Software Technologies for Embedded and Ubiquitous Systems: 7th IFIP WG 10.2 International Workshop, SEUS 2009 Newport Beach, CA, USA, November 16-18, ... Applications, incl. Internet Web, and HCI)
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5860
Sunggu Lee Priya Narasimhan (Eds.)
Software Technologies for Embedded and Ubiquitous Systems 7th IFIP WG 10.2 International Workshop, SEUS 2009 Newport Beach, CA, USA, November 16-18, 2009 Proceedings
13
Volume Editors Sunggu Lee Pohang University of Science and Technology (POSTECH) Department of Electronic and Electrical Engineering San 31 Hyoja Dong, Nam Gu, Pohang, Gyeongbuk 790-784, South Korea E-mail: [email protected] Priya Narasimhan Carnegie Mellon University Electrical and Computer Engineering Department 5000 Forbes Avenue, Pittsburgh, PA 15213-3890, USA E-mail: [email protected]
Library of Congress Control Number: 2009937935 CR Subject Classification (1998): C.2, C.3, D.2, D.4, H.4, H.3, H.5 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-10264-6 Springer Berlin Heidelberg New York 978-3-642-10264-6 Springer Berlin Heidelberg New York
The 7th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems (SEUS) followed on the success of six previous editions in Capri, Italy (2008), Santorini, Greece (2007), Gyeongju, Korea (2006), Seattle, USA (2005), Vienna, Austria (2004), and Hokodate, Japan (2003), establishing SEUS as one of the emerging workshops in the field of embedded and ubiquitous systems. SEUS 2009 continued the tradition of fostering cross-community scientific excellence and establishing strong links between research and industry. The fields of both embedded computing and ubiquitous systems have seen considerable growth over the past few years. Given the advances in these fields, and also those in the areas of distributed computing, sensor networks, middleware, etc., the area of ubiquitous embedded computing is now being envisioned as the way of the future. The systems and technologies that will arise in support of ubiquitous embedded computing will undoubtedly need to address a variety of issues, including dependability, real-time, human–computer interaction, autonomy, resource constraints, etc. All of these requirements pose a challenge to the research community. The purpose of SEUS 2009 was to bring together researchers and practitioners with an interest in advancing the state of the art and the state of practice in this emerging field, with the hope of fostering new ideas, collaborations and technologies. SEUS 2009 would not have been possible without the effort of many people. First of all, we would like to thank the authors, who contributed the papers that made up the essense of this workshop. We are particularly thankful to the Steering Committee Co-chairs, Peter Puschner, Yunmook Nah, Uwe Brinkschulte, Franz Rammig, Sang Son and Kane H. Kim, without whose help this workshop would not have been possible. We would also like to thank the General Co-chairs, Eltefaat Shokri and Vana Kalogeraki, who organized the entire workshop, and the Program Committee members, who each contributed their valuable time to review and discuss each of the submitted papers. We would also like to thank the Publicity Chair Soila Kavulya and the Local Arrangements Chair Steve Meyers for their help with organizational issues. Thanks are also due to Springer for producing this publication and providing the online conferencing system used to receive, review and process all of the papers submitted to this workshop. Last, but not least, we would like to thank the IFIP Working Group 10.2 on Embedded Systems for sponsoring this workshop. November 2009
Sunggu Lee Priya Narasimhan
Organization
General Co-chairs Eltefaat Shokri Vana Kalogeraki
The Aerospace Corporation, USA University of California at Riverside, USA
Program Co-chairs Sunggu Lee Priya Narasimhan
Pohang University of Science and Technology (POSTECH), Korea Carnegie Mellon University, USA
Steering Committee Peter Puschner Yunmook Nah Uwe Brinkkschulte Franz Rammig Sang Son Kane H. Kim
Technische Universit¨ at Wien, Austria Dankook University, Korea Goethe University, Frankfurt am Main, Germany University of Paderborn, Germany University of Virginia, USA University of California at Irvine, USA
Program Committee Allan Wong Doo-Hyun Kim Franz J. Rammig Jan Gustafsson Kaori Fujinami Kee Wook Rim Lynn Choi Minyi Guo Paul Couderc Robert G. Pettit IV Roman Obermaisser Tei-Wei Kuo Theo Ungerer Wenbing Zhao Wilfried Elmenreich Yukikazu Nakamoto
Hong Kong Polytech, China Konkuk University, Korea University of Paderborn, Germany Malardalen University, Sweden Tokyo University of Agriculture and Technology, Japan Sunmoon University, Korea Korea University, Korea University of Aizu, Japan INRIA, France The Aerospace Corporation, USA Vienna University of Technology, Austria National Taiwan University, Taiwan University of Augsburg, Germany Cleveland State University, USA University of Klagenfurt, Austria University of Hyogo and Nagoya University, Japan
VIII
Organization
Publicity and Local Arrangements Chairs Soila Kavulya Steve Meyers
Carnegie Mellon University, USA The Aerospace Corporation, USA
Table of Contents
Design and Implementation of an Operational Flight Program for an Unmanned Helicopter FCC Based on the TMO Scheme . . . . . . . . . . . . . . . Se-Gi Kim, Seung-Hwa Song, Chun-Hyon Chang, Doo-Hyun Kim, Shin Heu, and JungGuk Kim
Power Modeling of Solid State Disk for Dynamic Power Management Policy Design in Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinha Park, Sungjoo Yoo, Sunggu Lee, and Chanik Park
24
Optimizing Mobile Application Performance with Model–Driven Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Thompson, Jules White, Brian Dougherty, and Douglas C. Schmidt
36
A Single-Path Chip-Multiprocessor System . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Schoeberl, Peter Puschner, and Raimund Kirner
47
Towards Trustworthy Self-optimization for Distributed Systems . . . . . . . . Benjamin Satzger, Florian Mutschelknaus, Faruk Bagci, Florian Kluge, and Theo Ungerer
58
An Experimental Framework for the Analysis and Validation of Software Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrea Bondavalli, Francesco Brancati, Andrea Ceccarelli, and Lorenzo Falai
69
Towards a Statistical Model of a Microprocessor’s Throughput by Analyzing Pipeline Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uwe Brinkschulte, Daniel Lohn, and Mathias Pacher
Model-Based Analysis of Contract-Based Real-Time Scheduling . . . . . . . . Georgiana Macariu and Vladimir Cret¸u
227
Exploring the Design Space for Network Protocol Stacks on Special-Purpose Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyun-Wook Jin and Junbeom Yoo
240
Table of Contents
HiperSense: An Integrated System for Dense Wireless Sensing and Massively Scalable Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pai H. Chou, Chong-Jing Chen, Stephen F. Jenks, and Sung-Jin Kim Applying Architectural Hybridization in Networked Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Casimiro, Jose Rufino, Luis Marques, Mario Calha, and Paulo Verissimo Concurrency and Communication: Lessons from the SHIM Project . . . . . Stephen A. Edwards
XI
252
264
276
Location-Aware Web Service by Utilizing Web Contents Including Location Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . YongUk Kim, Chulbum Ahn, Joonwoo Lee, and Yunmook Nah
288
The GENESYS Architecture: A Conceptual Model for Component-Based Distributed Real-Time Systems . . . . . . . . . . . . . . . . . . . Roman Obermaisser and Bernhard Huber
296
Approximate Worst-Case Execution Time Analysis for Early Stage Embedded Systems Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Gustafsson, Peter Altenbernd, Andreas Ermedahl, and Bj¨ orn Lisper
308
Using Context Awareness to Improve Quality of Information Retrieval in Pervasive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joseph P. Loyall and Richard E. Schantz
Design and Implementation of an Operational Flight Program for an Unmanned Helicopter FCC Based on the TMO Scheme Se-Gi Kim1, Seung-Hwa Song2, Chun-Hyon Chang2, Doo-Hyun Kim2, Shin Heu3, and JungGuk Kim1 1
Hankuk University of Foreign Studies {undeadrage,jgkim}@hufs.ac.kr 2 Konkuk University [email protected], {chchang,doohyun}@konkuk.ac.kr 3 Hanyang University [email protected]
Abstract. HELISCOPE is the name of a project support by MKE (Ministry of Knowledge & Economy) of Korea to develop flying-camera services that transmits the scene of a fire by an unmanned helicopter. In this paper, we introduce the design and implementation of the OFP (Operational Flight Program) for the unmanned helicopter’s navigation based on the well-known TMO scheme. Navigation of the unmanned helicopter is done by the commands on flight mode from our GCS (Ground Control System). As the RTOS on the FCC (Flight Control Computer), RT-eCos3.0 that has been developed based on the eCos3.0 to support the basic task model of the TMO scheme is being used. To verify this navigation system, a HILS (Hardware-in-the-loop Simulation) system using the FlightGear simulator also has been developed. The structure and functions of the RT-eCos3.0 and the HILS is also introduced briefly. Keywords: unmanned helicopter, on-flight software, TMO.
and communication devices connected to the FCC. The OFP must process real-time sensor inputs and commands coming from GPS/INS (GPS/Inertial Navigation System), AHRS (Attitude and Heading Reference System) and the GCS in a deadline based manner. A main FC (Flight Control) task of the OFP calculates real-time control signals with the inputs and must send them to actuators of the helicopter in the pre-given deadline. Also, the FCC must report the status of the aircraft to the GCS periodically or upon requests from the GCS. Our OFP consists of one time-triggered task and several message-triggered tasks. Collecting of sensor data and commands is done by several message-triggered tasks. Upon receiving data, these tasks store them in to the ODS (Object Data Store) of the OFP-TMO. The ODS also contains some parameters on the capability of the aircraft. Periodic calculation and sending of control outputs with the data in ODS is performed by the main time-triggered task. All these tasks are scheduled by the RT-eCos3.0 based on their timing constraints. To verify the navigation system, a HILS (Hardware-in-the-loop Simulation) system using the open-source FlightGear simulator also has been developed. In section 2, the TMO model and the RT-eCos3.0 kernel is introduced briefly as related works and in section 3, design and implementation of the OFP are described. In section 4, the HILS system for verification will be discussed and in section 5, we will conclude.
2 The TMO Model and the RT-eCos3.0 Kernel In this section, a distributed real-time object model, TMO, and the RTOS that has been used in implementing the OFP are introduced briefly as related works. 2.1 TMO Model [2], [3] The TMO model is a real-time distributed object model for timeliness-guaranteed computing at design time. A TMO instance consists of three types of object member: an ODS (Object Data Store), time-triggered methods (SpM: Spontaneous Method) and message-triggered methods (SvM: Service Method). An SpM is actually a member-thread that is activated by a pre-given timing constraint and must finish its periodic executions within the given deadline. An SvM is also a member-thread that is activated by an event-message from a source outside a TMO. Main differences between the TMO model and conventional objects can be summarized as follows. - TMO is a distributed computing component and thus TMOs distributed over multiple nodes may interact via distributed IPC. - The two types of method are active member threads in the TMO model (SpMs and SvMs). - SvMs cannot disturb the executions of SpMs. This rule is called the BCC (Basic Concurrency Constraint). Basically, activation of an SvM is allowed only when potentially conflicting SpM executions are not in place. 2.2 RT-eCos3.0 Scheduling and IPC [2], [5] RT-eCos3.0 which is a real-time extension of the eCos3.0 supports multiple perthread real-time scheduling policies. The policies are the EDF(Earliest Deadline
Design and Implementation of an Operational Flight Program
3
First)/BCC, LLF(Least Laxity First)/BCC and the non-preemptive FTFS (First Triggered First Scheduled) scheduler. The non-preemptive FTFS scheduler is used when an off-line scheduling scenario is given by our task serializer [7] that determines the initial offsets of time-triggered tasks so as that all task instances can be executed without overlap and preemption. On the other hand, the EDF/BCC and the LLF/BCC are normally used when there is no pre-analysis tool or when the task serialization is not possible. This means that the EDF and LLF schedulers are ones for the systems with dynamic execution behaviors. With these real-time schedulers, two basic types of real-time task; time-triggered and message-triggered tasks; are supported by the kernel. SpMs and SvMs of the TMO model are mapped to these tasks by the TMOSL (TMO Support Library) for TMO programmers. The timing precision that is used for representing timing constraints such as start/stop time, period and deadline has been enhanced into micro second unit. Although the scheduling is performed every milli-second, the kernel computes current time in micro-second unit by checking and compensating the raw PIC clock ticks from the last clock interrupt. With these this schedulers, the kernel shows task-switch overhead of 1.51 micro-sec and scheduling overhead of 2.73 micro-sec in a 206MHz ARM9 processor environment. Management of message-triggered tasks is always done in conjunction with the logical multicast distributed IPC of the RT-eCos3.0. Once a message is arrived via the network transparent IPC at a channel associated with a message-triggered task, the task is activated and scheduled to finish its service within the pre-given deadline. The IPC subsystem consists of two layers. The lower layer is the intra-node channel IPC layer and the upper is the network transparent distributed IPC layer. This layering is for flexible configurations of the RT-eCos3.0 IPC based on various protocols. Besides supporting this basic channel IPC, the TMOSL has been enhanced to support the Gate and RMMC (Real-time Multicast Memory replication Channel) that is a highly abstracted distributed IPC model of the TMOSM of U.C. Irvine [4]. Since the role of an SvM is to handle external asynchronous events, the channel IPC has been extended so that external devices such as an IO devices or a sensor-alarm device can be associated to a channel. In this case, a message-triggered task can be activated when an asynchronous input occurs and is scheduled to be finished within the predefined deadline.
3 Design and Implementation of OFP for Unmanned Helicopter FCC Based on the TMO Scheme In this section, control points and OFP of a helicopter is mainly described. 3.1 Helicopter Mechanics Being different from the fixed wing aircraft, a helicopter makes a stable flight by using the thrust and upward force generated by the fixed speed rotation of the engine and the angles of the main and tail rotor blades. For change of heading, vertical flight and forward movement, it uses the tilt rotor disk and tail rotor.
4
S.-G. Kim et al.
Fig. 1. Main components of a helicopter [1]
A helicopter can maintain the powerful hovering state that is not possible in fixed wing aircraft but it has a difficult problem in maintain a stable attitude caused by its complicated lifting mechanism. Fig. 1 shows the main components of a helicopter and table 1 shows the major control points and their motion effects. Table 1. Helicopter controls and effects [1] Control Points
Directly controls
Cyclic lateral
Varies main rotor blade pitch left/right
Cyclic longitudinal
Varies main rotor blade pitch fore/aft
Collective
Tail rotor: Rudder
Collective angle of attack for the rotor main blades via the swashplate Collective pitch supplied to tail rotor blades
Secondary effect
Used in forward flight
Used in hover flight
Induces roll in direction moved
To turn the aircraft
To move sideways
Induces pitch nose down or up
Control attitude
To move forwards/ backwards
Inc./dec. pitch angle of rotor blades causing the aircraft to rise/descend
Inc./dec. torque and engine RPM
To adjust power through rotor blade pitch setting
To adjust skid height/ vertical speed
Yaw rate
Inc./dec. torque and engine RPM (less than collective)
Adjust sideslip angle
Control yaw rate/ heading
Primary effect Tilts main rotor disk left and right through the swashplate Tilts main rotor disk forward and back via the swashplate
Design and Implementation of an Operational Flight Program
5
3.2 Flight Modes of the Unmanned Helicopter The unmanned helicopter receives commands from the GCS and the OFP on FCC calculates the values of control signals to be sent to control points with current sensor values from GPS/INS, AHRS and SWM (Helicopter Servo Actuator Switching Module). Theses control signals are sent to control points via the SWM. The OFP also sends information on location and attitude to the GCS for periodical monitoring. In case of losing control, the SWM is set to a manual remote-control mode. Fig. 2 describes the control structure. Actually, Fig. 2 describes the whole structure of the HELISCOPE project including the MCC (Multimedia Communication Computer) board and its communication mechanism however, the description of this part has been excluded in this paper. Auto-flight modes of our unmanned helicopter are as follows (Fig. 3). -
Hovering Auto-landing Point navigation Multi-point navigation
In hovering mode, the helicopter tries to maintain current position and attitude even if there is a wind. And hovering is always the end of all flight modes except for autolanding and is the most difficult mode when auto-flight mode is used. In auto-landing mode, landing velocity is controlled according to the current altitude. In point-navigation mode, the helicopter moves to a target position given by the GCS. In this mode, a compensation algorithm is used when there is a breakaway from the original track to the target caused by a wind. When the craft arrives at the target, it turns its mode to hovering. Multi-point navigation is a sequential repetition of point- navigations.
Fig. 2. Structure of the HELICOPE
6
S.-G. Kim et al.
Fig. 3. Flight Mode Transition
Fig. 4. Structure of the OFP-TMO
3.3 Design of the OFP Based on the TMO Scheme The OFP basically consists of a TMO instance containing with one time-triggered task, four message-triggered tasks and ODS. The ODS contains data that are periodically read from GPS/INS, AHRS and SWM. Followings are descriptions on the rolls of these five tasks and data contained in the ODS. - GPS/INSReader (IO-SvM): This task collects data sent from the GPS/INS device that periodically sends temporal and spatial data in 10 Hz. The data mainly consists of information on altitude, position and velocity for each direction.
Design and Implementation of an Operational Flight Program
7
- AHRSReader (IO-SvM): This task collects data sent from the AHRS device that periodically send attitude information of the aircraft in 20 Hz. - SWMReader(IO-SvM): This task collects data sent from the SWM in 10 Hz. The data consists of current values of four control points (Cyclic-lateral, Cycliclongitudinal, Collective, Rudder). - GCSReader(SvM): This task receives various flight commands from the GCS sporadically. Commands from the GCS include information on flight mode, target positions, switch-to-manual-mode, etc.. - FC(SpM) : FC means Flight Control and this task runs with 20Hz period and 40 milli-second deadline. This task calculates the next values for four control points with the ODS and send control values back to the SWM for controlling the helicopter. - ODS: Besides the data collected from the read tasks, the ODS data also contains information on the current flight mode, the next values for control points and capability parameters of the aircraft. The frequencies of the SvMs and the FC-SpM can be changed according to the actual frequencies of real devices and the capability of the CPU used. The frequencies of the IO-SvMs above are set to the ones of the physical devices that will be used in the field test. The calculations being performed by the FC for the four flight modes consist of four basic auto-flight rule-operations. They are SetFowardSpeed, SetSideSpeed, SetAltitude and SetHeading. SetFowardSpeed and SetSideSpeed generate values of twoaxis-Cyclic for forward and side speeds. SetAltitude generates a Collective value to rise, descend or preserve the current altitude. Finally SetHeading generates a value to change heading. For each flight mode, an appropriate combination of these four basic operations is used. For example, in the point navigation mode, the SetHeading operation generates a value for heading (rudder) to the target position and the SetForwadSpeed operation generates values for two Cyclic control points for the maximum speed to the target. To avoid too rapid changing of attitude of an aircraft, some upper and lower limits are imposed to the values generated by the four basic operations because the maximum and minimum values for Cyclic Lateral/Longitudial, Collective, Rudder are dependent on the kinds of aircraft.[6] 3.4 GCS (Ground Control System) A GCS is an application tool with which a user can monitor the UAV status and send control messages to FCC. Our GCS provides a graphical interface for easy control. A protocol for the wireless communication devices (RF modem) between the UAV and GCS has been designed. There are two types of packet, to-GCS-packet for monitoring and from-GCS-packet for control. Both packets consist of header, length, status or control data, and checksum in the tail. - to-GCS-packet
8
S.-G. Kim et al.
The FCC system transmits to-GCS-packets containing the UAV status data that are described in table 2 to the GCS at 20Hz. Table 2. UAV status data Attitude Location Velocity Control signal FCC system
- from-GCS-packet A from-GCS-packet contains a command message to control the UAV flight mode described in section 3.2. - User Interface The user interface has been designed for prompt monitoring and convenient control. Fig. 5 and 6 shows the user interface of our GCS. Various status data including the
Fig. 5. Main interface window
Fig. 6. Control signal window
Design and Implementation of an Operational Flight Program
9
attitude of an UAV contained in downlink packets are displayed in 2D/3D/bar graphs and gauges of the GCS for easy analysis and debugging the FCC control logic. The GCS has six widgets for control operations such as landing, taking off, hovering, moving, self-return and update status.
4 Experiments with the Hardware-in-the-Loop Simulation System To test and verify the OFP in the TMO scheme, we used a HILS system with the open-source based FlightGear-v0.9.10 simulator. The FlightGear simulator supports various flying object models, 3-D regional environments and model-dependent algorithms. In the HILS environment, the OFP receives GPS/INS, AHRS and SWM information from the FlightGear simulator. Fig. 7 shows the HILS architecture. For the FCC, a board with the ST Thomson DX-66 STPC Client chip has been used to enhance floating point operations. Among various testing of stability of flying that have been performed, test results of hovering and heading in the point navigation mode are shown in Fig. 8 and 9. In the hovering test, the helicopter takeoffs at altitude 7m, maintains hovering at altitude 17m for 10 seconds and does auto-landing. Fig. 8 shows the desired references and actual responses of the HILS system in this scenario and we can see that there is a deflection of almost 0.5 meter in maximum when hovering. This result is tolerable in our application.
Fig. 7. Structure of the HILS
10
S.-G. Kim et al.
Fig. 8. Stability of hovering control
Fig. 9. Heading-control in the point navigation mode
5 Conclusions and Future Works In this paper, we introduced the design and implementation of the OFP (Operational Flight Program) for Unmanned Helicopter’s navigation based on the well-known TMO scheme and the RT-eCos3.0 and verified the system using HILS. Since the OFP can naturally be composed of time-triggered and message-triggered tasks, using of the TMO mode that supports object-oriented real-time concurrent programming is a very well-structured and easily extendable scheme in designing and implementing the OFP. Moreover, we could also find out that the RTOS, RT-eCos3.0,
Design and Implementation of an Operational Flight Program
11
is a very suitable for this kind of application because of its accurate timing behavior and small size. By finishing the testing of HILS with the flying object supported by the FlightGear, the job to be done further is to do some minor corrections on parameters and detail algorithms that are dependent on a real flying object model. Our plan is to start field testing with a real aircraft in this year. Acknowledgement. This work was supported by the MKE(Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) support program supervised by the NIPA(National IT Industry Promotion Agency" (NIPA2009-C1090-0902-0026) and also partially supported by the MKE and the KETI, Korea, under the HVI(Human-Vehicle Interface) project(100333312) .
References 1. Kim, D.H., Nodir, K., Chang, C.H., Kim, J.G.: HELISCOPE Project: Research Goal and Survey on Related Technologies. In: 12th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 112–118. IEEE Computer Society Press, Tokyo (2009) 2. Kim, J.H., Kim, H.J., Park, J.H., Ju, H.T., Lee, B.E., Kim, S.G., Heu, S.: TMO-eCos2.0 and its Development Environment for timeliness Guaranteed Computing. In: 1st Software Technologies for Dependable Distributed Systems, pp. 164–168. IEEE Computer Society Press, Tokyo (2009) 3. Kim, K.H., Kopetz, H.: A Real-Time Object Model RTO.k and an Experimental Investigation of Its Potentials. In: 18th IEEE Computer Software & Applications Conference, pp. 392–402. IEEE Computer Society Press, Los Alamitos (1994) 4. Jenks, S.F., Kim, K.H., et al.: A Middleware Model Supporting Time-triggered Messagetriggered Objects for Standard Linux Systems. Real-Time Systems – J. of Time-Critical Computing systems 36, 75–99 (2007) 5. Kim, J.G., et al.: TMO-eCos: An eCos-based Real-time Micro Operating system Supporting Execution of a TMO Structured Program. In: 8th IEEE International Symposium on Object/ Component/ Service-Oriented Real-Time Distributed Computing, pp. 182–189. IEEE Computer Society Press, Seattle (2005) 6. Kim, S.P.: Guide and Control Rules for an Unmanned Helicopter. In: 2nd Workshop on HELISCOPE, pp. 1–12. ITRC, Konkuk University, Seoul (2008) 7. Kim, H.J., Kim, J.G., et al.: An Efficient Task Serializer for Hard Real-time TMO Systems. In: 11th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 405–413. IEEE Computer Society Press, Orlando (2008)
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems Ailixier Aikebaier2, Tomoya Enokido1, and Makoto Takizawa2 1
Abstract. Information systems are composed of various types of computers interconnected in networks. In addition, information systems are being shifted from the traditional client-server model to the peer-to-peer (P2P) model. The P2P systems are scalable and fully distributed without any centralized coordinator. It is getting more significant to discuss how to reduce the total electric power consumption of computers in information systems in addition to developing distributed algorithms to minimize the computation time. In this paper, we do not discuss the micro level like the hardware specification of each computer. We discuss a simple model to show the relation of the computation and the total power consumption of multiple peer computers to perform types of processes at macro level. We also discuss algorithms for allocating a process to a computer so that the deadline constraint is satisfied and the total power consumption is reduced.
1 Introduction Information systems are getting scalable so that various types of computational devices like server computers and sensor nodes [1] are interconnected in types of networks like wireless and wired networks. Various types of distributed algorithms [6] are so far developed, e.g. for allocating computation resources to processes and synchronizing multiple conflicting processes are discussed to minimize the computation time and response time, maximize the throughput, and minimize the memory space. On the other hand, the green IT technologies [4] have to be realized in order to reduce the consumptions of natural resources like oil and resolve air pollution on the Earth. In information systems, total electric power consumption has to be reduced. Various hardware technologies like low-power consumption CPUs [2,3] are now being developed. Biancini et al. [8] discuss how to reduce the power consumption of a data center with a cluster of homogeneous server computers by turning off servers which are not required for executing a collection of web requests. Various types of algorithms to find required number of servers in homogeneous and heterogeneous servers are discussed [5,9]. In wireless sensor networks [1], routing algorithms to reduce the power consumption of the battery in a sensor node are discussed. In this paper, we consider peer-to-peer (P2P) overlay networks [7] where computers are in nature heterogeneous and cannot be turned off by other persons different from S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 12–23, 2009. c IFIP International Federation for Information Processing 2009
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
13
the owners. In addition, the P2P overlay networks are scalable and fully distributed with no centralized coordination. Each peer has to find peers which not only satisfy QoS requirement but also spend less electric power. First, we discuss a model for performing processes on a computer. Then, we measure how much electric power a type of computers spends to perform a Web application process. Next, we discuss a simple power consumption model for performing a process in a computer based on the experiments with servers and personal computers. In the simple model, each computer consumes maximally the electric power if at least one process is performed. Otherwise, the computer consumes minimum electric power, i.e. in idle state. The simple model shows personal computers with one CPU independently of the number of cores, according to our experiment. A request to perform a process like a Web page request is allocated to one of the computers in the network. We discuss a laxity-based allocation algorithm to reduce not only execution time but also power consumption in a collection of computers. In the laxity-based algorithm, processes are allocated to computers so that the deadline constraints are satisfied based on the laxity concept. In section 2, we present a systems model for performing a process on a computer. In section 3, we discuss a simple power consumption model obtained from the experiment. In section 4, we discuss how to allocate each process with a computer to reduce the power consumption. In section 5, we evaluate the process allocation algorithms.
2 Computation Model 2.1 Normalized Computation Rate A system S is includes a set C of computers c1 , ..., cn (n ≥ 1) interconnected in reliable networks. A user issues a request to perform a process like a Web page request. The process is performed on one computer. There are a set P of application processes p1 , ..., pm (m ≥ 1). A term process means an application process in this paper. We assume each process ps can be performed on any computer in the computer set C. A user issues a request to perform a process ps to a load balancer K. For example, a user issues a request to read a web page on a remote computer. The load balancer K selects one computer ci in the set C for a process ps and sends a request to the computer ci . On receipt of the request, the process ps is performed on the computer ci and a reply, e.g. Web page is sent back to the requesting user. Requests from multiple users are performed on a computer ci . A process being performed at time t is referred to as current. A process which already terminates before time t is referred to as previous. Let Pi (t) (⊆ P ) be a set of current processes on a computer ci at time t. Ni (t) shows the number of the current processes in the set Pi (t), Ni (t) = |Pi (t)|. Let P (t) show a set of all current processes on computers in the system S at time t, ∪i=1,...,n Pi (t). Suppose a process ps is performed on a computer ci . Here, Tis is the total computation time of ps on ci and minTis shows the computation time Tis where a process ps is exclusively performed on ci , i.e. without any other process. Hence, minTis ≤ Tis for every process pi . Let maxTs and minTs be max(minT1s , ..., minTns ) and min(minT1s, ..., minTns ), respectively. If a process ps is exclusively performed on the fastest computer ci and the slowest computer cj , minTs = minTis and maxTs =
14
A. Aikebaier, T. Enokido, and M. Takizawa
minTjs , respectively. A time unit (tu) shows the minimum time to perform a smallest process. We assume it takes at least one time unit [tunit] to perform a process on any computer, i.e. 1 ≤ minTs ≤ maxTs . The average computation rate (ACR) Fis of a process ps on a computer ci is defined as follows: Fis = 1/Tis [1/tu].
(1)
Here, 0 < Fis ≤ 1 / minTis ≤ 1. The maximum ACR maxFis is 1 / minTis . Fis shows how many percentages of the total amount of computation of a process ps are performed for one time unit. maxFs = max(maxF1s , ..., maxFns ). minFs = min(maxF1s , ..., maxFns ). maxFs and minFs show the maximum ACRs maxFis and maxFjs for the fastest computer ci and the slowest computer cj , respectively. The more number of processes are performed on a computer ci , the longer it takes to perform each of the processes on the computer ci . Let αi (t) indicate the degradation rate of a computer ci at time t (0 ≤ αi (t) ≤ 1)[1/tu]. αi (t1 ) ≤ αi (t2 ) ≤ 1 if Ni (t1 ) ≤ Ni (t2 ) for every pair of different times t1 and t2 . We assume αi (t) = 1 if Ni (t) ≤ 1 and αi (t) < 1 if Ni (t) > 1. Suppose it takes 50 [msec] to exclusively perform a process ps on a computer ci . Here, minTis = 50. Here, Fis = maxFis = 1/50 [1/msec]. Suppose it takes 75 [msec] to perform the process ps while other processes are performed on the computer ci . Here, Fis = 1/75 [1/msec]. Hence, αi (t) = 50/75 = 0.67 [1/msec]. We define the normalized computation rate (N CR) fis (t) of a process ps on a computer ci at time t as follows: αi (t) · maxFis /maxFs [1/tu] fis (t) = (2) αi (t) · minTs /minTis [1/tu] For the fastest computer ci , fis (t) = 1 if αi (t) = 1, i.e. Ni (t) = 1. If a computer ci is faster than another computer cj and the process ps is exclusively performed on ci and cj at time ti and tj , respectively, fis (ti ) > fjs (tj ). If a process ps is exclusively performed on a computer ci , αis (t) = 1 and fis (t) = maxFis / maxFs . The maximum NRC maxfis shows maxFis / maxFs . 0 ≤ fis (t) ≤ maxfis ≤ 1. The NCR fis (t) shows how many steps of a process ps are performed on a computer ci at time t. The average computation rate (ACR) Fis depends on the size of the process ps while fis (t) depends on the speed of the computer ci . Next, suppose that a process ps is started and terminated on a computer ci at time stis and etis , respectively. Here, the total computation time Tis is etis - stis . The following formulas hold for the degradation rate αi (t) and NCR fis (t): etis αi (t) =1 (3) minT is stis etis etis αi (t) (fis (t)) dt = minTs · = minTs (4) stis stis minTis If there is no other process, i.e. αi (t) = 1 on the computer ci , fis (t) = maxFis / maxFs = minTs / minTis . Hence, Tis = etis − stis = minTis . If other processes are performed, Tis = etis - stis > minTis . Here, minTs shows the total amount of computation to be performed by the process ps .
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
15
Figure 1 shows the NCRs fis (t) and fjs (t) of a process ps which are exclusively performed on a pair of computers ci and cj , respectively. Here, the computer ci is the fastest in the computer set C. The NCR fis (t) = maxfis = 1 for stis ≤ t ≤ etis and Tis = etis - stis = minTs . On the slower computer cj , fjs (t) = maxfjs < 1 and Tjs = etjs - stjs > minTs . Here, maxfis · minTis = minTs = maxfjs · minTjs from the equation (4). The areas shown by fis (t) and fjs (t) have the same size minTs (= Tis ). Figure 2 shows the NCR fis (t) of a process ps on a computer ci at time t, where multiple precesses are performed concurrently with the process ps . fis (t) is smaller than maxfis if other processes are concurrently performed on the computer ci . Here, Tis = et etis - stis > minTs and stis fis (t)dt = minTs . is fis (t) 1
1 fjs (t)
maxf is
maxf is Tjs 0
stis
stjs
t
etis
etjs
Tis (=minTs )
Fig. 1. Normalized computation rates (NCRs)
0
stis
etis
t
Fig. 2. Normalized computation rate fis (t)
Next, we define the computation laxity Lis (t) [tu] of a process ps on a computer ci at time t as follows: t Lis (t) = minTs − (fis (x)) dx. (5) stis
The laxity Lis (t) shows how much computation the computer ci has to spend to perform up a process ps at time t. Lis (stis ) = minTs and Lis (etis ) = 0. If the process ps would be exclusively performed on the computer ci , the process ps is expected to terminate at time etis = t + Lis (t). 2.2 Simple Computation Model There are types of computers with respect to the performance. First, we consider a simple computation model. In the simple computation model, a computer ci satisfies the following properties: [Simple computation model] 1. maxfis = maxfiu for every pair of different processes ps and pu performed on a computer ci . 2.
ps ∈Pi (t)
fis (t) = maxfi .
(6)
16
A. Aikebaier, T. Enokido, and M. Takizawa
The maximum normalized computation rate (NCR) maxfi of a computer ci is maxfis for any process ps . This means, the computer ci is working to perform any process with the maximum clock frequency. Pi (t) shows a set of processes being performed on a computer ci at time t. In the simple computation model, we assume the degradation factor αi (t) = 1. On a computer ci , each process ps starts at time stis and terminates at time etis . We discuss how the NCR fis (t) of each process ps changes in presence of multiple precesses on a computer ci . A process ps is referred to as precedes another process pu on a computer ci if etis < stiu . A process ps is interleaved with another process pu on a computer ci iff etiu ≥ etis ≥ stiu . The interleaving relation is symmetric but not transitive. A process ps is referred to as connected with another process pu iff (1) ps is interleaved with pu or (2) ps is interleaved with some process pv and pv is connected with pu . The connected relation is symmetric and transitive. A schedule schi of a computer ci is a history of processes performed on the computer ci . Processes in schi are partially ordered in the precedent relation and related in the connected relation. Here, let Ki (ps ) be a closure subset of the processes in the schedule schi which are connected with a process ps , i.e. Ki (ps ) = {pu | pu is connected with ps }. Ki (ps ) is an equivalent class with the connected relation, i.e. Ki (ps ) = Ki (pu ) for every process pu in Ki (ps ). Ki (ps ) is referred to as knot in schi . The schedule schi is divided into knots Ki1 , . . . , Kili which are pairwise disjointing. Let pu and pv are a pair of processes in a knot Ki (ps ) where the starting time stiu is the minimum and the termination time etiv is the maximum. That is, the process pu is first performed and the process pv is lastly finished in the knot Ki (ps ). The execution time T Ki of the knot Ki (ps ) is etiv - stiu . Let KPi (t) be a current knot which is a set of current or previous processes which are connected with at least one current process in Pi (t) at time t. In the simple model, it is straightforward for the following theorem to hold from the simple model properties: [Theorem] Let Ki be a knot in a schedule schi of a computer ci . The execution time T Ki of the knot Ki is ps ∈Ki minTis . Let us consider a knot Ki of three processes p1 , p2 , and p3 on a computer ci as shown in Figure 3 (1). Here, Ki = {p1 , p2 , p3 }. First, suppose that the processes p1 , p2 , and p3 are serially performed, i.e. eti1 = sti2 and eti2 = sti3 . Here, the execution time T Ki is eti3 - sti1 = minTi1 + minTi2 + minTi3 . Next, three processes p1 , p2 , and p3 start at time st and terminate at time et as shown in Figure 3 (2). Here, the execution time T Ki = minTi1 + minTi2 + minTi3 . Lastly, let us consider a knot Ki where the processes are concurrently performed. The processes p1 , p2 , and p3 start at the same time, sti1 = maxf i
maxf i p 1
p 2
p 3
p1
maxf i
p2
p2
p1
p3 minT i1
minT i2 minT i3
(1) Serial execution.
time t
(minT i1+ minT i2 + minT i3) time t (2) Parallel execution.
p3
(minT i1+ minT i2 + minT i3) time t
sti1
t 1 - sti1
t1
(3) Mixed execution.
Fig. 3. Execution time of knot
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
17
sti2 = sti3 , are concurrently performed, and the process p3 lastly terminates at time eti3 after p1 and p2 as shown in Figure 3 (3). Here, the execution time T Ki of the knot Ki is eti3 - sti1 = minTi1 + minTi2 + minTi3 . The current knot KPi (t1 ) is {p1 , p2 , p3 } at time t1 and KPi (t2 ) is {p1 , p2 } at time t2 . It depends on the scheduling algorithm how much each NCR fis (t) is in the equation (6), fis (t) = αis ·maxfi where ps ∈Pi (t) αis = 1. In the fair scheduler, each fis (t) is the same as the others, i.e. αis = 1/|Pi (t)|: fis (t) = maxfi / |Pi (t)|.
(7)
2.3 Estimated Termination Time Suppose there are a set P of processes {p1 , . . . , pm } and a set C of computers {c1 , . . . , cn } in a system S. We discuss how to allocate a process ps in the process set P to a computer ci in the computer set C. Here, we assume the system S to be heterogeneous, i.e. some pair of computers ci and cj have different specifications and performance. Suppose a process ps is started on a computer ci at time stis . A set Pi (t) of current processes are being performed on a computer ci at time t. [Computation model] Let KPi (t) be a current knot = {pi1 , ..., pili } of processes, where the starting time is st. The total execution time T (st, t) of processes in the current knot KPi (t) is given as; T (st, t) = minTi1 + minTi2 + · · · + minTili
(8)
In Figure 3 (3), t1 shows the current time. A process p1 is first initiated at time sti1 and is terminated before time t1 on a computer ci . A pair of processes p2 and p3 are currently performed at time t1 . Here, KPi (t) is a current knot {p1 , p2 , p3 } at time t1 . T (sti1 , t1 ) = minTi1 + minTi2 + minTi3 . The execution time from time sti1 to t1 is t1 - sti1 . At time t1 , we can estimate that the processes p2 and p3 which are concurrently performed and terminate at the time t1 + T (stt1 , t1 ) - (t1 - sti1 ) = sti1 + T (ti1 , t1 ). sti1 is referred to as starting time of the current knot KPi (t). No process is performed in some period before sti1 and some process is performed at any time since sti1 to t. The estimated termination time ETi (t) of current processes on a computer ci means time when every current process of time t terminates if no other process would be performed after time t. ETi (t) is given as follows: ETi (t) = t + T (stis , t) − (t − stis ) = stis + T (stis , t)
(9)
Suppose a new process ps is started at current time t. By using the equation (9), we can obtain the estimated termination time ETi (t) of the current processes on each computer ci at time t. From the computation model, the estimated termination time ETis (t) of a new process ps starting on a computer ci at time t is given as follows: ETis (t) = ETi (t) + minTis
(10)
18
A. Aikebaier, T. Enokido, and M. Takizawa
3 Simple Power Consumption Model Suppose there are n (≥ 1) computes c1 , . . . , cn and m (≥ 1) processes p1 , . . . , pm . In this paper, we assume the simple computation model is taken for each computer, i.e. the maximum clock frequency to be stable for each computer ci . Let Ei (t) show the electric power consumption of a computer ci at time t [W/tu] (i = 1, . . . , n). maxEi and minEi indicate the maximum and minimum electric power consumption of a computer ci , respectively. That is, minEi ≤ Ei (t) ≤ maxEi . maxE and minE show max(maxE1 , ..., maxEn ) and min(minE1 , ..., minEn ), respectively. Here, minEi shows the power consumption of a computer ci which is in idle state. We define the normalized power consumption rate (N P CR) ei (t) [1/tu] of a computer ci at time t as follows: ei (t) = Ei (t)/maxE (≤ 1).
(11)
Let minei and maxei show the maximum power consumption rate minEi / maxE and the minimum one maxEi / maxE of the computer ci , respectively. If the fastest computer ci maximumly spends the electric power with the maximum clock frequency, ei (t) = maxei = 1. In the lower-speed computer cj , i.e. maxfj < maxfi , ej (t) = maxej < 1. We propose two types of power consumption models for a computer ci , simple and multi-level models. In the simple model, the NPCR ei (t) is given depending on how many number of processes are performed as follows: maxei if Ni (t) ≥ 1. ei (t) = (12) minei if otherwise. This means, if one process is performed on a computer ci , the electric power is maximally consumed on the computer ci . Even if more than one process is performed, the maximum power is consumed in a computer ci . A personal computer with one CPU satisfies the simple model as discussed in the experiments of the succeeding section. The total normalized power consumption T P Ci (t1 , t2 ) of a computer ci from time t1 to time t2 is given as follows: t2 T P Ci (t1 , t2 ) = ei (t)dt (13) t1
Next, T P C1 ·(t1 , t2 ) ≤ t2 - t1 . In the fastest computer ci , T P C1 ·(t1 , t2 ) = maxei ·(t2 t1 ) = t2 - t1 if at least one process is performed at any time from t1 to t2 in the simple model. Let Ki be a knot of a computer ci whose starting time is sti and termination time is eti . The normalized total power consumption of the computer ci to perform every i (sti , eti ). In the simple model, T P Ci (sti , eti ) = eti process in the knot Ki is T P C maxe dt = (et st ) · maxe = i i i i ps ∈Ki minTis · maxei . sti
4 Process Allocation Algorithms 4.1 Round-Robin Algorithms We consider two types of algorithms, weighted round robin (WRR) [20] and weighted least connection (WLC) [21] algorithms. For each of the WRR and WLC algorithms,
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
19
we consider two cases, Per (performance) and Pow (power). In Per the weight is given in terms of the performance ratio of the servers. That is, the higher performance a server supports, the more number of processes are allocated to the server. In Pow, the weight is defined in terms of the power consumption ratio of the servers. The less power a server consumes, the more number of processes are allocated to the server. 4.2 Laxity-Based Algorithm Some application has the deadline constraint T Cs on a process ps issued by the application, i.e. a process ps has to terminate until the deadline. Here, a process ps has to be allocated to a computer ci so that the process ps can terminate by the deadline T Cs . Cs (t) denotes a set of computers which satisfy the condition T Cs , i.e. Cs (t) = {ci | ETis (t) ≤ T Cs }. That is, in a computer ci in Cs (t), the process ps is expected to terminate by the deadline T Cs . Here, if the process ps is allocated to one computer ci in Cs (t), the process ps can terminate before T Cs . Next, we assume that a normalized power consumption rate (NPCR) ei (t) of each computer ci is given as equation (12) according to the simple model. We can estimate the total power consumption laxity leis (t) of a process ps between time t and ETis (t) at time t when the process ps is allocated to the computer ci [Figure 4]. leis (t) of the computer ci is given as equation (14): leis (t) = maxei ∗ (ETis (t) − t)
(14)
Suppose a process ps is issued at time t. A computer ci in the computer set C is selected for the a process ps with the constraint T Cs at time t as follows: Alloc(t, C, ps , T Cs ) { Cs = φ; NoCs = φ; for each computer ci in C, { if ETis (t) ≤ T Cs , Cs = Cs ∪ {ci }; else /* ETis (t) > T Cs */ NoCs = NoCs ∪ {ci }; } if Cs = φ, { /* candidate computers are found */ computer = ci such that leis (t) is the minimum in Cs ; return(computer); } else { /* Cs = φ */ computer = ci such that ETis (t) is minimum in NoCs ; return(computer); } } Cs and NoCs are sets of computers which can and cannot satisfy the constraint T Cs , respectively. Here, Cs ∪ NoCs = C and Cs ∩ NoCs = φ. In the procedure Alloc, if there is at least one computer which can satisfy the time constraint T Cs of process ps , one of the computers which consumes the minimum power consumption is selected. If there is no computer which can satisfy the application time constraint T Cs , one of the computers which can most early terminate the process ps is selected in the computer set C.
20
A. Aikebaier, T. Enokido, and M. Takizawa
e i (t)
maxe i * (ETis (t)- t )
maxe i mine i time t
f i (t) maxf i
p2
p3
ps
p1
0
(minTi1 + minTi2 + minTi3 )
sti1
minTis
t
time t
ETis
Fig. 4. Estimation of power consumption
5 Evaluation 5.1 Environment We measure how much electric power computers consume for Web applications. We consider a cluster system composed of Linux Virtual Server (LVS) systems which are interconnected in gigabit networks as shown in Figure 5. The NAT based routing system VS-NAT [12] is used as the load balancer K. The cluster system includes three servers s1 , s2 , and s3 in each of which Apache 2.0 [11] is installed, as shown in Figure 5. The load generator server L first issues requests to the load balancer K. Then, the load balancer K assigns each request to one of the servers according to some allocation algorithm. Each server si compresses the reply file by using the Deflate module [13] on receipt of a request from the load generator server L. We measure the peak consumption of electric power and the average response time of each server si (i = 1, 2, 3). The power consumption ratio of the servers s1 , s2 , and s3 is 0.9 : 0.6 : 1 as shown in Figure 5. On receipt of a Web request, each server si finds a Virtual server
Gbit switch
Gbit switch
real server 1
Server 1
Server 2
Server 3
Numbre of CPUs
1
1
2
Numbre of cores
1
1
2
CPU
Load generation Load balancer server
real server 2
Memory Maximum computation rate Maximum power computation rate maxei
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
21
reply file of the request and compresses the reply file by using the Deflate module. The size of the original reply file is 1Mbyte and the compressed reply file is 7.8Kbyte in size. The Apache benchmark software [10] is used to generate Web requests, where the total number 10,000 of requests are issued where 100 requests are concurrently issued to each server. The performance ratio of the servers s1 , s2 , and s3 are 1 : 1.2 : 4 as shown in Figure 5. The server s3 is the fastest and mostly consumes the electric power. The server s1 is slower than s3 but more consumes the electric power than the serve s2 . 5.2 Experimental Results If the weight is based on the performance ratio (P er), the requests are allocated to the servers s1 , s2 , and s3 with the ratio 1 : 1.2 : 4, respectively. On the other hand, if the weight is based on the power consumption ratio (P ow), the requests are allocated to the servers s1 , s2 , and s3 with the ratio 0.9 : 0.6 : 1, respectively. Here, by using the Apache benchmark software, the load generation server L transmits totally 100,000 requests to the servers s1 , s2 , and s3 where six requests are concurrently issued to the load balancer K. The total power consumption of the cluster system and the average response time of a request from a web server are measured. We consider a static web server where the size of a reply file for a request is not dynamically changed, i.e. the compressed version of the same HTML reply file is sent back to each user. In this experiment, the original HTML file and the compressed file are 1,025,027 [Byte] and 13,698 [Byte] in size, respectively. On the load balancer K, types of process allocation algorithms are adopted; the weighted round-robin (WRR) [20] algorithms, WRR-Per and WRR-Pow; the weighted least connection (WLC) [21] algorithms, WLC-Per and WLC-Pow. Figure 6 shows the total power consumption [W/H] of the cluster system for time. WRR-Per and WLC-Per show the total power consumption of the servers in the WRR and WLC algorithms with the performance based weight (Per), respectively. WRRPow and WLC-Pow indicate the power consumption of the WRR and WLC with power consumption based weight (Pow), respectively. In WRR-Per and WLC-Per, the total
Power consumption [W/H]
500 480 460
WRR Performance WLC-Performance WRR Power WLC Power
execution time and peak power consumption are almost the same. In addition, the total execution time and peak power consumption are almost the same in WRR-Pow and WLC-Pow. This experimental result shows that the total power consumption and total execution time are almost the same for the two allocation algorithms if the same weight ratio is used. If the weight of the load balance algorithm is given in terms of the performance ratio (Per), the peak power consumption is higher than the power consumption ratio (Pow). However, in the Per, the total execution time is longer than Pow. Here, the total power consumption is calculated by the multiplication of the execution time and power consumption. The experiment shows the total power consumption is reduced by using the performance based weight (P er).
6 Concluding Remarks In this paper, we discussed the simple power consumption model of computers. We discussed the laxity-based algorithm to allocate a process to a computer so that the deadline constraint is satisfied and the total power consumption is reduced on the basis of the laxity concept. We obtained experimental results on electric power consumption of Web servers. We evaluated the simple model through the experiment of the PC cluster. Then, we showed the PC cluster follows the simple model. We are now considering types of applications like database transactions and measuring the power consumption of multi-CPU servers.
References 1. Akyildiz, I.F., Kasimoglu, I.H.: Wireless Sensor and Actor Networks: Research Challenges. Ad Hoc Networks Journal (Elsevier) 2, 351–367 (2004) 2. AMD, http://www.amd.com/ 3. Intel, http://www.intel.com/ 4. Green IT, http://www.greenit.net 5. Heath, T., Diniz, B., Carrera, E.V., Meira, W., Bianchini, R.: Energy Conservation in Heterogeneous Server Clusters. In: PPoPP 2005: Proceedings of the tenth ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pp. 186–195 (2005) 6. Lynch, N.A.: Distributed Algorithms, 1st edn. Morgan Kaufmann Publisher, San Francisco (1997) 7. Montresor, A.: A robust Protocol for Building Superpeer overlay Topologies. In: Proc. of the 4th International Conference on Peer-to-Peer Computing, pp. 202–209 (2004) 8. Bianchini, R., Rajamony, R.: Power and Energy Management for Server Systems. IEEE Computer 37(11) (November 2004); Special issue on Internet data centers 9. Rajamani, K., Lefurgy, C.: On Evaluating Request-Distribution Schemes for Saving Energy in Server Clusters. In: Proc. of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 111–122 (2003) 10. ab - Apache HTTP server benchmarking tool, http://httpd.apache.org/docs/2.0/programs/ab.html 11. Apache 2.0, http://httpd.apache.org/ 12. VS-NAT, http://www.linuxvirtualserver.org/ 13. Apache Module mod-deflate, http://httpd.apache.org
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
23
14. Aron, M., Druschel, P., Zwaenepoel, W.: Cluster Reserves: A Mechanism for Resource Management in Cluster-Based Network Servers. In: Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pp. 90–101 (2000) 15. Bevilacqua, A.: A Dynamic Load Balancing Method on a Heterogeneous Cluster of Workstations. Informatica 23(1), 49–56 (1999) 16. Bianchini, R., Carrera, E.V.: Analytical and Experimental Evaluation of Cluster-Based WWW Servers. World Wide Web Journal 3(4) (December 2000) 17. Heath, T., Diniz, B., Carrera, E.V., Meira Jr., W., Bianchini, R.: Self-Configuring Heterogeneous Server Clusters. In: Proceedings of the Workshop on Compilers and Operating Systems for Low Power (2003) 18. Rajamani, K., Lefurgy, C.: On Evaluating Request-Distribution Schemes for Saving Energy in Server Clusters. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pp. 111–122 (2003) 19. Colajanni, M., Cardellini, V., Yu, P.S.: Dynamic Load Balancing in Geographically Distributed Heterogeneous Web Servers. In: Proceeding of the 18th International Conference on Distributed Computing Systems, p. 295 (1998) 20. Weighted Round Robin (WRR), http://www.linuxvirtualserver.org/docs/scheduling.html 21. Weighted Least Connection (WLC), http://www.linuxvirtualserver.org/docs/scheduling.html
Power Modeling of Solid State Disk for Dynamic Power Management Policy Design in Embedded Systems Jinha Park1, Sungjoo Yoo1, Sunggu Lee1, and Chanik Park2 1 Department of Electronic and Electrical Engineering, POSTECH (Pohang university of Science and Technology) 790-784 Hyoja-dong, Nam-gu, Pohang, Korea {litrine,sungjoo.yoo,slee}@postech.ac.kr 2 Flash Software Team, Memory Division, DS Solution Samsung Electronics Hwasung, Gyeonggi-do, Korea [email protected]
Abstract. Power consumption now becomes the most critical performance limiting factor to solid state disk (SSD) in embedded systems. It is imperative to devise design methods and architectures for power efficient SSD designs. In our work, we present the first step towards low power SSD design, i.e., power estimation of SSD. We present a practical approach of SSD power estimation which tries to keep the advantage of real measurement, i.e., accuracy, while overcoming its limitations, i.e., long execution time and lack of repeatability (and high cost) by a trace-based simulation. Since it is based on real measurements, it takes into account the power consumption of SSD controller as well as that of Flash memories. We show the effectiveness of the presented method in designing a dynamic power management policy for SSD. Keywords: Solid state disk, power consumption, measurement, trace-based simulation, dynamic power management, low power states.
Power Modeling of SSD for Dynamic Power Management Policy Design
25
SSD can achieve higher performance than HDD mostly by parallel accesses to relatively low speed Flash devices (e.g., achieving a throughput higher than 240MB/s by accessing 8 Flash devices with 33Mbytes/sec each). High performance SSD inherits the same level of power consumption constraints that traditional HDD has in embedded systems. For instance, SSD has a peak power budget of about 1A and an average power consumption budget of about 1.2W in typical notebook PCs [2] and is expected to have much lower power budget in smart phones. Further performance improvement in SSD will require more power consumption especially due to more aggressive parallel accesses to Flash devices. Such an aggressive parallel scheme is not easily applicable due to the given power consumption constraints of embedded systems. However, there is no more large room in peak and average power budget to be used, by aggressively parallel accesses, for further performance improvement of SSD in embedded systems. Power consumption now becomes the most critical performance limiting factor in SSD. It is imperative to devise design methods and architectures for power efficient SSD designs. There has been little work on low power SSD designs. In our work, we present the first step towards low power SSD design, i.e., power estimation of SSD. We also report our application of the power estimation method to a low power SSD design where the parameter of dynamic power management is explored in a fast and cost effective way with the help of the presented power estimation method. 1.2 Power Estimation of SSD The design space for low power SSD design will be huge due to various possible design choices, e.g., parameter sets in dynamic power management (e.g., time-out parameters, DPM policies, etc.), Flash Translation Layer (FTL) algorithms and SSD controller architectural parameters (e.g., I/O frequency in Flash devices, DRAM buffer size and caching/prefetch methods in the controller, etc.).1 When exploring the design space, there can be two typical methods of evaluating the power consumption of design candidates: real measurement and full system simulation with a power model. Real measurement, which gives accurate power information, is in use in SSD product designs. There are two critical problems with real measurements: long design cycle (e.g., hours of SSD execution is required for the evaluation of one design candidate) and changing battery characteristics over repeated runs [3].2 The two problems prevent designers from performing extensive design space exploration which may require evaluating numerous design candidates. Thus, it will be impractical to evaluate all the choices with real SSD executions due to the long execution time and high cost of battery.3 The second candidate, cycle-level full system simulation with a power model, is prohibitive due to tool long a simulation runtime. Assuming a 200MHz SSD controller 1
2
3
In this paper, we consider only software-based solutions. There can be hardware design candidates such as # of channels, # ways/channel, DRAM capacity/speed, # CPUs, etc. Battery lifetime measurements require a procedure of fully charging and then completely discharging the battery while running the system. The battery characteristics degrade significantly after several times (~10 times) of such procedures. Thus, a new battery needs to be used for subsequent battery lifetime measurements. Statistical approximation and optimization methods, e.g., response surface model can also be applied to reduce the number of real executions.
26
J. Park et al.
and ~100 Kcycles/sec of simulation speed, it may take 125 days for the simulation of less than 1.5 hours of real SSD execution. Thus, SSD power estimation based on a detailed simulation model may not be practical in real SSD designs. 1.3 Our Contribution In this paper, we present a practical approach of SSD power estimation which tries to keep the advantage of real measurement, i.e., accuracy, while overcoming its limitations, i.e., long execution time and lack of repeatability (and high cost). The power estimation method takes as input real power measurements and SSD access trace. Then, it performs a trace-based simulation of SSD operation gathering the information of power consumption. Finally, it gives as output power profile (power consumption over time). Since it is based on real measurements, it takes into account the power consumption of SSD controller as well as that of Flash memories. The presented method of SSD power estimation gives fast estimation, via trace-based simulation, and accuracy, based on real power measurements. We also present an application of our method to designing a DPM policy for SSD. The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces Flash memory and SSD operations. Section 4 explains the power estimation method. Section 5 reports experimental results including the application to DPM policy design. Section 6 concludes the paper.
2 Related Work There have been several works for power characterization and optimization of HDD [4][5][6][7][8][9][10]. Hylick, et al. explain that the mechanical parts incur large overheads of power consumption especially when HDD starts to run the spindle and head [4]. Zedlewski, et al. present a performance and power model of HDD based on a Disk simulation model, DiskSim [5]. The HDD power model considers the entire HDD as a single entity and has four active-mode power states (seeking, rotation, reading and writing) and two idle-mode ones. IBM reports that fine-grained power states enable further energy reduction than conventional coarse-grained ones by enabling to exploit short idle periods to enter low power states more frequently and stay there for a longer period of time [6]. Lu, et al. compare several DPM policies for HDD [7]. Douglis, et al. offer a DPM policy where the time-out (if there is no new access during the time-out, a low power state is entered) is determined adaptively based on the accuracy of previous time-out prediction [8]. Helmhold, et al. present a machine learning-based disk spin-down method [9]. Bisson, et al. propose an adaptive algorithm to adaptively calculate the time-out for disk spin down utilizing multiple timeout parameters and considering spin-up latency cost [10]. Regarding SSD, a performance model is presented only recently in [11]. However, there is little work on power characterization and modeling for SSD. In terms of power optimization in Flash-based storage, there are two categories of approaches depending on the optimization target between active and idle-mode power consumption. Joo, et al. present a low power coding scheme for MLC (multi-level cell) Flash memory which has value-dependent power characteristics (e.g., in case of 2 bit MLC, the power consumptions of coding two two-bit data, 00 and 01 are different) [12]. Recently, a low power
Power Modeling of SSD for Dynamic Power Management Policy Design
27
solution is presented for 3D die stacking of Flash memory, where multiple Flash dies share a common charge pump circuit [13]. Regarding the idle-mode power reduction, in commercial SSD products [14], a fixed time-out and DIPM (device initiated power management) are utilized when the SSD detects an idle period and asks the host of a grant for transition to the low power state.
3 Preliminary: Flash Memory Operation and SSD Architecture Typically, a Flash memory device contains, in a single package, multiple silicon Flash memory dies in a 3D die stacking [13]. The Flash memory dies share the I/O signals of device in a time-multiplexed way. We call the I/O signals of package channel and each memory die way.4 A single Flash memory device can support up to the data throughput of data width*I/O frequency, e.g., 33MBps at 33MHz. In order to support higher bandwidth, we need to access Flash memory dies (ways) and devices (channels) in parallel. Fig. 1 shows an example of SSD architecture consisting of two channels (i.e., two Flash devices) and four ways (i.e., four Flash memory dies on a single device) per channel. The controller takes commands from the host (e.g., smartphone or notebook CPU) and performs its Flash Translation Layer (FTL) algorithm to find the physical page address(es) to access. Then, it accesses the corresponding channels and ways, if needed, in parallel. In terms of available parallelism, each way can run in parallel performing internal operations, e.g., internal read and program to transfer data between the internal memory cell array and its I/O buffer. However, the controller can access, at a time, only one way on each channel, i.e., the I/O signals of the package, in order to transfer data between the I/O buffer of the Flash memory die and the controller. Thus, the peak throughput is determined by the number of channels * I/O frequency. One of salient characteristics in Flash memory is that no update is allowed on already written data. When a memory cell needs an update, it first needs to be erased before a new data is written to it. We call such a constraint “erase before write”. In order to overcome the low performance due to “erase before write”, log buffers (often called update blocks) are utilized and the new write data is written to the log buffers. The Flash translation layer (FTL) on the SSD controller maintains the address mapping between the logical data and the physical data [15][16][17]. In reality, the controller, to be specific, FTL, determines the performance and power consumption, especially, of random reads/writes. Thus, the controller overhead needs to be taken into account in the power estimation of SSD.
We use two terms (channel and the I/O signals of device, and way and die) interchangeably throughout this paper.
28
J. Park et al.
4 SSD Power Estimation The power estimation requires as input -
-
Performance and power measurement data (read/write latency for each of read/write data sizes of 1/2/4/8/16/32/64/128/256 sectors, power consumption of sequential reads/writes, power consumption per power state (idle, partial, and slumber), and power state transition delay values), Information of SSD architecture (# channels, ways/channel, tR/tRC/tPROG/ tERASE), and SSD access trace obtained from a real execution on the host system.
We perform a trace-based simulation of the entire SSD (controller as well as Flash memories) considering critical architectural resource conflicts at a channel level. Then, we obtain power profile over time as well as output execution trace (e.g., the status of each channel/way and latency of each SSD access). The trace-based simulation also allows for the simulation of DPM policy where power state transitions are performed based on the given DPM policy. 4.1 Performance and Power Modeling Resource conflict modeling is critical in performance modeling. Given the performance data per SSD access size and the information of SSD architecture, we model the critical resource conflict at a channel level. To do that, we decompose the measured latency into two parts in terms of resource parallelism: the latency of Flash I/O and that of controller and Flash internal operation. We model the channel-level resource conflict with the Flash I/O latency since only one Flash I/O operation can be active at a time at each channel. However, the controller and Flash internal operation (read from cell array to I/O buffer, program, and erase) can be performed in parallel. Fig. 2 (a) illustrates the decomposition of latency for a single one-sector SSD write operation (i.e., a SATA write command of one sector size) in the case of SSD architecture in Fig. 1. At time t0, the controller transfers, via the corresponding channel, one sector data to the I/O buffer of target Flash die. After the I/O operation is finished at time t1, we model that the controller and Flash program operations take the other portion of latency from time t1 to t2. For the power modeling of active-mode operation, we decompose the power consumption into two parts: baseline and access-dependent parts. The baseline part corresponds to the power consumption of idle state when there is no SSD access in progress while the SSD controller is turned on. We measure the power consumption of idle state and use it as the baseline part. The access-dependent part is obtained by subtracting the baseline from the measured power consumption of sequential reads/writes. The access-dependent part is further decomposed into the power consumption of per-way operation. Fig. 2 (b) illustrates the decomposition. Fig. 2 (b) shows the power profile for the case of Fig. 2 (a). The baseline, i.e. the power consumption of idle state is always consumed over the entire period in the figure (to be specific, until a transition to a low power state is made). The access-dependent part for a single write operation (its derivation will be explained later in this section) is added to the total power between time ,
Power Modeling of SSD for Dynamic Power Management Policy Design
Measured power consumption level = P sequential_write Time
(b) Power profile of single write
Time
(c) Trace-based simulation of sequential writes
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13
Time
(d) Power profile of sequential writes
Fig. 2. Performance and power modeling
t0 and t2 during which the write operation, including Flash I/O, program, and controller execution, is executed. Due to the limitation in measuring power consumption, we do not make a further decomposition to separately handle each of Flash I/O, read, program, erase and controller execution. We expect more detailed power measurement will enable such a fine-grained decomposition to give more accurate power estimation, which will be left for our future work. Figs. 3 (c) and (d) illustrate how we derive the power consumption of per-way operation from the access-dependent part of sequential reads/writes. Fig. 2 (c) shows the result of trace-based simulation for the case of 8 page sequential writes to the SSD of Fig. 1. At time t0, t1, t2, and t3, the controller starts to access one Flash die on each channel to transfer a page data from the controller to the I/O buffer of the Flash die. Then, it initiates a program operation in the Flash die. After the latency of program and controller, the next Flash I/O operations start at time t5, t6, t7, and t8, respectively. Fig. 2 (d) shows the corresponding power profile. First, the baseline portion covers the entire period. Then, we add up the contribution of each Flash die to the total power as the figure shows. Thus, we see the peak plateau from t3 to t10 when all the eight Flash dies and controller are active. The measured power consumption of sequential writes corresponds to the power consumption at the plateau. Thus, we obtain the per-way power consumption of write operation (Pper_way_write) as follows. Pper_way_write = (Psequential_write – Pidle)/# active Flash dies at the plateau The per-way power consumption of read operation is calculated in the same way. 4.2 Trace-Based Simulation of Performance, Power and DPM Policy The trace-based simulation covers DPM policy as well as performance and power consumption. Fig. 3 (a) illustrates how a given DPM policy is simulated in the tracebased simulation. In the figure, we assume that a time-out (TO)-based DPM policy is simulated. Thus, the SSD needs to enter a low power state after an idle period of TO
30
J. Park et al.
since the completion of previous access. In the figure, at time t13, an idle period starts and a TO timer starts to count down. At the same time, the power consumption drops to the level of idle state power consumption. At t14, the TO timer expires and a low power state is entered. The figure shows that the total power consumption drops again down to the level of the entered low power state. At t15, a read command for 8 pages arrives. However, since the SSD is in a low power state, it takes a wakeup time, Twakeup to make a state transition to the active state. Thus, the read operations start at time t16 after the wakeup delay. Fig. 3 (b) shows the current profile obtained in the trace-based simulation. Fig. 4 shows the pseudo code of trace-based simulation. The simulation is eventdriven and handles three types of event: host command arrival (e.g., SATA read/write command), start/end of Flash operations (I/O operation, and read/program/erase operation), and power state transition (e.g., when a time-out counter expires). The simulation advances the simulation time, Tnow to the time point when the next event occurs (line 2 in the figure). At the time point when there is any event (line 3), we select the event in the order of end Æ start Æ state transition, where ‘end’ and ‘start’ represent events for the end and start of Flash operation, respectively (line 4). If the selected event is a host command, then we run the FTL algorithm to find the corresponding physical page addresses (PPAs) (line 6). Note that, if there is no available log buffer space, then the FTL can call a garbage collection which schedules data move and erase operations on the corresponding channels and ways. The garbage collection method is specific to the FTL algorithm. Note also that the power consumption and runtime of FTL algorithm is included in the access-dependent part of per-way power consumption and the controller latency. If the current state is a low power state (line 8), we need a state transition to the active state. Thus, during the wakeup period, we mark the power state as an intermediate state of state transition (PState = Transition2Active), set the total power to the idle state power consumption (Ptotal = Pidle), and remove, if any, a future event for transition to a low power state (lines 8—12). Twakeup
Power Modeling of SSD for Dynamic Power Management Policy Design
31
A host command can create 2 to 128 event pairs for the start/end of Flash operations. If it is a read or write command for a data size less than and equal to the page size, it creates two event pairs: one pair, <start, end> for controller and Flash internal read or write operation and the other pair, <start, end> for Flash I/O operation. The 128 event pairs are created by a host command for 64 page (256 sector) read or write. In the trace-based simulation, the created event pairs are scheduled by a function, Schedule_events_for_Flash_operations(PPAs, Tinit), where PPAs is a list of physical page addresses obtained from the FTL algorithm run (on line 6). The function performs an ASAP (as soon as possible) scheduling algorithm to schedule the event pairs. Thus, it schedules each event pair at the earliest time point when the corresponding Flash channel (for Flash I/O operation) or way (for Flash internal operations and controller operation) becomes available. If the current state is a low power state, the scheduled time of the first event pair is adjusted to account for the wakeup delay (lines 11 and 13). If the new event (selected on line 4) is a start event and the current power state is a low power state, then the power state is set to the active state (line 17). Then, the power consumption of newly started operation is added to the total power consumption (line 18). If the new event is an end event (line 19), then that of the just finished operation is subtracted from the total power consumption (line 20). If there is no more future event for Flash operation, then it means that there is no active Flash channel and way and an idle period starts. Thus, we can insert into this point the function of DPM policy under development. Since we assume a simple TO-based DPM policy in Fig. 4, we schedule a TO event at Tnow + TO (line 23). 1 while(Tnow < end of simulation) { 2 Advance time Tnow to the next event 3 while (any event at time Tno w) { 4 new_event = pop(event_list(Tnow)) // pop the events of end of Flash operation first 5 If (new_event == host command) 6 Run FTL to find the corresponding PPAs 7 Tinit = Tnow 8 If (current status == low power state) 9 PState = Transition2Active 10 Pto tal = Pidle 11 Tinit = Tnow + Twakeup 12 Clear future events for transition to low power state 13 Schedule_events_for_Flash_operations(PPAs, Tinit) 14 Else if (new_event==start or end of Flash operation) 15 If (new event == start of Flash operation), then 16 If (PState==Transition2Active), then 17 PState=Active 18 Add the power consumption of the newly started operation to Pto tal 19 Else, // new_event = end of Flash operation 20 Subtract the power consumption of the just finished operation from Pto tal 21 If there is no more future event for Flash operation, then 22 // Insert DPM policy here. The following is a TO-based DPM policy example 23 Schedule a TO event at Tno w+TO 24 Else // power state transition event 25 // TO event for a power state transition in the DPM policy example 26 PState = LowPowerState 27 Ptotal = Plo w_power_state 28 // If there is any lower power state, then schedule a TO event here 29 } // end of “any event at time Tnow” 30 }
Fig. 4. Pseudo code of trace-based simulation algorithm
32
J. Park et al.
If the new event (selected in line 4) is a power state transition event (TO event in the DPM policy example), then the power state is set to the low power state (line 26). The total power consumption is set to that of low power state (line 27). If there is any lower power state, i.e., in the case of TO-based DPM policy with more than one low power states, then we can schedule another TO event (at line 28). The entire tracebased simulation continues until all the input SSD accesses are simulated.
5 Experiments We designed the power estimator in Matlab. For the experiments, we used a Samsung SSD (2.5”, 128GB, and SATA2)5 [18]. As the input data of performance and power consumption, we used the measurement data obtained from the real usage of SSD on a notebook PC running Windows VISTA. For performance, we used the measured latency of read/write commands for the data sizes of 1, 2, 4, 8, 16, 32, 64, 128, and 256 sectors, respectively. We also used the measured power consumption of sequential reads/writes and that of low power states (power consumption was measured for the idle state and two low power states called partial and slumber). We collected the input traces of SSD accesses from the notebook PC by running three scenarios of MobileMark 2007: Reader, Productivity, and DVD [19]. We also used PCMark05 to collect a SSD access trace as a heavy SSD usage case [20]. The accuracy of performance/power estimation was evaluated by comparing the estimation results and corresponding measurement data. The comparison showed that the trace-based simulation gives the same estimation results as measurement data in both power consumption (of sequential reads/writes) and latency (of all the read/write data sizes). We applied the trace-based simulation to a time out (TO)-based DPM policy design for SSD with two low power states (partial and slumber). TO-based DPM is used in HDD [6] and SSD [14] products. DPM in SSD is different from that in HDD since DPM in SSD can exploit short idle periods (much shorter than a second) which could not be utilized in HDD due to the high wakeup delay of mechanical parts (in seconds). Thus, DPM in SSD can give more reduction in energy consumption by entering low power states more frequently. TO-based DPM policy requires selecting a suitable TO which gives the minimum energy consumption over various scenarios. Fig. 5 shows performance estimation results obtained by sweeping the TO value for each of three MobileMark scenarios. In the TO sweep, for simplicity, we set two TO parameters (one for the transition from the active state to the first low power state partial, and the other for the transition from partial to the second power state slumber) to the same value. A single run of trace-based simulation takes 1~20 minutes depending on the number of accesses and TO values, which is 6~ 100+ times faster than real SSD runs.6 5
6
We assumed MLC, 4KB/page, 33MHz I/O frequency, 8 channels and 8 ways/channel based on the peak performance and capacity. We expect faster trace-based simulation can be achieved when implementing the algorithm in C/C++ rather than in Matlab.
Power Modeling of SSD for Dynamic Power Management Policy Design
"%"
$
33
( )&'
Fig. 5. MobileMark 2007 (a)~(c) and PCMark05 results (d)
Fig. 5 shows that there can be a trade-off between energy reduction and performance drop. The general trend is that as TO increases, energy consumption increases since less idle periods are utilized for energy reduction while performance penalty (due to accumulated wakeup delay) decreases since there are less wakeups. As shown in the figure, in the case of MobileMark scenarios, the sensitivity of performance drop to TO is the most significant in the Productivity scenario. It is because Productivity scenario has 5.34 and 6.26 times the SSD accesses of DVD and Reader scenarios, respectively. However, the absolute level of performance drop is not significant in the case of Reader and DVD scenarios and moderate in the case of Productivity scenario. It is because MobileMark runtime is dominated by idle periods which occupy about 95% of total runtime. Thus, the performance impact of DPM policy is not easily visible with the MobileMark traces. However, users may experience performance loss due to aggressive DPM policies (e.g., short TO such as 1ms) when the SSD is heavily accessed. PCMark05 represents such a scenario where the host runs continuously, thus, accessing the SSD more frequently than MobileMark. Fig. 5 (d) shows the result of PCMark05 which incurs up to 16.9% performance drop in the case of aggressive DPM policy (TO=1ms). It is mainly because PCMark05 has 9.7 times more SSD accesses (per minute) than Productivity scenario. Considering the results in Fig. 5, there can be a trade-off between energy reduction and performance drop which designers need to investigate in low power SSD design. As a final remark, Fig. 5 shows that there is still a large gap between the result of Oracle DPM (where we obtain maximum energy reduction without performance drop) and that of optimal single TO case. Our power estimation method will contribute to the design of sophisticated DPM policies to exploit the trade-off while closing the gap.
34
J. Park et al.
6 Conclusion In this paper, we presented a power estimation method for SSD in embedded systems. It takes as input a SSD access trace and the measurement data of performance and power consumption of SSD accesses and gives as output the power profile considering the DPM policy under development. The presented method gives accuracy in power consumption based on real measurement data and fast simulation by applying a trace-based approach. We also presented a case study of applying the method to designing a DPM policy for SSD. As our future work, we will perform an extensive analysis on the accuracy of power estimation with the comparisons against real measurements on various scenarios.
Acknowledgement This work was supported in part by Samsung Electronics.
References 1. Kim, B.: Design Space Surrounding Flash Memory. In: International Workshop on Software Support for Portable Storage, IWSSPS (2008) 2. Creasey, J.: Hybrid Hard Drives with Non-Volatile Flash and Longhorn. In: Windows Hardware Engineering Conference (WinHEC), MicroSoft (2005) 3. Communications with Samsung engineers 4. Hylick, A., Rice, A., Jones, B., Sohan, R.: Hard Drive Power Consumption Uncovered. ACM SIGMETRICS Performance Evaluation Review 35(3), 54–55 (2007) 5. Zedlewski, J., Sobti, S., Garg, N., Zheng, F., Krishnamurthy, A., Wang, R.: Modeling Hard-Disk Power Consumption. In: The USENIX Conference on File and Storage Technologies (FAST), pp. 217–230. USENIX Association (2003) 6. IBM: Adaptive Power Management for Mobile Hard Drives. IBM (1999), http://www.almaden.ibm.com/almaden/mobile_hard_drives.html 7. Lu, Y., De Micheli, G.: Comparing System-Level Power Management Policies. IEEE Design & Test of Computers 18(2), 10–19 (2001) 8. Douglis, F., Krishnam, P., Bershad, B.: Adaptive Disk Spin-down Policies for Mobile Computers. In: 2nd Symposium on Mobile and Location-Independent Computing, pp. 121–137. USENIX Association (1995) 9. Helmbold, D., Long, D., Sconyers, T., Sherrod, B.: Adaptive Disk Spin-Down for Mobile Computers. Mobile Networks and Applications 5(4), 285–297 (2000) 10. Bisson, T., Brandt, S.: Adaptive Disk Spin-Down Algorithms in Practice. In: The USENIX Conference on File and Storage Technologies (FAST). USENIX Association (2004) 11. Dirik, C., Jacob, B.: The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization. In: International Symposium on Computer Architecture, pp. 279–289. ACM, New York (2009) 12. Joo, Y., Cho, Y., Shin, D., Chang, N.: Energy-Aware Data Compression for Multi-Level Cell (MLC) Flash Memory. In: Design Automation Conference, pp. 716–719. ACM, New York (2007)
Power Modeling of SSD for Dynamic Power Management Policy Design
35
13. Ishida, K., Yasufuku, T., Miyamoto, S., Nakai, H., Takamiya, M., Sakurai, T., Takeuchi, K.: A 1.8V 30nJ adaptive program-voltage (20V) generator for 3D-integrated NAND flash SSD. In: International Solid-State Circuits Conference, pp. 238–239. IEEE, Los Alamitos (2009) 14. Intel: X25-M and X18-M Mainstream SATA Solid-State Drives. Intel (2009), http://www.intel.com/design/flash/nand/mainstream/index.htm 15. Lee, S., Park, D., Jung, T., Lee, D., Park, S., Song, H.: A Log Buffer-Based Flash Translation Layer Using Fully-Associative Sector Translation. ACM Transactions on Embedded Computing Systems (TECS) 6(3) (2007) 16. Kang, J., Cho, H., Kim, J., Lee, J.: A Superblock-based Flash Translation Layer for NAND Flash Memory. In: The 6th ACM & IEEE International conference on Embedded software (EMSOFT), pp. 161–170. ACM, New York (2006) 17. Lee, S., Shin, D., Kim, Y., Kim, J.: LAST: locality-aware sector translation for NAND flash memory-based storage systems. ACM SIGOPS Operating Systems Review 42(6) (2008) 18. Samsung SSD, Samsung, http://www.samsung.com/global/business/semiconductor/ products/flash/ssd/2008/product/pc.html (2009) 19. Business Applications Performance Corporation: MobileMark 2007. BAPCo (2007), http://www.bapco.com/products/mobilemark2007/ 20. Futuremark, co.: PCMark05. Futuremark (2009), http://www.futuremark.com/products/pcmark05/
Optimizing Mobile Application Performance with Model–Driven Engineering Chris Thompson, Jules White, Brian Dougherty, and Douglas C. Schmidt Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN USA {jules,briand,schmidt}@dre.vanderbilt.edu, [email protected]
Abstract. Future embedded and ubiquitous computing systems will operate continuously on mobile devices, such as smartphones, with limited processing capabilities, memory, and power. A critical aspect of developing future applications for mobile devices will be ensuring that the application provides sufficient performance while maximizing battery life. Determining how a software architecture will affect power consumption is hard because the impact of software design on power consumption is not well understood. Typically, the power consumption of a mobile software architecture can only be determined after the architecture is implemented, which is late in the development cycle when design changes are costly. Model-driven Engineering (MDE) is a promising solution to this problem. In an MDE process, a model of the software architecture can be built and analyzed early in the design cycle to identify key characteristics, such as power consumption. This paper describes current research in developing an MDE tool for modeling mobile software architectures and using them to generate synthetic emulation code to estimate power consumption properties. The paper provides the following contributions to the study of mobile software development: (1) it shows how models of a mobile software architecture can be built, (2) it describes how instrumented emulation code can be generated to run on the target mobile device, and (3) it discusses how this emulation code can be used to glean important estimates of software power consumption and performance.
Optimizing Mobile Application Performance with Model–Driven Engineering
37
those made during earlier software lifecycle phases [12], e.g., during architectural design and analysis. Conventional techniques for developing mobile device software are not well-suited to identifying performance and power consumption trade-offs during earlier phases of the software lifecycle. These limitations stem largely from the difficulty of comparing the power consumption of one architectural design versus another without implementing and testing each on the target device. Moreover, for each function an application performs, there are often multiple possible designs for accomplishing the same task, each differing in terms of operational speed, battery consumption and accuracy. Even though these design variations can significantly impact device performance, there are too many permutations to implement and test each. For example, if a mobile application communicates with a server it can do so via several protocols, such as HTTP, HTTPS, or other socket connections. Developers can also elect to have the application and/or mobile device infrastructure submit data immediately or in a batch at periodic intervals. Each design option can result in differing power consumption profiles [13]. If the developer elects to use HTTPS over HTTP, the developer is provided with additional security. The overhead associated with key exchange and the encryption/decryption process, however, incurs additional processing time and increases the amount of information that must be transmitted over the network. Both of these require more power and time than would be required if standard HTTP was used. The combination of these architectural options results in too many possible variations to implement and test each one within a reasonable budget and production cycle. A given application could have hundreds or thousands of viable configurations that satisfy the stated requirements. Solution approach Æ Emulation of application behavior through model-driven testing and auto-generated code. Model-driven engineering (MDE) [15] provides a promising solution to the challenges described above. MDE relies on modeling languages, such as domain-specific modeling languages (DSMLs) [16], to visually represent various aspects of application and system design. These models can then be utilized for code generation and performance analysis. By creating a model of candidate solution architectures early in the design phase, instrumented architectural emulation code can be generated and then run on actual mobile devices. This MDE-based approach allows developers to quickly emulate a multitude of possible configurations and provides them with actual device performance data without investing the time and effort manually writing application code. The generated code emulates the modeled architecture by consuming sensor data, computational cycles, and memory as specified in the model, as well as transmitting/receiving faux data over the network. Since wireless transmissions consume most of the power on mobile devices [3] and network interaction is a key performance bottleneck, large-scale power consumption and performance trends can be gleaned by executing the emulation code. Moreover, as the real implementation is built, the actual application logic can be used to replace faux resource consuming code blocks to refine the accuracy of the model. This MDE-based solution has been utilized previously to eliminate some inherent flaws with serialized phasing in layered systems, specifically as they apply to system QoS and to identify design flaws early in the
38
C. Thompson et al.
software production life-cycle [9]. Some prior work [8] also employs model-driven analysis to conduct what-if analysis on potential application architectures. By utilizing MDE-based analysis, mobile software developers can quantitatively evaluate key performance and power consumption characteristics earlier in the software lifecycle (e.g., at design time) rather than later (e.g., during and after implementation), thereby significantly reducing software refactoring costs due to design flaws. MDE provides this by not only allowing developers to generate emulation code, but also by providing them with a high-level understanding of their application that is easy to modify on the fly. Changes can be made at design time by simply moving model elements around rather than rewriting code. Moreover, since emulation code is automatically generated from the model, developers can quickly understand key performance and power consumption characteristics of potential solution architectures without investing the time and effort to implement them. This paper describes emerging R&D efforts that seek to provide developers of mobile applications with an MDE-based approach to optimizing application resource consumption across a multitude of platforms at design time. This paper also describes a methodology for increasing battery longevity in mobile devices through applicationlayer modifications. By focusing on the application layer, developers can still reap the benefits of advanced SDKs and compilers that shield the developer from hardwarecentric decisions. Paper organization. The remainder of this paper is organized as follows: Section 2 presents a sample mobile application running on Google’s Android platform and introduces several challenges associated with resource consumption optimization and mobile application development; Section 3 discusses our current research work developing an MDE tool to allow developers to predict software architecture performance and power consumption properties earlier in the development process; Finally, Section 4 presents concluding remarks and lessons learned.
2 Motivating Example This section presents a motivating mobile application running on Google’s Android platform and describes several challenges associated with resource consumption optimization and mobile application development. 2.1 Overview of Wreck Watch Managing system resources properly can significantly affect device performance and battery life. For instance, reducing CPU instructions not only speeds performance but also reduces the time the process is in a non-idle state thereby reducing power consumption; reducing network traffic also speeds performance and reduces the power supplied to the radio. To demonstrate the importance of proper resource management and the value of model-based resource analysis, we present the following example mobile application, called Wreck Watch, shown in Figure 1.
Optimizing Mobile Application Performance with Model–Driven Engineering
39
Fig. 1. Wreck Watch Behavior
Wreck Watch runs on Google Android smartphones to detect car accidents (1) by analyzing data from the device’s GPS receiver and accelerometer and looking for sudden acceleration events from a high velocity indicative of a collision. Car accident data is then posted to an HTTP server where (2) it can be retrieved by other devices in the area to help alleviate traffic congestion, notify first responders, and (3) provide accident photos to an emergency response center. Users of Wreck Watch can also elect to have certain people contacted in the event of an accident via SMS message or a digital PBX. Figure 1 shows this behavior of Wreck Watch. Since the Wreck Watch application runs continuously in the background, it must conserve its power consumption. The application needs to run at all times and consume a great deal of sensor information to accurately detect wrecks. If not designed properly, therefore, these characteristics could result in a substantial decrease in battery life. In our testing, for example, the Wreck Watch application was able to completely drain the device battery in less than an hour simply through its use of sensors and network connectivity. In the case of Wi-Fi, the radio represents nearly 70% of device power consumption [2] and in extreme cases can consume 100 times the power of one CPU instruction to transmit one byte of data [3]. The amount of power consumed by the network adapter is generally proportional to the amount of information transmitted [1]. The framing and overhead associated with each protocol can therefore significantly affect the power consumption of the network adapter. Prior work [5] demonstrated that significant power savings could be achieved by modifying the MAC layer to minimize collisions and maximize time spent in the idle state. This work also recognized that network operations generally involved only the CPU and transceiver and by reducing client-side processing, they could substantially reduce power consumed by network transactions. Similarly, other work [7] demonstrated that such power savings could also be achieved through transport layer modifications.
40
C. Thompson et al.
Although MAC and transport layer modifications are typically beyond the scope of most software projects, especially mobile application development, the data transmitted on the network can be optimized so it is as lightweight as possible, thereby accomplishing, on a much smaller scale, some of the same effects. The remainder of this paper uses the Wreck Watch application to showcase key design challenges that developers face when building power-aware applications for mobile devices. 2.2 Design and Behavioral Challenges of Mobile Application Development Despite the ease with which mobile applications can be developed via advanced SDKs (such as Google Android and Apple iPhone) developers still face many challenges related to power consumption. If developers do not fully understand the implications of their designs, they can substantially reduce device performance. Battery life represents a major metric used to compare devices and can be influenced significantly by design decisions. Designing mobile applications while remaining cognizant of battery performance presents the following challenges to developers: Challenge 1: Accurately predicting battery consumption of arbitrary architectural decisions is hard. Each instruction executed can result in the consumption of an unknown amount of battery power. Accurately predicting the power consumed for each line of code is hard given the level of abstraction present in modern SDKs, as well as the complexity and numerous variations between physical devices. Moreover, disregarding language commonalities between completely unrelated devices, mobile platforms, such as Android, are designed to operate on a plethora of hardware configurations, which may affect the power consumption of a given configuration. Challenge 2: Trade-offs between performance and battery life are not readily apparent. Although performance and power consumption are generally design tradeoffs, the actual relationship between the two metrics is not readily apparent. For example, when comparing two networking protocols, plain HTTP and SOAP, plain HTTP might operate much faster requiring only 10 ms to transmit the data SOAP requires 50 ms to transmit. At the same time, HTTP might consume .5 mW, while SOAP consumes 1.5 mW. Without the context of real-world performance in a physical device it would be difficult to predict the overhead associated with SOAP. Moreover, this data may vary from one device to the next. Challenge 3: Effects of transmission medium on power consumed are largely device, application, and environment specific. Wireless radios consume a substantial amount of device power relative to other mobile-device components [6], where the power consumed is directly proportional to the amount of information transmitted [1]. Each radio also provides differing data rates, as well as power consumption characteristics. Depending on the application, developers must choose the connection medium best suited to application requirements, such as medium availability and transmission rate. The differences between transmission media are generally subtle and may even depend on environmental factors [10], such as network congestion that are impossible to accurately predict. To deterministically and accurately quantify performance, therefore, testing must be performed in environmentally-accurate situations.
Optimizing Mobile Application Performance with Model–Driven Engineering
41
Challenge 4: It is hard to accurately predict the effects of reducing sensor data consumption rates on power utilization. To provide the most accurate readings and results, device sensors would be polled as frequently as they sample data. This method consumes the most possible power, however, by not only requiring that the sensor be enabled constantly, but by also increasing the amount of data the device must process. In turn, reducing the time that the sensor is active significantly reduces the effectiveness and accuracy of the readings. Determining the exact amount of power saved by a reduction in polling rate or other sensor accuracy change is difficult without profiling such a change on a device. Challenge 5: Accurately assessing effects of different communication protocols on performance is hard without real-world analysis. Each communication protocol has a specific overhead associated with it that directly affects its overall throughput. The natural choice would be to select the protocol with the lowest overhead. While this decision yields the highest performance, it also results in a tightly coupled architecture [11] and substantially increases production time. That protocol would only be useful for the specific data set for which it was designed, in contrast to a standardized protocol, such as HTTP. Standardized protocols often support features that are unnecessary for many mobile applications, however, making the additional data required for HTTP transactions completely useless. It is challenging to predict how much of a tradeoff in performance is required to select the more extensible protocol because the power cost of such protocols cannot be known without profiling them in a real-world usage scenario. Discussions on performance optimization have often focused on hardware- or firmware-level changes and ignored potential application layer enhancements [3] [5] [6]. Interestingly, this corresponds to the level of abstraction present in each layer: device drivers and hardware have little or no abstraction while software applications are often more thoroughly abstracted. It is this level of abstraction, however, that makes such optimizations challenging because often the developer has little or no control over the final machine code. Application code thus cannot be benchmarked until it has been fully developed and compiled. Moreover, problems identified after the code is developed are substantially more costly to correct than those that can be identified at design time. The value of optimizing the performance of an application before any code is written is therefore of great value. Moreover, because power consumption is generally hardware-specific [1] such optimizations result in a tightly coupled architecture that requires the developer to rewrite code to benchmark other configurations.
3 Model-Based Testing and Performance Analysis This section describes our current work in developing a modeling language extension to the Generic Eclipse Modeling System (GEMS) (www.eclipse.org/gmt/gems) [17], called the System Power Optimization Tool (SPOT) for optimizing performance and power consumption of mobile applications at design time. GEMS is an MDE tool for building Domain Specific Modeling Languages (DSMLs) for the Eclipse platform. The goal of SPOT is to allow developers to rapidly model potential application architectures and
42
C. Thompson et al.
obtain feedback on the performance and power consumption of the architecture without manual implementation. The performance data is produced by generating instrumented architectural emulation code from the architectural model that is then run on the target hardware. After execution, cumulative results can be downloaded from the target device for analysis. This section describes the modeling language, emulation code generation, and performance measurement infrastructure that we are developing to address the five challenges described in Section 2.2. 3.1 Mobile Application Architecture Modeling and Power Consumption Estimation with SPOT To accurately model mobile device applications, SPOT provides a domain-specific modeling language (DSML) with components that (1) represent key, resourceconsuming aspects of a mobile application’s architecture and (2) allows developers to specify visual diagrams of a mobile application architecture, as shown in the workflow diagram in Figure 2. SPOT’s DSML, called the System Power Optimization Modeling Language (SPOML), allows developers to build architectural specifications from the following types of model elements: • CPU consumers, which represent computationally intense code-segments such as location-based notifications that require distance calculations on hundreds of points. • Memory consumers, which represent sections of application code that will incur heavy memory operations reducing performance and increasing power consumption, e.g., displaying an image, stored on disk, on the screen, etc. • Sensor data consumers, which will poll device sensors at user-defined intervals. • Network consumers, which periodically utilize network resources emulating actual application traffic • Screen drawing agents, which interact with device graphics libraries, such as OpenGL, to consume power by rendering images to the display. The sensor and network data consumers operate independently of application logic and simply present an interface through which their data can be accessed. The CPU consumer, however, need to incorporate application-specific logic, as well as logic from other aspects of the application. The CPU consumer module also allows for developers to integrate actual logic application as it becomes available to replace emulation code that is generated by SPOML. To provide the software developer with the most flexibility and extensibility possible, SPOML provides them with many key power consumptive architectural options that would be present if they were actually writing device code. For example, if the device presents 10 possible options for granularity of GPS readings, SPOML provides all 10 possibilities via visual elements, such as drop down menus and check boxes. SPOML also provides for constraint checking that warns developers at design time if certain configuration options are unlikely to work together. Ultimately, SPOT provides developers with the ability to modify design characteristics rapidly and model their system without any application-specific logic, as well as provides them with a means to incorporate actual application code.
Optimizing Mobile Application Performance with Model–Driven Engineering
43
Fig. 2. SPOT Analysis Cycle
3.2 Architectural Emulation Code Generation Due to the difficulty of estimating power consumption for an arbitrary device and software architecture it is essential to evaluate application performance on the actual physical hardware in production conditions. To accomplish this task, SPOT can automatically generate instrumented code to perform the functions outlined by the architecture modeled in SPOML. This code generation is done by traversing the inmemory object graph of the model and outputting optimized code to perform the resource-intensive operations specified in the model. The architectural emulation code is constructed from several basic building blocks, as described above. The sensor consumers remain largely the same between applications and require little input from the user developing the model. The only variable in their construction is the rate at which they poll the sensor. They present an interface through which their data can be accessed. The network consumer itself consists of several modules: a protocol, a transmission scheme and a payload interface. The payload interface defines methods that allow other components of the application to utilize the network connection and, for the purposes of emulation and analysis, this interface also helps define the structure of the data to transmit. The protocol module allows the developer to select from a set of predefined protocols (e.g., HTTP or SOAP) or create a custom protocol with a small amount of code. The transmission scheme defines a set of behaviors for how to
44
C. Thompson et al.
transmit data back to the server, which allows developers to specify whether the application should transmit as soon as data is available, wait until a certain amount of data is available, or even wait until a certain connection medium is available (such as Wi-Fi or EDGE). Finally, the screen rendering agent allows users to specify the interval at which the screen is refreshed or invalidated for a given view. Each module described above relies almost entirely on prewritten and optimized code. Of greater complexity for users are the CPU and memory consumers. Users may elect to utilize prewritten code that closely resembles the functionality they wish to provide. Alternatively, they can write their own code to use in these modules profile their architecture more accurately. This iterative approach allows developers to quickly model their architecture without writing detailed application logic and then as this code becomes available, refine their analysis to better represent the performance and behavior of the ultimate system. 3.3 Performance and Resource Consumption Management When generating emulation code, SPOT also generates instrumentation code to record device performance and track power consumption. This code writes these metrics to a file on the device that can later be downloaded to a host machine for analysis after testing. This approach allows developers to quantitatively compare metrics such as application responsiveness (by way of processor idle time, etc), network utilization and throughput and battery longevity. These comparisons provide the developer with a means to quickly and accurately design a system that minimizes power consumption without sacrificing performance. In some instances, this analysis could even highlight simple changes such as reducing the size of XML tags to reduce the overhead associated with downloading information from a server. In each challenge presented in Section 2.2 we establish that through current methods, certain characteristics of a design can only be fully understood postimplementation. Additionally, with newer platforms such as Google’s Android, the mobile device has become an embedded multi-application system. Since each device has significantly less resources than their tethered brethren, however, individual applications must be cognizant of their resource consumption. The value of understanding a given application’s power consumption profile is thus greatly increased. The solutions to each of these challenges lie within the same space: utilization of a model that can be used to accurately assess battery life. SPOT addresses mobile application performance analysis through the use of auto-generated code specified by a DSML, which allows users to estimate performance and power consumption early in the development process. Moreover, developers can perform continuous integration testing by replacing faux code with application logic as it is developed.
4 Concluding Remarks The capabilities of mobile devices have increased substantially over the last several years and with platforms, such as Apple’s iPhone and Google’s Android, will no doubt continue to expand. These platforms have ushered in a new era of applications and have presented developers with a wealth of new opportunities. Unfortunately,
Optimizing Mobile Application Performance with Model–Driven Engineering
45
with these new opportunities have come new challenges that developers must overcome to make the most of these cutting-edge platforms. In particular, predicting performance characteristics of a given design is hard, especially those characteristics associated with power consumption. A promising approach to address these challenges is to enhance model-driven engineering (MDE) tools to enable developers to quickly understand the consequences of architectural decisions. These conclusions can be drawn long before implementation, significantly reducing production costs and time while substantially increasing battery longevity and overall system performance. From our experience developing SPOT, we have learned the following lessons: • • •
By utilizing MDE it becomes possible to quantitatively compare design decisions and deliver some level of optimization with regards to power consumption, Developing applications for platforms such as Android require extensive testing as hardware configurations can greatly influence performance, and It is impossible to completely profile a system configuration because ultimate device performance and power consumption depends on user interaction, network traffic and other applications on the device.
The WreckWatch application is available under the Apache open-source license and can be downloaded at http://vuphone.googlecode.com.
References 1. Feeney, L., Nilsson, M.: Investigating the energy consumption of a wireless network interface in an ad hoc networking environment. In: IEEE INFOCOM, vol. 3, pp. 1548–1557 (2001) 2. Liu, T., Sadler, C., Zhang, P., Martonosi, M.: Implementing software on resourceconstrained mobile sensors: experiences with impala and zebranet. In: Proceedings of the 2nd international conference on Mobile systems, applications, and services, pp. 256–269 (2004) 3. Pering, T., Agarwal, Y., Gupta, R., Want, R.: Coolspots: Reducing the power consumption of wireless mobile devices with multiple radio interfaces. In: Proceedings of the Annual ACM/USENIX International Conference on Mobile Systems, Applications and Services, MobiSys (2006) 4. Poole, J.: Model-driven architecture: Vision, standards and emerging technologies. In: Workshop on Metamodeling and Adaptive Object Models, ECOOP (2001) 5. Chen, J., Sivalingam, K., Agrawal, P., Kishore, S.: A comparison of mac protocols for wireless local networks based on battery power consumption. In: IEEE INFOCOM 1998. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies (1998) 6. Krashinsky, R., Balakrishnan, H.: Minimizing energy for wireless web access with bounded slowdown. Wireless Networks 11, 135–148 (2005) 7. Kravets, R., Krishnan, P.: Application-driven power management for mobile communication. Wireless Networks 6, 263–277 (2000) 8. Paunov, S., Hill, J., Schmidt, D., Baker, S., Slaby, J.: Domain-specific modeling languages for configuring and evaluating enterprise DRE system quality of service. In: 13th Annual IEEE International Symposium and Workshop on Engineering of Computer Based Systems, ECBS 2006 (2006)
46
C. Thompson et al.
9. Hill, J., Tambe, S., Gokhale, A.: Model-driven engineering for development-time QoS validation of component-based software systems. In: Proceeding of International Conference on Engineering of Component Based Systems (2007) 10. Carvalho, M., Margi, C., Obraczka, K., et al.: Garcia-Luna-Aceves. Modeling energy consumption in single-hop IEEE 802.11 ad hoc networks. In: Thirteenth International Conference on Computer Communications and Networks (ICCCN 2004), pp. 367–377 (2004) 11. Gay, D., Levis, P., Culler, D.: Software design patterns for TinyOS. ACM, New York (2007) 12. Boehm, B.: A spiral model of software development and enhancement. In: Software Engineering: Barry W. Boehm’s Lifetime Contributions to Software Development, Management, and Research, vol. 21, p. 345. Wiley-IEEE Computer Society Pr. (2007) 13. Tan, E., Guo, L., Chen, S., Zhang, X.: PSM-throttling: Minimizing energy consumption for bulk data communications in WLANs. In: IEEE International Conference on Network Protocols, ICNP 2007, pp. 123–132 (2007) 14. Kang, J., Park, C., Seo, S., Choi, M., Hong, J.: User-Centric Prediction for Battery Life time of Mobile Devices. In: Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management, pp. 531–534 (2008) 15. Kent, S.: Model Driven Engineering. In: Butler, M., Petre, L., Sere, K. (eds.) IFM 2002. LNCS, vol. 2335, pp. 286–298. Springer, Heidelberg (2002) 16. Lédeczi, A., Bakay, A., Maroti, M., Völgyesi, P., Nordstrom, G., Sprinkle, J., Karsai, G.: Composing domain-specific design environments. Computer, 44–51 (2001) 17. White, J., Schmidt, D.C., Mulligan, S.: The Generic Eclipse Modeling System. In: ModelDriven Development Tool Implementer’s Forum at the 45th International Conference on Objects, Models, Components and Patterns, Zurich, Switzerland (June 2007)
A Single-Path Chip-Multiprocessor System Martin Schoeberl, Peter Puschner, and Raimund Kirner Institute of Computer Engineering Vienna University of Technology, Austria [email protected], {peter,raimund}@vmars.tuwien.ac.at
Abstract. In this paper we explore the combination of a time-predictable chipmultiprocessor system with the single-path programming paradigm. Time-sliced arbitration of the main memory access provides time-predictable memory load and store instructions. Single-path programming avoids control flow dependent timing variations. To keep the execution time of tasks constant, even in the case of shared memory access of several processor cores, the tasks on the cores are synchronized with the time-sliced memory arbitration unit.
1 Introduction As more and more speedup features are added to modern processors and we are moving from single-core to multi-core processor systems, the analysis of the timing of the applications running on these systems is getting increasingly complex. The timing of single tasks per se is difficult to understand and to analyze. Besides that, task timing can no longer be considered as an isolated issue in such systems as the competition for shared resources and interferences via the state of the shared hardware lead to mutual dependencies of the progress and timing of different tasks. We are convinced that the only way of making these highly complex processing systems time predictable is to impose some restrictions on their architecture and on the way in which the mechanisms of the architecture are used. So far we have worked along two main lines of research aiming at real-time processing systems with predictable timing: On the software side we have conceived the single-path execution strategy [1]. The single-path approach allows us to translate task code in a way that the resulting code has exactly one execution trace that all executions of the task have to follow. To this end, the single-path conversion eliminates all input-dependent control flow decisions – by applying a set of code transformations [2] and if-conversion [3] it translates all input-dependent alternatives (i.e., code with if-then-else semantics) into straight-line predicated code. Loops with input-dependent termination are converted into loops that are semantically equivalent but whose iteration count is fully determined at system construction time. Architecture-wise we have been working on time-predictable processors and chipmultiprocessor (CMP) systems. We have developed the JOP prototype of a timepredictable processor [4] and built a CMP system with a number of JOP cores [5]. In this multiprocessor system a static time-division multiple access (TDMA) arbitration scheme controls the accesses of the cores to the common memory. The pre-planning of S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 47–57, 2009. c IFIP International Federation for Information Processing 2009
48
M. Schoeberl, P. Puschner, and R. Kirner
memory access schedules eliminates the need for dynamic conflict resolution and guarantees the temporal isolation that is necessary to allow for an independent progression of the computations on the CMP cores. So far, we have dealt with each of the two topics in separation. This paper is the first that describes our work on combining the concepts of the single-path approach and our time-predictable CMP architecture. We thus present an execution environment that provides both temporal predictability to the highest degree and the performance benefits of parallel code execution on multiple cores. By generating deterministic single-path code, running this code on predictable processor cores, and using a rigid, pre-planned scheme to access the global memory we manage to achieve completely stable, and therefore predictable execution times for each single task in isolation as well as for entire applications consisting of multiple cooperating tasks running on different cores. To the best of our knowledge this has not been achieved for any other state-of-the-art CMP system so far.
2 The Single-Path Chip-Multiprocessor System The main goal of our approach is to build an architecture that provides a combination of good performance and high temporal predictability. We rely on chip-multiprocessing to achieve the performance goal and on an offline-planning approach to make our system predictable. The idea of the latter is to take as many control decisions as possible before the system is actually run. This reduces the number of branching decisions that need to be taken during system operation, which, in turn, causes a reduction of the number of possible action sequences with possibly different timings that need to be considered when planning respectively evaluating the system’s timely operation. 2.1 System Overview We consider a CMP architecture that hosts n processor cores, as shown in Figure 1. On each core the execution of simple tasks is scheduled statically as cyclic executive. All core’s schedulers have the same major cycle that is synchronized to the shared memory arbiter. Each of the processors has a small local method cache (M$) for storing recently used methods, a local stack cache (S$), and a small local scratchpad memory (SPM) for storing temporary data. The scratchpad memory can be mapped to thread local scopes [6] for integration into the Java programming language. All caches contain only thread local data and therefore no cache coherence protocol is needed. To avoid cache conflicts between the different cores our CMP system does not provide a shared cache. Instead, the cores of the time-predictable CMP system access the shared main memory via a TDMA based memory arbiter with fine-grained statically-scheduled access. 2.2 TDMA Memory Arbiter The TDMA based memory arbiter provides a static schedule for the memory access. Therefore, access time to the memory is independent of tasks running on other cores. In the default configuration each processor cores has an equally sized slot for the memory
A Single-Path Chip-Multiprocessor System
JOP pipeline
M$
S$
JOP pipeline
SPM
M$
S$
49
JOP pipeline
SPM
M$
S$
SPM
TDMA arbiter
JOP chip-multiprocessor
Memory controller
Main memory
Fig. 1. A JOP based CMP system with core local caches (M$, S$) and scratchpad memories (SPM), a TDMA based shared memory arbiter, and the memory controller
access. The TDMA schedule can also be optimized for different utilizations of processing cores. In [7] we have optimized the TDMA schedule to distribute slack time of tasks to other tasks with a tighter deadline. The worst-case execution time (WCET) of a memory loads or stores can be calculated by considering the worst-case phasing of the memory access pattern relative to the TDMA schedule [8]. With single-path programming, and the resulting static memory access pattern, the execution time of tasks on a TDMA based CMP system is almost constant. The only jitter results from different phases of the task start time to the TDMA schedule. The maximal execution time jitter, due to different phases between the task start time and the TDMA schedule, is the length of the TDMA round minus one. Thus, the TDMA arbiter very well supports time-predictable program execution. The maximal jitter due to TDMA delays is bounded and relatively small. And if one is interested to even completely avoid this short bounded execution time jitter, this can be achieved by synchronizing the task start with the TDMA schedule, using the deadline instruction described in Section 3.2. 2.3 Tasks All tasks in our system are periodic. Tasks are considered to be simple tasks according to the Simple-Task Model introduced in [9]:1 Task inputs are assumed to be available when 1
More complex task structures can be simulated by splitting tasks into sets of cooperating simple tasks.
50
M. Schoeberl, P. Puschner, and R. Kirner
a task instance starts, and outputs become ready for further processing upon completion of a task execution. Within its body a task is purely functional, i.e., it does neither access common resources nor does it include delays or synchronization operations. To realize the simple-task abstraction, a task implementation actually consists of a sequence of three parts: read inputs – execute – write outputs. While the application programmer must provide the code for the execute part (i.e., the functional part), the first and the third part can be automatically generated from the description of the task interface. These read and write parts of the task implementations copy data between the shared state and task-local copies of that state. The local copies can reside in the common main memory or in the processor-local scratchpad memory. The placement depends on the access frequency and size of the local state. Care must be taken to schedule the data transfers between the local state copy and the global, shared state such that all precedence and mutual exclusion constraints between tasks are met. This scheduling problem is very similar to the problem of constructing static scheduling tables for distributed hard real-time computer systems with TDMA message scheduling in which task execution has to be planned such that task-order relations are obeyed and the message and task sequencing guarantees that all communication constraints are met. A solution to this scheduling problem can be found in [10]. Following our strategy to achieve predictability by minimizing the number of control decisions taken during runtime, all tasks are implemented in single path code. This means, we apply the single-path transformation described in [1,2] to (a) serialize all input-dependent branches and (b) transform all loops with input-dependent termination into loops with a constant iteration count. In this way, each instance of a task executes the same sequence of instructions and has the same temporal access pattern to instructions and data. 2.4 Mechanisms for Performance and Time Predictability By executing tasks on different cores with some local cache and scratchpad memory we manage to increase the system’s performance over a single-processor system. The following mechanisms make the operation of our system highly predictable: – Tasks on a single core are executed in a cyclic executive, avoiding cache influences due to preemption. – Accesses to the global shared memory are arbitrated by a static TDMA memory arbitration scheme, thus leaving no room for unpredictable conflict resolution schemes and unknown memory access times. – The starting point of all task periods and the starting point of the TDMA cycle for memory accesses are synchronized, and each task execution starts at a pre-defined offset within its period. Further, the single-path task implementation guarantees a unique trace of instruction and memory accesses. All these properties taken together allow for an exact prediction of instruction execution times and memory access times, thus making the overall task timing fully transparent and predictable. – As the read and write sections of the tasks may need more than a single TDMA slot for transferring their data between the local and the global memory, read and write operations are pre-planned and executed in synchrony with the global execution cycle of all tasks.
A Single-Path Chip-Multiprocessor System
51
Besides its support for predictability, our planning-based approach allows for the following optimizations of the TDMA schedules for global memory accesses. These optimizations are based on the knowledge available at the planning time: – The single-path implementation of tasks allows us to exactly spot which parts of a task’s execute part need a higher and which parts need a lower bandwidth for accessing the global memory (e.g., a task does not have to fetch instructions from global memory while executing a method that it has just loaded into its local cache). This information can be used to adapt the memory access schedule to optimize the overall performance of memory accesses. While an adaption of memory-access schedules to the bandwidth requirements of different processing phases has been proposed before [11,12], it seems that this technique can provide its maximum benefit when applied to single-path code – only the execution of single-path code yields a unique, and therefore fully predictable sequence and timing of memory accesses. – A similar optimization can be applied to optimize the timing of memory accesses during the read and write sections of the task implementations. These sections access shared data and should therefore run under mutual exclusion. Mutual exclusion is guaranteed by the static, table-driven execution regime of the system. Still, the critical sections should be kept short. The latter could be achieved by an adaption of the TDMA memory schedule that assigns additional time slots to tasks at times when they perform memory-transfer operations. Our target is a time-deterministic system, which means that not only the value of a function is deterministic, but also the execution time. It is desirable to exactly know which instruction is executed at each point in time. Execution time shall be a repeatable and predictable property of the system [13].
3 Implementation The proposed design is evaluated in the context of the Java optimized processor (JOP) [4] based CMP system [5]. We have extended JOP with two instructions: a predicated move instruction for single-path programming in Java and a deadline instruction to synchronize application tasks with the TDMA based memory arbiter. 3.1 Conditional Move Single path programming substitutes control decisions (if-then-else) by predicated move instructions. To avoid execution time jitter, the predicated move has to have a constant execution time. On JOP we have implemented a predicated move for integer values and references. This instruction represents a new, system specific Java virtual machine (JVM) bytecode. This new bytecode is mapped to a native function for access from Java code. The semantic of the function result = Native.condMove(x, y, b);
is equivalent to
52
M. Schoeberl, P. Puschner, and R. Kirner result = b ? x : y;
without the need for any branch instruction. The following listing shows usage of conditional move for integer and reference data types. The program will print 1 and true. String a = ”true” ; String b = ” false ” ; String result ; int val ; boolean cond = true; val = Native.condMove(1, 2, cond); System.out.println (val ); result = (String) Native.condMoveRef(a, b, cond); System.out.println ( result );
The representation of the conditional move as a native function call has no call overhead. The function is substituted by the system specific bytecode during link time (similar to function inlining). 3.2 Deadline Instruction In order to synchronize a task with the TDMA schedule a wait instruction with a resolution of single clock cycles is needed. We have implemented a deadline instruction as proposed in [14]. The deadline instruction stalls the processor pipeline until the desired time in clock cycles. To avoid a change in the execution pipeline we have implemented a semantic equivalent to the deadline instruction. Instead of changing the instruction set of JOP, we have implemented an I/O device for the cycle accurate delay. The time value for the absolute delay is written to the I/O device and the device delays the acknowledgment of the I/O operation until the cycle counter reaches this value. This simple device is independent of the processor and can be used in any architecture where an I/O request needs an acknowledgment. I/O devices on JOP are mapped to so called hardware objects [15]. A hardware object represents an I/O device as a plain Java object. Field read and write access are actual I/O register read and write accesses. The following code shows the usage of the deadline I/O device. SysDevice sys = IOFactory.getFactory().getSysDevice(); int time = sys. cntInt ; time += 1000; sys.deadLine = time;
The first instruction requests a reference to the system device hardware object. This object (sys) is accessed to read out the current value of the clock cycle counter. The deadline is set to 1000 cycles after the current time and the assignment sys.deadline = time writes the deadline time stamp into the I/O device and blocks until that time.
A Single-Path Chip-Multiprocessor System
53
4 Evaluation We evaluate our proposed system within a Cyclone EP1C12 field-programmable gate array that contains 3 processor cores and 1 MB of shared memory. The shared memory is an SRAM with 2 cycles read access time and 3 cycles write access time. Some bytecode instructions contain several memory accesses (e.g., an array access needs three memory reads: read of the array size for the bounds check, an indirection through a forwarding handle,2 and the actual read of the array value). For several bytecode instructions the WCET is minimized with a slot length of 6 cycles. The resulting TDMA round for three cores is 18 cycles. As a first experiment we measure the execution time of a short program fragment with access to the main memory. Without synchronizing the task start with the TDMA arbiter we expect some jitter. To provoke all possible phase relations between the task and the TDMA schedule the deadline instruction was used to shift the task start relative to the TDMA schedule. The resulting execution time varies between 342 and 359 clock cycles. Therefore, the maximum observed execution time jitter is the length of the TDMA round minus one (17 cycles). With the deadline instruction we make each iteration of the task start at multiples of the TDMA round (18 clock cycles in our example). In that case each task executes for a cycle accurate constant duration. This little experiment shows that single-path programming on a CMP system, synchronized with the TDMA based memory arbitration, results in repeatable execution time [13]. 4.1 A Sample Application To validate our programming model for cycle-accurate real-time computing, we developed a controller application that consists of five communicating tasks. This case study is a demonstrator that cycle-accurate computing is possible on a CMP system. Further, this case study give us some insights about the practical aspects of using the proposed programming model. The architecture of the sample application is given in Figure 2. The application is demonstrative because of its rather complex inter-process communication pattern, which shows the need of precise scheduling decisions to meet the different precedence constraints. The application consists of the following tasks: – τ1 and τ2 are the sampling tasks that read from sensors. τ1 samples the reference value and τ2 samples the system value. This two tasks share the same code basis and they run at the double frequency than the controller task to allow a low-pass filtering by averaging the sensor values. – τ3 is the proportional-integral-derivative controller (PID controller) that gets the reference value from τ1 and the feedback of the current system value from τ2 . – τ4 is a system guard similar to a watchdog timer that controls the liveness of τ1 , τ2 , and τ3 . Whenever the write phase of τ1 , τ2 , and τ3 has not been executed between two subsequent activations of τ4 then the system is set into an error state. 2
The forwarding handle is needed for the implementation of the real-time garbage collector.
54
M. Schoeberl, P. Puschner, and R. Kirner
τ1 : STSampler
τ3 : STController
τ2 : STSampler Core 1
τ4 : STGuard τ5 : STMonitor
Core 2
Core 3
JOP chip−multiprocessor Fig. 2. Sample application: control application
τ1 : STSampler
τ5 : STMonitor
τ3 : STController
τ4 : STGuard
τ2 : STSampler
Fig. 3. Communication directions of the control application
– τ5 is a monitoring task that periodically collects the sensor values (from τ1 and τ2 ) and the control value (from τ3 ). The write part of τ5 is currently empty, but it can be used to include the code for transferring the collected system state to a host computer. The inter-task communication of the sample application is summarized in Figure 3. It shows that this small application has a relatively complex communication pattern. Each task communicates with almost all other tasks. The communication pattern has a direct influence on the system schedule. The resulting precedence constraints have to be taken into account for scheduling the read, execute, and write phases for each task. And of course, since this is a CMP system, some of the task phases are executed in parallel, which complicates the search for a tight schedule. Tasks τ1 -τ5 are implemented in single-path code, thus their execution time does not depend on control-flow decisions. Since also the scheduler has a single-path implementation, the system executes exactly the same instruction sequence at each scheduling round.
A Single-Path Chip-Multiprocessor System
55
Table 1. Measured single-path execution time in clock cycles Task
Read
Execute
Write
Total
τ1 ,τ2 τ3 τ4 τ5
594 864 26604 1368
774 65250 324 324
576 576 28422 324
1944 66690 55350 2016
All tasks are synchronized on each activation with the same phase of the TDMA based memory arbiter. Therefore, their execution time does not have any jitter due to different phase alignments of the memory arbiter. With such an implementation style it is possible on the JOP to determine the WCET of each task directly by a single execution-time measurement (by enforcing either a cache hit or miss of the method). Table 1 shows the observed WCET values for each task, given separately for the read, execute, and write part of the tasks. The absolute WCET values are not that important to discuss, but more important is the fact that the execution time of each task is deterministic, not depending on the input data. To summarize on the practical aspects of the programming model, it has shown that even this relatively simple application results in a scheduling problem that is rather tricky to be solved without tool support. For the purpose of our paper we solved it manually using a graphical visualization of the relative execution times and determining the activation times of each task manually. However, to successfully use this programming model for industrial production code, the use of a scheduling tool is highly advisable [10]. With respect to generating a tight schedule, it has shown that the predictable execution time of all tasks is very helpful.
5 Related Work Time-predictable multi-threading is developed within the PRET project [14]. The processor cores are based on a RISC architecture. Chip-level multi-threading for up to six threads eliminates the need for data forwarding, pipeline stalling, and branch prediction. The access of the individual threads to the shared main memory is scheduled similar to our TDMA arbiter with the so called memory wheel. The PRET architecture implements the deadline instruction to perform time based, instead of lock based, synchronization for access to shared data. In contrast to our simple task model, where synchronization is avoided due to the three different execution phases, the PRET architecture performs time based synchronization within the execution phase of a task. The approach, which is closest related to our work, is presented in [11,12]. The proposed CMP system is also intended for tasks according to the simple task model [9]. Furthermore, the local cache loading for the cores is performed from a shared main memory. Similar to our approach, a TDMA based memory arbitration is used. The paper deals with optimization of the TDMA schedule to reduce the WCET of the tasks. The design also considers changes of the arbiter schedule during task execution to optimize the execution time. We think that this optimization can be best performed when the
56
M. Schoeberl, P. Puschner, and R. Kirner
access pattern to the memory is statically known – which is only possible with singlepath programming. Therefore, the former approach to TDMA schedule optimization shall be combined with our single-path based CMP system. Optimization of the TDMA schedule of a CMP based real-time system has been proposed in [7]. The described system proposes a single core per thread to avoid the overhead of thread preemption. It is argued that future systems will contain many cores and the limiting resource will be the memory bandwidth. Therefore, the memory access is scheduled instead of the processing time.
6 Conclusion A statically scheduled chip-multiprocessor system with single-path programming and a TDMA based memory arbitration delivers repeatable timing. The repeatable and predictable timing of the system simplifies the safety argument: measurement of the execution time can be used instead of WCET analysis. We have evaluated the idea in the context of a time-predictable Java chip-multiprocessor system. The cycle accurate measurements showed that the approach is sound. For the evaluation of the system we have chosen a TDMA slot length that was optimal for the WCET of individual bytecodes. If this slot length is also optimal for singlepath code is an open question. In future work we will evaluate different slot lengths to optimize the execution time of single-path tasks. Furthermore, the change of the TDMA schedule at predefined points in time is another option we want to explore.
Acknowledgments The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under grant agreement number 214373 (Artist Design) and 216682 (JEOPARD).
References 1. Puschner, P., Burns, A.: Writing temporally predictable code. In: Proc. 7th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems, January 2002, pp. 85–91 (2002) 2. Puschner, P.: Transforming execution-time boundable code into temporally predictable code. In: Kleinjohann, B., Kim, K.K., Kleinjohann, L., Rettberg, A. (eds.) Design and Analysis of Distributed Embedded Systems, pp. 163–172. Kluwer Academic Publishers, Dordrecht (2002); IFIP 17th World Computer Congress - TC10 Stream on Distributed and Parallel Embedded Systems (DIPES 2002) 3. Allen, J., Kennedy, K., Porterfield, C., Warren, J.: Conversion of Control Dependence to Data Dependence. In: Proc. 10th ACM Symposium on Principles of Programming Languages, January 1983, pp. 177–189 (1983) 4. Schoeberl, M.: A Java processor architecture for embedded real-time systems. Journal of Systems Architecture 54(1-2), 265–286 (2008) 5. Pitter, C., Schoeberl, M.: A real-time Java chip-multiprocessor. Trans. on Embedded Computing Sys. (accepted for publication, 2009)
A Single-Path Chip-Multiprocessor System
57
6. Wellings, A., Schoeberl, M.: Thread-local scope caching for real-time Java. In: Proceedings of the 12th IEEE International Symposium on Object/component/service-oriented Real-time distributed Computing (ISORC 2009), Tokyo, Japan. IEEE Computer Society, Los Alamitos (2009) 7. Schoeberl, M., Puschner, P.: Is chip-multiprocessing the end of real-time scheduling? In: Proceedings of the 9th International Workshop on Worst-Case Execution Time (WCET) Analysis, Dublin, Ireland, OCG (July 2009) 8. Pitter, C.: Time-predictable memory arbitration for a Java chip-multiprocessor. In: Proceedings of the 6th International Workshop on Java Technologies for Real-time and Embedded Systems, JTRES 2008 (2008) 9. Kopetz, H.: Real-Time Systems. Kluwer Academic Publishers, Dordrecht (1997) 10. Fohler, G.: Joint scheduling of distributed complex periodic and hard aperiodic tasks in statically scheduled systems. In: Proceedings of the 16th Real-Time Systems Symposium, December 1995, pp. 152–161 (1995) 11. Andrei, A., Eles, P., Peng, Z., Rosen, J.: Predictable implementation of real-time applications on multiprocessor systems on chip. In: Proceedings of the 21st Intl. Conference on VLSI Design, January 2008, pp. 103–110 (2008) 12. Rosen, J., Andrei, A., Eles, P., Peng, Z.: Bus access optimization for predictable implementation of real-time applications on multiprocessor systems-on-chip. In: Proceedings of the Real-Time Systems Symposium (RTSS 2007), December 2007, pp. 49–60 (2007) 13. Lee, E.A.: Computing needs time. Commun. ACM 52(5), 70–79 (2009) 14. Lickly, B., Liu, I., Kim, S., Patel, H.D., Edwards, S.A., Lee, E.A.: Predictable programming on a precision timed architecture. In: Altman, E.R. (ed.) Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES 2008), Atlanta, GA, USA, pp. 137–146. ACM, New York (2008) 15. Schoeberl, M., Korsholm, S., Thalinger, C., Ravn, A.P.: Hardware objects for Java. In: Proceedings of the 11th IEEE International Symposium on Object/component/service-oriented Real-time distributed Computing (ISORC 2008), Orlando, Florida, USA. IEEE Computer Society, Los Alamitos (2008)
Towards Trustworthy Self-optimization for Distributed Systems Benjamin Satzger, Florian Mutschelknaus, Faruk Bagci, Florian Kluge, and Theo Ungerer Department of Computer Science University of Augsburg, Germany {satzger,bagci,kluge,ungerer}@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik
Abstract. The increasing complexity of computer-based technical systems requires new ways to control them. The initiatives Organic Computing and Autonomic Computing address exactly this issue. They demand future computer systems to adapt dynamically and autonomously to their environment and postulate so-called self-* properties. These are typically based on decentralized autonomous cooperation of the system’s entities. Trust can be used as a means to enhance cooperation schemes taking into account trust facets such as reliability. The contributions of this paper are algorithms to manage and query trust information. It is shown how such information can be used to improve self-* algorithms. To quantify our approach evaluations have been conducted. Keywords: Trust, self-*, Autonomic Computing.
1
self-optimization,
Organic
Computing,
Introduction
The evolution of computer systems starting from mainframes towards ubiquitous distributed systems progressed rapidly. Common for early systems is the necessity for human administrators. Future systems, however, should act to a large extent autonomously in order to keep them manageable. The investigation of techniques to allow complex distributed systems to self-organize is of high importance. The initiatives Autonomic Computing [14,5] and Organic Computing [3] propose systems with life-like properties and the ability to self-configure, self-optimize, self-heal, and self-protect. Safety and security play a major role in information technology and especially in the area of ubiquitous computing. Nodes in such a system are restricted to a local view and typically no central instance can be responsible for control and organization of the whole network. Trust and reputation can serve as a means to build safe and secure distributed systems in a decentralized way. With appropriate trust mechanisms, nodes of a system have a clue about which nodes S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 58–68, 2009. c IFIP International Federation for Information Processing 2009
Towards Trustworthy Self-optimization for Distributed Systems
59
to cooperate with. This is very important to improve reliability and robustness in systems which depend on a cooperation of autonomous nodes. In this paper we adopt the definition that trust is a peer’s belief in another peer’s trust facet. There are many facets of trust in computer systems. Such facets may concern for instance availability, reliability, functional correctness, and honesty. The related term reputation emphasizes that trust information is based on recommendation. The development of trustworthy self-* systems concerns aspects of (1) generation of trust values based on direct experiences, (2) storage, management, retrieval of trust values, and (3) the usage of this information to enhance trustworthiness of the overall system. Generating direct trust values strongly depends on the trust facet. Direct experiences concerning the facet availability may be gathered by using heartbeat messages. The facet functional correctness of a sensor node may be estimated by comparison with measured values of sensors nearby. In this paper we will focus on (2) and (3), i.e. how to manage, access, and use trust information. An instance of a distributed ubiquitous system which exploits self-* properties is our Smart Doorplate Project [11]. This project envisions the use of smart doorplates within an office building. The doorplates are amongst others able to display current situational information about the office owner and to direct visitors to his current location based on a location-tracking system. A middleware called “Organic Computing Middleware for Ubiquitous Environments” OCµ [10] serves as common platform for all included devices. The middleware system OCµ was developed to offer self-configuration, self-optimization, self-healing, and self-protection capabilities. It is based on the assumption that applications are composed of services, which are distributed to the nodes of the network. Service distribution is performed during the initial self-configuration phase considering available resources and requirements of the services. At runtime, resource consumption is monitored. In a former work we have developed a self-optimization mechanism [13,12] to balance the resource consumption (load) between nodes by service transfers to other nodes. OCµ is an open system and designed to allow services and nodes from different manufacturers to interact. In this work we incorporate a trust mechanism into our middleware to allow network entities to decide how far to cooperate with other nodes/services. This is used to enhance the self-optimization algorithm. The paper is organized in seven sections. Section 2 gives an overview of the state of the art of trust in distributed systems. Section 3 presents the basic self-optimization algorithm we have developed. Section 4 introduces different algorithms to build a trust management layer which is able to provide functionalities covered by (2) as mentioned above. In Section 5 we present the trustworthy self-optimization, which extends the basic self-optimization and takes trust into account. This refers to (3). Then, Section 6 describes measurements of an implementation of the algorithm. Finally, Section 7 concludes the paper.
60
2
B. Satzger et al.
Related Work
There are many approaches to incorporate trust into distributed systems. In this section some relevant papers are highlighted. Repantis et al. [7] describe a middleware based on reputation. In their model, nodes can request data and services (objects) and may receive several responses from different nodes. In this case the object from the most trustworthy provider is chosen. The information about reputation of nodes is stored on its direct neighbors and appended to response messages. The nodes define thresholds for any object they request. Thus, only providers with a higher reputation are taken into account. After reception of an object the provider is being rated based on the satisfaction of the requester. In [7] nodes share one common reputation value which means that all nodes have the same trust in a certain node. In contrast, Theodorakopoulos et al. [9] describe a model based on a directed graph in which vertices are the network’s nodes and the weighted edges describe trust relations. The weight of an edge (u, v) describes the trust of node u in v. Any weight consists of a trust value and the confidence in its correctness. The TrustMe [8] protocol focuses on the anonymity of network members. It represents a technique to store and access trust information within a peer-to-peer network. The mining of trust information plays a minor role. An asymmetric encryption technique is used to allow for protection against attacks. In contrast to many trust management systems which support very limited anonymity or assume anonymity to be an undesired feature, TrustMe emphasizes the importance of anonymity. The protocol provides anonymity for the trust provider and the trust requester. Cornelli et al. [4] present a common approach to request trust values: node A interested in the reputation of node B sends a broadcast message and receives response from all nodes which have a trust value about B. The latter message is encrypted by the public key of A. After reception of the encrypted answer the node contacts the responder to identify bogus messages. In [6], a special approach is used to store trust information. It uses Distributed Hash Tables (DHTs) to store the trust value of a node in a number of parent nodes. These are identified by hash functions using the id of the child. A node requesting the trust value of a network member uses the hash function to calculate the parents which hold the value and sends a request to them. Aberer et al. [2] present a reputation system to gather trust values. An interesting point is that trust values are mutual while traditionally nodes judge independently. The hope is to achieve an advantage due to the cooperation. This idea is integrated into the calculation of the global trust value of a node. The interaction triggered by the node itself as well as the requested interactions are accounted. Global trust values are binary, i.e. nodes are considered trustworthy or not. If nodes are cheating during an interaction they are considered globally untrustworthy. If a node detects a cheating communication partner it files a complaint. By the number of interactions with different nodes the probability rises that a liar is unmasked. In this model reputation values are stored within the network in a distributed way. This is done using the so-called PGrid [1].
Towards Trustworthy Self-optimization for Distributed Systems
3
61
Basic Self-optimization Algorithm
The basic self-optimization algorithm [13,12] is inspired by the human hormone system. Its task is to balance the load of a distributed system based on services. This artificial hormone system consists of metrics which calculate a reaction (service transfer), nodes producing digital hormones which indicate their load, receptors collecting hormones and handing them over to the metrics, and finally the digital hormones holding load information. To minimize overhead the digital hormone value enfolds both, the activator as well as the inhibitor hormone. If the value of the digital hormone is above a given level, it activates while a lower value inhibits the reaction. To further reduce overhead, hormones are piggybacked to application messages and do not result in additional messages. The basic idea behind the self-optimization is: When a heavy loaded node receives a message containing a hormone which states that the sender is lightly loaded, services are transferred to this sender. The metrics used to decide whether to balance the load between two nodes of the network are named transfer strategies because they decide on the transfer of a service. Our self-optimization has the ability to improve a system during runtime. It yields very good results in load-balancing by only using local decision and with minimal overhead. However, it is not considered whether a service is transferred to a trustworthy node. Bogus nodes (e.g. nodes running a malicous middleware) might attract services in a systematic way and could induce a breakdown of the system. Unreliable faulty nodes might not have the ability to properly host services. The other way round, you would want to utilize particularly reliable trustworthy nodes for important services. Therefore, we propose to incorporate trust information into the transfer decision.
4
Trust Management
As mentioned, a trust system needs a component which generates trust values based on direct experiences and it needs a component which is able to process this information. The generation of direct trust values depends strongly on the trust facet and the domain. Since a network can be seen as a graph, each node has a set of direct neighbors. In order to be able to estimate the trust facet availability, direct neighbors may periodically check their availability status. But depending on domain and application the mining of trust values differs. In a sensor network the measured data of a node can be compared with the measurements of its neighbors. In this way nonconforming sensors can be identified. Since we do not focus on generation of trust values by observation but on the generic management of trust information we simply assume that any node has a trust value about its direct neighbors. This trust value T (k1 , k2 ) is within [0, 1] and reflects the subjective trust of node k1 in node k2 based on its experiences. T (k1 , k2 ) = 0 means k1 does not trust k2 at all while a value of 1 stands for ’full’ trust. We assume that direct neighbors directly monitor each other to determine a trust value. This trust value might be inadequate due to insufficient monitoring
62
B. Satzger et al.
data. It may be possible that the trust is either too optimistic or too pessimistic. With continuous monitoring of neighbors their trust can probably be estimated better and better. In the following three trust algorithms are presented which determine management of trust information in order to make it useful for the network’s entities. The algorithms are explained being used together with an algorithm spreading information via hormones - like our self-optimization algorithm. However, the trust management is not limited to such a usage. It is assumed that nodes do not alter messages they are forwarding. In a real world application this should be assured by the usage of security techniques like encryption. 4.1
Forwarder
Only direct neighbors have the possibility to directly measure trust. Nodes that are not in the direct neighborhood need to rely on some kind of reputation. Also direct neighbors can use reputation as additional information. This first approach to propagate trust is quite simple. When a node B sends an application message to a node F , direct neighbor D which forwards the message appends its trust value T (D, B) to it. Node F receives a message containing a hormone (with e.g. load information) and the trust of D in B as shown in Figure 1. This trust value is very subjective as only measured by one node, but this approach does not introduce additional messages for trust retrieval. Immediately after the receipt of an application message it has information about the trust of the sender.
Fig. 1. Forwarder algorithm
4.2
Distant Neighborhood
In this approach not only the trust value of one direct neighbor is considered but all neighbors of the according node. A node sends an application message with an appended hormone to the receiver. If the receiver needs to know about the trust of the sender, it sends a trust request to all neighbors of the sender. They reply by a trust response message. Finally, trust in the sender is calculated as the average of all received trust values. This method results in much more messages as the Forwarder algorithm would produce, but also provides more reliable information. A further advantage is the ability to detect diverged trust values. This might be used to identify bogus nodes. In Figure 2 node B sends
Towards Trustworthy Self-optimization for Distributed Systems
63
Fig. 2. Distant Neighborhood algorithm
an application message together with its capacity and load to node J which afterwards asks A, C, D, and E for their trust in node B. 4.3
Near Neighborhood
In this variant (see Fig. 3) a node spreads trust requests to query trust information, e.g. after the receipt of an application message. These trust requests are coupled with a hop counter. First, this hop counter is set to one which means that it first asks its own direct neighbors. If they have information about the target node they answer their trust value, otherwise they negate. If the node receives positive answers it averages the corresponding values. If the node receives only negative answers it increments its hop counter and repeats the trust request. At the latest when the request reaches a direct neighbor of the target node a trust value is responded. The requesting node stores the resulting trust value in order to answer other requests. Note that a node will execute this algorithm in order to update and refine its data even if it already has trust information about a target node. If a node already has a trust value of another node it is integrated into the averaging process. Initially, this algorithm produces much more messages than
Fig. 3. Near Neighborhood algorithm
64
B. Satzger et al.
the ones described above. However, as trust values are distributed within the network the number of messages decreases over time because it becomes more and more likely that few hops are needed until a node is found with information about the target node. The same holds for the accuracy of trust values which increases with the runtime of the algorithm.
5
Trustworthy Self-optimization
The self-optimization reallocates services during runtime to ensure a uniform distribution of load. The version described in Section 3 does not take the trust of nodes into account. In the following an approach is presented to incorporate trust into the self-optimization. Therefore, each of the three trust management algorithms presented above can be used. We assume that different services have different priorities. If a service is of high importance for the functionality of the system or collects sensitive data, its priority is high. The prioritization may be defined at design time or adapted dynamically during runtime. The trustworthy self-optimization aims at load-sharing with additional consideration of the services’ priority, i.e. to avoid hosting of services with high priority on nodes with low trust. A self-optimization step is performed strictly locally with no global control. Like the basic self-optimization algorithm, each node piggybacks hormones containing information about its current load and its capacity to each application message. This enables the receiver to compare this load value with its own load. Additionally the sender’s trust can be queried via the proposed trust management layer. A transfer strategy takes this information and decides whether to transfer a service or not. If a service is identified it is tried to transfer it to the sender. This is only possible if the sender still has enough free resources to host this service. Due to the dynamics of the network this is not always sure. The basic idea of the transfer strategy is to find a balance between pure loadbalancing and trustworthiness of service distribution. Using parameter α allows to determine on which aspect to focus. A higher value for α emphasizes the need to transfer services to nodes with optimal trust values, a lower value for α results in a focus on pure load-balancing. All services B is able to host, for which A’s trust into B is higher than their priority, and whose dislocation would balance the load significantly are considered for transfer.
6
Evaluation
Several test scenarios have been investigated in order to evaluate the trustworthy self-optimization. Each scenario consists of 100 nodes with random resource capacity (e.g. RAM) and a random global but hidden trust value is assigned to each node. In real world applications the trust of a node must be measured. As this strongly depends on the trust facet and the application we have chosen a theoretical approach to simulate the trust by direct observation. It is based on the assumption that nodes are able to better estimate the trust of a node over
Towards Trustworthy Self-optimization for Distributed Systems
65
Fig. 4. Simulation of direct trust monitoring
time. In the simulation the trust of a node in its direct neighbor converges to the true hidden trust value with increasing number of mutual interactions as shown in Figure 4. In this example, the true trust value of the node is 0.5. Initially, the node is only able to estimate the trust very roughly while the error decreases statistically with the interactions. Formally, the trust of node r to node k after n interactions is modeled by Tn (r, k) = t(k) + ρn . In this formula t(k) is the true global but hidden trust value of k and ρn is a random value which falsifies t(k). With further interactions the possible range of ρi decreases, i.e. |ρn | > |ρn+1 |. This random value simulates the inability to precisely estimate trustworthiness. In the simulation the nodes send dummy application messages to random nodes. These are used to piggy-back information necessary for self-optimization as described above. After reception of such a message the node determines the trust using a certain trust algorithm. Then, it is decided whether it is transferred or not. Initially, each node obtains a random number of services. Resource consumption and priority of a service are chosen randomly while the sum of all service weights is not allowed to exceed a node’s capacity. The proposed trustworthy self-optimization is used for load-balancing and additionally tries to assign services with high priority to highly trusted nodes. Rating functions are used to evaluate the fitness of a network configuration concerning trust and equal load-sharing. The main idea of the rating function for trusted service distribution fT is to reward services with high priority and resource consumption running on very trustworthy nodes: fT =
N n
S(n)
(t(n)
c(s) · p(s)).
s
N is the set of all nodes, S(n) is the set of services of a node n, t(n) its true trust value, c(s) and p(s) resource consumption and priority of a service s.
66
B. Satzger et al.
The rating function for load-sharing fL compares the current service distribution with the global theoretical optimal distribution. For each simulation the network consists of 100 nodes. At the beginning it is rated by fT and fL . Then the simulation is started and after each step the network is rated again. Within one step 100 application messages are randomly sent. This means that some nodes may send more than one message and others may send none to reflect the asynchronous character of the distributed system. The result of the receipt of an application message may be a service transfer dependent on the used trust algorithm and the node’s load. Additionally, it is measured how many services are transferred. Each evaluation scenario has been tested 250 times with randomly generated networks and averaged values. Figure 5 shows the gains of the trustworthiness of service distribution regarding the rating function fT . Distant neighborhood reached the best results followed by Forwarder and Near Neighborhood. However, Forwarder introduces no additional message for trust value distribution. Without consideration of trust in self-optimization, services are transferred to random nodes and the overall trust is not improved, but even a little bit declined. Figure 6 shows the network’s load-sharing regarding the function fL . Compared to the initial average load-distribution of 75% of the theoretical optimum, the self-optimization combined with any trust algorithm improves the load-balance. This means that the consideration of trust does not prevent the self-optimization to balance the load within the system. However, the quality of pure load-sharing cannot be reached by any trustworthy algorithm. Distant Neighborhood performs best in the conducted measurements. It improves the trustworthiness of the the service distribution by about 20% while causing a deterioration of the load-sharing by about 4% (compared to load-sharing
!
Fig. 5. Trustworthy service distribution (fT )
Towards Trustworthy Self-optimization for Distributed Systems
! ! "
67
Fig. 6. Load-sharing (fL )
with no consideration of trust). This is supposed to be a beneficial trade-off as a slightly improved load-sharing will have weak impact on the whole system. However, running important services on unreliable or malicious nodes may result in very poor system performance. Forwarder avoids overhead due to queries, as trust values are appended to application messages. This algorithm improves the trustworthy service distribution by about 13% and decreases load-sharing by about 5%. Near neighborhood shows similar results as Forwarder, but its explicit trust queries cause a higher overhead. However, due to its working principle it may be the case that it would show better results after a longer experiment duration.
7
Conclusion
This paper presents an approach for a trust management layer and shows how information provided by this layer can be used to improve a self-optimization algorithm. Three trust management algorithms have been introduced. With Forwarder, direct neighbors append its directly observed trust values automatically to each application message. Distant Neighborhood explicitly asks all direct neighbors of a node for a trust value. The last trust algorithm distributes trust values within the network for faster gathering of trust information especially after a longer runtime. The trustworthy self-optimization does not only consider pure load-balancing but also takes into account to transfer services only to nodes regarded sufficiently trustworthy. A feature of the self-optimization is not to use explicitly sent messages but append information in the form of hormones to application messages to minimize overhead. Three different transfer strategies have been proposed which determine whether a service is transferred to another node or not regarding load and trust.
68
B. Satzger et al.
The presented techniques have been evaluated. The results show that trust aspects can be integrated into the system with little restrictions regarding loadbalancing. The proposed trust mechanisms describe a way to increase the robustness of self-* systems with cooperating nodes.
References 1. Aberer, K.: P-Grid: A self-organizing access structure for P2P information systems. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, p. 179. Springer, Heidelberg (2001) 2. Aberer, K., Despotovic, Z.: Managing trust in a peer-2-peer information system. In: CIKM, pp. 310–317. ACM, New York (2001) 3. Allrutz, R., Cap, C., Eilers, S., Fey, D., Haase, H., Hochberger, C., Karl, W., Kolpatzik, B., Krebs, J., Langhammer, F., Lukowicz, P., Maehle, E., Maas, J., M¨ uller-Schloer, C., Riedl, R., Schallenberger, B., Schanz, V., Schmeck, H., Schmid, D., Schr¨ oder-Preikschat, W., Ungerer, T., Veiser, H.-O., Wolf, L.: Organic Computing - Computer- und Systemarchitektur im Jahr 2010 (in German). VDE/ITG/GI position paper (2003) 4. Cornelli, F., Damiani, E., di Vimercati, S.D.C., Paraboschi, S., Samarati, P.: Choosing reputable servents in a P2P network. In: WWW, pp. 376–386 (2002) 5. Horn, P.: Autonomic computing: IBM’s perspective on the state of information technology (2001) 6. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: Eigenrep: Reputation management in p2p networks. In: Proceedings of the 12th International World Wide Web Conference, WWW 2003 (2003) 7. Repantis, T., Kalogeraki, V.: Decentralized trust management for ad-hoc peer-topeer networks. In: Terzis, S. (ed.) MPAC. ACM International Conference Proceeding Series, vol. 182, p. 6. ACM, New York (2006) 8. Singh, A., Liu, L.: Trustme: Anonymous management of trust relationships in decentralized P2P systems. In: Shahmehri, N., Graham, R.L., Caronni, G. (eds.) Peerto-Peer Computing, pp. 142–149. IEEE Computer Society, Los Alamitos (2003) 9. Theodorakopoulos, G., Baras, J.S.: Trust evaluation in ad-hoc networks. In: WiSe 2004: Proceedings of the 3rd ACM workshop on Wireless security, pp. 1–10. ACM, New York (2004) 10. Trumler, W.: Organic Ubiquitous Middleware. PhD thesis, Universit¨ at Augsburg (July 2006) 11. Trumler, W., Bagci, F., Petzold, J., Ungerer, T.: Smart Doorplate. In: First International Conference on Appliance Design (1AD), Bristol, GB, May 2003, pp. 24–28 (2003) 12. Trumler, W., Pietzowski, A., Satzger, B., Ungerer, T.: Adaptive self-optimization in distributed dynamic environments. In: Di Marzo Serugendo, G., Martin-Flatin, J.-P., J´elasity, M., Zambonelli, F. (eds.) First IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO 2007), Cambridge, Boston, Massachussets, pp. 320–323. IEEE Computer Society, Los Alamitos (2007) 13. Trumler, W., Thiemann, T., Ungerer, T.: An artificial hormone system for selforganization of networked nodes. In: Pan, Y., Rammig, F.J., Schmeck, H., Solar, M. (eds.) Biologically Inspired Cooperative Computing, Santiago de Chile, pp. 85–94. Springer, Heidelberg (2006) 14. Weiser, M.: The computer for the 21st century (1995)
An Experimental Framework for the Analysis and Validation of Software Clocks Andrea Bondavalli1, Francesco Brancati1, Andrea Ceccarelli1, and Lorenzo Falai2 1 University of Florence, Viale Morgagni 65, I-50134, Firenze, Italy {bondavalli,francesco.brancati,andrea.ceccarelli}@unifi.it 2 Resiltech S.r.l., Via Filippo Turati 2, 56025, Pontedera (Pisa), Italy [email protected]
Abstract. The experimental evaluation of software clocks requires the availability of a high quality clock to be used as reference time and a particular care for being able to immediately compare the value provided by the software clock with the reference time. This paper focuses i) on the definition of a proper evaluation process and consequent methodology, and ii) on the assessment of both the measuring system and of the results. These aspects of experimental evaluation activities are mandatory in order to obtain valid results and reproducible experiments, including comparison of possible different realizations or prototypes. As case study to demonstrate the framework we describe the experimental evaluation performed on a basic prototype of the Reliable and Self-Aware Clock (R&SAClock), a recently proposed software clock for resilient time information that provides both current time and current synchronization uncertainty (a conservative and self-adaptive estimation of the distance from an external reference time). Keywords: experimental framework and methodology, assessment and measurements, software clocks, R&SAClock, NTP.
paper, the evaluation process is carefully planned, and the validity of the measuring system and of the results is investigated and assessed through principles of measurement theory. As further benefits, the experimental set-up and the whole evaluation process can be easily adapted and reused for the evaluation of different types of software clocks and for comparisons of possible different implementations or prototypes within the same category. The paper illustrates the experimental process and set up by showing the evaluation of a prototype of the Reliable and Self-Aware Clock (R&SAClock [5]), a recently proposed software clock. R&SAClock exploits services and data provided by any chosen synchronization mechanism (for external clock synchronization) to provide both the current time and the current synchronization uncertainty (an adaptive and conservative estimation of the distance of local clock from the reference time). The rest of this paper is organized as follows. In Section 2 we introduce our case study: the R&SAClock prototype that will be analyzed. Section 3 describes our experimental process and the measuring sub-system. Section 4 presents the results obtained by the planned experiments and their analysis. Conclusions are in Section 5.
2 The Reliable and Self-aware Clock 2.1 Basic Notions of Time and Clocks Let us consider a distributed system composed of a set of nodes. We define reference time as the unique time view shared by the nodes of the system, reference clock as the clock that always holds the reference time, and reference node as the node that owns the reference clock. Given a local clock c and any time instant t, we define c(t) as the time value read by local clock c at time t. The behavior of the local clock is characterized by the quantities offset, accuracy and drift. The offset Θc(t) = t − c(t) is the actual distance of local clock c from reference time at time t [9]. This distance may vary through time. Accuracy Ac is an upper bound of the offset [10] and is often adopted in the definition of system requirements and therefore targeted by clock synchronization mechanisms. Drift ρc(t) describes the rate of deviation of a local clock c at time instant t from the reference time [10]. Unfortunately, accuracy and offset are usually of practical little use for systems, since accuracy is usually a high value, and it is not a representative estimation of current distance from reference time, and offset is difficult to measure exactly at any time t. Synchronization mechanisms typically compute an estimated offset Θ (and an estimated drift ρ˜c(t)), without offering guarantees and only at synchronization instants. Instead of the static notion of accuracy, a dynamic conservative estimation of the offset provides more useful information. For this purpose, the notion of uncertainty as used in metrology [4], [3] can provide such useful estimation: we define the synchronization uncertainty Uc(t) as an adaptive and conservative evaluation of offset Θc(t) at any time t; that is Ac ≥ Uc(t) ≥ |Θc(t)| ≥ 0 [5]. Finally we define the root delay RDc(t) as the transmission delay (one-way or round trip, depending on the synchronization mechanism), including all systemrelated delays, from the node that holds local clock c to the reference node [9].
An Experimental Framework for the Analysis and Validation of Software Clocks
71
2.2 Basic Specifications of the R&SAClock R&SAClock is a new software clock for external clock synchronization (a unique reference time is used as target of the synchronization) that provides to users (e.g., system processes) both the time value and the synchronization uncertainty associated to the time value [5]. R&SAClock is not a synchronization mechanism, but it acts as a new software clock that exploits services and data provided by any chosen synchronization mechanism (e.g., [9], [11]). When a user asks the current time to R&SAClock (by invoking the function getTime), R&SAClock provides an enriched time value [likelyTime, minTime, maxTime, FLAG]. LikelyTime is the time value computed reading the local clock i.e., likelyTime = c(t). MinTime and maxTime are computed using the synchronization uncertainty provided by the internal mechanisms of R&SAClock. More specifically, for a clock c at any time instant t, we extend the notion of synchronization uncertainty Uc(t) distinguishing between a right synchronization uncertainty (positive) Ucr(t) and a left synchronization uncertainty (negative) Ucl(t), such that Uc(t) = max[Ucr(t); −Ucl(t)]. Values minTime and maxTime are respectively a left and a right bound of the reasonable values that can be attributed to the actual time: minTime is set to c(t) + Ucl(t) and maxTime is set to c(t) + Ucr(t). The user that exploits the R&SAClock can impose an accuracy requirement, that is the largest synchronization uncertainty that the user can accept to work correctly. Correspondingly, R&SAClock can give value to its output FLAG, which is a Boolean value that indicates whether the current synchronization uncertainty is within the accuracy requirement or not. The main core of R&SAClock is the Uncertainty Evaluation Algorithm (UEA), that equips the R&SAClock with the ability to compute the synchronization uncertainty. Possible different implementations of the UEA may lead to different versions of R&SAClock (for example, in [5] and [12] two different versions are shown). Besides the R&SAClock specification shown, we identify the following two non-functional requirements: REQ1. The service response time provided by R&SAClock is bounded: there exists a maximum reply time ∆RT from a getTime request made by a user to the delivery of the enriched time value (the probability that the getTime is not provided within ∆RT is negligible). REQ2. For any minTime and maxTime in any enriched time value generated at time t, it must be minTime ≤ t ≤ maxTime with a coverage ∆CV (by coverage we mean the probability that this equation is true). 2.3 R&SAClock Prototype and Target System Here we describe the prototype of the R&SAClock and the system in which it is executing as our target system used for the subsequent experimental evaluations. The R&SAClock prototype works with Linux and with the NTP synchronization mechanism. The UEA implemented in this prototype computes a symmetric left and right synchronization uncertainty with respect to likelyTime i.e., −Ucl(t) = Ucr(t) and Uc(t) = Ucr(t) [5]. Using functionalities of both NTP and Linux, the UEA gets i) c(t) querying the local clock and ii) the root delay RDc(t) and the estimated offset Θ t by
72
A. Bondavalli et al.
monitoring the NTP log file (NTP refreshes root delay and estimated offset when it performs a synchronization). The behavior of the UEA is as follows. First, the UEA reads an upper bound δc, fixed to 50 part per million (ppm) in the experiments, on clock drift from a configuration file and listens on the NTP log file. When NTP updates the log file, the UEA reads the estimated offset and the root delay and starts a counter called TSLU which represents the Time elapsed Since Last (more recent) Update of root delay and estimated offset. Given t the most recent time instant in which root delay and estimated offset have been updated, at any time t1 ≥ t the synchronization uncertainty Uc(t1) is computed as: Uc(t1) =|Θ
| + RD(t) + (δc · TSLU).
(1)
The basic idea of (1) is that, given Uc(t) = |Θ | + RD(t) ≥ |Θc(t)| at time t, we have Uc(t1) ≥ Θc(t1) at t1 ≥ t (a detailed discussion is in [5]). The target system is depicted in Fig. 1. The hardware components are a Toshiba Satellite laptop, that we call PC_R&SAC, and the NTP servers connected to PC_R&SAC through high-speed Internet connection. The software components are the R&SAClock prototype, the NTP client (a process daemon) and the software local clock of PC_R&SAC. The NTP client synchronizes the local clock using information from the NTP servers.
PC_R&SAC
R&SAClock NTP Servers NTP
Disk local clock
Fig. 1. The target system: R&SAClock for NTP and Linux
3 The Experimental Evaluation Process and Methodology The process of our experimental evaluation starts by identifying the goals, and then designing the experimental set up (composed by injectors, probes and the experiment control subsystems), the planning of the experiments to conduct and finally defining the structure and organization of the data related to the experiments. 3.1 Objective The objective of our analysis is in this case to validate a R&SAClock prototype, verifying if and how much it is able to fulfill its requirements in varying operating conditions, especially nominal and faulty. We aim to assign values to ∆RT and ∆CV.
An Experimental Framework for the Analysis and Validation of Software Clocks
73
3.2 Planning of the Experimental Set Up The experimental set up is described by the grey components of Fig. 2. Its hardware components are a HP Pavilion desktop, that we call PC_GPS, and a high quality GPS (Global Positioning System [8]) receiver. Through such receiver, the PC_GPS is able to construct a clock tightly synchronized to the reference time that is used as the reference clock. Obviously such reference clock does not hold the exact reference time, but it is orders of magnitude closer to the reference time than the clock of the target system: it is sufficiently accurate to suit our experiments. The PC_GPS contains a software component for the control of the experiment (Controller hereafter) that is composed of an actuator that commands the execution of workload and faultload to the Client (e.g., requests for the enriched time value, that the Client will forward to the R&SAClock), and of a monitor that receives the information from the Client of the completion of services, accesses the reference clock and writes data on the disk. The Client is a software component located on the target system: it performs injection functions, to inject the faultload and to generate the workload, and probe functions to collect the relevant quantities and write this data on the disk. An experimental setup in which the target system and the Controller are placed on the same PC would require two software clocks, thus introducing perturbations that are hard to address.
PC_GPS Controller
Disk
monitor actuator
Client probes injector
PC_R&SAC
Disk NTP
R&SAClock
GPS
(approximation of ) reference clock
NTP Servers
local clock
Fig. 2. Measuring system and target system
Given the description of the target system (an implementation of an R&SAClock) and of the experimental set up, we describe now how the relevant measures can be collected. To verify requirements REQ1 and REQ2, our measuring system implements solutions which are specific for R&SAClocks but general for any instance of it. To evaluate REQ1, the Client logs, for each request for the enriched time value, the time Client.start in which the Client queries the R&SAClock and the time Client.end in which it receives the answer from R&SAClock (these two values are collected reading the local clock of PC_R&SAC). To verify REQ2 (see Fig. 3), the measuring system computes a time interval [Controller.start, Controller.end] that contains the actual time in which the enriched time value is generated. This time interval is collected by the Controller’s monitor that reads the reference clock. Controller.start is the time instant in which a request for the enriched time value is sent by the Controller to the Client, and Controller.end is the time instant in which the enriched time value is received by the Controller. REQ2 is satisfied if [Controller.start, Controller.end] is within [minTime, maxTime].
74
A. Bondavalli et al.
Controller.start
Controller.end
generation of the enriched time value
Fig. 3. The time interval [Controller.start, Controller.end] that allows to evaluate REQ2
3.3 Instrumentation of the Experimental Set Up PC_R&SAC is a Linux PC connected to (one or more) NTP servers by means of an Internet connection and to the PC_GPS (another Linux-based PC) by means of an Ethernet crossover cable. The Controller and the Client are two high-priority processes that communicate using a socket. Fig. 4 shows their interactions to execute the workload. The Client waits for Controller’s commands. At periodic time intervals, the Controller sends a message containing a getTime request and an identifier ID to the Client, and logs ID and Controller.start. When the Client receives the message, it logs ID and Client.start and performs a getTime request to the R&SAClock. When the Client receives the enriched time value from R&SAClock, it logs the enriched time value, Client.end and ID, and sends a message containing an acknowledgment and ID to the Controller. The Controller receives the acknowledgment and logs ID and Controller.end. At the experiment termination, the log files created on the two machines are combined pairing entries with the same ID. Controller and Client interact to execute the faultload as follows. The Controller sends to the Client the commands to inject the faults (e.g. to close the Internet connection or to kill the NTP client), and logs the related event; the Client executes the received command and logs the related event. Data logging is handled by NetLogger [6], a tool for logging data in distributed systems guaranteeing negligible system perturbation. PC_GPS
PC_R&SAC
Controller
Client
R&SAClock
LOG
get_time() e ed Tim Enrich e Valu
“ack ID=1” “get:time ID=2”
C Clien lient.star t.end t
Con Co trol ntr ler. oll sta er. rt en d
“get:time ID=1”
get_time( ) Enriched Time Value
=2”
“ack ID
LOG
Fig. 4. Controller, Client and R&SAClock interactions to execute the workload
An Experimental Framework for the Analysis and Validation of Software Clocks
75
3.4 Planning of the Experiments Here the execution scenarios, the faultload, the workload and the experiments need to be defined. Our framework allows to easily define and modify such aspects of the experimental planning as desired or needed by the objectives of the analysis. In this case of the basic R&SAClock prototype, with the objective of showing how our set up works, the selection has been quite simple and not particularly demanding (for the target system) or rich (to obtain a complete assessment). Execution scenarios. Two execution scenarios are considered, that corresponds to the two most important operating conditions of the R&SAClock: i) beginning of synchronization (the NTP client has an initial transient phase and starts to synchronize to the NTP servers, and the PC_R&SAC is connected to the network), and ii) nominal operating condition of the NTP client (it represents a steady state phase: the NTP client is active and fully working, holding information on local clock). Faultload. Beside the situation of no faults, the following situations are considered: i) loose of connection between the NTP client and servers (thus making the NTP servers unreachable), and ii) failure of the NTP client (the Controller commands to shut down the NTP client). These are the most common scenarios that the R&SAClock is expected to face during its execution. Workload. The workload selected is simply composed of getTime requests to the R&SAClock to be sent once per second. This workload does not allow to observe the behavior of the target system under overload or stress conditions, and must be modified to a more demanding one if one wants to thoroughly evaluate the behavior of the target system. Experiments. Combining the scenarios, the faultload and the workload, we identify four significant experiments: i) beginning of synchronization, and no faults injected; ii) nominal operating condition, and no faults injected; iii) nominal operating condition, and failure of NTP client; iv) nominal operating condition, and loose of connection. The duration of each experiment is 12 hours; the rationale (confirmed by the collected evidence) is that 12 hours are more than sufficient to observe all the relevant events in each experiment. 3.5 Structure of the Data The structure and organization of the data related to the experiments is shown in Fig. 5, and are organized using a star-schema [7]. The organization of the data using a star schema is an intuitive model composed of facts and dimension tables. Facts tables (such as table R&SAClock_Results in Fig. 5) contain an entry for each experiment run. Each entry in its turn contains the values observed for the relevant metrics and values for the parameters of the experimental setup used in that specific experiment run. Each dimension table refer to a specific characteristic of the experimental set up and contain the possible values for that specific feature (in Fig. 5, tables Scenario, Workload, Faultload, Experiment, Target_System).
76
A. Bondavalli et al.
Fig. 5. Structure of the datta related to the experiment organized following a star-schemaa
This model allows to strructure and highlight the objectives, the results and the key elements of each evaluation n; consequently, it helps to reason on and keep the purpposes and contexts of the analy ysis clear.
4 Analysis of Resultss We subdivide the offline prrocess to investigate the results collected in three activitties: i) data staging on the raw data collected, to populate a database where data cann be easily analyzed, ii) investig gation of the validity of the results, and iii) presentation and discussion of the results. 4.1 Data Staging We organize the data staging in three steps: log collection, log parsing and databbase loading. In log collection th he events are merged in a unique log file using NetLoggeer’s API (Application Program mming Interface). In log parsing we use an AWK sccript (AWK is a programming language for processing text-based data) to parse raw ddata and create CSV (Comma Separated S Value: a standard data file format used for the storage of data structured in i table form) files, that are easier to handle than the rraw data. In database loading we create SQL (Structured Query Language) queries startting from the content of CSV filles to populate the database. 4.2 Quality of the Measu uring System and Results We assess the quality of th he measuring system along the principles of experimenntal validation and fault injectio on [1], [14] and the confidence of the results through principles of measurement theo ory [3], [4]; we focus on the uncertainty of the results and the intrusiveness of the meeasuring system [3], [4]. Furthermore, repeatability [44] is discussed to identify to whaat extend the results can be replicated in other experimennts.
An Experimental Framework for the Analysis and Validation of Software Clocks
77
Intrusiveness (perturbation). Despite the Client is a high priority thread, its execution does not perturb the target system and does not affect results. In fact, the other relevant thread that if delayed could induce a change in the system behavior is the R&SAClock thread, responsible for the generation of the enriched time value. This thread is the one with the highest priority in the target system. Uncertainty. The actual time instant in which the enriched time value is computed is within the interval [Controller.start, Controller.end] whose length is constituted by the length of the interval [Client.start, Client.end] plus the delay due to the communications between the Client and the Controller. The sampled duration of this interval is within 1.7 ms in the 99% of cases. Analyzing the other 1% of executions we discovered they were affected by large communications delay: we decided then to discard these runs. We set the time instant in which the enriched time value is computed as the middle value of the interval [Controller.start, Controller.end], that is (Controller.end + Controller.start) / 2. Such value is affected by an uncertainty of (Controller.end − Controller.start) / 2 = 0.85 ms (milliseconds) and confidence 1 [4]. Since the time instant in which the enriched time value is computed is the time instant in which the likelyTime is generated, we can attribute to the likelyTime the same uncertainty. As a consequence also the measured offset (difference between reference time and likelyTime) suffers from the same uncertainty. In principle also the resolution of our measurement system should contribute to the final uncertainty. In our case, the Linux clock resolution is 1 µs (microsecond) and its contribution is irrelevant to the computation of uncertainty. Repeatability. Re-executing the same experiment will almost certainly produce different data. Repeatability in deterministic terms as defined in [4] is not achievable. However, re-executing the same set of experiments will lead to statistically compatible results and to the same conclusions.
distance from reference time (seconds)
0.06
maxTime
synchronizations ) G c = 50 ppm
0.04
Ucr(t) = Uc(t)
0.02
likelyTime 0
reference time (provided by GPS)
-0.02
Ucl(t) = - Uc(t)
-0.04
minTime
-0.06 200
300
400
500
600
seconds
700
800
900
1000
s
Fig. 6. A sample trace of execution of the R&SAClock
78
A. Bondavalli et al.
4.3 Results In Fig. 6 we explain the results shown in Fig. 7-9. Reference time is on the x-axis. The central dashed line with label likelyTime is the distance between likelyTime and reference time: it is the offset of the local clock of PC_R&SAC that may vary during the experiments execution. The two external lines represent the distance of minTime and maxTime from reference time; this distance vary during experiments execution. If the NTP client performs a synchronization to the NTP servers at time t, the synchronization uncertainty Uc(t) is set to Θ . After the synchronization, the synchronization uncertainty grows steadily at rate 50 ppm until the next synchronization. The time interval between two adjacent synchronizations varies depending on the behavior of the NTP client. 0.25
distance from reference time (seconds)
0.15
distance from reference time (seconds)
0.2 0.15
maxTime
0.1 0.05 0
likelyTime
-0.05 -0.1
minTime
-0.15 -0.2 -0.25
0
1
2
3
4
5
6 hours
a)
7
8
9
10
11
12
maxTime 0.1
0.05
likelyTime 0
-0.05
-0.1
minTime -0.15 0
1
2
3
4
5
6
7
8
9
10
11
12
hours
b)
Fig. 7. a) Exp. 1: beginning of synchronization. b) Exp. 2: nominal operating condition.
Experiment 1. Fig. 7a shows the behavior of the R&SAClock prototype at the beginning of synchronization. The initial offset of the local clock of PC_R&SAC is 100.21 ms. At the beginning of the experiment, the NTP client performs frequent synchronizations to correct the local clock. After 8 hours, the offset is close to zero and consequently the NTP client performs less frequent synchronizations: this behavior affects Uc(t), that increases. Reference time is always within [minTime, maxTime]. Experiment 2. Fig. 7b shows the behavior of the R&SAClock prototype when the target system is in nominal operating condition and no faults are injected. The offset is close to zero and the local clock of PC_R&SAC is stable: the NTP client performs rare synchronization attempts. Reference time is always within [minTime, maxTime]. Uc(t) varies from 65.34 ms to 281.78 ms; the offset is at worst 4.22 ms. Experiment 3. Fig. 8a shows the behavior of the R&SAClock prototype when the target system is in nominal operating condition and the NTP client is failed (the figure does not contain the beginning of synchronization, about 8 hours). LikelyTime drifts from reference time (the NTP client does not discipline the local clock): after 12 hours, the offset is close to 500 ms. Since the actual drift of local clock is smaller than , the reference time is always within [minTime, maxTime].
An Experimental Framework for the Analysis and Validation of Software Clocks 2.5
1.5
distance from reference time (seconds)
distance from reference time (seconds)
2
maxTime
1 0.5 0 -0.5
likelyTime
-1 -1.5 -2
minTime
-2.5 -3 0
79
1
2
3
4
5
6
7
8
9
10
11
12
2
maxTime
1.5 1 0.5
likelyTime
0 -0.5 -1 -1.5
minTime
-2 -2.5
0
1
2
3
4
5
6
7
8
9
10
11
12
hours
hours
a)
b)
Fig. 8. a) Exp. 3: failure of the NTP client. b) Exp. 4: unavailability of the NTP servers.
Experiment 4. Fig. 8b shows the behavior of the R&SAClock prototype when the target system is in nominal operating condition and Internet connection is lost (the NTP client is unable to communicate with the NTP servers and consequently does not perform synchronizations). NTP client disciplines the local clock using information from the most recent synchronization. After 12 hours the offset is 26.09 ms: the NTP client, thanks to stable environmental conditions, succeeds in keeping likelyTime relatively close to the reference time. Reference time is within [minTime, maxTime]. Assessment of REQ1 and REQ2. The time intervals [Client.start, Client.end] from all the samples collected in the experiments are shown in Fig. 9. The highest value is 1.630 ms, thus REQ1 is satisfied simply by setting ∆RT ≥ 1.630 ms. However such multimodal distribution shows that the response time of a getTime varies significantly depending on current system activity and possible overheads of system resources. This suggest the possibility to build a new improved prototype with a reduced ∆RT and less variance in the interval [Client.start, Client.end] (e.g., implementing the R&SAClock within the OS layer and the getTime as an OS call).
number of samples
10000 8000 6000 4000 2000 0 700
900
1100
1300
microseconds
1500
µs
Fig. 9. Intervals [Client.start, Client.end]
1700
80
A. Bondavalli et al.
In the experiments shown, REQ2 is always satisfied (∆CV = 1). However, the interval [minTime, maxTime] is often a very large value, even when offset is continuously close to zero. Results of the experiments suggest that different (more efficient) UEAs that predict the oscillator drift behavior using statistical information on past values may be developed and used. A preliminary investigation is in [12].
6 Conclusions and Future Works In this paper we described a process and set up for the experimental evaluation of software clocks. The main issues addressed have been the provision of a high quality clock (resorting to high quality GPS) to be used as reference time in the experimental set up, and a particular care in designing the measuring system, that has allowed to assess the validity of the measuring system and of the results. Besides the design and planning of the experimental activities, the paper illustrates the experimental process and set up by showing the evaluation of a prototype of the R&SAClock [5], a recently proposed software clock. Even the simple experiments described allowed to get insight on the major deficiencies of the considered prototype and to identify the directions for improvements. Acknowledgments. This work has been partially supported by the European Community through the project IST-FP6-STREP-26979 (HIDENETS - HIghly DEpendable ip-based NETworks and Services).
References 1. Hsueh, M., Tsai, T.k., Iyer, R.K.: Fault Injection Techniques and Tools. Computer 30(4), 75–82 (1997) 2. Avizienis, A., Laprie, J., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. on Dependable and Secure Computing 1(1), 11–33 (2004) 3. BIPM, IEC, IFCC, ISO, IUPAC, OIML: Guide to the Expression of Uncertainty in Measurement (2008) 4. BIPM, IEC, IFCC, ISO, IUPAC, OIML: ISO International Vocabulary of Basic and General Terms in Metrology (VIM), Third Edition (2008) 5. Bondavalli, A., Ceccarelli, A., Falai, L.: Assuring Resilient Time Synchronization. In: Proceedings of the 27th IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 3–12. IEEE Computer Society, Washington (2008) 6. Gunter, D., Tierney, B.: NetLogger: a Toolkit for Distributed System Performance Tuning and Debugging. In: IFIP/IEEE Eighth International Symposium on Integrated Network Management, pp. 97–100 (2003) 7. Kimball, R., Ross, M., Thornthwaite, W.: The Data Warehouse Lifecycle Toolkit. J. Wiley & Sons, Inc., Chichester (2008) 8. Dana, P.H.: Global Positioning System (GPS) Time Dissemination for Real-Time Applications. Real-Time Systems 12(1), 9–40 (1997) 9. Mills, D.: Internet Time Synchronization: the Network Time Protocol. IEEE Trans. on Communications 39, 1482–1493 (1991)
An Experimental Framework for the Analysis and Validation of Software Clocks
81
10. Verissimo, P., Rodriguez, L.: Distributed Systems for System Architects. Kluwer Academic Publishers, Dordrecht (2001) 11. Cristian, F.: Probabilistic Clock Synchronization. Distributed Computing 3, 146–158 (1989) 12. Bondavalli, A., Brancati, B., Ceccarelli, A.: Safe Estimation of Time Uncertainty of Local Clocks. In: IEEE Symposium on Precision Clock Synchronization for Measurement, Control and Communication, ISPCS (to appear, 2009) 13. Bondavalli, A., Ceccarelli, A., Falai, L., Vadursi, M.: Foundations of Measurement Theory Applied to the Evaluation of Dependability Attributes. In: Proceedings of the 37th Annual IEEE/IFIP international Conference on Dependable Systems and Networks, pp. 522–533 (2007) 14. Arlat, J., Aguera, M., Amat, L., Crouzet, Y., Fabre, J.-C., Laprie, J.-C., Martins, E., Powell, D.: Fault Injection for Dependability Validation: a Methodology and Some Applications. IEEE Trans. on Software Engineering 16(2), 166–182 (1990) 15. Veitch, D., Babu, S., Pàsztor, A.: Robust Synchronization of Software Clocks across the Internet. In: Proceedings of the 4th ACM SIGCOMM Conference on internet Measurement, pp. 219–232 (2004)
Towards a Statistical Model of a Microprocessor’s Throughput by Analyzing Pipeline Stalls Uwe Brinkschulte, Daniel Lohn, and Mathias Pacher Institut f¨ ur Informatik Johann Wolfgang Goethe Universit¨ at Frankfurt, Germany {brinks,lohn,pacher}@es.cs.uni-frankfurt.de
Abstract. In this paper we model a thread’s throughput, the instruction per cycle rate (IPC rate), running on a general microprocessor as used in common embedded systems. Our model is not limited to a particular microprocessor because our aim is to develop a general model which can be adapted thus fitting to different microprocessor architectures. We include stalls caused by different pipeline obstacles like data dependencies, branch misprediction etc. These stalls involve latency clock cycles blocking the processor. We also describe each kind of stall in detail and develop a statistical model for the throughput including the entire processor pipeline.
1
Introduction
Nowadays, the development of embedded and ubiquitous systems is strongly advancing. We find microprocessors embedded and networked in all areas of life, e.g. in cell phones, cars, planes, and household aids. In many of these areas the microprocessors need special capabilities, e.g. guaranteeing execution time bounds for real-time applications like a control task on an autonomous guided vehicle. Therefore, we need models of the timing behavior of these microprocessors by which the execution time bounds can be computed. In this paper we develop a statistical model for the IPC rate of a general purpose multi-threaded microprocessor to predict timing behavior thus improving the real-time capability. We consider both effects like data dependencies and processor speed-up techniques like branch- and branch target prediction, or caches. The model is a transfer function computing the IPC rate. By analyzing this model we obtain bounds for the IPC rate which can be used to compute bounds for the execution time of user applications. Another important use of a model like this is to control the IPC rate similar to [1,2,3]. Controlling the IPC rate in pipelined microprocessors is one of the long-term goals of our work: If we develop precise statistical models of the throughput we are able to adjust the controller parameters in a very fine-grained way. In addition, we can compute estimations for the applications’ time bounds which is necessary for real-time systems. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 82–90, 2009. c IFIP International Federation for Information Processing 2009
Towards a Statistical Model of a Microprocessor’s Throughput
83
The paper is structured as follows: Section 2 presents related work and similar approaches. In Section 3 we discuss modern scalar and multi-threaded microprocessors and in Section 4 we present our model which is validated by an example. Section 5 concludes our paper and gives an outlook to the future work.
2
State of the Art
Many approaches for Worst Case Execution Time (WCET) analysis are known. Most of them examine the semantics of the program code in respect to the pipeline used for the execution resulting in a cycle accurate analysis of the program code. One example is the work in [4]. The authors examine the WCET in the Motorola ColdFire 5307 and study i.e. cache interferences occurring while loop execution. In [5], the authors discuss the WCET analysis in an out-of-order execution processor. They transform the WCET analysis problem by computing and examining the execution graph (whose nodes represent the tuple consisting of an instruction identifier and a pipeline stage) of the program code to be executed. The authors of [6] consider the WCET analysis for processors with branch prediction. They classify each control transfer instruction with respect to branch prediction and use a timing analyzer to estimate the WCET according to the instruction classification. A WCET analysis with respect to industrial requirements is discussed in [7]. A good review on existing WCET analysis techniques is given in [8]. The authors also present a generic framework for WCET analysis. In [9], the author propose to split the cache into several independent caches to simplify the WCET analysis and get tighter upper bounds. The authors of [10] design a model of a single-issue in-order pipeline for a static WCET analysis and consider time dependencies between instructions. The papers presented above mostly provide cycle accurate techniques for WCET analysis of program codes. This is different to our approach as we use a probabilistic approach based on the pipeline structure. The characteristics of the program codes are generalized by statistical values like the probability of a misprediction etc. As a result, our model is not necessarily cycle accurate but we are able to use analytical techniques to examine the throughput behavior for individual programs as well as for program classes. Furthermore, as mentioned in the section 1, our long-term goal is to control the IPC rate. Using control theory as a basis, a model of a processor as proposed in this paper is necessary not only to analyze, but as well to improve and guarantee real-time behavior by a closed control loop.
3
Pipeline Stalls in Microprocessors
Techniques like long pipelines, branch prediction and caches were developed to improve the average performance of modern super-scalar microprocessors. But the worst case oriented real-time behavior suffers from various reasons like branch
84
U. Brinkschulte, D. Lohn, and M. Pacher
misprediction, cache misses, etc. Besides, there is only one set of registers in single-threaded processors, thus producing context switch costs of several clock cycles in case of a thread switch. On application level thread synchronization also introduces latency clock cycles if one thread has to wait on a synchronization event. This problem depends on the programming model and is not affected by the architecture of the microprocessor used for its execution. Multi-threaded processors suffer from the same problems considering realtime as single-threaded processors. However, there are several differences which makes them an interesting research platform to model the IPC rate: Contrary to single-threaded processors there are mostly separated internal resources in multithreaded processors like program counters, status- and general purpose registers etc. for each thread. This decreases the interdependencies of different threads caused by the processor hardware. The remaining thread interdependencies only depend on the programming model, thus e.g. the context switching time between different threads is eliminated. In addition, if a scheduling strategy like Guaranteed Percentage Scheduling (GP-Scheduling, see [11]) is used, the controller is able to to control each thread in a fine-grained way. In GP scheduling, a requested number of clock cycles is assigned to each thread. This assignment is guaranteed within a certain time period (e.g. 100 clock cycles). Fig. 1 gives an example of three threads with a GP rate of 30% for Thread A, 20% for Thread B, and 40% for Thread C, and a time period of 100 clock cycles. This means, thread A gets 30 clock cycles, thread B gets 20 clock cycles, and thread C gets 40 clock cycles within the 100 clock cycle time period. Thread A, 30% Thread B, 20% Thread C, 40%
...
Thread A 30 clock cycles
Thread B 20 clock cycles
Thread C 40 clock cycles
100 clock cycles
Thread A 30 clock cycles
Thread B 20 clock cycles
Thread C 40 clock cycles
...
100 clock cycles
Fig. 1. Example for a GP schedule
4
Modeling
In this section we present a statistical model of a microprocessor evolving the parameters influencing the throughput. Our first approach considers a processor with
Towards a Statistical Model of a Microprocessor’s Throughput
85
one core and a simple scalar multi-threaded pipeline. Our analysis of throughput hazards starts with the lowest hardware level, the pipeline. Every instruction to be executed passes the pipeline. Therefore, we have to consider the first pipeline stage, the instruction fetch unit. This stage exists in almost all modern microprocessors [11,12]. The instruction set of a processor can be divided into different classes: An instruction is either controlflow or data related. We can compute a probability for the occurrence of instructions from these classes. The probability of the occurrence of a controlflow class instruction in the interval n is denoted by pa (n), while pb (n) represents the probability of a data related instruction in the interval n. We assume the probabilities to be time dependent, because they may change with the currently executed program code. First, we consider controlflow related instructions like unconditional and conditional branches. These instructions may lead to a lower IPC rate, caused by delay cycles in the pipeline. Therefore, it is necessary to identify them as early as possible and handle them appropriately1. This is done with the help of a branch target buffer (BTB) [11]. The BTB contains the target addresses of unconditional branches, and some additional prediction bits for conditional branches to predict the branch direction. Whenever the target address of an instruction can’t be found in the BTB, the pipeline has to be stalled until the target address has been computed. Therefore, we model these delay cycles by a penalty Datarget , while patarget (n) is the probability that such a stall event occurs in the time interval n. If a conditional branch is fetched, the predictor may fail. In this case the pipeline has to be flushed and the instructions of the other branch direction have to be fed into the pipeline mostly leading to a long pipeline stall. The actual number of delay cycles depends on the length of the pipeline [11]. We call pamp (n) the probability a branch is mispredicted in the interval n and Damp the penalty in delay cycles for pipeline flushing. Now, we consider data related instructions because data dependencies also influence the IPC rate. There are three different kinds of dependencies [11]: The anti dependency or write-after-read-hazard (WAR) is the easiest one because it does not affect the execution in an in-order-pipeline at all. As an example let’s assume an instruction writes to a register that was read earlier by another instruction. This does not influence the execution in any way. Output dependencies or write-after-write-hazards (WAW) can be solved by register renaming thus not affecting the IPC rate, too. True dependencies or read-after-write-hazards (RAW) are the worst kind of data dependencies. Their impact on the IPC rate can be reduced by hardware (forwarding techniques [12]) or by software (instruction reordering [12]). However, in several cases instructions have to wait in the reservation stations and several delay cycles have to be inserted into the pipeline until the dependency is solved. pbd denotes the statistical probability for a pipeline stalling data dependency and Dbd denotes the average penalty in clock cycles. 1
A modern microprocessor is able to detect controlflow related instructions yet in the instruction fetch stage of its pipeline.
86
U. Brinkschulte, D. Lohn, and M. Pacher
The following formula (1) computes the IPC rate I of a microprocessor including the above mentioned pipeline obstacles in the interval n: G(n) 1 + X(n) X(n) =pa (n)(patarget (n)Datarget + pamp (n)Damp ) + pb (n)pbd (n)Dbd I(n) =
(1)
The IPC rate I(n) of the executed thread in the interval n is the Guaranteed Percentage rate G(n) divided by one plus a penalty term X(n), where X(n) is the expected value of all inserted penalty delay cycles. If we assume a perfect branch prediction and no pipeline stalls caused by data dependencies, then the probabilities for pipeline stalling events would be zero, turning the whole term X(n) into zero. The resulting IPC rate would equal the Guaranteed Percentage rate, due to no latency cycles occurring. However, in case a data dependency could not be solved by forwarding, then pbd (n) would not be zero and X(n) would contain the penalty for the data dependency. Therefore, the IPC rate would suffer. Figure 2 shows the impact of those pipeline hazards on the IPC rate. The next step is to consider the effects of caches on the IPC rate, ignoring any delay cycles from other pipeline stages. Since we have no out-of-order execution, every cache miss leads to a pipeline stall, until the required data or instruction is available. The statistical probability of a cache miss occurring in the interval n is denoted by pc (n) and Dc is the average penalty in delay cycles. So the resulting formula is quite similar to formula 1: GP (n) 1 + Y (n) Y (n) =pc (n)Dc I(n) =
Fig. 2. Impact of pipeline hazards on the IPC rate
(2)
Towards a Statistical Model of a Microprocessor’s Throughput
87
Y (n) is the expected value of all delay cycles in the interval n, lowering the IPC rate I(n). Figure 2 shows the effects of cache misses on the IPC rate. Our final goal in this paper is to combine the pipeline hazard and the cache miss effects in one formula. As there is no dependency between cache misses and pipeline hazards, all the inserted delay cycles can simply be added, resulting in a final penalty of Z(n). Thus, we can bring together the effects of pipeline hazards and cache misses leading to the following formula 3: GP (n) 1 + Z(n) Z(n) =X(n) + Y (n) X(n) =pa (n)(patarget (n)Datarget + pamp (n)Damp ) + pb (n)pbd (n)Dbd I(n) =
(3)
Y (n) =pc (n)Dc Figure 3 shows the according IPC rate, taking into account all effects on hardware level. Now, we show that formula 3 is an adequate model of a simple microprocessor. Therefore, we examine a short code fragment of ten instructions executed in the time interval i: 1. 2. 3. 4. 5.
data instruction controlflow instruction (jump target not known) data instruction data instruction (with dependency) data instruction
Fig. 3. Impact of cache misses on the IPC rate
88
U. Brinkschulte, D. Lohn, and M. Pacher
Fig. 4. The final IPC rate including pipeline hazards and cache misses
6. 7. 8. 9. 10.
data instruction (with dependency) controlflow instruction data instruction (cache miss) data instruction data instruction
We assume our microprocessor has a five stage pipeline and runs two different threads and a Guaranteed Percentage value of 0.5 is granted to each of them . Furthermore, we assume a penalty of 2 clock cycles for an unknown branch target, 5 clock cycles for flushing the pipeline after a mispredicted branch, 1 clock cycle for an unresolved data dependency and 30 clock cycles for a cache miss. Analyzing the code fragment produces the following probability values: pa (i) = 0.2 patarget (i) = 0.1 pamp (i) = 0 pb (i) = 0.8 pbd (i) = 0.25 pc = 0.1 Having these values we are able to compute the IPC rate according to our model: X(i) = 0.2 · (0.1 · 2 + 0) + 0.8 · 0.25 · 1 = 0.4 Y (i) = 0.1 · 30 = 3 Z(i) = 0.4 + 3 = 3.4 0.5 I(i) = ≈ 0.114 1 + 3.4
Towards a Statistical Model of a Microprocessor’s Throughput
89
To verify the model, we examine what happens on pipeline level. At the beginning of interval i it takes five clock cycles to fill the pipeline. At the 6th clock cycle the first instruction is completed and then the pipeline is stalled for two cycles to compute the branch target of instruction 2. So instruction 2 finishes at the 9th clock cycle. Instruction 3 is completed at the 10th clock cycle and instruction 4 at the 12th clock cycle, because the unresolved data dependency of instruction 4 leads to a pipeline stall of one cycle. At the 13th clock cycle, instruction 5 is finished and at the 15th and 16th clock cycles the instructions 6 and 7, too. Because a cache miss happens during the execution of instruction 8, it finishes at the 47th clock cycle. The last two instructions finish at the 48th and 49th clock cycle. Since the thread has a GP value of 0.5, we have to double the execution time. This means, the execution of the code fragment would take 98 clock cycles on the real processor. This is already very close to our model (about 10%). If we neglect the first cycles needed to fill the pipeline we even get exactly an IPC rate of I(i) = 10 88 ≈ 0.114. Since real programs consist of many instructions, the time for the first pipeline filling can be easily neglected, thus enabling our model to predict the correct value of the IPC rate.
5
Conclusion and Future Work
In this paper we developed a statistical model of a simple multi-threaded microprocessor to compute the throughput of a thread. We started to consider the influence of hardware effects like pipeline hazards or cache misses on the IPC rate. First, we considered each hardware effect on its own, then we combined all together to a single formula, see formula 3. We showed with the help of an example that our model adequately describes the IPC rate. Future work will concern further improvements of the model, taking into account more advanced hardware techniques, like multicore or out-of-order execution. As already mentioned above, our future work will not only concern to compute the throughput of a thread, but also to control and stabilize it to a given IPC rate by closed control loops. Therefore, we want to develop a model by what we are able to identify the most important parameters for the IPC rate.
References 1. Brinkschulte, U., Pacher, M.: A Control Theory Approach to Improve the RealTime Capability of Multi-Threaded Microprocessors. In: ISORC, pp. 399–404 (2008) 2. Pacher, M., Brinkschulte, U.: Implementing Control Algorithms Within a Multithreaded Java Microcontroller. In: Beigl, M., Lukowicz, P. (eds.) ARCS 2005. LNCS, vol. 3432, pp. 33–49. Springer, Heidelberg (2005) 3. Brinkschulte, U., Pacher, M.: Improving the Real-time Behaviour of a Multithreaded Java Microcontroller by Control Theory and Model Based Latency Prediction. In: WORDS 2005, Tenth IEEE International Workshop on Object-oriented Real-time Dependable Systems, Sedona, Arizona, USA (2005)
90
U. Brinkschulte, D. Lohn, and M. Pacher
4. Langenbach, M., Thesing, S., Heckmann, R.: Pipeline modeling for timing analysis. In: Hermenegildo, M.V., Puebla, G. (eds.) SAS 2002. LNCS, vol. 2477, pp. 294–309. Springer, Heidelberg (2002) 5. Li, X., Roychoudhury, A., Mitra, T.: Modeling out-of-order processors for software timing analysis. In: RTSS 2004: Proceedings of the 25th IEEE International RealTime Systems Symposium, Washington, DC, USA, pp. 92–103. IEEE Computer Society, Los Alamitos (2004) 6. Colin, A., Puaut, I.: Worst case execution time analysis for a processor withbranch prediction, vol. 18(2/3), pp. 249–274. Kluwer Academic Publishers, Norwell (2000) 7. Ferdinand, C.: Worst case execution time prediction by static program analysis, vol. 3, p. 125a. IEEE Computer Society, Los Alamitos (2004) 8. Kirner, R., Puschner, P.: Classification of WCET analysis techniques. In: Proc. 8th IEEE International Symposium on Object-oriented Real-time distributed Computing, May 2005, pp. 190–199 (2005) 9. Schoeberl, M.: Time-predictable cache organization (2009), http://www.jopdesign.com/doc/tpcache.pdf 10. Engblom, J., Jonsson, B.: Processor pipelines and their properties for static wcet analysis. In: Sangiovanni-Vincentelli, A.L., Sifakis, J. (eds.) EMSOFT 2002. LNCS, vol. 2491, pp. 334–348. Springer, Heidelberg (2002) 11. Brinkschulte, U., Ungerer, T.: Mikrocontroller und Mikroprozessoren, 2nd edn. Springer, Heidelberg (2007) 12. Hennessy, J.L., Patterson, D.A.: Computer architecture: a quantitative approach, 4th edn. Elsevier [u.a.], Amsterdam (2007); Includes bibliographical references and index
Joining a Distributed Shared Memory Computation in a Dynamic Distributed System Roberto Baldoni1 , Silvia Bonomi1 , and Michel Raynal2 1
2
Universit´a La Sapienza, Via Ariosto 25, I-00185 Roma, Italy IRISA, Universit´e de Rennes, Campus de Beaulieu, F-35042 Rennes, France {baldoni,bonomi}@dis.uniroma1.it, [email protected]
Abstract. This paper is on the implementation of high level communication abstractions in dynamic systems (i.e., systems where the entities can enter and leave arbitrarily). Two abstractions are investigated, namely the read/write register and add/remove/get set data structure. The paper studies the join protocol that a process has to execute when it enters the system, in order to obtain a consistent copy of the (register or set) object despite the uncertainty created by the net effect of concurrency and dynamicity. It presents two join protocols, one for each abstraction, with provable guarantees. Keywords: Churn, Dynamic system, Provable guarantee, Regular register, Set object, Synchronous system.
1 Introduction Dynamic systems. The passage from statically structured distributed systems to unstructured ones is now a reality. Smart environments, P2P systems and networked systems are examples of modern systems where the application processes are not aware of the current system composition. Because they are run on top of a dynamic distributed system, these applications have to accommodate a constantly change of the their membership (i.e., churn) as a natural ingredient of their life. As an extreme, an application can cease to run when no entity belongs to the membership, and can later have a membership formed by thousands of entities. Considering the family of state-based applications, the main issue consists in maintaining their state despite membership changes. This means that a newcomer has to obtain a valid state of the application before joining it (state transfer operation). This is a critical operation as a too high churn may prevent the newcomer from obtain such a valid state. The shorter the time taken by the join procedure to transfer a state, the highest the churn rate the join protocol is able to cope with. Join protocol with provable guarantees. This paper studies the problem of joining a computation that implements a distributed shared memory on the top of a messagepassing dynamic distributed system. The memory we consider is made up of the noteworthy object abstractions that are the regular registers and the sets. For each of them, a notion of admissible value is defined. The aim of that notion is to give a precise meaning to the object value a process can obtain in presence of concurrency and dynamicity. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 91–102, 2009. c IFIP International Federation for Information Processing 2009
92
R. Baldoni, S. Bonomi, and M. Raynal
The paper proposes two join protocols (one for each object type) that provide the newcomer with an admissible value. To that end, the paper considers an underlying synchronous system where, while processes can enter and leave the application, their number remains always constant. While the regular register object is a well-known shared memory abstraction introduced by Lamport [10], the notion of a set object in a distributed context is less familiar. The corresponding specification given in this paper extends the notion of weak set introduced by Delporte-Gallet and Fauconnier in [5]. Roadmap. The paper is made up of 5 sections. First, Section 2 introduces the register and set objects (high level communication abstractions), and Section 3 presents the underlying computation model. Then, Sections 4 and 5 presents two join protocols, each suited to a specific object.
2 Distributed Shared Memory Paradigm A distributed shared memory is a programming abstraction, built on top of a message passing system, that allows processes to communicate and exchange informations by invoking operations that return or modify the content of shared objects, thereby hiding the complexity of the message exchange needed to maintain it. One of the simplest shared objects that can be considered is a register. Such an object provides the processes with two operations called read and write. Objects such as queues, stacks are more sophisticated objects. It is assumed that every process it sequential: it invokes a new operation on an object only after receiving an answer from its previous object invocation. Moreover, we assume a global clock that is not accessible to the processes. This clock can be seen as measuring the real time as perceived by an external observer that would not part of the system. 2.1 Base Definitions An operation issued on a shared object is not instantaneous: it takes time. Hence, two operations executed by two different processes, may overlap in time. Two events (denoted invocation and the response) are associated with each operation. They occur at the beginning (invocation time) and at the end of the operation (return time). Given two operation op and op having respectively invocation times tB (op) and tB (op ), and return times tE (op) and tE (op ), respectively, we say that op precedes op (op ≺ op ) iff tE (op) < tB (op ). If op does not precede op and op does not precede op then they are concurrent (op||op ). Definition 1 (Execution History). Let H be the set of all the operations issued on a = (H, ≺) is a partial order on H satisfying shared object O. An execution history H the relation ≺. t of H at time t). Given an execution history H = (H, ≺) Definition 2 (Sub-history H such that: and a time t, the sub-history Ht = (Ht , ≺t ) of H at time t is the sub-set of H (i) Ht ⊆ H, and (ii) ∀op ∈ H such that tB (op) ≤ t then op ∈ Ht .
Joining a Distributed Shared Memory Computation
93
2.2 Regular Register: Definition A register object R has two operations. The operation write(v) defines the new value v of the register, while the operation read() returns a value from the register. The semantic of a register is given by specifying which are the values returned by its read operations. Without loss of generality, we consider that no two write operations write the same value. This paper consider a variant of the regular register abstraction as defined by Lamport [10]. In our case, a regular register can have any number of writers and any number of readers [13]. The writes appear as if they were executed sequentially, this sequence complying with their real time occurrence order (i.e., if two writes w1 and w2 are concurrent they can appear in any order, but if w1 terminates before w2 starts, w1 has to appear as being executed before w2 ). As far as a read operation is concerned we have the following. If no write operation is concurrent with a read operation, that read operation returns the last value written in the register. Otherwise, the read operation returns any value written by a concurrent write operation or the last value of the register before these concurrent writes. Definition 3 (Admissible value for a read() operation). Given a read() operation op, a value v is admissible for op if: – ∃ write(v) : write(v) ≺ op ∨ write(v) || op, and – write(v ) (with v = v ) : write(v) ≺ write(v ) ≺ op. Definition 4 (Admissible value for a regular register at time t). Given an execution = (H, ≺) of a regular register R and a time t, let H t = (Ht , ≺t ) be the subhistory H history of H at time t. An admissible value at time t for R is any possible admissible value v for an instantaneous read operation op executed at time t. 2.3 Set Data Structure: Definition A set object S can be accessed by processes by means of three operations: add() and remove() that modify the content of the set and get() that returns the current content of the set. The add(v) operation takes an input a parameter v and returns the value ok when it terminates. Its aim is to add the element v to S. Hence, if {x1 , x2 , . . . , xk } are the values belonging to S before the invocation of add(v), and if no remove(v) operation is executed concurrently, the value of the set will be {x1, x2 , . . . , xk , v} after its invocation. The remove(v) operation takes an input a parameter v and returns the value ok. Its aim is to delete the element v from S if v belongs to S, otherwise the remove operation has no effect. The get() operation takes no input parameter. It returns a set containing the current content of S, without modifying the content of the object. In a concurrency-free context, every get() operation returns the current content of the set. The content of the set is welldefined when the operations occur sequentially. In order to state without ambiguity the value returned by get() operation in a concurrency context, let us introduce the notion of admissible values for a get() operation op (i.e. Vad (op)) by defining two sets, denoted sequential set (Vseq (op)) and concurrent set (Vconc (op)).
94
R. Baldoni, S. Bonomi, and M. Raynal
GET()
Pi ADD(2)
Pj Pk
GET()
Pi REMOVE(1)
Pj
ADD(1)
Pk
(a) Vseq = {1, 2}, Vconc = ∅
(b) Vseq = ∅, Vconc = ∅ GET()
Pi Pj Pk
ADD(1)
REMOVE(1) ADD(1)
(c) Vseq = ∅, Vconc = {1} Fig. 1. Vseq and Vconc in distinct executions
Definition 5 (Sequential set for a get() operation). Given a get() operation op executed on S, the set of sequential values for op is a set (denoted Vseq (op)) that contains all the values v such that: 1. ∃ add(v) : add(v) ≺ op, and 2. If remove(v) exists, then add(v) ≺ op ≺ remove(v). As an example, let consider Figure 1(a). The sequential set Vseq (op) = {1, 2} because there exist two operations adding the values 1 and 2, respectively, that terminate before the get() operation starts, and there is neither remove(1), nor remove(2), before get() termionates. Differently, Vseq (op) = ∅ in Figure 1(b). Definition 6 (Concurrent set for a get() operation). Given a get() operation op executed on S, the set of concurrent values for the get() operation is a set (denoted Vconc (op)) that contains all the value v such that: 1. ∃ add(v) : add(v) || op, or 2. ∃ add(v), remove(v) : (add(v) ≺ op) ∧ (remove(v) || op), or 3. ∃ add(v), remove(v) : add(v) || remove(v) ∧ add(v) ≺ op ∧ remove(v) ≺ op. When considering the execution of Figure 1(c), Vconc (op) = {1} due to the first item 1 of the previous definition. Definition 7 (Admissible set for a get() operation). Given a get() operation op, a sequential set Vseq (op) and a concurrent set Vconc (op), a set Vad (op) is an admissible set of values for op if Vseq (op) ⊆ Vad (op) ∧ Vad (op) \ Vseq (op) ⊆ Vconc (op) . As an example, let consider the executions depicted in Figure 1. For Figure 1(a) and Figure 1(b), there exists only one admissible set Vad for the get operation and it is respectively Vad (op) = {1, 2} and Vad (op) = ∅. Differently, for Figure 1(c) there exist two different admissible sets for the get operation, namely, Vad (op) = ∅ and Vad (op) =
Joining a Distributed Shared Memory Computation
ADD(4)
GET()
95
REMOVE(2)
Pi
GET()
REMOVE(4)
ADD(1)
GET()
Pj
ADD(3) Pk
REMOVE(3)
ADD(2)
t
t at time t Fig. 2. Sub-History H
{1}. Note that, in the executions depicted in Figure 1(c), if there was another get() operation after the add() and the remove() operations, these two get() could return different admissible sets. In order to take into consideration such point, consistency criteria have to be defined. Definition 8 (Admissible Sets of values at time t). An admissible set of values at time t for S (denoted V ad (t)) is any possible admissible set Vad (op) for any get() operation op that would occur instantaneously at time t. As an example, consider the scenario depicted in Figure 2. The sub-history at the time t is the partial order of all the operations started before t (i.e. the operations belonging to the set Ht are add(4) and get() executed by pi , get(), remove(4) and add(1) executed by pj , and add(3) and remove(3) executed by pk . The instantaneous get operation op is concurrent with add(1) executed by pj and remove(3) executed by pk . The sequential set for op Vseq (op) is ∅ because for both the add operations preceding op there exists a remove not following op while the concurrent set for op Vconc (op) is {1, 3}. The possible admissible sets for op (and then the possible admissible sets at time t) could be then (i) ∅, (ii) {1}, (iii) {3} and (iv) {1, 3}.
3 Joining a Computation in Dynamic Distributed System 3.1 System Model The distributed system is composed, at each time, by a fixed number (n) of processes that communicate by exchanging messages. Processes are uniquely identified with their indexes and they may join and leave the system at any point in time. The system is synchronous in the following sense. The processing times of local computations are negligible with respect to communication delays, so they are assumed to be equal to 0. Contrarily, messages take time to travel to their destination processes, but their transmission time is upper bounded. Moreover, we assume that processes can access a global clock (this is for ease of presentation; as we are in a synchronous system, such a global clock could be implemented by synchronized local clocks). We assume
96
R. Baldoni, S. Bonomi, and M. Raynal
that there exists an underling protocol (implemented at the connectivity layer) that keeps processes connected. 3.2 The Problem Given a shared object O (e.g. a register or a set), it is possible to associate with it, at each time t, a set of admissible values. Processes continuously join the system along time and every process pi that enters the computation has no information about the current state of the object with the consequence of being unable to perform any operation. Therefore every process pi that wishes to enter into the computation needs to retrieve an admissible value for the object O from the other processes. This problem is captured by adding a join() operation that has to be invoked by every joining process. This operation is implemented by a distributed protocol that builds an admissible value for the object. 3.3 Distributed Computation A distributed computation is defined, at each time, by a subset of processes. A process p, belonging to the system, that wants to participate to the distributed computation has to execute the join() operation. Such an operation, invoked at some time t, is not instantaneous. But, from time t, the process p can receive and process messages sent by any other process that belongs to the system and that participate to the computation. Processes participating to the computation implements a shared object. A process leaves the computation in an implicit way. When it does, it leaves the computation forever and does not longer send messages. (From a practical point of view, if a process wants to re-enter the system, it has to enter it as a new process, i.e., with a new name.) We assume that no process crashes during the computation (i.e., it does not crash from the time it joins the system until it leaves). In order to formalize the set of processes that participate actively to the computation we give the following definition. Definition 9. A process is active from the time it returns from the join() operation until the time it leaves the system. A(t) denotes the set of processes that are active at time t, while A([t1 , t2 ]) denotes the set of processes that are active during the interval [t1 , t2 ]. 3.4 Communication Primitives Two communication primitives are used by processes belonging to the distributed computation to communicate: point-to-point and broadcast communication. Point-to-point communication. This primitive allows a process pi to send a message to another process pj as soon as pi knows that pj has joined the computation. The network is reliable in the sense that it does not loose, create or modify messages. Moreover, the synchrony assumption guarantees that if pi invokes “send m to pj ” at time t, then pj receives that message by time t + δ (if it has not left the system by that time). In that case, the message is said to be “sent” and “received”.
Joining a Distributed Shared Memory Computation
97
Broadcast. Processes participating to the distributed computation are equipped with an appropriate broadcast communication sub-system that provides the processes with two operations, denoted broadcast() and deliver(). The former allows a process to send a message to all the processes in the distributed system, while the latter allows a process to deliver a message. Consequently, we say that such a message is “broadcast” and “delivered”. These operations satisfy the following property. – Timely delivery: Let t be the time at which a process p belonging to the computation invokes broadcast(m). There is a constant δ (δ ≥ δ ) (known by the processes) such that if p does not leave the system by time t + δ, then all the processes that are in the system at time t and do not leave by time t + δ, deliver m by time t + δ. Such a pair of broadcast operations has first been formalized in [8] in the context of systems where process can commit crash failures. It has been extended to the context of dynamic systems in [7]. 3.5 Churn Model The phenomenon of continuous arrival and departure of nodes in the system is usually referred as churn. In this paper, the churn of the system is modeled by means of the join distribution λ(t), the leave distribution µ(t) and the node distribution N (t) [3]. The join and the leave distribution are discrete functions of the time that returns, for any time t, respectively the number of processes that have invoked the join operation at time t and the number of processes that have left the system at time t. The node distribution returns, for every time t, the number of processes inside the system. We assume, at the beginning, n0 processes inside the system and we assume to have λ(t) = µ(t) = cn0 (where c ∈ [0, 1] is a percentage of node of the system) meaning that at each time unit, the number of precess that joins the system is the same as the number of process that leave, i.e. the number of processes inside the system N (t) is always equal to n0 .
4 Joining a Register Computation 4.1 The Protocol Local variables at a process pi Each process pi has the following local variables. – Two variables denoted registeri and sni ; registeri contains the local copy of the regular register, while sni is the associated sequence number. – A boolean activei , initialized to false, that is switched to true just after pi has joined the system. – Two set variables, denoted repliesi and reply toi , that are used during the period during which pi joins the system. The local variable repliesi contains the 3-uples < id, value, sn > that pi has received from other processes during its join period, while reply toi contains the processes that are joining the system concurrently with pi (as far as pi knows). The local variables of each process pk (of the n processes that compose the initial set of processes) are such that registerk contains the initial value of the regular register (say the value 0), snk = 0, activek = true, and repliesk = reply tok = ∅.
98
R. Baldoni, S. Bonomi, and M. Raynal
operation join(i): (01) registeri ← ⊥; sni ← −1; active i ← false; repliesi ← ∅; reply toi ← ∅; (02) wait(δ); (03) if (registeri = ⊥) then (04) repliesi ← ∅; broadcast INQUIRY (i); wait(2δ); (05) let < id, val, sn >∈ repliesi such that (∀ < −, −, sn >∈ repliesi : sn ≥ sn ); (06) if (sn > sni ) then sni ← sn; registeri ← val end if (07) end if; (08) activei ← true; (09) for each j ∈ reply toi do send REPLY (< i, registeri , sni >) to pj end for; (10) return(ok). ————————————————————————————————————– (11) when INQUIRY(j) is delivered: (12) if (activei ) then send REPLY (< i, registeri, sni >) to pj (13) else reply toi ← reply toi ∪ {j} (14) end if. (15) when REPLY(< j, value, sn >) is received: repliesi ← repliesi ∪ {< j, value, sn >}. Fig. 3. The join() protocol for a register object in a synchronous system (code for pi )
The join() operation When a process pi enters the system, it first invokes the join operation. The algorithm implementing that operation, described in Figure 3, involves all the processes that are currently present (be them active or not). The interested reader will find a proof in [3]. First pi initializes its local variables (line 01), and waits for a period of δ time units (line 02); This waiting period is explained later. If registeri has not been updated during this waiting period (line 03), pi broadcasts (with the broadcast() operation) an INQUIRY (i) message to the processes that are in the system (line 04) and waits for 2δ time units, i.e., the maximum round trip delay (line 04)1 . When this period terminates, pi updates its local variables registeri and sni to the most uptodate values it has received (lines 05-06). Then, pi becomes active (line 08), which means that it can answer the inquiries it has received from other processes, and does it if reply to = ∅ (line 09). Finally, pi returns ok to indicate the end of the join() operation (line 10). When a process pi receives a message INQUIRY(j), it answers pj by return sending back a REPLY(< i, registeri , sni >) message containing its local variable if it is active (line 12). Otherwise, pi postpones its answer until it becomes active (line 13 and lines 08-09). Finally, when pi receives a message REPLY(< j, value, sn >) from a process pj it adds the corresponding 3-uple to its set repliesi (line 15). 1
The statement wait(2δ) can be replaced by wait(δ + δ ), which provides a more efficient join operation; δ is the upper bound for the dissemination of the message sent by the reliable broadcast that is a one-to-many communication primitive, while δ is the upper bound for a response that is sent to a process whose id is known, using a one-to-one communication primitive. So, wait(δ) is related to the broadcast, while wait(δ ) is related to point-to-point communication. We use the wait(2δ) statement to make easier the presentation.
Joining a Distributed Shared Memory Computation δ pj
δ
0 write (1) 1
ph
0
pk
0
pi
99
1 1
⊥
pj
0 write (1) 1
ph
0
pk
0
pi
⊥
0 read()
Join()
Join
1 1
1
Join()
read() Join
Write
Write
Reply
δ
δ
Reply
δ
(a) Without wait(δ)
δ
δ
(b) With wait(δ)
Fig. 4. Why wait(δ) is required
Why the wait(δ) statement at line 02 of the join() operation? To motivate the wait(δ) statement at line 02, let us consider the execution of the join() operation depicted in Figure 4(a). At time τ , the processes pj , ph and pk are the three processes composing the system, and pj is the writer. Moreover, the process pi executes join() just after τ .The value of the copies of the regular register is 0 (square on the left of pj , ph and pk ), while registeri = ⊥ (square on its left). The ‘timely delivery” property of the broadcast invoked by the writer pj ensures that pj and pk deliver the new value v = 1 by τ + δ. But, as it entered the system after τ , there is no such a guarantee for pi . Hence, if pi does not execute the wait(δ) statement at line 02, its execution of the lines 03-07 can provide it with the previous value of the regular register, namely 0. If after obtaining 0, pi issues another read it obtains again 0, while it should obtain the new value v = 1 (because 1 is the last value written and there is no write concurrent with this second read issued by pi ). The execution depicted in Figure 4(b) shows that this incorrect scenario cannot occur if pi is forced to wait for δ time units before inquiring to obtain the last value of the regular register.
5 Joining a Set Computation 5.1 The Protocol Local variables at process pi . Each process pi has the following local variables. – Two variables denoted seti and sni ; seti is a set variable and contains the local copy of the set, while sni is an integer variable that count how many update operations have been executed by process pi on the local copy of the set. – A FIFO set variable lastopsi used to maintain an history of the update operations executed by pi . Such variable contains all the 3-uples < val, op type, id > each one characterizing an operation of type op type = {add or remove} of the value val issued by a process with identity id. – A boolean activei , initialized to false, that is switched to true just after pi has joined the system. – Three set variables, denoted repliesi , reply toi and pendingi ,that are used in the period during which pi joins the system. The local variable repliesi contains the
100
R. Baldoni, S. Bonomi, and M. Raynal
3-uples < set, sn, ops > that pi has received from other processes during its join period, while reply toi contains the processes that are joining the system concurrently with pi (as far as pi knows). The set pendingi contains the 3-uples < val, op type, id > each one characterizes an update operation executed concurrently with the join. Initially, n processes compose the system. The local variables of each of these processes pk are such that setk contains the initial value of the regular register (without loss of generality, we assume that, at the beginning, every process pk has nothing in its variable setk ), snk = 0, activek = true, and pendingk = repliesk = reply tok = ∅. The join() operation. The algorithm implementing the join operation for a set object, is described in Figure 5, and involves all the processes that are currently present (be them active or not). First pi initializes its local variables (line 01), and waits for a period of δ time units (line 02); the motivations for such waiting period are basically the same described for the regular register and it is needed to avoid that pi looses some update. After this waiting period, pi broadcasts (with the broadcast() operation) an INQUIRY (i) message to the processes that are in the system and waits for 2δ time units, i.e., the maximum round trip delay (line 02). When this period terminates, pi first updates its local variables seti , sni and lastopsi to the most uptodate values it has received (lines 03-04) and then executes all the operations concurrent with the join contained in pendingi and not yet executed (lines 05-13). Then, pi becomes active (line 14), which means that it can answer the inquiries it has received from other processes, and does it if reply to = ∅ (line 15). Finally, pi returns ok to indicate the end of the join() operation (line 16). When a process pi receives a message INQUIRY(j), it answers pj by sending back a REPLY(< seti , sni , lastopsi >) message containing its local variables if it is active (line 18). Otherwise, pi postpones its answer until it becomes active (line 19 and line 15). Finally, when pi receives a message REPLY(< set, sn, ops >) from a process pj it adds the corresponding 3-uple to its set repliesi (line 21). 5.2 add() and remove() protocols These protocols are trivially executed by sending an update message using the broadcast primitives (i.e. their execution time is bounded by δ). At the receipt of the update message, every process pi checks its state. If pi is active, it simply adds or removes the value from its local copy of the set. If pi is not active (i.e. it is still executing the join() protocol), it buffers the operation in the local set pendingi set by adding the 3-uple < val, op type, id >. Such a tuple is made up of (i) the value val to be updated, (ii) the type op type of the operation (add or remove), and (iii) the id of the process that issued the update. Every operation in the set pendingi will be then executed by pi at the end of the join() protocol (lines 05-13 of Figure 5). 5.3 Correctness Proof Due to page limitation, this section only states two lemmas and the main theorem. Their proofs can be found in [4].
Joining a Distributed Shared Memory Computation
101
operation join(i): (01) sni ← 0; lastopsi ← ∅ seti ← ∅; active i ← false; pending i ← ∅; repliesi ← ∅; reply toi ← ∅; (02) wait(δ); broadcast INQUIRY (i); wait(2δ); (03) let < set, sn, ls >∈ repliesi such that (∀ < −, sn , − >∈ repliesi : sn ≥ sn ); (04) seti ← set; sni ← sn; lastopi ← ls; (05) for each < val, op type, id >∈ pendingi do (06) if (< val, op type, id >∈ / lastopi ) then (07) sni ← sni + 1; (08) lastopi ← lastopi ∪ {< val, op type, id >}; (09) if (op type = add) then seti ← seti ∪ {val}; (10) else seti ← seti /{val}; (11) end if (12) end if (13) end for; (14) activei ← true; (15) for each j ∈ reply toi do send REPLY (< seti , sni , lastopi >) to pj end for; (16) return(ok). ——————————————————————————————————— (17) when INQUIRY(j) is delivered: (18) if (activei ) then send REPLY (< seti , sni , lastopi >) to pj (19) else reply toi ← reply toi ∪ {j} (20) end if. (21) when REPLY(< set, sn, ops >) is received: repliesi ← repliesi ∪ {< set, sn, ops >}. Fig. 5. The join() protocol for a set object in a synchronous system (code for pi )
Lemma 1. Let c < 1/3δ. ∀t : |A[t, t + 3δ]| ≥ n(1 − 3δc) > 0. = Lemma 2. Let t0 be the time at which the computation of a set object S starts, H at (H, ≺) an execution history of S, and Ht1 +3δ = (Ht1 +3δ , ≺) the sub-history of H time t1 + 3δ. Let pi be a process that invokes join() on S at time t1 = t0 + 1, if c < 1/3δ then at time t1 + 3δ the local copy seti of S maintained by pi will be an admissible set at time t1 + 3δ. = (H, ≺) be the execution history of a set object S, and pi a Theorem 1. Let H process that invokes join() on the set S at time t. If c < 1/3δ then at time t + 3δ the local copy seti of S maintained by pi will be an admissible set at time t + 3δ.
References 1. Aguilera, M.K.: A Pleasant Stroll Through the Land of Infinitely Many Creatures. ACM SIGACT News, Distributed Computing Column 35(2), 36–59 (2004) 2. Baldoni, R., Bonomi, S., Kermarrec, A.M., Raynal, M.: Implementing a Register in a Dynamic Distributed System. In: Proc. 29th IEEE Int’l Conference on Distributed Computing Systems (ICDCS 2009), pp. 639–647. IEEE Computer Society Press, Los Alamitos (2009)
102
R. Baldoni, S. Bonomi, and M. Raynal
3. Baldoni, R., Bonomi, S., Raynal, M.: Regular Register: an Implementation in a Churn Prone Environment. In: SIROCCO 2009. LNCS, vol. 5869. Springer, Heidelberg (2009) 4. Baldoni, R., Bonomi, S., Raynal, M.: Joining a Distributed Shared Memory Computation in a Dynamic Distributed System. Tech Report 5/09, MIDLAB, Universit´a di Roma, La Sapienza (Italy) (July 2009), http://www.dis.uniroma1.it/˜midlab/publications 5. Delporte-Gallet, C., Fauconnier, H.: Two Consensus Algorithms with Atomic Registers and Failure Detector Ω. In: Garg, V., Wattenhofer, R., Kothapalli, K. (eds.) ICDCN 2009. LNCS, vol. 5408, pp. 251–262. Springer, Heidelberg (2009) 6. Dolev, S., Gilbert, S., Lynch, N., Shvartsman, A., Welch, J.: GeoQuorums: Implementing Atomic Memory in Mobile Ad Hoc Networks. In: Fich, F.E. (ed.) DISC 2003. LNCS, vol. 2848, pp. 306–320. Springer, Heidelberg (2003) 7. Friedman, R., Raynal, M., Travers, C.: Abstractions for Implementing Atomic Objects in Distributed Systems. In: Anderson, J.H., Prencipe, G., Wattenhofer, R. (eds.) OPODIS 2005. LNCS, vol. 3974, pp. 73–87. Springer, Heidelberg (2006) 8. Hadzilacos, V., Toueg, S.: Reliable Broadcast and Related Problems. Distributed Systems, 97–145 (1993) 9. Ko, S., Hoque, I., Gupta, I.: Using Tractable and Realistic Churn Models to Analyze Quiescence Behavior of Distributed Protocols. In: Proc. 27th IEEE Int’l Symposium on Reliable Distributed Systems, SRDS 2008 (2008) 10. Lamport, L.: On Interprocess Communication, Part 1: Models, Part 2: Algorithms. Distributed Computing 1(2), 77–101 (1986) 11. Leonard, D., Yao, Z., Rai, V., Loguinov, D.: On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks. IEEE/ACM Transaction on Networking 15(3), 644–656 (2007) 12. Merritt, M., Taubenfeld, G.: Computing with Infinitely Many Processes. In: Herlihy, M.P. (ed.) DISC 2000. LNCS, vol. 1914, pp. 164–178. Springer, Heidelberg (2000) 13. Shao, C., Pierce, E., Welch, J.: Multi-writer consistency conditions for shared memory objects. In: Fich, F.E. (ed.) DISC 2003. LNCS, vol. 2848, pp. 106–120. Springer, Heidelberg (2003)
BSART (Broadcasting with Selected Acknowledgements and Repeat Transmissions) for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network Ingu Han1,*, Kee-Wook Rim2, and Jung-Hyun Lee1 1
Abstract. In this paper, we suggest enhanced broadcasting method, named 'BSART(Broadcasting with Selected Acknowledgement and Repeat Transmissions)' which reduces broadcast storm and ACK implosion on the mobile ad hoc network with switched beam antenna elements that can enable bidirectional communication. To reduce broadcast storm, we uses DPDP(Directional Partial Dominant Pruning) method, too. To control ACK implosion problem rising on reliable transmission based on ACK, in case of the number of nodes that required message reception is more than throughput, each nodes retransmit messages constant times without ACK which considering message transmission success probability through related antenna elements(R method). Otherwise, the number of message reception nodes is less than throughput, each node verify message reception with ACK with these antenna elements(A method). In this paper, we suggest mixed R /A method. This method not only can control the number of message transmitting nodes, can manage the number of ACK for each antenna elements. By simulations, we proved that our method provides higher transmission rate than legacy system, reduces broadcast messages and ACKs.
‐
‐
‐ ‐
‐
Keywords: selected broadcasting, mobile ad-hoc network, node selection.
1 Introduction Because every node roles not only host but router, the broadcasting method is indispensable to wireless ad hoc network for searching special node’s positioning information or indentifying existence of any node. To control broadcast storm problem which too heavily duplicated messages are occurred when nodes operate broadcasting, it is useful a method that only a few node receives forwarded message[1][2][3]. The CDS(connected dominant set) can be is equal to forward node set for those network
‐
*
"This research was supported by the MKE(Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) Support program supervised by the IITA(Institute of Information Technology Advancement)" (IITA-2009-C1090-0902-0020.
set, but it is proved that finding the lowest cost CDS is NP complete problem. There are various heuristic methods to search CDS, the one method is source dependent broadcasting which consists of only one CDS per whole network, another method is source independent broadcasting which consists of one CDS per each network, the other method which mixes source independent method and source dependent method[2][3][5][6]. In general, the former method can reduce the number of selected nodes, the latter method can support node’s mobility and it also can split up whole traffics. The wireless ad hoc network environment may increase the rate of error during transmission rather than wired network environment, and the probability of message loss is high because the signals interfere and collide with each other. The one solution of these problems is ACK transmission and the other solution is selective flooding which can receive partially overlapped messages [4][7][8]. But if all nodes that received broadcasting message response with ACK message, it may cause ACK implosion which many ACK messages occur simultaneously and it leads congestion[9]. Furthermore it can reduce the performance of link in the case of ACK message is missed, because nodes must retransmit messages[1][9]. Related researches show that the node which required receiving and forwarding is applied ACK response method, and dead end node that required only receiving can receive duplicated message from neighboring nodes in the wireless ad hoc network environment with omnidirectional antennas[2][11]. But these methods select forwarding nodes that neighboring all dead end nodes with definite number of forwarding nodes compulsory, it may increase the number of forwarding node. This means that the number of broadcasting messages and ACK messages are increased, and so it can’t be appropriate solution for broadcast storm or ACK implosion. The methods that reducing duplicated messages in the wireless ad hoc network with directional antenna are message forwarding in the MAC layer, directional self pruning, three hop horizon pruning and etc. But these research didn't consider reliable transmission or ACK implosion problem though they attempt to reduce broadcasting messages[5][8][12]. Most research considered just one of them, but Lou-Wu considered both problems[1][2][3][6][10][11][14][15]. In this paper, we suggest a low-cost, reliable broadcasting method with switched beam antenna which enables directional transmission on mobile ad-hoc network. Our method manages broadcasting storm with DPDP and in case of ACK implosion, we applies SART selectively. By simulation, we proved our method quite reduced both of broadcast storm and ACK implosion and enables reliable transmission with directional antenna on mobile ad-hoc network. This instruction file for Word users (there is a separate instruction file for LaTeX users) may be used as a template. Kindly send the final and checked Word and PDF files of your paper to the Contact Volume Editor. This is usually one of the organizers of the conference. You should make sure that the Word and the PDF files are identical and correct and that only one version of your paper is sent. It is not possible to update files at a later stage. Please note that we do not need the printed paper. We would like to draw your attention to the fact that it is not possible to modify a paper in any way, once it has been published. This applies to both the printed book and
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
BSART for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network
105
the online version of the publication. Every detail, including the order of the names of the authors, should be checked before the paper is sent to the Volume Editors.
2 System Model The Mobile ad-hoc network that discussed in this paper is divided by not overlapped K sectors and we supposed that each sector contains switched beam antennas which controls each sector. Let Go where the transmission gain using omni-directional antenna, Gd where the transmission gain using directional antenna, in general the following inequality comes, Gd>Go. In case that omni-directional antenna using 10dBm power reaches 250m, but using the same antenna which beam angle setted by 60 , it reaches 450m[16]. A switched beam antenna that using only one antenna element at a time, omni-directional broadcasting can be realized by sequential sweeping process[16]. In other words, a clockwise antenna element 0, 1, 2, ..., K-1 transmits messages with constant delay. If it transmits only special antenna elements group, it can realize selective flooding, too. Let dd=λdo(where λ>1, dd: reaching distance using directional antenna, do: reaching distance using omni-directional antenna), the reaching area using directional antenna is larger than area using omni-directional antenna for λ2 times, so we can regard network model that increasing λ2 times node per neighbor node. The mobile ad-hoc network can be described by unit disk graph G=(V,E) where V is set of wireless mobile nodes and E is set of node's edge. A edge (u,v) E means wireless link between node u and node v which can reach each other. We suppose that all wireless links (u,v) satisfy symmetrical property. In other words, if u can transmit messages to v, v can transmit to u, too. We supposed u's neighbor nodes to u can reach and declare u's neighbor nodes set to N(u). By definition, u N(u). If we declare u's 2-hop neighbor nodes set to N(N(u)) or N2(u), a inequality {u} N(u) N2(u) is established and N(v) N2(u) follows if v N(u). If we declare Nh(u) that within h-hop nodes from u and Hh(u) that h-hop nodes from u, a following equation comes, Nh(u) = Nh-1(u) Hh(u) where h≥1 and N0(u) = H0(u) = {u}. For the convenience, we omit subscript if h=1.
∘
∈
∈
⊆
∪
⊆
∈
⊆
Fig. 1. Omnidirectional antenna and directional antenna, directional antenna which consist of 6 antenna elements(K=6)
106
I. Han, K.-W. Rim, and J.-H. Lee
Fig. 2. An example using 4 antenna elements
∪
∪
Fig. 2 describes N2(1) = N(1) H2(1) ={1,2,3,4,5,9,10} {6,7,8,11} = {1,2,3,4,5,6,7,8,9,10}. Nodes can communicate directly with antenna element i, where the nodes which using unoverlaped K antenna elements, so to speak 1-hop away nodes set declared to Ni→(u). Then Ni→(u) N(u) and N(u)= N0→(u) N2→(u) ... NK-1→(u) {u}. A degree of node u is |N(u)|-1 = |N0→(u)| + |N2→(u)| + ... + |NK-1→(u)| where |Ni→(u)| is the number of nodes that belongs to Ni→(u). We suppose that antenna element's direction for every node maintains fixed direction by using magnetic compass or etc.. Because radiowave travels straight, there are diagonal relationship established between antenna elements for u and v(where u N(v)) communicate each other. In other words, the antenna j where 0≤j≤K-1 which transmit messages u to v, the antenna that v uses must (j+(K/2)) mod K. In fig. 2, the antenna is 1 when node 2 transmit messages to node 8, so node 8 can receive message from node 2 via antenna 3. If Dv→u = {i|u Ni→(v)}, Dv→V= w∈V Dv→w where V is nodes set that satisfy V N(v). For example, D8→2 = {3}, N(10) = {1,2,4,9}, D10→N(10) = D10→1 D10→2 D10→4 D10→9= {0} {1} {0} {1} = {0,1} in fig. 2. In this paper, we suppose that node u broadcast HELLO periodically for obtain neighbor node's state information. In other words, node v that receives HELLO from u, transmits HELLO to u via piggybacking to communicate with 1-hop neighbor node N(v).
⊆
∪
∪ ∪
∪
∈
∪
⊆ ∪ ∪
∈
∪ ∪ ∪
∪
BSART for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network
107
3 BSART: Broadcasting with Selected Acknowledgement and Repeat Transmissions Suppose that node v gets self forwarding node set F(v) and dead-end set D(v) using DPDP. Then v gets nodes set Ti to transmit message that not classified F(v) and D(v) per antenna element 0, 1, ..., K-1, where Ti = Ni→(v) {F(v) D(v)}, i means antenna element's ID, and 0 i K-1. Then v gets nodes set Ti to transmit message that not classified F(v) and D(v) per antenna element 0, 1, ..., K-1, where Ti = Ni→(v) {F(v) D(v)}, i means antenna element's ID, and 0 i K-1. In case that |Ti| exceeds a constant number then nodes transmit messages repeatedly constant times(for convenience, we call this A-method), otherwise nodes identify message reception via ACK(for convenience, we call this Rmethod). For example, to prohibit receiving 3 messages per antenna simultaneously, set c=3. It can increase conjestion by ACK and messages generated simultaneously, if ACK identification(A-method) just as Low-Wu and method that get oppotunity from neighbor nodes minimum 2 times are applied at the same time for the area that mixed F(v) and D(v), via each antenna element i because a network with directional antenna, each antenna can control separately[10]. And if one node receives ACK heavily, the ACK implosion occurs and this situation cause not only performance decrease, extreme delay. Let the M(v, s, seq#, F(v), mode, DATA) is message to broadcast where v is ID of forwarding node, s is broadcast message source node's ID, seq# means sequential number of broadcast message that generated by s. s and seq# are used for identifying overlaped or not. Data means transmission message. F(v) is forward node set that acquired by DPDP. Besides, all v must get Rv which antenna elements set that to apply A-method and Av which antenna elements set that to apply R-method. Then for every antenna element i where i {0,1, ..., K-1}, v calculates Ti = Ni (v) {F(v) D(v)}. If |Tj|
≤≤
≤≤
∈
• • • • • • • • • • •
≥
∩
∪
∩
∪
→ ∩
∪
D(v): dead-end node set for v F(v): forward node set for v Ni→(v): neighbor node set which v can transmit message with antenna element i TXmax: throughput of retransmission WAITmax: waiting time to receive ACK Tint: time slot for transmission M tx_cnti: the number of transmission times via antenna element i timeri wait: timer that waiting ACK after transmits M via antenna element i Av: antenna element set for v to apply A-method Rv: antenna element set for v to apply R-method Piv: nodes set that response with ACK when receives M via antenna element i where i Av where Piv = Ni→(v) {F(v) D(v)}. If, i Rv, Piv =Ф.
∈
∩
∪
∈
108
I. Han, K.-W. Rim, and J.-H. Lee
• • •
ack_requ: in case of waiting ACK from neighbor node u where | Piv |
∈
∩
∪
Algorithm: BSART(Broadcasting with Selected Acknowledgements and Repeat Transmissions) input: M(u, s, seq#, F(u), Pu, mode, DATA), c output: Av, Rv, F(v), M(v, s, seq#, F(v), mode, DATA)
∈
initial state: tx_cnti=0; Av=Rv= Ф; for all i {0,1,...,Ki
1}, P
v
= Ф
// supposed that ACK can not be lossed case 1: nove v is a broadcast source 1.1 v=s; seq#=seq#+1 1.2 B(u,v) = N(v); U(u,v) = H2(v) // u = Ф 1.3 jump to 2.4 case 2: in case that receive M(u, s, seq#, F(u), mode, DATA) from neighbor node u 2.1 if mode=A, v F(u), execute followings, otherwise jump to 2.2 // ACK transmission 2.1.1 if M is overlapped, stop. 2.1.2 transmit ACK(v, u, s, seq#) via antenna element f which received M, jump to 2.3 2.2 if M is not overlapped, receive M and stop // dead-end node 2.3 B(u,v) = N(v)-N(u); U(u,v)= H2(v)-N(u)-N(N(v)∩ N(u)) 2.4 calculate F(v) with DPDP 2.5 for each antenna element i(i=0 to K-1), calculate i P v = Ni→ (v)∩{F(v) D(v)} then execute followings i 2.5.1 i where | P v |≥c execute followings, or jump to 2.5.2 // R-method 2.5.1.1 Rv = Rv {i}. i 2.5.1.2 for all x where x P v, ack_reqx=0
∈
∪
∪
∈
2.5.1.3 mode=R; Pv=Ф. // retransmission mode i 2.5.2 if | P v |
∪
∈
∈
∈
2.6.2 if i Av, execute followings via antenna element i
BSART for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network
109
2.6.2.1 timeri:wait = WAITmax 2.6.2.2 transmit M(v, s, seq#, F(v), mode, DATA) case 3: in case that receive ACK(w, v, s, seq#) from neighbor node w via antenna element f 3.1 if f Av and ack_reqw=1, execute followings 3.1.1 ackedw=1; ack_reqw=0; f f f 3.1.2 P v= P v -w. if P v =Ф, timerf wait = Ф
∈
i
case 4: when timer timer wait has expired and tx_cnti < f TXmax, P v ≠Ф i 4.1 timer wait = WAITmax; tx_cnti = tx_cnti+1. f 4.2 set F(v)= P v transmit M via antenna element i In case that broadcast source node is 0, we will apply BSART. In this case, we suppose that c=2, so when only one forward node per antenna element, nodes response with ACK and suppose that if node v transmit M with only one directional antenna element, put the transmission success probability p set to 1/2. By algorithm 2.3 and 2.4, we will get B(Ф,0) = N(0) - N(Ф) = N(0) = {0,1,2,3,4,6,8}, U(Ф,0) = H2(0) - N(Ф) - N(N(Ф)∩N(0)) = H2(0) ={5,7,9}, F(0) = {2,4,8} (F(0) = {3,6,8}, too). By algorithm 2.5 and 2.6 we will get R0 ={0,1}, A0 = {2,3}, P2 0={4}, P30={6}. Because the transmission success probability p=1/2, M will be transmitted 2 times with antenna 0, 1. By 2.5.2, node 2, 3 included A0 will wait ACK. That is to say ack_req4=1, ack_req6=1. Same way node 0, 4, 8 which included F(0)={2,4,8}, F(2)=Ф, A2={0}, R2=Ф, P02={7} F(4)=Ф, A4={3}, R4=Ф, P34={5}, F(8)=Ф, A8={1}, R8=Ф, P18={9}. If supposed that ACK can not be lossed, consider messages that generated by BSART algorithm in case c=2. A message transmission that required ACK occurs 5 times with A-method, so ACK occurs 5 times. In case that applied R-method, messages are transmitted by antenna element 0, 1, broadcasting is accomplished 2 times per each antenna element regardless the number of receiver nodes. In other words, number of message which twice retransmission, so total number of occured message is 14(=5+5+2×2). Because receive nodes can receive with 4 antennas, node 0 generates 4 messages. In the same way, node 2 generates 2 messages, node 3, 4, 5, 8 generates 1 message each other. If transmission successes without ACK implosion, ACK occurs from forward node set {1,2,3,4,5,8,9} for each except source node 0. Therefore 7 messages are generated and this is more than BSART. Moreover it is the minimum number that can be generated besides at the node 0 can occur ACK implosion if it receives 2 ACK with antenna 0, 1.
4 Experiments and Evaluation We considered 1000×1000 array with 20, 40, 60, 80, 100 nodes and nodes distributed by random number generator. Table 1 shows major parameters. Compared protocols are BF(blind flooding), HHH, SHJ, and etc. [8] [16]. The BF generates many overapped messages but the message transmission ratio is high.
110
I. Han, K.-W. Rim, and J.-H. Lee
The speed of node is set to 0-20m/s and we supposed random-way point. Considered HHH, each forward node set one forwarding node per each direction. A designated node transmits message to all direction except received direction when message received. In the SHJ algorithm, each node u forwards broadcast message with neighbor node information N(u), and v which receives that message f direction which satisfies Nf(v)-N(u){u} Ф. The simulation items as follows. forwarding direction (antenna element) ratio message forwarding ratio by number of nodes and antenna elements and movement speed ACK message processing time as number of antenna elements
≠ □ □ □
Table 1. Major parameters for simulation Parameter
Value
TXmax K WAITmax c p
4 4, 6, 8 10 2, 3, 4, 5 0.3
The experiments carried with NS-2 simulator and we programmed each module with C++ and Tcl/Tk. Fig. 3 shows selected antenna elements ratio which in order to broadcast in case that put the number of nodes to 20, 40, 60, 80, 100. In the case of BSART, less than 30% antenna elements are used, that is, 1.2(=4×0.3) per nodes and it is very similiar to HHH algorithm. In the case of BF algorithm, as number of nodes increase, messages are transmitted to all direction. SHJ case, it uses 2.4 the maximum. Fig. 4, 5, 6 shows message transmission ratio as number of nodes, antenna element and movement speed respectively. As number of nodes increases, the transmission
Fig. 3. Antenna element ratio per node
BSART for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network
111
Fig. 4. Message delivery ratio per nodes
ratio increases and HHH shows the lowest transmission ratio. Except BF, the transmission ratio is similar to each other and BSART and DCB show almost 100% in case that number of nodes is over 60. Fig. 5 also shows transmission ratio when c=2, 3, 4, 5 to apply A-method for antenna element 1, 4, 8. In general, as c increases, A-method is used frequently and it leads message transmission with ACK. As c increases, message transmission ratio is increases consequently just as Fig. 5.
Fig. 5. Message delivery ratio per antenna element
Fig. 6 shows transmission ratio as node's movement when number of nodes is 60. BF and SHJ show high ratio regardless node's movement, DCB-SD and BSART show over 90% transmission ratio. HHH shows the hightest ratio as node's movement[11].
112
I. Han, K.-W. Rim, and J.-H. Lee
Fig. 6. Message delivery ratio per mobility
Fig. 7 shows ACK processing time as c constant which decides A-method. As c is smaller and K is bigger, the processing time gets short. As K increases, the number of nodes to transmit messages per antenna element decreases and it leads c to smaller.
Fig. 7. ACK handling time per No. of antenna element
5 Conclusion In this paper, we proposed BSART(Broadcasting with Selective Acknowledgements and Repeat Transmission) that provides bidirectional, low cost and reliable broadcast with switched beam antenna in the mobile ad-hoc network. We considered A-method based ACK per antenna element and R-method which only retransmission messages constant times without ACK to deal with ACK implosion that appears reliable transmission, that is, antenna elements which number of receive nodes over c, let the node retransmit message constant times, otherwise require ACK. By experiments, we proved our algorithm reduces number of broadcast message and ACK, supports reliable message transmission. The condition which under 20m/s movement speed, K=4 and c=2,
BSART for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network
113
we proved over 90% message transmission ratio by experiments. And close performance analysis of BSART and appliable BSART to sensor network are expected.
References 1. Ni, S., Tseng, Y., Chen, Y., Sheu, J.: The broadcast storm problem in a mobile ad hoc network. In: Proc. MOBICOM 1999, pp. 151–162 (1999) 2. Lim, H., Kim, C.: Flooding in wireless ad hoc networks. Computer Communications 24(34), 353–363 (2001) 3. Lou, W., Wu, J.: On reducing broadcast redundancy in ad hoc wireless networks. IEEE Trans. Mobile Computing 1(2), 111–122 (2002) 4. Basagni, C.S., Conti, M., Giordano, S., Stojmenovic, I. (eds.): Mobile Ad Hoc Networking. IEEE/Wiley (2004) 5. Dai, F., Wu, J.: Efficient broadcasting in ad hoc networks using directional antennas. IEEE Trans. Parallel & Distributed Systems 17(4), 1–13 (2006) 6. Wu, J., Dai, F.: A generic distributed broadcast scheme in ad hoc wireless networks. IEEE Trans. Computers 53(10), 1343–1354 (2004) 7. Boukerche, A., Chlamta, I. (eds.): Handbook of Algorithms for Mobile and Wireless Networking and Computing. CRC Press, Boca Raton (2005) 8. Hu, C., Hong, Y., Hou, J.: On mitigating the broadcast storm problem with directional antennas. In: Proc. IEEE ICC 2003 (2003) 9. Impett, M., Corson, M.S., Park, V.: A receiver-oriented approach to reliable broadcast ad hoc networks. In: Proc. WCNC 2000, pp. 117–122 (2000) 10. Low, W., Wu, J.: A reliable broadcast algorithm with selected acknowledgements in mobile ad hoc networks. In: Proc. IEEE GLOBECOM 2003 (2003) 11. Low, W., Wu, J.: Double-covered broadcast(DCB): a simple reliable broadcast algorithm in MANETS. In: Proc. IEEE INFOCOM 2003 (2003) 12. Spohn, M., Garcia-Luna-Aceves, J.J.: Improving route discovery in on-demand routing protocols using two-hop connected dominating sets. Ad Hoc Networks 4(4) (July 2006) 13. Lou, W., Wu, J.: Localized broadcasting in mobile ad hoc networks using neighbor designation, Technical Report, Dep’t of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL (July 2003) 14. Qayyum, A., Viennot, L., Laouiti, A.: Multipoint relaying for flooding broadcast message in mobile wireless networks. In: Proc. 35th Hawaii Int’l Conf. System sciences (HICSS35), January 2002, pp. 3898–3907 (2002) 15. Alagar, S., Venkatesan, S., Cleveland, J.: Reliable broadcast in mobile wireless networks. In: Proc. MILCOM 1995, pp. 236–240 (1995) 16. Choudhury, R.R., Vaidya, N.H.: Performance of ad hoc routing using directional antennas. Ad Hoc Networks 3(2), 157–173 (2005) 17. Shen, C.C., Huang, Z., Jaikaeo, C.: Directional broadcast for ad hoc networks with percolation theory, Tech. Report, Comp. and Info. Sciences, Univ. of Delaware (February 2004)
DPDP: An Algorithm for Reliable and Smaller Congestion in the Mobile Ad-Hoc Network Ingu Han1,∗, Kee-Wook Rim2, and Jung-Hyun Lee1 1
Abstract. The PDP(Partial Dominant Pruning) method is known as most practical method to reduce overlapped broadcasting messages by designating forward node as in-fly type when broadcasting occurs in the mobile wireless network with directional antenna. In this paper, we introduce DPDP(Directional PDP) that reduces not only number of nodes but number of used antenna elements simultaneously. By simulation, we proved our algorithm reduces number of forwarding nodes per antenna element and number of overlapped messages that each node receives compare to PDP though the number of antenna elements are increasing rather than in case of using omnidirectional antennas. Keywords: partial dominant pruning, selected broadcasting, mobile ad-hoc network.
1 Introduction Because all nodes roll not only host, but router, it is necessary to use broadcasting in the mobile ad-hoc network to find routing path to a certain node or discover locational information. In general, to deal with Broadcast storm, the method that message forwarding is performed by only fixed nodes in the mobile ad-hoc network are used[1][2][3][4]. These forwarding nodes come under CDS(Connected Dominent Set) to the network, but finding the lowest cost CDS is known as NP-complete problem. The heuristic to designate CDS is consists of source-independent broadcasting and source-dependent broadcasting[1][3][5][6]. A source-independent broadcasting consists only one CDS per given network and source-dependent broadcasting consists CDS based on broadcasting node by in-fly form. So source-dependent broadcasting method can have a few CDSs but source-independent broadcasting does not. In general, source-independent broadcasting garantees smaller number of nodes but sourcedependent broadcasting is fit for dynamic situation. The mobile ad-hoc network with directional antenna is known as good for bandwidth use and power consumption and can reduce interference with neighbor node, but due to technical difficulty, the research which broadcasting method with directional ∗
"This research was supported by the MKE(Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) Support program supervised by the IITA(Institute of Information Technology Advancement)" (IITA-2009-C1090-0902-0020.
DPDP: An Algorithm for Reliable and Smaller Congestion
115
antenna in the mobile ad-hoc network has started lately. Most researches is research that designates CDS using source-independent broadcasting method and research that try to reduce redundancy broadcasting messages by considering antenna’s direction[6][7]. But there are no research that designates message forwarding nodes set based on broadcasting node like this paper. In this paper, we proposed directional partial dominant pruning that expanded version of PDP which reduces not only number of antenna elements but also number of forwarding nodes[1]. By simulation, we proved that our algorithm is superior than legacy PDP from the viewpoint of reducing number of nodes and antenna elements.
2 Network Model Fig. 1 shows omnidirectional antenna and directional antenna, Fig. 2 shows switched beam antenna that divided to fanwise sector for 360 and each sector has antenna element of it’s own[8][9]. In case that omni-directional antenna using 10dBm power reaches 250m, but using the same antenna which beam angle setted by 60∘ , it reaches 450m[9]. A switched beam antenna that using only one antenna element at a time, omnidirectional broadcasting can be realized by sequential sweeping process[8]. We supposed u's neighbor nodes to u can reach and declare u's neighbor nodes set to N (u ) . By definition, u ∈ N (u ) . If we declare u 's 2-hop neighbor nodes set to N(N(u)) or N 2(u) , a inequality {u} ⊆ N(u) ⊆ N 2(u) is established and N(v) ⊆ N 2(u) follows if v ∈ N (u ) . If we declare N h (u ) that within h-hop nodes from u and H h (u ) that h-hop nodes from u, a following equation comes, H h (u ) = N h−1 (u ) U H h (u ) where h ≥ 1 and N 0 (u ) = H 0 (u ) = {u} . For the convenience, we omit subscript if h = 1 . Nodes can communicate directly with antenna element i , where the nodes which using unoverlaped K antenna elements, so to speak 1-hop away nodes set declared to N i → (u ) . Then N i → (u ) ⊆ N (u ) and N (u ) = N o→ (u ) U N 2→ (u ) U ... U N K −1→ (u ) U {u} .
〫
D
E # " Fig. 1. Omnidirectional antenna and directional antenna
116
I. Han, K.-W. Rim, and J.-H. Lee
> 8
D 3
2
Fig. 2. Directional antenna which consist of 6 antenna elements(K=6)
Fig. 3. An example using 4 antenna elements
Because radiowave travels straight, there are diagonal relationship established between antenna elements for u and v (where u ∈ N (v) ) communicate each other. In other words, the antenna j where 0 ≤ j ≤ K − 1 which transmit messages u to v , the antenna that v uses must ( j + ( K / 2)) mod K . In Fig. 3, the antenna is 1 when node 2 transmit messages to node 8, so node 8 can receive message from node 2 via antenna 3.
DPDP: An Algorithm for Reliable and Smaller Congestion
117
If Dv→u = {i | u ∈ N i→ (v)} , Dv→V = Dw∈V U Dv→w where V is nodes set that satisfy V ⊆ N (v ) . For example, D8 → 2 = {3} , N (10) = {1,2,4,9} , D10→ N (10) = D10→1 U D10→2 U D10→4 U D10→9 = {0} U {1} U {0} U {1} = {0,1} in Fig. 3. In this paper, we suppose that node u broadcast HELLO periodically for obtain neighbor node's state information. In other words, node v that receives HELLO from u , transmits HELLO to u via piggybacking to communicate with 1-hop neighbor node N (v) .
3 Directional Partial Dominant Pruning To apply PDP that designed for omnidirectional antenna model to directional antenna model, we considered followings. •
We modified selection criterion for node that belongs to B(u, v)(= N (v) − N (u )) and covers node under U (u, v)(= H 2 (v) − N (u ) − N ( N (u ) I N (v))) , that is, we selected node p that covers q where q ∈ U (u, v) and Max (| N ( p) I U (u, v) |) where preferentially. If tie occurs, we selected p where p ∈ B (u , v ) Max (| N ( p ) I U (u , v ) |) , and if tie orrurs again, we select random node. Then we found out antenna elements set Dv→B (u ,v ) which used for message for-
•
warding to node belongs to B(u, v) , that is, 1-hop node from v which have to receive message including selected F (v) . Not a like omnidirectional antenna model, directional antenna model must transmit to only receiving node’s direction to reduce interference and waste of bandwidth. Algorithm: DPDP(Directional Partial Dominant Pruning)
N (v) , N 2 (v) , F (u ) output: F (v) , Dv → B ( u , v ) input:
initial state:
F (v) = Dv→B (u ,v ) = φ
B (u , v) = N (v) − N (u ) , U (u, v) = H 2 (v) − N (u ) − N ( N (u ) I N (v))
1.
2. if t where following 2.1 2.2
t ∈ U (u , v)
is covered by
s
where
s ∈ B (u , v) ,
F (v) = F (v) U {s} , Dv→B ( u ,v ) = Dv→B ( u ,v ) I Dv→s
B (u , v) = B (u , v) − s , U (u , v) = U (u , v) − {N ( s ) I U (u , v)} U (u , v) = φ = Dv→B ( u ,v ) U {i} for p where
3. perform repeatedly until 3.1 find
Dv→B (u ,v )
Max(| N ( p ) I U (u , v) |) , if tie occurs then jump to,
do
118
I. Han, K.-W. Rim, and J.-H. Lee
otherwise jump to 3.3
Max(| N ( p ) I U (u , v) |) , if tie occurs again then find p where Max(| H ( p ) |) , and if tie occurs again then select random p , then find Dv→ B ( u ,v ) = Dv→ B ( u ,v ) U Dv→ p 3.3 F (v) = F (v ) U { p} B (u , v) = B (u , v) − p U (u , v) = U (u , v) − {N ( p ) I U (u , v)} 3.2 find
p
Fig. 3 shows that
where
F (v) in case of node 2 is broadcasting source. In node 2,
B (φ ,2) = N (2) − {} = {1,2,7,8,9,10} , U (φ ,2) = H 2 (2) − N (φ ) − N ( N (2) I N (φ ) = {3,4,5} . Select a
node that belongs to B (φ ,2) and covers maximum number of node in U (φ ,2) . Because | N 0 → (1) I U (φ ,2) |=| {3,5} |= 2 and N (1) I U (φ ,2) = U (φ ,2) , we will get F (2) = {1} and D2 → B (φ , 2 ) = {0,1,2,3} . In node 1, we will get B (2,1) = N (1) − N (2) = {1,2,3,4,5,9,10} − {1,2,7,8,9,10} = {3,4,5} , U (2,1) = H 2 (1) − N (2) − N ( N (2) I N (1))
4 Simulation and Evaluation To evaluate propose algorithm, we considered 1000×1000 array with 20, 40, 60, 80, 100 nodes and nodes distributed equally. • • •
number of forwarding nodes average number of forwarding nodes per antenna element number of redundancy messages per node
The experiments carried with NS-2 simulator and we programmed PDP and DPDP module with C++ and Tcl/Tk. For convinience, we do not consider MAC and physical layer. We tested in case K(=number of antennas per node)=1, 4, 8 and do not consider node’s mobility. Fig. 4 shows number of forwarding nodes that selected via DPDP algorithm. In case of using directional antenna increases number of used antennas rather than K=1, that is, omnidirectional antenna, but the difference is within 5. This is caused by algorithm 3.1 that the algorithm select node which covering the maximum number of neighbor node per antenna element. As number of antennas increase, the number of forwarding nodes increases but the difference is not considerable. Fig. 5 shows the number of forwarding nodes per antenna, that is, number of forwarding nodes divideded by K. It means interference rate, power consumption indirectly. As K larger, decrease number of forwarding nodes, so the performance is enhanced. And it also reduces ACK implosion problem[10]. As K goes bigger, the difference of number of nodes are larger. For example, in case of K=4, 8, the difference goes to 230% than K=1. It means that DPDP is profitable for directional antennas in the mobile ad-hoc network.
DPDP: An Algorithm for Reliable and Smaller Congestion
119
Fig. 4. Relation of No. of nodes and No. of Forward nodes
Fig. 5. No. of forwarde nodes per antenna element
Fig. 7 shows the number of redundancy messages, and in case K=8, redundancy messages occur under 2. In case of K>1, nodes receive message from fixed direction as compare to K=1(omnidirectional antenna), so K goes bigger, the duplication ratio gets smaller.
Fig. 6. Redundancy ratio per node
120
I. Han, K.-W. Rim, and J.-H. Lee
In case K=8, the duplication ratio reduced 160%~190%. As a sumulation result, our algorithm proved that superior than legacy algorithm in many aspect and very useful.
5 Conclusion In this paper, we proposed directional partial dominant pruning that expanded version of PDP which reduces not only number of antenna elements but also number of forwarding nodes. By simulation, we proved that our algorithm is superior than legacy PDP from the viewpoint of reducing number of nodes and antenna elements. So the algorithm select a node p where Max(|Ni (p) U(u,v)|), p B(u,v) that covers q where q U(u,v) preferentially. And to reduce redundancy messages, the algorithm found antenna element set Dv B(u,v). As a sumulation result, our algorithm proved that superior than legacy algorithm in many aspect and very useful. Finally, the research that allows node’s mobility and using MAC layer is required.
∈
→
→ ∩
∈
References 1. Lou, W., Wu, J.: On reducing broadcast redundancy in ad hoc wireless networks. IEEE Trans. Mobile Computing 1(2), 111–122 (2002) 2. Ni, S., Tseng, Y., Chen, Y., Sheu, J.: The broadcast storm problem in a mobile ad hoc network. In: Proc. MOBICOM 1999, pp. 151–162 (1999) 3. Lim, H., Kim, C.: Flooding in wireless ad hoc networks. Computer Communications 24(34), 353–363 (2001) 4. Basagni, C.S., Conti, M., Giordano, S., Stojmenovic, I. (eds.): Mobile Ad Hoc Networking. IEEE/Wiley (2004) 5. Wu, J., Dai, F.: A generic distributed broadcast scheme in ad hoc wireless networks. IEEE Trans. Computers 53(10), 1343–1354 (2004) 6. Dai, F., Wu, J.: Efficient broadcasting in ad hoc networks using directional antennas. IEEE Trans. Parallel & Distributed Systems 17(4), 1–13 (2006) 7. Hu, C., Hong, Y., Hou, J.: On mitigating the broadcast storm problem with directional antennas. In: Proc. IEEE ICC 2003 (2003) 8. Ramanathan, R., Redi, J., Santivanez, C., Wiggins, D., Polit, S.: Ad hoc networking with directional antennas: a complete system solution. IEEE JSAC 23(3) (2005) 9. Ramanathan, R.: On the performance of ad hoc networks with beamforming antennas. In: Proc. MobiHOC 2001, pp. 95–105 (2001) 10. Impett, M., Corson, M.S., Park, V.: A receiver-oriented approach to reliable broadcast ad hoc networks. In: Proc. WCNC 2000, pp. 117–122 (2000)
Development of Field Monitoring Server System and Its Application in Agriculture Chang-Sun Shin1, Meong-Hun Lee1, Yong-Woong Lee1, Jong-Sik Cho1, Su-Chong Joo2, and Hyun Yoe1 1
School of Information and Communication Engineering, Sunchon National University, Korea {csshin,leemh777,ywlee,cho1318,yhyun}@sunchon.ac.kr 2 School of Electrical, Electronic and Information Engineering, Wonkwang University, Korea [email protected]
Abstract. In agricultural field, environmental factors such as temperature, humidity, solar radiation, CO2, and soil moisture are essential elements which influence on growth rate, productivity of produce, sugar content of fruit, acidity and etc. If we manage the above mentioned environmental factors efficiently, we can achieve improved results in production of agricultural product. For monitoring and managing the growth environments, this paper suggests the Field Monitoring Server System (FMSS) which can operate with solar power. We implemented the Ubiquitous Field Server System (UFSS) in our previous work. Compared with the UFSS, this FMSS enhanced or improved the power consumption, the mobility, and user-friendly environment monitoring methods. The system collects environmental data directly obtained from environment sensors, soil sensor and CCTV camera. To implement a stand-alone system, we applied a solar cell panel to operate this system without power source. To indicate the location of this system, a Global Positioning System (GPS) module is installed on the system. Finally, we confirmed that the FMSS monitors the field conditions by using various facilities and correctly operates without helping external supports.
implementation of our system. Finally, we discuss conclusions and future work in Section 5.
2 Related Works Environment factors, like light, water, temperature, and soil, are the essential elements of agriculture. Some of the researchers have studied agricultural field monitoring. The Jet Propulsion Lab. in NASA studied the solar environment sensors to monitor temperature, humidity and oxygen of environment and soil. They applied the Sensor Web 3.1 with low power and small size [5]. The phytech Co. in Israel developed a plant growth monitoring system. This system measures the environment status by using sensors adhered to plant and sends information to farmer’s home via internet [6].
Fig. 1. A plant growth monitoring system of the phytech co.
Tokyo Univ. in Japan monitored several farmlands in Asia. They collect environmental data and soil information of the farmlands using a stand-alone system [7]. Above researches are not considered the communications of systems or sensors and solar power. They collect environmental data at one spot of field. In large-scale agricultural field, we need to collect environmental data at several spots. Our system adopts Ubiquitous Sensor Network (USN) and the solar battery. Also we develop the FMSS as a stand-alone system integrating sensors, CCTV camera, Solar, database, web server and GPS module.
3 Field Monitoring Server System (FMSS) Architecture The Field Monitoring Server System (FMSS) can collect real-time environmental field data from various sensors in physical layer and implement the agriculture application services including real-time monitoring at the higher layer, application layer. Figure 2 shows the architecture of the FMSS.
Development of Field Monitoring Server System and Its Application
123
Fig. 2. Field Monitoring Server System’s architecture consisting of three layers
Our suggested FMSS consists of three layers. The each layer and components of each layer explains in detail as follows. The physical layer includes various sensors, GPS module, CCTV camera, and solar cell. This system reduces the electric power consumption by using the low-power embedded board consisted of CPU, AD converter, DA converter, Ethernet controller, and wireless LAN. Also the system includes database and web server. Our system is a self-charging stand alone system using solar-electric power. The middle layer has the soil manager, the location manager, the motion manager, the information storage, and the web server. The sensor manager manages the information from soil sensor and environment sensors. The location manager interacts with GPS module at the device layer. This manager stores and manages location data of the system. The motion manager provides stream data of field status for users. The information storage stores the information of the physical devices to database. The web server provides environment information for users from physical devices via internet. The application layer provides users with the environment monitoring service, the location monitoring service, and the motion monitoring service. These three layers are integrated into the FMSS. By interacting with each layer, the system provides field environment information to farmers. We had ever implemented the Ubiquitous Field Server System (UFSS) in previous work [8]. Compared with the UFSS, this FMSS enhanced or improved the power consumption, the mobility, and user-friendly environment monitoring methods. And we have a field test in yard for verifying the system executability.
124
C.-S. Shin et al.
3.1 Environment Monitoring Service The environment monitoring service shows data collected from soil and environment sensors in physical layer. First, this service sends the raw data of environment sensors to the sensor manager. The raw data are temperature, humidity, soil Electronic Conductivity (EC) ratio, CO2 and illumination of field. Then, the sensor manager changes raw data into digital information and stores the information to the information storage. The web server shows this environment information to users. Figure 3 shows procedure of the environment monitoring service.
Fig. 3. Procedure of the environment monitoring service
3.2 Location Monitoring Service This service monitors the system’s location in field. First, the GPS Module sends the system’s location data to the location manager. Then the sensor manager stores the location data to the information storage. And the web server provides the location of the system for users. Figure 4 explains the procedure of this location monitoring service.
Fig. 4. Procedure of the location monitoring service
Development of Field Monitoring Server System and Its Application
125
3.3 Motion Monitoring Service This service provides the motion data by using CCTV camera for users. First, the CCTV camera sends stream data to the motion manager. The motion manager stores the stream data to the information storage. The web server shows the stream data to the user via internet. Figure 5 describes the procedure of the motion monitoring service.
Fig. 5. Procedure of the motion monitoring service
4 Development of the FMSS In this chapter, we develop the FMSS. Figure 6 shows the whole system model. The FMSS consists of autonomous systems. The system has a solar cell, a storage battery, and a low-power embedded board. The system stores electric power in the daytime and uses it in the nighttime.
Fig. 6. FMSS model including an embedded board, a GPS, solar-charging devices and sensors
126
C.-S. Shin et al.
4.1 System Components This system includes of physical devices and software modules. We explained the software modules in Chapter 3. The physical devices have sensing and information gathering devices. You can see the devices attached in our system in Figure 7.
Fig. 7. Solar cell, soil sensor, network sensor node, GPS module and CCTV camera in the FMSS Table 1. Power consumption of each module and Power supply of solar battery Power consumption Voltage Current Power
Module GPS Receiver
DC 5V
0.2A
1W
Embedded Board
DC 9V
500mA
5W
CCTV Camera
DC 9V
400mA
4W
Soil Sensor
DC 5V
10mA
0.05W
Environment Sensor
DC 3V
2.3A
6.9W
DC 12V
1.4A
16.95 W
TOTAL Module Solar Cell Battery
Voltage DC 26.4V Voltage DC 12V
Supply Power Current 7.6A Capacity(20HR) 64A
Power 200W
Table 1 is showing the power consumption of equipped modules and solar power supply in the FMSS. The total power consumption of equipped modules like GPS receiver, embedded board, CCTV camera, soil sensor, and environment sensor is 16.95W. The solar cell supplies with electric power of the maximum 200W in the 25℃ test environment. This is enough to operate the system. And, you can see the main system installed sensors’ data receiver, database, and the web server in Figure 8 and a solar battery in Figure 9. Now, we integrate above components into the system. Figure 10 shows the FMSS’s prototype including the software modules. The FMSS can apply various environments such as precision agriculture, livestock monitoring and greenhouse monitoring. For executing field test, we have been deployed the system in wild field.
Development of Field Monitoring Server System and Its Application
Fig. 8. Embedded board including environment sensor receiver, database, and the web server
127
Fig. 9. Soil sensor receiver, solar battery
Fig. 10. Prototype of the FMSS deployed in wild filed
4.2 Implementation Results Figure 11 is showing the results executed from FMSS’s GUI from web server. The (a) in the figure presents the real-time motion from the CCTV camera. The (b) is showing location of the current system via GIS map. We used the GPS data to map
128
C.-S. Shin et al.
the location. (c), (d) and (e) in Figure11 are showing the sensing value from the soil sensor, the sensing value from the environmental sensors and the average temperature, respectively.
Fig. 11. A GUI for the FMSS’s application
Figure 12 is showing the position of the FMSS. We can confirm the system’s location or movement through the location monitoring service of the FMSS.
Fig. 12. Location of the FMSS on the map
To confirm the successful operation of the FMSS using solar cell, we performed field test on a sunny day with a mean temperature of 25 degree. Figure 13 showed a
Development of Field Monitoring Server System and Its Application
129
graph of field test result in power consumption. As a final result, a solar cell which charged about 10 hours can support the operation of the FMSS about 24 hours. Hence, our FMSS can operate with the support of solar cell in the field without wired link or additional recharging process.
Fig. 13. Comparing estimated lifetime with actual lifetime of the system
5 Conclusions This paper proposed the Field Monitoring Server System (FMSS) that can collect and monitor the environmental information occurred from given field and the system’s location. Also, for verifying the executability of our system, we implemented the FMSS prototype and showed the executing results of the system. From this result, we confirmed that the FMSS monitors the field conditions by using various facilities and correctly operates without helping external supports. Also this FMSS enhanced or improved the power consumption, the mobility, and user-friendly environment monitoring methods. The FMSS can be powerful system to solve fundamental problem in large-scale agricultural area. In the future, we are to developing an improved monitoring system which operates under CDMA or other technology based USN and applies into the reference point control in GIS field. Acknowledgements. This research was supported by the MKE (Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Advancement) (IITA-2009-(C1090-0902-0047)).
130
C.-S. Shin et al.
References 1. Akyildiz, I.F., et al.: A survey on Sensor Networks. IEEE Communications Magazine 40(8) (2002) 2. Burrell, J., Brooke, T., Beckwith, R.: Vineyard computing: sensor networks in agricultural production. IEEE Pervasive Computing 3(1), 38–45 (2004) 3. Fountas, S., Pedersen, S.M., Blackmore, S.: ICT in Precision Agriculture – diffusion of technology. In: ICT in agriculture: perspective of technological innovation (2005) 4. Tilman, D., Cassman, K.G., Matson, P.A., Naylor, R., Polasky, S.: Agricultural sustainability and intensive production practices. Nature 418, 671–677 (2002) 5. Delin, K.A., Jackson, S.P., Burleigh, S.C., Johnson, D.W., Woodrow, R.R., Britton, J.T.: The JPL Sensor Webs Project: Fielded Technology. In: Space Mission Challenges for IT Proceedings. Annual Conference Series, pp. 337–341 (2003) 6. http://www.phytech.com/products/phytalk_system.html 7. Mizoguchi, M., Mitsuishi, S., Ito, T., Ninomiya, S., Hirafuji, M., Fukatsu, T., Kiura, T., Tanaka, K., Toritani, H., Hamada, H., Honda, K.: Real-time Monitoring of Farmland in Asia using Field Server. In: International Symposium on Geoinformatics for Spatial Infrastructure Development in Earth and Allied Sciences (October 2008) 8. Shin, C.S., Joo, S.C., Lee, Y.W., Sim, C.B., Yoe, H.: An Implementation of Ubiquitous Field Server System Using Solar Energy Based on Wireless Sensor Networks. Studies in Computational Intelligence 209 (2009) 9. Yoe, H., Eom, K.-b.: Design of Energy Efficient Routing Method for Ubiquitous Green Houses. In: 1st International Conference on Hybrid Information Technology (November 2006) 10. Shin, C.S., Kang, M.S., Jeong, C.W., Joo, S.C.: TMO-Based Object Group Framework for Supporting Distributed Object Management and Real-Time Services. In: Zhou, X., Xu, M., Jähnichen, S., Cao, J. (eds.) APPT 2003. LNCS, vol. 2834, pp. 525–535. Springer, Heidelberg (2003) 11. Shin, C.S., Lee, Y.W., Lee, M.H., Park, J.W., Yoe, H.: Design of Ubiquitous Glass Green Houses. Software Technologies for Future Dependable Distributed Systems, 169–172 (March 2009) 12. Kang, B.J., Park, D.H., Cho, K.R., shin, C.S., Cho, S.E., Park, J.W.: A Study on the Greenhouse Auto Control System based on Wireless Sensor Network. In: International Conference on Security Technology, December 2008, pp. 41–44 (2008) 13. Lee, M.H., Be, K., Kang, H.J., Shin, C.S., Yoe, H.: Design and Implementation of Wireless Sensor Network for Ubiquitous Glass Houses. In: 7th IEEE/ACIS International Conference on Computer and Information Science, May 2008, pp. 397–400 (2008)
On-Line Model Checking as Operating System Service Franz J. Rammig, Yuhong Zhao, and Sufyan Samara Heinz Nixdorf Institute, University of Paderborn F¨ urstenallee 11, D-33102 Paderborn, Germany [email protected]
Abstract. A complementary verification method for real-time application with dynamic task structure has been developed. Here the real-time application is developed by means of Model-Driven Engineering. The basic verification technique is given by model checking. However, the model checking is executed at run-time whenever some reconfiguration of the task set takes place. Instead of exploring the entire state space of the model to be checked, only a partial state space at model level covering the execution trace of the checked task is explored. This on-line model checking can be seen as an extension to the traditional schedulability acceptance test which is needed anyway in systems with dynamic task set. Therefore this runtime verification is implemented as a service of the underlying operating system. In this paper we describe this method in general, explain some design and implementation decisions and provide experimental results. Keywords: On-line model checking, Verification service, Real-time operating system.
1
Introduction
Real-time applications are safety critical in many cases. A careful quality assurance process therefore is mandatory. This process includes more and more formal verification techniques like model checking. Model checking has the advantage of being fully automated and inherently includes means for diagnosis in case of errors. On the other hand, model checking is substantially confronted with the so called state explosion problem. This means that the state space to be explored grows very quickly to an unmanageable size whenever problems of practical relevance are to be handled. Numerous approaches to overcome this deficiency have been developed, like partial order reduction [1], compositional reasoning [2], and other simplification and abstraction techniques, which aim to reduce the state space to be explored by over-approximation [3] or under-approximation
This work is developed in the course of the Collaborative Research Center 614 Self-Optimizing Concepts and Structures in Mechanical Engineering - Paderborn University, and is published on its behalf and funded by the Deutsche Forschungsgemeinschaft (DFG).
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 131–143, 2009. c IFIP International Federation for Information Processing 2009
132
F.J. Rammig, Y. Zhao, and S. Samara
[4] techniques. Over-approximation techniques generate an abstract model by adding redundant behaviors to the original one (weaken constraints) such that the correctness at the abstract level implies the correctness of the original model. Under-approximation techniques generate an abstract model by removing irrelevant behaviors from the original one (strengthen constraints) so that the falseness at the abstract level implies the falseness of the original model. Applying these techniques can relieve the state explosion problem to some degree, but can not resolve it totally. That is, the correctness of a complex system with respect to some properties could not always be verified completely. In this paper we propose a complementary technique, namely on-line model checking (or model-based runtime verification) [5,6]. Deferring formal verification to the execution phase of a real time application seems to be a strange idea, especially in real-time computing where one prefers to execute off-line as many activities related to a task as possible. However, we are looking at real-time applications with a highly dynamic task set. For a software system with selfadaptive capability, the task set consists of instances that are activated under various profiles. It is based on the actual environmental conditions to decide which profile to be used, i.e, which tasks to be activated. In such real-time applications with dynamic task sets an acceptance test concerning schedulability has to be executed whenever a new task is added to the task set. It seems to be natural to extend this acceptance test by a logical safety test, which may be implemented by means of model checking. But we would be confronted with the state explosion problem again, now even under real-time constraints. To make on-line model checking feasible, we suppose that the real-time application is developed by means of Model-Driven Engineering (MDE) [7], which is an efficient software engineering approach to complex systems development. According to MDE, we can follow three steps to develop a software system: 1. model the system according to the system specification, 2. verify the system model against the system specification, and 3. synthesize the system implementation (source code) from the system model. Theoretically speaking, the following assertions are supposed to be true: – The system model is consistent with the system specification. – The system implementation is consistent with the system model. However, are they really true under any specific running environment? We try to answer this question by doing model checking at runtime. The basic idea (as shown in Fig. 1) is to check on-line whether the monitored execution trace of the system
Fig. 1. On-line model checking framework
On-Line Model Checking as Operating System Service
133
Fig. 2. Partial system model to be explored
conforms to the system model on the one hand and if a partial system model that covers the execution trace satisfies the system properties on the other hand. Here the partial system model is obtained by exploring only such kind of states that can be reached from those current states monitored at runtime as shown in Fig. 2. Intuitively, if this partial system model is checked safe against the system specification, and the monitored states conform to the corresponding states in the partial system model, then we have more confidence to the correctness of the actual execution trace. It doesn’t matter even if the rest of the system model might still contain some errors. Of course, sophisticated techniques have to be used to let it really fly, which will be detailed in the sequel. In this way, we obtain a natural solution to the state explosion problem. Instead of looking at the entire state space, we pay our attention only to a partial state space covering the execution trace. As a result, we do not need to simplify or abstract value domains of system variables at all. It is worth mentioning that off-line model checking is usually valid under the assumption that the platform, on which the real-time application runs, should behave correctly. This assumption is no longer needed for on-line model checking. Commonly used services at run-time are usually provided by the underlying Operating System. This is exactly our approach. We provide on-line model checking as a service of the underlying Real-time Operating System (RTOS). The verification service is implemented as isolated task in user space. This isolates model checking from the task to be verified and makes sure that errors in the task cannot infect the verification service. To enhance efficiency the verification service runs in its own address space which is attached to the kernel address space. The address space of the application is mapped into this verification address space as “read only” partition. This avoids cache refilling in case of the context switching between verification service and the task to be checked and allows fast access to the task’s state variables by the verification service.
134
F.J. Rammig, Y. Zhao, and S. Samara
The state-of-the-art runtime-verification is discussed in the literature (see section 4) since years. The basic idea is to monitor the execution of the source code and afterwards to check the so far observed execution trace against the given properties specified usually by LTL formulas. The checking progress always falls behind the system execution because the checking procedure can continue only after a new state has been observed. In contrast, our runtime verification is applied to the model level. The states observed from the execution trace are mainly used to reduce the state space to be explored at the model level. That is, the checking progress is not strictly bound to the progress of the system execution, i.e., our on-line model checking might run ahead or behind the execution of the source code. If the processing speed is fast enough, our runtime verification could keep looking certain time steps ahead of the system execution and then tell the real-time application how many time steps ahead are safe.
2
Problem Statement
Without loss of generality, let M = {M1 , M2 , · · · , Mn } model a real-time reconfigurable system which consists of n (> 0) components M1 , M2 , · · · , Mn running in parallel. M may reconfigure itself at runtime either by adding a new component Mi to or by removing an existing component Mi from M . This also includes replacing one component with another one as shown in Fig. 3, which can be done by consecutively removing and adding operations. The components in M can communicate with each other only through the underlying RTOS. This forms a dependency relationship between the components in M . Without doubt, system reconfiguration might more or less affect the behavior of the related components in the system. What’s more, the impact of the RTOS on the inter-process communication also might affect the behaviors of the related components in the system. For instance, the component B might be affected most by replacing the component C with the component E in Fig. 3. Could these effects violate some safety conditions associated to the related components in the system?
Fig. 3. A reconfiguration example
On-Line Model Checking as Operating System Service
135
Since the reconfiguration might occur according to the actual running environment, it is hard to answer this question only by off-line verification techniques due to the unpredictable indefinite factors. Therefore, it is necessary to on-line check at model level if the most affected component in M still maintains safety after the system reconfiguration. In doing so, we suppose that whenever the real-time application needs to do reconfiguration, the RTOS is informed about this in advance. As any modification of the task set can happen only under the control of the RTOS, this requirement is rational. With the information given by the real-time application, the RTOS will trigger the verification service as isolated task in user space and then schedule the verification service as earlier than the component to be checked as possible without violating the real-time deadlines. To achieve this, we follow a deterministic approach by reserving a fixed time slot at the beginning of each scheduling cycle of the RTOS. This time slot is mainly reserved for the verification service. In case of no active on-line model checking task, the scheduler is allowed to allocate this slot to such preemptive low priority tasks that can be moved and replaced by on-line model checking at any time the verification service is triggered. In this way, if the checking process is efficient enough, we can always check at model level what might happen in the near future relative to the current state of the component’s execution. In case that an error is detected or the checking progress falls behind the execution of the checked task, then the real-time application is informed in order to allow it undertaking appropriate counter means. The properties to be checked are safety conditions that might be sensitive to the context of the related component. LTL (or ACTL) formulas are used to formally specify the safety properties, as the discrete time extensions to LTL (or ACTL) formulas are just shorthand notations to the usual LTL (or ACTL) formulas [8].
3
On-Line Model Checking
3.1
Overview
As mentioned in Section 1, we suppose that the real-time application is developed following the MDE approach. In this way, we can model the system in UML with real-time extension1 on the one hand and specify constraints in OCL with realtime extension on the other hand. From the real-time UML model, we can derive an FSM model and synthesize a source code respectively. Since the FSM model and the source code come from the same origin, there exists a mapping function σ from concrete states (derived from the source code) to abstract states (derived from the FSM model). From the real-time OCL constraints, we can derive the LTL (or ACTL) formulas and then transform them into B¨ uchi automata. Having the concrete model (source code), the abstract model (FSM model) and the properties (B¨ uchi automata) at hand, our on-line model checking aims to check 1
http://wwwcs.uni-paderborn.de/cs/fujaba/
136
F.J. Rammig, Y. Zhao, and S. Samara
Safety Checking
Consistency Checking conform?
conform? Büchi automaton
··· Safety Property Real-time ACTL/LTL
À
Abstract State
À
Concrete State
···
··· System Model
Execution Trace
FSM Model
Source Code
Real-time OCL Constraint
Real-time UML Model
Fig. 4. Overview
if the execution trace conforms to the FSM model (consistency checking) and meanwhile if a partial state space of the FSM model conforms to the B¨ uchi automaton (safety checking) as shown in Fig. 4. Here the partial state space reflects a near future relative to the current state observed from the execution trace of the system running. 3.2
Model Checking Paradigm
Recall that we have reserved a fixed time slot at the beginning of each scheduling cycle for on-line model checking. Without loss of generality, let the time slot be td time units. After verification service is triggered, in each scheduling cycle we have td time units to do on-line model checking from the current state of the task to be checked, which is obtained at the previous scheduling cycle as shown in Fig. 5. Of course, the current (concrete) state should be mapped to the corresponding abstract state at model level to be used by model checking. If the current state could not be mapped to an appropriate abstract state, it means that the execution trace no longer conforms to the behavioral model. In this case, the verification service will terminate the checking process and inform the RTOS to deal with this problem. Otherwise, the on-line model checking will continue until one of the following two cases happens: Case No: if at some time point an error is detected, the verification service terminates with the answer No to the real-time application via RTOS. Case Yes: if a sufficient partial state space that covers the execution trace of the task is successfully checked, the verification service reports definitely Yes to the real-time application via RTOS and then terminates the safety checking process (while the consistency checking can continue if necessary). Notice that the “No” case only means that the detected errors might happen in the future, because we check at model level and thus do not know whether
On-Line Model Checking as Operating System Service verification service
· · ·
CurrentState(si-1)
task to be checked
137
· · · Yes/No
CurrentState(si )
Yes/No
· · ·
· · ·
Fig. 5. Scheduling of verification service and task to be checked
the errors are spurious or not. To avoid the errors really to happen, we have to conservatively choose to inform the real-time application that an error might emerge in the future. That is, the RTOS might raise an exception together with a counterexample (if necessary). How to handle the exception is application domain specific, thus we do not discuss this here. The implementation of a component is in fact a refinement of the model of the component, i.e., the model is an abstraction of the implementation of the component. Thus, an ACTL/LTL formula being true at the model level implies that it is also true at the implementation level, while its being false at the model level does not imply that it is also false at the implementation level. In this sense, our runtime verification is conservative due to its being applied to the model level. However, the advantage of predicting and thus avoiding potential errors are gained just due to its being applied to the model level. Experimental Results. A stand-alone prototype for on-line model checking invariants, LTL and ACTL properties is implemented. We have done some experiments for BEEM2 benchmark set derived from mutual exclusion algorithms, communication protocols and so on in research or industry area. The benchmark set contains only FSM models, so we generate randomly the execution traces from the same FSM models. This can simplify the monitoring procedure of capturing the runtime information (current states) to be used by on-line model checking. In this way, we can estimate the performance of our verification service to some degree. Two experiments are done on a Pentium-IV 3.00Ghz processor with 1GB memory running Linux. One experiment is on-line invariant checking. This experiment can help find out the influence of the out-degrees of the states on the look-ahead performance, i.e., how far away the model checking can look ahead from each state of the given model within a predefined time interval. 16 typical models are selected from the BEEM benchmark set to perform on-line invariant checking. The features of these models are given in the number of states, the number of transitions, the average degrees of states, the height of BFS, and the maximal stack of DFS as well as the number of Boolean (state) variables. In this experiment, each transition in the models is set to represent 1 millisecond, i.e., it takes 1 millisecond from one state to next state. We also say one transition being one time step. For each model, this 2
http://anna.fi.muni.cz/models/
138
F.J. Rammig, Y. Zhao, and S. Samara
Model
Type
State
BFS Max. Transition Average Maximal Degree Out-degree Height Stack
Boolean Variables
Transition Minimal Maximal Average timeunit Look-ahead Look-ahead Look-ahead (ms)
sorter_1
Controller
20544
30697
1,5
5
198
617
36
1
40
299
103
collision_1
Communications protocol
5593
10792
1,9
5
57
617
25
1
26
81
48,7
synapse_2
Protocol
61048
125334
2,1
18
41
2349
46
1
7
28
21,5
driving_phils_2
Mutualexclusion algorithm
33173
81854
2,5
9
150
3702
27
1
31
97
65,7
blocks_1
Planningund Scheduling
7057
18552
2,6
6
19
4263
23
1
8
21
14
peterson_1
MutualExclusion Algorithm
12498
33369
2,7
5
54
1862
30
1
13
39
31,7
szymanski_1
MutualExclusion Algorithm
20264
56701
2,8
3
72
2064
27
1
13
90
49,7
hanoi_1
Puzzle
6561
19680
3
3
256
4376
36
1
56
103
75,9
iprotocol_2
Communications protocol
29994
100489
3,4
7
91
443
39
1
18
451
50
phils_3
MutualExclusion Algorithm
729
2916
4
6
17
518
18
1
156
357
265
cyclic_scheduler_1 Protocol
4606
20480
4,4
8
55
1819
40
1
23
437
278
rushhour_1
Puzzle
1048
5446
5,2
9
73
535
28
1
66
248
150,7
rushhour_2
Puzzle
2242
12603
5,6
10
80
906
32
1
36
408
116,4
pouring_1
Puzzle
503
4481
8,9
9
13
348
16
1
42
101
71,9
reader_writer_2
Protocol
4104
49190
12
19
13
4097
25
1
4
16
9,9
pouring_2
Puzzle
51624
1232712
23,9
25
15
44509
18
1
1
4
2
Fig. 6. Experimental result of on-line invariant checking
experiment is designed to compute how many time steps model checking could look ahead from each state in the model within one time step (i.e. 1ms). So the invariant to be checked is a Boolean formula derived from the set of the states in each model. The experimental results in Fig. 6 show the minimal, the maximal and the average look-ahead from the states of each model. It is easy to see that the maximal out-degree of a model has a larger influence on look-ahead performance than the average degree of the model. The other experiment is on-line LTL model checking. The model driving phils 2 is derived from a mutual exclusion algorithm of processes accessing several resources, motivated by “The Driving Philosophers” in [9]. The property to be checked is G(ac0 → F gr0 ), where the proposition ac0 denotes that process 0 requests a resource and the proposition gr0 denotes that the resource is granted to process 0. In other words, if process 0 requests a resource, it will be granted to him eventually. The experimental result in Fig. 7 is obtained by setting td = 5ms and running 2000 scheduling cycles. That is, at each scheduling cycle the verification service is allocated 5ms to perform on-line model checking. The property is not violated at least up to this 2000 checking rounds. Fortunately, the verification service can always run enough time steps earlier than the simulated execution of this model. The minimal look-ahead is 23 time steps, the maximal look-ahead is 74 time steps and the average look-head is 57.2 time steps relative to the corresponding current states monitored from the randomly generated execution trace. Compared to the usual (off-line) model checking, our on-line model checking can reduce the state space to be explored by using the monitored states obtained while the system is running. On this view, the computational complexity of the on-line model checking is less than that of the traditional model checking. Compared to the usual runtime verification, our runtime verification checks the
On-Line Model Checking as Operating System Service time steps
Fig. 7. Experimental result of on-line LTL checking
system properties at the model level while just using the monitored states to do consistency checking and then to shrink the state space to be explored. As a result, the computational complexity of the model-based runtime verification is greater than that of the conventional runtime verification. However, if we make our model-based runtime verification look ahead only several time steps at each checking round, then its computational complexity in terms of time and memory overhead will be closer to that of the state-of-the-art runtime verification. In addition, our model-based runtime verification can check more general properties specified by ACTL and/or LTL formulas, since [10] shows that the property patterns to be checked in practice are usually not very complex. 3.3
Pre-checking and Post-checking
Ideally, we wish that on-line model checking could always run enough (time) steps ahead the execution of the task to be verified. This depends on the complexity of the behavioral model of the task as well as the underlying hardware architecture. Therefore, we have to face the reality that the verification service might fall behind the execution of the task to be checked. As a result, we introduce two checking modes: pre-checking and post-checking. We say that the verification service is in pre-checking mode, if it runs ahead of the execution of the task to be checked; otherwise, it is in post-checking mode as shown in Fig. 8. In pre-checking mode, the verification service can naturally predict violations before they really happen. In post-checking mode, it seems that the violations could only be detected after they have already happened. Fortunately, it is still possible to “predict” violations even in post-checking mode because our on-line verification works at the model level. In case that an error is found at some place
140
F.J. Rammig, Y. Zhao, and S. Samara
Model-based Runtime Verification Service
Pre-checking
Real-time Application
Runtime Verification
Post-checking
Runtime Verification
Real-time Application
Fig. 8. Pre-checking and Post-checking
other than the monitored execution trace in the partial state space being checked, then we can “predict” that there might be an error in the model which has not happened yet. In this sense, both checking modes are useful for safety-critical systems. Notice that our on-line model checking can observe the actual execution trace of the task being checked once it falls behind. This means that only a rather small state space needs to be explored in post-checking mode. Thus, there still exists chance for the verification service to pass over the task being checked again. On this view, it seems as if the verification service and the task are involved into a two-player game. In the course of the game, we say that the verification service wins against the task being checked, if the verification service takes the leading position for a longer time than the task does. Without doubt, we need to find an improved strategy to make the verification service have more chance or higher probability to win against the task to be checked. Recall that the source code of the system implementation is usually validated by simulation and testing. Therefore, in the future we are going to learn some heuristic knowledge at the system testing phase so that the system model can be enriched with more useful information. The heuristic information can thus guide on-line model checking to reduce the state space to be explored whenever necessary.
4
Related Work
Unlike our on-line model checking, the state-of-the-art runtime verification takes the system implementation and the system specification into account. The basic idea is to monitor the execution of the source code and afterwards to check the so far observed execution trace against the system properties specified usually by LTL formulas. This kind of runtime verification can only do post-checking, i.e., the checking progress always falls behind the system execution because the checking procedure can continue only after a new state has been observed. Consequently, property violations are usually detected after they have already happened. Notice that even if a property is checked correct with this approach, it
On-Line Model Checking as Operating System Service
141
does not imply that the monitored execution trace conforms to the system model and the system model satisfies the same property as well. The former depends on the consistency between the system implementation and the system model, while the latter depends on the granularity of the system model and the property automaton to be checked. Typically, [11] presents runtime checking for the behavioral equivalence between a component implementation and its interface specification by writing the interface specification in the executable AsmL so that one can synchronously run the interface specification and the component implementation while monitor if they are equivalent on the observed behaviors; [12] presents runtime certified computation whereby an algorithm not only produces a result for a given input, but also proves that the result is correct with respect to the given input by deductive reasoning; [13] presents runtime checking for the conformance between a concurrent implementation of a data structure and a high-level executable specification with atomic operations by first instrumenting the implementation code to extract the execution information into a log and then executing a verification thread concurrently with the implementation while using the logged information to check if the execution conforms to the high-level specification; [14] presents monitoring-oriented programming (Mop) as a light-weight formal method to check conformance of implementation to specification at runtime by first inserting specifications as annotations at various user selected places in programs and then translating the annotations into an efficient monitoring code in the same target language as the implementation during a pre-compilation stage. Similar to Mop, Temporal Rover [15] is a commercial code generator allowing programmers to insert specifications in programs via comments and then generating from the specifications the executable verification code, which are compiled and linked as part of the application under test. In addition, Java PathExplorer (JPaX) [16] is a runtime verification environment for monitoring the execution traces of a Java program by first extracting events from the executing program and then analyzing the events via a remote observer process. What’s more, [17] extends the usual runtime verification techniques to on-line verify and steer a Discrete Event System (DES) by looking ahead into a partial system model to predict violations and then applying steering actions to prevent them. This method requires that the time delay for the DES to move from the current state to the next state must be long enough so that the runtime checking has sufficient time to explore a partial system model, which is generated after the current state is known. Our on-line model checking can explore the system model even before the current state is known and then shrinks the state space after the current state is known. That is, the progress of our runtime verification is not strictly bound to the execution of the source code, i.e., it may run before or after the system execution. If the processing speed is fast enough, our runtime verification could keep running certain time steps before the system execution and then tell the system how many time steps ahead are safe. Also, our runtime verification can check more general properties specified by ACTL and/or LTL formulas.
142
5
F.J. Rammig, Y. Zhao, and S. Samara
Conclusion
On-line model checking has the potential to serve as a powerful complementary verification technique for real-time applications with dynamic task sets. It is complementary in the sense that we have to assume that the newly accepted task has been verified off-line under the assumptions necessary for such a verification. The on-line model checking then can be restricted to verify whether the actual execution trace is correct under the real environmental conditions. As our online model checking reduces dramatically the state space to be verified, a much finer granularity concerning value domains of system variables can be handled. By this and due to the fact that a priori unknown run-time conditions can be considered as well, our run-time verification establishes an additional safety level. This method can also be seen as a complementary attempt to overcome the well known state explosion problem of model checking. Whenever the state space is reduced, it is essential to reduce it to the states that are relevant. Our method automatically and dynamically reduces the state space to exactly those states that are relevant for the actual execution trace. The resulting verification method can be implemented as an operating system service comparable to the schedulability acceptance test which is part of any RTOS being able to handle dynamic task sets. This service is triggered whenever some reconfiguration of the task set to be handled takes place. In contrary to traditional a posteriori runtime verification methods published so far, our approach can look into the future, i.e., a partial state space at model level relative to the current state of the execution trace. Experimental results show that run-time model checking is possible when the approach as outlined in this paper is followed. Although these experiments have been carried out based on simulations up to now, there is a strong indication that also systems of practical relevance can be handled.
References 1. Godefroid, P.: Partial-Order Methods for the Verification of Concurrent Systems. LNCS, vol. 1032. Springer, Heidelberg (1996); Foreword By-Wolper, Pierre 2. Berezin, S., Campos, S.V.A., Clarke, E.M.: Compositional reasoning in model checking. In: de Roever, W.-P., Langmaack, H., Pnueli, A. (eds.) COMPOS 1997. LNCS, vol. 1536, pp. 81–102. Springer, Heidelberg (1998) 3. Clarke, E.M., Grumberg, O., Long, D.E.: Model checking and abstraction. ACM Trans. Program. Lang. Syst. 16(5), 1512–1542 (1994) 4. Lee, W., Pardo, A., Jang, J.Y., Hachtel, G., Somenzi, F.: Tearing based automatic abstraction for ctl model checking. In: ICCAD 1996: Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design, Washington, DC, USA, pp. 76–81. IEEE Computer Society, Los Alamitos (1996) 5. Zhao, Y., Oberth¨ ur, S., Kardos, M., Rammig, F.J.: Model-based runtime verification framework for self-optimizing systems. Electr. Notes Theor. Comput. Sci. 144(4), 125–145 (2006) 6. Zhao, Y., Rammig, F.J.: Model-based runtime verification framework. In: Proceedings of the Formal Engineering Approaches to Software Components and Architectures (FESCA 2009), New York, UK (March 2009)
On-Line Model Checking as Operating System Service
143
7. Kent, S.: Model driven engineering. In: Butler, M., Petre, L., Sere, K. (eds.) IFM 2002. LNCS, vol. 2335, pp. 286–298. Springer, Heidelberg (2002) 8. Clark, E.M., Grumberg Jr., O., Peled, D.A.: Model Checking. MIT Press, Cambridge (1999) 9. Baehni, S., Baldoni, R., Guerraoui, R., Pochon, B.: The driving philosophers. Technical report. In: Proceedings of the 3rd IFIP International Conference on Theoretical Computer Science (TCS 2004) (2004) 10. Dwyer, M.B., Avrunin, G.S., Corbett, J.C.: Patterns in property specifications for finite-state verification. In: ICSE 1999: Proceedings of the 21st international conference on Software engineering, pp. 411–420. IEEE Computer Society Press, Los Alamitos (1999) 11. Barnett, M., Schulte, W.: Spying on components: A runtime verification technique. In: Leavens, G.T., Sitaraman, M., Giannakopoulou, D. (eds.) Workshop on Specification and Verification of Component-Based Systems (October 2001) 12. Arkoudas, K., Rinard, M.: Deductive Runtime Certification. In: Proceedings of the 2004 Workshop on Runtime Verification (RV 2004), Barcelona, Spain (April 2004) 13. Tasiran, S., Qadeer, S.: Runtime Refinement Checking of Concurrent Data Structures. In: Proceedings of the 2004 Workshop on Runtime Verification (RV 2004), Barcelona, Spain (April 2004) 14. Chen, F., Rosu, G.: Towards Monitoring-Oriented Programming: A Paradigm Combining Specification and Implementation. In: Proceedings of the 2003 Workshop on Runtime Verification (RV 2003), Boulder, Colorado, USA (2003) 15. Drusinsky, D.: The Temporal Rover and the ATG Rover. In: Havelund, K., Penix, J., Visser, W. (eds.) SPIN 2000. LNCS, vol. 1885, pp. 323–330. Springer, Heidelberg (2000) 16. Havelund, K., Rosu, G.: Java PathExplorer — a runtime verification tool. In: Proceedings 6th International Symposium on Artificial Intelligence, Robotics and Automation in Space (ISAIRAS 2001), Montreal, Canada (June 2001) 17. Easwaran, A., Kannan, S., Sokolsky, O.: Steering of discrete event systems: Control theory approach. Electr. Notes Theor. Comput. Sci. 144(4), 21–39 (2006)
Designing Highly Available Repositories for Heterogeneous Sensor Data in Open Home Automation Systems Roberto Baldoni, Adriano Cerocchi, Giorgia Lodi, Luca Montanari, and Leonardo Querzoni Dipartimento di Informatica e Sistemistica “A. Ruberti” Sapienza Universit`a di Roma - Rome, Italy {baldoni,cerocchi,lodi,montanari,querzoni}@dis.uniroma1.it
Abstract. Smart home applications are currently implemented by vendor-specific systems managing mainly a few number of homogeneous sensors and actuators. However, the sharp increase of the number of intelligent devices in a house and the foreseen explosion of the smart home application market will change completely this vendor centric scenario towards open, expandable systems made up of a large number of cheap heterogeneous devices. As a matter of fact, new smart home solutions have to be able to takle with scalability, dynamicity and heterogeneity requirements. In this paper we present the architecture of a basic building block, namely a distributed repository service, for smart home systems. The repository stores data from heterogeneous devices deployed in the house that can be then retrieved by context aware applications implementing some home automation functionalities. Our architecture, based on a DHT, offers a completely decentralized and reliable storage service able to offer complex query functionalities.
1 Introduction Thanks to recent progresses in the area of wired and wireless networking, sensor networks, networked appliances and embedded computing, all the enabling technologies needed to develop the vision of smart automation in home environments seems to be available. Despite this fact, currently available smart home applications are mainly represented by complex prototypes that still face a long way before reaching the status of commercial products. Existing applications are developed primarily with proprietary technology and seem to lack a long-term vision of evolution and interoperation. The future market for smart home applications will comprise a wide variety of devices and services from different manufacturers and developers. We must therefore achieve platform and vendor independence as well as architecture openness before smart homes become common places.
The work described in this paper was partially supported by the EU Projects SM4All, SOFIA and eDIANA.
Designing Highly Available Repositories for Heterogeneous Sensor Data
145
Future open smart home applications will be ready for the market only if they will meet the following requirements: Scalability - current applications are limited to a small number of devices, but an open architecture would open the market to many different vendors offering a wider selection of devices, thus raising the current limit to hundreds or even thousands of devices per home; Dynamics - while current applications are mainly based on cabled devices installed by experts, we envisage a future where new devices can be added to existing environments in a plug-and-play fashion and where wearable devices can follow users and join the home environment only when the user enters it. In this scenario, future applications must be able to tolerate or even leverage the dynamic environments where they will be required to run; Heterogeneity - a large base of available devices offered by different vendors will clearly increase the heterogeneity of the environments causing interoperability issues and requiring new approaches for resource sharing and scheduling; Reliability - as the users will start to put confidence in smart home applications, the reliability aspects of these applications will gain more importance, requiring the definition of new techniques to guarantee their correct behaviour. In this scenario, a fundamental building block is represented by a repository where devices (e.g., sensors, actuators etc) store data that can be retrieved either by the devices themselves or by some context-aware application for further processing. This paper describes the design of a scalable and reliable repository well suited to smart home applications. To this aim, the repository is implemented in a fully distributed fashion. Processes constituting the repository can be deployed on various devices located in the house offering sufficient computational and storage resources. TVs, smart phones, PCs, refrigerators etc can be example of such devices. Processes cooperate in a peer-to-peer fashion to implement storage and query functionalities in a dynamic, scalable and reliable way. In order to fit the heterogeneity of data that could be stored in the repository these functionalities are realized through a mapping component able to efficiently store and organize pieces of data in order to facilitate their search and retrieval. The rest of this paper is organized as follows: Section 2 introduces the architecture of the repository detailing its internal components and explaining how data can be stored and retrieved from it; Section 3 introduces an application scenario that shows a possible usage of the repository in a realistic setting; Section 4 describes related works in this field of research and, finally, Section 5 concludes the paper.
2 Architecture of the Repository This section is devoted to describing the architecture of our repository; however, before delving into its details, we will briefly introduce the reader to the smart home environment where our repository will be deployed.
146
R. Baldoni et al.
2.1 The Smart Home Environment Traditionally, smart home solutions (e.g. [1,5,6,2]) were provided by one vendor, using a single standard for communication, often choosing a closed one, and were expensive. In the future of this area we envisage a scenario where a single home will host a large (up to hundreds) number of devices coming from different vendors, with different hardware, communication interfaces, and operating software, that will interoperate and cooperate to offer complex services to house residents [9]. The cooperation among widely different technologies will be guaranteed through the use of middleware platforms [7] and interconnection standards [3,4] able to hide to software developers the complexities stemming by the unavoidable differences among communication standards. Devices interacting within the home environments will be characterized by different capabilities and different available resources: dumb temperature/light sensors, automatic blinds, phones, light switches, home appliances, media centers, PCs, etc. Most of them will offer different kinds of services like reading the current temperature value or showing a high-def movie on a TV. Some of them, arguably the powerful ones, will be able to make part of these resources available to the system to offer storage space or computing power. These resources will be used to build and offer to house inhabitants complex services that could not be offered by devices working alone in a “digitally closed” world. In the following we will assume that all these devices are able to communicate using a common middleware infrastructure. We also assume that all devices and services running in the system agree on a common schema representing the environment. This schema contains an organized description of all the devices present in the environment together with the services they offer and the state they maintain. Such shared schema could be queried and then used to automatically compose services offered by different devices. Figure 1 depicts a possible general schema for an environment hosting a number of heterogeneous sensor and actuator devices; note that the schema defines a hierarchy of elements through IS-A relationships; every element can contain a set of attributes used to describe its state (for example a temperature sensor could have an attribute “current temperature”). The schema also describes the types and admitted values for every attribute it defines. The schema can be represented within the architecture using a markup language like OWL Device
Attr. x
Sensor
S1
Attr. y
Actuator
S2
Sn
Attr. v
A1
Attr. w
A2
Attr. z
Fig. 1. General descriptive schema of the environment
Attr. k
Am
Designing Highly Available Repositories for Heterogeneous Sensor Data
147
[8]. In this paper we assume that the schema is given and static and that is known by all devices in the system. Devices produce data represented by XML documents whose format respects the schema; we can imagine a document as a set of nested tags specifying completely the device where the data has been originated and the content of the data itself. Using a common format to describe data is fundamental to support operations among heterogeneous sources. 2.2 Architecture Overview In this section we present the architecture of our repository service. The service is provided by a distributed set of processes, all running the same software component, that cooperate to provide the required functionalities: (i) storage of data provided by devices and (ii) retrieval of data matching queries possibly issued by devices or other software components. These processes can exchange messages using the communication primitives provided by the underlying middleware; here we do not make any specific assumptions on these primitives as a simple TCP/IP-like channel would perfectly suit our needs. Figure 2 depicts the internal architecture of the repository service. It consists of three main components: the Mapper, a hash function and a Key-value storage. Arrows in the picture depict data as it flows through the repository; the upper half of the picture shows what happens when an external component queries the repository for some data, while the bottom half shows the repository actions when some data provided by a device must be stored. Data provided to the repository is stored in a key-value storage component. This component provides a simple interface: a store(key,data) primitive that is used to store Mapped query
Lookup keys
Results
Mapper Data
Mapped Data
Hash Data storage key
Key-Value storage
Repository
DATA
STORE
store(key,data)
RETRIEVE
mappings
lookup(key)
Query
XML data Storage key
Fig. 2. Architecture of the repository. The figure also shows main operations that can be executed on it: data storage (bottom half) and retrieval (top half).
148
R. Baldoni et al.
a document data with an identifier key, and a lookup(key) that returns all data associated to identifier key. Keys are usually represented by strings of bits. A key-value storage represents a very simple and flexible method to store large amounts of data; thanks to its simplicity it can easily be distributed to improve its scalability. However, the interface it provides limits users to exact-match like data searches. This severely reduces the expressiveness of queries that could be issued to the repository; in this way it would impossible to issue a query like “retrieve all data from devices that sensed a temperature higher than 21°C in the last three hours”. Such kind of complex queries are quite common in smart home scenarios because knowledge about a specific service context must often be built from scratch without the need to retrieve the complete environmental context (e.g. I don’t want to know the temperature in every room of the house if the service will be delivered only in the kitchen). To increase the flexibility of queries issued to the repository we introduce the Mapper component whose goal is to decouple interactions with the key-value storage by automatically mapping complex queries to set of keys. Mappings must be defined such that, if some data generated by a device matches a query, then both the data and query are mapped to two sets of keys with non empty intersection. This intersection property guarantees the correct behaviour of the data retrieval functionality. Mapping is realized by transforming original data and queries in their mapped versions and then hashing such documents to obtain keys that can be used to access the key-value storage. 2.3 The Key-Value Storage This functional component can be embodied by any service able to store data associated to and identifier key. Many different technologies can be used to implement this service. Given the need to run the repository on a possibly large number of devices that could be added or removed at runtime we advocate the usage of a Distributed Hash Table (DHT). DHTs implement a simple key-value storage as a completely distributed and self-organizing system of processes running on different hosts. A DHT is able to change its composition at run-time by allowing new processes to join the storage system o gracefully removing nodes that silently left it. Data stored within the DHT is automatically moved and replicated to adapt to the new system configuration and to resist to unexpected changes. Moreover, thanks to their completely decentralized functionalities, DHT are able to scale to a large number of processes and fairly balance the load among them. Past research on DHTs mainly focused on developing systems for large scale wide area network settings [12]; however, given their nice properties with respect to reliability, scalability and ability to autonomously react to changes in the system, we think DHTs could perfectly fit out smart home scenario. 2.4 The Mapper Each query issued to the repository is an XML document with the same format used by devices to store their data, but where attributes can contain data or complex operators. These operators can be simple >=< for numerical values, regular expressions for strings or a composite construct with various constraints. When a query reaches the Mapper component these complex constraints expressed on various attributes are
Designing Highly Available Repositories for Heterogeneous Sensor Data
149
used to derive from the original query a set of mapped queries; each of these mapped queries has a different value specified for every attribute. The translation from an attribute constraint to one or more specified values is executed using a mapping for the attribute. Mappings for the various attributes are retrieved from a local storage inside the mapping component. Mappings must be defined for every attribute associated to every element of the schema. A mapping is built by partitioning the set of all valid values for an attribute in subsets and electing a representative value for each of them. This operation is quite intuitive if we consider a numerical attribute with values bounded by min and max; all the values in this interval can be partitioned in continuous interval and the first value of each interval can be elected as a representative. For example, the following mapping can be considered for a temperature sensor with an attribute “current temperature” whose values are floating point numbers bounded by -10 and 60: /Device/Sensor/Temperature/current temperature ⇒ [−10, 0, 10, 20, 30, 40, 50] Mappings are strictly related to the schema describing the environment, therefore we assume that they are defined at startup time and remain unchanged at runtime. Issues related to runtime updates to mappings will be considered in future work. 2.5 Data Storage and Retrieval Let us now detail how data produced by devices can be stored and then retrieved from the repository. For the sake of simplicity we will first show query management. When a query is issued to the repository it is first passed to the Mapper component. Figure 3 shows how a query is managed within the Mapper component. The Mapper splits it into its components, i.e. the attributes and their constraints (phase 1). It then checks its content one attribute at a time and retrieves the corresponding mapping from the internal storage. Constraints associated to the attribute are then matched (phase 2) against the sets of values contained in the mapping: all intervals that do not match the constraints are discarded (grey intervals in the figure). When all attributes have
mappings
Query
Attribute Constraints
Mapped query
Attribute Constraints
Mapped query
C Attribute Constraints
Phase 1 - split
Mapped query Phase 2: match
Phase 3: combine
Fig. 3. Query management within the Mapper component
150
R. Baldoni et al.
been considered the mapping component builds the XML documents representing the mapped queries by combining all the representative values from intervals that have not been discarded in the previous phase for all the different attributes (phase 3); each mapping query contains for each attribute of the original query one of the representative values. Mapped queries produced by the Mapper are then passed to the hash component. This component simply implements an hashing function that returns a string of bits for any incoming mapped query. These strings are the keys that must be used to invoke the lookup procedure on the key-value storage component. If the storage component is implemented as a DHT the hashing function can be the same function provided by the DHT implementation, otherwise any collision-resistant function will fit. For each key passed to the storage, a resulting document will be returned. All these data constitute the response to the query. The management of new data submitted to devices for storage is equivalent to query management. The main difference is that data provided by devices will contain values for every attribute, i.e. no constraints are admitted in these documents. When this data is passed to the Mapper component it decomposes it into its components and then checks its content one attribute at a time and retrieve the corresponding mapping from the internal storage. When the value of an attribute is matched against the sets of values defined by the mapping one and only one set is positively matched; this is due to the fact that a mapping is a complete partitioning of the value space defined by the attribute. After all attributes have been matched, the corresponding representative values are combined to obtain the mapped data; note that, since a single representative value can be returned by the matching phase for each defined attribute, only one instance of mapped data will be created in this process. Mapped data produced by the Mapper is then passed to the hash component to obtain the data storage key associated to the submitted data. Submitted data and the corresponding key are then used as parameters of the storage component’s store primitive. Clearly, the matching between a query and the mappings for the attributes it contains can generate false positives in the result sets, i.e. data that was mapped to one of the keys associated to the query but that does not satisfy the constraints defined in the query. The amount of false positives depends on the granularity of mappings defined for the various attributes: the more value sets are defined in the mapping for a specific attribute, the less will be the probability of false positives caused by constraints expressed on that attribute. False negatives are not possible as long as the intersection property is satisfied for all mappings.
3 Application Scenario Our example is based on a single house, and within this house we will focus our attention only on three locations: the kitchen, the dining room and the toilet. Figure 4 shows of the left plan the positioning of actuators (represented by triangles), temperature sensors (represented by circles) and light sensors (represented by squares). A fourth symbol is used to denote devices (like the PC or the TV set) whose computational and storage resources are sufficient to host an instance of the repository. Obviously, the set of devices in a real scenario would easily be larger than the one presented here; we decided
Designing Highly Available Repositories for Heterogeneous Sensor Data
Fig. 4. Example of a smart home and its corresponding schema of the environment
to limit both the number and the different types of devices to improve the readability of this example. We consider two types of devices that can produce data: sensors (light and temperature) and actuators (that open and close windows and doors, turn on and off lights, the refrigerator, the tv and the heating system). Starting from these considerations, a schema of the environment can be produced (right half of Figure 4). The schema describes any element in the environment as a subtype of the Device type. Different types are characterized by different attributes. Some attributes, like Temp value or Gas flow for the temperature sensor and Stove type respectively, are bounded numerical values; other attributes, like Door for the Fridge type are simple boolean values; finally, attributes like Channel for the Television type are enumerated text values. Regardless of the specificities of the considered scenario, each attribute in the schema is supposed to be detailed with its value type and possible bounds or valid value sets. Mappings definition - Given the schema of figure 4 a mapping for each attribute must be defined in the Mapper component. Let consider the attributes Temp Value and Light Amount for the two sensor types. Assume that temperature values are bounded by 10 and 40 degrees Celsius and grouped in intervals 5 degrees wide. The Mapper internal storage unit will contain an entry like the following one1 : /Device/Sensor/Temperature/Temp value ⇒ [10, 15, 20, 25, 30, 35] With respect to the Light Amount attribute, we can imagine that we are interested only in four value intervals: dark, soft light, normal light and strong light. A light sensor returns values expressed in lumen, thus we have to discretize the great set of possible light amounts in four coarse intervals. Therefore, the Mapper will contain an entry like the following one: 1
In this example we use a compact representation of intervals for numerical attributes in the mapping. This compact representation is subject to variations depending on the specific attribute type.
152
R. Baldoni et al.
/Device/Sensor/Light/Light amount ⇒ [0, 1000, 5000, 20000] An enumeration attribute like Location has a predefined set of acceptable values; in our example it could have values “Kitchen”, “Dining Room” and “Toilet”. In this case grouping the values in sets for the mapping is useless and a 1-to-1 mapping can be considered. Op 1: temperature data update - Each device producing data knows the data schema and this lets it produce well-formed data. An example of how data produced by a temperature sensor could be represented is shown in the listing 1; in this case the data is an instance of the environment schema. Listing 1. Representation of data produced by a temperature sensor 1 K i t c h e n L o c a t i o n > <S e n s o r > 26.5 Temp value > T e m p e r a t u r e> S e n s o r > Device >
This listing represents data produced by a temperature sensor located in the kitchen of out home that is identified by Home ID 1. When this data is passed to the Mapper component it considers the lists of attributes it contains and retrieves the corresponding mappings from the internal storage. In this case the mappings are: /Device/Home ID ⇒ [1] /Device/Location ⇒ [“Kitchen”, “Dining Room”, “Toilet”] /Device/Sensor/Temperature/Temp value ⇒ [10, 15, 20, 25, 30, 35] The Home ID and Location attributes are thus mapped to their corresponding values. The Temp value attribute is mapped to the representative value 25 as the real value 26.5 is included in the range 25 − 30. Now that the mapper knows the values for the mapped attributes it can build the mapped data: Listing 2. Mapped data 1 K i t c h e n L o c a t i o n > <S e n s o r > 25 Temp value > T e m p e r a t u r e> S e n s o r > Device >
Designing Highly Available Repositories for Heterogeneous Sensor Data
153
The mapped data is then used to calculate through the hash function the key representing the sensor data in the key-value storage component. Note how all data generated by temperature sensors located in the kitchen of house 1 and whose sensed value is in the range 25 − 30 are stored with the same key. Op 2: querying temperature data - Suppose now that an external software component needs to interrogate the repository and obtain data from all temperature sensors in the house whose last reading reported a temperature value greater than 25.5°C. This kind of query could be useful for example to control the heating subsystem in order to maintain a constant temperature in the house. The query looks like the following: Listing 3. A query submitted to the repository 1 * L o c a t i o n > <S e n s o r > >25.5 Temp value > T e m p e r a t u r e> S e n s o r > Device >
Note how values for some attributes have been substituted by constraints; a star “*” wildcard is used to match any possible value of the attribute meaning that the query is not interested in restricting the search to a specific location, while the constraint “>25” limits the range of temperatures that are returned. The mapper receives the query and splits it into its components. It retrieves the mappings for all three attributes and starts to match the contraints against the intervals. All intervals in the mapping associated to attribute Location is matched given the “*” constraint. The Home ID attribute defined with value 1 leads to a 1-to-1 mapping. Finally, the constraint “>25.5” defined for attribute Temp value is matched against the intervals contained in the corresponding mapping; a match is positive if there is a non empty intersection between the set of values defined by the contraint and the set of values contained in one of the intervals defined by the mapping; the matching operation thus returns values 25, 30 and 35. All these matched values are combined in all possible ways to obtain the mapped queries. One of the 9 mapped queries generated by the Mapper component is the one presented as listing 2. This mapped query is indistinguishable from the mapped data we showed in the previous section. This means that this query generates at least one key that corresponds to the key previously generated to store the sensor data. This is correct, indeed, as the sensor data perfectly matches the query. Note that a a slightly different query, e.g. a query requiring all data from temperature sensors exposing a value greater than 29°C would have returned the same mapped query. In this case the previously stored sensor data would constitute a false positive returned in the response due to the lack of precision in the attribute mapping.
154
R. Baldoni et al.
4 Related Work Even if the possibility to store and retrieve data is fundamental in all smart home applications, to the best of our knowledge issues related to the design of embedded repositories for such kind of applications have been hardly takled in the State of the Art. Probably this is due to the fact that previous works in this area often considered simple centralized approaches to the problem. To find some informations about the distributed storage problem in this kind of environment, we have to look works addressing problems related to context-awareness, that typically need to access a reliable repository. Schmidt in [13] explains that context-aware systems are computing systems that provide relevant services and information to users based on their situational conditions. This status of the system must be stored reliably in order to be queried and explored. From this point of view the availability of a reliable distributed repository can be very useful to a context-aware systems deployed in a smart home. In our work we focused on how such reliable distributed repository could be realized. Khungar and Riekki introduced Context Based Storage (CBS) in [11], a context aware storage system. The structure of CBS is designed to store all types of available data related to a user and provide mechanisms to access this data everywhere using devices capable to retrieve and use the information. CBS provides a simple way to store documents using the context explicitly provided by the user. It allows users to retrieve documents from a ubiquitous storage using the context related directly to the document or context related to the user that is then linked to the document through timestamping methods. The main difference between CBS and our system is that in CBS special emphasis is given to group activities and access control, since CBS is designed for an ubiquitous environment. In our system rights are not considered since we assume that within a closed home environment the set of users is well known and security is tackled when trying to access the system. Another great difference regards the way data is stored: the storage system in our architecture is completely distributed and built to provide reliable storage and access to devices deployed in the environment. All the previous solutions assume the presence of a central database, a storage server that in same cases could be overloaded. In [14] the authors show a solution targeted at enhancing the process of data retrieval. The system is based on the idea of prefetching data from a central repository to improve responsiveness to user requests. The solution proposed in this paper tries to overcome these problems using a scalable distributed approach where all participating nodes are able to provide the same functionalities. A solution similar to the one proposed in this paper has been previously adopted in the field of large scale data diffusion to implement a content-based publish/subscribe system on top of a DHT [10]. The main difference between the two approaches is in the way data is accessed: while publish/subscribe systems assume that queries (subscriptions) are generated before data (publications) is diffused, our system targets a usage pattern closer to the way classical storage systems are accessed.
5 Conclusions This paper presented the architecture of a distributed repository for smart home application environments characterized by the presence of a large number of heterogeneous
Designing Highly Available Repositories for Heterogeneous Sensor Data
155
devices. The repository bases its reliability and scalability properties on an underlying DHT that is used to store and retrieve data. The limitations imposed by the DHT lookup primitive is solved by introducing a mapping component able to correctly map queries and matching data. The authors plan to start experimenting this idea through an initial prototype that will be adopted for testing purposes. Aim of these test will be to evaluate the adaptability of the proposed architecture to different applicative scenarios. A further improvement plannes as futur work will consist in modifying the system in order to automatically adapt at run-time the mappings definition in order to reduce the number of false positives returned by the repository as response to queries without adversely affecting its performance.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12.
13. 14.
AMX, http://www.amx.com/ BTicino “My Home”, http://www.myhome-bticino.it/ KNX, http://www.knx.org/ LonWorks, http://www.echelon.com/ Lutron Electronics Co., Inc., http://www.lutron.com/ Philips Dynalite, http://www.dynalite-online.com/ The UPnP forum, http://www.upnp.org/ Web Ontology Language (OWL), http://www.w3.org/2004/OWL/ Smart Homes for All: An embedded middleware platform for pervasive and immersive environments for-all. EU STREP Project: FP7-224332 (2008) Baldoni, R., Marchetti, C., Virgillito, A., Vitenberg, R.: Content-based publish-subscribe over structured overlay networks. In: Proceedings of the International Conference on Distributed Computing Systems, pp. 437–446 (2005) Khungar, S., Riekki, J.: A context based storage system for mobile computing applications. SIGMOBILE Mob. Comput. Commun. Rev. 9(1), 64–68 (2005) Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001) Schmidt, A.: Ubiquitous Computing - Computing in Context. PhD thesis, Ph.D dissertation, Lancaster University (2002) Soundararajan, G., Mihailescu, M., Amza, C.: Context-aware prefetching at the storage server. In: Proceedings of the 33rd USENIX Technical Conference, pp. 377–390 (2008)
Fine-Grained Tailoring of Component Behaviour for Embedded Systems Nelson Matthys, Danny Hughes, Sam Michiels, Christophe Huygens, and Wouter Joosen IBBT-DistriNet, Department of Computer Science, Katholieke Universiteit Leuven, B-3001, Leuven, Belgium {firstname.lastname}@cs.kuleuven.be
Abstract. The application of run-time reconfigurable component models to networked embedded systems has a number of significant advantages such as encouraging software reuse, adaptation to dynamic environmental conditions and management of changing application demands. However, reconfiguration at the granularity of components is inherently heavy-weight and thus costly in embedded scenarios. This paper argues that in some cases component-based reconfiguration imposes an unnecessary overhead and that more fine-grained support for the tailoring of component functionality is required. This paper advocates for a highlevel policy-based approach to tailoring component functionality. To that end, we introduce a lightweight framework that supports fine-grained adaptation of component functionality based upon high-level policy specifications. We have realized and evaluated a prototype of this framework for the LooCI component model.
1
Introduction
Run-time reconfigurable component models provide an attractive programming model for Wireless Sensor Networks (WSN). As WSN environments are typically highly dynamic, run-time reconfigurable component models allow this dynamism to be effectively managed through the deployment of new functionality or the modification of existing compositions. WSNs are also increasingly expected to support multiple applications in the long-term perspective. In response, reconfigurable component models allow system functionality to evolve to meet changing application requirements. Run-time reconfigurable component models also promote reuse, which is essential in resource-constrained WSN environments. A number of run-time reconfigurable component models have been developed for embedded systems, most notably OpenCOM [4], RUNES [3], and OSGi [13]. These component models address the problems of dynamism, evolution and reuse by offering developers: – Concrete interfaces that promote the reuse of components between applications. – On demand component deployment that can be used to manage dynamism and evolution through the injection of new functionality. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 156–167, 2009. c IFIP International Federation for Information Processing 2009
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
157
– Component rewiring that can be used to modify component compositions on the fly and thus offers a mechanism to manage dynamism and evolution. The ability to dynamically wire a third party component into a composition also promotes reuse. In sum, run-time reconfigurable component models allow for reconfiguration of system functionality through the introduction of new components, or the modification of relationships between existing components. However, component-based reconfiguration has two critical disadvantages: – Coarse granularity: As reconfigurations may be enacted only by modifying relationships between components or deploying new components, componentbased reconfiguration is a poor fit for enacting fine-grained changes. Thus, while component-based reconfiguration provides a generic mechanism for enacting changes, it is inefficient when that change may be represented by a few lines of code. This is particularly critical for embedded platforms, such as WSN nodes, where memory is limited and software updates are costly operations. – Complexity of abstraction level: Component-based reconfiguration is complex and requires a domain expert to be enacted properly. This complexity prevents end-users from tailoring the functionality of the deployed system themselves. Furthermore, expressing simple changes in a component-based system should be offered at the abstraction level of the end-user. This paper addresses the problems of coarse granularity and complexity through the introduction of a lightweight policy framework for adapting component behaviour. Policies for this framework are high-level and platform independent, thus allowing end-users to more easily tailor component behaviour. The performance of this system is evaluated through a number of case studies. The remainder of this paper is structured as follows: Section 2 provides background on component and policy frameworks for networked embedded systems, while Section 3 presents the design of a policy language and corresponding framework for tailoring component behaviour. An initial prototype of this framework is evaluated based on a case study in Section 4. Section 5 critically discusses advantages and shortcomings of our approach. Finally, Section 6 concludes and presents directions for future work.
2
Background
This section firstly discusses the state-of-the-art in component models for networked embedded systems. Section 2.2 then discusses existing policy-based mechanisms for tailoring component functionality. Finally, Section 2.3 provides a brief overview of the LooCI component model. 2.1
Component Models for Networked Embedded Systems
NesC [6] is perhaps the best known component model for networked embedded systems and is used to implement the TinyOS [12] operating system. NesC
158
N. Matthys et al.
provides an event-driven programming approach together with a static component model. NesC components cannot be dynamically reconfigured, however, the static approach of NesC allows for whole-program analysis and optimization. Mat´e [11] extends NesC and provides a framework to build application-specific virtual machines. As applications are composed using specific virtual machine instructions, they can be represented concisely, which saves power that would otherwise be consumed due to transmitting software modules. However, compared to component-based approaches, Mat´e has one critical shortcoming - compositions are limited by the functionality that is already deployed on each node and thus it is not possible to inject new functionality into a Mat´e application without reflashing each node. OpenCOM [4] is a general purpose, run-time reconfigurable component model and while it is not specifically targeted at networked embedded systems, it has been deployed in a number of WSN scenarios [7]. OpenCOM supports dynamic reconfiguration via a compact runtime kernel. Reconfiguration in OpenCOM is coarse-grained, being achieved through the deployment of new components and modifying connections between components. The RUNES [3] component model brings OpenCOM functionality to more embedded devices. Along with a smaller footprint, RUNES adds a number of introspection API calls to the OpenCOM kernel. Like OpenCOM, RUNES allows for only coarse-grained component-based reconfiguration. The OSGi component model [13] targets powerful embedded devices along with desktop and enterprise computers. OSGi provides a secure execution environment, support for run-time reconfiguration and life-cycle management. Unfortunately, while OSGi is suitable for powerful embedded devices, the smallest implementation, Concierge [15] consumes more than 80KB, making it unsuitable for highly resource-constrained devices. 2.2
Policy Techniques for Tailoring Component Behaviour
Over the last decade, research on policy-based management [2] has primarily been applied to facilitate management tasks, such as component configuration, security, or Quality of Service in large-scale distributed systems. Policy-based management allows the specification of requirements about the intended behaviour of a managed system using a high-level policy language, which are then automatically enforced in the system. Furthermore, policies can be changed dynamically without having to modify the underlying implementation or requiring the consent or cooperation of the components being governed. ESCAPE [16] is a component-based policy framework for programming sensor network applications using TinyOS [12]. Similar to our approach, ESCAPE advocates the use of policy rules to govern component behaviour. However, policies in ESCAPE are exclusively used to specify interactions between components, removing interaction code from the individual components, whereas in our approach we apply policy techniques to configure entire component compositions, including the existing information flow. In addition, ESCAPE is implemented
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
159
on top of the static NesC component model [6], whereas our policy framework builds on top of a more flexible run-time reconfigurable component model. Recently, the Service Component Architecture (SCA) defined a Policy framework specification [14], which aims to use policies for describing capabilities and constraints that can be applied to service components or to the interactions between different service components. While not being bound to a specific implementation technology, the SCA policy framework focusses on service-oriented environments such as OSGi [13] which may only be applied to relatively powerful embedded devices. The approach this paper proposes is to combine the key benefits of a runtime reconfigurable component model (i.e. the ability to inject new functionality dynamically and reason about distributed relationships between components), with the efficiency of policy-based tailoring of functionality. As we will show in Section 4, this reduces the burden on developers while also reducing performance overhead for simple reconfigurations. Furthermore, the policy language we propose is high-level and easy to understand, allowing end-users, as well as domain experts, to customize the functionality of component compositions. 2.3
LooCI: The Loosely-Coupled Component Infrastructure
The Loosely-coupled Component Infrastructure (LooCI) [8] is designed to support Java ME CLDC 1.1 platforms such as the Sun SPOT [17]. LooCI is comprised of a component model, a simple yet extensible networking framework and a common event bus abstraction. LooCI components support run-time reconfiguration, interface definitions, introspection and support for the rewiring of bindings. LooCI offers support for two component types, macrocomponents and microcomponents. Macrocomponents are coarse-grained and service-like, building upon the notion of Isolates inherent in embedded Java Virtual Machines such as Sentilla [1] or SQUAWK [18]. Isolates are process-like units of encapsulation and provide varying levels of control over their execution (exactly what is provided depends on the specific JVM). LooCI standardizes and extends the functionality offered by Isolates. Each macrocomponent runs in a separate Isolate and communicates with the runtime middleware via Inter Isolate RPC (IIRPC), which is offered by the underlying system. Unlike microcomponents, macrocomponents may use multiple threads and utility libraries. Microcomponents are fine-grained and self-contained. All microcomponents run in the master Isolate alongside the LooCI runtime. Unlike macrocomponents, microcomponents must be single threaded and self-contained, using no utility libraries. Aside from these restrictions, microcomponents offer identical functionality to macrocomponents in a smaller memory footprint. Unlike OpenCOM or RUNES, LooCI components are indirectly bound over a lightweight event bus. LooCI components define their provided interfaces as the set of LooCI events that they publish. The receptacles of a LooCI component are similarly defined as the events to which they subscribe. As bindings are indirect, they may be modified in a manner that is transparent to the composition.
160
N. Matthys et al.
Furthermore, as all events are part of a globally specified event hierarchy, it becomes easier to understand and modify data flows.
3 3.1
A Policy-Based Component Tailoring Framework Policy Language Design and Tool Support
The specification of policies to tailor component behaviour is accomplished by using policy rules following Event-Condition-Action (ECA) semantics, which correspond well to the event-driven nature of the target embedded platforms. An ECA policy consists of a description of the triggering events, an optional condition which is a logical expression typically referring to external system aspects, and a list of actions to be enforced in response. In addition, our prototype policy language allows various functions to be called inside the condition and action parts of a policy. By using these policies, we offer a simple, yet powerful method to tailor component behaviour for end-users. In addition, we provide tool support to the end-users to allow simple tailoring of system behaviour. Our tool allows the end-user to firstly select the components and interfaces that can be tailored. Secondly, after specification of the corresponding policies, the tool parses and analyzes each policy for syntactic consistency. Finally, the tool allows the end-user to choose which nodes he wants to deploy the policy to. Concrete examples of the policy language can be found in Section 4. 3.2
Policy Framework Design
As illustrated in Figure 1, the policy framework is deployed on each sensor node and consists of three key components: the Policy Engine, the Rule Manager, and a Policy Distribution component. The Policy Engine is the main component in the framework and is responsible for intercepting events as they pass between two components and evaluating them based upon the set of policy rules on each node. In case of a match (i.e. a triggering event and a condition evaluating to true), the engine enforces the actions defined in the action part of the matching policy. Typical examples of actions are, e.g. denying the event to pass, publishing a custom event, or invoking a particular function in the middleware runtime. Potential conflicts between multiple matching policies are handled by following a priority-based ordering of policies, whereas only the actions of the highest priority policy are executed. Distribution of policy files from the back-end to the sensor network is achieved using a Policy Distribution component hosted on each individual sensor node. After specification and analysis of a policy by our tool, the policy is transformed into a compact binary representation that can be efficiently disseminated to the sensor nodes. On reception of this binary policy representation, the policy distribution component passes it to the Rule Manager component. The Rule Manager on each individual sensor node is responsible for storing and managing the set of policy rules on the node. After reception of a binary
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
161
Fig. 1. Overview of the policy framework
policy from the distribution component, the rule manager converts the policy into a data structure suitable for more efficient evaluation which is then passed to the policy engine on a per triggering-event base. By retaining the ability to dynamically change the set of policies at run-time, the framework can be adapted according to evolving application demands.
4
Case-Study Based Evaluation
This section presents a scenario that requires two archetypal reconfigurations of a distributed component composition: (i.) introduction of filtering functionality and (ii.) binding interception and monitoring. For each case, we compare the overhead of realizing reconfigurations using LooCI macrocomponents and microcomponents to that of realizing reconfiguration using the policy framework introduced in Section 3. Specifically Section 4.1 describes our motivating application scenario. Section 4.2 describes how compositions may be modified through component reconfiguration and policy application. Section 4.3 then considers the overhead for developers inherent in each approach, while Section 4.4 analyzes the memory consumption of each approach. Finally, Section 4.5 explores the performance overhead of component-based versus policy-based reconfiguration. 4.1
Application Scenario
Consider the scenario of a WSN-based warehouse monitoring scenario. In this scenario, a company STORAGO CO provides temperature controlled storage of
162
N. Matthys et al.
goods, wherein the temperature of stored packages is monitored using a WSN running the LooCI middleware. STORAGE CO offers two classes of service for stored goods: best effort temperature control and assured temperature control. The customers of STORAGE CO (CHOCOLATE CO and CHEMICAL CO) each have different storage requirements that evolve over time. – Best effort temperature control : in this scheme, STORAGE CO sets temperature alarms, which alert warehouse employees if the temperature of a stored package has breached a specified threshold. As the scheme is alarm-based, it generates low levels of traffic, increasing battery life and reducing cost. – Assured temperature control : in this scheme, STORAGE CO provides continuous data to warehouse employees, who may view detailed temperature data and take pre-emptive action to avoid package spoiling. As this scheme transmits continuous data, it decreases node battery life and increases costs. Scenario 1. CHOCOLATE CO begins by requesting the assured temperature service level from STORAGE CO, however, due to tightening cost-constraints, CHOCOLATE CO later requests their service level to be switched to best effort. CHEMICAL CO begins by requesting the low-cost best effort service, however stricter government regulations require CHEMICAL CO increasing their coverage to assured temperature control. Scenario 2. STORAGE CO wishes to perform a detailed analysis of how their WSN infrastructure is being used, and thus deploys functionality to monitor all component bindings in their WSN. This functionality includes accounting of all events that pass. 4.2
Component-Based Modification versus Policy-Based Modification
Scenario 1. This section explores how the changing requirements of the customers on both temperature monitoring schemes can be reflected using (i.) component-based modification of the compositions, and (ii.) by a single composition customized using our policy-based approach. Component-Based Tailoring of Functionality. The assured and best effort temperature monitoring schemes discussed in Section 4.1 may be represented by two distinct component compositions. This is shown in Figure 2. In the assured monitoring scheme, a TEMP SENSOR exposes a single interface of type TEMP, which is wired to the matching receptacle of a TEMP MONITORING component. In the best effort temperature monitoring scheme, the TEMP SENSOR component is wired to the matching receptacle of a TEMP ALARM component, the ALARM interface of which is then wired to the matching interface of a TEMP MONITORING component.
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
163
Fig. 2. Component configurations
In the case of CHOCOLATE CO, switching from assured to best effort temperature monitoring, the existing TEMP SENSOR component will be unwired from the TEMP MONITORING component and rewired to a TEMP ALARM component, the ALARM interface of which is wired to the TEMP MONITORING component. In the case of CHEMICAL CO, switching from best effort to assured monitoring, the existing TEMP ALARM component will be unwired from the TEMP MONITORING and TEMP SENSOR component. Subsequently, the TEMP interface of the TEMP SENSOR component will be wired to the matching receptacle of the TEMP MONITORING component. Policy-Based Modification. To enable CHOCOLATE CO to switch from assured to best effort monitoring, the developer needs to specify and enable the following policy with priority 1: policy " assured - to - best - effort " " 1 " { on TEMP as t ; // TEMP contains ( source , dest , value ) if ( t . value > 20 && t . dest == T E M P _ M O N I T O R I N G _ C H O C _ C O) then ( // publish an ALARM event to T E M P _ M O N I T O R I N G _ C H O C _ C O publish ALARM ( t . source , TEMP_MONITORING _C HO C_C O , t . value ) ; deny t ; // and block TEMP event for further dissemination ) }
This policy specifies that the policy engine should intercept all TEMP events, while only allowing those events to pass with a temperature value higher then 20 degrees Celsius and by converting them to ALARM events destined for the TEMP MONITORING component. To enable CHEMICAL CO switching from best effort to assured temperature monitoring, the developer needs to specify and enable the following policy:
164
N. Matthys et al.
policy " best - effort - to - assured " " 1 " { on TEMP as t ; if ( t . dest == TEMP_ALARM ) then ( // allow sending to TEMP_ALARM for threshold checking allow t ; t . dest = T E M P _ M O N I T O R I N G _ C H E M _ C O; // change destination publish t ; // assure sending to T E M P _ M O N I T O R I N G _ C H E M _ C O ) }
This policy changes the destination of TEMP events from the TEMP ALARM to the TEMP MONITORING CHEM CO component to enforce the assured monitoring scheme. In addition, it may not break the existing composition (i.e. TEMP events must also be sent to the TEMP ALARM component). Scenario 2: Insertion of Global Monitoring Behaviour. The networkwide monitoring of component interactions described in Section 4.1 may also be implemented using a component-based or policy-based approach. In either case, the reception and transmission of an event should be logged to a ACCOUNTING component which stores events for future retrieval and analysis. In order to implement logging or accounting using component-based modification, STORAGE CO would be required to continually probe the network to discover the state of compositions and then insert a BINDING MONITOR interception component into each discovered binding - clearly a resource intensive process. In contrast, as the LooCI Event manager provides a common point of interception for all events on each node, a single, generic policy may be inserted to perform equivalent monitoring. As all events are routed through the policy engine, such a configuration is agnostic to the component compositions executing on the WSN, and clearly entails significantly lower overhead. A policy to implement this is shown below: policy " logging " " 1 " { on * as e ; // all events have source , dest , data [] as payload then ( // always do accounting of event occurrence invoke ACCOUNTING ( e . source , e . dest , e . data []) ; allow e ; // do not block e , allow it to continue ) }
While this example is simple, we believe that the ability to install per-node, as well as per binding policies to enforce various non-functional concerns may reduce overhead in many scenarios. 4.3
Overhead for the Developer
In this section, we analyze the effort required to implement the TEMP ALARM component and compare this with the effort required to develop a functionally equivalent policy, as described in Section 4.2. Each implementation was analyzed in terms of Source Lines of Code (SLoC). The results are shown in Table 1.
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
Perhaps more critically than the conservation of development effort, as illustrated by the SLoC savings shown in Table 1, is the high-level and platform independent nature of the policy specification language, which unlike a Javabased LooCI component could equally be applied to a TinyOS [12] or Contiki [5] software configuration where a suitable policy interpreter exists. 4.4
Memory Footprint
The size of the policy framework is 26 kB. Subsequently, we analyzed the static memory (size on disk) and dynamic memory (RAM) consumed by the software elements introduced in Section 4.2. As can be seen in Table 2, policy-based reconfiguration consumes significantly less memory than component-based reconfiguration, a critical advantage in memory-constrained environments like WSNs. Table 2. Memory Consumption
We evaluated the performance of policy-based and component-based reconfiguration using a standard SunSPOT node (180 MHz ARM9 CPU, 512 kB RAM, SQUAWK VM ‘BLUE’ version) and a 3 GHz Pentium 4 desktop with 1 GB of RAM running Linux 2.6 and Java 1.6. We first logged the time required to deploy and initialize the policy specification and component implementation required to achieve the reconfigurations described in Section 4.2. We then analyzed the time which each took to handle an incoming TEMP event (i.e. process it and disseminate an ALARM event to the gateway). In each case, the SPOT node was deployed between 20 cm and 30 cm from the network gateway and we performed 50 experiments, the averaged results of which are illustrated in Table 3. As can be seen from Table 3, not only is the overhead inherent in deploying and initializing a policy significantly lower than that of deploying and initializing a component, the ongoing performance overhead per event caused by applying a policy to a binding is also lower (or equal to microcomponent performance) than that caused by inserting a new macrocomponent. In embedded environments where CPU and energy resources are scarce, we believe that policy-based reconfiguration provides concrete benefits over component-based reconfiguration for tailoring compositions as it does not introduce additional overhead.
166
N. Matthys et al. Table 3. Performance Comparison
Deployment Initialization Execution overhead
5
Microcomponent Macrocomponent Policy 11330 ms 11353 ms 200 ms 8418 ms 7420 ms 6 ms 28 ms 43 ms 28 ms
Discussion
The evaluation presented in the previous section clearly shows that policy-based modification of component compositions can have significant advantages in terms of: (i.) lowering development overhead, (ii.) reducing memory footprint and (iii.) improving performance. This leads to a critical question: When to apply component-based modification of functionality, and when to use policy-based tailoring of functionality? The policy-based approach is suited to enforce non-functional concerns like accounting or security on component compositions, as these non-functionalities are orthogonal to the composition and not radically change the end-to-end information flow in the component composition. Despite the concrete advantages of policy-based composition modification, this approach is not without drawbacks: it can reduce reusability of components. In a pure (or functional) component composition, the functionality of each component is solely identified by its type along with the interfaces and receptacles it provides. As the application of policies to component bindings can modify functionality in a manner that is opaque, this can effectively render the component unreliable for use in other compositions and thus reduces the maintainability of the system. Managing long-term system evolution must be done with care. Rather, we believe that policies should be used to efficiently realize transient modifications to compositions and to enforce non-functional concerns on compositions.
6
Conclusions
This paper has presented a policy-based framework that can be used to tailor the functionality of component compositions. We have presented a compact and lightweight prototype of this framework realized for the LooCI component model and, through evaluation, we have shown that policy-based tailoring can reduce overhead for developers, reduce memory consumption and improve the performance of reconfiguration when compared to purely component-based reconfiguration approaches. In the short term, future work will focus upon further researching the impact of policy-based modifications on component compositions. In addition, we plan evaluating policy-based tailoring of functionality in a logistics scenario with concrete WSN end-users. In the longer term we hope to improve the expressiveness of our policy language, and implement prototypes of our policy engine and evaluate its performance for the OpenCOM [4] and OSGi [13] component models.
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
167
Acknowledgments. Research for this paper was partially funded by the Interuniversity Attraction Poles Programme Belgian State, Belgian Science Policy, Research Fund K.U.Leuven, and is conducted in the context of the IBBT-DEUS project [9] and IWT-SBO-STADiUM project No. 80037 [10].
References 1. Sentilla Perk Platform (July 2009), http://www.sentilla.com/ 2. Boutaba, R., Aib, I.: Policy-based management: A historical perspective. J. Network Syst. Manage. 15(4), 447–480 (2007) 3. Costa, P., Coulson, G., Mascolo, C., Mottola, L., Picco, G.P., Zachariadis, S.: Reconfigurable component-based middleware for networked embedded systems. International Journal of Wireless Information Networks 14(2), 149–162 (2007) 4. Coulson, G., Blair, G., Grace, P., Taiani, F., Joolia, A., Lee, K., Ueyama, J., Sivaharan, T.: A generic component model for building systems software. ACM Trans. Comput. Syst. 26(1), 1–42 (2008) 5. Dunkels, A., Gronvall, B., Voigt, T.: Contiki - a lightweight and flexible operating system for tiny networked sensors. In: Proceedings of the 29th Annual IEEE International Conference on Local Computer Networks (LCN 2004), Washington, DC, USA, pp. 455–462. IEEE Computer Society, Los Alamitos (2004) 6. Gay, D., Levis, P., von Behren, R.V., Welsh, M., Brewer, E., Culler, D.: The nesc language: A holistic approach to networked embedded systems. In: PLDI 2003: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pp. 1–11. ACM Press, New York (2003) 7. Hughes, D., Greenwood, P., Blair, G., Coulson, G., Grace, P., Pappenberger, F., Smith, P., Beven, K.: An experiment with reflective middleware to support gridbased flood monitoring. Conc. Comp.: Pract. Exper. 20(11), 1303–1316 (2008) 8. Hughes, D., Thoelen, K., Horr´e, W., Matthys, N., Del Cid, J., Michiels, S., Huygens, C., Joosen, W.: LooCI: a loosely-coupled component infrastructure for networked embedded systems. Technical Report CW 564, K.U.Leuven (September 2009) 9. IBBT-DEUS project (July 2009), https://projects.ibbt.be/deus 10. IWT STADiUM project 80037. Software technology for adaptable distributed middleware (July 2009), http://distrinet.cs.kuleuven.be/projects/stadium/ 11. Levis, P., Culler, D.: Mat´e: a tiny virtual machine for sensor networks. In: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, New York, USA, pp. 85–95 (2002) 12. Levis, P., Madden, S., Gay, D., Polastre, J., Szewczyk, R., Woo, A., Brewer, E.A., Culler, D.E.: The emergence of networking abstractions and techniques in tinyos. In: Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI 2004), March 2004, pp. 1–14 (2004) 13. OSGi Alliance. About the OSGi Service Platform, whitepaper, rev. 4.1 (June 2007) 14. OSOA. SCA Policy Framework. SCA Version 1.00 (March 2007) 15. Rellermeyer, J.S., Alonso, G.: Concierge: a service platform for resourceconstrained devices. SIGOPS Oper. Syst. Rev. 41(3), 245–258 (2007) 16. Russello, G., Mostarda, L., Dulay, N.: ESCAPE: A component-based policy framework for sense and react applications. In: Chaudron, M.R.V., Szyperski, C., Reussner, R. (eds.) CBSE 2008. LNCS, vol. 5282, pp. 212–229. Springer, Heidelberg (2008) 17. Sun Microsystems. Sun SPOT world (July 2009), http://www.sunspotworld.com/ 18. Sun Squawk Virtual Machine (July 2009), http://squawk.dev.java.net/
MapReduce System over Heterogeneous Mobile Devices Peter R. Elespuru, Sagun Shakya, and Shivakant Mishra Department of Computer Science University of Colorado, Campus Box 0430 Boulder, CO 80309-0430, USA
Abstract. MapReduce is a distributed processing algorithm which breaks up large problem sets into small pieces, such that a large cluster of computers can work on those small pieces in an efficient, timely manner. MapReduce was created and popularized by Google, and is widely used as a means of processing large amounts of textual data for the purpose of indexing it for search later on. This paper examines the feasibility of using smart mobile devices in a MapReduce system by exploring several areas, including quantifying the contribution they make to computation throughput, end-user participation, power consumption, and security. The proposed MapReduce System over Heterogeneous Mobile Devices consists of three key components: a server component that coordinates and aggregates results, a mobile device client for iPhone, and a traditional client for reference and to obtain baseline data. A prototypical research implementation demonstrates that it is indeed feasible to leverage smart mobile devices in heterogeneous MapReduce systems, provided certain conditions are understood and accepted. MapReduce systems could see sizable gains of processing throughput by incorporating as many mobile devices as possible in such a heterogeneous environment. Considering the massive number of such devices available and in active use today, this is a reasonably attainable goal and represents an exciting area of study. This paper introduces relevant background material, discusses related work, describes the proposed system, explains obtained results, and finally, discusses topics for further research in this area. Keywords: MapReduce, iPhone, Android, Mobile Platforms, Apache, Ruby, PHP, jQuery, JavaScript, AJAX.
1
Introduction
Distributed computing has come into its own in the internet age. Such a large computational pool has given rise to endeavors such as the SETI@Home [14], and Folding@Home [7] projects, which both attempt to allow any willing person to surrender a portion of their desktop computer or laptop to a much larger computational goal. In the case of SETI@Home, millions of users participate to analyze data in search of extra terrestrial signals therein, whereas Folding@Home S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 168–179, 2009. c IFIP International Federation for Information Processing 2009
MapReduce System over Heterogeneous Mobile Devices
169
is a bit more practical. Folding@Home’s goal is ”to understand protein folding, misfolding, and related diseases”. These systems, along with others which are mentioned later, are conceptually similar to what we propose, which is a system that allows people to participate freely in these kinds of massive, computationally bound problems so that results may be quickly obtained. There are many similar approaches to solving large computationally intensive problems. One of the most famous of these is the problem of providing relevant search of the Internet itself [2]. Google has emerged as the superior provider of that capability, and a portion of that superiority comes by way of the underlying algorithms in use to make their process efficient, elegant, and reliable [13], MapReduce [4]. MapReduce is similar to other mechanisms employing parallel computations, such as parallel prefix schemes [12] and scan primitives [3], and is even fairly similar to blockedsort based indexing algorithms [16]. We believe there exists a blatant disregard of certain capable devices [11] in the context of these kinds of distributed systems. Existing implementations have neglected the mobile device computation pool, and we suspect this is due to a number of factors which hamper most current mobile devices. It seems only smart phones are powerful enough, computation wise, for most of these distributed workloads. There are many additional concerns as well that have been covered by prior work, such as power usage, security concerns [9] and potential interference with the device’s intended usage model as a phone. All of these factors limit the viability of incorporating mobile devices into a distributed system. It is our belief that despite these limitations, there are solutions that allow the inclusion of the massive smart phone population [6] into a distributed system. One logical progression of MapReduce, and other such distributed algorithms, is toward smart mobile devices primarily because there are so many of them, and they are largely untapped. Even a small scale incorporation of this class of device can have an enormous impact on the systems at large and how they accomplish their goals. Increases in data volume underscores the need for additional computational power as the world continues to create far more data than it can realistically and meaningfully process [5]. Using smart mobile devices, in addition to the more traditional set of servers, is one possible way to increase computational power for these kinds of systems, and is exactly what we attempt to prove and quantify in specific cases by leveraging prior work on MapReduce. This paper explores the feasibility of using smart mobile devices in a MapReduce system by exploring several areas, including quantifying the contribution they make to overall computation throughput, end-user participation, power consumption, and security. We have implemented and experimented with a prototype of a MapReduce system that incorporates three types of devices: a standard Linux server, an iPhone, and an iPhone simulator. Preliminary results from our performance measurements support our claim that mobile devices can indeed contribute positively in a large heterogenous MapReduce system, as well as similar systems. Given that the number of smart phones is clearly on the rise, there is immense potential in using them to build computationally-intensive parallel processing applications.
170
P.R. Elespuru, S. Shakya, and S. Mishra
The rest of the paper is organized as follows. In Section 2, we briefly outline the MapReduce system. Section 3 touches on similar endeavors. In Section 4, we describe the design of our system, and in Section 5, we describe the implementation details. In Section 7, we discuss experimental results measured from our prototype implementation. Next, we discuss some optimizations in Section 8 and then finally conclude our paper in Section 10.
2
MapReduce
MapReduce [4] is an increasingly popular programming paradigm for distributed data processing, above and beyond merely indexing text. At the highest architectural level, MapReduce is comprised of a few critical pieces and processes. If you have a large collection of documents or text, that corpus must be broken into manageable pieces, called splits. Commonly, a split is one line of a document, which is the model we follow as well. Once split, a master node must assign splits to workers who then process each piece, store some aspect of it locally, but ultimately return it to the master node or something else for reduction. The reduction then typically partitions the results for faster usage, accounting for statistics, document identification and so on. We describe the MapReduce process in three phases: Map, Emit, and Reduce (See Figure 1). In our system, the map phase is responsible for taking a large data set, and chunking it into splits. The Emit phase is entails the distributed processing nodes obtaining and work on the splits, and returning a processed result to another entity, the master node or job server that coordinates everything. Unlike most MapReduce implementations, the nature of mobile devices precludes us from using anything other than network communications to read and write data, as well as assign jobs and process them. The final phase is Reduce, which in our case further minimizes the received results into a unique set of data that ultimately gets stored in a database for simplicity. For example, given a large set of plain text files over which you may wish to search by keyword, a MapReduce system begins with a master node that takes all those text files, splits them up line by line, and parcels them out to participants. The participating computation nodes find the unique set of keywords in each line of text they were given, and emit that set back to the master node. The masFig. 1. High Level Map Reduce Explanation ter node, after getting all of the pieces back, aggregates all
MapReduce System over Heterogeneous Mobile Devices
171
of the responses to determine the overall unique set of keywords for that whole set of data, and stores the result in a database, file, or some other persistent storage medium. It is at this point the data is analyzed and searched whatever way desired from within our web application. One of the biggest strengths of MapReduce lies in its inherent distribution of phases, which results in an extremely high degree of reliable parallelism when implemented properly. MapReduce is both fault and slow-response tolerant, which are very desirable characteristics in any large distributed system.
3
Related Work
There have been a number of other explorations of heterogeneous MapReduce implementations and their performance [15], as well as some more unique expansions on the idea such as using JavaScript in an entirely client side browser processing framework [8] for MapReduce. None of this related work however focuses on using a mobile device pool as a major computation component. To complement these related works, we focus on mobile devices, and in particular, on the specifics of heterogeneity in the context of mobile devices mixed with more traditional computation resources.
4
The Heterogeneous Mobile Device MapReduce System
Our problem encompasses three areas: 1) Provide a mechanism for interested parties to participate in a smart phone distributed computational system, and ensure they are aware of the potential side effects; 2) Make use of this opt-in device pool to compute something and provide aggregate results; and 3) Provide meaningful results to interested parties, and summarize them in a timely fashion, considering the reliability of devices on wireless and cellular networks. Our solution is The Heterogeneous Mobile Device MapReduce System. Fig. 2. System Summary There are several key components in our system: 1) A server which acts as the master node and coordinator for MapReduce processing; 2) Server side client code used to provide faster more powerful client processing in conjunction with mobile devices, 3) The mobile device client which
172
P.R. Elespuru, S. Shakya, and S. Mishra
implements MapReduce code to get, work on, and emit results of data from the master node; and finally 4) The BUI, or browser user interface (web application), which lets the results be searched (See Figure 2). The MapReduce master node server leverages the Apache [17] web server for HTTP. To provide the MapReduce stack, we actually have two different implementations of our master node/job server code, one in Ruby [18] and one in PHP [19]. However, we primarily used the PHP variant during our testing. Once the master node has been seeded with some content to process, it is told to begin accepting participant connections. Once the process begins, clients of any type, mobile or traditional, may connect, get work, compute and return results. During processing, clients, whether they are mobile devices or processes running on a powerful server, can continually request work and compute results until nothing is left to do for a given collection. In this case, the server still responds to requests, but does not return work units since the cycle is complete (See Figure 3). After all the data has been processed, clients can still request work, but obviously are not furnished anything. At this point, our web application front end is used to search for keywords throughout the documents which were just processed. The web application was implemented in PHP and makes use of the jQuery [20] JavaScript framework to provide asynFig. 3. Client Flow chronous (AJAX) page updates as workers complete units, in real-time. More can be seen in Figure 2. Further, Figure 4 illustrates exactly what the entire process looks like.
Fig. 4. Work Loop
MapReduce System over Heterogeneous Mobile Devices
5
173
System Development
There are a few additional aspects of developing this system that warrant discussion. Our experience with the development environment, and lessons learned are worth sharing as well. 5.1
Mobile Client Application Development Experience
We developed our mobile client application on the iPhone OS platform using the iPhone SDK, Cocoa Touch framework and Objective-C programming language. As part of the iPhone SDK, the XCode development environment was used for project management, source code editing and debugging. To run and test the MapReduce mobile client, we used the iPhone simulator in addition to actual devices. Apple’s Interface Builder provided a drag and drop tool to develop the user interface very rapidly. All in all the experience was extremely positive [10]. 5.2
Event Driven Interruption on iPhone
Event handling on the iPhone client proved rather interesting, due largely to the fact that certain events can override an application and take control of the device against an application’s will. While the iPhone is processing data, other events like an incoming phone call, a SMS message or a calendar alert event can take control of the device. In the case of an incoming phone call, the application is paused. Once the user hangs up, the iPhone client is relaunched by iPhone OS, but it is up to the application to maintain state. While on the call, if the user goes back to the home screen or launches another application, the phone client does not resume, and again the application is responsible for maintaining state. When an SMS message or calendar event occurs the computation continues in the background unless the user clicks on the message or views the calendar dialog. In the latter case the action is same as when there is phone call. These events, which are entirely out of the control of the application, pose an interesting challenge and must be addressed during development.
6
End-User Participation
Participants are largely in two different camps, captive and voluntary. For example, if a system such as ours was deployed in a large corporation where most employees have company provided mobile devices, that company could require employees to allow their devices to participate in the system. These are what we consider captive users. Normal users on the other hand are true volunteers, and participate for different reasons. The key is to come up with methods which engage both of these types of users so that the overall experience is positive for everyone involved. There are a large number of possible solutions to entice both types of users. Both captive and voluntary users could be offered prizes for participation, or perhaps simply receive accolades for being the participant with the most computed work units. This is similar to what both SETI@Home and Folding@Home
174
P.R. Elespuru, S. Shakya, and S. Mishra
do, and has proven effective. The sense of competition and participation drives people to team up with the hopes of being the most productive participant. This topic is discussed further later on as well.
7
Results
Our results were very interesting. We created several data sets of varying sizes composed of randomly generated text files of varying sizes. Data set sizes overall ranged from 5 MB to almost 50 MB. Within those data sets, each individual text document ranged in size as well, from a few kilobytes up to roughly 64 kilobytes each. Processing throughput was largely consistent independent of both the overall Fig. 5. Client Type Comparison data set size and the distribution of included document sizes. Figure 5 illustrates exactly what we expected would be the case. The simulated iPhone clients were the fastest, followed by the traditional perl clients, and lastly the real iPhone clients, which processed data at the slowest rate of all clients tested. The reason this behavior was expected is that the simulated iPhone clients ran on the same machine as the server software during our tests. The perl clients were executed on remote Linux machines. Interestingly though, mixing and matching client types didn’t seem to impact the contribution of any one particular client type. Perl clients processed data at roughly the same rate independent of whether a given test included only perl clients, as did simulated and real iPhone clients. Figure 6, presents another visualization that clearly shows there was a fair amount of variation in the different client types. Again, the simulated iPhone clients were able to process the most data, primarily because they were run on the same machine as the server component. The traditional perl clients were not far behind, and the real iPhone Fig. 6. Min Max Average clients were the laggards of the bunch.
MapReduce System over Heterogeneous Mobile Devices
7.1
175
Interpretation of the Results
Simulated iPhone clients processed an average of 1.64 MB/sec, Perl clients processed an average of 1.29 MB/sec, and finally, real iPhone clients processed an average of 0.12 MB/sec. The simulated iPhone clients can be thought of as another form of local client, and they help highlight the difference and overhead in the wireless conFig. 7. Projected System Throughput nection and processing capabilities of the real phones. These results and averages were consistent across a variety of data sets, both in terms of size and textual content. Our results show that very consistently, the iPhones were capable of performing at roughly an order of magnitude slower than the traditional clients, which is a very exciting result. It implies that a large portion of processing could be moved to these kinds of mobile clients, if enough exist at a given time to perform the necessary work load. For example, a company could purchase one server to operate as the master node, and farm all of the processing to mobile devices within their own company. Provided they have on the order of one hundred or more employees with such devices, which is a very likely scenario. This also suggests that this system could be particularly useful for non-time sensitive computations. For example, if a company had a large set of text documents it needed processed, it could install a client on its employees mobile devices. Those devices in turn could connect and work on the data set over a long period of time, so long as they are capable of processing data faster than it is being created. Considering how easy it is to quantify the contribution each device type is capable of making, such a system could very easily monitor its own progress. In summary, there are a large number of problems to which this system is a viable and exciting solution. We had a limited number of actual devices to test with (3 to be specific) but all performed consistently across all tests and data sets, so we feel comfortable projecting forward to estimate the impact of even more devices. As you increase the number of actual devices, throughput should grow similarly to what is represented in Figure 7. If a system utilized 500 mobile devices, we expect that system would be capable of processing close to 60 MB/sec of textual data. Similarly, 10000 devices would likely yield the ability to process 1,200 MB/sec (1.2 GB/sec!) of data. This certainly suggests our system warrants further exploration, but points to the fact that other components of the system would definitely start becoming bottlenecks. For example, at those rates, it would take massive network bandwidth to even support the data transfer necessary for the processing to take place.
176
8
P.R. Elespuru, S. Shakya, and S. Mishra
Optimizations
There are a few areas where certain aspects of this system could be improved to provide a more automatic and ideal experience. It is particularly important that the end user experience be as automatic and elegant as possible. 8.1
Automatic Discovery
Currently, the client needs to know the IP address and port number of the server in order to participate. This requires prior knowledge of the server address information which may be a barrier to entry for our implementation of MapReduce. In order to allow auto-discovery, we can have Bonjour (aka mDNS), a service discovery protocol, running on the server and clients. Bonjour automatically broadcasts the service being offered for use. With Bonjour enabled on the server and clients a WiFi network is not an absolute requirement. However, there are some limitations of using Bonjour as service discovery protocol. Namely, all devices must be on the same subnet of the same local area network, which imposes maximum client limits that would minimize the viability of our system in those situations. 8.2
Device Specific Scaling
An important goal of this system is that it can be used on heterogeneous mobile devices. As such, not all mobile devices perform the same or have the same power usage characteristics. The system should ideally have the ability to know about each type of device it can run on to maintain a profile of sorts. The purpose is to allow the system to optimize itself. For example, on a Google Android device it would have one profile, and on an iPhone it would have another, and in each case the client application would taylor itself to the environment on which it is running. The ultimate goal is to maximize performance relative to power consumption. 8.3
Other Client Types
In addition to smart mobile devices of various types, and traditional clients, there are other kinds of clients which could be used in conjunction with the other two varieties. In particular, a JavaScript client would allow any web browser to connect and participate [8] in the system. The combination of these three types of clients would be formidable indeed, and form a potentially massive computational system.
9
Additional Considerations
There are a number of areas we did not explore as part of our implementation. However, the following topics would need to be considered in an actual production implementation.
MapReduce System over Heterogeneous Mobile Devices
9.1
177
Security
Security is a multi-faceted topic in this context. Our primary concerns are two fold, first can the client implementation impact security of end-user’s mobile devices, or in any way be used to compromise their devices. Second, is the data mobile devices receive in the course of participating information that would be risky to have out in the open, were a device compromised by some other means. A production implementation would need to ensure that even if a mobile device is compromised by some other means, any data associated with this processing system is inaccessible. One way to accomplish this might be to improve the client to store its local results in encrypted form, and transmit them via HTTPS to ensure the communication channel also minimizes the opportunity for compromise. Another consideration that must be made is whether to process sensitive information in the first place. In fact, certain regulations may even prevent it altogether. 9.2
Power Usage
Power usage is a very critical topic when considering a system such as this. The additional load placed on mobile devices will certainly draw more power, which is potentially even disastrous in some situations. For example, if the mobile device is an emergency phone, running down the battery to participate in a computation cycle is a very bad idea. Ultimately, power usage must be considered when deciding which devices to allow in the mix. There are a number of things which may be done to account for these concerns, such as adding code to the mobile client that would prevent it from participating if a certain power level has been passed. This may prove particularly tricky however, since not all mobile platforms include API calls that allow an application to probe for that kind of low level system information. A balance must be reached, and it is the responsibility of the client application implementation to ensure that balance is maintained. 9.3
Participation Incentives
Regardless of whether a participating end user is a captive corporate user or a truly voluntary end user, there should be an incentive structure that rewards participation in a manner that benefits all parties. There are several incentives that could be considered. One potential way would be to offer some kind of reward for participating, based on the number of work units completed for example. This would entice users to participate even more. A community or marketplace could even be set up around this concept. For example, companies could post documents they want processed and offer to pay some amount per work unit completed. Users could agree to participate for a given company, and allow their device to churn out results as quickly as possible. This would have to be a small amount of money per work unit to be viable, perhaps a few cents each. Such a marketplace could easily become quite large and be very beneficial to all involved. Amazon has a similar concept in place with its Mechanical Turk, that
178
P.R. Elespuru, S. Shakya, and S. Mishra
allows people to post work units which other people then work on for a small sum of money [1]. Another possibility would be to bundle the processing into applications where it runs in the background, such as a music player, so that work goes on continually while the media player is playing music. The incentive could be a discount of a few cents when purchasing songs through that application, relative to some number of successfully completed jobs. The possibilities are numerous.
10
Conclusions
As is clearly evident in our results, mobile devices can certainly contribute positively in a large heterogenous MapReduce system. The typical increase from even a few tens of mobile devices is substantial, and will only increase as more and more mobile devices participate. Assuming a good server implementation exists, the mobile client contribution should increase with each new mobile device added. It is expected there would be a point of diminishing returns relative to network communication overhead, but the potential benefit is still very real. If non-captive user bases could be properly motivated, there is a large potential here to process massive amounts of data for a wide range of uses. This is conceptually similar to existing cloud computing, but where computation and storage resources happen to be mobile devices, or they interoperate between the traditional cloud and a new set of mobile cloud resources.
References 1. Amazon, Inc. Amazon Mechanical Turk, https://www.mturk.com/mturk/welcome 2. Barroso, L.: Web Search for a Planet: The Google Cluster Architecture. IEEE 23(2) (March 2003) 3. Blelloch, G.E.: Scans as Primitive Parallel Operations. IEEE Transactions on Computers 38(11) (November 1989) 4. Dean, J., Ghemawat, J.: Map Reduce, Simplied Data Processing On Large Clusters. ACM, New York (2004) 5. Dubey, P.: Recognition, Mining, and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine (February 2005) 6. Egha, G.: Worldwide Smartphone Sales Analysis, UK (February 2008) 7. Folding@Home. Folding@Home project, http://folding.stanford.edu/ 8. Grigorik, I.: Collaborative MapReduce in the Browser (2008) 9. Hunkins, J.: Will Smartphones be the Next Security Challenge (October 2008) 10. iPhone Developer Program. iphone development, http://developer.apple.com/iphone/program/develop.html 11. Krazit, T.: Smartphones Will Soon Turn Computing on its Head, CNet (March 2008) 12. Ladner, R.E., Fischer, M.J.: Parallel Prex Computation. Journal of the ACM 27(4) (October 1980) 13. Mitra, S.: Robust System Design with Built-in Soft-Error Resilience. IEEE 38(2) (February 2005)
MapReduce System over Heterogeneous Mobile Devices
179
14. SETI@Home. SETI@Home Project, http://setiathome.ssl.berkeley.edu/ 15. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI (2008) 16. Manning, C., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 17. Apache Web Server. Apache, http://httpd.apache.org/ 18. Ruby Programming Language Ruby, http://www.ruby-lang.org/en/ 19. PHP Programming Language PHP, http://www.php.net/ 20. jQuery JavaScript Framework. jQuery, http://jquery.com/
Towards Time-Predictable Data Caches for Chip-Multiprocessors Martin Schoeberl, Wolfgang Puffitsch, and Benedikt Huber Institute of Computer Engineering Vienna University of Technology, Austria [email protected], [email protected], [email protected]
Abstract. Future embedded systems are expected to use chip-multiprocessors to provide the execution power for increasingly demanding applications. Multiprocessors increase the pressure on the memory bandwidth and processor local caching is mandatory. However, data caches are known to be very hard to integrate into the worst-case execution time (WCET) analysis. We tackle this issue from the computer architecture side: provide a data cache organization that enables tight WCET analysis. Similar to the cache splitting between instruction and data, we argue to split the data cache for different data areas. In this paper we show cache simulation results for the split-cache organization, propose the modularization of the data cache analysis for the different data areas, and evaluate the implementation costs in a prototype chip-multiprocessor system.
1 Introduction With respect to caching, memory is usually divided into instruction memory and data memory. This cache architecture was proposed in the first RISC architectures [1] to resolve the structural hazard of a pipelined machine where an instruction has to be fetched concurrently to a memory access. The independent caching of instructions and data has also enabled the integration of cache hit classification of instruction caches into the worst-case execution time analysis (WCET) [2]. While analysis of the instruction cache is a mature research topic, data cache analysis is still an open problem. After n accesses with unknown address to a n-way set associative cache, the abstract cache state is lost. In previous work we have argued about cache splitting in general [3]. We have argued that caches for data with statically unknown addresses shall be fully associative. In this paper we evaluate time-predictable data cache solutions in the context of the Java virtual machine (JVM). We provide simulation results for different cache organizations and sketch the resulting modular analysis. Furthermore, an implementation in the context of a Java processor shows the resource consumptions and limitations of highly associative cache organizations. Access type examples are taken from the JVM implemented on the Java processor JOP [4]. Implementation details of other JVMs may vary, but the general classification of the data areas remains valid. Part of the proposed solution can be adapted to other object-oriented languages, such as C++ and C#, as well. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 180–191, 2009. c IFIP International Federation for Information Processing 2009
Towards Time-Predictable Data Caches for Chip-Multiprocessors
181
2 Data Areas and Access Instructions The memory areas used by the JVM can be classified into five categories: Method area. The instruction memory that contains the bytecodes for the execution. On compiled Java systems this is the native code area Stack. Thread local stack used for stack frames, arguments, and local variables Class information. A data structure representing the different types. Contains the type description, the method dispatch table, and the constant pool. Heap. Garbage collected heap of class instances. The object header, which contains auxiliary information, is stored on the heap or in a distinct handle area. Class variables. Shared memory area for static variables. Caching of the method area and the stack area have been covered in [5] and [6]. In this paper we are interested in a data cache solution for the remaining data areas. On standard cache architectures these memory areas and the stack memory share the same data cache. 2.1 Data Access Types Data memory accesses (except stack accesses) can be classified as follows: CLINFO. Type information, method dispatch table, and interface dispatch table. The method dispatch table is read on virtual and static method invocation and on the return from a method. The method dispatch table contains two words per method. Bytecodes: new, anewarray, multianewarray, newarray, checkcast, instanceof, invokestatic, invokevirtual, invokespecial, invokeinterface, *return. CONST. Constant pool access. Is part of the class information. Bytecodes: ldc, ldc w, ldc2 w, invokeinterface, invokespecial, invokestatic, invokevirtual. STATIC. Access to static fields. Is the class variables area. Bytecodes: getstatic, putstatic. HEADER. Dynamic type, array length, and fields for garbage collection. The type information on JOP is a pointer to the method dispatch table within CLINFO. On JOP each reference is accessed via one indirection, called the handle, to simplify the compacting garbage collection. The header information is part of the handle area. Bytecodes: getfield, putfield, *aload, *astore, arraylength, *aload, *astore, invokevirtual, invokeinterface. FIELD. Object field access. Is part of the heap. Bytecodes: getfield, putfield. ARRAY. Array access. Is part of the heap. Bytecodes: *aload, *astore. 2.2 Cache Access Types The different types of data cache accesses can be classified into four classes w.r.t. the cache analysis: – The address is always known statically. This is the case for static variables (STATIC), which are resolved at link time, and for the constant pool (CONST), which only depends on the currently executed method.
182
M. Schoeberl, W. Puffitsch, and B. Huber
– The address depends on the dynamic type of the operand, but not on its value. Therefore, the set of possible addresses is restricted by the receiver types determined for the call site. The class info table, the interface table and the method table are in this category (CLINFO). – The address depends on the value of the reference. The exact address is unknown, as some value on the managed heap is accessed, but in addition to the symbolic address a relative offset is known. Instance fields and array fields, both showing some degree of spatial locality, belong to this category (FIELD, ARRAY). – The last category contains handles, references to the method dispatch table, and array lengths (HEADER). They reside on the heap as well, but we only know the symbolic address. 2.3 Cache Coherence For a chip-multiprocessor system the cache coherence protocol is the major limiting factor on the scalability. Splitting data caches also simplifies the cache coherence protocol. Most data areas are actually constant (CLINFO, CPOOL). Access into the handle area (HEADER) is pseudo-constant. The data written into the header area during object creation and can not be changed by a thread. However, the garbage collector can modify this area. To provide a coherent view of the handle area between the garbage collector and the mutators, a cache for the handle area has to be updated or invalidated appropriately. Data on the heap (FIELD, ARRAY) and in the static area (STATIC) is shared by all threads. With a write-through cache the cache coherence can be enforced by invalidating the cache on monitorenter and before reads from volatile fields.
3 Cache Benchmarks Before developing a new cache organization we run benchmarks to evaluate memory access patterns and possible cache solutions. Our assumption is that the hit rate on the average case correlates with the hit classification in the WCET analysis, when different access types are cached independently. Therefore, we can reason about useful cache sizes from the benchmark results. For the benchmarks we use two real-world embedded applications [7]: Kfl is one node of a distributed control application and Lift is a lift controller deployed in industrial automation. The Kfl example is a very static application, written in conservative, procedural style. The application Lift was written in a more object-oriented style. Furthermore, two benchmarks from an embedded TCP/IP stack (UdpIp and Ejip) are used to collect performance data. Figure 1 shows the access frequencies for the different memory areas for all benchmarks. There are no write accesses to the constant data areas and also no write access to the pseudo-constant area (HEADER). As we measure applications without object allocation at runtime, the data in the HEADER area is not mutated. The general trend is that load instructions dominate the memory traffic (between 89% and 93%).
Towards Time-Predictable Data Caches for Chip-Multiprocessors
183
Table 1. Data memory traffic to different memory areas (in % of all data memory accesses) Kfl
Lift
UdpIp
Ejip
load store load store load store load store CLINFO CONST STATIC HEADER FIELD ARRAY
For the Kfl application there are no field accesses (FIELD). Dominating accesses are to static fields (STATIC), static method invocation (CLINFO), and access to the constant pool (CONST). The rest of the accesses are related to array accesses (HEADER, ARRAY). The Lift application has a quite different access pattern: instance field accesses dominate all reads (FIELD and HEADER). There are less methods invoked than in the Kfl application and less static fields accessed. The array access frequency of both applications is similar (4%–5%), for the TCP/IP benchmark, due to many buffer manipulations, considerable higher (11% loads). 3.1 Cache Simulations As first step we simulate different cache configurations with a software simulation of JOP (JopSim) and evaluate the average case hit count. Handle Cache. As all operations on objects and arrays need an indirection through the handle we first simulate a cache for the handle. The address of the handle is not known statically, therefore we assume a small fully-associative cache with LRU replacement policy. The results of the cache is shown in Table 2 for different sizes. The size is in single words. Quite interesting to note is that even a single entry cache provides a hit rate for the handle indirection access of up to 72%. Caching a single handle should be so simple, so a single cycle hit detection including a memory read start in the same cycle should be possible. In that case, even a uniprocessor JOP with a two cycle memory read will gain some speedup. A size of just 8 entries results in a reasonable hit rate between 84% and 95%. Constants and the Method Table. Mixing access to the method table and access to the constant pool in one direct mapped cache is an option when the receiver types can be determined precisely. However, if the set of possible receiver types is large, the analysis becomes less precise. Therefore, we evaluate individual caches for the constant pool access (CPOOL) and the access to the method table (CLINFO). Table 3 shows that a small direct-mapped cache of 512 words (2 KB) gives a hit rate of 100%. Keeping the cache sizes small is important for our intended system. We are targeting chip-multiprocessor systems with private caches, even for accesses to constants, to keep the individual tasks time-predictable. A shared cache would not allow to perform any cache analysis of individual tasks.
184
M. Schoeberl, W. Puffitsch, and B. Huber Table 2. Hit rate of a handle cache, fully associative, LRU replacement Hit rate (%) Size Kfl Lift UdpIp Ejip 1 2 4 8 16 32
72 82 84 88 92 95
15 20 94 95 95 95
43 80 87 91 94 96
69 78 82 84 84 86
Table 3. Hit rate of a constant pool cache, direct mapped Hit rate (%) Size Kfl Lift UdpIp Ejip 32 68 69 77 82 64 96 69 79 95 128 98 69 88 95 256 100 100 100 95 512 100 100 100 100 Table 4. Hit rate of a method table cache, direct mapped Hit rate (%) Size Kfl Lift UdpIp Ejip 32 64 83 64 85 83 128 91 100 256 100 100
62 77 85 97
49 74 93 95
The hit rate of a direct-mapped cache for the method table (MTAB) shows a similar behavior as the constant pool caching, as shown in Table 4. A size of 256 words gives a hit rate between 95% and 100%. It has to be noted that the method table is accessed by static and virtual methods. While the MTAB entry is known statically for static methods, the MTAB entry for virtual methods depends on the receiver type. If data-flow analysis can determine most of the receiver types the combination of a single cache for the constant pool and the method table is an option further to explore. Static Fields. Table 5 shows the results for a direct mapped cache for static fields. For object-oriented programs (represented by Lift), this cache can be kept very small. Although the addresses are statically known as the addresses for the constants, a combination of these two caches is not useful. Static fields need to be kept cache coherent, constant pools entries are implicitly cache coherent. Cache coherence enforcement, with cache invalidation at synchronized blocks, limits the hit rate in UdpIp and Ejip.
Towards Time-Predictable Data Caches for Chip-Multiprocessors
185
Table 5. Hit rate of a static field cache, direct mapped Hit rate (%) Size Kfl Lift UdpIp Ejip 32 76 64 85 128 99 256 100
100 100 100 100
33 33 33 33
77 77 77 77
Table 6. Hit rate of an instance field cache, fully associative, LRU replacement Hit rate (%) Size Kfl Lift UdpIp Ejip 1 2 4 8 16 32
84 84 84 84 84 84
17 75 86 88 88 88
47 59 65 67 67 67
9 13 18 20 20 20
Object Fields. Addresses of object fields are unknown for the analysis. Therefore, we can only attack the analysis problem via a high associativity. Table 6 shows hit rates of fully-associative caches with LRU replacement policy. For the Lift benchmark we observe a moderate hit rate of 88% for a very small cache of just 8 entries. UdpIp and Ejip saturate at 8 entries due to cache invalidation during synchronized blocks of code. 3.2 Summary From the experiments with simulation of different caches for different memory areas we see that quite small caches can provide a reasonable hit rate. However, as the memory access latency for a CMP system with time-sliced memory arbitration can be quite high,1 even moderate cache hit rates are a reasonable improvement.
4 Cache Analysis In the following section we sketch cache analysis as it will be performed in a future version of our WCET analysis tool [8]. We leverage the cache splitting of the data areas for a modular analysis, e.g., analysis of heap allocated objects is independent from analysis of the cache for constants or cache for static fields. 1
Our 8 core CMP prototype with a time slot of 6 cycles per core has a worst-case latency of 48 cycles.
186
M. Schoeberl, W. Puffitsch, and B. Huber
In multithreaded programs, it is necessary to invalidate the cache when entering a synchronized block or reading from volatile variables.2 We require that accesses to shared data are properly synchronized, which is the correct way to access shared data in Java. In this case it is safe to assume that object references on the heap are not changed by another thread at arbitrary points in the program, resulting in a significantly more precise analysis. The effect of synchronization, namely invalidating some of the caches, has to be taken into account though. The running example is illustrated in Figure 1 and was taken from the Lift application. The figure comprises the source code of the method checkLevel and the corresponding control flow graph in static single assignment (SSA) form. Each basic block is annotated with the cache accesses it triggers. 4.1 Static and Type-Dependent Addresses If we only deal with statically known addresses in a data cache, the standard cache hit/miss classification (CHMC) for instruction caches delivers precise results and is therefore a good choice [9]. In the example, there is only one static variable, LEVEL POS. If we assume a direct-mapped cache for static variables, and a separate one for values on the heap, all but the first access to the field will be a cache hit every time checkLevel is executed. When the address depends on the type of the operand, we have to deal with a set of possible addresses. The straight forward extension of CHMC to sets of memory addresses is to update the abstract cache state for each possible address, and then join the resulting states. This leads to a very pessimistic classification when dynamic dispatch is used, however, and therefore is only acceptable if the exact address is known for most references. 4.2 Persistence Analysis If dynamic types are more common, a more promising approach is to classify program fragments, or partitions, where it is known that one or all memory addresses are locally persistent. If this is the case, they will be missed at most once during one execution of the program fragment. For both direct-mapped and N-way set associative caches with LRU replacement, a dataflow analysis for persistence analysis has been introduced in [10]. For FIFO caches, the concept of persistence is useful as well, but it is not safe anymore to assume that a persistent address will be loaded at the first access. Most work on persistence analysis focusses on dataflow equations and global persistence, leaving out some aspects which deserve more attention. Persistence for the whole program is rare and only of theoretical interest. We therefore identify a set of nested scopes [11], and identify for each scope which cache lines or cache sets are locally persistent. A scope is a subgraph of the control flow graph which represents a 2
The semantics of volatile variables in the Java memory model is similar to synchronized blocks: the complete global state has to be locally visible before the read access. Simply bypassing the cache for volatile accesses is not sufficient.
Towards Time-Predictable Data Caches for Chip-Multiprocessors
set of execution sequences. Methods, loops and loop bodies are typical examples of scopes, but partitions of less regular shape are possible as well. To reduce the amount of analysis work, persistence is checked in a bottom-up manner, starting at the leaves of the scope nesting graph. In the example we partitioned the flow graph of checkLevel into two scopes, the first of which contains the method turnOffLeds, while checkLevel.2.Loop is a subscope of the second one. 4.3 Object Header and Fields As the address of the object header and the object fields depends on the instance and is not known at compile time, we use small, fully associative caches to track the cache state. There is usually a high number of handle accesses in object-oriented programs,
188
M. Schoeberl, W. Puffitsch, and B. Huber
but many of them do not change often. In our architecture, the object header is fully transparent at the bytecode level, and is managed by the runtime system. Caching is hence expected to gain a substantial benefit. On other platforms, which compile the bytecode and hold handles and array length values in registers, caching those values is probably less beneficial. To calculate the symbolic addresses of object headers and fields used in some scope, the data dependencies of the control flow graphs in SSA form are analyzed. In SSA form, each variable is only defined once. If the definition is of the form v = φ(v1 , v2 ), the definition is called a φ node, and the value of v is either that of v1 or v2 . For each object header used, those data depenencies which are defined in the same scope, and might be executed more than once within the scope, are identified. If all of those definitions are neither φ nodes nor depend on an indeterministic instruction, the variable representing the object corresponds to a unique symbolic address. Finally, if all references used within a scope correspond to a unique symbolic address, we are able to perform a local persistence analysis. Additionally, using a variant of the global value numbering technique used in optimizing compilers [12], the quality of the analysis is further improved by identifying variables mapping to the same symbolic address. In the running example, no handle has a data dependency on a φ node, and therefore persistence analysis is relatively simple. If a fully associative cache with four cache lines is used, all object headers of scope checkLevel are locally persistent. If the object header cache only has two entries, at least those headers used in scope turnOffLeds and checkLevel.2.loop are locally persistent.
5 Cache Implementation We have implemented various forms of caches in the context of the Java processor JOP [4]: (1) a small fully associative cache with LRU replacement, (2) a fully associative cache with FIFO replacement, and (3) a direct mapped cache. We have combined the different caches to distinguish between different data areas. 5.1 LRU and FIFO Caches The crucial component of an LRU cache is the tag memory. In our implementation it is organized as a shift register structure to implement the aging of the entries (see Figure 2). The tag memory that represents the youngest cache entry (cache line 0) is fed by a multiplexer from all other tag entries and the address from the memory load. This multiplexer is the critical path in the design and limits the maximum associativity. Table 7 shows the resource consumption and maximum frequency of the LRU and FIFO cache. The resource consumption is given in logic cells (LC) and in memory bits. As a reference, a single core of JOP consumes around 3500 LCs and the maximum frequency in the Cyclone-I device without data caches is 88 MHz. We can see the impact on the maximum frequency of the large multiplexer in the LRU cache on configurations with a high associativity. The implementation of a FIFO replacement strategy avoids the change of all tag memories on each read. Therefore, the resource consumption is less than for an LRU
Towards Time-Predictable Data Caches for Chip-Multiprocessors
cache line 0
cache line 1
cache line 2
189
cache line 3
addr reset clk ena[0..3]
dout
hit[0]
din
tag
=
v
tag
ena[3]
dout
hit[1]
din
idx
din
R
=
v
ena[2]
idx
dout
idx
din
idx
din
R
=
v
ena[1]
tag
R
=
v
ena[0]
tag
R
dout
hit[2]
hit[3] hit[0..3]
dout[0]
dout[1]
dout[2]
dout[3] dout[0..3]
address
Fig. 2. LRU tag memory implementation Table 7. Implementation results for LRU and FIFO based data caches LRU Cache Associativity 16-way 32-way 64-way 128-way 256-way
cache and the maximum frequency is higher. However, hit detection still has to be applied on all tag memories in parallel and one needs to be selected. 5.2 Split Cache Implementation We have combined a direct mapped cache and an LRU cache with one JOP core. The LRU cache stores the object header and the object fields; the direct mapped cache stores class info, constants, and static fields; array data is not cached. Table 8 shows the resources and the maximum system frequency of different cache configurations. The first line gives the base numbers without any data cache. From the resource consumptions we can see that a direct mapped cache is cheap to implement. Furthermore, the maximum clock frequency is independent of the direct mapped cache size. A highly associative LRU cache (i.e., 32-way and more) dominates the maximum clock frequency and consumes considerable logic resources.
6 Related Work Early work on data cache access classification by White et al. focusses on computing addresses and analyzing array access patterns [13]. It is assumed, however, that the
190
M. Schoeberl, W. Puffitsch, and B. Huber Table 8. Implementation results for a split cache design Cache size
exact memory accesses can be resolved. Ferdinand et al. [10] discuss the use of dataflow analysis for data cache analysis. They suggest to use persistence analysis to deal with memory accesses which reference one out of a set of possible addresses. To overcome the problems with unknown memory addresses, Lundquist et al. [14] suggest to distinguish unpredictable and predictable memory accesses to improve the analysis of data caches. If an address cannot be resolved at compile time, accesses to that address are considered as unpredictable. Those data structures which might be accessed by unpredictable memory accesses are marked for being moved into an uncached memory area. Vera et al. [15] lock the cache during accesses to unpredictable data . The locking proposed there affects all kinds of memory accesses though, and therefore is necessarily coarse grained.
7 Conclusion Chip-multiprocessor systems increase the pressure on the memory bandwidth and caching of instructions and data is mandatory. In order to estimate tight WCET values, we propose to split data caches for different data areas. Benchmarking of embedded applications show possible tradeoffs between achievable hit rates and sizes of the different caches. Splitting the data cache for different access types (e.g., constant pool and heap) allows to modularize the cache analysis. Furthermore, unknown addresses of one data type access have no impact on data accesses of a different type. Caches for data where the address is not known statically (e.g., heap allocated data), can only be analyzed when the cache has a very high associativity. From our prototype implementation within an FPGA we conclude that LRU caches scale up to an associativity of 16 and FIFO caches up to an associativity of 64.
Acknowledgements The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under grant agreement number 216682 (JEOPARD).
Towards Time-Predictable Data Caches for Chip-Multiprocessors
191
References 1. Patterson, D.A.: Reduced instruction set computers. Commun. ACM 28(1), 8–21 (1985) 2. Arnold, R., Mueller, F., Whalley, D., Harmon, M.: Bounding worst-case instruction cache performance. In: Proceedings of the Real-Time Systems Symposium 1994, December 1994, pp. 172–181 (1994) 3. Schoeberl, M.: Time-predictable cache organization. In: Proceedings of the First International Workshop on Software Technologies for Future Dependable Distributed Systems (STFSSD 2009), Tokyo, Japan. IEEE Computer Society, Los Alamitos (2009) 4. Schoeberl, M.: A Java processor architecture for embedded real-time systems. Journal of Systems Architecture 54(1-2), 265–286 (2008) 5. Schoeberl, M.: A time predictable instruction cache for a Java processor. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM-WS 2004. LNCS, vol. 3292, pp. 371–382. Springer, Heidelberg (2004) 6. Schoeberl, M.: Design and implementation of an efficient stack machine. In: Proceedings of the 12th IEEE Reconfigurable Architecture Workshop (RAW 2005), Denver, Colorado, USA. IEEE, Los Alamitos (2005) 7. Schoeberl, M.: Application experiences with a real-time Java processor. In: Proceedings of the 17th IFAC World Congress, Seoul, Korea (July 2008) 8. Huber, B.: Worst-case execution time analysis for real-time Java. Master’s thesis, Vienna University of Technology, Austria (2009) 9. Theiling, H., Ferdinand, C., Wilhelm, R.: Fast and precise WCET prediction by separated cache and path analyses. Real-Time Syst. 18(2/3), 157–179 (2000) 10. Ferdinand, C., Wilhelm, R.: On predicting data cache behavior for real-time systems. In: M¨uller, F., Bestavros, A. (eds.) LCTES 1998. LNCS, vol. 1474, pp. 16–30. Springer, Heidelberg (1998) 11. Engblom, J., Ermedahl, A.: Modeling complex flows for worst-case execution time analysis. In: RTSS 2000: Proceedings of the 21st IEEE Real-Time Systems Symposium, pp. 163–174. IEEE Computer Society, Los Alamitos (2000) 12. Click, C.: Global code motion/global value numbering. SIGPLAN Not. 30(6), 246–257 (1995) 13. White, R.T., Mueller, F., Healy, C., Whalley, D., Harmon, M.: Timing analysis for data and wrap-around fill caches. Real-Time Syst. 17(2-3), 209–233 (1999) 14. Lundqvist, T., Stenstr¨om, P.: A method to improve the estimated worst-case performance of data caching. In: RTCSA 1999: Proceedings of the Sixth International Conference on RealTime Computing Systems and Applications, Washington, DC, USA, pp. 255–262. IEEE Computer Society, Los Alamitos (1999) 15. Vera, X., Lisper, B., Xue, J.: Data caches in multitasking hard real-time systems. In: RTSS 2003: Proceedings of the 24th IEEE International Real-Time Systems Symposium, Washington, DC, USA, pp. 154–165. IEEE Computer Society, Los Alamitos (2003)
From Intrusion Detection to Intrusion Detection and Diagnosis: An Ontology-Based Approach Luigi Coppolino, Salvatore D’Antonio, Ivano Alessandro Elia, and Luigi Romano Dipartimento per le Tecnologie - Universit` a degli Studi di Napoli “Parthenope” {luigi.romano,luigi.coppolino, salvatore.dantonio,ivano.elia}@uniparthenope.it http://www.dit.uniparthenope.it/FITNESS
Abstract. Currently available products only provide some support in terms of Intrusion Prevention and Intrusion Detection, but they very much lack Intrusion Diagnosis features. We discuss the limitations of current Intrusion Detection System (IDS) technology, and propose a novel approach - which we call Intrusion Detection & Diagnosis System (ID2 S) technology - to overcome such limitations. The basic idea is to collect information at several architectural levels, using multiple security probes, which are deployed as a distributed architecture, to perform sophisticated correlation analysis of intrusion symptoms. This makes it possible to escalate from intrusion symptoms to the adjudged cause of the intrusion, and to assess the damage in individual system components. The process is driven by ontologies. We also present preliminary experimental results, providing evidence that our approach is effective against stealthy and non-vulnerability attacks. Keywords: Intrusion Detection and Diagnosis, Information Diversity, Ontologies, Stealthy and non-vulnerability attacks.
1
Rationale and Contribution
By Diagnosis, we mean the capability of: i) clearly identifying the causes of the attacks, and ii) accurately estimating their consequences on individual system components. Currently available products only provide some (indeed limited) support in terms of Intrusion Prevention and Intrusion Detection, but they very much lack Intrusion Diagnosis capabilities. We strongly believe that this technology trend should be subverted, and that more efforts should be put in the development of effective techniques for implementing Intrusion Diagnosis features. We propose a novel approach, which extends Intrusion Detection System (IDS) technology to what we call Intrusion Detection & Diagnosis System (ID2 S) technology, to overcome this limitation. The basic idea is to collect information at several architectural levels (namely: Network, Operating System, Data Base, and Application), using multiple security probes which are deployed as a distributed S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 192–202, 2009. c IFIP International Federation for Information Processing 2009
From Intrusion Detection to Intrusion Detection and Diagnosis
193
architecture, and use Complex Event Processing (CEP) technology to perform sophisticated correlation analysis of intrusion symptoms. The idea of collecting information from different sources to gain more insight into attack/intrusion related phenomena is not new. A (far from complete) list of remarkable works is: [3], [4], and [5]. While the presented works exploit the concept of correlation and multilayer analysis, they do not address the issue of diagnosing the kind of anomaly or attack the system is experiencing. In our approach, the escalation process from intrusion symptoms to the adjudged cause of the intrusion and to an assessment of the damage in individual system components is driven by ontologies. More precisely, we have developed two sets of ontologies: the first one allows us, given that a particular symptom has been observed, to identify which are the attacks that have possibly generated that symptom; the second one can be used to infer an estimate of the damage to specific system components from knowledge of the attack. The output of the process can then be used to drive remediation actions, and ultimately replenish system resources. To demonstrate the effectiveness of our approach, we have conducted preliminary experiments on a testbed consisting of a web server running Joomla, the well known open source Content Management Systems (CMSs) written in PHP. The experiments exposed vulnerabilities to SQL injection (SQLi) and Cross Site Scripting (XSS) attacks of the target applications, which are described in the Bugtraq repository. The experimental tests have demonstrated that our approach leads to better detection results, both in terms of improved accuracy of the classification process and of enhanced reliability of the decision-making process. Also importantly, we were able to clearly identify the nature of the attack, as well as the specific system components which are affected by it. We emphasize that the proposed approach is effective against an emerging class of new attacks, which is referred to in the literature [1] as “stealthy”. These attacks represent a major threat, since not only they have a dramatic impact in terms of economic losses [2], but (i) they also are invisible to current State-Of-The-Art IDSs, and (ii) current Intrusion Prevention Systems are ineffective against them. Instead, since such attacks have clear symptoms at architectural levels other than the network, by collecting information also at multiple architectural levels using diverse security probes, and performing sophisticated correlation analysis of attack symptoms, we are able to detect them. The rest of the paper is organized as follows. Sect. 2 describes the approach we propose, and how it is currently implemented in the framework of the INTERSECTION and INSPIRE projects. Sect. 3 describes the ontology-based detection and diagnosis process. Sect. 4 provides a description of the case study and presents preliminary experimental results. Finally, Sect. 5 gives some concluding remarks, along with information concerning the directions of our future work.
2
Conceptual Architecture of the ID2 S Technology
In order to effectively assess the security status of a networked system, the results of the monitoring activities performed at different observation points need to be correlated. Such observation points are distributed throughout the network
194
L. Coppolino et al.
Fig. 1. Conceptual architecture of the ID2 S technology
as well as throughout the system to be protected. The more diverse the information sources and the processing methods, the more effective the correlation process. The deployment of probes at different observation points located in the networked system and at different architectural levels (network level, operating system level, application level, etc.), allows to fulfill the requirement of diversity of the information sources. By exploiting information diversity, it is possible to improve the accuracy of the detection process, as well as to implement diagnostic capabilities. Fig. 1 shows the architecture of the proposed Intrusion Detection and Diagnosis System, which comprises the following functional blocks: – Event Collection - Collecting security-related events from a wide range of detection probes implies dealing with heterogeneous data sources. To cope with heterogeneity, a promising solution is the use of Adaptable Parsers (APs). APs extract security-related information from multiple data feeds, and convert it to IDMEF (Intrusion Detection Message Exchange Format) standard messages, which are then routed to the Event Distribution Channel. A system component, called Decision Engine (DE), is in charge of processing data available on the Event Distribution Channel, and of correlating them, in order to take a decision on whether the collected symptoms represent an actual attack or not. The current implementation of AP components relies on Java Compiler Compiler (JavaCC) technology. More details on adaptable parsers are available in [8]. – Stream Mapping - This function converts IDMEF messages describing the symptoms of a possible intrusion to streams of tuples, which are then fed to a Complex Event Processing (CEP) engine. CEP technology enables the Decision Engine to perform sophisticated correlations on information gathered
From Intrusion Detection to Intrusion Detection and Diagnosis
195
from the multiple security probes. The current implementation of the CEP engine is based on Borealis [10]. – Detection - This phase consists in the extraction of higher level knowledge from situational information. This is done in real-time or near real-time. – Diagnosis - An ontology-based hierarchical organization of event patterns is used to automate the process of deriving the queries which implement the diagnostic analysis. The knowledge formalized by the ontology is used by the diagnostic process to identify which intrusions/attacks are the possible cause of the observed symptoms, and which part of the system has been affected by the malicious activity.
3
Ontology-Based Detection and Diagnosis
The Decision Engine is in charge of detecting ongoing attacks by analyzing and correlating the security-related events which have been conveyed to the Event Distribution Channel. The queries are formulated based on the information and knowledge contained in the threat ontology. Fig. 2 presents a high level view of such an ontology. Properties of concepts and sub-concepts (denoted by ellipses) are not shown because they would make the ontology unwieldy. An event generated by a probe is considered as a potential symptom of an attack against a specific target. Each kind of attack is associated to a set of symptoms. An attack is described by using a number of indicators that assess the trustworthiness of the probe used to detect it. Attack has the isEvaluatedBy property, which is defined by the AttackIndicator concept. AttackIndicator has the following properties: (i) hasTrustworthinessValue, which is defined by the Trustworthiness concept; (ii) isAssociatedTo, which is defined by the Probe concept, and (iii) indicates, which is defined by the Symptom concept. Symptoms are classified into Abuses, Misuses and Suspicious Acts.
Fig. 2. The Threat Ontology
196
L. Coppolino et al.
Abuses are actions which change the state of an asset. They are further divided into Anomaly-based and Knowledge-based Abuses. The first category of abuses includes anomaly behaviors (e.g., unusual application load, anomalous input requests, etc.), while the second category relies on the recognition of signatures of well-known attacks (e.g., brute force attacks). Misuses are out-of-policy behaviors which do not affect the state of the system components (e.g., authentication failures, query failures). Suspicious Acts do not violate any policy. They are events of interest to the probes (e.g., execution of commands providing information about the system state). The Symptom concept has the following properties: (i) isDetectedBy, which is defined by the Probe concept; (ii) isCausedBy, which is defined by the Attacker concept, and (iii) isDirectedTo, which is defined by the Target concept. Each Symptom is characterized by the hasDetectionTime property, which specifies the detection time of the symptom, and the hasIntensityScore property, which measures the probability of occurrence of the symptom with reference to the specific probe detecting it. Targets are classified in 4 categories, i.e. Network, Operating System, Data Base and Application. A software tool, named Query Generator, browses the threat ontology, extracts the properties characterizing an attack, and generates the queries to be executed by the Complex Event Processor. Since the events feeding the distribution channel encompass information about both the attack symptom and the detection probe, the Symptom and the Probe concepts of the threat ontology can be used to build the query which drives the correlation process performed by the CEP engine. More precisely, the structure of the threat ontology shows that the Attack Indicator (AI) concept indicates a symptom and is AssociatedTo a probe. Therefore, events on the Event Distribution Channel can be considered as AI instances. While generating the queries, the Query Generator uses the Attack Indicator properties to define the query parameters which help detect the specific attack. Fig. 3 shows the main functions that compose the process, namely: Classification, Aggregation, and Filtering. Classification. The purpose of the classification function is to create a separate stream for each possible attack (i.e. every Attack described in the ontology), in order to make the subsequent aggregation function more efficient. Every attack is associated to a stream containing the Attack Indicators, i.e. the attack information, which should be used to detect the attack according to the IsEvaluatedBy relationship in the threat ontology. The Classifier is implemented as a set of concatenated filter and map queries. A filter query extracts from the event descriptions on the Event Distribution Channel only the attack details that concern specific attack, while a map query converts the selected information to a stream of attack indicators. Aggregation. Once, for every class of Attack, a stream has been created containing only relevant AIs, such indicators are aggregated in order to formulate hypothesis about the ongoing attacks. The aggregation is performed by combining subsets of AIs. The aggregation process scales with the number and the type
From Intrusion Detection to Intrusion Detection and Diagnosis
197
Fig. 3. The query generation process
of matching patterns, thanks to the use of the ontology, which enables to filter out the most relevant information (for the specific domain). The implemented aggregation patterns include temporal proximity and Source and/or Target matching of the Symptoms. Event aggregation is performed by making join queries which generate meta-events containing hypothesis on the possible results of the diagnostic activity. In order to discriminate among such hypothesis, a confidence degree is associated to every meta-event by using the HasConfidenceLevel property. The confidence degree is computed through a weighted combination of the HasIntensityScore values of the aggregated Symptoms where the weight of every Symptom is given by the hasTrustworthinessValue of the AI. Filtering. The filtering function performs the crucial task of selecting the Attacks instances that have a HasConfidenceLevel value exceeding a configurable threshold. This threshold-based filtering enables to lower the number of false positives as only the aggregated events showing a robust detection pattern will be considered as an actual diagnosis of an attack and raise an alert. The diagnostic process aims to extract a higher level of knowledge from the aggregated symptoms of an ongoing Attack. This process is not performed in offline mode on the aggregated AIs. Conversely, it is performed during all the phases of the correlation process. The goal of the diagnostic process is to characterize the ongoing attack in terms of: – Attack Type - The Attack Type indicates the class of the ongoing attack. It is determined during the classification phase, by discriminating AIs based on the detectable types of Attack. The aggregation performed on a given class of AIs will result in the detection of that kind of Attack.
198
L. Coppolino et al.
– Attack Targets - Information about the attack target(s) is inferred by looking at the Target concepts of the aggregated AIs. In most cases, not all aggregated Symptoms are DirectedTo the same Target. For example, when detecting an attack against an application server it is required that the behavior of both the application server and the database is monitored, and in case of an ongoing attack the Event Distribution Channel will be fed with different events. Some events are raised by probes monitoring the application server, while others are generated by probes observing the database. The generated events correspond to symptoms having a different target, and therefore the event correlation would result unfeasible. The information allowing the CEP to identify the relationships and dependencies between the different system components is inferred by the Attack Indicators concepts. This approach enables the correlation of the different symptoms and the identification of the system components affected by the attack. These components are considered as potentially violated targets, and as such they should not be considered trusted. – Attack Latency - Basically, the Attack Latency (AL) parameter is the amount of time that an ongoing attack against to a system has been undetected. More precisely, it represents an upper bound estimate of the amount of time during which the system has been manipulated by the attacker. Given t1 , t2 , .. , tN the timestamps of the aggregated AIs, and tD the instant when the attack is detected, (i.e. the instant of the generation of the alert), AL can be evaluated as: AL= tD - min(t1 , t2 , .. , tN ) with N being the number of the available AIs.
4
Case Study and Experimental Results
In this section we present preliminary results attained by applying the proposed approach in a laboratory experimental testbed. The testbed includes a mySQL database and an Apache web server running an open-source Content Management System (CMS) written in PHP, namely Joomla (v.1.5). We carried out experimental tests in order to assess the capability of the proposed system to detect and diagnose two common attacks, namely SQL injection (SQLi) and Cross Site Scripting (XSS). Attacks were modeled and injected in the normal traffic profile. The following probes have been deployed throughout the networked system: (i)Apache Scalp[11], a host-level signature-based analyzer of Apache web server access logs, which uses a set of rules to spot malicious requests; (ii) ACDM (Anomalous Character Distribution Monitor)[6], a host-level anomalybased probe which analyzes character distribution in HTTP requests and gives a score to every anomalous request; (iii) AQFM (Anomalous Query Failures Monitor), a database-level probe that monitors the rate of failed queries in a SQL database. 4.1
SQL Injection: Assumptions and Experimental Setup
Assuming that the system administrators apply security patches to the web application components in order to prevent attacks exploiting well-known vulnerabilities, any attacker will have to proceed by trial and error in order to find a
From Intrusion Detection to Intrusion Detection and Diagnosis
199
way to exploit new and unknown vulnerabilities. For this reason the SQLi attack was modeled as composed of a set of unsuccessful attempts possibly followed by a successful exploitation of the vulnerability. The preliminary attempts performed by the attacker will leave a trace in the web server access log. These requests will look increasingly similar to the successful one as the attacker learns more about the internal mechanisms of the application and increasingly different from the requests in the “normal” traffic. Moreover these blind attempts will at first result in the injection of syntactically wrong queries leaving more traces of the ongoing attack as normal traffic usually does not generate errored queries. Given this attack model, both the application server and the database are monitored in order to detect SQLi attacks. The application server is monitored by using two complementary probes: Scalp, which uses a signature based approach, and ACDM, that uses an anomaly-based approach. The database is monitored by using the AQFM probe. From an ontology point of view the SQLi Attack isEvaluatedBy three AIs: (i) SQLi attack detection performed by Scalp; (ii) Anomalous Character Distribution detected by ACDM; (iii) Anomalous Query Failures detected by the AQFM. The proposed Intrusion Detection and Diagnosis system collects such AIs and merges them in a separate event stream. Afterward, the AIs are aggregated in order to group the traces of the same ongoing attack. Since both Scalp and ACDM analyze the Apache access logs events raised by the two probes are aggregated on the basis of the access log entry they refer to. Subsequent attack attempts are correlated by using time proximity and source matching patterns. Once aggregation has been performed a set of meta-events is produced. Each meta-event contains a global, multi-level evidence of an attack and can be considered as a preliminary attack diagnosis. The diagnostic process provides further information which is deduced by looking at the aggregated AIs. For example, the starting and ending time of the attack are obtained by evaluating the lowest and the highest timestamp values. 4.2
Cross Site Scripting: Assumptions and Experimental Setup
Strategies for the detection of Cross Site Scripting (XSS) attacks and SQLi attacks are very similar since both approaches rely on the analysis of HTTP requests performed by the attacker in order to find attack traces. Cross Site Scripting is modeled as composed of a sequence of attempts which reflect the fact that initially the attacker has no information about how to inject the malicious code. The difference between SQLi attacks and XSS attacks is that no query failure is detected by monitoring the database in case the XSS code injection is not exploited to trigger the injection of SQL commands. This happens as the attacker usually injects client side code (like JavaScript) in web pages accessed by other users so as to steal sensitive information (i.e. cookies or session tokens) from their browsers without modifying the server side code of the application. This kind of Attack isEvaluatedBy Scalp XSS and ACDM probes. Thanks to the use of the ontology attack type, target and latency are correctly diagnosed.
200
4.3
L. Coppolino et al.
SQLi and XSS: Experimental Results
In this section, we discuss the results of the experimental campaign, with respect to SQLi and XSS attacks. Fig. 4a shows the performance of the detection process when a single AI is used. With respect to detection, experiments show that: – Scalp has good performance when detecting SQLi attacks (72%), while the detection rate of XSS attacks is lower (63%); – ACDM provides a very high detection rate (94%), at the cost of a rather high false positive rate (36% of the normal traffic is erroneously perceived as malicious); – AQFM never fails to report database level traces of SQLi attacks (100%), but normal traffic and XSS attacks may generate failing queries, too. As to diagnosis, it is worth noting that: – Due to their intrinsic nature, the anomaly-based probes, namely ACDM and AQFM, have the drawback of being only capable of raising alarms upon detection of anomalous requests, but they are not able to provide any information about the specific threat which has originated the alarm neither they are able to provide further diagnostic info, e.g. detection latency, parts of the system under attack; – Since SQLi and XSS attacks share similar patterns, they can be easily confused. Even Scalp, although being a signature-based probe, marks many SQLi requests as XSS (44% of the SQLi attacks, with 20% mapped to both kinds of attack). Fig. 4b shows the performance of the detection process when complex correlation rules are applied to symptoms detected by probes which monitor different architectural levels, and use diverse detection approaches. Results show that the detection rate increases significantly, and the accuracy of the diagnosis is improved. In particular, the wrong marking performed by Scalp is eliminated. Furthermore, the false positives produced by the ACDM probes are drastically
Fig. 4. (a) Detection performances using a single AI ; (b) Detection performance correlating multiple AIs
From Intrusion Detection to Intrusion Detection and Diagnosis
201
reduced. Basically, even though Scalp does not detect all the malicious requests and, even worse, sometimes it gives wrong hints, the system almost always detects the SQLi attacks by adding to the Scalp detections those obtained via correlation with symptoms detected by ACDM and AQFM. In this way, the rate of correctly diagnosed SQLi attacks rises from the 73% achieved by Scalp to the 91% of our ID2 S. When application-level symptoms are aggregated with ACDM and AQFM symptoms, the successfull diagnosis of XSS attacks rises from 63% achieved by Scalp to 71% of the correlation-based approach. The cases in which our ID2 S fails can be associated to two main scenarios: (i) wrong diagnoses are performed when query failures are generated by an XSS attack which leads to an incorrect SQLi diagnosis (4%); (ii) False Positives are raised by normal traffic that triggers a false detection by the ACDM (36%). When this detection is incorrectly aggregated with query failures generated by an overlapping SQLi attack, it produces a false SQLi diagnosis (5%), otherwise it can produce a false XSS diagnosis (11%). It should be emphasized that the false positives generated by ACDM are halved (16%) by means of the correlation with other symptoms.
5
Conclusions and Future Work
In this work, we have discussed the limitations of current Intrusion Detection Systems (IDS) technology, and proposed a novel approach, which we call Intrusion Detection & Diagnosis System (ID2 S) technology, to overcome such limitations. The basic idea is to collect information at several architectural levels (namely: Network, Operating System, Data Base, and Application), using multiple security probes which are deployed as a distributed architecture, and use Complex Event Processing (CEP) technology to perform sophisticated correlation analysis of intrusion symptoms. The escalation process from intrusion symptoms to the adjudged cause of the intrusion and to an assessment of the damage in individual system components is driven by ontologies. We have conducted preliminary experiments on a testbed consisting of web servers running a well-known open-source Content Management System (CMS) written in PHP, namely Joomla. The experimental tests have demonstrated that our approach is effective against an emerging class of new attacks, which is referred to in the literature [1] as “stealthy”. These attacks represent a major threat, since not only they have a dramatic impact in terms of economic losses [2], but (i) they also are invisible to current State-Of-The-Art IDSs, and (ii) current Intrusion Prevention Systems are ineffective against them. Experiments conducted so far indicate that the proposed approach has three important advantages: – It improves the performance of the detection process; – It provides diagnostic features; – It reduces the false positive rates. Future work will follow two main directions. The first objective will be to conduct a thorough experimental analysis, in order to collect more evidence of the
202
L. Coppolino et al.
effectiveness of the approach. The second objective will be the implementation of more sophisticated correlation approaches, and the development of more detailed ontologies, so to allow finer grain estimation of the consequences of attacks on individual system components.
Acknowledgements The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement no. 216585 (INTERSECTION Project) and Grant Agreement no. 225553 (INSPIRE Project).
References 1. Jakobsson, M., XiaoFeng, W., Wetzel, S.: Stealth attacks in vehicular technologies. In: Proc. of The Vehicular Technology IEEE Conference, September 26-29, vol. 2, pp. 1218–1222 (2004) 2. IDC, Worldwide threat Management Security Appliances 2007-2011 Forecast and 2006 Vendor Shares: Still Stacking the Racks, Doc # 209303 (November 2007) 3. Repp, N., Berbner, R., Heckmann, O., Steinmetz, R.: A Cross-Layer Approach to Performance Monitoring of Web Services. In: Proc. of the Workshop on Emerging Web Services Technology, CEUR-WS (December 2006) 4. Yu-Sung, W., Bagchi, S., Garg, S., Singh, N.: SCIDIVE: a stateful and cross protocol intrusion detection architecture for voice-over-IP environments. In: Proc. of Dependable Systems and Networks Conference, June 28, pp. 433–442 (2004) 5. Vigna, G., Robertson, W., Vishal, K., Kemmerer, R.A.: A stateful intrusion detection system for World-Wide Web servers. In: Proc. of 19th Annual Computer Security Applications Conference, December 8-12, pp. 34–43 (2003) 6. Kruegel, C., Vigna, G.: Anomaly detection of web based attacks. In: Proc. of the 10th ACM conference on Computer and Communication Security (CCS 2003), pp. 251–261. ACM Press, New York (2003) 7. Majorczyk, F., Totel, E., M´e, L., Sa¨ıdane, A.: Anomaly Detection with Diagnosis in Diversified Systems using Information Flow Graphs. In: Proc. of The Ifip Tc 11 23rd International Information Security Conference, July 17, pp. 301–315 (2008) 8. Campanile, F., Cilardo, A., Coppolino, L., Romano, L.: Adaptable Parsing of RealTime Data Streams. In: 15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing, PDP 2007, February 7-9, pp. 412–418 (2007) 9. Fisher, Gruber, R.: PADS: a domain-specific language for processing ad hoc data. In: Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (2005) 10. The Borealis project, http://www.cs.brown.edu/research/borealis/public/ 11. apache-scalp, Apache log analyzer for security, http://code.google.com/p/apache-scalp/
Model-Based Testing of GUI-Driven Applications Vivien Chinnapongse1 , Insup Lee1 , Oleg Sokolsky1 , Shaohui Wang1 , and Paul L. Jones2 1
University of Pennsylvania U.S. Food and Drug Administration {vichi,lee,sokolsky,shaohui}@cis.upenn.edu, [email protected] 2
Abstract. While thorough testing of reactive systems is essential to ensure device safety, few testing methods center on GUI-driven applications. In this paper we present one approach for the model-based testing of such systems. Using the AHLTA-Mobile case study to demonstrate our approach, we first introduce a high-level method of modeling the expected behavior of GUI-driven applications. We show how to use the NModel tool to generate test cases from this model and present a way to execute these tests within the application, highlighting the challenges of using an API-geared tool in a GUI-based setting. Finally we present the results of our case study.
1
Introduction
Thorough testing of reactive systems is an active research area with a long history. Reactive systems are primarily event-driven systems that operate by continuously interacting with their environment, responding to received signals. Operation of reactive systems is often safety- and life-critical. Rigorous development and analysis techniques are required to ensure safe and correct operation of such systems. In many safety-critical domains, for example in avionics and medical device areas, government regulators certify or approve systems before they can be used. In particular, the U.S. Food and Drug Administration (FDA) approves medical devices for use in the United States. An important class of reactive systems comprises systems interacting with a human user. Such systems offer a user interface, through which the user can send signals to the system and observe its responses. The user typically learns to interact with the system by reading the user manual or through targeted training sessions. In either case, the user forms a mental model of the system
This research has been supported in part by the FDA/TATRC grant MIPR6MRXMM6093 and NSF grants CNS-0509327 and CNS-0720703.
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 203–214, 2009. c IFIP International Federation for Information Processing 2009
204
V. Chinnapongse et al.
in his/her head. This model is then used as a specification, against which operation of the system is assessed. In this paper, we are interested in establishing conformance between the system operation and user expectations. Conformance between the mental model and observable behavior of the system is important from different perspectives. From the development perspective, it will help avoid usability problems in the system. From the regulatory perspective, it may help to evaluate necessary user training and instruction materials that accompany the device. We concentrate on GUI-driven handheld devices as a particular case of usercentric reactive systems. In the long-standing collaboration between experts at the FDA and the high-confidence systems design group at Penn (e.g., [1,3]), we have considered several medical devices that fall in this category. This paper has been motivated by a recent case study, in which we analyze a point-of-injury data entry device application called AHLTA-Mobile [2]. The Armed Forces Health Longitudinal Tracking Application–Mobile (AHLTA-Mobile) is a point-of-care handheld medical assistant developed by the Telemedicine and Advanced Technology Research Center (TATRC), approved for use by the FDA and deployed in R Windows the U.S. Army. AHLTA-Mobile is a C# application on the Microsoft TM Mobile platform. It assists medical personnel, deployed, on military bases, or at military medical centers, with diagnosis and treatment of patients. Medical personnel also use the solution to record patient clinical encounters and transmit those records to a central data repository. AHLTA-Mobile provides users access to service members’ complete medical records and offers advice for diagnosis and treatment. It contains a set of question-and-answer examinations that evaluate common battlefield injuries such as concussions. For the safety of patients it is important that the device always functions correctly, because misdiagnosis and incorrect treatment can cause serious harm. For the purposes of this paper we are concerned with the correctness of a subset of AHLTA-Mobile’s behavior, the Military Acute Concussion Evaluation (MACE) module. MACE is a series of eight GUI screens, displaying forms to be completed by the user. Seven of these screens, to which we refer as MACE 11 through MACE 7, are used to enter results of the user examination, while the last screen, MACE Results, is used to enter diagnosis and offers the possibility to save the results by entering them in a database. Relevant screens including Start Screen, Resume Screen, No Unit, and Error are also sconsidered. The screens are navigated by invoking the Next Screen button on each screen or the Previous menu item in the Tools menu. In response to users invoking an action, the system moves to a different screen or updates information on the current screen. Note that the user can enter data into the appropriate fields on the screen, but cannot modify user interface actions. This observation led us to represent the mental model of the device as a state machine, in which states are identified with GUI screens and transitions represent changing screens in response to invoking UI elements. Each transition in such a state machine is labeled with the UI element 1
Throughout this paper we use a sans font for the names of GUI items. We use a fixed-width font to identify model and source code elements.
Model-Based Testing of GUI-Driven Applications
205
that effects the change. In our case study, we constructed the model manually through the careful reading of the AHLTA-Mobile user manual [16]. We discuss the model in more detail in Section 3.2. Of the 114,000 lines of C# code that comprise the AHLTA-Mobile application, MACE screen classes and auxiliary classes contain approximately 6,000 lines of code. Given the state-machine model of the system, we can pursue two approaches to ascertain compliance of the system to its model. One is model-based testing, where the model is used to generate a test suite, which is then applied to the system implementation. Several tools are available for model-based testing of software. In our case study, we used NModel [7] from Microsoft Research, one of the few tools that target C# applications. The other alternative is to extract a state-machine model from the application source code and compare it directly to the mental model using a suitable notion of state machine equivalence or preorder. Although it makes more thorough testing possible, this alternative is much more challenging, and is one subject of our ongoing work. The contributions of this paper are threefold. First, we present an approach to capture behavioral models of GUI-driven handheld devices. We believe that the high-level modeling approach we have applied to represent the mental model of the AHLTA-Mobile device will be equally applicable to most devices in this category. Second, we present lessons learned during model-based testing to the AHLTA-Mobile case study. We discuss challenges we faced while applying the NModel methodology in the GUI-based setting, and the ways in which we overcame these challenges. Finally, we present results of the case study, which uncovered inconsistencies between the device behavior and the desired behavior described in the manual. The paper is organized as follows: Section 2 describes the NModel framework to be used in analyzing AHLTA-Mobile. Section 3 discusses the development of a mental model, both as an extended finite state machine (EFSM) and as an NModel model program. Section 4 explains the creation of a test harness to link an implementation with test cases. The testing of the AHLTA-Mobile application is described in Section 5. Section 6 discusses related research work. We conclude our paper with a discussion of our contributions in Section 7.
2
Using NModel to Analyze MACE
Developed at Microsoft Research, the NModel [6,7] framework is a model-based software testing and analysis tool for C# programs. NModel allows us to create a formal model of an implementation’s expected behavior and determine through model-based testing whether or not the implementation’s actual behavior and the model are consistent. The open-source tool is freely available online at no cost and there is a good level of support and documentation. No other tool we discovered matched this description and we decided NModel would suit our purposes reasonably well.
206
V. Chinnapongse et al.
The NModel framework consists of the following components: – a library for creating model programs, executable specifications for implementations, – a model program viewer (mpv) for viewing model programs as finite-state machines (FSMs), – an offline test generator (otg), which performs link coverage of model programs to produce test cases, and – a conformance tester (ct), which takes test cases and executes them within the implementation.2 This must be coupled to the implementation with a test harness, called a stepper. A diagram of the steps involved in testing implementations with NModel is provided in Figure 1: 1. First, we take the specifications and/or the user manual and write a model program using the NModel library. We can use this model program to generate a graphical FSM using mpv for a visual representation. 2. Then, we use otg to generate a test suite from the model program. 3. To test the implementation with the test suite, we first write a stepper to couple the test cases described in the model program with the implementation. 4. Finally, we run ct with the test suite and the implementation coupled with the stepper to check for consistency between the implementation and the model. The output of ct is Success if the implementation is correct and Failure otherwise. specifications/user manual
implementation
manual generation
model program
model program viewer (mpv)
graphical FSM
offline test generator (otg)
test suite
manual generation
stepper
conformance tester (ct)
output
Fig. 1. Testing implementations with NModel 2
ct can also generate test cases on the fly from a model program during test execution, but this is not necessary and is therefore not discussed in this paper.
Model-Based Testing of GUI-Driven Applications
3
207
Creating the Mental Model
The first step of our process was to produce a mental model of MACE from the AHLTA-Mobile user manual. The challenge in creating the model was to find an adequate modeling approach that captures the user perception of the application. Taken in its full complexity, the problem of user perception goes well beyond the scope of the case study. However, after showing the AHLTAMobile to several potential users, we concluded that the application can be modeled as an extended finite state machine (EFSM), which has been long used in model-based testing [5]. In the following, we give a brief definition of EFSM, followed by the description of our modeling approach and a discussion of the implementation of a given EFSM as a model program in NModel. 3.1
Extended Finite State Machines
Preliminaries. For a finite set of variables X = {x1 , ..., xn }, each ranging over the space of values O, a valuation is a function v : X → O that assigns to each variable x its current value. The set of valuations of X is denoted V(X). A predicate P over X is a boolean-valued function P : V(X) → {true, false}. A valuation transformer T is a function T : V(X) → V(X). An EFSM M is a tuple Q, Σ, X, E, q0 , v0 , where Q is a set of states with the designated initial state q0 , Σ is a finite alphabet, X is a set of variables with the initial valuation v0 , and E is a transition relation. A transition t ∈ E is a tuple q1 , g, a, u, q2 , where q1 , q2 are the source and destination states of the transition, respectively. The symbol a ∈ Σ is the event that triggers the transition. The guard g is a predicate over the variables of M that states when the transition is allowed to be taken. Finally, the update u is a valuation transformer that reflects changes to variables when the transition occurs. For the purpose of this paper, we represent each update as a sequence of assignments xi = fi (X). A run of M is an alternating sequence (q0 , v0 )a1 (q1 , v1 )a2 ... such that, for each i, M has a transition qi−1 , gi , ai , ui , qi such that gi (vi−1 ) = true and vi = ui (vi−1 ). That is, in every step of the execution, a transition of M is taken such that its guard is satisfied by the variable values in the source state and the valuation after the transition is taken is updated according to the update specified by the transition. The update occurs by performing assignments in their syntactic order in ui . 3.2
EFSM Model for AHLTA-Mobile
The AHLTA-Mobile user manual uses two ways to convey the expected behavior of the application to the user: first, it offers pictures of each GUI screen, and second, it describes the actions that may be performed when a given screen is displayed. With the exception of editing actions, the outcome of performing an action is the new screen being displayed. We found it natural from the documentation to formulate the mental model as an EFSM that encompasses the observable behavior of the system, identifying screens with states and actions with transitions between the screens.
208
V. Chinnapongse et al.
For this case study we focus on a subset of MACE’s behavior, capturing the actions Resume and Suspend. The resulting EFSM is MAM = QAM , ΣAM , XAM , EAM , StartScreen, v0 , where – QAM = {StartScreen, MACE1, . . . , MACE7, MACEResults, ExamIndex, ResumeScreen, NoUnit}, – ΣAM = {Edit, Next, Suspend, Resume, Select, Start, MACE}, – XAM = {Edited1, . . . , Edited7, EditedResults, Suspended, Selected, UnitInfo} EAM is visually represented in Figure 2, somewhat simplified for readability, and v0 is discussed below. In the EFSM model, states represent the following subset of AHLTA-Mobile screens relevant to MACE. MACE is comprised of eight screen states, MACE 1 through MACE 7, and MACE Results. Each screen is a form that has to be completed before the next may be displayed. The Start Screen is the initial screen where the application begins after the user has logged in and a patient has been selected. The Exam Index is a menu from which the user can navigate to MACE, and the Resume Screen is a menu from which the user may resume a suspended exam. The alphabet of this EFSM consists of the following actions available within MACE. – – – – – – –
Edit completes the required fields in the current screen. Next clicks Next to navigate to the next screen. Suspend clicks Suspend to suspend the evaluation. Resume clicks Resume to resume the evaluation. Select selects the appropriate exam to resume. Start clicks Exam Index on the initial screen. MACE clicks MACE within the Exam Index.
Each action would label a transition in the EFSM representation of the mental model of MACE. Note that the Edit action is the only one that does not correspond to the invocation of a particular user interface element. In our approach, we do not model the contents of MACE forms. Instead, we capture only the fact
that some editing has to be performed before the user can move to the next screen. The variables of the EFSM model have been introduced to capture conditional execution of user actions as specified in the user manual. For example, the action Resume may only be executed from the Start Screen state if the value of the boolean variable Suspended is false, indicating that the exam has been previously suspended. Other variables include UnitInfo, which indicates whether unit information has been previously specified for the patient, Selected, which indicates whether an exam has been selected in the Resume Screen state to resume, and Edited1. . . EditedResults indicate whether the required fields in the MACE screens MACE 1. . . MACE Results have been completed. Initial values of Suspended, Selected, and Edited1. . . EditedResults are all false.We have separately considered initial valuations with UnitInfo either true or false. The reason for this is that the patient’s unit information is set in another part of the AHLTA-Mobile system, which has been excluded from the case study. 3.3
Model Program Representation of the Mental Model
After creating a formal representation of the mental model we needed to translate it into a model program that could be used by NModel to test the AHLTAMobile application. Model programs, executable specifications written in C# using the NModel library, are action oriented. They define which actions in an application may be taken in what circumstances. A model program contains a set of variables that captures the state of the model program, and a collection of methods that represent actions. For each action a, the model program contains two methods, a() that represents the action itself, and aEnabled() that, based on the current state of the model program, determines whether a is enabled. The body of a() is a collection of cases that update the variables of the model program when the action method is invoked. Given an EFSM M = Q, Σ, X, E, q0 , v0 that defines states as screens, we mechanically translate it into a model program as follows: 1. Declare the Model class, which references the System and NModel libraries. 2. Within Model, create all variables in X and initialize them according to v0 . Add the string variable current that stores the label of the current state of M , initially q0 . 3. For each a in Σ: (a) Create an action skeleton: [Action("a")] static public void a() {} static bool aEnabled() { return false; }
(b) For each t = q, g, a, u, q in E: i. Add the following lines to a():
if (current.Equals("q") && g) { current = "q’"; update(u); }
210
V. Chinnapongse et al.
where update(u) updates all variables according to u. Given our sequential interpretation of u in the definition of EFSM, the assignments of u can be syntactically transcribed into C# statements. ii. Add the following line to the beginning of aEnabled(): if (current.Equals("q") && g) return true;
The described translation is mechanical and can be easily made automatic. However, in the case study, we followed the described procedure manually. As an example of this translation, consider the action MACE in Figure 3. The action may be taken when the Exam Index screen is displayed and may lead to No Unit or MACE 1 depending on the value of UnitInfo. This fragment of the EFSM yields the model program shown below it.
!UnitInfo MACE Exam Index
UnitInfo MACE
No Unit
MACE 1
[Action("MACE")] static public void MACE() { if (unitinfo) current = "MACE 1"; else current = "No Unit"; } static bool MACEEnabled() { return current.Equals("Exam Index"); } Fig. 3. Representing the MACE() action in a model program
After writing the model program representation of the MACE mental model, we used mpv to produce an FSM. We then used otg to automatically generate a test suite for MACE.
4
Writing a Test Harness
NModel requires the use of a stepper, a test harness that invokes an instance of the implementation to be tested and causes the appropriate actions to be executed when invoked by ct. For simple applications, like the samples provided on the NModel website[6], when ct requests for an action to be executed, the stepper is written to simply call a corresponding method that exists within the implementation. In AHLTA-Mobile this was not possible: our actions did not directly correspond to single methods provided in the application but instead
Model-Based Testing of GUI-Driven Applications
211
to multiple methods triggered by user input events, like keystrokes and mouse clicks. Attempting to associate input actions with existing methods, like callbacks for buttons, was problematic for a few reasons. Since callback methods are normally private, code needed to be modified in many places for them to be used; an inelegant solution that presented many possibilities for errors to be introduced. We also needed to know which instance of any object we were manipulating, requiring further additions to the source code. This method also required detailed knowledge about how the implementation worked, which was both tedious and, as we found, unnecessary. Instead of using callbacks and related methods in order to simulate actions, we inserted actual keystroke and mouse click events into the application’s message loop. We did this in AHLTA-Mobile by retrieving object handles from the C# message loop within the application and sending our user input events directly to the appropriate handles via the message loop. This method allowed us to add code in only one part of the application, making it simpler to work with and reducing the opportunities to introduce errors into the application.
5
Testing AHLTA-Mobile
Once a stepper is written we can run ct with the coupled implementation and stepper and the test suite generated by otg as arguments, as shown previously in Figure 1. Running the conformance tester quickly revealed an error: Suspending an exam does not lead to the Start Screen as expected but instead to the Exam Index. This resulted in a timeout, a function included in NModel in case an implementation does not behave as expected and must be terminated. Since the Start Screen and thus Resume button never appeared, it was never clicked and no exam could be selected, causing the application to stall. The test trace that caused the error is given below. TestResult(0, Verdict("Failure"), "Action timed out", Trace( Test(0), Start(), MACE(), Suspend(), Resume(), Select() ) )
6
Related Work
The use of state machines for specifying user interfaces has been explored as early as mid-1980s in [17]. At that time, however, state machines were applied to textual user interfaces, which are much simpler to model and analyze (for example, they do not involve callbacks). With the advent of flexible, dynamically modifiable GUI systems research in the human-computer interface (HCI) area has focused primarily on dynamic aspects of GUI-based systems, where state
212
V. Chinnapongse et al.
machines appear to be less useful. However, in the domain of GUI-driven handheld devices considered in our case study, EFSMs are quite appropriate and yield high-level and accurate models of user expectations of the system. Model-based testing of GUI programs is also explored in [9], where the authors use randomized online testing instead of providing offline tests that achieve transition coverage of the model. The paper presents a very different modeling approach based on labeled transition systems with concurrency. The approach involves two levels of models. A high-level model describes the various user-level actions that may be performed. A user level action may require several GUI operations, such as popping up a menu and then selecting an item in the menu. A low-level model then describes how these actions are accomplished. We believe that the approach of [9] is targeted towards systems with dynamically created and manipulated GUI screens. In our case, their multi-level approach would be an overkill. Several other research works focus on different aspects of model-based testing. [13] mentions the use of model based test case generation for fault detection, and employs hierarchical predicate transition Petri Nets as a formalism. [18] discusses and compares several testing methodologies toward open source software using model based testing. In [12], the authors present extensions to the Spec Explorer tool to automate testing based on Spec# specifications. A GUI mapping tool allows the tester to associate actions with physical objects that appear on the GUI display. The tool generates C# code with methods that have the same signature as those specified and actions are performed externally according to tests generated by Spec Explorer. [14] specializes the task modelling notation to ConcurTaskTrees.
7
Conclusions and Discussion
In this paper we presented an approach for the behavioral modeling of GUIdriven handheld devices. We illustrated how the NModel methodology can be applied for the model-based testing of this class of devices. We discussed the challenges we faced in applying this approach and our way of overcoming them. Finally, we presented the results of our case study of the AHLTA-Mobile application, demonstrating an inconsistency between the observed behavior and the behavior described in the user manual. We believe that our approach is applicable to most GUI-driven handheld devices, offering a viable method of establishing conformance between system operation and user expectations for these types of reactive systems. While this work is an encouraging step forward, it is still far from the comprehensive methodology needed for the analysis of user-centric GUI-driven devices that we envision. Several aspects of such a methodology remain open problems, as discussed below. Realistic mental models. In this paper, we constructed the mental model based on the contents of a user manual. Clearly, perception of the appropriate use for a device by a user is formed through other factors as well and can be quite different
Model-Based Testing of GUI-Driven Applications
213
from the literal representation of the user manual in formal notation [11,10]. Empirically constructed mental models, capturing probabilistic information about observed user behaviors, are used in testing literature under the name of usage models or usage profiles [4]. A predictive way to construct such mental models in needed, especially for new kinds of devices. A practical mental modeling methodology should build on both cognitive science and computer science. Detecting and managing underspecification. A big part of the challenge in constructing mental models is that natural-language documents describing a system are never complete and users interpret them by making assumptions based on their knowledge and prior experience with similar systems. The problem here is that these assumptions are so natural for the reader that it is often hard to detect that an implicit assumption has been made. Representation of alternatives would require us to apply a different modeling and testing approach. A possible way to capture alternatives is to use nondeterministic EFSMs, where different transitions labeled by the same symbol would correspond to different alternatives. During testing, as long as the implementation offers a behavior corresponding to one alternative, a test should succeed. A slight complication here is the need to ensure consistency: if an alternative has been resolved in some way during a test execution then later in the same execution it has to be resolved the same way. We also will have to rely on a different tool to generate and execute tests, since NModel operates on input-deterministic model programs. Soft vs. hard inputs. Many approaches to model-based testing require that the model does not restrict the tester from performing an action. Technically, this corresponds to the notions of weak input enabledness [15] or input completeness [8]. In our case, this requirement should be relaxed, because the system interacts with its environment via what we call soft inputs, such as GUI buttons on the screen. Such a button many not be present on some screens, and in that case the tester should actually be prevented from invoking that input.
References 1. Alur, R., Arney, D., Gunter, E.L., Lee, I., Lee, J., Nam, W., Pearce, F., Van Albert, S., Zhou, J.: Formal specifications and analysis of the computer-assisted resuscitation algorithm (CARA) infusion pump control system. Software Tools for Technology Transfer 5(4), 308–319 (2004) 2. AHLTA-Mobile fact sheet. Medical Communications for Combat Casualty Care Web Site, https://www.mc4.army.mil/AHLTA-Mobile.asp 3. Arney, D., Jetley, R., Jones, P., Lee, I., Sokolsky, O.: Formal methods based development of a PCA infusion pump reference model: the generic infusion pump (GIP) project. In: Joint Workshop on High-Confidence Medical Devices, Software and Systems and Medical Device Plug-and-Play Interoperability, July 2007, pp. 23–33 (2007) 4. Brooks, P., Memon, A.M.: Automated GUI testing guided by usage profiles. In: Proceedings of the 22nd IEEE International Conference on Automated Software Engineering (ASE 2007) (November 2007)
214
V. Chinnapongse et al.
5. Cheng, K.T., Krishnakumar, A.S.: Automatic functional test generation using the extended finite state machine model. In: Proceedings of the 30th international conference on Design automation (DAC 1993), June 1993, pp. 86–91 (1993) 6. Microsoft Corporation. NModel website (2009), http://www.codeplex.com/NModel 7. Jacky, J., Veanes, M., Campbell, C., Schulte, W.: Model-based Software Testing and Analysis with C#. Cambridge University Press, Cambridge (2008) 8. Jard, C., J´eron, T.: TGV: theory, principles and algorithms. a tool for the automatic synthesis of conformance test cases for non-deterministic reactive systems. Software Tools for Technology Transfer (2004) 9. Kervinen, A., Maunumaa, M., P¨ aa ¨kk¨ onen, T., Katara, M.: Model-based testing through a GUI. In: Grieskamp, W., Weise, C. (eds.) FATES 2005. LNCS, vol. 3997, pp. 16–31. Springer, Heidelberg (2006) 10. Legrenzi, P., Girotto, V.: Mental models in reasoning and decision making. In: Garnham, A., Oakhill, J. (eds.) Mental models in cognitive science, pp. 95–118 (1996) 11. Lewis, C.: A model of mental model construction. In: Proceedings of the SIGCHI conference on Human factors in computing systems (CHI 1986), pp. 306–313 (1986) 12. Paiva, A., Faria, J., Tillmann, N., Vidal, R.: A model-to-implementation mapping tool for automated model-based GUI testing. In: Lau, K.-K., Banach, R. (eds.) ICFEM 2005. LNCS, vol. 3785, pp. 450–464. Springer, Heidelberg (2005) 13. Reza, H., Endapally, S., Grant, E.S.: A model-based approach for testing gui using hierarchical predicate transition nets. In: ITNG, pp. 366–370 (2007) 14. Silva, J.L., Campos, J.C., Paiva, A.C.R.: Model-based user interface testing with spec explorer and concurtasktrees. Electronic Notes in Theoretical Computer Science 208, 77–93 (2008) 15. Tretmans, J.: Test generation with inputs, outputs and repetitive quiescence. Software - Concepts and Tools 17(3), 103–120 (1996) 16. U.S. Army Medical Research & Material Command, Mobile Computing Group, Telemedicine and Advanced Technology Research Center, Fort Detrick, Maryland. AHLTA-Mobile User Manual, v2.2.61 17. Wasserman, A.I.: Extending state transition diagrams for the specification of human-computer interaction. IEEE Transactions on Software Engineering 11(8), 699–713 (1985) 18. Xie, Q., Memon, A.: Model-based testing of community-driven driven open source GUI applications. In: 22nd International Conference on Software Maintenance (ICSM 2006), pp. 145–154 (2006)
Parallelizing Software-Implemented Error Detection Ute Schiffel, Andr´e Schmitt, Martin S¨ ußkraut, Stefan Weigert, and Christof Fetzer Technische Universt¨ at Dresden Department of Computer Science Dresden, Germany {ute,andre,suesskraut,stefan,christof}@se.inf.tu-dresden.de http://wwwse.inf.tu-dresden.de
Abstract. Because of economic pressure, more commodity hardware with insufficient error detection is used in critical applications. Moreover, it is expected that commodity hardware is becoming less reliable because of the continuously decreasing feature size. Thus, we expect that software-implemented approaches to deal with unreliable hardware will be needed. Arithmetic codes are well suited for this purpose because they can provide very good error detection capabilities independent of the actual failure modes of the underlying hardware. But arithmetic codes generate high slowdowns. This paper describes our encoding which uses an expensive AN-code. Second, we show how we harness the power of modern multicore CPUs to parallelize this expensive but flexible and powerful software-implemented fault detection technique. Our measurements show that under continuous probabilistic error injection, AN-encoding reduces the number of runs with incorrect output from 15.9% for the unencoded execution to 0.5% in the encoded case. Our parallelization reduces the observed slowdowns by an order of magnitude.
1
Introduction
Historically, hardware reliability has been increasing with every new generation. In the future, it is expected that decreasing feature size of hardware will, however, lead to less reliable hardware [5]. Moreover, the error rate in logical circuits has overtaken error rates in memory [8]. Thus, the usage of memory protection alone is not sufficient anymore. Historically, critical and especially safety-critical systems have mostly been built using special purpose hardware with better error detection and masking with the help of redundancy. Some hardware is even radiation hardened to prevent environment induced execution errors. However, these solutions are expensive and usually an order of magnitude slower than commodity hardware. We expect that in the future there will be even a higher economic pressure to use commodity hardware for dependable computing. Therefore, there is a need to cope with the restrictive failure detection capabilities of commodity hardware in S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 215–226, 2009. c IFIP International Federation for Information Processing 2009
216
U. Schiffel et al.
software. Commodity hardware will not exhibit pure fail-stop behavior but instead exhibit value failures which are much more difficult to detect and to mask. We aim at implementing a system which will turn these arbitrary value failures into easier to handle crash failures without the need for special hardware. Arithmetic codes (see Sec. 2) facilitate software-implemented hardware error detection. In this paper, we use an AN-code. Its main advantage of arithmetic codes is that one can ensure error detection with a given probability - independent of the underlying hardware. Arithmetic codes introduce a very high overhead. Previous approaches [6,12] reduced the overheads of arithmetic codes by not completely encoding applications and additionally, by using less powerful AN-codes. We, instead, protect every instruction of a program using the same powerful AN-code. Section 3 demonstrates how we reduce the slowdowns of the encoding by parallelizing the encoded execution using the power of modern multicore CPUs. The measurements presented in Sec. 4 show that under continuous probabilistic error injection, AN-encoding reduces the number of runs with incorrect output from 15.9% for the unencoded execution to 0.5% in the encoded case. Note that one can improve the strength of the encoding (by using a larger A - see Section 2) to reduce the percentage of incorrect correct outputs even further. Of course, a larger A also increases the overheads. Our parallelization reduces the observed slowdowns by an order of magnitude on a 16-core system. We discuss the related in Sec. 5.
2
AN-Encoding
Arithmetic codes are a technique to detect hardware errors during runtime. The encoding adds redundancy to all data words. This results in a larger domain for data words and only a small subset of the domain contains the valid code words. Arithmetic codes are conserved by correctly executed arithmetic operations: a correctly executed operation given valid code words as input, outputs a valid code word. A faulty arithmetic operation destroys the code with a very high probability, i.e., results in an invalid code word [2]. In the following, we will briefly summarize our previous work with AN-code which is published in [14]. We want to give the reader a general idea about the concept and an understanding why the application of AN-code is as computationally expensive as it is. The AN-code is one of the most widely known arithmetic codes. The encoded version xc of variable x is obtained by multiplying its original functional value xf with a constant A. This encoding is only done for input values of a program. All computations take multiples of A as inputs and if executed error-free, produce multiples of A as outputs. Code checking is done by computing the modulus with A, which is zero for a valid code word. Before a variable is externalized, i.e., used as a parameter of an external function or as a memory address in a load or a store operation, it is checked if it is a valid code word. If the check fails, the application is aborted.
The advantage of an AN-code is that the probability of detecting an operation error does not depend on the used hardware but only on the choice of A. Assuming a failure model with equally distributed bit flips and a constant Hamming distance between all code words, the resulting probability of detecting one error is approximately: 1 − 2−k , where, k is the size of A in bits. Thus, A should be as large as possible. Furthermore, it should not be a power of two because a multiplication by A would only shift the bits to the left and no bit-flips in the higher bits would be detected. A should also have as few factors as possible to increase the probability of detecting an error. Large prime numbers are therefore a good choice for A. For encoding a program with an AN-code, every variable has to be replaced with its larger AN-encoded version. Every instruction is substituted by its AN-preserving counter-part. We perform the instrumentation at compilation time: – The scope of the protection includes compiler errors. For example, errors in lowering the source code to an executable binary will most likely result in invalid codes and hence, become detectable. – We do not introduce further slowdowns because of dynamic instrumentation. See [21] for a detailed discussion of advantages and disadvantages of encoding at compile vs at runtime. We have implemented our encoding compiler using the LLVM compiler framework [10]. We encode LLVM’s bitcode which is a static single assignment assembler-like language. The advantage of LLVM’s bitcode, in comparison to any native assembler, is its manageable amount of instructions for which we have to provide encoded versions and the LLVM framework for analyzing and modifying LLVM bitcode. For AN-encoding LLVM bitcode, we solved the following problems: – We need encoded versions of all operations supported by LLVM. Therefore, we provide a set of basic hand-encoded arithmetic operations and a set of encodable replacement operations which uses only the basic arithmetic operations. – We have to handle calls to un-encoded external libraries. – AN-codes are only applicable to integers. Thus, we encode floating point operations by replacing them with encodable software implementations which make only use of integers. – We have to provide encoded versions of all constants and initialization values. LLVM enables us to find and modify all those initializations. Thus, we replace them with appropriate multiples of A. – For the encoding of memory content, a specific word size had to be chosen: we chose 32 bit. We require the compiler to align all memory accesses to that word size because only whole encoded words can be read. Basic Hand-Encoded Arithmetic Operations. Executing arithmetic operations on AN-encoded data mostly requires some corrections to obtain a correctly
218
U. Schiffel et al.
Slowdown
1000.0 100.0 10.0 1.0 addition
subtraction multiplication unsigned division
signed division
compare equal
compare unequal
unsigned signed greater than greater than
unsigned less than
signed less than
Fig. 1. Slowdowns of encoded arithmetic operations compared to their native versions
encoded result. For example, consider the multiplication of two encoded values ac = A ∗ af and bc = A ∗ bf . When just multiplying ac and bc , the obtained result is A2 ∗ af ∗ bf but the correctly encoded result should be A ∗ af ∗ bf . Thus, an additional division by A is required. Our encoded arithmetic operations are hand-encoded. See [19] for the details of an AN-code with signatures, i.e., the value of a variable is not only multiplied by a factor A but also a unique constant is added. Fig. 1 shows the slowdowns of our AN-encoded operations compared to their unencoded versions. While AN-encoded additions and subtractions take only two times as long as their native versions, an AN-encoded multiplication takes 126 times longer. Divisions and comparisons are between 3 and 10 times slower. The main reasons for those slowdowns are: (1) the implementation of the overflow behavior of integer operations as defined in the C standard, and (2) that encoded multiplications and divisions require expensive 128-bit integer operations. Replacement Operations. Since encoding by hand is a tedious and errorprone task, we automated as much of the remaining encoding tasks as possible. Thus, we provide a library of so-called replacement operations. Those contain implementations of the following operations: shifts, casts, bitwise logical operations, and remainder operations. The replacement operations are written in such a way that they can be automatically encoded by our encoding compiler, i.e., they only use arithmetic operations for which hand-encoded versions exist. Before encoding a program, all not directly encodable operations, i.e., all operations which have no encoded variant in our basic set of hand-encoded arithmetic operations, are replaced with their appropriate encodable replacement operation. Fig. 2 depicts the slowdowns generated by the slowest replacement operations compared to their native versions: for the unencoded and for the encoded replacement operations. Especially the bitwise logical operations (notx , andx , orx , and xorx ) generate large slow downs. Their implementation uses tabulated results for 16-bit (notx ) and 8-bit (the other operations) blocks and expensive shift operations to combine those results. Arithmetic right shifts (ashrx ) are expensive because they require two accesses to tabulated data and an encoded division. Finally, the encoded versions of signed and unsigned remainder operations (sremx and uremx ) and upcast and downcast operations (sext-x -to-y and trunc-x -to-y ) still generate slowdowns between 8 and 64. They also require expensive encoded divisions.
Fig. 2. Slowdowns of unencoded and AN-encoded versions of replacement operations compared to their native versions
Calls to unencoded External Code. In contrast to dynamic binary instrumentation, static instrumentation does not allow for protection of external libraries whose source code is not available at compilation time. For calls to those libraries, we currently provide hand-coded wrappers which decode parameters and after executing the unencoded original, encode the obtained results. Note that AN-encoding leads to unexpected performance modifications. Some operations whose unencoded versions are very fast (casts, shifts, bitwise logical operations, multiplications and divisions) suddenly induce very large overheads. Depending on the encoded application AN-encoding results in slowdowns between 7.5 and 238 (see Sec. 4).
3
Parallelize Encoded Execution
We mitigate the performance overhead of the encoded execution by parallelization. Fig. 3 shows the general approach which is very similar to [11] with two important exceptions: (1) to reduce the overall overhead and to increase safety, we use static instrumentation instead of dynamic binary instrumentation, and (2) we introduce speculative variables to decouple the execution of the individual epochs. Speculative variables are discussed in more details in Sec. 3.2.
Fig. 3. (a) Original application execution without encoding and parallelization. (b) Sequential execution with encoding. (c) Parallelized encoded execution. Unencoded predictor for fast state prediction. Parallelized executors execute slow encoded variant.
220
U. Schiffel et al.
An encoded execution (see Fig. 3 (b)) introduces, in general, a substantial runtime overhead compared to the unencoded execution (Fig. 3 (a)). To parallelize the encoded execution (Fig. 3 (c)), we execute the original unencoded application within the predictor process. The predictor’s execution is partitioned into epochs. An executor process reexecutes a given epoch using its encoded version. Executors run on additional CPU cores in parallel to each other and to the predictor. They synchronize their state using speculative variables to make the approach scalable. The predictor runs up to two orders of magnitude faster than the executors. Hence, it can provide the snapshot for starting the executor of epoch ei+1 even if the executor of the previous epoch ei has not yet finished. Our parallelization approach is not completely transparent to the application developer. The application developer has to mark potential places in the code where a snapshot could be taken, i.e., a new epoch can be started. Ideally, these snapshot places are executed periodically with constant frequency at runtime. 3.1
Platform Support
At compile time, we first generate two code bases from the unencoded application: the predictor code base and the executor code base. This stage just duplicates the functions of the original code base and renames them. Second, we instrument both code bases to allow switching from the predictor code base to the executor code base at epoch boundaries. At runtime, the added code rewrites the stack when switching from the predictor’s code base to the executor’s code base (e.g., it rewrites the return addresses on the stack to point into the executor’s code base). After these preparations, the encoding compiler encodes the executor code base. The instrumentation process is the same as for encoding without parallelization. At runtime, we provide a snapshot mechanism (similar to fork [11]) which starts a new executor for each new epoch started by the predictor. The executor replays the same computation for an epoch e as the predictor performed for e but with encoding. Therefore, any input received by the predictor is deterministically replayed in the executors. The input is encoded at runtime by the hand-coded wrappers described in Sec. 2. All externally visible side-effects (issued via system calls) of the predictor are held back until they are verified by the executors. After successfully executing an epoch, an executor explicitly approves it to make its side-effects externally visible. Because the executors are running in parallel, the verification order of system calls might be different from the order in which they were issued in the predictor. We allow out-of-order issuing of system calls by the executors but ensure their in-order retirement. The deterministic replay and speculative execution of external side-effects is transparent to the encoding compiler and its runtime support. We implemented these two features as kernel-module for Linux similar to Speck [11]. 3.2
Speculative Variables
To parallelize the encoded execution of epochs by executors, the executor of epoch ei starts independently from the state of the executor of its predecessor
epoch ei−1 . Initially, the executor of ei contains only the unencoded state from the snapshot of the predictor. Whenever the current state is read in ei , it is lazily encoded. This approach does not protect against errors in the predictor’s execution or the snapshots. Hence, after executing ei , its executor verifies the initial encoded state of ei against the final encoded state of ei−1 . By comparing only encoded states, we achieve end-to-end safety. We implemented the described approach by using speculative variables. A speculative variable holds a value and optionally an obligation. The value is written within the executed epoch, i.e., computed in that epoch. The obligation is the initial value which was read from the snapshot and that needs to be verified at some later point in time. At runtime, the whole encoded state is stored in speculative variables as follows. Every word in memory is assigned to one speculative variable. An epoch starts with an empty set of speculative variables. Speculative variables are created lazily when their corresponding memory addresses are accessed for the first time. When a memory address i is read or written, the value of its speculative variable vi is read or written, respectively. A speculative variable vi is either created by an encoded read from address i or an encoded write to i. When created by a write, the value of vi is set to the (already) encoded value given to the write. A speculative variable created by a write does not have an obligation. If vi is created by a read, the unencoded value at address i is read from the predictor’s snapshot at the start of the current epoch ei . This unencoded value is then encoded and written to the value of vi and to its obligation. Subsequent accesses to vi do not touch its obligation. At the end of epoch ei , all obligations created in the encoded executor of ei are checked against the final state of the encoded executor of the preceding epoch ei−1 . Therefore, the executor of ei−1 writes its final encoded state into a global view shared by all executors. At the start of the application, the global view contains the encoded initial state of the application. The executor Ei of ei waits for the executor of ei−1 to terminate. Then Ei verifies all obligations of ei against the global view. If this verification fails, the application is stopped. In the future, we want to retry the execution of the current epoch to tolerate transient faults. After the verification, Ei updates the global view with the current values of all speculative variables of ei .
4
Evaluation
We evaluated our approach using four applications: (1) md5 calculates the md5 hash of a string, (2) primes implements the Sieve of Eratosthenes, (3) tcas is an open-source implementation of the traffic alert and collision avoidance system [1] which is mandatory for airplanes, and (4) pid is a ProportionalIntegral-Derivative controller [23]. Performance. We measured the performance as slowdown of the runtime of the AN-encoded application over the runtime of the unencoded application. Fig. 4 depicts the slowdown of the AN-encoded applications compared to their unencoded versions using different amounts of parallelization by restricting the
222
U. Schiffel et al.
256
tcas md5 pid primes
slowdown vs. native
128 64 32 16 8 4 2
1
2
4
8
16
32
64
# parallel executors
Fig. 4. Slowdowns of sequential and parallelized AN-encoded applications
number of maximally parallel executed AN-encoded executors (x-axis). All tests were executed on 64-bit Fedora 10 running on a 16-core machine. The sequential slowdowns (1 parallel executor in Fig. 4) range from 7.5 to 238. The more expensive encoded operations are used by an application, the larger the slowdown becomes. Especially md5 uses many bitwise logical operations and thus experiences a larger slowdown. Using two parallel executors does not halve the slowdown of the sequential execution because parallelization itself generates overhead by forking new executors, encoded switching from predictor code base to executor code base, and checking the obligations. Starting with 2 parallel executors, the slowdown decreases linearly with the number of parallel executors. Between 8 and 32 parallel executors the decrease stagnates since we are then overloading our 16-core machine. The more overhead the AN-encoded version generates the better does its parallelization scale. With 16 cores the average slowdowns of the tested ANencoded applications can be reduced from 110 in the sequential case to 16 in the parallelized case. Thus, parallelizing our AN-encoded applications using 16 cores makes them at best 11.5(tcas), worst 2.3(primes) and on average 6.9 times faster. Error Detection. Fig. 5 shows the results of our error injection experiments. The used error injection tool we implemented using LLVM. At compiliation time, we insert so-called trigger points wherever possible. At runtime we decide at each trigger point if it is triggered, i.e., if an error is inserted. If it is not triggered, execution is carried on as in the errror-free case. The caption of Fig. 5 describes the implemented error model. While unencoded applications show high rates of undetected errors (incorrect output ), this is not the case for AN-encoded programs. The highest number of incorrect output with 7%, we see for md5 with faulty operations while the highest number for unencoded programs is 71% for md5 with exchanged operators. For the probabilistic error injection it even goes down to in average 0.5% of undetected errors compared to 15.9% for the unencoded execution. In the case of failure detected, we either detected an invalid code word and stopped the application, or modified data led to another inconsistency in the
100 90 80 70 60 50 40 30 20 10 0 EO1 E02 FO LS MO ALLProb EO1 E02 FO LS MO ALLProb
primes native
normalized behavior in %
223
primes AN-encoded
EO1 E02 FO LS MO ALLProb EO1 E02 FO LS MO ALLProb
tcas native
tcas AN-encoded
100 90 80 70 60 50 40 30 20 10 0 EO1 E02 FO LS MO ALLProb EO1 E02 FO LS MO ALLProb
md5 native no error
md5 AN-encoded
correct output
failure detected
EO1 E02 FO LS MO ALLProb EO1 E02 FO LS MO ALLProb
pid native performance failure
pid AN-encoded incorrect output
Fig. 5. We inserted the following transient error types: exchange operands (EO1): A different but valid operand is used. exchange operators (EO2): A different operator is used, e.g., a plus instead of a minus. faulty operations (FO): The result of an operation is modified by bit-flips. lost stores (LS): A store operation is omitted. modify operands (MO): An operand is modified by bit-flips. For each example program 10,000 runs with the same input were executed for each error type. In each run exactly one error of that type was inserted. ALL represents the average over all those experiments. Prob, however, shows the results for probabilistically executed injections of all types. At each possible injection point, we decided if an error should be injected. The same error probability was used for the unencoded and the AN-encoded version. This results in more errors being injected into the AN-encoded version because of its larger code size.
program causing a crash. The AN-encoded version always has a higher amount of failure detected than the unencoded runs. The probabilistic runs (Prob), where multiple errors are injected in nearly all AN-encoded runs, result in a higher percentage of failures detected because the more errors are injected, the higher is the probability of detection [20]. A large part of the injections produced correct output which does not differ from the output of the error-free run. For three of the four AN-encoded versions, the amount of such runs increases. Especially with the AN-encoded versions of the tcas benchmark, we see performance failures, i.e., the application runs much longer than the error-free run. tcas contains several loops whose conditions contain comparisons with 8-bit integers. Encoding increases the size of those values to 64 bit. Injected errors probably increase the contained functional value.
224
U. Schiffel et al.
This results in longer running loops since AN-encoded comparisons currently do not check the code of their operands. Code checking is expensive and especially for control systems such as tcas or pid it might be cheaper to use a watchdog to check for liveness. No error marks those runs of the probabilistic injections into which no error was injected. This is the case for 35% to 52% of the native runs but for none of the AN-encoded runs. Obviously, the increased code path lengths makes it more probable of being hit by an error. But on the other hand AN-encoding makes it possible to prevent erroneous output to a large extent.
5
Related Work
Error Detection. For an overview of hardware approaches for hardware error detection see our previous work [14]. They all have in common that custom hardware is typically very expensive and not often adapted to new faster technologies. Our intention is to provide a software-implemented error detection mechanism which provides detection guarantees independent of the used hardware. This allows to use up-to-date hardware in safety-critical systems which require certification. Control flow checking approaches such as [4,16] in contrast to AN-encoding can detect invalid control flow for the executed program, that is, execution of sequences of instructions which are not permitted for the executed binary. Modified data values, that an AN-code detects, will remain undetected. In [15,22] invariants contained in the executed program are used to check the validity of the generated results. Thereby good error detection can be provided but for most programs it is difficult—if not impossible—to design invariants with good failure detection capabilities and to assess the quality of these invariants. Replicated execution and comparison of the obtained results as for example used by [18,3] provide no guarantees with respect to permanent hardware errors or soft errors which disturb the voting mechanism. ED4I [12] uses also an AN-code but the authors choose a factor A which is a power of two whenever a program contains logical operations. Thereby, logical operations become easily encodable but the detection capabilities are reduced immensely because the resulting code cannot detect bit flips in the higher order bits of data values. But those contain the original functional value. [6] applies the ANcode only to registers and not to memory and only to operations which easily can handle encoded values such as additions and subtractions. In the end that leaves supposedly only small parts of applications which are AN-encoded. As should be expected their fault injection experiments show a non-negligible amount of undetected failures for most of the tested applications. Both [12] and [6] do not discuss overflow problems with AN-codes which we pointed out and solved in [19].
Parallelized Checks. Recent work [11,13,17] exploits modern multi-core systems for reducing the overhead of expensive runtime checks. As in our parallelization framework the execution is split into epochs. The application runs as speculator on a single core. Whereas, the other cores replay the execution of the speculator using multiple parallel checkers. But there are also major differences. Speck [11] and SuperPin [17] rely on dynamic binary instrumentation. Instrumenting AN-encoding at runtime is less safe as a static instrumentation [21]. While Speck changes the whole OS kernel, we implemented a kernel module which is much easier to deploy and maintain. SuperPin does not have a speculation support on the syscall level. Thus, erroneous output cannot always be blocked before it is verified. Different AN-encoded checker epochs have to share state because it is required that predecessors verify the error-freeness of the start state of their successors. Our understanding is that SuperPin does not support sharing state between parallel running checker epochs. Speck and parallel DIFT [13] merge the checking data gathered by the checkers in a separate thread into sequential order. This thread can become a bottleneck when the number of checkers increases. Parallel DIFT uses a hardware extension [7] to stream data between cores. We do not rely on any specialized hardware but implemented state sharing using speculative variables.
6
Conclusion
We demonstrated that AN-codes can be used to reduce the rate of undetected incorrect output for frequently occurring error events from 15.9% to 0.5% and for rare error events with only one injected error per run from 30.8% down to 1.7%. The remaining undetected errors can be tackled by using the more powerful but also more expensive AN-code with signatures as introduced by Forin in [9].
References 1. The Paparazzi Project, http://paparazzi.enac.fr/wiki/Main_Page 2. Avizienis, A.: Arithmetic error codes: Cost and effectiveness studies for application in digital system design. Transactions on Computers (1971) 3. Bolchini, C., Miele, A., Rebaudengo, M., Salice, F., Sciuto, D., Sterpone, L., Violante, M.: Software and hardware techniques for SEU detection in IP processors. J. Electron. Test. 24(1-3), 35–44 (2008) 4. Borin, E., Wang, C., Wu, Y., Araujo, G.: Software-based transparent and comprehensive control-flow error detection. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), Washington, DC, USA, pp. 333–345. IEEE Computer Society, Los Alamitos (2006) 5. Borkar, S.: Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro. (2005) 6. Chang, J., Reis, G.A., August, D.I.: Automatic instruction-level software-only recovery. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE Computer Society, Los Alamitos (2006)
226
U. Schiffel et al.
7. Chen, S., Kozuch, M., Strigkos, T., Falsafi, B., Gibbons, P.B., Mowry, T.C., Ramachandran, V., Ruwase, O., Ryan, M., Vlachos, E.: Flexible hardware acceleration for instruction-grain program monitoring. In: ISCA 2008: Proceedings of the 35th International Symposium on Computer Architecture. IEEE Computer Society, Los Alamitos (2008) 8. Dixit, A., Heald, R., Wood, A.: Trends from ten years of soft error experimentation. System Effects of Logic Soft Errors, SELSE (2009) 9. Forin, P.: Vital coded microprocessor principles and application for various transit systems. In: IFA-GCCT, September 1989, pp. 79–84 (1989) 10. Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of the international symposium on Code generation and optimization (CGO). IEEE Computer Society, Los Alamitos (2004) 11. Nightingale, E.B., Peek, D., Chen, P.M., Flinn, J.: Parallelizing security checks on commodity hardware. SIGARCH Comput. Archit. News (2008) 12. Oh, N., Mitra, S., McCluskey, E.J.: ED4I: Error detection by diverse data and duplicated instructions. IEEE Trans. Comput. 51 (2002) 13. Ruwase, O., Gibbons, P.B., Mowry, T.C., Ramachandran, V., Chen, S., Kozuch, M., Ryan, M.: Parallelizing dynamic information flow tracking. In: SPAA 2008: Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, USA. ACM, New York (2008) 14. Schiffel, U., S¨ ußkraut, M., Fetzer, C.: AN-encoding compiler: Building safetycritical systems with commodity hardware. In: The 28th International Conference on Computer Safety, Reliability and Security, SafeComp 2009 (2009) 15. Stefanidis, V.K., Margaritis, K.G.: Algorithm based fault tolerance: Review and experimental study. In: International Conference of Numerical Analysis and Applied Mathematics (2004) 16. Vemu, R., Abraham, J.A.: CEDA: Control-flow error detection through assertions. In: IOLTS 2006: Proceedings of the 12th IEEE International Symposium on OnLine Testing. IEEE Computer Society, Los Alamitos (2006) 17. Wallace, S., Hazelwood, K.: Superpin: Parallelizing dynamic instrumentation for real-time performance. In: 5th Annual International Symposium on Code Generation and Optimization, San Jose, CA, March 2007, pp. 209–217 (2007) 18. Wang, C., Kim, H.s., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: International Symposium on Code Generation and Optimization, CGO (2007) 19. Wappler, U., Fetzer, C.: Hardware failure virtualization via software encoded processing. In: 5th IEEE International Conference on Industrial Informatics, INDIN 2007 (2007) 20. Wappler, U., Fetzer, C.: Software encoded processing: Building dependable systems with commodity hardware. In: Saglietti, F., Oster, N. (eds.) SAFECOMP 2007. LNCS, vol. 4680, pp. 356–369. Springer, Heidelberg (2007) 21. Wappler, U., M¨ uller, M.: Software protection mechanisms for dependable systems. In: Design, Automation and Test in Europe, DATE 2008 (2008) 22. Wasserman, H., Blum, M.: Software reliability via run-time result-checking. J. ACM (1997) 23. Wescott, T.: PID without a PhD. Embedded Systems Programming 13(11) (2000)
Model-Based Analysis of Contract-Based Real-Time Scheduling Georgiana Macariu and Vladimir Cret¸u Computer Science and Engineering Department Politehnica University of Timisoara Timisoara, Romania
Abstract. We apply automata theory to analyze the schedulability of real-time component-based applications running on uniform multi-processor platforms. The resource requirements of each application or application component are specified in a service contract resulting a hierarchy of contracts. As we are interested in determining the schedulability of such applications, this hierarchy of contracts is mapped to a hierarchical scheduling strategy. We use model checking and transform the schedulability analysis problem into a reachability checking of a timed automata model of the service contracts.
1
Introduction
In the last years, real-time embedded software development has focused more and more on building flexible and extensible applications. Component-based software systems achieve these objectives by gluing individually designed, developed and tested software components, each component having different timing requirements. Therefore, when building such a component-based system one must ensure that components can coexist without jeopardizing each other’s execution. One of the solutions for temporal isolation of applications running on uniprocessor systems has been provided by utilizing hierarchical scheduling based on execution time servers [1]. In hierarchical scheduling each application has its own scheduler and can use the scheduling policy that best suits its needs. Based on such a hierarchical scheduling scheme, Harbour has introduced the concept of service contracts [2]. In Harbour’s model, every application or application component may have a set of service contracts describing its minimum resource requirement. These contracts are used in online or offline negotiations to determine if the resource requirements can be guaranteed or not. Recently, hierarchical scheduling has been used in a multi-processor scheduling framework for integrating applications with hard, soft and non-real-time requirements [3]. Also research is undertaken for extending the service contract model for component-based multi-processor real-time systems. Chang et. al [4] has proposed a two-level resource contract model. First, each application has a contract specifying the resources to be reserved for its execution. This is called an external contract. Next, every component of the application has its own contract, called internal contract, describing the portion of the resources specified in S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 227–239, 2009. c IFIP International Federation for Information Processing 2009
228
G. Macariu and V. Cret¸u
the external contract that must be distributed to the component. Each component consists of one or more tasks which may require parallel execution. Internal contracts are mapped to abstract servers which are further divided in execution time sub-servers (called just servers in what follows) in order to support parallel execution of the components. On the other hand, external contracts are mapped to multi-processor time partitions [5]. As each application will be mapped to a separate time partition, a specific scheduling policy may be associated with it. Starting from the hierarchical scheduling solution proposed in [4], we apply timed automata theory [6] to specifying the service contracts for components and applications. A component contract describes the tasks of the component and the arrival patterns of these tasks, modeled by a timed automaton. An application contract will refer to the servers for all components in the application and to the arrival patterns of these servers, modeled also as a timed automaton. We allow a different scheduling policy for each component and application, with no restriction on task preemption. Furthermore, we present a compositional approach based on timed automata for schedulability analysis of component-based real-time applications running on uniform multi-processor platforms. For each application, schedulability (i.e. checking that all application components can be executed such that all their tasks meet their deadlines) can be analyzed separately. In the timed automata formalism, schedulability analysis is reduced to reachability and can be performed using a tool like UPPAAL [7]. Schedulability analysis of real-time systems using timed automata has been proved decidable and applied successfully for non-preemptive scheduling policies of tasks. However, in timed automata models as defined in [6] time elapses at the same rate for all components and therefore they cannot be used for preemptive scheduling policies where execution of tasks can be suspended and resumed later. Stopwatch automata [8], a subclass of Linear Hybrid Automata, have been proposed as a solution for modeling preemptible tasks. However, since the reachability of Linear Hybrid Automata has been proved undecidable [9], this proof extends also to the stopwatch automata. An over-approximation method based on Difference Bound Matrixes has been applied in [8] for a coarse reachability analysis of stopwatch automata. Even so, the schedulability checking problem has been shown to be decidable for non-uniformly recurring tasks triggered by events. [10] introduces timed automata extended with tasks, a class of timed automata with subtraction where clocks may be updated by subtraction in a bounded zone, and proves that the schedulability checking relative to a preemptive scheduling policy is decidable for this class of automata. This result has been extended for multi-processor real-time systems in [11] where it is shown that the schedulability problem is decidable for preemptive scheduling policies with fixed execution time tasks. However, they do not allow task migration meaning that a task instance is bound with one processor until it finishes. In our framework a task instance may execute on any processor depending on the availability of the execution time servers and the configuration of the multi-processor time partition. The global multi-processor schedulability analysis using model-checking has been investigated for tasks with static priorities in [12]. The models in [12] allow
Model-Based Analysis of Contract-Based Real-Time Scheduling
229
restricted and full migration of task instances. Every task is modeled separately and the schedulability of tasks is checked in decreasing order of their priority which limits the applicability of the analysis to static scheduling policies. This also implies that for a task set with N tasks, model checking has to be performed N times in order to determine the schedulability of the entire set. With this approach a maximal number of N + 1 clocks are necessary for a task set of size N . Unlike this model checking solution, our proposal addresses both static and dynamic scheduling policies. Moreover it requires just a single run of the model checking for the entire task set using a single clock in a setting with resources that are not continuously available and multiple levels of scheduling. This paper is organized as follows. Section 2 introduces our formal model for contract-based scheduling and Section 3 gives details on the timed automata used in the system model. We present performance evaluation results in Section 4. Section 5 concludes this paper.
2
The Contract-Based Scheduling Model
This section presents the formal model of the service contracts. As explained in Section 1 there are two levels of such contracts. The first level specifies the resource requirements of a single application while the second level describes the requirements of each individual component of the application. Corresponding to the two levels of contracts there are two scheduling levels. At the upper level, each component of an application has a scheduler for scheduling its tasks, while at the lower level there is an application scheduler which manages the servers associated with each component of the application. 2.1
Component Contracts
A component C consists of a finite set of n tasks T and a timed automaton AC where: - a component task τi ∈ T is a tuple τi = (wi , pi , oi , di ), with wi being the worst case execution time of the task, pi the inter-arrival time between different instances of the same task, oi the first release of the task and di is the deadline of the task where wi ≤ di ≤ pi , - tasks may execute in parallel and are independent of each other, - AC models the execution of tasks in set T by taking transitions labeled with actions tReadyi , tF inishi and tOverruni , ∀ 1 ≤ i ≤ n representing the release and ending of task τi , and actions tGoi and tP reempti through which the component scheduler notifies task execution start/restart and suspension. The tasks of the component will be executed according to a component specific scheduling policy implemented by a scheduler associated with the component. The parameters of the tasks along with the task arrival pattern determine the resource requirements for the component. These resource requirements can be supported using one or more execution time servers, depending whether the tasks
230
G. Macariu and V. Cret¸u
must execute in parallel or not. The period, deadline and budget of the servers associated with a component are specified in the component contract. A server is defined by a tuple (q, p, o) where q is the capacity of the server, p is its replenishment period (i.e. the server becomes active every p time units) and o is the time of its first release. Each server may also have a deadline equal to its period. It is assumed there is a finite set of servers S containing the servers for all the components of an application. Definition 1 (Component contract). A component contract CC providing a set of ns execution servers SC ⊆ S is a timed automata ACc over the set of actions ΣC such that: - ACc specifies the activation pattern of servers σi ∈ SC , 1 ≤ i ≤ ns . - ΣC is split in two sets: - output actions: ΣCO = {sReadyi , sF inishi , sOverruni , sActivei , sInactivei | 1 ≤ i ≤ ns } - input actions: ΣCI = {sGoi , sP reempti | 1 ≤ i ≤ ns }. The ACc automaton sends the output action sReadyi to the scheduler associated with the application as soon as server σi is ready for execution and sends sF inishi or sOverruni to the same scheduler to notify it that the server has finished its execution, respectively missed its deadline. As a response to its actions ACc can receive from the application scheduler sGoi , telling it that server σi can start its execution, or sP reempti which results in server σi being suspended from execution until the next sGoi action. Actions sActivei and sInactivei are used to announce the component scheduler that server σi has consumed all its budget, respectively that it has replenished its budget and can be used again to execute tasks. 2.2
Application Contracts
As proposed in [4] the application contracts are supported by a multi-processor time partition model. Each application is associated with a time partition which has a local scheduler to execute the execution time servers assigned to the components of the application. In a uni-processor system a time partition is implemented as a fixed-length major time frame composed of several scheduling windows. A scheduling window is defined by its offset to the beginning of the partition major time frame and by its length. The scheduling scheme of the major time frame repeats during the execution of the system such that all scheduling windows are essentially periodic. In a multi-processor system, we assume there is a major time frame for each processor, but frames on all processors will have equal length and will be synchronized. The scheduling windows of frames on different processors can be different. From the above specification we derive next a formal definition of the multiprocessor time partition.
Model-Based Analysis of Contract-Based Real-Time Scheduling
231
Definition 2 (Time partition). A time partition T P in multi-processor system is described by a set of major time frames {Fi | 1 ≤ i ≤ m, length(Fi ) = L}, one for each of the m processors in the system, where Fi is a set of scheduling windows with periods that are an exact divisor of L. In our setting the time partition is used to facilitate application contracts. In a simple scenario, the application contract could specify a few pairs of period and length values which upon successful negotiation of the contract could be mapped to a set of scheduling windows. Definition 3 (Application contract). An application contract CA is a pair (T P, ACa ) where: - T P is the multi-processor time partition provided by the contract, and - ACa is a timed automaton over the action set ΣSW modeling the scheduling scheme of the major time frame: - ΣSW = {swActivek , swInactivek }, where k is a scheduling window in T P. - action swActivek signals to the application scheduler that the scheduling window k is now active, while swInactivek signals its deactivation.
3
The Timed Automata Models
As shown in the previous section both component and application levels include three automata - one for generating tasks or servers according to a given release pattern, one for generating the resources (servers or scheduling windows) on which the tasks and servers, respectively shall be executing and one for scheduling. Notice that servers can be both schedulable entities (i.e. when referring to the application scheduler) and resources (i.e. for the component scheduler). For this reason in the rest of this section they are referred simply as tasks and, respectively as resources. Also, this plurality of roles implies that the timed automaton generating tasks for the application level is the same with the one generating resources for the component level. Therefore, this automata can be deduced immediately from the task generator and the resource generator automaton types. The rest of the section is dedicated to given detailed descriptions of each of the three types of automata. In addition to the three types of automata, the model also includes a Timer automaton which uses a single continuous clock t and each time this clock ticks sends a tick signal to the task generator and the resource generator automata. We first introduce some notations. Let W (i), P (i), D(i), R(i) and E(i) denote the worst case execution time, the period, the deadline, the next release time and the current execution time, respectively for each task τi . For each task τi it is defined a status variable status(i) that is initialized to idle meaning that a task instance has not been released yet. The value status(i) = ready is used to denote that a task instance of τi is ready for execution (i.e. it has just been released or was preempted). Let status(i) = running stand for the fact that
232
G. Macariu and V. Cret¸u
a task instance of τi is currently running on one of the active resources. To denote that an instance of task τi has finished or has missed its deadline we use status(i) = f inish and status(i) = overrun, respectively. 3.1
Task Generator Automaton
Model checking of preemptive scheduling algorithms could be done using a stopwatch model but it has been proved that schedulability of these models is undecidable. Therefore, in order to address task preemption a discrete time formalism is adopted for the model proposed in this paper. This leads to a limitation as all task parameters (i.e. worst case execution time, period, deadline, release time) must have integer values. In order to be able to determine the actual execution time of a task, a variable E(i) is used for keeping track of the time task τi has executed since its last release. Each time the task is released E(i) is set to 0 while R(i) is set to the time of its next release. When the task generator automaton receives a tick signal from the Timer automaton it increases E(i) for tasks with status(i) = running and decreases R(i) for all tasks with a value M IN representing the minimum between the time for the next release of a task or of a resource and the time for the next termination of a task or deactivation of a resource. In other words, E(i) acts like a discrete clock which can be suspended and resumed. Instead of using a task generator for releasing all n tasks of a component according to some pattern, it would have been possible to define a timed automaton for each of the n tasks, each automaton with a clock, leading to a total of n clocks. Since the state space of timed automata grows exponentially with the number of clocks in the model, the approach taken in this paper is superior to this one. Figure 1(a) shows the main locations and transitions in the task generator automaton, leaving out some self-loop transitions. All white locations in the figure have the semantics that the system cannot delay in those locations and the next transition must involve an outgoing edge from one of them.
(b) The non-preemptive resource generator automaton
Fig. 1. Task and resource generators
Model-Based Analysis of Contract-Based Real-Time Scheduling
233
The task generator automaton uses a variable next release to remember the time until the next task is released. At start-up this variable is initialized with the smallest R(i) and if after that next release = 0 the automaton goes to the Ready location and selects a task τi for which R(i) = 0, updates next release, sets the shared variable ready task = i and sends the ready signal to the scheduler automaton. Once next release becomes greater than 0, the generator moves to the Idle location where it waits for the next tick of the Timer. When the tick signal arrives the transition to the Increment location is taken and inc time() updates the values status(i), E(i) and R(i) as follows: - for all tasks τi with status(i) = running E(i) = E(i) + M IN and if E(i) = W (i) then status(i) = f inished, - for all tasks τi R(i) = R(i) − M IN and next release = min(R(i)), - for all tasks τi running or ready for execution with E(i) < W (i) and P (i) − D(i) = R(i) sets status(i) = overrund. Next, for all tasks τj that have finished the variable f inished task is set to j and the f inished signal is sent to the scheduler which will free the resources used by these tasks. If any task τj has missed its deadline an overrun signal notifies the scheduler which as a result will go to an Error location. After signaling all task finish events the generator checks to see if there is any task ready for execution and goes back to the Ready location. 3.2
Resource Generator Automaton
The task generator automaton presented above can be used to generate servers which act as resources for the component level. By adding just two signals active and inactive - to notify the scheduler about the availability of the resources the task generator automaton becomes a resource generator automaton with the property that those resource are preemptible. If resources are not preemptible (i.e. the scheduling windows of a time partition) the resource generator automaton is a simplified version of the task generator. Figure 1(b) presents the non-preemptive version of the resource generator automaton. The automaton keeps a discrete clock RE(k) for each resource rk . Also RR(k) is used to remember the time until the next activation of resource rk and two variables named next release and next f inish hold the time until the next resource activation and, respectively deactivation. When resource rk is activated RE(k) = L(k) where by L(k) we denote the length of the resource’s activation period. At every tick signal received from the Timer for all active resources rk RE(k) is decreased with the value MIN and variables next release and next f inish are also decreased with the same value. When next release reaches 0 all resources rk with RR(k) = 0 are activated. If next f inish becomes 0 than all resources rk with RE(k) = 0 are deactivated. 3.3
Scheduler Automaton
As it can be seen from the definitions in the previous sections, the component scheduler and the application scheduler have rather similar behavior. Both of
234
G. Macariu and V. Cret¸u
them must schedule a set of periodic tasks/servers with deadlines less or equal to their period. The component tasks are scheduled on execution time servers which may be active or inactive. It is possible for two or more servers to be active simultaneously which implies that two or more tasks may run in parallel. For the application scheduler the tasks to be scheduled are actually the servers used by the component scheduler as resources. The servers are scheduled for execution on the scheduling windows of a time partition. The scheduling windows represent the resources allocated to the application by the system. As more scheduling windows can be active simultaneously parallel execution of the servers is also possible. A scheduler automaton for a service (i.e. application or component) contract has the following characteristics: - has a queue holding the tasks ready for execution, - implements a preemptive scheduling policy Sch representing a sorting function for the task queue, - maintains a map between active resources (servers or scheduling windows) and tasks using those resources, and - has an Error location which is reached when a task misses its deadline. To record the status of a resource, let rt map(j) be a map where rt map(j) = inactive denotes that resource j is inactive, rt map(j) = active means that resource j is active but no task is executing on it, and rt map(j) = i denotes that resource j is active and is currently used by task τi . Figure 2 shows the scheduler automaton. The locations of the automaton have the following interpretations: 1. Idle - denotes the situation when no task is ready for execution or no resources are active, 2. Prepare - a task has been released and a resource is active after a period during which either there were no tasks to schedule or no active resources, 3. Running - at least one task is currently executing, running==false overrun?
Model-Based Analysis of Contract-Based Real-Time Scheduling
235
4. AssignTask - a task has just finished and as a result an active resource can be used to schedule another ready task, 5. AssignResource - a task has just been released or a resource has just become inactive leaving its assigned task with no resource on which to execute; consequently the task has to be enqueued and if it has the highest priority in the queue according to Sch then an active resource is assigned to it, 6. Check - a resource has become inactive, 7. Error - the task set is not schedulable with Sch. The scheduler enters the Idle location when either there are no ready tasks, no active resources or both of these conditions hold. As long as new tasks are released for execution but there are no active resources on which the tasks to be executed (i.e. task no > 0 and res no == 0) or as long as there are available resources but no ready tasks (i.e. task no == 0 and res no > 0) the scheduler stays in the Idle location. If the scheduler receives a ready signal meaning that task τready task has been released and res no > 0 the scheduler goes to the Prepare location. Leaving the Prepare location for the Running location, it assigns the task to one of the active resources by setting rt map(j) = i, sets the variable activated task = i and sends a go signal to announce the task generator automaton that task τ i is running. After the scheduler has reached the Running location, it will leave this location if one of the following situations happen: - the resource rk becomes active (signaled by the active signal and activated re-source = k): this is marked by updating rt map[k] = ACT IV E on the transition to the AssignTask location. If tasks are ready for execution than the scheduler will assign the highest priority task τj to resource rk by setting rt map[k] = j and will notify the task generator with the signal go on a transition back to the Running location. - a new task τi has been released (signaled by the ready signal and ready task = i): the task is enqueued by setting status(i) to ready on the transition to the AssignResource location. If task τi is the highest priority released task and there are active resources then τi must start executing. If there is a free active resource then task τi is assigned to it otherwise the lowest priority task is chosen from the running tasks, preempted and the automaton goes to the AssignTask location. On the transition from AssignTask to Running the resource is assigned to τi and a go signal is sent to the task generator to notify it that task τi has started running. - the resource rk becomes inactive (signaled by the inactive signal and deactivated resource = k): this is marked by updating rt map[k] = IN ACT IV E on the transition to the Check location. If the deactivated resource was free and there are still running tasks but no tasks in the queue then the transition back to Running location is taken. If a task τi was using resource rk then the scheduler must set status(i) = ready and go to AssignResource location. Should the resource rk be the last active resource the scheduler would simply preempt task τi and go back to the Idle location, otherwise an active resource is searched analog to the situation when a new task is released.
236
G. Macariu and V. Cret¸u
- the task τj finishes (signaled by the f inish signal and f inished task = j): the resource used until now by τj can be assigned to the highest priority task waiting in the queue, if there is such a task. - the task τi misses it deadline (signaled by overrun): the scheduler automaton goes into the Error location.
4
Performance Analysis
This section presents an evaluation of the performance and scalability of model checking the contract-based scheduling model. The experiments were run on a machine with Intel Core 2 Quad 2.40 GHz processor and 4 GB RAM running Ubuntu. The analysis of the model was automated using UPPAAL and the utility program memtime [13] was used for measuring the model checking time and memory usage. Although the proposed model addresses scheduling at two levels, namely task level and server level, experiments were conducted only for the server level as we consider the analysis of the task level is just a replica of the server level due to the similarities between the two levels. In all experiments, to verify schedulability we checked if property A[] not Error holds. In order to observe the behavior of the model for different number of application servers we have used randomly generated sets of servers with periods in the range [10, 100] and utilizations (i.e. budget/period) generated with a uniform distribution in the range [0.05, 1]. The offset of each server was set to a value equal to the period multiplied with a randomly generated number in the interval [0, 0.3]. Also, the servers sets were accommodated by a time partition with 9 scheduling windows and a total utilization of 4.5. Figure 3 shows how the model checking time and memory usage increase with the number of servers in the set. Also it can be noticed that for the same size of the server set the performance of the model checking can vary between rather larger limits (e.g. for sets of 30 servers the model checking time grows from 7 seconds to approximatively 25 seconds). This is due to the size of the hyper-period of the server sets, larger the hyper-period larger the model checking time and memory consumption. Next, we analyzed the scalability and performance of model checking when the number of scheduling windows in the time partition accommodating the servers varies. For this, sets with 25 servers each and parameters in the same Model checking mem mory usage [MB]
Model checkiing time [sec]
30 25 20 15 10 5 0 0
5
10
15
20
25
Server set size [no. of servers]
(a) Model checking time
30
35
30 25 20 15 10 5 0 0
5
10
15
20
25
30
Server set size [no. of servers]
(b) Model checking memory usage
Fig. 3. Influence of server set size on model checking performance
35
Model-Based Analysis of Contract-Based Real-Time Scheduling 14
14 12 10 8 6 4 2 0 0
2
4
6
8
10
Model checking mem mory usage [MB]
Model checkiing time [sec]
16
237
12 10 8 6 4 2 0 0
2
Time partition size [no. of scheduling windows]
4
6
8
10
Time partition size [no. of scheduling windows]
(a) Model checking time
(b) Model checking memory usage
Fig. 4. Influence of time partition size on model checking performance ϭϬϬ
ϴ
ZD
ϵϬ
ϳ
&
ϴϬ
ϲ
dͲ
ϳϬ
Success rate [%]
Model checkiing time [sec]
ϵ
ϱ ϰ ϯ Ϯ
ϲϬ ϱϬ ϰϬ ϯϬ
ZDͲϯϬ
ϮϬ
ϭ
ZDͲϮϬ
ϭϬ
Ϭ
Ϭ
0
5
10
15
20
25
30
35
Task set size [no. of tasks]
Fig. 5. Influence of scheduling policy on model checking time
ϭ͘ϬϬ
ϭ͘ϱϬ
Ϯ͘ϬϬ
Ϯ͘ϱϬ
ϯ͘ϬϬ
ϯ͘ϱϬ
ϰ͘ϬϬ
ϰ͘ϱϬ
Task set utilization
Fig. 6. Schedulability of task sets
limits as for the first experiment were generated and time partitions with 2, 3, 5, 7 and 9 scheduling windows were tested. In Figure 4 it can be seen that both the time for checking the model and the memory usage grow with the number of scheduling windows in the time partition. In the first two experiments the server sets were scheduled using the Rate Monotonic (RM) priority scheduling policy. The goal of our next experiment is to determine the impact of the scheduling policy on the model checking time and peak memory usage. The same time partition configuration as in the first experiment was used and sets of 5, 10, 15, 20, 25 and 30 servers were scheduled using both the Rate Monotonic, the Earliest Deadline First (EDF) and the (T-C) (i.e. the higher the difference between the period and the budget of a server the lower its priority) scheduling policies. As can be seen in Figure 5 the scheduling policy has little influence on the performance of the model checking. In the last experiment we are interested in seeing what is the influence of the task set utilization on the schedulability analysis. We have used the same time partition as in the first experiment with a total utilization of 4.5 and task sets of 20 and 30 tasks with utilizations between 1 and 4.5 scheduled using the Rate Monotonic policy. Figure 6 depicts the number of schedulable task sets identified by our analysis. It can be noticed that even if the total utilization of a task set is maximal with respect to the available resources, our analysis is able to determine its schedulability, which is a clear advantage over the pessimist schedulability bounds presented in [4].
238
5
G. Macariu and V. Cret¸u
Conclusions
In this paper we have presented a compositional approach using the timed automata formalism for schedulability analysis of component-based real-time applications which utilize multi-processor resource partitions. Starting with the assumption that the resource requirements for each application and component are stipulated in a service contract we have defined a timed automata model for specifying the contracts and shown how to use model checking as a technique for analyzing the preemptive schedulability of an hierarchy of such contracts. The performance analysis of our technique using the UPPAAL model checker showed that even with just one real-time clock used for the entire model, the applicability of the technique is limited by the state-explosion problem.
Acknowledgment This research is supported by eMuCo, a European project supported by the European Union under the Seventh Framework Programme (FP7) for research and technological development.
References 1. Lipari, G., Bini, E.: A methodology for designing hierarchical scheduling systems. Journal of Embedded Computing 1(2), 257–269 (2005) 2. Harbour, M.G.: Architecture and contract model for processors and networks. Technical Report D-AC1, Universidad de Cantabria (2006) 3. Brandenburg, B.B., Anderson, J.H.: Integrating hard/soft real-time tasks and besteffort jobs on multiprocessors. In: ECRTS 2007: Proceedings of the 19th Euromicro Conference on Real-Time Systems, Washington, DC, USA, pp. 61–70. IEEE Computer Society, Los Alamitos (2007) 4. Chang, Y., Davis, R., Wellings, A.: Schedulability analysis for a real-time multiprocessor system based on service contracts and resource partitioning. Technical Report YCS-2008-432, Computer Science Department, University of York (2008) 5. Kaiser, R.: Combining partitioning and virtualization for safety-critical systems. White Paper, SYSGO AG (2007) 6. Alur, R., Dill, D.L.: A theory of timed automata. Theoretical Computer Science 126(2), 183–235 (1994) 7. Larsen, K.G., Pettersson, P., Yi, W.: UPPAAL in a nutshell. International Journal on Software Tools for Technology Transfer 2(1), 134–152 (1997) 8. Cassez, F., Larsen, K.G.: The impressive power of stopwatches. In: Palamidessi, C. (ed.) CONCUR 2000. LNCS, vol. 1877, pp. 138–152. Springer, Heidelberg (2000) 9. Henzinger, T.A., Kopke, P.W., Puri, A., Varaiya, P.: What’s decidable about hybrid automata? In: STOC 1995: Proceedings of the 27th annual ACM symposium on Theory of computing, pp. 373–382. ACM, New York (1995) 10. Fersman, E., Krcal, P., Pettersson, P., Yi, W.: Task automata: Schedulability, decidability and undecidability. Information and Computation 205(8), 1149–1172 (2007)
Model-Based Analysis of Contract-Based Real-Time Scheduling
239
11. Krcal, P., Stigge, M., Yi, W.: Multi-processor schedulability analysis of preemptive real-time tasks with variable execution times. In: Raskin, J.-F., Thiagarajan, P.S. (eds.) FORMATS 2007. LNCS, vol. 4763, pp. 274–289. Springer, Heidelberg (2007) 12. Guan, N., Gu, Z., Deng, Q., Gao, S., Yu, G.: Exact schedulability analysis for staticpriority global multiprocessor scheduling using model-checking. In: Obermaisser, R., Nah, Y., Puschner, P., Rammig, F.J. (eds.) SEUS 2007. LNCS, vol. 4761, pp. 263–272. Springer, Heidelberg (2007) 13. Memtime utility, http://freshmeat.net/projects/memtime/
Exploring the Design Space for Network Protocol Stacks on Special-Purpose Embedded Systems Hyun-Wook Jin and Junbeom Yoo Department of Computer Science and Engineering Konkuk University Seoul 143-701, Korea {jinh,jbyoo}@konkuk.ac.kr Abstract. Many special-purpose embedded systems such as automobiles and aircrafts consist of multiple embedded controllers connected through embedded network interconnects. Such network interconnects have particular characteristics and thus have different communication requirements. Accordingly, we need to frequently implement new protocol stacks for embedded systems. Implementing new protocol stacks on embedded systems has significant design space but it has not been explored in detail. In this paper, we aim to explore the design space of network protocol stacks for special-purpose embedded systems. We survey several design choices very carefully so that we can choose the best design for a given network with respect to performance, portability, complexity, and flexibility. More precisely we discuss design alternatives for implementing new network protocol stacks over embedded operating systems, methodologies for verifying the network protocols, and the designs for network gateway. Moreover, we perform case studies for the design alternatives and methodologies discussed in this paper. Keywords: Embedded Networks, Embedded Operating Systems, Network Protocol Stacks, Formal Verification, Protocol Verification, Network Gateway.
Exploring the Design Space for Network Protocol Stacks
241
Therefore, we need to consider several design choices very carefully so that we can choose the best design for a given network with respect to performance, portability, complexity, and flexibility. In this paper, we present various design alternatives and compare them in several aspects. Moreover, we perform the case studies for the design alternatives. The rest of the paper is organized as follows: Section 2 discusses the design alternatives for implementing new network protocol stacks over embedded operating systems. Section 3 describes the methodologies for verifying the network protocols. Section 4 addresses the network interoperability issue and discusses the designs for network gateway. Finally we conclude the paper in Section 5.
2 Protocol Stacks on Embedded Nodes In this section, we explore the design and implementation alternatives of network protocol stacks on embedded nodes. The designs can highly depend on operating systems and their task models but we try to generalize this discussion as much as possible so that the designs described can be applied to the most of embedded operating systems. One of the most important issues when implement new network protocol stacks is who takes charge of multiplexing and demultiplexing of network packets. Accordingly, we classify the possible designs into two: i) user-level design and ii) kernel-level design. 2.1 User-Level Design In this design alternative, the protocol stacks are implemented as a user-level thread or process, which performs (de)multiplexing across networking tasks. The user-level protocol stacks can be portable across several embedded operating systems as far as they follow the standard interfaces such as POSIX. The overall designs are shown in Figure 1. As we have mentioned, the way to implement new network protocol stacks are dependent on the task models of operating systems. Many embedded operating systems such as VxWorks [18] and uC/OS-II [19] define the thread-based tasks on top of the flat memory models in which the user-level protocol stacks are implemented as a user thread. On the other hand, some other embedded operating systems such as Embedded Linux and QNX [20] define isolated memory spaces between tasks. In such systems, the user-level protocol stacks are implemented as a user process in general. Though most of these process-based task models also support multiple threads, the design of processbased protocol stacks is still attractive. This is because, in this task model, if we implement the protocol stacks as a thread it can only support the threads belong to the same process. That is, the thread-based protocol stacks over the process-based task models is not suitable to support multiple networking processes. In either thread or process-based design, the protocol stacks send the network packets by accessing the network device driver directly. Thus the device drivers have to provide the interfaces (e. g., APIs or system calls) for the network protocol stacks. The user-level tasks request the protocol stacks to send their packets through InterProcess Communication (IPC). In case of thread-based design, since the protocol stacks share the memory space with other tasks, the network data can be directly accessed from the protocol thread without data copy as far as the synchronization is
242
H.-W. Jin and J. Yoo
guaranteed by an IPC such as semaphore. On the other hand, the process-based protocol stacks need to pass the network data between the networking tasks and the protocol stacks explicitly by using an IPC such as message queue. This can add message passing overhead because the messaging IPCs usually require memory copy operations to move data between two different memory spaces. On the receiver side, how it works is similar with the sender side; however, there is a critical design issue of how to detect the incoming new packets. Since the protocol stacks are implemented at the user-level, there is no proper asynchronous signaling mechanism at the device driver to notify new packet arrival to the user-level protocol stacks. Thus, the interfaces provided by the device driver are the only way to check new packet arrival. However, if the interface has a blocking semantic then the protocol stacks cannot handle other requests (e.g., sending requests) from the tasks while waiting a new packet arrived. There are two solutions to overcome this issue. One is to use asynchronous interface and the other one is to have multithreaded protocol stacks. The asynchronous interface is easy to use but it is hard to come up with an optimal strategy of calling the interface in terms of when and how frequently. Thus it is likely to achieve lower performance than what the actual network can provide or waste the processor resources. Instead, the multithreaded protocol stacks can have separate threads to serve the sending and receiving operations respectively. That is, for both thread- and process-based designs, the protocol stacks consist of a set of threads. Only one difference is that the multiple threads belong to the same process in case of the process-based design. The receiving thread can block on waiting a new packet while the sending thread handles the requests from the tasks. Once the new packet has been recognized by returning from the blocked function, the receiving thread of the protocol stacks interpret the header and pass the packet to the corresponding process through an IPC. Since the protocol stacks are implemented at the user-level, they are scheduled as other user-level tasks by the task scheduler. If we give the same priority to the protocol stacks with other tasks, the execution of the protocol stacks can get delayed, which results in high network latency. Thus it is desired that the protocol stacks have higher priority than general user-level tasks and block on waiting new packets received or sending requests, which allows other tasks to utilize the processor resources if there are no pending jobs of the protocol stacks. As a case study we have implemented Network Service of Media Oriented System Transport (MOST) [1] at the user-level over Linux [2]. MOST is an automotive highspeed network to support multimedia data streaming. The current MOST standard specifies 25Mbps ~ 150Mbps network bandwidth with QoS support. To meet the demands from various automotive applications, MOST provides three different message channels: control, stream, and packet message channels. Network Service is the transport protocol for the control messages, which covers from layer 3 to parts of layer 7 of OSI 7 layers. In order to implement Network Service, we have applied the process-based design where the protocol stacks consist of sending and receiving threads. We have utilized the ioctl() system call to provide interfaces between the protocol stacks and the device driver. We have also implemented library for applications, which provides interfaces to interact with the protocol stacks using POSIX message queue. The performance results show 0.9ms of one-way latency with 8-byte control message.
Exploring the Design Space for Network Protocol Stacks
2.2 Kernel-Level Design In this design alternative, the protocol stacks are implemented as a part of operating system. Thus we do not need to move data between the device driver and the protocol stacks. This is because both use the kernel memory space and can share the network buffer. In addition, since the kernel context has higher priority than the user context, the kernel-level protocol stacks can guarantee the network performance. Accordingly, it has more potential of achieving better performance than the user-level protocol stacks. This design however may require modifications of the kernel, which is not portable across several operating systems. As shown in Figure 2, we classify the kernel-level design into bottom half based design and device driver based design according to where the protocol stacks are implemented (especially for receiver side). The traditional protocol stacks are implemented as a bottom half in general. In such design, when a packet has been received from the network controller, the interrupt handler simply queues it to a queue shared with the bottom half. Then the bottom half takes care of most of protocol processing including demultiplexing. The bottom half is scheduled by the interrupt handler when there is no interrupts to be processed. On the other hand, in the device driver based design, the entire protocol stacks are implemented in the device driver. Therefore, if the protocol stacks are heavy like TCP/IP then the device driver based design may not be suitable. In case of the kernel-level design, the user tasks request a sending operation through a system call. The system call eventually passes the request to the device driver. On the sender side, the main difference between two design alternatives is that, in case of the bottom half based design, the kernel performs most of protocol processing before passing down the user request to the device diver. It is to be noted that the data copy operation between the user and kernel spaces should be carefully designed. In either synchronous or asynchronous interface, we can copy the user data into the kernel and return immediately; however, this results in the copy overhead. On the contrary, we can avoid the copy operation by delaying the notification of completion but this can hinder the application’s progress.
244
H.-W. Jin and J. Yoo
Fig. 2. Kernel-level protocol stacks: (a) bottom half based design and (b) device driver based design
On the receiver side, once a new packet comes in from the network controller, the interrupt handler does urgent process before passing it to the upper layer. In the bottom half based design, as we have mentioned earlier, the bottom half takes care of interpreting the header and demultiplexing. Some of operating systems such as Embedded Linux provide an interface to insert a new bottom half (more precisely tasklet in Embedded Linux) without kernel modification. The microkernel based operating systems such as QNX also allow adding new protocol stacks in a similar manner. In the device driver based design, the bottom half is not taken into account at all. In this design alternative, the protocol stacks are implemented in the system call and the interrupt handler. The distribution of weight between the system call and the interrupt handler can vary in terms of which does more protocol processing but usually the interrupt handler does majority of the protocol processing. This is because doing demultiplexing at the interrupt handler is more efficient. Otherwise, the system call needs to search the incoming packet queue internally, which requires exhaustive searching time and locking overhead between tasks. However, doing more work at the interrupt handler is not desirable because it is supposed to finish its work very quickly. Therefore, this design is valuable when the overhead for protocol processing is low. As a case study of kernel-level design, we have implemented a device driver based protocol called RTDiP (Real-Time Direct Protocol) in the Embedded Linux over Ethernet [3, 4]. RTDiP is a new transport protocol that can provide priority aware communication, communication semantics for synchronization, and low communication overhead. In the synchronous semantics, the communication protocols do not queue the packets but keep only the last packet received, which is suitable for distributed synchronization over relatively small area embedded networks. The performance results show that RTDiP reports 48us one-way latency with 8-byte message and provides better overhead prediction. We are currently implementing RTDiP over Control Area Network (CAN) as well. In addition we plan to implement it in another embedded operating system such as QNX.
3 Verification Methodologies Protocol verification [5] is an activity to assure the correctness of network communication protocols. The design alternatives we have studied in Section 2 should be verified
Exploring the Design Space for Network Protocol Stacks
245
thoroughly before proceeding to the implementation. The formal verification has been known as the prominent but cost-ineffective technique. This section introduces formal verification techniques for verifying network protocol stacks. We briefly overview formal verification techniques and then review the techniques from aspect of network protocol stacks verification. We then share our experience of verifying protocol stacks of system air conditioning system. 3.1 Formal Verification Formal verification and formal specification altogether are called as formal methods [6]. Formal specification [7] is a technique for specifying the system on the basis of mathematics and logic. It has various techniques and notations, e.g. algebra, logic, table, graphics and automata. After completing the formal specification, we can apply formal verification techniques to the specification to prove that the system satisfies required properties. There are two main approaches in formal verification: deductive reasoning and algorithmic verification. Deductive reasoning is a verification methodology using axioms and proof rules to establish the reasoning. Experts construct the proofs in hands, and it usually requires greater expertise in mathematics and logic. Even if tools called theorem prover have been developed to provide a certain degree of automation, its inherent characteristic makes it difficult to be used widely for verifying recent network protocol stacks. Second methodology is algorithmic verification, usually called model checking [8]. Model checking is a technique verifying finite state systems through exhaustively searching all states space to check whether specified correctness condition is satisfied. It is carried out automatically without almost any intervention of experts, but restricted to the verification of finite state systems. The deductive reasoning, on the other hand, has no such limitations. With respect to protocol verification, the latter - model checking is more efficient and cost-effective than the former - theorem proving. The former’s main drawback, requiring considerable expertise, makes the model checking techniques better suited for protocol stacks verification. Indeed, as the performance of model checking technique has increased rapidly, it can do various verifications more efficiently than when it had been firstly proposed. 3.2 Formal Verification Techniques for Network Protocol Stacks The formal verification techniques for network protocol stacks fall into several categories. General-purpose model checkers such as Cadence SMV [9] and SPIN [10] can verify protocols efficiently. General-purpose proof tools which are not the model checker but conduct formal verification such as UPPAAL [11] are useful too. We can also use specialized protocol analysis tools (e.g., Meadows’ NRL [12] and Cohen’s TAPS [13]) Formal specification should be prepared before conducting formal verification. Finite State Machine (FSM) based formal specification technique has been widely used for specifying network protocols and stacks. FSM mainly consists of a set of transition rules. In the traditional FSM model, the environment of the FSM consists of two finite and disjoint sets of signals, input signals and output signals. A number of papers
246
H.-W. Jin and J. Yoo
using FSM based formal specification have been reported. Especially, network protocols can be well specified using communicating FSM or extended FSM as reported in [14, 15]. With respect to the formal verification of network protocol stacks, we have to consider two tasks: specification of protocol stacks and modeling of the system implementing the protocol. In the first step, we have to model the protocol algorithm and stack hierarchy using a formal specification method. Then the modeling of the whole embedded network system which includes the implementations of the protocol stacks can proceed. Therefore, verifying the network protocol stacks requires not only the formal specification for the protocol stacks but also the encompassing environment where the protocol stacks are implemented and used. Formal verification for network protocol stacks totally depends on the formal specification developed beforehand. If we use FSM-based formal specifications (e.g., Statecharts [16] and Specification and Description Language (SDL) [17]), most general-purpose model checkers are available. In case that exact timing constraints should be preserved, timed automata based formal specification like UPPAAL is a good choice. We can also use specialized protocol verification tools, but it is not easy to model the whole system with them. Therefore, the combination of FSM based formal specification and general-purpose model checking tools will be more effective than others. 3.3 SDL-Based Verification of Protocol Stacks SDL is a formal specification language and tool suite widely used to model the system which consists of a number of independently communicating subsystems. The SDL specification can be translated into FSM forms, and then used as an input for general-purpose model checkers such as SMV and SPIN. Figure 3 describes the architecture of system air conditioning system. We performed the formal verification of the
Fig. 3. The architecture of system air conditioning system
Exploring the Design Space for Network Protocol Stacks
247
network protocol between distributed controllers called DMS (Distributed Management System) and a personal controller called MFC (Multi-Function Controller). A DMS controls all indoor air conditioners, outdoor compressors and network routers under its control. An MFC is a touch-screen based personal controller like PDA. In our experience, special-purpose embedded network system such as the above can be well specified with SDL and verified formally through general-purpose model checkers such as SPIN. We implemented an automatic translator from SDL into PROMELA, SPIN’s input program language, and conducted SPIN model checking. We verified several properties, categorized as feasibility test, responsiveness, environmental assumptions and consistency checking. In addition to the SPIN model checking, the SDL tool has its own validation tool, which checking syntax errors and completeness of the specification.
4 Network Interoperability Since various network interconnects can be utilized on a distributed embedded system, the network interoperability is a critical requirement in such system. For example, in modern automobile systems, several network interconnects such as CAN, LIN, FlexRay, and MOST are widely used in an integrated manner. In such systems, we need a gateway for interoperation between different networks [2, 21, 22], which is similar with bridges or routers on Internet. Thus the gateway needs to understand several network protocols and convert one into another. In this section, we explore the design alternatives for embedded network gateways. Especially, we classify the gateway designs into two based on how to operate the operating system on the gateway. 4.1 Single OS Based Gateway In this design alternative, the gateway architecture has a single or multiple homogeneous Micro-Controller Units (MCUs) that run single operating system’s image. The MCU can include the network controllers for different network interconnects supported by the gateway or can be connected with the network controllers on the same board through buses such as Serial Peripheral Interface (SPI), Inter-Integrated Circuit (I2C), etc. The protocol stacks can be designed and implemented as any of the design choices described in Section 2 but a layer of protocol stacks is required to perform the gateway functions. If the network layer performs the gateway functions, it can be transparent to the networking processes running on the embedded nodes. The network protocols for embedded systems, however, usually have no strict distinction between the network and transport layers because their network layers do not suppose to allow arbitrary transport layers while the Internet Protocol (IP) layer does. In addition, even if the gateway performs the protocol conversion at the network layer, in many cases it is hard to conserve the end-to-end communication semantics due to significant differences between transport protocols of embedded networks.
248
H.-W. Jin and J. Yoo
Fig. 4. Single OS based gateway: (a) transport layer based design and (b) global addressing layer based design
Another solution is to introduce a gateway module at the transport layer as shown in Figure 4(a). In this design alternative, the gateway should manage the protocol conversion tables that map between message headers of both network and transport layers for different networks. Since the transport layer translates the protocols internally the legacy applications do not need any modifications. A drawback of this design is the limitation of scalability. The number of possible header patterns can be numerous in some embedded systems, which can result in memory shortage on the gateway node. Therefore, this design is useful only when the number of entries of the protocol conversion tables is predictable. Fortunately, in many embedded systems, we can figure out the number of embedded devices that need to collaborate (i.e., communicate) each other across different networks at the design phase. We can also add a new layer on top of the transport layer as shown in Figure 4(b). The new layer defines global addressing and APIs to access the layer from the applications. If the gateway uses global addressing the networking processes on every embedded node have to be aware about it. Thus the applications need to be modified but if once this is done they can run transparently on any networks in which the additional layer is inserted. In this case, the gateway only requires managing the routing table and thus the scalability in terms of memory space can be better than the previous design. However, if the most of embedded nodes perform intra-network communication, the overhead of additional layering can harm the performance. Therefore, the decision among the design alternatives described can vary based on the system requirements and characteristics. As a case study of single OS gateway, we have implemented a gateway between MOST and CAN networks based on the transport layer based design [2]. In this case study, we utilize the MOST Network Service implemented in Section 2.1. The communication semantics of MOST control message are very different with the traditional send/receive semantics. The MOST control message invokes a function on a MOST device. However, CAN does not provide such communication semantics while providing multicast like communication semantics which is not in MOST Network Service. Thus, simple message forwarding with address translation at the network layer does not work. To provide transparent conversion of communication semantics we have suggested a gateway module. In addition, we have implemented the protocol conversion table and defined some entries for performance measurement. The performance results show that the suggested design hardly adds additional
Exploring the Design Space for Network Protocol Stacks
249
overhead, which is about 15% of pure communication overhead, and can deliver control messages very efficiently. 4.2 Multi-OS Based Gateway Since the embedded nodes on different networks can have different requirements the desirable operating systems can vary. For example, the automobile gateway node can have many kinds of peripheral interfaces such as USB and wireless network for supporting infotainment applications over MOST. Therefore, an operating system that has fluent device drivers such as Embedded Linux is highly expected. On the other hand, the electronic units such as chassis, powertrain and body controllers connected to CAN or LIN demand to guarantee the real-time requirements and thus an RTOS is desirable. Since the gateway needs to meet such various requirements we can consider having multiple operating systems on the gateway node. The address translation issue discussed in Section 4.1 is still applied in the similar manner even in this design alternative. However, an efficient scheme for communication between operating systems has to be taken into account. A gateway node can be equipped with multiple heterogeneous MCUs that have different network controllers as shown in Figure 5(a). Each MCU can run its own operating system that satisfies the requirements of responsible networks. The MCUs on the gateway node can collaborate by communicating each other through a bus or shared memory module. Since an MCU may do not have all network controllers required to a specific embedded system, we can need several MCUs, which makes the connection architecture between MCUs very complicate. Thus the architecture based on multiple MCUs can be applied to limited cases. Another approach is to exploit the virtualization technology, which allows running several operating systems on the same MCU as shown in Figure 5(b). The virtualization technology can isolate the system fault from propagating to other operating systems and provide service assurance. In addition, the state-of-the-art virtualization technologies enable low overhead virtualization and better resource scheduling, which lead to high scalability. In addition to the existing optimization technologies, a lighter I/O virtualization can be suggested because the network controllers on the gateway
Fig. 5. Multi-OS based gateway: (a) multiple MCUs based design and (b) system virtualization based design
250
H.-W. Jin and J. Yoo
node may not be shared between operating systems. An important issue is how efficiently the operating system domains can communicate each other. In general, the portion of inter-domain communication on a virtualized node is not dominant compared with inter-node communication. However, on the gateway node, many of network messages cause inter-domain communication because they are supposed to be forwarded to another network interface of which another operating system domain may take care. As a case study of the gateway with multiple operating systems, we are implementing a MOST-CAN gateway using virtualization technology provided by Adeos [23]. Adeos provides a flexible environment for sharing hardware resources among multiple operating systems by forwarding hardware events to appropriate operating system domain. We run Linux and Xenomai [24], a parasitic operating system to Linux, over Adeos. The Linux operating system takes charge of the MOST interface while Xenomai does the CAN interface. The gateway processes are running on each operating system and communicate each other through inter-domain communication interface provided by Xenomai. Since the protocol stacks for MOST and CAN run on different operating systems we perform the protocol conversion at the above of the transport layer but we do not use global addressing. Instead we define a protocol conversion table that maps network connections over different networks.
5 Conclusions In this paper, we have explored the design space of network protocol stacks for special-purpose embedded systems. We have surveyed several design choices very carefully so that we can choose the best design for a given network with respect to performance, portability, complexity, and flexibility. More precisely we have discussed design alternatives for implementing new network protocol stacks over embedded operating systems, methodologies for verifying the network protocols, and the designs for network gateway. Moreover, we have performed case studies for the design alternatives and methodologies. Acknowledgments. This work was partly supported by grants #NIPA-2009-C10900902-0026 and #NIPA-2009-C1090-0903-0004 by the MKE (Ministry of Knowledge and Economy) under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) and grant #R33-2008-000-10068-0 by MEST (Ministry of Education, Science and Technology) under the WCU (World Class University) support program.
References 1. MOST Cooperation.: MOST Specification. Rev 3.0 (2008) 2. Lee, M.-Y., Chung, S.-M., Jin, H.-W.: Automotive Network Gateway to Control Electronic Units through MOST Network (2009) (under review) 3. Lee, S.-H., Jin, H.-W.: Real-Time Communication Support for Embedded Linux over Ethernet. In: International Conference on Embedded Systems and Applications (ESA 2008), pp. 239–245 (2008)
Exploring the Design Space for Network Protocol Stacks
251
4. Lee, S.-H., Jin, H.-W.: Communication Primitives for Real-Time Distributed Synchronization over Small Area Networks. In: IEEE International Symposium on Object/component/service-oriented Real-Time distributed Computing (ISORC 2009), pp. 206–210 (2009) 5. Palmer, J.W., Sabnani, K.: A Survey of Protocol Verification Techniques. In: Military Communications Conference - Communications-Computers, pp. 1.5.1–1.5.5 (1986) 6. Peled, D.: Software Reliability Methods. Springer, Heidelberg (2001) 7. Wing, J.M.: A specifier’s introduction to formal methods. IEEE Computer 23(9) (1990) 8. Clarke, E., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (1999) 9. SMV, http://w2.cadence.com/webforms/cbl_software/index.aspx 10. SPIN, http://spinroot.com/spin/whatispin.html 11. UPPAAL, http://www.uppaal.com/ 12. Meadows, C.: Analysis of the Internet Key Exchange protocol using the NRL Protocol Analyzer. In: SSP 1999, pp. 216–231 (1999) 13. Cohen, E.: TAPS: A first-order verifier for cryptographic protocols. In: 13th IEEE Comp. Sec. Found. Workshop, pp. 144–158 (2000) 14. Aggarwal, S., Kurshan, R.P., Sabnani, K.: A Calculus for Protocol Specification and Verification. In: Int. Workshop on Protocol Specification, Testing and Verification (1983) 15. Sabnani, K., Wolper, P., Lapone, A.: An Algorithmic Procedure for Protocol Verification. In: Globecom (1985) 16. Harel, D.: Statecharts: A Visual Formalism for complex systems. Science of Computer Programming 8, 231–274 (1987) 17. SDL, http://www.telelogic.com/products/sdl/index.cfm 18. Wind River, http://windriver.com 19. Labrosse, J.: MicroC/OS-II: The Real-Time Kernel. CMP Books (1998) 20. QNX Software Systems, http://www.qnx.com 21. Hergenhan, A., Heiser, G.: Operating Systems Technology for Converged ECUs. In: 7th Embedded Security in Cars Conference (2008) 22. Obermaisser, R.: Formal Specification of Gateways in Integrated Architectures. In: Brinkschulte, U., Givargis, T., Russo, S. (eds.) SEUS 2008. LNCS, vol. 5287, pp. 34–45. Springer, Heidelberg (2008) 23. Yaghmour, K.: Adaptive Domain Environment for Operating Systems (2001), http://www.opersys.com/adeos 24. Xenomai, http://www.xenomai.org
HiperSense: An Integrated System for Dense Wireless Sensing and Massively Scalable Data Visualization Pai H. Chou1,2,4 , Chong-Jing Chen1,2 , Stephen F. Jenks2,3 , and Sung-Jin Kim3 1
2
4
Center for Embedded Computer Systems, University of California, Irvine, CA Electrical Engineering and Computer Science, University of California Irvine, CA 3 California Institute for Telecommunications and Information Technology, Irvine, CA Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Abstract. HiperSense is a system for sensing and data visualization. Its sensing part is comprised of a heterogeneous wireless sensor network (WSN) as enabled by infrastructure support for handoff and bridging. Handoff support enables simple, densely deployed, low-complexity, ultra-compact wireless sensor nodes operating at non-trivial data rates to achieve mobility by connecting to different gateways automatically. Bridging between multiple WSN standards is achieved by creating virtual identities on the gateways. The gateways collect data over Fast Ethernet for data post-processing and visualization. Data visualization is done on HIPerWall, a 200-megapixel display wall consisting of 5 rows by 10 columns of 30-inch displays. Such a powerful system is designed to minimize complexity on the sensor nodes while retaining high flexibility and high scalability.
1
Introduction
Treating the physical world as part of the cyber infrastructure is no longer just a desirable feature. Cyber-physical systems (CPS) are now the mandate of many national funding agencies worldwide. CPS entails more than merely interfacing with the physical world. The goal is to form synergy between the cyber and the physical worlds by enabling cross pollination of many more features. A wireless sensor network (WSN) is an example of cyber-physical interface. A sensor converts a physical signal into a quantity to enable further processing and interpretation by a computing machine. However, it is still mostly an interface, rather than a system in the CPS sense. Most WSNs today lack the cyber part, which would leverage the vast amount of information available on the network to synthesize new views of data in ways never possible before. An example of a system that is one step towards CPS is SensorMap [1], which offers a GoogleEarth-style application to be augmented with sensor data collected at the corresponding positions. SensorMap provides an abstraction in the form of the sensor-to-map programming interface (API), so that data providers can S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 252–263, 2009. c IFIP International Federation for Information Processing 2009
HiperSense: An Integrated System
253
leverage the powerful cloud-computing backend without having to re-invent yet another tool for each highly specialized application. However, data visualization can be much more than merely superimposing data on geographical or topological maps by a cloud computing system to be rendered on a personal computer. In fact, the emergence of large, high-resolution displays, high-performance workstations, and high-speed interconnection interfaces give rise to large display walls as the state-of-the-art visualization systems. An example is the HIPerWall, a 200-megapixel tiled screen driven by 25 PowerMac G5s interconnected by two high-speed networks [2, 3]. Such a system has found use in ultra-high-resolution medical imaging and appears as a prime candidate for visualization of a wide variety of sensor data as well. This paper describes work in progress on such a massive-scale sensing and visualization system, called HiperSense. On the sensing side, we aim to further develop a scalable infrastructure support, called EcoPlex [4]. It consists of a tiered network, where the upper tier contains the gateways and the lower tier includes the sensor nodes. The gateways support handoff for mobility and bridging of identity for integrating heterogeneous radio and networking standards. On the data visualization side, we feed the data to the HIPerWall, which can then render the data in an interactive form across 50 screens as one logical screen. This paper reports on the technologies developed to date and discusses practical issues that we have encountered.
2
Related Work
Several multiple-access protocols that use multiple frequency channels have been proposed for wireless sensor networks [5, 6]. Some have been evaluated only by simulation, while others have been adopted for researchers in ad-hoc network domains. Many protocols for WSNs have been implemented on the popular MicaZ or TelosB platforms. Y-MAC [7] is a TDMA-based protocol that schedules the receivers in the neighborhood by assigning available receiving time slots. Its light-weight channelhopping mechanism can enable multiple pairs of nodes to communicate simultaneously, and it has been implemented on the RETOS operating system running on a TmoteSky-class sensor node. However, a problem with Y-MAC is the relatively low performance due to time synchronization and overhead of channelswitching time. Base on previous experimental results from another team [8] on the Chipcon CC2420 radio transceiver, which implements the IEEE 802.15.4 MAC as used on MicaZ and TelosB, the channel switching time is nearly equal to the time it takes to transmit a 32-byte data packet. Therefore, changing to another frequency channel dynamically and frequently can become a damper on system performance. Le et al proposed and implemented another multi-channel protocol on the MicaZ [8]. Their protocol design does not require time synchronization among the sensor nodes. They also take the channel-switching time into consideration for sensor nodes that are equipped with only a half-duplex radio transceiver. A
254
P.H. Chou et al.
distributed control algorithm is designed to assign accessible frequency channels for each sensor node dynamically to minimize the frequency-changing time. The compiled code is around 9.5 KB in ROM and 0.7 KB in RAM on the top of TinyOS. Although it is smaller than others’ solutions, it is still too big to fit in either the Eco or µPart sensor node, both of which have much smaller RAM and ROM. After data collection, showing sensing data in a meaningful way in real-time is another emerging issue. Several solutions have been proposed to integrate the sensing data with a geographic map [9, 10]. SensorMap [1] from Microsoft Research is for displaying sensor data from SenseWeb [11] on a map interface. They are able to provide tools to query sensor nodes and visualize sensing data in real-time. Google Map with traffic can show not only the live traffic but also the history traffic at day and time [12]. However, these works currently assumes limited screen resolution and has not been scaled to the 200-megapixel resolution of the HIPerWall.
3
System Architecture
Fig. 1 shows the architecture of HiperSense. It consists of HIPerWall as the visualization subsystem and EcoPlex as the sensing infrastructure. This section summarizes each subsystem and their interconnection. 3.1
HIPerWall
HIPerWall
HIPerWall is a tiled display system for interactive visualization, as shown in Fig. 2(d). The version we use consists of fifty 30-inch LCD monitors arranged in five rows by ten columns. Each monitor has a pixel count of 2560 × 1600 at 100 dots per inch of resolution, and therefore the entire HIPerWall has a total resolution of 204.8 million pixels. The tiled display system is driven by 25 PowerMac G5 workstations interconnected by two networks to form a highperformance computing cluster. One network uses Myrinet for very high-speed 50 tiled displays (5 rows x 10 cols) driven by 25 PowerMac G5s, connected by Myranet & Gigabit Ethernet front-end node
EcoPlex
G
ZigBee mesh
Legend Eco nodes, connected to via a Gateway G
Gateway for ZigBee and Eco nodes, w/ Ethernet uplink
Fast Ethernet G
G
ZigBee mesh
G
ZigBee mesh
Fig. 1. The HiperSense architecture
HiperSense: An Integrated System
(a) Eco Node
(b) Base Station
255
(c) EZ-Gate
(d) HIPerWall Fig. 2. Components of HiperSense
data transfer, and the other network uses Gigabit Ethernet for control. The HIPerWall software is portable to other processors and operating systems, and it can be configured for a wide variety of screen dimensions. A user controls the entire HIPerWall from a separate computer called the front-end node. It contains a reduced view of the entire display wall, enabling the user to manipulate the display across several screens at a time. The frontend node also serves as the interface between the sensing subsystem and the visualization subsystem. 3.2
EcoPlex
EcoPlex is a tiered, infrastructure-based heterogeneous wireless sensor network system. Details of EcoPlex can be found in another paper [4], and here we highlight the distinguishing features. At the bottom tier are the wireless sensor nodes. The top tier consists of a network of gateway nodes. Lower Tier Sensor Nodes. EcoPlex currently supports two types of nodes with different communication protocols: ZigBee and Eco. Our platform allows other protocols such as Bluetooth and Z-Wave to be bridged without any inherent difficulty. ZigBee is a wireless networking protocol standard primarily targeting lowduty-cycle wireless sensing applications [13]. In recent years, it is also targeting home automation domain. ZigBee is a network protocol that supports ad hoc mesh networking, although it also defines roles for not only end devices but also
256
P.H. Chou et al.
routers and coordinators. ZigBee is built on top of the IEEE 802.15.4 media access control (MAC) layer, which is based on carrier-sense multiple access with collision avoidance (CSMA/CA). Currently, many wireless sensor applications are built on top of 802.15.4, though not necessarily with the ZigBee protocol stack, since it occupies about 64–96KB of program memory. Another type of wireless sensor node supported in EcoPlex is Eco [14, 15], our own ultra-compact wireless sensing platform, as shown in Fig. 2(a). It is 1 cm3 in volume including the MCU, RF, antenna, and sensor devices. It contains 4 KB RAM and 4 KB EEPROM. The radio is based on Nordic VLSI’s ShockBurst protocol at 1 Mbps, and it is a predecessor of Wibree, also known as Bluetooth Low Energy Technology as a subset of Bluetooth 3.0 [16]. Eco is possibly the world’s smallest, self-contained, programmable, expandable platform to date. A flex-PCB connector enables an Eco node to be connected to other I/O devices and power. Eco is meant to complement ZigBee in that Eco nodes can be made much smaller and cheaper than ZigBee ones, and thus they can be deployed where ZigBee nodes cannot, especially in some highly wearable or size-constrained applications. Upper Tier: Gateways. The upper tier of EcoPlex consists of a network of gateway nodes called EZ-Gates. An EZ-Gate is essentially a Fast Ethernet router based on an ARM-9-core network processor running Linux 2.6. It is augmented with the radio transceivers that are needed for supporting the protocols used by the wireless sensor nodes. In this case, one ZigBee transceiver and two Eco transceivers are added to each EZ-Gate. For ZigBee support, the EZGate implements the protocol stack of a ZigBee coordinator. Since Eco is more resource-constrained and can afford to implement only a much simpler protocol, the gateway provides relatively more support in the form of handoff and virtual identity. Eco nodes connect to the gateways and not to each other, the same way cellular phones connect to base stations but not to each other. Just as cellular towers perform handovers based on the proximity of the mobile, our EZ-Gates perform handoffs based on the link quality of the Eco nodes. Unlike cell phones, which are treated as independently operated units, EcoPlex supports the concepts of clusters, which are groups of wireless sensor nodes that work together, are physically close to each other, and move as a group [17]. Instead of performing handoff for each node individually, cluster handoff would rely on a node as a representative for the entire cluster and has been shown to be effective especially in dense deployments. EZ-Gates also support bridging in the form of virtual identity. That is, for every Eco node connected to EcoPlex, the owner gateway maintains a node identity in the ZigBee space. This way, an Eco node appears just like any other ZigBee node and can communicate logically with other ZigBee nodes, without having to be burdened with the heavy ZigBee stack. A simpler base station without the handoff and virtual identity support was also developed. It is based on the Freescale DEMO9S12NE64 evaluation board connected to a Nordic nRF24L01 transceiver module, as shown in Fig. 2(b). It has a Fast Ethernet uplink to the front-end node. It was used for the purpose
HiperSense: An Integrated System
257
of developing code between the Ethernet and Eco sides before porting to the EZ-Gate for final deployment.
4
System Integration
HiperSense is more than merely connecting the EcoPlex sensing subsystem with the HIPerWall tiled display system. It entails design of a communication and execution scheme to support the needs of sensing and visualization. This section first discusses considerations for HiperSense to support CPS-style visualization. Then, we describe the communication scheme for system integration. 4.1
Visualization Styles and Support
Visualization is the graphical rendering of data in a form that helps the user gain new insights into the system from which the data is collected. Unlike many other visualization systems that render only static data that has been collected and stored in advance, HiperSense is designed to support visualization of both static data files and live data streams. More importantly, we envision a visualization system that synthesizes views from static or live data and other sources available on the Internet. As an example, consider a WSN that collects vibration data from sensor nodes on a pipeline network. The user is not interested in vibration per se but wants to non-invasively measure the propagation speed of objects traveling inside the pipeline based on the peak vibration. In this case, time-history plots of raw vibration data is not so meaningful to the user; instead, the data streams must be compared and processed to derive the velocity. The visualization system then renders the velocity by superimposing colored-encoded regions over highresolution images of the pipeline network and its surroundings. Moreover, the user may want the ability to not only navigate spatially similar to GoogleEarth but also temporally by seeing how the peak velocity shifts over time. To support smooth spatial navigation, HIPerWall relies on replication or prefetching of large data (e.g., patches of GoogleEarth photos) from adjacent nodes. The data could be information in the local database system, images or videos. These data shown on the screens are treated as independent objects and can be zoomed in, zoomed out or rotated arbitrarily. The front-end node simply sends out commands to every computing node to indicate which object should be displayed, which position the object is and the other properties to control the object. This mechanism reduces the traffic between the front-end node and the cluster of computing nodes. To support real-time access to data, supervisory control and data acquisition (SCADA) systems, which are commonly found in factories for up to thousands of sensing and actuation channels per site, have used real-time databases for the purpose of logging and retrieval of data. A similar set up can be built for HiperSense. Historic data can also be retrieved from the real-time database via a uniform interface. However, one difference between a conventional SCADA and
258
P.H. Chou et al.
HiperSense is that the former is commonly handled by one or a small number of computers, while the latter relies on a cluster of 25 computers to parallelize the handling of the massive graphical bandwidth. For the purpose of tiled visualization of live sensor data, we program the frontend node of the HIPerWall to also be a fat client for data collection from wireless sensor nodes via the gateways in EcoPlex. The front-end node then broadcasts the collected data to the cluster of computing nodes inside HIPerWall. Every computing node decodes the whole data packet but shows only the portion that is visible on the two LCD screens that it controls. This broadcasting mechanism removes the time synchronization between all workstations and ensures that all sensing data can be shown on the tiled displays at the same time. If the backbone of Intranet and the cluster of workstations both support Jumbo Frame [18], we can increase the overall system performance and deploy more wireless sensor nodes at once. 4.2
Protocols for the Tiers
EcoPlex currently supports both ZigBee and Eco as two complementary wireless protocols. The ZigBee standard is designed for sporadic, low-bandwidth communication in an ad hoc mesh network, whereas Eco is capable of high-bandwidth, data-regular communication on ultra-compact hardware in a star network. Of course, it is possible for each platform to implement each other’s characteristics, but they would be less efficient. For the purpose of integration with HiperSense, ZigBee does not pose a real problem due to its lack of timing constraints. We therefore concentrate our discussion on integration of Eco nodes and gateways. On many wireless sensor nodes, the software complexity is dominated by the protocol stack. The code sizes of the typical protocol stacks of Bluetooth, ZigBee, and Z-Wave are 128KB, 64–96KB, and 32KB, respectively, whereas the main program of a sensing application can be as little as a few kilobytes. Our approach to minimizing complexity on the Eco nodes is to externalize it: we implement only the protocol-handling mechanisms on Eco and move most policies out to the infrastructure. This can be accomplished by making nodes passive with a thin-server, fat-client organization. That is, the sensor nodes are passive and the host actively pulls data from them. This effectively takes care of arbitration1 and effective acknowledgment. The core mechanism can be implemented in under 40 bytes of code. After adding support for multi-gateway handoff and joining, channel switching, and a number of other performance enhancement policies (e.g., amortize the pulling overhead by returning multiple packets), the code size can still be kept around 2KB. Once the reliable communication primitives are in place, then we can add another layer of software for dynamic code update and execution [19] on these nodes. Our vision is host-assisted dynamic compilation, where the host or its delegate dynamically generates optimized code to be loaded into the node for execution. This will be much more energy efficient than a general-purpose protocol stack that must anticipate all possibilities. An example is the protocol stack 1
No intra-network collision unless there is a malfunctioning node.
HiperSense: An Integrated System
259
for network routing. Since our gateway as well as many commercially available gateways run Linux and have plenty of storage available, the gateways should also be able run synthesizer, compiler, and optimizer without difficulty. The front-end node of the HIPerWall acts as a client to query data from all gateways. Each gateway is connected to the front-end node via a wired interface with higher bandwidth than the wireless interface. At beginning, all wireless sensor nodes communicate with the gateway via the control channel. The frontend node issues frequency switching commands to the wireless sensor nodes based on the sampling rate of each wireless sensor node and the available bandwidth of each wireless frequency channel. Later on, the front-end node issues a command packet to the gateway to get data from the wireless sensor nodes controlled by the gateway. The gateway in turn broadcasts the command packet to all Eco sensor nodes on the gateway’s own frequency channel. The gateway packs all pulled data together and forwards the data to the front-end node, which in turn broadcasts the data to the real-time databases for visualization. ZigBee nodes push data rather than being pulled. This way, the ZigBee network can coexist with the Eco network by sporadically taking up bandwidth only when necessary. For HiperSense, the front-end node can resend the command packet to a wireless sensor node if it does not get any reply packet within certain amount of time. In order to improve the system performance, we have the retransmission mechanism inside the gateways instead of having it at the front-end node. A gateway resends the pulling command packet that it received from the front-end node if it does not receive any reply packet from a wireless sensor node after a pre-defined timeout period.
5
Evaluation
This section presents experimental results on a preliminary version of HiperSense. The experimental setup consists of 100 Eco nodes (Fig. 3(a)) and two gateways densely deployed in an area from 2 to 16 m2 . The larger setup is for a miniature-scale water pipe monitoring system, where nodes measure vibration at different junctions. The gateways are connected to the HIPerWall’s front-end node (Fig. 3(b)) via Fast Ethernet. We compare the performance of our system with other works in terms of measured throughput, latency, and code sizes for different sets of features included. 5.1
Latency and Throughput
Fig. 4 shows the measured aggregate throughput and per-node latency results for one and two gateways over different numbers of reply packets per pull. Returning more packets per pull enables amortization of pulling overhead, though at the expense of increased latency. The lower and upper curves for each chart shows the result for one and two gateways, respectively. In the case of one gateway, the throughput ranges from 6.9 KB/s to 15.8 KB/s for 20 reply packets per pull, though the latency increases linearly from around 100 ms to over 1.5 seconds.
260
P.H. Chou et al.
(a) 100 Eco Nodes
(b) Front-End Node and HIPerWall
Fig. 3. Experimental Setup 30000
2500
25000
20000
Response Time (ms)
Performance (Bytes/S)
2000
15000
10000 2 Base Stations 2 Transceivers 5000
1500
1000
2 Base Stations 2 Transceivers
500
1 Base Station 1 Transceiver
1 Base Station 1 Transceiver
0
0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
Number of Reply Packet(s)
(a) Throughput
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Number of Reply Packet(s)
(b) Latency
Fig. 4. Performance Results
By doubling the number of gateways, the aggregate throughput ranges from 13.5KB/s for one reply per pull to 27.9 KB/s for 20 reply packets per pull. In the latest case, the total throughput increases by 81% while the latency increases by 9%. The data rate is rather low compared to the bandwidth within HIPerWall. With nonoverlapping frequencies, EcoPlex can scale up to 50 nodes/gateway × 14 channels/gateway = 6200 nodes with only 0.1% utilization of the bandwidth on the Gigabit Ethernet or 1% of Fast Ethernet. 5.2
Comparison
The closest work to ours in the area of communication protocol for the wireless sensor nodes is the Multi-Channel MAC (MCM) protocol as proposed by Le et al from the Cyber Physical Computing Group [8]. Their protocol was built on the top of TinyOS v2.x, whose minimum code size is around 11 KB [20]. Depending
HiperSense: An Integrated System
261
Table 1. Comparison Between HiperSense for Eco nodes and the Multi-Channel MAC protocol for TinyOS 2 [8] Code Size HiperSense MCM Protocol MAC layer 31 bytes 9544 bytes Runtime Support 1.1 KB 11-20 KB Dynamic Execution 430 bytes N/A Total Code Size 1.56 KB + loaded code 20.5-29.5 KB
on the hardware and software configurations, the compiled code size of TinyOS v2.x could exceed 20 KB [20]. Table 1 shows the required code size in the ROM after compilation. In HiperSense, the gateways and the front-end node handle most of the protocol policies originally handled by the sensor nodes. This enables the sensor nodes to be kept minimally simple with only the essential routines. Moreover, the processing time on a sensor node can be shortened, and the firmware footprint in the ROM is also minimized. We implemented a dynamic loading/dispatching layer, which occupies 430 bytes, enabling the node to dynamically load and execute code fragments that can be highly optimized to the node in its operating context [19]. In contrast, the MCM protocol occupies 9544 bytes on the top of TinyOS, which can increase the total code size to 29.5KB. That is over an order of magnitude larger than our code size.
6
Conclusion and Future Work
This paper reports the progress on the HiperSense sensing and visualization system. The sensing aspect is based on EcoPlex, which is an infrastructure-based, tiered network system that supports heterogeneous wireless protocols for interoperability and handoff for mobility. We keep node complexity low by implementing only bare minimum mechanisms and either externalize the policies to the host side or make them dynamically loadable. The visualization subsystem is based on HIPerWall, a tiled display system capable of rendering 200 megapixels of display data. By feeding the data streams from EcoPlex to the front-end node of the HIPerWall and replicating them among the nodes within HIPerWall, we are making possible a new kind of visualization system. Unlike previous applications that use static data, now we can visualize both live and historic data. Scalability in a dense area was shown with 100 wireless sensor nodes in a 2m2 area. By utilizing all frequency channels, we expect HiperSense to handle 6200 independent streams of data. Applications include crowd tracking, miniature-scale pipeline monitoring, and a wide variety of medical applications. Future work includes making the protocol more adaptive and power manageable on the wireless sensor nodes. Dynamic code loading and execution has been implemented but still relies on manual coding, and it is a prime candidate for automatic code synthesis and optimization.
262
P.H. Chou et al.
Acknowledgments The authors would like to thank Seung-Mok Yoo, Jinsik Kim, and Qiang Xie for their assistance with this work on the Eco protocol, and Chung-Yi Ke, NaiYuan Ko, Chih-Hsiang Hsueh, and Chih-Hsuan Lee for their work on EcoPlex. The authors also would like to thank Duy Lai for his assistance on HIPerWall. This research project is sponsored in part by the National Science Foundation CAREER Grant CNS-0448668, UC Discovery Grant itl-com05-10154, the National Science Council (Taiwan) Grant NSC 96-2218-E-007-009, and Ministry of Economy (Taiwan) Grant 96-EC-17-A-04-S1-044. HIPerWall was funded through NSF Major Research Instrumentation award number 0421554. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References [1] Nath, S., Liu, J., Miller, J., Zhao, F., Santanche, A.: SensorMap: a web site for sensors world-wide. In: SenSys 2006: Proceedings of the 4th international conference on Embedded networked sensor systems, pp. 373–374. ACM, New York (2006) [2] Kuester, F., Gaudiot, J., Hutchinson, T., Imam, B., Jenks, S., Potkin, S., Ross, S., Sorooshian, S., Tobias, D., Tromberg, B., Wessel, F., Zender, C.: HIPerWall: A high-performance visualization system for collaborative earth system sciences (2004), http://dust.ess.uci.edu/prp/prp_Kue04.pdf [3] Jenks, S.: HIPerWall, http://hiperwall.calit2.uci.edu/ [4] Ke, C.Y., Ko, N.Y., Hsueh, C.H., Lee, C.H., Chou, P.H.: EcoPlex: Empowering compact wireless sensor platforms via roaming and interoperability support. In: Proceedings of the Sixth Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services (MobiQuitous 2009), Toronto, Canada, July 13-16 (2009) [5] Wu, S.L., Lin, C.Y., Tseng, Y.C., Sheu, J.P.: A new multi-channel MAC protocol with on-demand channel assignment for multi-hop mobile ad hoc networks. In: ISPAN, p. 232 (2000) [6] So, H.S.W., Walrand, J., Mo, J.: McMAC: A parallel rendezvous multi-channel MAC protocol. In: Wireless Communications and Networking Conference, March 2007, pp. 334–339 (2007) [7] Kim, Y., Shin, H., Cha, H.: Y-MAC: An energy-efficient multi-channel MAC protocol for dense wireless sensor networks. In: IPSN 2008: Proceedings of the 2008 International Conference on Information Processing in Sensor Networks, Washington, DC, USA, pp. 53–63. IEEE Computer Society, Los Alamitos (2008) [8] Le, H.K., Henriksson, D., Abdelzaher, T.: A practical multi-channel media access control protocol for wireless sensor networks. In: IPSN ’08: Proceedings of the 2008 International Conference on Information Processing in Sensor Networks (IPSN 2008), Washington, DC, USA, pp. 70–81. IEEE Computer Society, Los Alamitos (2008)
HiperSense: An Integrated System
263
[9] Krause, A., Horvitz, E., Kansal, A., Zhao, F.: Toward community sensing. In: IPSN 2008: Proceedings of the 7th international conference on Information processing in sensor networks, Washington, DC, USA, pp. 481–492. IEEE Computer Society, Los Alamitos (2008) [10] Ahmad, Y., Nath, S.: COLR-Tree: Communication-efficient spatio-temporal indexing for a sensor data web portal. In: IEEE 24th International Conference on Data Engineering, April 2008, pp. 784–793 (2008) [11] Grosky, W., Kansal, A., Nath, S., Liu, J., Zhao, F.: SenseWeb: An infrastructure for shared sensing. IEEE Multimedia 14(4), 8–13 (2007) [12] Google Traffic, http://maps.google.com/ [13] ZigBee Alliance, http://www.zigbee.org/ [14] Park, C., Chou, P.H.: Eco: Ultra-wearable and expandable wireless sensor platform. In: Third International Workshop on Body Sensor Networks (BSN 2006) (April 2006) [15] Ecomote, http://www.ecomote.net/ [16] Bluetooth Low Energy Technology, http://www.bluetooth.com/Bluetooth/Products/low_energy.htm [17] Lee, C.H.: EcoFlock: A clustered handoff scheme for ultra-compact wireless sensor platforms in EcoPlex network. Master’s thesis, National Tsing Hua University (2009) [18] Jumbo Frame, http://en.wikipedia.org/wiki/Jumbo_frame [19] Hsueh, C.H.: EcoExec: A highly interactive execution framework for ultra compact wireless sensor nodes. Master’s thesis, National Tsing Hua University (2009) [20] Cha, H., Choi, S., Jung, I., Kim, H., Shin, H., Yoo, J., Yoon, C.: RETOS: resilient, expandable, and threaded operating system for wireless sensor networks. In: IPSN 2007: Proceedings of the 6th international conference on Information processing in sensor networks, pp. 148–157. ACM, New York (2007)
Applying Architectural Hybridization in Networked Embedded Systems Antonio Casimiro, Jose Rufino, Luis Marques, Mario Calha, and Paulo Verissimo FC/UL {casim,ruf,lmarques,mjc,pjv}@di.fc.ul.pt Abstract. Building distributed embedded systems in wireless and mobile environments is more challenging than if fixed network infrastructures can be used. One of the main issues is the increased uncertainty and lack of reliability caused by interferences and fading in the communication, dynamic topologies, and so on. When predictability is an important requirement, then the uncertainties created by wireless networks become a major concern. The problem may be even more stringent if some safety critical requirements are also involved. In this paper we discuss the use of hybrid models and architectural hybridization as one of the possible alternatives to deal with the intrinsic uncertainties of wireless and mobile environments in the design of distributed embedded systems. In particular, we consider the case of safety-critical applications in the automotive domain, which must always operate correctly in spite of the existing uncertainties. We provide the guidelines and a generic architecture for the development of these applications in the considered hybrid systems. We also refer to interface issues and describe a programming model that is “hybridization-aware”. Finally, we illustrate the ideas and the approach presented in the paper using a practical application example.
1
Introduction
Over the last decade we have witnessed an explosive use of wireless technologies to support various kinds of applications. Unfortunately, when considering real-time systems, or systems that have at least some properties whose correctness depends on the timely and reliable communication, the communication delay uncertainty and unreliability characteristic of wireless networks becomes a problem. It is not possible ignore uncertainty and simply wait until a message arrives, hoping it will arrive soon enough. Our approach to address this problem is considering a hybrid system model, in which a part of the system is asynchronous, namely the part that encompasses the wireless networks and the related computational subsystems, and another part that is always timely, with well defined interfaces to the asynchronous
Faculdade de Ciˆencias da Universidade de Lisboa. Navigators Home Page: http://www.navigators.di.fc.ul.pt. This work was partially supported by FCT through the Multiannual Funding and the CMU-Portugal Programs.
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 264–275, 2009. c IFIP International Federation for Information Processing 2009
Applying Architectural Hybridization in Networked Embedded Systems
265
subsystem. In this paper we discuss applicability aspects of this hybrid model, considering in particular safety-critical applications in the automotive domain. In vehicles, safety-critical functions related to automatic speed and steering control are implemented using real-time design approaches, with dedicated controllers that are connected to car sensors and actuators through predictable networks. Despite all the advances in wireless communication technologies, relying on wireless networks to collect information from external sources and using this information in the safety-critical control processes, seems to be too risky. We argue that this may be possible if a hybrid system model and architecture are used. The advantage is the following: with the additional information it may be possible to improve some quality parameters of the control functions, possibly optimizing speed curves and fuel consumption or even improving the overall safety parameters. One fundamental aspect to make the approach viable is to devise appropriate interfaces between the different parts of the architecture. On the other hand, special care must be taken when programming safety-critical applications, as we illustrate by providing the general principles of a “hybridization-aware” programming model. The presented ideas and principles have been explored in the HIDENETS European project [9], in which a proof-of-concept prototype platooning application has been developed. We use this example to briefly exemplify the kind of benefits that may be achieved when using a hybrid system and architecture to build a networked embedded system. The paper is structured as follows. Some related work is address in the next section. Then, Section 3 motivates the idea of using hybrid distributed system models and highlights their main advantages. In Section 4 we discuss the applicability of the model in the automotive context and in Section 5 address interface issues and introduce the hybrid-aware programming model. The platooning example case is then provided in Section 6 and we end the paper with some conclusions and future prospects.
2
Related Work
The availability of varied and increasingly better technologies for wireless communication explains the pervasiveness of these networks in our everyday life. In the area of vehicular applications, new standards like the one being developed by IEEE 802.11p Task Group for Wireless Access in Vehicular Environments (WAVE) will probably become the basis on many future applications. The 802.11p standard provides a set of seven different logical communication channels among which one is a special dedicated control channel which specifically aims at allowing some more critical vehicular applications to be developed [1]. In fact, improving the baseline technologies and standards in one of the ways to be able to implement safety-critical systems that operate over wireless networks. And there is a large body of research concerned with studying and proposing solutions to deal with the reliability and temporal uncertainties of wireless communication.
266
A. Casimiro et al.
A line of research consists in devising new protocols for the MAC level using specific support at the physical level (e.g., Dedicated Short-Range Communications, DSRC) [17] or adopting decentralized coordination techniques, such as using rotating tokens [6]. In fact, the possibility of characterizing communication delays with a reasonable degree of confidence is sufficient for a number of applications that provide safety-related information to the driver, for instance to avoid collisions [5,14]. However, these applications are not meant to autonomously control the vehicles and therefore the involved criticality levels are just moderate. In general, and in spite of all improvements, we are still a few steps away of ensuring the levels of reliability and timeliness that are required for the most critical systems. A recent approach that indeed aims at dealing with safety requirements and allow autonomous control of vehicles in wireless and mobile environments is proposed in [3]. The approach relies on the cooperation and coordination between involved entities, and defines a coordination model that builds on a real-time communication model designated as the Space Elastic model [2]. The Space Elastic model is actually defined to represent the temporal uncertainties associated real wireless communication environments. The work presented in [13] also addresses the coordination of automated vehicles in platoons. In particular, it focuses on the feasibility of coordination scenarios where vehicles are able to communicate with their closest neighbors. The authors argue that in these scenarios, communication between a vehicle and its leader/follower is possible, as supported by simulation results presented in [7]. In contrast with these works, we consider a hybrid system model, which accommodates both the asynchrony of the wireless environments and the predictable behavior of the embedded control systems and local networks. In the area of wireless sensor networks efforts have also been made in devising architectures and protocols to address the temporal requirements of applications. One of the first examples is the RAP architecture [11], which defines query and event services associated to new network scheduling policies, with the objective of lowering deadline miss ratios. More recent examples include VigilNet, for real-time target tracking [8] in large-scale sensor networks, and TBDS [10], a mechanism for node synchronization in cluster-tree wireless sensor networks. Our focus is at a higher conceptual level, abstracting from the specific protocols, network topologies and wireless technologies that are used.
3
Hybrid System Models
Classical distributed system models range from purely asynchronous to fully synchronous, assume different failure models, from crash to Byzantine. But independently of the particular synchrony or failure model that is assumed, they are typically homogeneous, meaning that the assumed properties apply to the entire system, and do not change over time. However, in many real systems and environment, we observe that synchrony or failure modes are not homogeneous: they vary with time or with the part of the system being considered.
Applying Architectural Hybridization in Networked Embedded Systems
267
Therefore, in the last few years we have been exploring the possibility of using hybrid distributed system model approaches, in which different parts of the system have different sets of properties (e.g. synchronism [16] or security [4]). Using hybrid models has a number of advantages when compared to approaches based on homogeneous models. The main advantages include more expressiveness with respect to reality, the provision of a sound theoretical basis for crystal-clear proofs of correctness, the possibility of being naturally supported by hybrid architectures and, finally, the possibility of enabling concepts for building totally new algorithms. One example of a hybrid distributed system model is the Wormholes model [15]. In essence, this model describes systems in which it is possible to identify a subsystem that presents exceptional properties allowing overcoming fundamental limitations of overall system if seen as a whole. For instance, a distributed system in which nodes are connected by a regular asynchronous network, but in which there is also a separate real-time network connecting some synchronization subsystem in each node, can be well described by the Wormholes model. Another very simple example, in which the wormhole subsystem is only local to a node, is a system equipped with a watchdog. Despite the possible asynchrony of the overall system, the watchdog is synchronous and always resets the system in a timely manner whenever necessary. We must note that designing systems based on the Wormholes model is not just a matter of assuming that uncertainty is not ubiquitous or does not last forever. The design philosophy also builds on the principle that predictability must be achieved in a proactive manner, that is, the system must be built in order to make predictability happen at the right time and right place.
4
Application in Automotive Context
The wormhole concept can in fact be instantiated in different ways, and here we discuss the possible application of the concept to car systems. Therefore, we first provide an overview of system components that are found in modern cars, and then we explain how an hybrid architecture can be projected over these systems. 4.1
In-Vehicle Components
Modern cars include a wide set of functions to be performed by electronics and microcontrollers complementing and in many cases totally replacing the traditional mechanical and/or hydraulic mechanisms. These functions include both hardware and software components and are usually structured around Electronic Control Units (ECUs), using the terminology of the automotive industry, which are subsystems composed of a microcontroller complemented with an appropriate set of sensors and actuators. The functions being replaced by these components are quite diverse and aim to assist the driver in the control of the vehicle. They range from critical functions associated to the control of the power train (e.g. engine and transmission related
268
A. Casimiro et al.
functions), traction (e.g. driving torque), steering or braking, to less critical ones to control the different devices in the body of the vehicle, such as lights, wipers, doors and windows, seats, climate systems, just to name a few. Recently, a slightly different set of functions is also being incorporated. They are related to information, communication and entertainment (e.g. navigation systems, radio, audio and video, multimedia, integrated cellular phones, etc). The implementation of these functions is supported in specialized ECUs. However, many of these functions are distributed along the car infrastructure. Thus, there is a need for those functions to be distributed over several ECUs that exchange information and communicate through in-vehicle networking. Furthermore, there may be required to exchange information between ECUs implementing different functions. For example, the vehicle speed obtained from a wheel rotation sensor may be required for gearbox control or for the control of an active suspension subsystem, but it may also be useful for other subsystems. Given the different functional domains have different requirements in terms of safety, timeliness and performance guarantees, the interconnection of the different ECUs is structured along several networks, classified according to their available bandwidth and function. There are four classes of operation, including one (Class C) with strict requirements in terms of dependability and timeliness, and another (Class D) for high speed data exchanges such as those required for mobile multimedia applications. The combination of the functions typically provided by each one of those four networking classes involves network interconnection through gateways, as illustrated in Figure 1.
Fig. 1. Typical In-Vehicle Networking
The in-vehicle ECUs provide support for the different functions implemented in nowadays cars. Each ECU is composed of a computing platform where the ECU software is executed. The computing platform is typically complemented with some specific hardware, a set of sensors, for gathering data from the system under control and a set of actuators which allows to act over the given car subsystem. The support of drive by wire functions integrating a set of sensors (e.g. proximity sensor) and actuators (e.g. speed and brake control) are just one example with relevance for the platooning application that we refer in Section 6. Others ECUs may exhibit a slightly different architecture because they are intended to support different functions. One example is illustrated in Figure 2,
Applying Architectural Hybridization in Networked Embedded Systems
269
Fig. 2. Example of In-Vehicle Infotainment Functions
intended to support the integration of infotainment functions. In this case, the architecture of the computing platform is designed to interface and to integrate the operation of multiple gadgets (radio, cellular phone) and technologies. 4.2
Architectural Hybridization in Vehicles
Given the description provided above, it is clear that there is a separation between what may be called a general computing platform, able to run general purpose local and distributed applications connected through wireless networks, and embedded systems dedicated to the execution of specific car functions. Interestingly, there exist gateways between these different subsystems, which allow for information to flow across their boundaries. For example, the information provided by a proximity sensor in the car electronic subsystem may be highly relevant for a driver warning application running in the general computing platform. However, sending information from the general purpose system to a critical control system is not so trivial, and as far as we know is typically avoided. We argue that in this context it is interesting and useful to apply the wormholes hybrid model in order to explicitly assume the existence of a general (payload) system, asynchronous, but in which complex applications can be executed without special concern for timeliness, and a wormhole subsystem, which is timely, reliable, and in which it is possible to execute critical functions to support interactions with the payload system. The wormhole must provide at least one Timely Timing Failure Detection (TTFD) service, available to payload applications, to allow the detection of timing failures in the payload part or in the payload-wormhole interactions. This TTFD service must also be able to timely trigger the execution of fault handling operations for safety purposes. These handlers have to be implemented as part of the wormhole subsystem and will necessarily be application dependant. With these settings it is possible to deal with information flows from the payload side to the critical subsystems, thus allowing developing applications that run in the general computing platform, which are able to exploit the availability of wireless communication, and which are still able to control critical systems
270
A. Casimiro et al.
in a safe way. Of course that in order for this to be possible, the applications must be programmed in a way that is “hybridization-aware”, explicitly using the TTFD service provided by the wormhole subsystem and being complemented by safety functions that must be executed on predictable subsystems. In the following section we describe the architectural components that constitute the hybrid system, focusing on these interfacing and programming issues.
5 5.1
Designing Applications in Hybrid Systems Generic Architecture
Asynchronous control task
Admission layer Control Task
Shared memory 1/0 Safety Task
TTFD Task
synchronous real-time subsystem
Gateway
asynchronous payload
In the proposed approach for the design of safety-critical applications in hybrid systems, the system architecture must necessarily encompass the two realms of operation: the asynchronous payload and the synchronous real-time subsystem, as illustrated in Figure 3.
Sensors, actuators
Fig. 3. System architecture for asynchronous control
A so called asynchronous control task executes in the payload part, possibly interacting with external systems through wireless or other non real-time networks. Interestingly, this asynchronous control task can perform complex calculations using varied data sources in order to achieve improved control decisions. On the real-time (or wormhole) part of the system, several tasks will be executed in a predictable way, always satisfying (by design) any required deadline. In order to exploit the synchronism properties of the wormhole part of the system, the interface to access wormhole services must be carefully designed. The solution requires the definition of a wormhole gateway, much like the gateways between the different network classes that are defined in car architectures. This wormhole gateway includes an admission layer, which restricts the patterns of service requests as a means to secure the synchrony properties of the wormhole subsystem (we assume that the payload system can be computationally powerful, and the number of requests sent to the wormhole subsystem is not
Applying Architectural Hybridization in Networked Embedded Systems
271
bounded a priori). Some service requests may be delayed, rejected or simply not executed because of lack of resources. This behavior is admissible because from the perspective of the asynchronous system, no guarantees are given anyway. Several interface functions may be made available, some of which specifically related to the application being implemented (e.g., functions for control commands to be sent to actuators or ECUs, and for sensor information to be read). At least it is necessary to provide a set of functions to access and use the TTFD service. The role of the TTFD service is fundamental: in simple terms, it is a kind of “enhanced watchdog” programmed by the payload application, and it works as a switching device that gives control to the wormhole part when the payload becomes untimely. A more detailed description of the TTFD service and how it must be used is provided in Section 5.2. A control task is defined within the gateway, which will implement the specific functions and will also interact with the TTFD service, forwarding start and stop commands received from the payload. The task may also decide whether an actuation command can effectively be applied or not, depending on the timeliness status of the payload. A safety task must also be in place, which will be in charge of ensuring a safe control whenever the asynchronous control task is prevented from taking over the control. This safe control will be done using only the locally available information, collected from local sensors. This control task can be designed to keep the system in a safe state, but this will be a pessimistic control in the sense that it will be based only on local information. The effective activation of this task is controlled by the TTFD service, using a status variable in a shared-memory structure, or some equivalent implementation. Quite naturally, each specific application must have its own associated safety task. Therefore, although the architecture is generically applicable to safety-critical applications in hybrid systems, some components must be instantiated on a case-by-case basis. In Figure 3 we also represent the sensors and actuators, which are necessarily part of the real-time subsystem. 5.2
Using the TTFD Service
A fundamental idea underlying the approach is to use the TTFD service to monitor the timeliness of a payload process. The TTFD service provides the following functions: startTFD, stopTFD and restartTFD. The startTFD function specifies a start instant to start the monitoring of a timed action and a maximum duration for that action. The handling functions that are executed when a timing failure is detected must be programmed a priori as part of the wormhole. A specific handler may be specified when starting a timing failure monitoring activity. The stopTFD function stops the on-going monitoring activity, returning an indication of whether the timed execution was timely terminated or not. The restartTFD function allows the atomic execution of a stopTFD request followed by a startTFD request. Before starting a timed execution, the TTFD service is requested to monitor this execution and a deadline is provided. If the execution is timely, then the
272
A. Casimiro et al.
TTFD monitoring activity will be stopped before the deadline. Otherwise, when the deadline is reached the TTFD service will raise a timing fault condition (possibly a boolean variable in the shared memory, as shown in Figure 3). From a programmers view perspective, and considering that we are concerned with the development of asynchronous control applications, there are two important issues to deal with: a) determining the deadline values provided to the TTFD service; b) use the available functions in a way that ensures that either the execution is timely (thus allowing control commands to be issued) or else a timing failure is always detected (and safety handler can at least be executed). The deadline must be such that the application is likely able to perform the necessary computations within that deadline. In control, there is a tradeoff between reducing the duration of the control cycle and the risk of not being able to compute a control decision within the allowed period. On the other hand, specifying large deadlines will have a negative influence on the quality of control. The other restriction for the deadline is determined by safety rules and by the characteristics of a fail-safe real-time control task that will be activated when the deadline is missed. The deadline must be such that the fail-safe control task, when activated, is still able to fulfill the safety requirements. The second issue concerns the way in which interactions between the payload and the wormhole must be programmed, which we discuss in what follows. 5.3
Payload-Wormhole Interactions
In the proposed architecture, TTFD requests are in fact directed to the control task, along with actuation commands. That is, when the asynchronous control task sends an actuation command, it is implicitly finishing an on-going timed action, and it must explicitly start a new one by specifying a deadline for the next actuation instant. The idea is the following: when an actuation command is sent from the payload to the wormhole, it is supposed to be sent before a previously specified deadline. Therefore, when the command is received by the control task, this task first has to stop the on-going TTFD monitoring activity. Depending on the returned result, the control task will know if the actuation command is timely (and hence can be safely used and applied to actuators) or if it is untimely (in which case, it will just be discarded). In the latter case, the TTFD must already have triggered the failure indication. In fact, this indicator is used by the safety task to decide if it should indeed become active and take over the control of the system. As soon as a timing failure occurs, the indicator is activated, and the safety task will take over the next time it is released. This means that a late command received from the payload will be ignored, and it will be ensured that the safety task will already be in control. In a steady state, the asynchronous control task will be continuously sending commands to the the wormhole, timely stopping the on-going TTFD monitoring activity, atomically restarting the TTFD for a future point in time (the next actuation deadline), and applying the control command.
Applying Architectural Hybridization in Networked Embedded Systems
6
273
Platooning Example
Let us consider the example of a platooning application, in which the objective is to achieve a better platoon behavior (keep cars close together at the maximum possible speed), using not only the information available from local car sensors, but also other information collected from other cars or from fixed stations along the road. The hybrid architecture will encompass an asynchronous platooning control task running in some on-board general purpose computer, processing all the available information and producing control decisions that must be sent to the vehicle ECUs. The information exchanged between vehicles (through the wireless network) includes time, position and speed values. This information is relevant for the platooning control application, since it will know where to position each other car in a virtual map and hence what to do regarding the own car speed. Clocks are assumed to be synchronized through GPS receivers and accelerations (positive and negative) are bounded. In this way, worst case scenarios can be considered when determining the actuation commands. Every car in the platoon periodically retrieves the relevant information from local sensors (through the wormhole interface), disseminates this information, and hopefully receives the same information from the other cars. In the platooning application case, failures in the communication will not have serious consequences. In fact, if a car does not receive information from the preceding car, it will use old information and will “see” that car closer than it is in reality. The consequence is that the follower car will stop, even if not necessary. Given the periodic nature of the payload message exchanges, the asynchronous control tasks may become aware of lost or very delayed messages (if timeouts are used) and refrain from sending actuation commands to the wormhole. In this case, or if the payload becomes to slow (remember that this is a general purpose computing environment), the actuation commands expected by the wormhole will not be received or will arrive too late, and meanwhile the safety task is activated to take over the control of the car. From the platooning application perspective, the proposed implementation provides some clear improvements over a traditional implementation. The latter is pessimistic in the sense that it must ensure larger safety distances between cars, in particular at high speeds, since no information is available about the surrounding environment and in particular about the speed of the preceding car. On the other hand, in the prototype we implemented it is possible to observe that independently of the platoon speed, the distance between every two cars is kept constant because follower cars are able to know the distance to the preceding car, and its speed also. We implemented a prototype of this platooning application, which was demonstrated using emulators for the physical environment and for the wireless network. Figure 4 illustrates some of the hardware and a graphical view of the platoon in the physically emulated reality. The interested reader can refer to [12], which provides additional details about this demonstration.
274
A. Casimiro et al.
Fig. 4. Platooning demonstration
7
Conclusions
The possibility of using wireless networks for car-to-car or car-to-infrastructure interactions is very appealing. The availability of multiple sources of information can be used to improve the quality of control functions and implicitly the safety or the fuel consumptions. The problem that we addressed in this paper is concerned with the potential lack of timeliness and with the unreliability of wireless networks, which make it difficult to consider their use when implementing real-time applications or safety-critical systems. We propose an approach that is based on the use of a hybrid system model and architecture. The general idea is to allow applications to be developed on a general purpose part of the system, typically asynchronous, and provide the means to ensure that safety-critical properties are always secured. Since we focus on applications for the vehicular domain, typically control applications, we first explain why the considered hybrid approach is very reasonable in this context. Then we provide the guidelines for designing asynchronous control applications, explaining in particular how the interactions between the payload and the wormhole subsystems should be programmed. From the experience we had in the development of the platooning example application and from the observations we made while executing our demonstration system, we conclude that the proposed approach constitutes a potentially interesting alternative for the implementation of optimized safety-critical systems in wireless environments.
References 1. IEEE P802.11p/D3.0, Part 11: Wireless LAN Medium Access Contrl (MAC) and Physical Layer (PHY) Specifications: Amendment: Wireless Access in Vehicular Environments (WAVE), Draft 3.0 (July 2007) 2. Bouroche, M., Hughes, B., Cahill, V.: Building reliable mobile applications with space-elastic adaptation. In: WOWMOM 2006: Proceedings of the 2006 International Symposium on on World of Wireless, Mobile and Multimedia Networks, Washington, DC, USA, pp. 627–632. IEEE Computer Society, Los Alamitos (2006)
Applying Architectural Hybridization in Networked Embedded Systems
275
3. Bouroche, M., Hughes, B., Cahill, V.: Real-time coordination of autonomous vehicles. In: Proceedings of the IEEE Intelligent Transportation Systems Conference 2006, September 2006, pp. 1232–1239 (2006) 4. Correia, M., Ver´ıssimo, P., Neves, N.F.: The design of a COTS real-time distributed security kernel. In: Bondavalli, A., Th´evenod-Fosse, P. (eds.) EDCC 2002. LNCS, vol. 2485, pp. 234–252. Springer, Heidelberg (2002) 5. Elbatt, T., Goel, S.K., Holland, G., Krishnan, H., Parikh, J.: Cooperative collision warning using dedicated short range wireless communications. In: VANET 2006: Proceedings of the 3rd international workshop on Vehicular ad hoc networks, pp. 1–9. ACM, New York (2006) 6. Ergen, M., Lee, D., Sengupta, R., Varaiya, P.: WTRP - Wireless Token Ring Protocol. IEEE Transactions on Vehicular Technology 53(6), 1863–1881 (2004) 7. Halle, S., Laumonier, J., Chaib-Draa, B.: A decentralized approach to collaborative driving coordination. In: Proceedings of the 7th International IEEE Conference on Intelligent Transportation Systems, October 2004, pp. 453–458 (2004) 8. He, T., Vicaire, P., Yan, T., Luo, L., Gu, L., Zhou, G., Stoleru, R., Cao, Q., Stankovic, J.A., Abdelzaher, T.: Achieving real-time target tracking using wireless sensor networks. In: RTAS 2006: Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium, Washington, DC, USA, pp. 37–48. IEEE Computer Society, Los Alamitos (2006) 9. HIDENETS, http://www.hidenets.aau.dk/ 10. Koubˆ aa, A., Cunha, A., Alves, M., Tovar, E.: Tdbs: a time division beacon scheduling mechanism for zigbee cluster-tree wireless sensor networks. Real-Time Syst. 40(3), 321–354 (2008) 11. Lu, C., Blum, B.M., Abdelzaher, T.F., Stankovic, J.A., He, T.: Rap: A real-time communication architecture for large-scale wireless sensor networks. In: Eighth IEEE Real-Time and Embedded Technology and Applications Symposium, Washington, DC, USA, pp. 55–66. IEEE Computer Society, Los Alamitos (2002) 12. Marques, L., Casimiro, A., Calha, M.: Design and development of a proof-ofconcept platooning application using the HIDENETS architecture. In: Proceedings of the 2009 IEEE/IFIP Conference on Dependable Systems and Networks, pp. 223–228. IEEE Computer Society Press, Los Alamitos (2009) 13. Michaud, F., Lepage, P., Frenette, P., Letourneau, D., Gaubert, N.: Coordinated maneuvering of automated vehicles in platoons. IEEE Transactions on Intelligent Transportation Systems 7(4), 437–447 (2006) 14. Misener, J.A., Sengupta, R.: Cooperative collision warning: Enabling crash avoidance with wireless. In: 12th World Congress on ITS, New York, NY, USA (November 2005) 15. Verissimo, P.: Travelling through wormholes: a new look at distributed systems models. SIGACT News 37(1), 66–81 (2006) 16. Ver´ıssimo, P., Casimiro, A.: The Timely Computing Base model and architecture. IEEE Transactions on Computers - Special Issue on Asynchronous Real-Time Systems 51(8) (August 2002); A preliminary version of this document appeared as Technical Report DI/FCUL TR 99-2, Department of Computer Science, University of Lisboa (April 1999) 17. Xu, Q., Mak, T., Ko, J., Sengupta, R.: Vehicle-to-vehicle safety messaging in dsrc. In: VANET 2004: Proceedings of the 1st ACM international workshop on Vehicular ad hoc networks, pp. 19–28. ACM Press, New York (2004)
Concurrency and Communication: Lessons from the SHIM Project Stephen A. Edwards Columbia University, New York, ny, usa [email protected]
Abstract. Describing parallel hardware and software is difficult, especially in an embedded setting. Five years ago, we started the shim project to address this challenge by developing a programming language for hardware/software systems. The resulting language describes asynchronously running processes that has the useful property of schedulingindependence: the i/o of a shim program is not affected by any scheduling choices. This paper presents a history of the shim project with a focus on the key things we have learned along the way.
1
Introduction
Shim, an acronym for “software/hardware integration medium,” started as an attempt to simplify the challenges of passing data across the hardware-software boundary. It has since turned into a language development effort centered around a scheduling-independent (i.e., race-free) concurrency model and static analysis. The purpose of this paper is to lay out the history of the shim project with a special focus on what we learned along the way. It is deliberately light on technical details (which can be found in the original publications) and instead tries to contribute intuition and insight. We begin by discussing the original motivations for the project, how it evolved into a study of concurrency models, how we chose a particular model, and how we have added language features to that model. We conclude with a section highlighting the central lessons we have learned along with open problems.
2
Embryonic shim
We started developing shim in 2004 after observing the difficulties our students were having building embedded systems [1,2] that communicated across the hardware/software boundary. The central idea was to provide variables that could be accessed equally easily by either hardware processes or software functions, both written in C-like dialect. Figure 1 shows a simple counter in this dialect of the language. The count function resides in hardware; the other two are in software. When a software function like get time would reads the hardware register counter, the compiler would automatically insert code that would fetch its value from the hardware and synthesize vhdl that could send the data S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 276–287, 2009. c IFIP International Federation for Information Processing 2009
Concurrency and Communication: Lessons from the SHIM Project
277
module timer { shared uint:32 counter; /∗ Hardware register visible from software ∗/ hw void count() { /∗ Hardware function ∗/ counter = counter + 1; /∗ Direct access to hardware register ∗/ } out void reset timer() { /∗ Software function ∗/ counter = 0; /∗ Accesses register through bus ∗/ }
}
out uint get time() { /∗ Software function ∗/ return counter; /∗ Accesses register through bus ∗/ }
Fig. 1. An early fragment of shim [1]
on a bus when requested. We wrote a few variants of an i2 c bus controller in the language, starting with an all-software version and ending with one that implemented byte-level communication completely in hardware. The central lesson of this work was that the shared memory model, while simple, was a very low-level way to implement such communication. Invariably, it is necessary to layer over it another communication protocol (e.g., some form of handshake) to ensure coherence. We had not included an effective mechanism for implementing communication libraries that could hide this fussy code, so it was back to the drawing board.
3
Kahn, Hoare, and the shim Model
We decided we wanted reliable communication, including any across the hardware/software boundary, to be a centerpiece of the next version of shim. Erroneous communication is a central source of bugs in hardware designs: our embedded system students’ favorite mistake was to generate a value in one cycle and read it in another. This rarely produces even a warning in usual hardware simulation, so it can easily go unnoticed. We also found the inherent nondeterminism of the first iteration of shim a key drawback. The speed at which software runs on processors is rarely known, let alone controlled. Since software and hardware run in parallel and communicate using shared variables, the resulting system was nondeterministic, making it difficult to test. It also ran counter to what we had learned from Esterel [3]. Table 1 shows our wishlist. We wanted a concurrent, deterministic (i.e., independent of scheduling) model of computation and started looking around. The synchronous model [4] was unsuitable because it generally assumes either a single or harmonically related clocks and would not work well with software.
278
S.A. Edwards Table 1. The shim Wishlist
Trait
Motivation
Concurrent Mixes synchronous and asynchronous styles Only requires bounded resources Formal semantics Scheduling-independent
Hardware/software systems fundamentally parallel Software slower and less predictable than hardware; need something like multirate dataflow Fundamental restriction on hardware No arguments about meaning or behavior i/o should not depend on program implementation
Steve Nowick steered us toward the body of work on delay-independent circuits (e.g., van Berkel’s Handshake circuits [5]). We compared this class of processes to Kahn’s networks [6] and found them to be essentially the same [7]. We studied how to characterize such processes [8], finding that we could characterize them as functions that, when presented with more inputs or output opportunities, never produced less or different data. In their classic form, the unbounded buffers of Kahn networks actually make them Turing-complete [9] and difficult to schedule [10], so we decided on a model in which Kahn networks were restricted to csp-like rendezvous [11]. Others, such as Lin [12] had also proposed using such a model. In 2005, we presented our new shim model and a skeleton of the language, “Tiny-shim,” and its formal semantics [13]. It amounted to read and write operations sewn together with the usual C-like expressions and control-flow statements. We later extended this work with further examples, a single-threaded C implementation, and an outline of a hardware translation [14]. In 2006, we published our first real research result with shim: a technique for very efficient single-threaded code generation [15]. The centerpiece of this work was an algorithm that could compile arbitrary groups of processes into a single automaton whose states abstracted the control states of the processes. Our goal was to eliminate synchronization overhead, so the automaton captured which processes were waiting on which channels, but left all other details, such as variable values and details of the program counters, to the automaton. Figure 2 demonstrates the algorithm from Edwards and Tardieu [15]. The automaton’s states are labeled with a number (e.g., S0), the state of each channel in the system (ready “-”, blocked reading “R”, or blocked writing “W”), and, for √ each process, whether it is runnable ( ) or blocked on a channel (×), and a set of possible program counters. From each state, the automaton generator (a.k.a., the scheduler) nondeterministically chooses one of the runnable processes to execute and generates a state by considering each possible pc value for the process. The code generated for a state with multiple pc values begins with a C switch statement that splits control depending on the pc value.
Concurrency and Communication: Lessons from the SHIM Project
process sink(int32 B) { for (;;) B; }
sink 0 PreRead 1 1 PostRead 1 tmp3 2 goto 0
process buffer(int32 &B, int32 A) { for (;;) B = A; }
Fig. 2. An illustration of the shim language and its automaton compilation scheme from Edwards and Tardieu [15]. A source program (a) is dismantled into intermediate code (b), then simulated to produce an automaton (c). Each state is labeled with its name, the state of each channel (blocked on read, blocked on write, or idle), and the state of each process (runnable, and possible program counter values).
At this point, the language fairly closely resembled the Tiny-shim language of the Emsoft paper [13]. A system consisted of a collection of sequential processes, assumed to all start when the system began. It could also contain networks—groups of connected processes that could be instantiated hierarchically. One novel feature of this version, which we later dropped, was the ability to instantiate processes and networks without supply explicit connections. Instead, the compiler would examine the interface to each instantiated process and make sure its environment supplied such a signal. Connections were made implicitly by name, although this could be overridden. This feature arose from observing how in vhdl it is often necessary to declare and mention each channel many times: once for each process, once for each instantiation of each process, and once in the environment in which it is instantiated. However, in the process of writing more elaborate test cases, such as a jpeg decoder [16], we decided that this connection-centric specification style (which we adopted from hardware description languages) was inadequate for any sort of interesting software. We wanted function calls.
280
4
S.A. Edwards
Recursion
In 2006, we introduced function calls and recursion to shim, making it very Clike [17]. Our main goal was to make basic function calls work, allowing the usual re-use of code, but we also found that recursion, especially bounded recursion, was a useful mechanism for specifying more complex structures. void buffer( int i, int &o) { void fifo(int i, int &o, int n) { for (;;) { int c; int m = n − 1; recv i; if (m) o = i; buffer(i, c) par fifo(c, o, m); send o; else } buffer(i, o); } } Fig. 3. An n-place fifo specified using recursion, from Tardieu and Edwards [17]
Figure 3 illustrates this style. The recursive fifo procedure calls itself repeatedly in parallel, effectively instantiating buffer processes as it goes. This recursion runs only once, when the program starts, to set up a chain of single-place buffers.
5
Exceptions
Next, we added exceptions [18], certainly the most technically difficult addition we have made. Inspired by Esterel [3], where exceptions are used not just for occasional error handling but as widely as, say, if-then-else, we wanted our exceptions to be widely applicable and be concurrent and scheduling-independent. For sequential code, the semantics of exceptions were clear: throwing an exception immediately sends control to the most-recently-entered handler for the given exception, terminating any functions that were called in between. For concurrently running functions, the right behavior was less obvious. We wanted to terminate everything leading up to the handler, including any concurrently running relatives, but we insisted on maintaining shim’s scheduling independence, meaning we had to carefully time when the effect of an exception was felt. Simply terminating siblings when one called an exception would be nondeterministic: the behavior would then depend on the relative execution rates of of the processes and thus not be scheduling independent. Our solution was to piggyback the exception mechanism on the communication system, i.e., a process would only learn of an exception when it attempted to communicate, the only point at which processes agree on the time. To accommodate exceptions, we introduced a new, “poisoned,” state for a process that represents when it has been terminated by an exception and is waiting for its relatives to terminate. Any process that attempts to communicate with a poisoned process will itself become poisoned. In Figure 5, the first thread throws
Concurrency and Communication: Lessons from the SHIM Project
281
void main() { int i; i = 0; try { i = 1; throw T; i = i ∗ 2; // is not executed } catch(T) { i = i ∗ 3; } // i = 3 }
void main() { int i; i = 0; try { // thread 1 throw T; } par { // thread 2 for (;;) i = i + 1; // runs forever } catch(T) {} }
(a)
(b)
Fig. 4. (a) Sequential exception semantics are classical. (b) Thread 2 never feels the effect of the exception because it never communicates. From Tardieu and Edwards [18].
void main() { chan int i = 0, j = 0; try { // thread 1 while (i < 5) next i = i + 1; throw T; // poisons itself } par { // thread 2 for (;;) next j = next i + 1; // poisoned by thread 1 } par { // thread 3 for (;;) recv j; // poisoned by thread 2 } catch (T) {} } Fig. 5. Transitive Poisoning: throw T poisons the first process, which poisons the second when the second attempts next i. Finally the third is poisoned when it attempts recv j and the whole group terminates.
an exception; the second thread is poisoned when it attempts to rendezvous on i, and the third is poisoned by the second when it attempts to rendezvous on j. The idea was simple enough, and the interface it presented to the programmer could certainly be used and explained without much difficulty, but implementing it turned out to be a huge challenge, despite there being fairly simple set of structural operational semantics rules for it. The real complexity came from having to consider exception scope, which limits how far the poison propagates (it does not propagate outside the scope of the exception) and the behavior of multiple, concurrently thrown exceptions.
6
Static Analysis
Shim has always been designed for aggressive compiler analysis. We have attempted to keep its semantics simple, scheduling-independent, and restrict it to finite-state models. Together, these have made it easier to analyze.
282
S.A. Edwards
We developed a technique for removing bounded recursion from shim programs [19]. One goal was to simplify shim’s translation into hardware, where general recursion would require memory for a stack and choosing a size for it, but it has found many other uses. In particular, if a program has only bounded recursion, it is finite-state, simplifying other analysis steps. The basic idea of our work was to unroll recursive calls by exactly tracking the behavior of variables that control the recursion. Our insight was to observe that for a recursive function to terminate, the recursive call must be within the scope of a conditional. Therefore, we need to track the predicate of this conditional, see what can affect it, and so forth. Figure 6 illustrates what this procedure does to a simple fifo. To produce the static version in Figure 6(b), our procedure observes that the n variable controls the predicate around fifo’s recursive call of itself. It then notices that n is initially bound to 3 by fifo3 and generates three specialized versions of fifo—one with n = 3, n = 2, and n = 1—simplifies each, then inlines each function, since each is only called once. Of course, in the worst case our procedure could end up trying to track every variable in the program, which would be impractical, but in many examples we tried, recursion control only involved a few variables, making it easy to resolve. A key hypothesis of the shim project has been that scheduling independence should be a property of any practical concurrent language because it greatly simplifies reasoning about a program, both by the programmer and by automated tools. Our work on static deadlock detection reinforces this key point. Shim is not immune to deadlocks (e.g., { recv a; recv b; } par { send b; send a; } is about the simplest example), but they are simpler in shim because of its scheduling-independence. Deadlocks in shim cannot occur because of race conditions. For example, because shim does not have races, there are no racevoid fifo3(chan int i, chan int &o) { fifo(i, o, 3); } void fifo(chan int i, chan int &o, int n) { if (n > 1) { chan int c; buf(i, c); par fifo(c, o, n−1); } else buf(i, o); } void buf(chan int i, chan in &o) { for (;;) next o = next i; } (a)
void fifo3(chan int i, chan int &o) { chan int c1, c2, c3; buf(i, c1); par buf(c1, c2); par buf(c2, o); } void buf(chan int i, chan in &o) { for (;;) next o = next i; } (b)
Fig. 6. Removing bounded recursion, controlled by the n variable, from (a) gives (b). After Edwards and Zeng [19].
Concurrency and Communication: Lessons from the SHIM Project
283
induced deadlocks, such as the “grab locks in opposite order” deadlock race present in many other languages. In general, shim does not need to be analyzed under an interleaved model of concurrency since most properties, including deadlock, are the same under any schedule. So all the clever partial order tricks used by model checkers such as spin [20], are not necessary for shim. We first used the synchronous model checker nusmv [21] to detect deadlocks in shim [22]—an interesting choice since shim’s concurrency model is fundamentally asynchronous. Our approach was to abstract away data operations and choose a specific schedule in which each communication event takes a single cycle. This reduced the shim program to a set of communicating state machines suitable for the nusmv model checker. We continue to work on deadlock detection in shim. Most recently [23], we took a compositional approach where we build an automaton for a complete system piece by piece. Our insight is that we can usually abstract away internal channels and simplify the automaton without introducing or avoiding deadlocks. The result is that even though we are doing explicit model-checking, we can often do it much faster than a state-of-the art symbolic model checker such as nusmv. We have also used model checking to search for situations where buffer memory can be shared [24]. In general, each communication channel needs storage for any data being communicated over it, but in certain cases, it is possible to prove that two channels can never be active simultaneously. We use the nusmv model checker to identify these cases, which allow us to share potentially large buffers across multiple channels. Because this is optimization, if the model checker becomes overloaded, we can we safely analyze the system in smaller pieces.
7
Backends
We have developed a series of backends for the shim compiler; each works off a slightly different intermediate representations. First, we developed a code generator that produced single-threaded C [14] for a variant of Tiny-shim, which had only point-to-point channels. The runtime system maintained a linked list of runnable processes, and for each channel, tracked what process, if any, was blocked on it. Each process was compiled into a separate C function, which stored its state as a global integer and used an switch statement to restore it. This worked well, although we could improve runtimes by compiling away communication overhead through static scheduling [15]. To handle multi-way rendezvous, exceptions, and recursion on parallel hardware we needed a new technique. Our next backend [25] generated C code that made calls to the posix thread library to ask for parallelism. The challenge was to minimize overhead. Each communication action would acquire the lock on a channel, check whether every process connected to it had also blocked (i.e., whether the rendezvous could occur), and then check if the channel was connected to a poisoned process (i.e., a relevant exception had been thrown). All of these checks ran quickly; actual communication and exceptions took longer.
284
S.A. Edwards
We also developed a backend for ibm’s cell processor [26]. A direct offshoot of the pthreads backend, it allows the user to assign computationally intensive tasks to the cell’s synergistic processing units (spus); remaining tasks run on the cell’s powerpc core (ppu). Our technique replaces the offloaded functions with wrappers that communicate across the ppu-spu boundary. Cross-boundary function calls are technically challenging because of data alignment restrictions on function arguments, which we would have preferred to be stack-resident. This, and many other fussy aspects of coding for the cell, convinced us that such heterogeneous multicore processors demand languages at a higher level than C.
8 8.1
Lessons and Open Problems Function Calls
Early version of the language did not support classical software-like function calls. However, these are extremely useful, even in dataflow-centric descriptions, that they really need to be part of just about any language. We were initially deceived by the rare use of function calls in vhdl and Verilog, but we suspect this is because they do not fit easily into the register-transfer model. 8.2
Two-Way vs. Multi-way Rendezvous
Initial versions of shim only used two-way rendezvous, but after a discussion with Edward Lee, we became convinced that multi-way rendezvous was useful to provide at the language level. Debugging was one motivation: with multiway rendezvous, it becomes easy to add a monitor that can observe data flowing through a channel; modeling the clock of a synchronous system was another. Unfortunately, implementing multiway rendezvous is much more complicated than implementing two-way rendezvous, yet we found that most communication in shim programs is point-to-point, so we are left with a painful choice: slow down the common case to accommodate the uncommon case, or do aggressive analysis to determine when we can assume point-to-point communication. We would like to return shim to point-to-point communication only but provide multiway rendezvous as a sort of syntactic sugar, e.g., by introducing extra processes responsible for communication on channels. How to do this correctly and elegantly remains an open question, unfortunately. 8.3
Exceptions
Exceptions have been an even more painful feature than multi-way rendezvous. They are extremely convenient from a programming standpoint (e.g., shim’s rudimentary i/o library wraps each program in an exception to allow it to terminate gracefully; virtually every compiler testcase includes at least a single exception), but extremely difficult to both implement and reason about. We have backed away from exceptions for now (all our recent work addresses the exception-free version of shim); we see two possibilities for how to proceed.
Concurrency and Communication: Lessons from the SHIM Project
285
One is to restrict the use of exceptions so that the complicated case of multiple, concurrent exceptions is simply prohibited. This may prohibit some interesting algorithms, but should greatly simplify the implementation, and probably also analysis, of exceptions. Another alternative is to turn exceptions into syntactic sugar layered on the exception-free shim model. We always had this in the back our minds: an exception would just put a process into an unusual state where it would communicate its poisoned state to any process that attempts to communicate with it. The problem is that the complexity tends to grow quickly when multiple, concurrent exceptions and scopes are considered. Again, exactly how to translate exceptions into a simpler shim model remains an open question. 8.4
Semantics and Static Analysis
We feel we have proven one central hypothesis of the shim project: that simple, deterministic semantics helps both programming and automated program analysis. That we have been able to devise truly effective mechanisms for clever code generation (e.g., static scheduling) and analysis (e.g., deadlock detection) that can gain deep insight into the behavior of programs vindicates this view. The bottom line: if a programming language does not have simple semantics, it is really hard to analyze its programs quickly or precisely. We have also validated the utility of scheduling independence. Our test suite, which consists of many parallel programs, has reproducible results that lets us sleep at night. We have found few cases where the approach has limited us. Algorithms where there is a large number of little, variable-sized, but independent pieces of work to be done do not mesh well with shim’s schedulingindependent philosophy as it currently stands. The obvious way to handle this is to maintain a bucket of tasks and assign each task to a processor once it has finished its last task. The order in which the tasks is performed, therefore, depends on their relative execution rates, but this does not matter if the tasks are independent. It would be possible to add scheduling-independent task distribution and scheduling to shim (i.e., provided the tasks are truly independent or, equivalently, confluent); exactly how is an open research question. 8.5
Buffers
That buffering is mandatory for high-performance parallel applications is hardly a revelation; we confirmed it anyway. The shim model has always been able to implement fifo buffers (e.g., Figure 3), but we have realized that they are sufficiently fundamental to be a first-class type in the language. We are currently working on a variant of the language that replaces pure rendezvous communication with bounded, buffered communication. Because it will be part of the language, it will be easier to map to unusual environments, such as the dma mechanism for inter-core communication on the cell processor.
286
8.6
S.A. Edwards
Other Applications
The most likely future role of shim will be as inspiration for other languages. For example, Vasudevan has ported its communication model into the Haskell functional language [27] and proposed a compiler that would impose its schedulingindependent view of the work on arbitrary programs [28]. Certain shim ideas, such as scheduling analysis [29], have also been used in ibm’s x10 language.
Acknowledgments Many have contributed to shim. Olivier Tardieu created the formal semantics, devised the exception mechanism, and instigated endless (constructive) arguments. Jia Zeng developed the static recursion removal algorithm. Nalini Vasudevan has pushed shim in many new directions; Baolin Shao has just started pushing. The nsf has supported the shim project under grant 0614799.
References 1. Edwards, S.A.: Experiences teaching an fpga-based embedded systems class. In: Proceedings of the Workshop on Embedded Systems Education (wese), Jersey City, New Jersey, September 2005, pp. 52–58 (2005) 2. Edwards, S.A.: Shim: A language for hardware/software integration. In: Proceedings of synchron, Schloss Dagstuhl, Germany (December 2004) 3. Berry, G., Gonthier, G.: The Esterel synchronous programming language: Design, semantics, implementation. Science of Computer Programming 19(2), 87–152 (1992) 4. Benveniste, A., Caspi, P., Edwards, S.A., Halbwachs, N., Guernic, P.L., de Simone, R.: The synchronous languages 12 years later. Proceedings of the IEEE 91(1), 64–83 (2003) 5. van Berkel, K.: Handshake Circuits: An Asynchronous Architecture for vlsi Programming. Cambridge University Press, Cambridge (1993) 6. Kahn, G.: The semantics of a simple language for parallel programming. In: Information Processing 74: Proceedings of ifip Congress 74, Stockholm, Sweden, pp. 471–475. North-Holland, Amsterdam (1974) 7. Edwards, S.A., Tardieu, O.: Deterministic receptive processes are Kahn processes. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Verona, Italy, July 2005, pp. 37–44 (2005) 8. Tardieu, O., Edwards, S.A.: Specifying confluent processes. Technical Report cucs– 037–06, Columbia University, Department of Computer Science, New York, USA (September 2006) 9. Buck, J.T.: Scheduling Dynamic Dataflow Graphs with Bounded Memory using the Token Flow Model. PhD thesis, University of California, Berkeley (1993); Available as ucb/erl M93/69 10. Parks, T.M.: Bounded Scheduling of Process Networks. PhD thesis, University of California, Berkeley (1995); Available as ucb/erl M95/105 11. Hoare, C.A.R.: Communicating sequential processes. Communications of the ACM 21(8), 666–677 (1978) 12. Lin, B.: Software synthesis of process-based concurrent programs. In: Proceedings of the 35th Design Automation Conference, San Francisco, California, June 1998, pp. 502–505 (1998)
Concurrency and Communication: Lessons from the SHIM Project
287
13. Edwards, S.A., Tardieu, O.: Shim: A deterministic model for heterogeneous embedded systems. In: Proceedings of the International Conference on Embedded Software (Emsoft), Jersey City, New Jersey, September 2005, pp. 37–44 (2005) 14. Edwards, S.A., Tardieu, O.: Shim: A deterministic model for heterogeneous embedded systems. IEEE Transactions on Very Large Scale Integration (vlsi) Systems 14(8), 854–867 (2006) 15. Edwards, S.A., Tardieu, O.: Efficient code generation from Shim models. In: Proceedings of Languages, Compilers, and Tools for Embedded Systems (lctes), Ottawa, Canada, June 2006, pp. 125–134 (2006) 16. Vasudevan, N., Edwards, S.A.: A jpeg decoder in Shim. Technical Report cucs– 048–06, Columbia University, Department of Computer Science, New York, USA (December 2006) 17. Tardieu, O., Edwards, S.A.: R-shim: Deterministic concurrency with recursion and shared variables. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Napa, California, July 2006, p. 202 (2006) 18. Tardieu, O., Edwards, S.A.: Scheduling-independent threads and exceptions in Shim. In: Proceedings of the International Conference on Embedded Software (Emsoft), Seoul, Korea, October 2006, pp. 142–151 (2006) 19. Edwards, S.A., Zeng, J.: Static elaboration of recursion for concurrent software. In: Proceedings of the Workshop on Partial Evaluation and Program Manipulation (pepm), San Francisco, California, January 2008, pp. 71–80 (2008) 20. Holzmann, G.J.: The model checker spin. IEEE Transactions on Software Engineering 23(5), 279–294 (1997) 21. Cimatti, A., Clarke, E.M., Giunchiglia, E., Giunchiglia, F., Pistore, M., Roveri, M., Sebastiani, R., Tacchella, A.: NuSMV 2: An openSource tool for symbolic model checking. In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS, vol. 2404, pp. 359–364. Springer, Heidelberg (2002) 22. Vasudevan, N., Edwards, S.A.: Static deadlock detection for the schim concurrent language. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Anaheim, California, June 2008, pp. 49–58 (2008) 23. Shao, B., Vasudevan, N., Edwards, S.A.: Compositional deadlock detection for rendezvous communication. In: Proceedings of the International Conference on Embedded Software (Emsoft), Grenoble, France (October 2009) 24. Vasudevan, N., Edwards, S.A.: Buffer sharing in csp-like programs. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Cambridge, Massachusetts (July 2009) 25. Edwards, S.A., Vasudevan, N., Tardieu, O.: Programming shared memory multiprocessors with deterministic message-passing concurrency: Compiling Shim to Pthreads. In: Proceedings of Design, Automation, and Test in Europe (date), Munich, Germany, March 2008, pp. 1498–1503 (2008) 26. Vasudevan, N., Edwards, S.A.: Celling Shim: Compiling deterministic concurrency to a heterogeneous multicore. In: Proceedings of the Symposium on Applied Computing (sac), Honolulu, Hawaii, March 2009, vol. III, pp. 1626–1631 (2009) 27. Vasudevan, N., Singh, S., Edwards, S.A.: A deterministic multi-way rendezvous library for Haskell. In: Proceedings of the International Parallel and Distributed Processing Symposium (ipdps), Miami, Florida, April 2008, pp. 1–12 (2008) 28. Vasudevan, N., Edwards, S.A.: A determinizing compiler. In: Proceedings of Program Language Design and Implementation (pldi), Dublin, Ireland (June 2009) 29. Vasudevan, N., Tardieu, O., Dolby, J., Edwards, S.A.: Compile-time analysis and specialization of clocks in concurrent programs. In: de Moor, O., Schwartzbach, M. (eds.) CC 2009. LNCS, vol. 5501, pp. 48–62. Springer, Heidelberg (2009)
Location-Aware Web Service by Utilizing Web Contents Including Location Information YongUk Kim, Chulbum Ahn, Joonwoo Lee, and Yunmook Nah Department of Computer Science and Engineering, Dankook University, 126 Jukjeon-dong, Suji-gu, Yongin-si, Gyeonggi-do, 448-701, Korea {yukim,ahn555,jwlee}@dblab.dankook.ac.kr, [email protected]
Abstract. Traditional search engines are usually based on keyword-based retrievals, where location information is simply treated as text data, thus resulting in incorrect search results and low degree of user satisfaction. In this paper, we propose a location-aware Web Service system, which adds location information to web contents, usually consisting of text and multimedia information. For this purpose, we describe the system architecture to enable such services, explain how to extend web browsers, and propose the web container and the web search engine. The proposed methods can be implemented on top of traditional Web Service layers. The web contents which include location information can use their location information as a parameter during search process and therefore they can increase the degree of search correctness by using actual location information instead of simple keywords. Keywords: Location-Based Service, Web Service, GeoRSS, search engine, GeoWEB.
Location-Aware Web Service by Utilizing Web Contents
289
Search engines usually return huge volume of unnecessary sites because they simply contain the same keyword with the given positional information. In this paper, we propose how to extend web contents so that their own location information can be built in them. The proposed method prevents location information from being treated as simple text and allows the exact web contents having strong relationship with the given location can only be retrieved. Here, the web contents mean the general extension of HTML/CSS contents [2]. The proposed system can collect location information from extended web contents and search and provide web contents related with the given query location. To support and utilize location-aware web contents, we propose how to extend web browsers and how to build web containers and web search engines. The remainder of this paper is organized as follows. The common problems of current search systems are described in Section 2. Section 3 shows the overall structure of location-aware Web Service system and describes detail structures and behaviors of the proposed system. Section 4 describes some implementation details and experimental results. Finally, section 5 concludes the paper.
2 Overview In the location-related web contents retrieval by keyword matching, the location of contents are included in the search keyword or input form and that query is processed by simple keyword matching. Let’s consider the query ‘Kangnam station BBQ house.’ The term ‘Kangnam station’ is the positional information describing the location of the contents and ‘BBQ house’ is the normal query term. In the current search engines, the term ‘Kangnam station’ and ‘BBQ house’ are all handled as text information and documents including both query terms are included in the search result. The term ‘Kangnam station’ has a meaning related with position, but it is treated as a simple keyword without any special consideration on its meaning during the search processing. In the current web contents service, there is no difference between the location-related web contents retrieval and the keyword-based retrieval. The problems become more severe when multiple location-related keywords are randomly contained in keyword-based retrievals. In such case, it is very difficult to eliminate unrelated documents. For example, documents that contain information on the Subway Line 2 will contain the names of 44 subway stations and such documents can be treated as documents having 44 location-related keywords, even though these documents are not directly related with 44 specific locations. Therefore, the search engines will return all the documents containing information related with the Subway Line 2 for the query asking about one location among 44 subway stations of the Subway Line 2, resulting in incorrect search results and low degree of user satisfaction. The local search services are services provided by content providers, which show the contents related with specific location on the map. The portal sites, such as Naver and Daum, provide such services, by showing the map on the one side of the screen, while listing the neighboring store list on the other side of the screen. The listed stores are the ones that have direct contracts with the portal sites. When users select a link, the summary information containing the store name, the address and the telephone number of the selected store appears on the pop-up layer [3, 4].
290
Y. Kim et al.
The local search service of Naver shows the map of ‘Kangnam station’ on the left side of the screen and displays the store list, with telephone number and grade, in alphabetical order with alphabet balloon symbol on the right side of the screen. When users select a link, the review information is displayed. The search results consist of information provided by contents providers and the short summary information is only provided instead of full web contents. The contents of the search results depend on the contents providers and, therefore, the information provided by local search services is not enough to users in terms of volume and quality. To relieve this problem, some providers also provide links to the review information posted by users in blog services. But, the main problem of this approach is that it only provides the information intentionally prepared by the contents providers and it shows only quick summary and review instead of full web pages. The research on the Geospatial Web (or Geoweb), which means the combination of the location-based information and the Web, was started in 1994 by U.S. Army Construction Engineering Research Laboratory [5]. This research can be divided into subareas, such as geo-coding, geo-parsing, GCMS(Geospatial Contents Management Systems) and retrievals. The geo-coding is a technique to verify the location of a document by other tools, by recording the location information within the document as shown in Figure 1. For geo-coding, EXIF information of image files, meta information of web sites and meta tag information on text or picture data can be used [6, 7]. GPS Latitude : 57 deg 38' 56.83" N GPS Longitude : 10 deg 24' 26.79" W GPS Position : 57 deg 38' 56.83" N, 10 deg 24' 26.79" W Fig. 1. Example of EXIF information of JPEG picture
The geo-parsing is a technique which translates location information in text into real spatial coordinates, thus enabling spatial query processing [8, 9]. For example, if users input ‘Hannam-dong Richensia North 50m’, that location information is parsed into the real coordinate. The Geospatial Contents Management Systems supports location information by extending traditional CMS. The Geospatial Web technologies support coordinate-based and region-based retrieval and they provide location information of individual documents or domains. However, these technologies depend on specific documents or specific domains and they put focus on region-based retrieval.
3 Structure of Location-Aware Web Service Systems The overall structure of the system to easily change and service general web contents according to the location information is shown in Figure 2. This system consists of the Client, the Web Container and the Search Engine. The Web Container module manages and transfers web contents with location information to provide location-aware web contents to users. It supports both static pages and dynamic web applications. It is able to notify update status of contents made by web applications actively to the Search Engine by using pingback methods. The correctness of information returned by the Search Engine can be improved by this mechanism.
Location-Aware Web Service by Utilizing Web Contents
291
Fig. 2. The location-aware Web Service system structure
The Search Engine module collects information from the Web Container and allows users to retrieve location-aware web contents. It recognizes changes in dynamic web contents by using the Pingback Retriever and periodically collects static documents by invoking Web Crawlers using the Timer. The information is updated by the Information Updater. Location-based queries are delivered through the Query Handler and the Information Finder finally processes these queries. The Client is the module for users to use location-aware web contents and it provides tools to support easy retrieval. It is an extension of common web browsers and includes modules, such as Map Generator, Location Information Parser, Query Generator, Location Detector, etc. It is able to show the location of contents intuitively to users by using Location Information Parser and Map Generator. The Query Generator is used to allow users more convenient location-aware retrieval. The Location Detector is a module which captures location information by utilizing external hardware or Web Services. This module can directly get location information by using GPS or devices supporting Global Navigation Satellite System, like Beidou [10]. Also, it can indirectly get location information by using Web Services supporting Geolocation API Spec [13], such as Mozilla Firefox Geode [11] and Yahoo FireEagle [12]. Currently, the utilization of the Location Detector is not high because general PCs do not provide location information. However, it will be possible to provide more exact location information when the IP and GPS information are more commonly utilized. 3.1 Web Contents Extension to Support Location-Aware Information In the specification of traditional Web contents standards, such as HTML and XHTML[14], the markups to specify location information are not included and the
292
Y. Kim et al.
namespaces for such extension are not also predetermined. Therefore, the format of web contents needs to be extended to deal with location-aware information. One method is to add new markups to the HTML/XHTML format and another method is to include links to external documents. In the previous cases, the formats of web contents were usually extended by linking external documents using tag [15]. However, the standard committees have recently turned their attitude in a conservative way and they are eliminating the indiscriminately added markup tags, such as <marquee> and <embed>, to manage namespaces clearly [16]. The method adding new markup tags will face some difficulties in maintaining future web contents. Therefore, in this paper, we extended the markup tag to include location information in GeoRSS [17] format, as shown in Figure 3. 45.256-71.92 Fig. 3. Location information format
The <where> markup tag, having the ‘georss’ namespace, is used to represent the object containing the location information. The markup tags located within the <where> tag are used to describe location using geospatial languages, such as GML (Geography Markup Language) [18]. Documents holding location information are referenced within HTML/XHTML documents using markup tag and the file type of such documents is ‘application/geoweb.’ The documents having this file type can interpret data in GeoRSS format. We will represent coordinates using WGS84 format [19], which is one of coordinate formats supported by GeoRSS. The WGS84 format is consistent with the GPS format and, thus, it can be effectively used to support interoperability with mobile devices. Also, that format can be easily transformed into the KATECH TM128 coordinate developed by the National Geographic Institute. 3.2 Web Client The Web Client provides facilities for web contents retrieval and visually shows documents stored in the Web Container with location information. Figure 4 shows the interaction between the Web Client and the Web Container. The Web Client module which is an extension of web browsers consists of HTML Renderer, Location Information Parser and Map Generator as shown in Figure 4. When users provide URI of contents (1 of Figure 4), the HTML Renderer sends request to and receive the required contents from the Web Container (2 and 3). The Web Client then receives the location information (5 and 6) and generates the appropriate map (7, 8, 9). The steps from 5 to 9 are repeated if there are more external links in the web contents.
Location-Aware Web Service by Utilizing Web Contents
User
HTML Renderer
Location Info. parser
1 : Type URI()
Map Generator
293
Web Container
2 : Request the web page.() 3 : web page 4 : Get map info.()
loop exist out links?
5 : Request the location.() 6 : Location 7 : Request a map.() 8 : Generate a map() 9 : map
10 : Render all contents.() 11 : all contents
Fig. 4. Basic operations of the Web Client
Web Container
Pingback Retriever
1 : Notify update()
Web Crawler
Info. Updater
2 : Crawl it()
3 : Request the web contents() 4 : HTML page and location info. 5 : Update() loop frequently 6 : Request the changed set() 7 : HTML/XHTML & location info.
8 : update()
Fig. 5. Information collection by the Search Engine
3.3 Search Engine When some information is updated by Web applications, the Web Container notifies that fact to the Search Engine (1 of Figure 5). The Pingback Retriever makes request to the Web Crawler to check the updated web contents. The Web Crawler then visits the Web Container to see the updated information (2, 3, 4) and updates the information in the database (5). The Web Crawler continuously visits the Web Container, checks the newly updated information and downloads such updates, according to the timer event (6, 7, 8).
294
Y. Kim et al.
The Search Engine can handle keyword-based queries, coordinate and keywordbased queries, location and keyword-based queries and keyword plus keyword-based queries.
4 Implementation and Experiments Both the Web Container and the Search Engine are implemented on the same computer, using AMD Athlon 64 Dual Core 4000+, 1GB memory, Ubuntu 4.1.2-16 Linux 2.6.22-14-server version, Apache2 and MySQL 5.0.45. The Web Client is developed on a computer with Pentium 4, 1GB memory, Windows XP, Firefox 3.0.8. The proposed functions are implemented using NPAPI [20] and extended features [21]. We compared the search results of location and keyword-based queries by using general search engines and the proposed Search Engine. The query is finding information related with ‘Subway Line 2.’ We first got results using general search engines and then got improved results using location-aware web contents search engine. Table 1. Filtering rate and error rate
Filtering rate Error rate
N search engine 66.67% 3.33%
D search engine 47.22% 10.56%
Y search Engine 53.33% 15.56%
As shown in Table 1, we can eliminate 66.67% unnecessary search results by N search engine. However, there still exist wrong answers (3.33%), which can’t be eliminated by location information and which contain the given keywords meaningless in the search results. The most common meaningless results were caused by the keywords contained in the titles of the bulletin boards included in the web pages.
5 Conclusion In this paper, we proposed a location-aware Web Service system, which adds location information to traditional web contents. We explained the overall structure of location-aware Web Service system, which consists of the Web Client, the Web Container and the Search Engine. We also described detail structures and behaviors of the proposed systems. The Web Container module manages and transfers web contents with location information to provide location-aware web contents to users. The Search Engine module collects information from the Web Container and allows users to retrieve location-aware web contents. The Web Client provides facilities for web contents retrieval and visually provides documents stored in the Web Container with location information. The proposed methods can be implemented on top of traditional Web Service layers. The web contents which include location information can use their location information as a parameter during search process and therefore they can increase the degree of search correctness by using actual location information instead of simple keywords. To show the usefulness our schemes, some experimental results were shortly provided.
Location-Aware Web Service by Utilizing Web Contents
295
Acknowledgments. This research was supported by the Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the Institute of Information Technology Advancement (grant number IITA-2008-C1090-0801-0031). This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant number R01-2007-000-20958-0 funded by the Korea government (MOST). This research was also supported by Korea SW Industry Promotion Agency (KIPA) under the program of Software Engineering Technologies Development and Experts Education.
References 1. Millard, D., Ross, M.: Web 2.0: Hypertext by Any Other Name? In: Proc. ACM Conference on Hypertext and Hypermedia, pp. 22–25. ACM Press, New York (2006) 2. HTML Spec., http://www.w3.org/TR/REC-html40/ 3. Daum local information, http://local.daum.net/ 4. Naver local information, http://local.naver.com/ 5. An Architecture for Cyberspace: Spatialization of the Internet, http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.37.4604 6. geo-extension-nonWGS84, http://microformats.org/wiki/geo-extension-strawman 7. Geographic registration of HTML documents, http://tools.ietf.org/id/draft-daviel-html-geo-tag-08.txt 8. NGA GEOnet Names Server, http://earth-info.nga.mil/gns/html/ 9. U.S. Board on Geographic Names, http://geonames.usgs.gov/domestic/index.html 10. Beidou, http://www.globalsecurity.org/space/world/china/beidou.htm 11. Mozilla Firefox Geode, http://labs.mozilla.com/2008/10/introducing-geode/ 12. Yahoo FireEagle, http://fireeagle.yahoo.net 13. Geolocation API Spec., http://dev.w3.org/geo/api/spec-source.html 14. XHTML Spec., http://www.w3.org/TR/xhtml11/ 15. Important change to the LINK tag, http://diveintomark.org/archives/2002/06/02/ important_change_to_the_link_tag 16. How to upgrade markup code in specific cases: <embed>,