SERIES ON SCALABLE COMPUTING - VOL 3
ANNUAL REVIEW OF SCALABLE COMPUTING Editor
Yuen Chung Kwong
9
>v£
9
\&
&
«H8QRrc>i 'i-~-:-'-:r mtieim .• .;'
SINGAPORE UNIVERSITY PRESS World Scientific
ANNUAL REVIEW OF SCALABLE COMPUTING
SERIES ON SCALABLE COMPUTING Editor-in-Chief: Yuen Chung Kwong (National University of Singapore) ANNUAL REVIEW OF SCALABLE COMPUTING Published: Vol. 1: Vol.2:
ISBN 981-02-4119-4 ISBN 981-02-4413-4
SERIES ON SCALABLE COMPUTING-VOL 3
ANNUAL REVIEW OF SCALABLE COMPUTING
Editor
Yuen Chung Kwong School of Computing, NUS
SINGAPORE UNIVERSITY PRESS NATIONAL UNIVERSITY OF SINGAPORE
V | f e World Scientific « •
Singapore • New Jersey • London • Hong Kong
Published by Singapore University Press Yusof Ishak House, National University of Singapore 31 Lower Kent Ridge Road, Singapore 119078 and World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
ANNUAL REVIEW OF SCALABLE COMPUTING Series on Scalable Computing — Vol. 3 Copyright © 2001 by Singapore University Press and World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4579-3
Printed in Singapore by Fulsland Offset Printing
EDITORIAL BOARD Editor-in-Chief Chung-Kwong Yuen National University of Singapore Coeditor-in-Chief Kai Hwang Dept of Computer Engineering University ofSourther California University Park Campus Los Angeles, California 90089, USA Editors Amnon Barak Dept of Computer Science Hebrew University of Jerusalem 91905 Jerusalem, Israel Willy Zwaenepoel Dept of Compter Science Rice University, 6100 Main Street Houston, TX 77005-1892, USA Jack Dennis Laboratory for Computer Science MIT, Cambridge, MA 02139-4307, USA
Proposals to provide articles for forthcoming issues of Annual Review of Scalable Computing should be sent to: C K Yuen School of Computing National University of Singapore Kent Ridge, Singapore 119260 email:
[email protected]
For the Year 2002 volume, draft versions should be sent by August 2001; final electronic versions (in prescribed LaTeX format) by December 2001.
This page is intentionally left blank
Preface As the Annual Review goes into its third volume, I am able to feel increasingly positive that the series has found its own niche in the publishing environment for Scalable Computing. In the busy world of technology, research and development, few authors have the time to write lengthy review articles of the type we are interested in, but there are also rather few venues for such material. Once the Annual Review establishes a presence and an archive of articles, it can have the effect of encouraging authors to seriously consider writing such material, knowing that they do have a suitable outlet. The present volume presents four papers from authors in Europe and USA, with an additional article of my own to bring the volume to the desired size. In view of the volume's reduced dependence on material from Asian authors and the articles' coverage of less abstract topics, in comparison with the previous two volumes, the Annual Review shows the potential of widening its appeal. With further determined efforts from our authors, editors and publishers, we can make the series a highly valuable resource in the field of Scalable Computing.
Chung-Kwong Yuen National University of Singapore January 2001
vii
This page is intentionally left blank
Contents
1
Anatomy of a Resource Management System for HPC Clusters 1.1 Introduction 1.1.1 Computing Center Software (CCS) 1.1.2 Target Platforms 1.1.3 Scope and Organisation of this Chapter 1.2 CCS Architecture 1.2.1 User Interface 1.2.2 User Access Manager 1.2.3 Scheduling and Partitioning 1.2.4 Access and Job Control 1.2.5 Performance Issues 1.2.6 Fault Tolerance 1.2.7 Modularity 1.3 Resource and Service Description 1.3.1 Graphical Representation 1.3.2 Textual Representation 1.3.3 Dynamic Attributes 1.3.4 Internal Data Representation 1.3.5 RSD Tools in CCS 1.4 Site Management 1.4.1 Center Resource Manager (CRM) 1.4.2 Center Information Server (CIS) 1.5 Related Work 1.6 Summary 1.7
Bibliography
1 1 2 3 3 4 4 5 6 10 12 12 13 17 17 19 20 22 23 24 24 26 26 27 28 ix
On-line OCM-Based Tool Support for Parallel Applications 2.1 Introduction 2.2 OMIS as Basis for Building Tool Environment 2.3 Adapting the OCM to MPI 2.3.1 Handling Applications in MPI vs. PVM 2.3.2 Starting-up MPI Applications 2.3.3 Flow of Information on Application 2.3.4 Detection of Library Calls 2.4 Integrating the Performance Analyzer PATOP with the OCM 2.4.1 PATOP's Starter 2.4.2 Prerequisites for Integration of PATOP with the OCM 2.4.3 Gathering Performance Data with the OCM 2.4.4 New Extension to the OCM - PAEXT 2.4.5 Modifications to the ULIBS Library 2.4.6 Costs and Benefits of using the Performance Analysis Tool 2.5 Adaptation of PATOP to MPI 2.5.1 Changes in the Environment Structure 2.5.2 Extensions to ULIBS 2.5.3 MPI-Specific Enhancements in PATOP 2.5.4 Monitoring Overhead Test 2.6 Interoperability within the OCM-Based Environment 2.6.1 Interoperability 2.6.2 Interoperability Support in the OCM 2.6.3 Interoperability in the OCM-Based Tool Environment 2.6.4 Possible Benefits of DETOP and PATOP Cooperation 2.6.5 Direct Interactions 2.7 A Case Study 2.8 Concluding Remarks 2.9 Bibliography
32 33 34 35 36 36 38 39 39 39 40 40 42 43 44 45 45 45 46 48 50 50 51 51 52 53 55 58 59
Task Scheduling on NOWs using Lottery-Based Work Stealing 3.1 Introduction 3.2 The Cilk Programming Model and Work Stealing Scheduler 3.2.1 Java Programming Language and the Cilk Programming Model 3.2.2 Lottery Victim Selection Algorithm 3.3 Architecture and Implementation of the Java Runtime System 3.3.1 Architecture of the Java Runtime System
63 64 69 69 71 73 73
xi
Contents
3.4
3.5 3.6
3.3.2 Implementation of the Java Runtime System Performance Evaluation 3.4.1 Applications 3.4.2 Results and Discussion Conclusions Bibliography
Transaction Management in a Mobile Data Access System 4.1 Introduction 4.2 Multidatabase Characteristics 4.2.1 Taxonomy of Global Information Sharing Systems 4.2.2 MDBS and Node Autonomy 4.2.3 Issues in Multidatabase Systems 4.2.4 MDAS Characteristics 4.2.5 MDAS Issues 4.3 Concurrency Control and Recovery 4.3.1 Multidatabase Transaction Processing: Basic Definitions 4.3.2 Global Serialisability in Multidatabases 4.3.3 Multidatabase Atomicity/Recoverability 4.3.4 Multidatabase Deadlock 4.3.5 MDAS Concurrency Control Issues 4.4 Solutions to Transaction Management in Multidatabases 4.4.1 Global Serializability under Complete Local Autonomy 4.4.2 Solutions using Weaker Notions of Consistency 4.4.3 Solutions Compromising Local Autonomy 4.4.4 Using Knowledge of Component Databases 4.4.5 Global Serializability Based on Transaction Semantics 4.4.6 Solutions under MDAS 4.4.7 Solutions to Global Atomicity and Recoverability 4.5 Application Based and Advanced Transaction Management 4.5.1 Unconventional Transactions Types 4.5.2 Advanced Transaction Models 4.5.3 Replication 4.5.4 Replication Solutions in MDAS 4.6 Experiments with V-Locking Algorithm 4.6.1 Simulation Studies 4.6.2 System Parameters
76 76 76 78 79 80 85 86 89 90 91 92 93 97 99 102 103 104 106 107 108 110 113 115 117 118 118 122 125 126 126 128 135 136 138 139
4.7 4.8
4.6.3 Simulation Results Conclusion Bibliography
Architecture Inclusive Parallel Programming 5.1 Introduction 5.1.1 Architecture Independence - The Holy Grail 5.1.2 Shared Memory Versus Distributed Systems 5.1.3 Homogeneous Versus Heterogeneous Systems 5.1.4 Architecture Independence Versus Inclusiveness 5.2 Concurrency 5.2.1 Threads and Processes 5.2.2 Exclusion and Synchronization 5.2.3 Atomicity 5.2.4 Monitors and Semaphores 5.3 Data Parallelism 5.3.1 Vector Processors 5.3.2 Hypercubes 5.3.3 PRAM Algorithms 5.4 Memory Consistency 5.4.1 Shared Memory System 5.4.2 Tuplespace 5.4.3 Distributed Processing 5.4.4 Distributed Shared Memory 5.5 Tuple Locks 5.5.1 The Need for Better Locks 5.5.2 Tuple Locks 5.5.3 Using Tuple Locks 5.5.4 Bucket Location 5.5.5 Homogeneous Systems 5.6 Using Tuple Locks in Parallel Algorithms 5.6.1 Gaussian Elimination 5.6.2 Prime Numbers 5.6.3 Fast Fourier Transform 5.6.4 Heap Sort 5.7 Tuples in Objects 5.7.1 Objects and Buckets
139 140 142 148 148 148 150 152 155 157 157 158 160 162 164 164 166 167 171 171 173 174 176 177 177 181 182 185 186 189 189 190 192 195 198 198
xiii
Contents
5.8
5.7.2 An Example 5.7.3 Reflective Objects Towards Architecture Inclusive Parallel Programming 5.8.1 Parallel Tasking 5.8.2 Speculative Processing 5.8.3 Efficient Implementation of Tuple Operations 5.8.4 Tuple and Bucket Programming Styles 5.8.5 Back Towards Architecture Independence
199 202 203 203 205 206 208 212
Chapter 1
Anatomy of a Resource Management System for HPC Clusters
AXEL KELLER
Paderborn Center for Parallel Computing (PC2) Furstenallee 11, D-33102 Paderborn Germany
[email protected]
ALEXANDER REINEFELD
Konrad-Zuse-Zentrum fur Informationstechnik Berlin (ZIB) Takustr. 7, D-14195 Berlin-Dahlem Germany
[email protected]
1.1
Introduction
A resource management system is a portal to the underlying computing resources. It allows users and administrators to access and manage various computing resources like processors, memory, network, and permanent storage. With the current trend towards heterogeneous grid computing [17], it is important to separate the resource management software from the underlying hardware by introducing an abstraction layer between the hardware and the system management. This facilitates the management of distributed resources in grid computing environments as well as in local clusters with heterogeneous components. In Beowulf clusters [34], where multiple sequential jobs are concurrently executed in high-throughput mode, the resource management task may be as simple 1
2
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
as distributing the tasks among the compute nodes such that all processors are about equally loaded. When using clusters as dedicated high-performance computers, however, the resource management task is complicated by the fact that parallel applications should be mapped and scheduled according to their communication characteristics. Here, the efficient resource management becomes more important and also more visible to the user than the underlying operating system. The resource management system is the first access point for launching an application. It is responsible for the management of all resources, including setup and cleanup of processes. The operating system comes only at runtime into play, when processes and communication facilities must be started. As a consequence, resource management systems have evolved from the early queuing systems towards complex distributed environments for the management of clusters and high-performance computers. Many of them support space-sharing (=exclusive access) and time-sharing mode, some of them additionally provide hooks for WAN metacomputer access, thereby allowing to run distributed applications on a grid computing environment.
1.1.1
Computing Center Software (CCS)
On the basis of our Computing Center Software CCS [24], [30] we describe the anatomy of a modern resource management system. CCS has been designed for the user-friendly access and system administration of parallel high-performance computers and clusters. It supports a large number of hardware and software platforms and provides a homogeneous, vendor-independent user interface. For system administrators, CCS provides mechanisms for specifying, organizing and managing various high-performance systems that are operated in a computing service center. CCS originates from the transputer world, where massively parallel systems with up to 1024 processors had to be managed [30] by a single resource management software. Later, the design has been changed to also support clusters and grid computing. With the fast innovation rate in hardware technology, we also saw the need to encapsulate the technical aspects and to provide a coherent interface to the user and the system administrator. Robustness, portability, extensibility, and the efficient support of space sharing systems, have been among the most important design criteria. CCS is based on three elements: • a hierarchical structure of autonomous "domains", each of them managed by a dedicated CCS instance, • a tool for specifying hardware and software components in a (grid-) computing environment, • a site management layer which coordinates the local CCS domains and supports multi-site applications and grid computing.
Section 1.1.
Introduction
3
Over the years, CCS has been re-implemented several times to improve its structure and the implementation. In its current version V4.03 it comprises about 120.000 lines of code. While this may sound like a lot of code, it is necessary because CCS is in itself a distributed software. Its functional units have been kept modular to allow easy adaptation to future environments.
1.1.2
Target Platforms
CCS has been designed for a variety of hardware platforms ranging from massively parallel systems up to heterogeneous clusters. CCS runs either on a frontend node or on the target machine itself. The software is distributed in itself. It runs on a variety of UNIX platforms including AIX, IRIX, Linux, and Solaris. CCS has been in daily use at the Paderborn computing center since almost a decade by now. It provides access to three parallel Parsytec computers, which are all operated in space-sharing mode: a 1024 processor transputer system with a 32x32 grid of T805 links, a 64 processor PowerPC 604 system with a fat grid topology, and a 48 node PowerPC system with a mesh of Clos topology of HIC (IEEE Std. 1355-1995) interconnects. CCS is also used for the management of two PC clusters. Both have a fast SCI [21] network with a 2D torus topology. The larger of the two clusters is a Siemens hpcLine [22] with 192 Pentium II processors. It has all typical features of a dedicated high-performance computer and is therefore operated in multi-user space-sharing mode. The hpcLine is also embedded in the EGrid testbed [14] and is accessible via the Globus software toolkit [16]. For the purpose of this paper we have taken our SCI clusters just as an example to demonstrate some additional capabilities of CCS. Of course, CCS can also be used for managing any other Beowulf cluster with any kind of network like FE, GbE, Myrinet, or Infiniband.
1.1.3
Scope and Organisation of this Chapter
CCS is used in this paper as an example to describe the concepts of modern resource management systems. The length of the chapter reflects the complexity of such a software package: Each module is described from a user's and an implementator's point of view. The content of this chapter is as follows: In the remainder of this section, we present the architecture of a local CCS Domain and focus on scheduling and partitioning, access and job control, scalability, reliability, and modularity. In Section 1.3, we introduce the second key component of CCS, the Resource and Service Description (RSD) facility. Section 1.4 presents site management tools of CCS. We conclude the paper with a review on related work and a brief summary.
4
Anatomy of a Resource Management System for HPC Clusters
1.2
Chapter 1
CCS Architecture
A CCS Domain (Fig. 1.1) has six components, each containing several modules or daemons. They may be executed asynchronously on different hosts to improve the CCS response time.
Ul User Interface
lAccess MgrJ
iQueueMgr.
Figure 1.1. Interaction between the CCS components.
• The User Interface (UI) provides a single access point to one or more systems via an X-window or ASCII interface. • The Access Manager (AM) manages the user interfaces and is responsible for authentication, authorization, and accounting. • The Queue Manager (QM) schedules the user requests onto the machine. • The Machine Manager (MM) provides an interface to machine specific features like system partitioning, job controlling, etc. • The Domain Manager (DM) provides name services and watchdog facilities to keep the domain in a stable condition. • The Operator Shell (OS) is the main interface for system administrators to control CCS, e.g. by connecting to the system daemons (Figure 1.2).
1.2.1
User Interface
CCS has no separate command window. Instead, the user commands are integrated in a standard UNIX shell like tcsh or bash. Hence, all common UNIX mechanisms for I/O re-direction, piping and shell scripts can be used. All job control signals (ctl-z, ctl-c, ...) are supported and forwarded to the application. The CCS user interface supports six commands:
Section 1.2.
5
CCS Architecture
Queue Manager Dialoybox (PSC) fthoutQH
ftddtriistratiort
{ebugMode
Cammttimi to 'Qimffi MmAWK'
gpnneeUtro
(PSQ
. Close' Connection"
Enter scheduler i d ;
flrKSi;23j38):reqId •Duration QHU1:23;38>;
Machine'
ifcer
State
J , Nodes-' C « n f i r f l « e
T
QM<11:23:38); 23 PSt hw0s QH{ll?23t38>i 28 PSC kel ' QMU1:2S;44); UMmt2S:4<)iList'Of Scheditlersf
RUNNING OUEUEB ,
16 16 '
'
•
llt22 23t00
mm 34Stn
:.-..'•
OKliUS:44):— ( H { i l j 2 S l 4 4 ) j 0 - FCFS* ( F i r s t Corns'First "Served) < — : t c t i w s W(U:iZ;W; X - SJF*' (Shortest Job F i r s t ) '• • '• • . . QM<11:23:44): 2 - U F * .(longest Job F i r s t ) '• -- ;
Figure 1.2. The CCS operator shell (OS). • ccsalloc for allocating or reserving resources, • ccsrun for starting jobs on previously reserved resources, • ccskill for killing jobs and for resetting or releasing resources, • ccsbind for re-connecting to a lost interactive application, • ccsinfo for displaying information on the job schedule, users, job status etc. • ccschres for changing resources of already submitted requests.
1.2.2
User Access Manager
The Access Manager (AM) manages the user interfaces, analyzes the user requests, and is responsible for authentication, authorization, and accounting. Authentication
When first connecting to CCS, users must specify a project name. Users may work in several projects at the same time. In addition to the user input, the UI also determines user specific information (real and effective username, UNIX user-ID) and sends it to the AM. The AM checks its database whether the user is a member of the specified project. If not, the connection request is denied. Otherwise, the AM checks whether the request matches the users' privileges.
6
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
Authorization
Privileges can be granted to either a complete project or to specific members, for example: • access rights (allocate-/reserve/change resource requests), • maximum number of concurrently used resources, • allowed time of usage (day, night, weekend, etc.). Jobs may be manipulated by either the owner or project group members (if not prohibited by a privacy flag). This allows to establish working groups. CCS has three time categories: weekdays, weekend, and special days. The latter can be defined by a calendar file. In each category CCS distinguishes day and night. Three limits can be specified: (1) the number of allocatable nodes in percent of the machine size during day and night time, (2) the maximum duration a partition may be used, and (3) the maximum time extension which can be added after job submission. These parameters are used to compute an additional (implicit) limit, the "resource integral" (similar to an area in Fig. 1.3). The resource integral is given by maxNodes * maxDuration * factor, where maxNodes is the maximum number of nodes over all project limits. Normally, factor is set to 1, but for projects with many members it should be increased. Accounting The accounting of the machine usage is done at partition allocation time, at release time and during job runtime. The resulting data can be post-processed by statistical tools. The operator can assign CPU time limits to a project. When the time (wall clock time x nodes) has been overdrawn, CCS locks the project. All above is specified in an ASCII file which is read at boot time by the AM. The AM periodically checks this file for changes, additionally the system operator may force the AM via the operator shell to read the file. We currently use an SQL database with a Java interface for administering projects and for creating the ASCII configuration file.
1.2.3
Scheduling and Partitioning
In the design of CCS, we have tried to compromise between two conflicting goals: First, to create a system that optimally utilizes processors and interconnects, and second, to keep a high degree of system independence for improved portability. These two goals could only be achieved by splitting the scheduling process into two instances, a hardware-dependent and a hardware-independent part. The Queue Manager (QM) is the hardware-independent part. It has no information on any mapping constraints such as the network topology or the amount or location of I/O nodes.
Section 1.2.
CCS Architecture
7
T h e hardware-dependent tasks are performed by the Machine Manager (MM). It verifies whether a schedule received from the QM can be mapped onto the hardware at the specified time. T h e MM checks this by mapping the user specification with the static (e.g., topology) and dynamic (e.g., P E availability) information on the system resources. This kind of information is described by means of the Resource and Service Description facility RSD (Sec. 1.3). In the following sections we discuss the scheduling model in more detail. Scheduling CCS requires the user to specify the expected finishing time of his jobs in the resource requests. This is necessary to compute a fair and deterministic schedule. T h e CCS schedulers distinguish between fixed and time-variable resource requests. A request t h a t has been reserved for a given time interval is fixed, i.e. it cannot be shifted on the time axis. Time-variable requests, in contrast, can be scheduled earlier but not later than requested. Such a shift on the time axis might occur when other users release their resources before the specified estimated finishing time. Figure 1.3 shows the scheduler GUI. Note t h a t both, batch and interactive requests are processed in the same scheduler queue. CCS provides several scheduling strategies: first-come-first-serve, shortest-jobfirst, or longest-job-first. T h e integration of new schedulers is easy, because the Queue Manager provides an A P I to plug in new modules. T h e system administrator can choose a scheduler at runtime. In addition, the QM may adjust to different request profiles by dynamically switching between the schedulers. For our request profile, we found t h a t an enhanced first-come-first-serve (FCFS) scheduler fits best. Waiting times are minimized by first checking whether a new request fits into a gap of the current schedule (back-filling). Resource Reservation. W i t h this scheduling model it is also possible to reserve resources for a given time in the future. This is a convenient feature when planning interactive sessions or online-events. As an example, consider a user who wants to run an application on 32 nodes from 9 to 11 am at 13.02.2001. T h e resource allocation is done with the command: c c s a l l o c - n 32 - s 9 : 1 3 . 0 2 . 0 1 - t 2h. Deadline Scheduling. W i t h the deadline scheduling policy, CCS guarantees a batch j o b to be completed not later than the specified finishing time. A typical scenario for deadline scheduling would be an overnight run t h a t must be finished when the user comes back into his office next morning. Deadline scheduling gives CCS the flexibility to improve the system utilization by scheduling batch jobs at the latest possible time by which the deadline is still met. More sophisticated policies, like tunning a batch j o b as soon as possible b u t at latest to meet the specified finishing time, can be introduced via the scheduler A P I . Balancing Interactive and Batch Jobs. Systems t h a t are primarily used in interactive mode must not be overloaded with batch jobs during daytime. This is taken care of by distinguishing six time slots in CCS: three categories of working days
8
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
' Zf«&<|
CCSSchedule 0.9
I
a.os Request-ID: 5B17 CCS User: 52
Duration: 5h Staft time: 2000-06-20. 15:02:14 End. t i life J 2000-06-20. 20';Q2:M State: 5VAR) 8UEUED
18t
Actual tine} 2000-06-20, Ifl549:19 Hachtna; PSC2. CPU name: PentiunII : CPU sspaetf: MBmoryi
4SB!HHz 512 KB
Nodes: Scheduler:
96 " FCFS* (First Cone First Serve)
21.04
O0S00 22.06
«|0|l
Figure 1.3. Scheduler GUI showing allocated compute nodes over time. (weekdays, weekend days, exceptional days) and two time categories (day and night time). These six slots are taken into account when scheduling a new request. With the given project limits (CPU time and nodes in percent of the total machine size) the scheduler is able to schedule the request to the next suitable time slot depending on the request type (interactive, reservation, deadline, etc.). If the machine size or the limits should change later on, the scheduler is notified and it automatically adapts to the new scenario. With this mechanism, a user may submit an arbitrary number of jobs without blocking the machine. In a certain sense, this mechanism gives each project its own virtual machine of time-dependable size (e.g., 40% during day time and 100% during night time). Partitioning
The separation between the hardware-independent QM and the system-specific MM allows to encapsulate system-specific mapping heuristics in separate modules. With this approach, system-specific requests for I/O-nodes, specific partition topologies, or memory constraints can be taken into consideration in the verifying process. The MM verifies whether a schedule received from the QM can be mapped onto the hardware at the specified time. In this check, the MM takes also the concurrent usage of other applications into account. If the schedule cannot be mapped onto the machine, the MM returns an alternative schedule to the QM. The QM either
Section 1.2.
9
CCS Architecture
4
f\
C\. A
^ : — i * : — y p —3•
=f Z5
-Jjh
c^t
-d^
4* -<#
y|—..
CD
^ - f Z ) ^—3
cz —J|
o o 2
S-
=9
400 MByte/s
Figure 1.4. Example illustrating the system partitioning scheme on a machine with different link speeds.
accepts the schedule or it proposes a new one. Figure 1.4 is one such example, where system specific characteristics must be taken into consideration in the partitioning process. Here, the 32 compute nodes are interconnected by SCI links of different speeds: the vertical rings have a bandwidth of 500 MByte/s whereas the horizontal rings are a little bit slower with 400 MByte/s due to their longer physical cable lengths. The system specific MM takes such system characteristics into account. In this case, it determines a partition according to the cost functions "minimum network interference by other applications" and "best network bandwidth for the given application". The first function tries to use as few rings as possible (thereby minimizing the number of dimension changes from X- to Y-ringlets), while the second tries to map applications on single rings to ensure maximum bandwidth. Note that the best choice of policy also depends on the network protocol, e.g., in our case SCI [21]. The API of the MM allows to implement mapping modules that are optimally tailored to the specific hardware properties and network topology. As a side effect of this model, it is also possible to migrate partitions when they are not active. This feature is used to improve the utilization on partitionable systems. The user does not notice the migration (unless he runs time-critical benchmarks for testing the communication speed of the interconnects—but in this case the automatic migration facility may be switched off). Another task of the MM is to monitor the utilization of the partitions. If a partition is not used for a certain amount of time, the MM releases the partition and notifies the user via email.
10
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
Interface to System Area Networks (SAN). In modern workstation clusters, the SAN administration is often done by an autonomous layer. The Siemens hpcLine provides a SAN administration layer which monitors the SCI network and responds to problems by disabling SCI links or by changing the routing scheme. Ideally, such changes should be reported to the resource management system, because the routing affects the quality of the mapping. However, the hardware vendors do not yet support the interface, so we have not been able to implement it into CCS. Currently, CCS just checks the integrity of the SAN at boot time.
1.2.4
Access and Job Control
When configuring the compute nodes for the execution of a user application, the QM sends the resource request to the MM. The MM then initializes the compute nodes, loads and starts the application code and releases the resources after the last Because the MM also verifies the schedule, which is a polynomial time problem, a single MM daemon might become a computational bottleneck. We therefore splitted the MM into two parts (Fig. 1.5), one for the machine administration and one for the job execution. Each part contains several modules and daemons, which may be executed on different hosts to improve the response time. job vxeculinn
mm. hi/ic administration '
\1\
•
Mappiny \ crifiiT
/
/ ' Dispatcher
SM
^
'
CM
\SM
/
• • •
\i—-^
Hardware •
KM 1 \L-VllliU1l Mxiuif>i r
\S\1
' 4
M:tn.i;>cr
KM lAlX'lllllill Munujjir
Maim I'd
( mifii^iiralKiii
t
'
Nuili stimuli
""*'
KM I.XL-i'iiiiiin MilllilUiT
•
••-$*
S
System Software
F i g u r e 1.5. Detailed view of the machine manager (MM). The machine administration part consists of three separate daemons (MV, SM, CM) that execute asynchronously. A small Dispatcher coordinates them. The Mapping Verifier (MV) checks whether the schedule given by the QM can be realized at the specified time with the specified resources (see 1.2.3). The Configuration Manager (CM) provides the interface to the underlying hardware. It is responsible for booting, partitioning, and shutting down the operating system (resp. middleware) at the target system. Depending on the system's capabilities, the CM may gather consecutive requests and re-organize or combine them
Section 1.2.
11
CCS Architecture
one host QM
AM
process group-
MM
rcmd
SM
rcmi
ifoifcaMexeCv.:.
mffl,
ciii
Dispatcher
a
Ppl-
iijp:;,
aǤH
Sss..
BP;
'$&$•
PW
CCS communication protocol
Figure 1.6. Control flow and data flow in a CCS domain.
for improving the throughput—analogously to the autonomous optimization of hard disk controllers. In addition, the CM provides external tools with information on the allocated partition, like host names or the partition size. With this information, external tools can create host files, for example to start a PVM application (see 1.2.7). The Session Manager (SM) interfaces the job execution level. It sets up the session, including application-specific pre- or post-processing, and it maintains information on the status of the applications. It also synchronizes the nodes with the help of the Node Session Manager (NSM), that runs on each specified node with root privileges. The NSM is responsible for node access and job controlling. At allocation time, the NSM starts an Execution Manager (EM) which establishes the user environment (UID, shell settings, environment variables, etc.) and starts the application. Before releasing the partition, the NSM cleans up the node. All NSMs are synchronized by the SM. Figure 1.6 illustrates the control and data flow in a CCS domain. In clusters, access control is of prime importance, because each node runs a full fledged operating system and therefore could in principle be used as a stand alone computer by its owner. When operated in cluster mode, users must be prohibited
12
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
to start processes on single nodes, not only because of unpredictable changes in the CPU load, but also because they might create orphan processes which are difficult to clean up by the resource management system. For this reason, the NSM also takes care about the access- and job control. At allocation time, an NSM modifies several system files (depending on the operating system) to allow exclusive login for the temporary node owner. Before releasing the partition, the NSM removes pending processes and files and locks the node again to prohibit direct access by single users.
1.2.5
Performance Issues
On very large clusters the SM may become a bottleneck, because it has to communicate with all NSMs. Configuring a partition with n nodes needs 2n messages to be sent through one communication link. In our simulations, the initialization of a 256 node partition took about five seconds, most of it spent in communication. This is because we use RPCs (for portability) which implicitly serializes the communication. The time can be reduced by implementing a hierarchy of SMs that communicate in parallel (i.e. a broadcast tree). In addition, transient problems (e.g. network congestion or crashed daemons) are handled by the nearest SM without interrogating the MM. The depth and width of the SM tree can be specified with the Resource and Service Description language (Sec. 1.3).
1.2.6
Fault Tolerance
Virtual terminal concept
With the increasing interactive use via WANs, reliable remote access services become more and more important. Unpredictable network behavior and even temporary breakdowns should ideally be hidden from the user. When the network breaks down, open standard output streams (stdout and stderr) are buffered by the EM. The EM then either sends the output via email to the user or it writes it into a file (see Fig. 1.6). A user can re-bind to a lost session, provided that the application is still running. CCS guarantees that no data is lost in the meantime. Alive checks
Today's parallel systems often comprise independent (workstation-like) nodes, which are more vulnerable to breakdowns than traditional HPC systems. System reliability becomes an even more important issue because a node breakdown has an immediate impact on the user's work flow. To ensure a high reliability, erroneous nodes must be detected and disabled. The resource management system must be able to detect and possibly repair breakdowns at three different levels: the computing nodes, the daemons, and the communication network. Many failures become only apparent when the communication behavior changes over time or when a communication partner does not
Section 1.2.
CCS Architecture
13
answer at all. To cope with this, the Domain Manager (DM) maintains a database on the status of all system components in the domain. Each CCS daemon notifies the DM when starting up or closing down. The DM periodically pings all connected daemons to check if they are still alive. In addition, when a CCS daemon detects a breakdown while communicating with another daemon (by receiving an error from the CCS communication layer), it closes the connection to this daemon and requests the DM to re-establish the link. In both cases, the DM first tries to reconnect. If this is not possible, the DM has a number of methods to investigate the problem. Which of them to use depends on the type of the target system. If it is a cluster, the DM tries to ping the faulty node. If the node does not answer, it may be in trouble and is marked as faulty in the MM mapping module. The MM then ignores this node in future mappings. The respective jobs are stopped (if specified) and a problem report is sent to the user and the operator. The DM is authorized to stop erroneous daemons, to restart crashed ones, and to migrate daemons to other hosts in case of system overloads or crashes. For this purpose, the DM maintains an address translation table that matches symbolic names to physical network addresses (e.g. host ID and port number). Symbolic names are given by the triple < s i t e , domain, process>. With this feature a cluster may be logically divided into several CCS domains, each of them with a different scheduling or mapping scheme. For recovery, each CCS daemon periodically saves its state to a disk. At boot time the daemons read their information and synchronize with their communication partners. This allows to shutdown or kill CCS daemons (or even the whole domain) at any given time without the risk to loose scheduled requests. Node checking
In space sharing mode users want to access "clean" nodes to achieve a deterministic behavior of their applications. Therefore, both the node and the network have to be cleaned up after terminating a job. In CCS, this is done in two steps. Before allocating a node, CCS checks its integrity: (1) Is the SCI network interface okay? (2) Is enough memory available? (3) Has the node be cleaned of unauthorized processes? If any of these conditions is not true, CCS tries to fix the problem. If this is not possible, the allocation fails, the user is informed and the operator gets an email describing what went wrong. Additionally, CCS performs a post-processing after each job. For example after running a ScaMPI [31] application on an SCI cluster, CCS removes remaining shared memory segments. This ensures that all memory is available for the next run.
1.2.7
Modularity
One of the design goals of CCS is extensibility. We have designed CCS in a modular way, to allow easy adaptation to new system architectures or different execution
14
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
#name of the worker pvm, '/,CCS/bin/start_pvmJob -d -r '/.reqID -m '/.domain, #run command #parse command '/.CCS/bin/start-pvmJob -q -m '/.domain, #pre- processing '/.root '/.CCS/bin/establishPVM '/.user, #post -processing '/.user '/.CCS/bin/cleanPVM '/.user
F i g u r e 1.7. Worker definition for job startup in a PVM environment. modes. In the following, we give some examples how this design principle is reflected in CCS. Operator shell
The Operator Shell (OS) and the other kernel modules (AM, QM, MM) are all independent modules. When the OS contacts a daemon for the first time, the daemon sends its menu items to the OS which automatically generates the corresponding pulldown menues and dialog boxes. This allows to specify any menue items and dialogs on the client side without changing the OS source code. Worker concept
Cluster systems comprise nodes with full operating system capabilities and software packages like debuggers, performance analyzers, numerical libraries, and runtime environments. Often these software packages require specific pre- and post-processing. CCS supports this with the so-called worker concept. Workers are tools to start jobs under specific runtime environments. They hide specific procedures (e.g. starting of a daemon or setting of environment variables) and provide a convenient way to start and control programs. A worker's behavior is specified by five attributes in a configuration file (Fig. 1.7) • the name of the worker, • the command for CCS to start the job, • the optional parse command for detecting syntax errors, • the optional pre-processing command (e.g. initializing a parallel file system), and • the optional post-processing command (e.g. closing a parallel file system). Pre- and post-processing can be started with either root or user privileges, controlled by a keyword. The configuration file is parsed by the user interface and can therefore be changed at run time. New workers can be plugged in without the need
Section 1.2.
15
CCS Architecture
to change the CCS source code. It is possible to use several configuration files at the same time. The pvm-worker in Fig. 1.7 may serve as an example to illustrate what can be done with a worker: The pvm-worker creates a PVM host file (the host names are provided by the CM) and starts the master-pvmd. The master-pvmd starts, according to the given host file, all other slave-pvmds via the normal r s h or ssh mechanism to establish the virtual machine (VM) on the requested partition. This is possible since the NSMs have granted node access at allocation time. Since the user application cannot be started until all pvmds are running, the pvm-worker starts a special PVM application when the master-pvmd is running. This little program periodically checks how many nodes are connected to the VM until the entire VM is up. Thereafter the user application is started. When the application is done, the worker terminates the master-pvmd which shuts down the VM. In the post processing phase the nodes are cleaned up by removing pending processes and files. Adapting to the local environment
Since CCS is able to manage heterogeneous systems it is possible that the process environment may not always be the same. The automounter tool which automatically mounts file systems (e.g. home directories from remote file serves) may serve as an example to exemplify this problem: Provided a user has a home directory on host x in directory /home/foo, a pwd would result in /homes/x/foo. However, a pwd submitted on host x itself results in /home/foo. If a user starts a job in /home/foo on host x, the EM is not able to change to this directory. Due to the automounter naming conventions the path should be /homes/x/foo. Therefore applications using this path will not work correctly. CCS copes with problems like this by modifying the process environment of an application before starting it. Environment variables like PATH will be explored and modified with respect to the host name.
#EM-Host
SOURCE
DESTINATION
.* .sci-[123]*
/homel /home3
/homes1/psc-master /homes3/psc-master
F i g u r e 1.8. CCS configuration file with path mapping. The operator can specify the mapping in a configuration file as shown in Fig. 1.8. Each line describes a mapping. The first column (which may be a regular expression) specifies the local host. The second and third column describe what will be mapped. For example, the string "/homel/foo" will be replaced by "homesl/psc-
16
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
master/foo". The configuration will be read by the EM at boot time, hence it can be changed during runtime.
Machines
Options
w; . ._ Pj
E=xrt I
f
Windows
Zvalue
Help
Coordinates: 2,6,0
Watch Request ;
No More Jobs f
Machino:
'PSC2
Program:
jscampi - - bandwith
Timci
,
I
Quit
0 days,
"""
5 hours,
Update
j
Start Command ;
• 45 m i n u t e s
^ <
fc*J
-J.J
Figure 1.9. Monitoring a partition with ccsView.
Monitoring tools
A resource management system should provide interfaces to arbitrary external tools to support users with several layers of information. Common users should be given the overall status without burdening them with too many technical details. Application developers, in contrast, need better monitoring tools to be able to see what happens on the nodes and on the network. CCS offers three monitoring tools. The first one, ccsView, provides a logical view of the processor network. It shows the status of the network topology and the nodes (e.g allocated, available, down, etc.). Allocated partitions are highlighted and node specific information is displayed by clicking on a node symbol. Figure 1.9 depicts the ccsView tool. The second monitor, ccsMon [21] shown in Fig. 1.10, gives a physical view on the cluster and the status of the nodes. On each node, a local monitor agent samples data and sends it periodically to a central server daemon. The server makes this information available to the GUI. The CCS Configuration Manager provides the ccsMon server with additional CCS specific information on the node status. This information is then used by the ccsMon user interface to highlight the corresponding frame(s) on the display. Clicking on a node opens an additional window which shows more detailed information. ccsMon is useful to inspect the behavior of parallel applications (e.g. communication patterns or hot spots).
Section 1.3.
Resource and Service Description
17
T h e third tool, SPROF [32], can be used to locate performance bottlenecks at a very low level. This includes not only run time d a t a , but also cache miss ratios and other events. For this purpose, we have integrated a small monitor in the ccsMon tool to show C P U load, MFlop/s, memory bandwidth, bandwidth of the C P U caches and their misses, and the SCI network bandwidth. liiSSEEQESIBHMMMil^^^flMHsJnJSIi ' £Jje Options
Help ;
F i g u r e 1.10. Monitoring a node with ccsMon.
1.3
Resource and Service Description
T h e Resource and Service Description RSD [11] is a versatile tool for specifying any kind of computer resources (computer nodes, interconnects, storage, software services, etc.). It is used at the administrator level for describing the type and topology of the available resources, and at the user level for specifying the required system configuration for a given application. Fi gure 1.11 illustrates the architecture of RSD. There are three basic interfaces: a GUI for specifying simple topologies and attributes, a language interface for specifying more complex and repetitive graphs (mainly intended for system administrators), and an API for access from within an application program. In the following, we describe these components in more detail.
1.3.1
Graphical Representation
T h e RsdEditor [3] provides a and linked together to build environment. System administrators use components in their computer
set of predefined objects (icons) t h a t can be edited a hierarchical graph of the resources in the system RsdEditor to describe the computing and networking center. Figure 1.12 illustrates a typical administrator
18
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
User or Administrator
RsdEditor
ASCII - Editor
Graphical Representation
Textual Representation
.^1
RsdParser
LAN / WAN Resource Management System
Application
Figure 1.11. RSD architecture.
session. The system components of the site are specified in a top-down manner with the interconnection topology as a starting point. With drag-and-drop, the administrator specifies the available machines, their links and the interconnection to the outside world. New resources can be specified by using predefined objects and attributes via pull down menus, radio buttons, and check boxes. In the next step, the structure of the systems is successively refined. The GUI offers a set of generic topologies like ring, grid, or torus. The administrator defines the size and the general attributes of the system. When the system has been specified, a window with a graphical representation of the system opens, in which single nodes can be selected. Attributes like network interface cards, main memory, disk capacity, I/O throughput, CPU load, network traffic, disk space, or the automatic start of daemons, etc. can be assigned. At the user level, RsdEditor is used for specifying resource requests. Several predefined system topologies help in specifying commonly used configurations.
Section 1.3.
19
Resource and Service Description
* u
H
t" " V
V X
tnnis 4x"
Node11 Node21 Node12 Node22 Node31 Node41 Node32 Node42 NodelS Node14 Node15 Node23 Node33 NodB43 Node24 Node34 Node44 Node25 Node35 Node45
£
NODE torus_4x5 PORTtpZ; NODE Node11 PORTxportll {TYPE • "SCI";) PORT yportl 1 £ TYPE - "SCI";) - •; Memory HardDisl
i
CPU
j f.l.-iplni jimpi-tni'*
KSO Aflribiitf^
I'iril
(ilt
Pulls
I tli jus
II (IHiHr
(.mirl
F i g u r e 1.12. The RsdEditor.
1.3.2
Textual Representation
In some cases, the RsdEditor may not be powerful enough to describe complex computer environments with a large number of services and resources. Hence, we devised a language interface that is used to specify irregularly interconnected, attributed structures. Its hierarchical concept allows different dependency graphs to be grouped for building even more complex nodes, i.e., hypernodes. Grammar
In RSD resources and services are described by attributed nodes (keyword NODE) that are interconnected by edges (keyword EDGE) via communication endpoints (keyword PORT). Nodes: Nodes are defined in a hierarchical manner with an arbitrary number of subnodes at the next hierarchy level. Depending on the context a NODE is either a "processor" or a "process". Passive software processes, like CPU performance daemons, can be associated with nodes.
20
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
ATM \
Gigabit Ethernet !<> ATM \
F i g u r e 1.13. Configuration of a 4 x 8 node SCI cluster. Ports: The keyword PORT defines a communication endpoint of a node. It acts as a gateway between the node and remote connections. Ports may be sockets, network interface cards or switches. Edges: An EDGE is used to specify a connection between ports. Edges may be directed. Their characteristics (e.g. bandwidth, latency) are specified by the corresponding attributes. Passive software processes, like network daemons, can be associated with edges. Virtual edges are used to specify links between different levels of the hierarchy in the graph. This allows to establish a link from the described module to the "outside world" by exporting a physical port to the next higher level. These edges are defined by: ASSIGN NameOfVirtualEdge {NODE w PORT x <=> PORT a}. Note, that NODE w and PORT a are the only entities known to the outside world. In addition, the RSD grammar provides keywords and statements for including files (INCLUDE), sequences, variable declarations (VAR), constants (CONST), macro definitions (MACRO), loops (FOR), and control structures (IF).
1.3.3
Dynamic Attributes
In addition to the static attributes of nodes and edges, RSD is also able to handle dynamic attributes. This may be important in heterogeneous environments, where
Section 1.3.
Resource and Service Description
NODE PSC
21
// This SCI cluster is named PSC
{ // DEFINITIONS: CONST X = 4, Y = 8; CONST N = 2; CONST EXCLUSIVE = TRUE;
// dimensions of the system // number of frontends // resources for exclusive use only
// DECLARATIONS: // He have 2 SMP frontends, each Kith 4 processors, an ATM-, and a FastEthernet-port. FOR i=0 TO N-l DO NODE frontend_$i { PORT ATM; PORT ETHERNET; CPU=PentiumlI; MEM0RY=512 MByte; MULTI.PR0C=4;}; OD // The others are dual processor nodes each with an SCI- and a FastEthernet port. FOR i=0 TO X-l DO FOR j=0 TO Y-l DO NODE $i$j { PORT SCI; PORT ETHERNET; CPU=PentiumII; MEM0RY=256 MByte; MULTIJ>R0C=2; IF ((i+1) MOD 2 == 0) *« ((j+1) MOD 2 == 0) PFSnode = TRUE; DISKsize = 50; // This is an I/O node of the parallel file system FI
}; OD OD // All nodes are connected via a FastEthernet switch NODE FEswitch { // The FastEthernet switch FOR i=0 TO X*Y+N DO PORT port$l; // Port 0 and 1 are Gigabit uplinks OD
}: // CONNECTIONS: build the SCI 2D torus FOR i=0 TO X-l DO FOR j=0 TO Y-l DO // the horizontal direction EDGE edge_$i$j.to.$((i+l) MOD X)$j { NODE $i$j PORT SCI => NODE t((i+l) MOD X)$j PORT SCI; HAX.BANDHIDTH = 400 MByte/s; DYNAMIC ACT-BANDWIDTH=FILE:/import/SCI_Bandwith.txt;}; // the vertical direction EDGE edge-$i*j-to_$i$((j+l) MOD Y) { NODE $i$j PORT SCI => NODE $i$((j+l) MOD Y) PORT SCI; MAX-BANDWIDTH = 500 MByte/s; DYNAMIC ACT_BANDWIDTH=FILE:/import/SCI.Bandwith.txt;}; OD OD // CONNECTIONS: build the FastEthernet links between switch and nodes port=0; FOR i=0 TO N-l DO EDGE FEswitch.$port.to_frontend-$i) { NODE FEswitch PORT port.$port <=> NODE frontend_$i PORT ETHERNET; BANDWIDTH = 1 Gbps;}; port=$port+l; OD FOR i=0 TO X-l DO FOR j=0 TO Y-l DO EDGE FEswitch-$port-to.$i$j) { NODE FEswitch PORT port.$port <=> NODE $i$j PORT ETHERNET; MAX-BANDWIDTH = 100 Mbps; DYNAMIC ACT_BANDWIDTH=FILE:/import/ether.bandwith.txt;} ; port=$port+l; OD OD
}i
Figure 1.14. RSD specification of the SCI cluster in Figure 1.13.
22
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
the temporary network load affects the choice of the process mapping. Moreover, dynamic attributes may be used by the application to migrate a job to less loaded nodes during runtime. Dynamic resource attributes are defined by the keyword DYNAMIC with the location of a temporary file that provides up-to-date information at runtime. The RSD parser generates the objects with the appropriate access methods of the dynamical data manager. These methods are used at runtime to retrieve, e.g., the network performance data, which was written asynchronously by a separate daemon (e.g an SNMP agent [33]). In addition to file access, the data manager of the RSD runtime management also provides methods for retrieving data via UDP or TCP. More information on these services can be found in [12]. Example
Figure 1.13 shows the configuration of a 4 x 8 node SCI cluster. The corresponding RSD specification is shown in Figure 1.14. The cluster consists of two frontend computers, an Ethernet switch and 32 compute nodes. The frontend systems have quad-processors with ATM connections for external communication and Gigabit-Ethernet for controlling the cluster and serving I/O to the compute nodes. For each compute node, we have specified the following attributes: type of CPU, amount of memory, and the SCI/FE ports. All nodes are interconnected by unidirectional SCI ringlets in a 2D torus topology. The bandwidth of the horizontal rings is 400 MByte/s and 500 MByte/s in vertical direction. Each node is connected by FastEthernet to the Ethernet switch. Eight of the 32 compute nodes also serve as I/O nodes of a parallel file system. They are accessible via Ethernet as well as via SCI. Therefore we specified two attributes per edge: the maximum bandwidth and the actual bandwidth. The latter one is a dynamical attribute. By means of these two attributes it is possible to determine which network should be used for the parallel file system I/O at runtime.
1.3.4
Internal Data Representation
The internal data representation [12] establishes the link between the graphical and the textual representation of RSD. It is capable of describing hierarchical graph structures with attributed nodes and edges. Sending RSD objects between different sites poses another problem: Which protocol should be used? When we developed RSD, we did not want to commit to either of the standards know at that time (e.g. CORBA, Java, or COM+). Therefore we have only defined the interfaces of the RSD object class but not their private implementation. Today XML [36] seems to be a good choice for this purpose. However, since the textual representation of RSD is more readable (for humans) than XML, we will use it further on. To be compatible with other resource
Section 1.3.
Resource and Service Description
23
description tools we implemented a converter from RSD to XML. The API layer (written in C++) can be divided into two classes: Methods for parsing and creating resource objects, and methods for navigating through the resource graph. In the following, we list some of the more important navigation methods: • RSDnode *parse(char *filename)
parses an RSD text file and returns a handle to the resulting objects. • RSDnode *GetNeighbours(node) returns a list of nodes that are connected to the given node in the same level. • RSDnode *GetSubNodes(node) returns a list of all sub nodes of the given node. • RSDattr * G e t A t t r i b u t e s ( e n t i t y ) returns the attributes of the given object (node or edge). • char *GetValue(attr) returns the value of the given attribute object.
1.3.5
RSD Tools in CCS
The RSD tools are used in the CCS domain management for describing system resources and user requests. At boot time, all CCS components read the RSD specification created by the administrator. Each module extracts only its relevant information (by use of the RsdAPI), e.g., the MM reads the machine topology and attributes, and the QM extracts only the number of processors and the operating system type. The User Interface (UI) generates an RSD description from the user's parameters (or from a given RSD description) and it sends it to the Access Manager (AM). The AM, checks whether the request matches the administrator given limits and forwards it to the QM. The QM extracts the information, determines a schedule and sends both the schedule and the RSD description to the MM. The MM verifies this schedule by mapping the user request against the static (e.g. topology) and dynamic (e.g. processor availability) information on the system resources. Figure 1.15 illustrates the RSD data flow in CCS. Even though all CCS components are based on RSD, we have hidden the RSD mechanisms behind a simple command line interface because in the past, there was no need for a versatile resource description facility. Most systems were homogeneous, their topologies were simple and regular, and nearly all applications ran on only one system. With the trend towards grid computing environments, resource description becomes more important now. In some metacomputer environments, the system (instead of the user) decides autonomously which of the available resources to use.
24
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
HARDWARE
Administrator
Resource Specification |{MI (=RSD Object) I..„.,I,.,., •jjwww"
wMMMMUKimmn -nifTT-irrL-niimuuuiiiBBBlBllr
L
L
F i g u r e 1.15. RSD data flow in a CCS domain. Hence, the users need a convenient tool to specify their requests, and the applications need an API to negotiate their requirements with the resource management system.
1.4
Site Management
The local CCS domains described in Section 1.2 may be coordinated by two higher level tools as shown in Fig. 1.16: A passive instance named Center Information Server (CIS) that maintains up-to-date information on the system structure and state, and an active instance, called Center Resource Manager (CRM), that is responsible for the location and allocation of resources within a center. The CRM also coordinates the concurrent use of several systems, which are administered by different CCS domain instances.
1.4.1
Center Resource Manager (CRM)
The CRM is a tool on top of the CCS domains. It supports the setup and execution of multi-site applications running concurrently on several CCS domains. Here, the term 'multi-site application' can be understood in two ways: It could be just one application that runs on several machines without explicitly being programmed for that execution mode [7], or it could comprise different modules, each of them executing on a machine that is best suited for running that specific piece of code. In the latter case the modules can be implemented in different programming languages using different message passing libraries (e.g., PVM, MPI, or MPL). Multicommunication tools like PLUS [10] are necessary to make this kind of multi-site
Section 1.4.
25
Site Management
Link to other RasoMtcsm Management Systems
CIS: Center Information Server passive
information
on sjtvices f QRM: Center Resource Manager active
\
7~1 Domain-!.
•
•
•
DoWilri-n;
Figure 1.16. A CCS site with a link for metacomputer access. application possible. For executing multi-site applications three tasks need to be done: • locating the necessary resources, • allocating the resources in a synchronized fashion, • starting and terminating the modules. For locating the resources, the CRM maps the user request (given in RSD notation) against the static and dynamic information on the available system resources. The static information (e.g. topology of a single machine or the site) has been specified by the system administrator, while the dynamic information (e.g. state of an individual machine, network characteristics) is gathered at runtime. All this information is provided by the CCS domains or the Center Information Server CIS (see next section). Since our resource and service description is able to describe dependency graphs, a user may additionally specify the required communication bandwidth for his application. After the target resources have been determined, they must be allocated. This can be done in analogy to the two-phase-commit protocol in distributed database management systems: The CRM requests the allocation of all required resources at all involved domains. Since CCS domains allow reservations it is possible to
26
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
implement a co-scheduling mechanism [19]. If not all resources were available, it either re-schedules the job or it denies the user request. Otherwise the job can now be started in a synchronized way. Here, machine-specific pre-processing tasks or intermachine-specific initializations (e.g. starting of special daemons) must be performed. Analogously to the domains level, the CRM is able to migrate user resources between machines to achieve better utilization. Accounting and authorization at the site level can also be implemented at this layer. The CRM does not have to be a single daemon. It could also be implemented as distributed instances like the QM-MM complex at the domain level.
1.4.2
Center Information Server (CIS)
The CIS is the 'big brother' of the domain manager (DM) at the site level. Like the well-known Network Information Service NIS, CIS provides up-to-date information on the resources in a site. However, compared to the active DM in the domains, CIS is a passive component. At startup time, or when the configuration has been changed, a domain signs on at the CIS and informs it about the topology of its machines, the available system software, the features of the user interfaces, the communication interfaces and so on. The CIS maintains an RSD based database on the network protocols, the system software (e.g., programming models, and libraries) and the time constraints (e.g. for specific connections). This information is used by the CRM. The CIS also plays the role of a docking station for mobile agents or external users. For higher level metacomputer components, the CIS data must be compatible or easily convertible to the formats used by other resource management systems (e.g. XML).
1.5
Related Work
Much work has been done in the field of resource management in order to optimally utilize the costly high-performance computer systems. However, in contrast to the CCS approach, described here, most of today's resource management systems are either vendor-specific or devoted to the management of workstation clusters in throughput mode. The Network Queuing System NQS [25], developed by NASA Ames for the Cray2 and Cray Y-MP, might be regarded as the ancestor of many modern queuing systems like the Portable Batch System PBS [6] or the Cray Network Queuing Environment NQE [28]. Following another path in the line of ancestors, the IBM Load Leveler is a direct descendant of Condor [26], whereas Codine [13] has its (far away) roots in Condor and DQS. They have been developed to support high-throughput computing on UNIX clusters. In contrast to high-performance computing, the goal is here to run a large number of (mostly sequential) batch jobs on workstation clusters without
Section 1.6.
Summary
27
affecting interactive use. The Load Sharing Facility LSF [27] is another popular software to utilize LAN-connected workstations for throughput computing. More detailed information on cluster managing software can be found in [2], [23]. Aside from local cluster management, several schemes have been devised for high-throughput computing on a super-national scale. They include the Iowa State University's Batrun [35], the Flock of Condors used in the Dutch Polder initiative [15], the Nimrod project [1], and the object-oriented Legion [20] which proved useful in a nation-wide cluster. While these schemes emphasize mostly the application support on homogeneous systems, the AppLeS project [8] provides application-level scheduling agents on heterogeneous systems, taking into account their actual resource performance. Another approach is to include cluster computing capabilities into the operating system itself. MOSIX (Multicomputer OS for UNIX) [4] is such a software package. It enhances the Linux kernel and allows any size cluster of X86/Pentium based workstations and servers to work cooperatively as if part of a single system. Its main features are preemptive process migration and supervising algorithms for load-balancing and memory ushering. MOSIX operations are transparent to the applications. Sequential and parallel applications can be executed just like on an SMP. MOSIX provides adaptive resource management algorithms that monitor and respond (on-line) to unbalanced work distribution among the nodes in order to improve the overall performance of all the processes. It can be used to define different cluster types and also clusters with different machines or LAN speeds.
1.6
Summary
With the current trend towards cluster computing with heterogeneous (sometimes even WAN-connected) components the management of computing resources is getting more important and also more complex. Modern resource management systems are regarded as portals to the available computing resources, with the system specific details hidden from the user. Most often, workstation clusters are not only used for high-throughput computing in time-sharing mode but also for running complex parallel jobs in space-sharing mode. This poses several difficulties to the resource management system. It must be able to reserve computing resources for exclusive use and also to determine an optimal process mapping for a given system topology. On the basis of our CCS environment, we have presented the anatomy of a modern resource management system. CCS is built on three architectural elements: the concept of autonomous domains, the versatile resource description tool RSD, and the site management tools CIS and CRM. The actual implementation has the following features: • It is modular and autonomous on each layer with customized control facilities. New machines, networks, protocols, schedulers, system software, and metalayers can be added at any point—some of them even without the need to
28
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
re-boot the system. • It is reliable. Recovery is done at the machine layer. Faulty components are restarted and malfunctioning modules are disabled. The center information manager (CIS) is passive and can be restarted or mirrored. • It is scalable. There exists no central instance. The hierarchical approach allows to connect to other centers' resources. This concept has been found useful in several industrial projects. The resource and service description tool RSD is a generic tool for specifying any kind of computing resource like CPU nodes, networks, storage, software, etc. It has a user-friendly graphical interface for specifying simple requests or environments and an additional textual interface for defining complex system environments in more detail. The hierarchical graph structure of RSD allows to hide complex details from the user by defining them in the next hierarchy level. The CCS software package is currently being re-engineered so that it can be released under the GNU general public license.
Acknowledgments Over the years, many people helped in the design, implementation, debugging, and operation of CCS: Bernard Bauer, Mathias Biere, Matthias Brune, Christoph Drube, Harald Dunkel, Jorn Gehring, Oliver Geisser, Christian Hellmann, Achim Koberstein, Rainer Kottenhoff, Karim Kremers, Fru Ndenge, Friedhelm Ramme, Thomas Romke, Helmut Salmen, Dirk Schirmer, Volker Schnecke, Jorg Varnholt, Leonard Voos, Anke Weber. Also, we would like to thank our colleagues from CNUCE, Pisa, who have designed and implemented the RsdEditor: Domenico Laforenza and Ranieri Baraglia, and their master students Simone Nannetti and Mauro Micheletti.
1.7
Bibliography [1] D. Abramson, R. Sosic, J. Giddy, and B. Hall. Nimrod: A Tool for Performing Parameterized Simulations using Distributed Workstations. 4th IEEE Symp. High Performance and Distributed Computing, August 1995. [2] M. Baker, G. Fox, and H. Yau. Cluster Computing Review. Northeast Parallel Architectures Center, Syracuse University New York, November 1995. http://www.npac.syr.edu/techreports/. [3] R. Baraglia, D. Laforenza, A. Keller, and A. Reinefeld. RsdEditor: A Graphical User Interface for Specifying Metacomputer Components. Proc. 9th Heterogenous Computing Workshop HCW 2000 at IPDPS, Cancun, Mexico, 2000, pp. 336-345.
Section 1.7.
Bibliography
29
[4] A. Barak, 0 . La'adan, and A. Shiloh. Scalable Cluster Computing with MOSIX for LINUX. Proc. Linux Expo '99, Raleigh, N.C., May 1999, pp. 95-100. [5] B. Bauer and F. Ramme. A General Purpose Resource Description Language. In: Grebe, Baumann (eds): Parallele Datenverarbeitung mit dem Transputer, Springer-Verlag Berlin, 1991, pp. 68-75. [6] A. Bayucan, R. Henderson, T. Proett, D. Tweten, and B. Kelly. Portable Batch System: External Reference Specification. Release 1.1.7, NASA Ames Research Center, June 1996. [7] T. Beisel, E. Gabriel, and M. Resch. An Extension to MPI for Distributed Computing on MPPs. In M. Bubak, J. Dongarra, J. Wasniewski (Eds.), 'Recent Advances in Parallel Virtual Machine and Message Passing Interface', LNCS, Springer, 1997, pp. 25-33. [8] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao. Application-Level Scheduling on Distributed Heterogeneous Networks. Supercomputing, November 1996. [9] N. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. K. Su. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro Vol. 15, No 1, Feb. 1995, pp. 29-36. [10] M. Brune, J. Gehring, and A. Reinefeld. Heterogeneous Message Passing and a Link to Resource Management. Journal of Supercomputing, Vol. 11, 1997, pp. 355-369. [11] M. Brune, J. Gehring, A. Keller, and A. Reinefeld. RSD - Resource and Service Description. Intl. Symp. on High Performance Computing Systems and Applications HPCS'98, Edmonton Canada, Kluwer Academic Press, May 1998. [12] M. Brune, A. Reinefeld, and J. Varnholt. A Resource Description Environment for Distributed Computing Systems. Proc. 8th Intern. Sympos. High-Performance Distributed Computing HPDC'99, Redondo Beach, 1999, 279-286. [13] Codine: Computing in Distributed Networked Environments. http://www.gridware.com, November 2000. [14] EGrid testbed. http://www.egrid.org/, November 2000. [15] D. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne. A Worldwide Flock of Condors: Load Sharing among Workstation Clusters. FGCS, Vol. 12, 1996, pp. 53-66.
30
Anatomy of a Resource Management System for HPC Clusters
Chapter 1
[16] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit Intl J. Supercomputer Applications 11(2), 1997, pp.115128. [17] I. Foster, C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure Morgan Kaufman Publ. 1999. [18] J. Gehring and F. Ramme. Architecture-Independent RequestScheduling with Tight Waiting-Time Estimations. IPPS'96 Workshop on Scheduling Strategies for Parallel Processing, Hawaii, Springer LNCS 1162, 1996, pp. 41-54. [19] J. Gehring and T. Preiss. Scheduling a Metacomputer with Uncooperative Subschedulers. Proc. IPPS Workshop on Job Scheduling Strategies for Parallel Processing, Puerto Rico, Springer, LNCS, vol. 1659, April 1999. [20] A. Grimshaw, J. Weissman, E. West, and E. Loyot. Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems. / . Parallel Distributed Computing, Vol. 21, 1994, pp. 257-270. [21] H. Hellwagner, and A. Reinefeld (eds.). SCLScalable Coherent Interface: Architecture and Software for High-Performance Compute Clusters. Springer Lecture Notes in Computer Science 1734, 1999. [22] HpcLine. http://www.siemens.de/computer/hpc/de/hpcline/, June 2000. [23] J. Jones and C. Brickell. Second Evaluation of Job Queueing/Scheduling Software: Phase 1 Report. Nasa Ames Research Center, NAS Tech. Rep. NAS-97-013, June 1997. [24] A. Keller and A. Reinefeld. CCS Resource Management in Networked HPC Systems. 7th Heterogeneous Computing Workshop HCW'98 at IPPS, Orlando Florida, IEEE Comp. Society Press, 1998, pp. 44-56. [25] B. A. Kinsbury. The Network Queuing System. Cosmic Software, NASA Ames Research Center, 1986. [26] M. J. Litzkow and M. Livny. Condor - A Hunter of Idle Workstations. Procs. 8th IEEE Int. Conference on Distributed Computing Systems, June 1988, pp. 104-111. [27] LSF: Product 2000.
Overview, http://www.platform.com,
November
Section 1.7.
31
Bibliography
[28] NQE-Administration. Cray-Soft USA, SG-2150 2.0, May 1995. [29] Portable Batch System 2000.
(PBS), http://pbs.mrj.com/,
November
[30] F. Ramme, T. Rornke, and K. Kremer. A Distributed Computing Center Software for the Efficient Use of Parallel Computer Systems. HPCN Europe, Springer LNCS 797, Vol. 2, 1994, pp. 129-136. [31] Scali. http://www.scali.no/, November 2000. [32] J. Simon, R. Weicker, and M. Vieth. Workload Analysis of Computation Intensive Tasks: Case Study pn SPEC CPU95 Benchmarks. EuroPar97, Springer LNCS 1300, 1997, pp. 971-983. [33] Simple Network Management Protocol (SNMP) Specification. http://www.faqs.org/rfcs/rfcl098.html, November 2000. [34] T. L. Sterling, J. Salmon, D. J. Becker, D. F. Savarese. How to Build a Beowulf. A Guide to the Implementation and Application of PC Clusters. The MIT Press, Cambridge, MA (1999). [35] F. Tandiary, S. C. Kothari, A. Dixit, and E. W. Anderson. Batrun: Utilizing Idle Workstations for Large-Scale Computing. IEEE Parallel and Distributed Techn., 1996, pp. 41-48. [36] XML 1.0 Recommendation. November 2000.
http://www.w3.org/TR/REC-xml,
Chapter 2
On-line OCM-Based Tool Support for Parallel Applications
MARIAN BUBAK, WLODZIMIERZ
FUNIKA
Institute of Computer Science, AGH, al. Mickiewicza 30, 30-059 Cracow, Poland Academic Computer Centre CYFRONET, ul. Nawojki 11, 30-950 Cracow, Poland
BARTOSZ BALIS
Institute of Computer Science, AGH, al. Mickiewicza 30, 30-059 Cracow, Poland
AND R O L A N D
WISMULLER
LRR-TUM - Technische Universitat Miinchen, D-80290 Miinchen, Germany {emails - bubakjunika}Quci.agh.edu.pl, [email protected], [email protected]
A b s t r a c t This paper presents a motivation and issues concerning the development and adaptation of on-line tools to the OCM, an OMIS Compliant Monitoring syst e m . It covers the general structure of an OCM-based tool environment and other enhancements needed both at low-level parts of the monitoring environment and at the user interface level in order to achieve full tool support for parallel applications. A high d e m a n d for tools support for parallel programming, which comprise several classes of tools, e.g. debuggers, performance analyzers, visualizers, etc., reveals a strong need in co-operation of tools which thus should constitute a well designed tool environment. This should be based on the monitoring of parallel applications 32
Section 2 . 1 .
Introduction
33
developed using different programming models and offer a wide range of tools capable of cooperating with each other. This paper presents the recent development of the OCM-based environment of on-line tools and outlines perspectives for further research. Keywords: parallel programming, monitoring systems, on-line tools, interoperability.
2.1
Introduction
To support the development of parallel message-passing applications there have been designed a number of run-time tools [4]. These tools allow to observe and manipulate the behavior of an executing application and thus support performance analysis, visualization, debugging, etc. They are intended to provide indispensable support during all the phases of program development. In particular, a performance measurement tool enables to analyze an application's execution, especially focusing on such problems as utilization of system resources (CPU, memory) or inter-process communication. Thus, being aimed at finding potential bottlenecks, performance analysis may help to improve the overall performance of the parallel algorithm. During recent years, a large number of tools have emerged [4]. The bulk of them follow the off-line approach, where the gathered data can only be analysed after the application terminated. Representative examples are Vampir, Pablo, and ParaGraph. The on-line tools are much fewer and even fewer are the tools which feature some well defined interface to a monitoring facility, which could provide prerequisites for interoperability. Practically each on-line tool needs such an interface to a monitoring system via which it can observe and possibly influence the application's state, since virtually all of the existing tools are supplied with their own proprietary monitoring systems, they are not capable of cooperating with each other. Due to this, the on-line tools among them suffer from poor interoperability and portability. Among the initiatives to define universal monitoring interfaces are DAMS[10], PARMON[20], and DPCL[19]. However, DAMS does currently not address performance monitoring, while PARMON and DPCL do not offer direct support for applications using de-facto standard message passing environments like PVM [12] and MPI [18]. the On-line Monitoring Interface Specification (OMIS) [16] is aimed at improving this situation. OMIS defines a standard interface to provide a unified access to a wide set of services for tool development. In OMIS, there are three categories of services which allow to retrieve information about the execution of a parallel program, to control the execution, and to detect specific events during execution, in order to start proper actions. Owing to the OMIS specification and its implementation, OCM 1 , a performance analysis tool, PATOP is enabled to function in message passing programming environments, interoperate with other on-line monitoring-bound, OMIS-compliant tools ^ C M stands for the OMIS Compliant Monitoring system
34
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
like debuggers, visualizers, etc. The OCM, originally developed to monitor PVM applications, has been enhanced to support MPI applications. The next natural step is to adapt this and other tools to the latter programming model. In this paper, we present the issues concerning the enhancement of the tool functionality and relevant changes to the tools, monitoring system and other aspects connected with extending the range of parallel environments to be supported.
2.2
OMIS as Basis for Building Tool Environment
OMIS is intended to meet the goal of building a standardized interface between application development support tools and parallel programming environments. Such an interface must address the issues of versatility, portability, extensibility and independence of the underlying parallel environment. OMIS is not restricted to a single kind of tools, especially it supports both debuggers and performance analyzers [6], [8]. This implies that a monitoring system compliant with OMIS needs to be able to control processes from the very beginning of their execution (crucial for debugging) as well as to monitor the parallel programming library calls efficiently (crucial for performance analysis). The key feature of OMIS is its capability to be a basis for supporting a large number of tools on a wide range of platforms under different programming environments. This implies, e.g., that a debugging tool which can co-operate with a performance analyzer in a particular parallel environment can be easily transferred into another environment while preserving its capability of co-operation. Thus a high level of portability and interoperability, so crucial in run-time tools, is provided. Generally, OMIS defines: 1. three categories of monitoring-related services, as mentioned before: information services, manipulation services, and event services, 2. a basic set of services which are available on any platform, 3. an extension interface that allows to add new services supporting the operation of tools in particular environments, 4. a uniform string-based syntax for event-driven requests to the services defined, 5. data structures for service replies, 6. six C-level function calls that build the interface to the monitoring system. For a full specification of OMIS, the reader is referred to [16]. At the beginning, the use of OMIS for parallel programming was rather restricted. Although an OMIS compliant monitoring system, the OCM [26], was available, it could only be used to monitor parallel programs based on PVM [22]. In addition, a number of tools was available, including the parallel debugger DETOP [24]. However, the important task of performance analysis was not yet supported
Section 2.3.
Adapting the OCM to MPI
35
by OMIS based tools. The project we will report on in the next sections thus had two goals: to extend the applicability of the OCM to other programming libraries, especially MPI, and to extend the OMIS-based tool environment, especially w.r.t. performance analysis.
2.3
Adapting the OCM to MPI
The OCM [26] is a monitoring system implemented in line with OMIS. Originally, the OCM has been implemented for the PVM programming library and could be used to monitor PVM parallel programs only [22].
• = OCM
Figure 2.1. Structure of OCM-based monitoring As shown in Fig. 2.1, the OCM is a distributed monitoring environment with a local monitor process on each node of the target system. In addition, there is a central component (the Node Distribution Unit, for short NDU) responsible for distributing requests to the local monitors and re-assembling their replies. The local monitor processes control the application processes via operating system interfaces (ptrace and /proc) and communicate with additional parts of the monitoring system contained in instrumented libraries, using a shared memory segment.
36
On-line OCM-Based Tool Support for Parallel Applications
2.3.1
Chapter 2
Handling Applications in MPI vs. P V M
The growing use of MPI motivated our involvement in the porting and enhancement of the OCM's functionality so it could support the development of MPI applications as well. When porting the OCM to MPI, our main concern was to preserve the user interface designed for PVM as much as possible. This would ensure that the user could easily switch between the PVM and MPI applications. One of the main points in the design of the OCM concerns the way the user can handle both the execution of an application program with the OCM and the start-up of all of the OCM's components. The usage of OCM-based tools should be as near to sequential on-line tools as possible, this implies that one can start a tool without an already running program. The tool then starts the OCM, which in turn starts the application. Another requirement is that a tool should support a simple 'rerun' of the application, without a need to start the tool anew. However, due to significant differences in PVM and MPI mechanisms [14], in MPI these general features may be realized in a way different from that for PVM, including a different sequence of start-ups as shown below. For PVM, due to the "virtual machine" that exists independently of any application program, it is possible to realize two things: to start a monitor process on each host, and to create/rerun the application processes: the OCM just has to start the initial process which will explicitly spawn all other processes. For MPI, there is no virtual machine - the hosts to be monitored are known only after an application has been started. Thus, the start-up must proceed in a reversed sequence: first, an application is created, then a monitor is started on each node which is used by the application. In order to support a consistent interface with the PVM version and to enable a restart of an application, we have introduced an MPI specific service which creates all the processes of an MPI application. The considerations above imply that the bulk of activity about the modification of the OCM to support MPI applications is focused on the program (re)start and initialization mechanisms within the OCM, application and a tool as well as the ways of transferring the control and application-related data between them.
2.3.2
Starting-up MPI Applications
The first and most important thing an on-line monitoring system must know about an application to be monitored is: < what are the hosts the application is running on? < which processes on these hosts belong to the application? The user should not be forced to specify this information explicitly. Instead, this should be retrieved by the monitoring system automatically. Whereas the PVM version of the OCM can use the existing PVM facilities to access this information, such means do not exist in MPI. So how can the OCM get the necessary data? For MPI applications using mpich [13], the data concerning the application can be retrieved during its startup in the following way: When the application is
37
Section 2.3. Adapting the OCM to MPI
started with a proper command line flag passed to mpirun, mpirun will start a single monitor process. Later the initial (master) process (with global rank 0) is created and controlled by this monitor process. When the master process executes the HPI_init() function, mpich creates all other processes and stores the relevant information in an internal data structure. When done, MPI_init() calls a special routine MPIR_Breakpoint(), where the monitor may place a breakpoint to stop the master process. It can then read the necessary information from the following mpich variables, using the UNIX p t r a c e system call or the /proc file system: • MPIR_proctable - pointer to an array of MPIR-PROCDESC structures, which are defined as: typedef s t r u c t { char *hostJiame; host name char *executable_name; path of the program's executable i n t pid; system process identifier } MPIRJPROCDESC;
• MPIR_proctable_size - size of the array, The processes created by the master process are stopped by an endless loop inside MPI_Init(). So there is some time to create the other monitor processes on each host involved, which then gain control over the application processes on their local host and force them to exit the loop. The outlined interface in mpich has originally been designed for the t o t a l v i e w debugger [11], but is not restricted to it. MPI parallel program
* / ''
f create processes
TooLj.
:\ mpi_proc_create()
'
get ififo
: /
OCM ', stfirt(pass
mpirun args)
mpirun script 1
1
|
|
I
I
I
I
I
I time
i%
Start of execution
F i g u r e 2.2. Time-lined relationships of a tool with the OCM and an MPI application
38
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
Fig. 2.2 summarizes the temporal relationships between the tool, the OCM, and the MPI program: The tool has to invoke the mpirun script with the -ocm option to start the OCM. The mpirun starts the OCM as a debugger process and passes the application's name and parameters to it. When the tool requests the service mpi_proc_create(), the OCM starts the application's 'master' process and gets information about the application. The user can then manipulate and control the execution of the MPI program using the tool, which issues OMIS requests. The OCM plays the role of a server, while the tool is a client. The first local monitor has to provide a larger functionality due a sort of functional asymmetry between the local monitors. The initial (master) local monitor must start up the MPI program and the rest of the monitoring system and must distribute the data about the running program to the other local monitors and the NDU.
2.3.3
Flow of Information on Application
While the PVM oriented version of the OCM can access the application-related data via PVM library calls on each node, this is not possible with MPI, so the data must be explicitly distributed among nodes. In case mpich is used, the information retrieved from the application's master process (see Sect. 2.3.2) is distributed to all local monitor processes by the NDU (Fig.2.3). This information is also accessible for tools through an MPI specific information service.
other processes
J.. v\, ^ master process
Local monitor
ft
j"
Local monitor
HD«
Local monitor
OCM PVM
processes-related data flow
Figure 2.3. Distribution of application-related information
Section 2.4.
2.3.4
Integrating the Performance Analyzer PATOP with the O C M
Detection of Library Calls
In order to support performance measurement of parallel programs using the MPI library, it is necessary that the OCM can detect entries into and exits from MPI functions. The OCM gathers information about the parallel library calls via instrumentation of the parallel library. The instrumentation in the OCM consists in building a wrapper for each library function. The wrapper transfers relevant information on a MPI function call to the OCM prior to and after invocation of that library function. In mpich [13], a profiling interface as defined by the MPI standard is provided, i.e. each function can be called in the normal mode (MPI_XXX entry points) or by means of the profiling interface (PMPIJCXX entry points). Thus, mpich can be instrumented by replacing the 'standard' mpich library (with MPI_ entry points) with a library consisting of wrapper functions and linking the mpich profiling library (with PMPI_ entry points) to the application. The wrapper functions then can call the original MPI routines using the PMPI_ entry points. Since the wrapper code is quite complex and MPI consists of over a hundred functions, a kit of universal tools for automatic generation of the individual wrappers has been developed. From a list of functions for which wrappers are to be generated, the prototype declarations of these functions, and a list of wrapper definitions, it automatically creates the instrumented MPI library.
2.4
Integrating the Performance Analyzer PATOP with the OCM
Performance measurement tools for parallel applications are generally intended to help in finding and fixing bottlenecks by carrying out various measurements and/or relating their results to some static analysis [17]. During recent years, a considerable number of performance tools have been made available. Unfortunately, as a rule these are off-line tools, which feature such disadvantages as: potentially vast volumes of performance data, poor guidance and insufficient flexibility. In order to provide on-line performance analysis facilities for the de facto message passing standards PVM [12] and MPI [18], we used two existing performance analysis tools, PATOP [24] and TATOO [3]. Below we outline the functionalities of the tools and the changes to the OCM and the tools, necessary to make them work together for PVM applications. The following stage is the adaptation of the tool environment to MPI applications.
2.4.1
PATOP's Starter
PATOP and TATOO, two tools for on-line (PATOP) and off-line (TATOO) performance analysis, have originally been developed at LRR-TUM for the analysis of applications running under the PARIX environment on Parsytec systems. Performance visualisation is provided using an Xll/Motif user interface. Due to almost identical user interfaces in both PATOP and TATOO, users do not have to learn twice how to use them. Both tools provide reasonably rich sets of available measurements and display diagrams. They support measurement and visualisation at
40
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
various levels of detail: the whole system, individual nodes, particular processes or user functions. The available measurements comprise busy and sojourn time, delay in sending and receiving, data volume sent and received. Performance data can be presented in a wide range of ways, as bar graphs, multi-curves, color bars, curve/matrix graphs, distribution diagrams, and matrix diagrams.
2.4.2
Prerequisites for Integration of PATOP with the O C M
The integration of PATOP with the OCM involved an additional layer, ULIBS, that interfaces to the OCM and provides more easy, high level access to the monitoring services. In the case of PATOP, the task of ULIBS is to transform high-level measurement specifications from PATOP to event/action relations acceptable by the OCM and to transform back the results. In Fig. 2.4 the structure of the on-line monitoring environment is presented when monitoring a sample parallel application. In order to enable the monitoring of a parallel application by the OCM, some parts of the OCM must reside in the target processes, i.e. monitoring the communication library calls requires that the program has to be instrumented and therefore linked to some modules of the OCM. The PATOP tool runs on a host machine and communicates with the OCM using the abstraction layer, ULIBS. Both the OCM and ULIBS are extended with modules specific to performance analysis, PAEXT and PERF, respectively. The operation of the TATOO tool is outlined on the right side of Fig. 2.4. While the application is executing, performance data is stored in local trace buffers within the monitored application processes. When the application terminates, or a trace buffer is almost full, the OCM reads its contents. The trace data is then sent to TRACER - a standard OMIS client. TRACER saves the data to a trace file in the SDDF format [1]. The TATOO tool reads this trace file, using a special, offline version of ULIBS. Adapting the performance measurement tools to the OCM required the definition of a number of additional services for performance analysis in the OCM and a modification of the ULIBS functionality.
2.4.3
Gathering Performance Data with the OCM
In order to be able to provide performance analysis of a parallel program, tools need to acquire information about the program such as the amount of time spent on waiting for a message from another process, or the message size. The OCM provides all the necessary support for obtaining such information. To make this data available to the performance analysis tools, there are three principal ways. 1. The direct communication method implies that whenever an interesting event is detected, the OCM executes the services that have been associated with that event. The services send their results (i.e. information on the event or the state of the application when the event occurred) to the tool. Fig. 2.5 shows the flow of control and data in this case.
Section 2.4.
41
Integrating the Performance Analyzer PATOP with the OCM
•
GUI
•^T«r»» i lUttShb,alk
new or heavily modified paits
I'll!*'*
pi M
i
I ULIBS, allhach
ULIBS 0,W.\/,ym'./» ; OMISieplm
OMISreplies host system
TIT" 'lei ml OCM cpn nodel
node 2
: process 1
i piocess4
| process ?.
! process 5
1 process i
parallel application target system
F i g u r e 2.4. Operation of PATOP and TATOO on top of the OCM 2. Tracing — when a relevant event occurs, some information on the event is stored in a trace buffer local to the application process. It is later read by the OCM on request. Fig. 2.6 shows the flow of control and data in this approach. It is also valid for the integrator/counter approach described below. 3. The counters/ integrators method refers to summarizing the most important information on each event in counters and integrators local to the application process, accessible by the OCM on request. The mechanism supporting the counters and integrators is part of PAEXT — a performance analysis extension to the OCM, which is outlined in Section 2.4.4. When choosing the proper way of gathering performance data, it is indispensable to take into consideration the demand for a particular kind of performance data and the cost of gaining it. The direct communication allows to acquire not only the full information on each event, but also information on the state of the application when the event occurred. However, it also has the greatest cost, due to the fact that for each event occurrence, a message needs to be sent. In the tracing and counter/integrator approaches, communication is needed infrequently, i.e. only when the displays of PATOP are updated. While tracing at least allows to store detailed information on each single event, counters and integrators only provide summary information. As the original set of event services defined by OMIS did not efficiently
42
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
PATOP request handling
PATOP request handling
PATOP 1. request data
PATOP
6. receive data i
1. request c ata
ULIBS
3. receive data
2. request data
ULIBS
5. receive data
1
Storage
OCM
2. retrieve information
y
*~^4- receive data
3. distribute request
Event handling
Process 1 ;
4. store information Storage
ULIBS
\)
Process 4
Process 2
Process 5
Process 3
Node 2
Nodel
} 3. notify ULIBS
Event handling
OCM
Nodel
*
Node 2
Process 1 :
Process 4
Process 2
Process 5
\ 2. notify OCM Process 1 Process 2
J
1
1. event detected
Process 3
Figure 2.5. mechanism
Direct communication
/
Process 3 \
:\
I . event detecte
' )
2. store it locally
Wm i-trace buffer \%$ local < counta l^J (-integrator
f a c i n g and count2Q ing/integrating mechanisms
Figure
support these more sophisticated performance measurements, in order to address this problem, some new specialized event services were needed. Such services are to be added to the OCM using the OMIS extension interface.
2.4.4
New Extension to the OCM - PAEXT
For the purposes of performance analysis using the counter/integrator approach, a new extension (PAEXT) to the OCM has been denned. In PAEXT two additional types of OMIS objects, are defined. Counters are identified by names (so-called tokens) starting with the prefix pa_c. They are plain integer variables that can be used for measuring values like message count and message volume. Integrators are named pa_i. . . and are used for f l o a t i n g values. They actually store an integral of a piecewise constant function over some interval. The integrators can be used for measuring values like the message receive delay. These objects can be either local, i.e. there is a separate instance for each node of the target system, or global, i.e. one instance for the whole system. For both
Section 2.4.
Integrating the Performance Analyzer PATOP with the O C M
43
local and global objects operations are denned for creating and deleting counters (integrators), and for incrementing and getting the values of counters (integrators). For example, to measure the total delay in pvm_send(), the following requests can be issued: • event services (the integrator's value is initially 0): thread_has_started_lib_call([], "pvm_send"): pa_integrator-global_increment(pa_i_l, 1, $time) threadJias_ended_lib_call( [] , "pvm_send"): pa_integrator_global_increment(pa-i_i, - 1 , $time) In these requests, the part preceeding the colon is an event specification, while the part following the colon defines the actions to be executed when the event occurs. Thus, the first request reads: whenever a process2 starts a call to pvm_send, increment integrator pa_i_l. The result of both requests is that the value of the integrator pa_i_i is increased by 1 at the start of pvm_send() call and decreased at the end. The time passed each time is used to calculate the new integral, increasing it by the time spent executing the call. • once in a while the current integral containing the time elapsed can be obtained from the pa_i_l integrator token, with a request like: :pa_integrator_globaljread([pa_i_i] , 0, 1)
2.4.5
Modifications to the ULIBS Library
As already mentioned, ULIBS is a library of high-level performance measurement and debugging functions. It has been developed before the OCM for use within the PARIX parallel environment. The involvement of ULIBS as an intermediate abstraction level implied its modification to comply with the OMIS interface. Now, ULIBS provides the following functionality to the tools: asynchronous requests and callbacks, configuration and current state info (nodes, processes), starting/stopping processes, debugger-oriented mechanisms like symbol tables, variable info, breakpoints, single stepping. Performance analysis support in ULIBS comprises several functions: • define a new measurement based on a specification passed as an argument. Available kinds of measurement include global sum, matrix, distribution, restriction and event trace. Data to be measured includes busy and sojourn time, message passing delays, and data volume. Measurements can be performed on the node, process and function level • start/stop the gathering of performance data • get performance data over the period of time passed as an argument; call an asynchronous callback when done 2 more exactly: a thread. However, in case of PVM and MPI, each process consists of exactly one thread.
44
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
• delete a measurement While the interface of these functions basically remained unchanged w.r.t. to PARIX version, their implementations had to be completely redesigned for the use with the OCM. As outlined in Sect. 2.4.3, in contrast to the highly specialized PARIX monitoring system [24], the OCM does not currently provide direct support for more sophisticated measurement specifications, like the ability to restrict a measurement to some source-level functions.
2.4.6
Costs and Benefits of using the Performance Analysis Tool
The use of any performance tool is connected with inducing perturbation into the execution of an application, called monitoring overhead. The "monitoring overhead" in Tab. 2.1 has been measured for the communication-intensive NAS mg benchmark [2], relatively to the standalone PVM version, on a network of 3 Sun SparcSTATION sun4c class machines running Solaris 2.6. Table 2.1 summarizes the pros and cons of each approach.
Direct
Tracing
integrators
commununication Available range of performance data Access to state info Event filtering Monitoring overhead (NAS mg benchmark) Processing in ULIBS
Counters/
basically unlimited yes yes (in ULIBS)
no yes (in ULIBS)
only summary information no no
19.9% high
5.7% high
3.9% low
wide
Table 2.1. Comparison of three ways for gathering performance data For each measurement, based on considering costs/benefits we use the least intrusive approach that still meets our goals. E.g., to carry out a system-wide analysis of the message volume sent, all the necessary support can be provided with counters, with the least possible intrusion. On the other hand, when limiting the analysis e.g. to certain source functions or to the messages sent to individual destination processes only, we have to use traces (all the events are stored in the trace, together with the destination processes' IDs; the uninteresting ones are later filtered out in ULIBS).
Section 2.5.
2.5
Adaptation of PATOP to MPI
45
Adaptation of PATOP to MPI
Extending PATOP, being in the top layer of the tool environment, with a capability to measure performance of MPI applications implied some indispensable extensions to the existing components of the environment. Fig. 2.7 shows the general architecture of the OCM-based tool environment with MPI-specific parts and performance analysis modules being exposed. The OCM gathers information on application processes and, on request, passes it to PATOP. In fact, PATOP communicates with the OCM indirectly using ULIBS, which translates high level measurement requests to lower level monioring requests and accordingly transforms the resulting replies.
2.5.1
Changes in the Environment Structure
By the beginning of the adaptation, there were the following modules extending the original functionality of the OCM and ULIBS. The PAEXT extension to the OCM provides efficient mechanisms for storing the performance data. Another OCM extension, PATOPEXT, provides some additional higher level services for PATOP. These extensions are based on a well defined interface of the OCM and do not depend on the OCM's internal implementation. They are linked to the monitoring system mainly for efficiency reasons to avoid excessive communication between the OCM and ULIBS. PERF is the part of ULIBS responsible for performance analysis services. It contains, among others, definitions of performance analysis measurements, which are expressed in terms of OMIS requests. All these modules were developed or changed while porting PATOP to the OCM (cf. Sect. 2.4). This provided that PVM applications could be monitored with the OCM-based tools. Adapting PATOP to MPI required some new modules to be added to the environment as follows. The OCM was extended with a new MPI-bound extension, MPIEXT, which has already been discussed in Sect. 2.3. The PATOPEXT module of the OCM was extended with an MPI-specific part. A new module for MPI was added to ULIBS, while the PERF module within ULIBS was extended with an MPI-specific part. The instrumented MPI library (for details of instrumentation, see Section 2.3.4) is intended to be linked to a parallel application.
2.5.2
Extensions to ULIBS
ULIBS contains the PERF module designed especially for services intended for performance analysis. This part contains information relevant to the performance measurements definitions. These definitions are, in fact, requests to the monitoring system for monitoring parallel library calls. Depending on the type of the measurement, proper data is requested from the OCM. In its turn, the PERF module needed to be extended by the measurement definitions which depend on the MPI communication subroutines semantics. Another important extension to ULIBS is a new MPI-specific module which maintains information on MPI communicators created in an application. This issue is described in Section 2.5.3.
46
On-line OCM-Based Tool Support for Parallel Applications
OTHER
PATOP
TOOLS libcttlls ' .T.
Chapter 2
i callbacks
lihcalls
callbacks
i
MPI
PERF
U L I B S
OMIS
requests
OMIS replies
7£=
PATOPEXT
PAEXT / P V M / f^r^\
V——I /
f!?
OCM Internal Communication Instrumentation
J\ INSTRUMENTED I MPI LIBRARY
tool
PARALLEL
[
|
APPLICATION
LEGEND added MPI-bound modules tool environment monitoring environment stand-alone tools
Figure 2.7. Architecture of the monitoring environment
2.5.3
MPI-Specific Enhancements in PATOP
MPI includes some sophisticated mechanisms for point-to-point and collective communication, which were not supported by the existing version of PATOP for PVM. In particular, MPI provides an abstract entity called communicator, which is a collection of processes together with additional attributes 3 . Moreover, MPI even enables groups of processes to communicate with each other (via so called intercommunicators). Since all these sorts of communication in MPI are to be tracked, PATOP should be provided with some new functionality. Although it is possible in PATOP to specify a subset of processes to apply a measurement to, it does not support such an abstraction as a group (collection) of processes. Thus PATOP 3
e . g . virtual
topology
Section 2.5.
Adaptation of PATOP to M P I
47
must be supplied with the capability to recognize MPI communicators and enable the user to access them at the user interface level on a basis similar to those in case of processes or nodes. MPI-Oriented measurements in PATOP
In the original version of PATOP, the offered communication measurements related to a particular participant of a communication, i.e. either sender or receiver. However, in case of MPI collective communication, in practice it is not always possible to distinguish between the sender and the receiver. For example, the call MPI_Bcast(. . . , r o o t , comm) provides a collective communication, since it is invoked in multiple processes, i.e. all members of the communicator comm. The process with rank root sends a message to every process within comm, while other processes receive the message. In general, it is hardly possible to distinguish if this is a send or a receive operation. Therefore, two new measurements have been added to PATOP, described as follows: 1. Integrate between the start and the end of a collective communication call, i.e. measure the delay of a global communication call 2. Add the size of messages at the end of a collective call, i.e. measure the volume of data transferred within a collective call These measurements are expressed in terms of the OMIS requests similarly to those related to point-to-point communication. For instance, the message volume transferred within a global MPI_Bcast() call can be measured by means of the following request: thread_has_endedJ.ib_call([] , "MPI_Bcast"): pa_counter_globalincrement (pa_c_l, $len) The semantics of this request is as follows: whenever a MPI_Bcast() function call has ended in any process, increment the value of the associated counter pa_cA by the value of the length of the message just transferred. Groups of processes as a new scope of system objects
Collective communication always involves a communicator, thus it would be convenient to have a possibility to apply a global measurement at once to a collection of processes involved, instead of selecting the processes one-by-one. Therefore, the tool environment was enhanced to provide information on communicators created in an application. The information on communicators is gathered by means of instrumentation of the MPI communicator-related subroutines. Fig. 2.8 shows the PATOP New Measurement dialog extended with new features. In the Type of Measurement window, two new items were added: Delayed in Global and Global Volume. In the window, where the scope of a measurement is
48
On-line OCM-Based Tool Support for Parallel Applications
New measurements for global communication
Chapter 2
New level of detail for measurements, e.g. communicators
Figure 2.8. Changes in the New Measurement dialog in PATOP to be chosen, beside the Node Level and Process Level lists, a new Group Level list containing identifiers of existing process groups (communicators) was added. Once a process group is selected, the Process Level list is automatically activated and all member processes of the selected group are shown. A measurement will apply to the processes selected from the Process Level list, or - if nothing is selected - to the groups selected from the Group Level list.
2.5.4
Monitoring Overhead Test
In the closing stage of the adaptation of PATOP to MPI, we measured the perturbation induced by the tool environment to the execution of a program. The ping-pong benchmark provided with mpich was used for this purpose. In the benchmark, two processes exchange messages using blocking send/receive subroutines MPIJSendQ and MPI-Recv(). The ping-pong benchmark is suitable for evaluating the monitoring overhead, as it exhibits a particularly high frequency of communication calls. In
Section 2.5.
49
Adaptation of PATOP to MPI
the example, 10000 messages, 10000 bytes each were sent from process one to process two and back. That gives a total number of nearly 200 MegaBytes transmitted. Three test runs were performed: 1. Uninstrumented application. In this case, the application linked with an uninstrumented MPI library was executed. 2. Instrumented application. Similarly to test No. 1, but the instrumented library was used to test the overhead of inactive instrumentation code. 3. Application + PATOP. The application was monitored with PATOP using the messages sent and messages received performance measurements to test the overhead of the instrumentation code. In each test, the wall clock time was measured using the MPI_Wtime subroutine. Thus, the overhead not only includes the direct overhead caused by the instrumentation, but also indirect overhead caused by the monitoring processes and PATOP stealing CPU cycles from the monitored application. The tests were repeated 10 times and a mean value of the times was taken. The table below shows a summary of the results obtained on two SUN workstations (300 MHz UltraSparc II), connected via fast Ethernet (100 MBit/s). The indicated error interval is ±cr, i.e. the values' standard deviation. Test Uninstrumented application Instrumented application Application + PATOP
Mean exec, time [s] 41.77 ± 0 . 1 9 41.62 ± 0.14 43.54 ± 0.32
Abs. overhead per event [fis]
Relative overhead [%]
88.5 ± 18.5
4.2 ± 0.89
As one can see, within the limits of statistical variation the execution times of an uninstrumented application and an application with inactive instrumentation code are the same 4 . This is not astonishable as for an inactive event, only one additional memory read and one comparison operation have to be performed. On the other hand, the overhead of full monitoring is significant - the application was running approximately 4.2% slower. However, this overhead could be further reduced by executing PATOP and the OCM's NDU process on a separate node. When we relate the overhead to the number of events having been generated in each application process (i.e. 20000, which is the number of MPI-Send() and MPI_Recv() calls), quite a small overhead of order 90 (is per event is obtained. This is what we expected, as the mechanisms of obtaining performance data are optimized [7]. In fact, the mean execution time for the instrumented application by chance even was slightly smaller than the mean execution time of the uninstrumented application.
50
2.6
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
Interoperability within the OCM-Based Environment
Each type of tools for parallel programming support has a well-defined functionality, therefore, in order to achieve a complex set of services in a tool environment, the interoperability of tools is highly desirable. For run-time tools, interoperability means the capability to run concurrently and be applied to the same application with a possible synergetic effect [21]. Ideally, one would like to enable interoperability between two tools coming from different vendors. However, such tools are likely to be incompatible with each other and, due to low-level conflicts, it might be even not possible to run them concurrently, or even if the tools are capable of running concurrently, further conflicts may occur at higher levels. In the present section it is shown how interoperability of on-line tools is enabled in the OCM-based tool environment.
2.6.1
Interoperability
The term interoperability, in the context of monitoring, mainly refers to on-line tools, and means their capability to run concurrently when applied to the same application [23]. Moreover, a cooperation between tools is supposed to be possible to provide additional functionality to the tool environment. For example, if a performance analyzer runs concurrently with a load balancer, and the latter migrates a process, the former should visualize the migration on its displays. And vice versa, a process migration may be forced via the peformance analyzer. The first basic requirement for interoperability concerns the possibility to run different tools concurrently. In case of tools coming from different vendors, supplied with their own monitoring modules, structural conflicts between different portions of the monitoring systems may occur, which may even prevent tools from running concurrently. As multiple tools may request an operation on a single object (e.g. writing into a process address space) at the same time, there must be provided an infrastructure to handle multiple requests. For these reasons, unless tools form a monolithic, integrated environment being dependent on each others' implementation, the interoperability of tools based on different monitoring systems is hardly attainable due to likely structural conflicts or conflicts on exclusively accessible objects among the monitoring modules [21]. Further problems may occur at the user level and manifest in logical conflicts. For example, if a debugger and a visualizer work concurrently, and a process is stopped by the former, the latter might not show it on its displays unless it is notified of the event. This results in inconsistencies in representation of the monitored system's state, which we call consistency problems. The subsections below provide an insight into the interoperability support within the OCM, the existing and the added ones, as well as an interoperability-bound extension to the OCM realized within this work.
Section 2.6.
2.6.2
Interoperability within the OCM-Based Environment
51
Interoperability Support in the O C M
./.From the beginning, the OCM has been providing some coordination features to address low-level conflicts in accessing shared objects by multiple tools [23]: • requests referring to a single object are mutually exclusive, • requests operating on more than one node are distributed to local monitors via an atomic multicast operation, to provide their execution in the same order on each node. • requests can be locked to prevent any other requests on any node from execution while the locked request is being executed. Furthermore, the concept of events, as defined by the OMIS specification, allows to address higher level conflicts. These issues are described below.
2.6.3
Interoperability in the OCM-Based Tool Environment
In this subsection, we focus on the interoperability of two tools, DETOP and PATOP. An insight into the structure of an OCM-based tool environment and some its components is provided. Also covered is the startup "protocol" of the environment. The structure of an OCM-based tool environment composed of DETOP and PATOP is shown in Fig. 2.9. To provide a management of the environment, we have developed a new OCM-oriented tool, OCTET which will be described in the next subsection. The OCTET Tool The OCTET (OCM-based Tool Environment top-level Tool) tool is intended to work on top of the tool environment. OCTET performs two tasks: The first is the start-up of the tool environment, which includes spawning the application processes and running the tools. The second is to provide tools with information to resolve consistency problems. OCTET is a program that provides a simple interface for setting up a number of parameters like the name of a parallel environment (PVM or MPI), paths to the application and tools' executables, number of processes to be run (in case of MPI only). A sample session with OCTET is shown in Fig. 2.10. The set command is used to set up the environment parameters including the parallel library type (PVM or MPI), path to the application executable and possibly other parameters. The run command schedules the specified tool to be run. The tools as well as the application are actually run after the go command is invoked. Commands which are not recognized by OCTET are considered as explicit requests to the monitoring system, hence they are sent to the OCM and their replies are printed to the standard output.
52
On-line OCM-Based Tool Support for Parallel Applications
DETOP
PATOP
libcalls callbacks
libcalls callbacks
[Tool library)
[Tool library]
Chapter 2
.__(ULIBS~]___ OMIS requests
OMIS replies
OCM
OCTET
Internal Communication
PARALLEL APPLICATION I
Instrumented library
Figure 2.9. OCTET in the tool environment Startup Mechanism in the Tool Environment
The startup of the tool environment is managed by OCTET. Time dependencies at the tool environment startup are shown in Fig. 2.11. At first, OCTET establishes communication with the monitoring system, which typically means start of the OCM. Next, based on the information provided by the user, OCTET orders the OCM to start the application. This process may vary depending on the parallel environment 5 . In the next step, OCTET starts the tools and provides them with a list of the application processes' tokens. The tools are supposed to attach to each of the application processes. Finally, OCTET provides the tools with information on the environment to enable possible interactions between them (for more details, see Subsection 2.6.5).
2.6.4
Possible Benefits of DETOP and PATOP Cooperation
When applied to a long-time running parallel application, PATOP, as a performance analysis tool, can be used to monitor and visualize the application execution. When the application reveals an unexpected behavior in PATOP's performance displays, one would undoubtedly like to localize the application's point of execution to find s I n the current implementation, this pattern is actually reversed in case of an MPI application: the application itself is started prior to the OCM (see Sect. 2.3.
Section 2.6.
Interoperability within the OCM-Based Environment
53
octet> set mode MPI application type set to MPI octet> set app-path "$HOME/MPI/cpi" application path set to /home/balis/MPI/cpi octet> run patop PATOP has been scheduled to run octet> run detop DETOP has been scheduled to run octet> g° starting the tool environment... [tools and the application are being started] octet> :mpi_get_proclist()
Figure 2.10. A sample session with OCTET
Figure 2.11. Time dependencies for the tool environment startup
the cause of the behavior. In this case DETOP is helpful, as it works at the source code level. After having suspended the application's execution with DETOP and having applied it to check proper variables, DETOP can be used to resume the application's execution again.
2.6.5
Direct Interactions
The cooperation of PATOP and DETOP can reveal consistency problems as denned in Section 2.6.1. The incorrect behavior occurs in two cases:
54
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
• When the application is started with PATOP, the latter starts reading the performance data from the OCM and updates its performance displays to visualize the execution. However, when DETOP suspends the application, PATOP continues to read data and update its displays, whereas one may expect that PATOP hangs up monitoring while the application is being suspended. This is only possible if PATOP is notified whenever DETOP suspends the application execution. • If the application is started with DETOP, PATOP does not start monitoring. Again, a notification that the application processes have been continued is necessary. Similarly, when the application is resumed by DETOP after having been suspended, the same kind of notification is needed. Fortunately, the notion of events provided by the OMIS specification helps in resolving these problems [16]. Basically, PATOP needs to "program" an action to each event of a thread suspension or continuation. This can be realized by issuing the following two conditional requests to the OCM: thread-has_been_stopped( [] ) : print ( [$proc] ) thread-has_been_continued( [] ) : print( C$proc] ) The semantics of these requests is as follows: whenever a process to which PATOP is attached has been stopped (continued), the process identifier of the process being stopped or continued is returned. The events are handled by means of a callback mechanism. The process' identifier is actually passed to the appropriate callback function, which is invoked on every occurrence of the event. This callback function performs appropriate actions to stop or resume the measurements. The succeeding questions about interactions of the tools concern: 1. PATOP can program reactions to various scenarios of tools' co-operation. However, how can PATOP learn of the actual configuration of the tool environment (which tools are running) so that it can perform appropriate actions? 2. Which part of the software should be responsible for sending the above requests? We might decide to insert the appropriate code directly to PATOP, however, this would be an intrusion into the tool's implementation, which contradicts the principal ideas of a loosely coupled environment, where tools are independent. In [23], the second problem is resolved by dynamically inserting and calling the necessary code into the tool via machine-level monitoring techniques like dynamic instrumentation [15]. A drawback of this approach is its complexity and, thus resulting in poor portability. The approach presented in [23] is currently implemented only for PVM on Sparc/Solaris. For our environment, we thus chose a more highlevel approach. For each tool, a specific library is provided in which every probable scenario of tools' cooperation would be handled.
Section 2.7.
A Case Study
55
One might get the impression that in this approach, the tools actually know about each other, thus one can argue whether they remain independent of each other. However, all the interoperability related code is implemented as a new module in ULIBS which might be considered as an independent component of the tool environment (Fig. 2.9), although it is implemented as a library being linked to the tools' executables. Thereby the tools themselves are not really affected. It should be stressed, that the new module is designed to provide a general support for interoperability of any combination of tools, not only DETOP and PATOP. Though at present only the case of DETOP-PATOP interoperability is implemented, the other scenarios can easily be added. Note that with the implementation described above, the interoperability modules within the tools have to be provided with information on which tools are running within the environment. The component which possesses a thorough knowledge on the whole system, in particular, which tools are running at the moment, is the OCTET tool. OCTET can supply the information to all the tools, that would make a part of the interoperability module to be activated, which is appropriate to the given scenario. For example, if OCTET knows that it would run PATOP and DETOP, it can send PATOP the information that DETOP is just running. This information would actually be processed by the startup module within ULIBS and passed to the interoperability module, which results in issuing the two requests described above.
2.7
A Case Study
In order to investigate the benefits coming from on-line tools, we studied the behavior of a real-world high performance application, a lattice gas automata program (LG A) [9]. Lattice gas consists of unit mass particles moving with unit velocities on a hexagonal lattice and colliding at lattice sites according to energy and momenta conservation rules. Macroscopic values are obtained by averaging after each specified number of steps. The evolution of the lattice consists of: absorption, collision, injection, and free particle streaming. Parallelization of the cellular automata simulation is achieved by dividing a lattice into domains assigned to different processes which in practice are situated on separate processors or workstations. Only the border rows of domains are transmitted between neighbour domains after each timestep. Using multispin coding one can represent cellular automata with binary arrays assigned to different automaton variables (directions of motion in the case of LGA). In this way, parallelization is introduced into the instruction level. The application under discussion is built as a master slave/program and equipped with a load balancing scheme. The master is intended to decompose a computational box into domains and to send them to slaves. Additionally, it has to periodically receive data from slaves and to redistribute domains, based on the time it takes the slaves to do a subsequent number of calculations. On having done the necessary calculations, the slaves exchange the data on boundary rows with neighbours and
56
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
send the averaged values to the master. T h e load balancing is performed with a damping factor, which lies within zero and one. Using P A T O P we observed the behavior of the program with 4 slaves, each on a separate node. T h e main scope of the observation was to check t h a t the load balancing mechanism applied in the application works correctly. Since the slaves are periodically synchronized when exchanging the d a t a at domain boundaries, the time spent waiting for receiving messages is a good metrics to assess the load balance. We thus defined a measurement "delayed in receiving" for all processes. Fig. 2.12 shows the display window when running the program with PATOP. The input parameters of the LGA program have been as follow: the lattice has 5529600 sites, the damping factor is 0.2, the overall number of steps is 500, the averaging takes place every 20 steps. Computational domain initialization
Load balancing activity due to an additional load on a node (related space is highlighted)
Slave 3 Slave 2 Slave 1
• i • In • -I f-1 ** i i t !
I mhif ^-ir
Slave 0 Master
|—"~-Y - ^riT | rfVT^~^'^"Tnr^nr v "nrTinn'"Trr^
iNocie n_l 40000: Delayed i n receiving
;N0fl3 r,.<MX>«>• neiayocJ ir R e s o l u t i o n ! S}1s/p1xe1 S t a r t ! 00s Stop : 07m52s
receiving Remove ^ j SferoiW Area Start: j;02B5.8S_. Area Stop : !j06»30S
F i g u r e 2 . 1 2 . LGA program: "Delayed in receiving" measurement with P A T O P By looking at the times it takes the slaves to generate the initial states of domains, it is possible to notice t h a t the nodes executing the slave processes have rather different speeds. From top to b o t t o m we used a SUN Sparc workstations equipped with a 110 MHz MicroSparc, a 50 MHz SuperSparc, a 140 MHz UltraSparc, and a 296 MHz UltraSparc II processor, respectively. T h e uppermost slave
Section 2.7.
57
A Case Study
is the slowest and, due to it, faster slaves spend more time idling. An interesting observation can be made if we look at the fragment of the run with the highlighted area (in the curve for slave 1). At the beginning of the highlighted interval in time, we started an additional task on the node of slave 1 in order to examine the behavior of the load balancing mechanism. One can see that the other slaves start to spend more time idling while slave 1 no longer has to wait for messages, since it does no longer succeed to do its work in time 6 . Due to the changed situation, the load balancing mechanism in the master now transferred part of load to other slaves, and therefore the situation gradually improved again. The improvement lasts until the moment, when we terminated the additional background task, which again caused an imbalance and thus a new redistribution of the load within slaves. Output > print [pi] "
Nyavg[0:3]
Value of " Nyavg£0;3] i s ; Cpl] array [ 4 L
> print [pi] "
Nuav9[^S]
Value of " Nyavgi#*3] i s : [pi] array [ 4 1 = > 4 1 7 / 9 9 } 108, 96
Domain size of slave 1 before and after placing the additional task on its node. About half the data is distributed to the other slaves.
ijHI Figure 2.13. DETOP
LGA program: Inspecting the domain sizes for each slave with
To get insight into the actual distribution of the load during this experiment, we used the debugger DETOP simultaneously with PATOP. By examining appropriate variables, we could verify that when the additional load is paced on the node of slave 1, about half of this slave's domain is distributed to the other slaves (see Fig. 2.13). This proves that the load balancing algorithm works correcly, as the background load effectively halves the node's performance. Moreover, by changing the variables related to the damping factor with DETOP and concurrently observing performance data with PATOP, one can conclude on which values of the damping factor are more proper for obtaining an optimal performance. 6 The short period of excessive waiting time of slave 1 has been caused by some incidental background load on the node of slave 3.
58
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
A performance experiment like this obviously would be practically impossible with off-line, trace based tools or even with non-interoperable on-line tools.
2.8
Concluding Remarks
On-line tool support is one of the indispensible premises of successful development of parallel applications. When undertaking the task of building a universal tool environment, we aimed at the use of an OMIS-based implementation of the monitoring system, to support a set of existing tools designed for performance analysis, debugging, etc. of parallel applications. The work was focused on extending the range of parallel environments as well as on extending the set of tools available to the user. The work on porting tools to the OCM comprised not only developing an interface between the OCM and the tools, but also enhancing the OCM functionality. This latter activities are feasible owing to the mechanism of extensibility embedded in OMIS and provided by the OCM, which enables to add a support needed for particular situations, e.g. new programming environments or tools. Our work was concentrated on working out a well defined tool-OCM interface based on the functionality of a library of functions, which enables transforming tool requests (e.g. those form a performance analysis tool) into OMIS requests and vice versa. The OCM has been extended on one hand to support a set of general purpose performance analysis requests, provided by PAEXT, on the other hand it was possible to add a tool-specific requests (e.g. those for PATOP, embedded in PATOPEXT). In its turn, a tool, i.e. PATOP, was modified to be able to support a new programming environment, MPI. One of the main concerns within this work was interoperability of on-line tools for parallel programming support, which is a key feature for building a powerful, easy-to-adapt tool environment. With the interoperability support lying in the environment infrastructure, not in the tools themselves, the user is enabled to customize his environment by picking tools which best fit his needs. A system that supports interoperability must meet a number of requirements. First of all, the tools must be able to run concurrently. Next, a way to enable interactions between the tools must be provided. Finally, there must be a control mechanism to coordinate access requests to the target system objects. The OCM monitoring system provides mechanisms that are sufficient to meet these requirements. Tools adapted to the OCM are enabled to run concurrently and operate on the same object. Moreover, a definition of tools' interactions, which leads to effective tool cooperation is possible without intrusion into their implementation. Future work will be concentrated on further extending of the range of parallel environments supported by the tools and on the problem of direct interactions between the tools already involved and those to be added. As for the new environments, our main concern is providing support for OpenMP applications. Special attention will be paid to a well defined naming policy for loops and barriers, based on information from the compiler. Proper enhancements will concern the tool func-
Section 2.9.
Bibliography
59
tionality as well. Further development in the context of interoperability will be focused on extending the role of OCTET in "programming" the tools interactions. In future, OCTET can even provide general directives on how to "program" the interactions of tools. Acknowledgement This work has been carried out within the Polish-German collaboration scheme and was supported in part by KBN under grant 8 T11C 006 15. The authors are grateful to Kamil Iskra and Radoslaw Gembarowski for their contribution.
2.9
Bibliography
[1] Aydt, R. A.: The Pablo Self-Defining Data Format. Technical Report, Univ. of Illinois, Urbana-Champaign, March 1992. ftp://vibes.cs.uiuc.edu/pub/Pablo.Release.5/SDDF/Documentation/SDDF.ps.gz [2] Bailey, D., et al.: NAS Parallel Benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, March 1994. [3] Borgeest, R., Dimke, B. and Hansen, 0.: A Trace Based Performance Evaluation Tool for Parallel Real Time Systems. Parallel Computing, 21(4), 1995, 551-564. [4] Browne, S.: Cross-Platform Parallel Debbuging and Performance Tools. In: Alexandrov, V., Dongarra, J., (eds.): Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proc. 5th European PVM/MPI Users' Group Meeting, Liverpool, UK, September 7-9, 1998, Lecture Notes in Computer Science 1497, Springer, 1998, pp. 257-264. [5] Bubak, M., Funika, W., Gembarowski, R., and Wismiiller, R.: OMISCompliant Monitoring System for MPI Applications. In: R. Wyrzykowski, B. Mochnacki, H. Piech, J. Szopa (Eds.), PPAM'99 - The 3th International Conference on Parallel Processing and Applied Mathematics, Kazimierz Dolny, Poland, 14 - 17 September 1999, pp. 378-386, IMil Czestochowa, 1999. [6] Bubak, M., Funika, W., Iskra, K., Maruszewski, R., and Wismiiller, R.: OCMBased Tools for Performance Monitoring of Message Passing Applications. In: Sloot, P., Bubak, M., Hoekstra, A., Hertzberger, B., (eds.): Proc. Int. Conf. High Performance Computing and Networking, Amsterdam, April 12-14, 1999, Lecture Notes in Computer Science 1593, Springer, 1999, pp. 1270-1273. [7] Bubak, M., Funika, W., Iskra, K., Maruszewski, R., and Wismiiller, R.: Enhancing the Functionality of Performance Measurement Tools for Message Passing Applications. In: Dongarra, J., Luque, E., Margalef, T., (Eds.), Recent
60
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
Advances in Parallel Virtual Machine and Message Passing Interface. Proceedings of 6th European PVM/MPI Users' Group Meeting, Barcelona, Spain, September 1999, Lecture Notes in Computer Science 1697, Springer, 1999. pp. 67-74. [8] Bubak, M., Funika, W., Mlynarczyk, G., Sowa, K., and Wismiiller, R.: Symbol Table Management in an HPF Debugger. In Sloot, P., Bubak, M., Hoekstra, A., Hertzberger, B., (eds.): Proc. Int. Conf. High Performance Computing and Networking, Amsterdam, April 12-14, 1999, 1278-1281, Lecture Notes in Computer Science 1593, Springer, 1999. [9] Bubak, M., Mosciriski, J., Slota, R.: Implementation of Parallel Lattice Gas Program. In: Dongarra, J., Wasniewski, J. (eds.): Parallel Scientific Computing, Lecture Notes in Computer Science 879, Springer-Verlag, 1994, pp. 136-146 [10] Cunha, J., Lourenco, Vieira, J., Moscao, B., and Pereira, D.: A Framework to Support Parallel and Distributed Debbuging. In: Sloot, P., Bubak, M., Hertzberger, B., (eds.): Proc. Int. Conf. High Performance Computing and Networking, Amsterdam, April 21-23, 1998, 708-717, Lecture Notes in Computer Science 1401, Springer, 1998. [11] Dolphin Interconnect Solutions, Inc.: TotalView Overview. WWW page, http://www.etnus.com/tw/tvover.htm [12] Geist, A., et al.: PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, Massachusetts, 1994. [13] Gropp, W., Lusk, E.: User's Guide for mpich, a Portable Implemention of MPI. ANL/MCS-TM-ANL-96/6, 1996. [14] Gropp, W., Lusk, E.: Why are PVM and MPI So Different? In: Bubak, M., Dongarra, J., Wasniewski, J. (eds.): Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proc. 4th European PVM/MPI Users' Group Meeting, Cracow, Poland, November 3-5, 1997, Lecture Notes in Computer Science 1332, 3-10, Springer, 1997. [15] Hollingsworth, J. R., Miller, B. P., Goncalves, M. J. R., Xu, D., Nairn, 0., Zheng, L.: MDL: A Language and Compiler for Dynamic Program Instrumentation. In: Proc. International Conference on Parallel Architectures and Compilation Techniques, San Francisco, CA, November 1997. f t p : / / g r i l l e d . c s . wise. edu/technical_papers/mdl. ps. gz [16] Ludwig, T., Wismiiller, R., Sunderam, V., and Bode, A.: OMIS - On-line Monitoring Interface Specification (Version 2.0). Shaker Verlag, Aachen, vol. 9, LRR-TUM Research Report Series, 1997.
Section 2.9.
Bibliography
61
http://wwwbode.in.turn.de/~omis/0MIS/Version-2.0/vers i o n - 2 . 0 . p s . g z [17] Miller, B.P., Callaghan, M.D., Cargille, J.M., Hollingsworth, J.K., Irvin, R.B., Karavanic, K.L., Kunchithapadam, K., and Newhall, T.: The Paradyn Parallel Performance Measurement Tool, IEEE Computer, vol. 28, No. 11, November, 1995, pp. 37-46 [18] MPI: A Message Passing Interface Standard. In: Int. Journal of Supercomputer Applications, 8, 1994; Message Passing Interface Forum: MPI-2: Extensions to the Message Passing Interface, July 12, 1997. http://www.mpi-forum.org/docs/ [19] Pase, D. Dynamic Probe Class Library: tutorial and reference guide, Version 0.1. Technical Report, IBM Corp., Poughkeepsie, NY, June 1998. http://www.ptools.org/projects/dpcl/tutref.ps [20] Rajkumar, B., Krishna Mohan, K.M., Gopal, B.: PARMON: A Comprehensive Cluster Monitoring System. In: Proceedings of High Performance Computing on Hewlett-Packard Systems, Zurich, Switzerland, 298 - 311 ETH Zurich, 1998; HPCC System Software PARMON Group: PARMON User Manual, CD AC, 1998 http://www.dgs.monash.edu.au/~rajkumar/parmon/ParmonManual.pdf [21] Trinitis, J., Sunderam, V., Ludwig, T., and Wismiiller, R.: Interoperability Support in Distributed On-line Monitoring Systems. In: M. Bubak, H. Afsarmanesh, R. Williams, and B. Hertzberger, editors, High Performance Computing and Networking, 8th International Conference, HPCN Europe 2000, volume 1823 of Lecture Notes in Computer Science, Amsterdam, The Netherlands, May 2000. Springer. [22] Wismiiller, R.: On-Line Monitoring Support in PVM and MPI. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proc. EuroPVM/MPI'98, Lecture Notes in Computer Science, vol. 1497, 312-319, Liverpool, UK, September 1998, Springer, 1998. [23] Wismiiller, R.: Interoperability Support in the Distributed Monitoring System OCM. In R. Wyrzykowski et al., editor, Proc. 3rd International Conference on Parallel Processing and Applied Mathematics - PPAM'99, pages 77-91, Kazimierz Dolny, Poland, September 1999, Technical University of Czestochowa, Poland. [24] Wismiiller, R., Oberhuber, M., Krammer, J. and Hansen, O.: Interactive Debugging and Performance Analysis of Massively Parallel Applications. Parallel Computing, 22(3), 1996, 415-442 http://wwwbode.in.turn.de/"wismuell/pub/pc95.ps.gz
62
On-line OCM-Based Tool Support for Parallel Applications
Chapter 2
[25] Wismiiller, R., Trinitis, J., and Ludwig T.: OCM - A Monitoring System for Interoperable Tools. In: Proceedings of the 2nd SIGMETRICS Symposium on Parallel and Distributed Tools SPDT'98, Welches, OR, USA, August 1998. [26] Wismiiller, R., Trinitis, J., Ludwig, T.: A Universal Infrastructure for the Runtime Monitoring of Parallel and Distributed Applications. In: EuroPAR'98, Parallel Processing, Lecture Notes in Computer Science, vol. 1470, 173-180, Southampton, UK, September 1998, Springer 1998.
Chapter 3
Task Scheduling on NOWs using Lottery-Based Work Stealing BORIS
ROUSSEV
Department of Information Systems Susquehanna University 514 University Avenue, Selinsgrove, PA 17870, USA [email protected]
JlE WU Department of Computer Science and Engineering Florida Atlantic University Boca Raton, Fl 33431, USA jie @reality. cse.fau. edu
Abstract This paper presents a Java framework for high performance computing (HPC) on networks of workstations (NOWs). We introduce a new lottery-based work stealing algorithm for efficient scheduling of large-scale multithreaded computations on N O W s . In the proposed algorithm, idle workstations actively search out work t o do rather t h a n wait for work to be assigned. In the lottery game, each workstation is equipped with a set of tickets and the number of tickets is proportional to the age of the oldest thread in the ready pool of the workstation. A winning ticket is drawn at r a n d o m and the workstation with the winning ticket becomes the victim from which the idle workstation steals work. T h e proposed selection procedure serves for two purposes. First, we try to lower communication costs by stealing large amounts of work, with the logic behind being t h a t old-aged computations are likely t o spawn more work t h a n relatively young computations. Second, we would 63
64
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
like to bias the search to obtain favourable results while at the same time avoid system bottleneck. Our approach has been implemented and tested on NOWs under the Solaris OS. Several examples have been used to demonstrate the potential performance gain. Keywords: High Performance Computing (HPC), Java, Network of Workstations (NOWs)
3.1
Introduction
For the past twenty years parallel computing has been used successfully in many applications such as weather forecasting, molecular modeling, airflow modeling, tax return, etc [20]. Despite some success and he fact that parallel processing have been conjectured to be the most promising solution to the computing requirements of many problem domains [26], parallel computing is not widely accepted in industry. Parallel computers conjure images of sophisticated and expensive multiprocessor architectures, running obscure operating systems, and executing programs written in non-portable special-purpose languages. The on-going technological convergence of local area networks (LANs) and massively parallel computers augments the effect of the reverse computing food chain law [1], where in contrast to biology, the smallest fish, personal computers, is eating the market of workstations, which has consumed the market for minicomputers and now is eating away the market for larger mainframes and supercomputers. The driving force behind this "law" is the better price/performance ratio of networks of workstations (NOWs) over parallel systems. We increasingly find NOWs making inroads into domains once monopolized by supercomputers [11]. We can identify the following motivating factors for using NOWs for high performance computing (HPC): (1) Surveys show that the utilization of CPU cycles of desktop workstations is typically less than 10%. (2) Performance of workstations and PCs is rapidly improving. (3) As performance grows, percent utilization will decrease even further. (4) Organizations are reluctant to buy large supercomputers, due to the large expense and short life span. (5) The communication bandwidth between workstations increases as new networking technologies and protocols are implemented in LANs and wide area networks (WANs). (6) NOWs are easier to integrate into existing networks than special parallel computers. (7) The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems. (8) NOWs are cheap and really available alternative to specialized HPC platforms. (9) Use of NOWs as a distributed computing resource is very cost effective (incremental system growth). Therefore, one could expect HPC on NOWs to become more and more attractive as time goes on. This gives a new impulse to the field of parallel computing. A model of parallel computation is an abstract machine, providing a set of
Section 3 . 1 .
Introduction
65
primitives to the programming level above. It is designed to separate software development concerns from effective parallel execution concerns. According to the abstraction they provide, models for parallel computing can be classified in five categories [25], see Table 3.1, based on the ways decomposition, mapping, communication, and synchronization are done. Table 3.1 also shows some representative language/libraries for each model. Decomposition of a program into threads (column 1 of Table 3.1) and mapping of threads to processors (column 2 of Table 3.1) are known to be computationally expensive. Communication requires placing two ends of the communication in the correct threads and at the correct place. Synchronization requires the understanding of the global state of the computation, which is immense for practical purposes. Given the aforementioned reasons for using NOWs for HPC, we might expect NOWs to have rapidly moved into the mainstream of computing. This is clearly not the case. We could identify the following problems and difficulties with using NOWs, which explain why its advantages have not (yet) led to its widespread use: (1) In principal, NOWs have much unused compute power to be exploited. In practice, the large latencies involved in communicating among workstations make them low-performance parallel computers. Typically larger-grain processes are used to help conceal the latency. The increasing use of optical interconnection and ATM for connecting workstation changes the situation. (2) Models for NOWs include systems such as MPI [22] and PVM [26], which belong to the lowest level of the model hierarchy. Most importantly, these models do not hide much of the decomposition and communication. The developer must specify in thorough detail the implementation, which makes building software extremely difficult. (3) The models used must address the heterogeneity of the processors (architecture, OS, GUI) on typical NOWs. (4) Workstations must trust the programs being executed on their machines. (5) The system must secure the application from spying by participating workstations. (6) The system must be able to mask the intentional (or unintentional) data loss and data corruption caused by the inherent partial failures. (7) The system must reward (economic incentive) the participation of workstations in a distributed computing infrastructure. (8) The theory required for parallel computations is immature [25]. Skillicorn and Talia argue that our knowledge about abstract representation of parallel computations and reasoning about them is insufficient and rudimentary. Java [21], an object-oriented language, has become popular because of its platform independence and safety. It has greatly simplified network programming by providing elegant TCP/IP API, object serialization, network class loading (code mobility), RMI (remote method invocation) [16], Servlets, JSP (Java server pages) [18] and built-in concurrent constructs. Java is a shared memory thread-based language with built-in monitors and binary semaphores as a means of synchronization at the object and class level. Though Java is firmly fixed at the lowest level of the parallel computing model hierarchy, it addresses concerns (3) and (4) above. This along with its phenomenal growing popularity entails a rapidly expanding body of projects that use Java as a language for HPC on NOWs and clusters [3], [4], [11],
66
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
[12], [13], [14], [17], [20], [23], [24]. Invariably, their aim is to hide one or more of the characteristics of the language, see Table 3.1, that make it ill-suited for parallel programming. Next, we discuss the existing Java-based systems according to the criteria outlined in Table 3.1. JPVM [12], IceT [14] and MPIJ [23] are systems based on message passing using sends and receives to specify the message to be exchange, process identifiers and address. They implement models from the bottom-level of the model abstraction hierarchy. JavaSpaces [17] stands one level higher than message passing systems. It is a new realization of Linda "tuple spaces" [10]. In essence, JavaSpaces simplifies process communication by using a large pool into which data values are placed by processes, and from which they are retrieved associatevely. Charlotte [4] is one of the first systems to use Java for parallel computing. Charlotte programs are written for a virtual parallel machine where the runtime system automates the mapping of threads, called routines, to processors and communication is done through shared variables allocated from a distributed shared memory. The system implements the fork-join model of parallel programming and introduces fault-tolerance through "eager scheduling." With eager scheduling the manager assigns a job repeatedly until it is executed to completion by at least one worker. The authors introduce a new memory management technique, called "two-phase idempotent execution strategy," to ensure the correct execution of of shared memory programs under eager scheduling. What we see as a disadvantage is the centralized association of workers and computations. Next, in the model hierarchy comes Atlas [3], a Java realization of the Cilk [5] programming model, which is best suited for tree-like computations, see below. The system automates the placement of computations and communication and achieves near-optimal load balancing. Bayanihan [24] implements a generic set of components that support master-worker programming style similar to Charlotte through a form of barrier synchronization and eager evaluation. In addition, the generic objects of the runtime system can be changed for performance optimization of different distributed algorithms and even for implementation of new programming paradigms. Javelin [11] is a seminal infrastructure for global computing based on Java-enabled Web technology (applets, Web servers, and HTTP). It achieves load balancing through a distributed task queue (scheduler) using work stealing. The developer is abstracted both from mapping of threads to processors and from interthread communication. In Javelin communication layer, communication between applets is routed through their associated servers, which further increases the network latency, making the system suitable for running mainly coarse-grain parallel applications. Further, the projects using Java could be divided into applet-based and standalone. This classification is orthogonal to the classification based on the model abstraction. Both approaches have advantages and disadvantages. The former is severely restricted by the applet security model. Load balancing is difficult to achieve since there is always the bottleneck created by virtue of the centralized node
Section 3 . 1 .
Introduction
67
(the Web server). These systems target the Internet and its unlimited resources. Standalone implementations target both NOWs and the Internet. They require that either the clients (users) have access privileges to the participating machines or the workstation owners download and install the runtime system. The chief problem to overcome in the latter approach is having to prove to and convince the owners that their privacy and security would not be violated by executing foreign computations. For a comprehensive discussion of some other considerations to be addressed when choosing between applet-based and standalone implementations refer to [19]. The aims of this research work are to develop a Java runtime system for efficient scheduling of multithreaded Java applications on NOWs and to improve the random work stealing algorithm used in Cilk-NOW [5], [6], [7], [8]. Cilk-NOW is a runtime system based on a C thread package designed for multiprocessor architectures. The runtime system can be ported to and used on a NOW. However, it suffers from the inherent limitations of the C programming language: nonportability, lack of reflection API, lack of serialization, and lack of network loading. The former restricts the portability of the runtime system as well as its deployment to homogeneous environments. The lack of network loading decreases the scalability of the system, e.g., the participating workstations should share a common file system in order to load the code of the applications. In our work we use the programming model developed by Robert Blumofe at MIT [5]. This model requires that decisions about the breaking up of available work into threads be made explicit while relieving the software developer of the ramifications of such decisions: mapping of threads to processors is done automatically and efficiently by the distributed scheduler that implements a random work stealing algorithm; communication is done implicitly through shared variables; and synchronization is achieved through continuation passing style [2]. In other words, in writing a parallel application in Java, a programmer expresses parallelism by coding instructions in a partial execution order by structuring the code into totally ordered sequences called threads. The programmer need not specify the processor in the system that executes a particular thread nor exactly when each thread should be executed. These scheduling decisions are made by the run-time systems scheduler. In our work we use Java as an implementation language. The proposed model allows Java to enjoy the benefits of being a member of the family of languages in the second category in Table 3.1. Making the Java programming model more abstract could reap tremendous spin-offs. Parallel applications are easier to design, verify, and debug while efficient implementation is still possible. The remainder of the paper is structured as follows. In Section 2 we review the Cilk language and work stealing scheduler [6], [7], [8] adapted to our needs and introduce a new work stealing algorithm. In Section 3 we describe the architecture and the implementation of the proposed Java runtime system. Then, in Section 4 we present experimental results about the performance of the runtime system employing the work stealing algorithm described in Section 2. In the final section we outline plans for future work and conclude.
68
1. 2. 3. 4. 5.
Task Scheduling on NOWs using Lottery-Based Work Stealing
Decomp. implicit explicit explicit explicit explicit
Mapping implicit implicit explicit explicit explicit
Comm. implicit implicit implicit explicit explicit
Sync. implicit implicit implicit implicit explicit
Chapter 3
Languages Haskel Concurr. Prolog, Multilisp, Cilk BSP, LogP, Linda Emerald, Concurrent Smalltalk Java, PVM, MPI, Ada
Table 3.1. Models for parallel computations. # of proc. 5 6 7 8 9
Random 43.4 49.25 63.6 82.67 83.25
Lottery-based 35.7 47.5 44 63 69.25
Impro. in % 17.74 3.55 30.81 23.79 16.81
Table 3.2. Comparison between the performance of the random work stealing algorithm and the lottery work stealing algorithm (Fibonacci numbers) # of proc. Random Lottery-based Impro. in % 30.54 419 291 42.21 398 230 39.2 197 324 190 37.5 304 249 154 38.15 Table 3.3. Comparison between the performance of the random work stealing algorithm and the lottery work stealing algorithm (Nqueens problem)
Section 3.2.
3.2
The Cilk Programming Model and Work Stealing Scheduler
69
The Cilk Programming Model and Work Stealing Scheduler
NOWs offer a tremendous processing capacity. However, in order to realize this computing capacity we need a good programming model and an efficient distributed scheduler for redistributing the load among the workstations. In Section 2.1 we describe the Cilk programming model as well as its random work stealing algorithm. In Section 2.2 we present a new distributed scheduler based on a victim selection algorithm through lottery.
3.2.1
Java Programming Language and the Cik Programming Model
The Cilk programming model contains a graph of instructions and a tree of threads that unfold dynamically during program execution. A multithreaded computation is composed of a set of threads, each of which is a sequential order of instructions. During the course of execution, a thread may create, or spawn, other threads. The spawning thread can operate concurrently with the spawned one. The spawned threads are considered to be children of the thread that did the spawning, and a thread may spawn as many children as it desires. In this way the threads are organized into a spawn tree. In addition to spawning threads, a multithreaded computation may also contain dependency between the threads. As an example of a data dependency, consider an instruction in one thread that produces a data value consumed by an instruction in another thread. Dependencies allow threads to synchronize. An execution schedule for a multithreaded computation determines the processor in the system that executes a given instruction at each step. An execution schedule must obey the spawning dependencies in that no processor may execute an instruction in a spawned child thread until after the spawning instruction in the parent thread has been executed. It must also obey the data and control dependencies among the threads in order to achieve proper thread synchronization. In a strict multithreaded computation, every dependency goes from a thread to one of its ancestor threads. In a fully strict multithreaded computation, every dependency goes from a thread to its parent. Fully strict computations are "well-structured" in that all dependencies from a subcomputation emanate from the subcomputations root thread. A distinctive feature of strict computations is that once a thread has been spawned, a single processor can complete the execution of the entire subcomputation rooted at this thread even if no other progress is made on the other parts of the computation. A program in the Cilk programming model consists of one or more classes and objects with one or more threads of control. Threads are nonsuspendable. The runtime system manipulates and schedules the threads. A Java program generates parallelism at runtime by instantiating a runnable object or a subclass of class Thread and executing its run method. After this the parent and the child may execute concurrently (asynchronous method invocation). After spawning one or more children threads, the parent thread does not wait for its children to return.
70
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
Instead, the parent thread additionally spawns a successor thread to wait for the results from the children. Thus, a thread may wait to begin executing, but once it begins execution there is no suspending it [2]. Sending a result to a suspended thread is done via the sendArg method. The Java runtime system implements these primitives using two types of classes: closures and continuations. Closures are classes employed by the runtime system to keep track of and schedule the execution of spawned threads. The runtime system associates one closure object with each spawned thread. The absence of templates in Java does not allow to hide the existence of closures from the software developer without an additional preprocessing step. A closure consists of the class name of a runnable object, a slot for each of the specified arguments in the object's constructor, and a join counter indicating the number of missing arguments that need to be supplied before the object is ready to be instantiated and its run method executed in a separate thread. If the closure has received all of its arguments, then it is ready; otherwise, it is waiting. To run a ready closure, the runtime system uses reflection API to find out the object constructor having the same number and type of arguments as specified in the closure and then invokes it. When the run method of the instantiated object dies, the closure is deleted (freed). A Continuation is a reference to an empty argument slot of a closure. An executing thread sends a value to a waiting thread by placing the value into an argument slot of the waiting thread's (runnable object's) closure. The executing thread uses the sendArg method of a Continuation object for this purpose. The empty slot of the waiting closure is specified by the argument passed as a parameter to the constructor of the Continuation object. At runtime, each processor maintains four pools of closures: ready pool, waiting pool, assigned pool, and the pool of stolen closures. The ready pool is a deque (double-ended queue) which contains all of the ready closures. Whenever a closure is created, if its join counter is 0, then it is placed on the head of the ready deque; otherwise, it is added to the waiting pool. Whenever a sendArg is invoked, the join counter is decremented, and if the join counter reaches 0, then the closure is removed from the waiting pool and placed at the head of the ready deque. When a thread finishes, the next closure is chosen from the head of the ready deque and instantiated (its thread executed.) In Figure 3.1, a worker pushes spawned tasks on its local ready deque and pops the task from its head when it finishes the current task. A pop on an empty ready pool triggers a steal request being sent to a victim worker. When the steal request arrives at the victim worker, if its ready deque is not empty, the task at the tail of the deque is removed and sent to the requesting worker. If no closures are available in the ready pool, a processor becomes a thief. In Cilk-NOW [5], to steal a work, a processor chooses another processor, called victim, at random and requests a closure to be sent back (see Figure 3.1). If the victim processor has any closures in its ready deque, one is removed from the tail of its ready deque and sent across the network to the thief whom will add this closure to its own ready deque. The thief may then begin work on the stolen closure.
Section 3.2.
The Cilk Programming Model and Work Stealing Scheduler
71
Thief
Victim
r
\
Server
1
J^>
push/pop
head
tail
^
1" 1
7\
Java rumtime system
Figure 3.1. A description of thief and victim algorithm If the victim has no ready closures, it informs the thief who then tries to steal from another randomly chosen processor until a ready closure is found or program execution completes. Our runtime system consists of several processes', executing Java Virtual Machines (JVM), running on several different workstations. One process, called registry, runs a Java program responsible for keeping track of all the other processors that cooperate on a given job. These other processes are called workers. Each worker registers with the registry by sending it a message containing its own transport address. The registry responds by assigning each worker a unique name. Workers periodically check in with the registry. Every 2 seconds each worker sends a message to the registry containing the level of the closure at the tail of its ready deque. The level of a closure is equal to the height of the root of the multithreaded spawn tree minus the height of the node of the closure in concern. Every 2 seconds the registry multicasts a list of the network addresses and ages of all registered workers.
3.2.2
Lottery Victim Selection Algorithm
In order to execute multithreaded programs on NOWs efficiently we need to construct execution schedules dynamically. We introduce below a new distributed scheduler based on work stealing that builds execution schedules at run time as the
72
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
total=30 random[0..30]=25
12
8
7
2=12 S>25
£=20 £>25
S=27 £>25
no
no
3
Yes
Figure 3.2. Description of the lottery victim selection algorithm computation unfolds. As distinguished from Cilk-NOW, what comes across as novel in our runtime system is that when a worker becomes a thief it does not chose a victim uniformly at random. Instead, it incorporates a lottery scheduler [27] making use of the information about the level of the closure (thread) at the tail in each processor's ready deque. Lottery scheduling has been used successfully to provide efficient and responsive control over the relative execution rates of computations running on a uniprocessor. It has been shown efficient and fair even in systems that require rapid, dynamic control over scheduling at a time scale of milliseconds in seconds. Lottery scheduling implements proportional-share resource management where the resource consumption rates of active computations are proportional to the relative shares they are allocated. In the proposed randomized victim selection algorithm, each processor is associated with a set of tickets and the number of tickets associated with each processor is proportional to the level of the tail thread of its ready pool. For every thief processor, the victim processor is determined by holding a lottery. The victim is the processor with the winning ticket. For example, if the registry has multicasted a list of four processors with levels of their ready deque tail threads 12, 8, 7, and 3 respectively, there is a total of 12 + 8 + 7 + 3 = 30 tickets in the system, see Figure 3.2. Next, assume that a new processor has just joined the computation and has received the multicast message from the registry. Initially, this new processor has an empty ready pool so it becomes a thief immediately. In order to select a victim the new processor holds a lottery based on the information in the multicast message. Assume that the 25th ticket is (randomly) selected. The list of processors is searched for the winner. For every processor the partial sum of tickets from the beginning of the list is computed. If the partial sum is greater than the number of the winning ticket than the current processor is the winner and the search is aborted; otherwise, the search continues with the next processor in the list. For our four-processor example, the winner is the third processor. Therefore, the new processor will try to steal work from the third processor in the list multicasted by the registry. Further, let us assume that another new processor joins the compu-
Section 3.3.
Architecture and Implementation of the Java Runtime System
73
tation. It will also hold a lottery based on the information in the same multicast message. It is likely that the winner will be the first or the second processor because of the great number of tickets representing them. In this way the selection algorithm probabilistically avoids congestions at the busiest nodes in the system while at the same time it allows work stealing from them.
3.3
Architecture and Implementation of the Java Runtime System
This section presents our prototype runtime system and describes its core components and their interactions.
3.3.1
Architecture of the Java Runtime System
At the highest level, the runtime system implements the following functions: T h r e a d s scheduling The scheduler distributes tasks from a distributed task queue and manages load balancing through random work stealing, see Section 2. Adaptive parallelism The system makes use of idle processors which are idle when the parallel application starts or become idle during the duration of the job. When a given workstation is not being used, it joins in the system. When the owner returns to work, that processor automatically leaves the computation. Thus, the set of workers shrinks and expands dynamically throughout the execution of a job. Macroscheduling A background daemon process runs on every processor in the network. It monitors the processor state to determine when the processor is idle so that it could start a worker on that machine. The three main components of the runtime system are the registry, the workers, and the node managers. The registry is a super server providing the following services, each of which is implemented in a separate server: • registering/deregistering of workers, • updating the information about the workers currently involved in the computation, and • multicasting the list of network addresses and ages of the workers. Each worker consists of the following components:
74
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
• Master object synchronizing the access to the four pools of closures through guarded suspension and execution state variables. • Compute server fetching jobs from the ready deque and executing them. If the ready deque is empty, the worker becomes a thief and triggers the Thief thread. • Thief runnable object executed in a separate thread. This object implements the victim selection algorithm and the actual work stealing. A shortcoming of most distributed schedulers is the need for the workstations to share a common file system, such as NFS. The Thief incorporates a network classloader that allows the downloading of executable code on demand. The latter overcomes the requirement for the workstations to have a common file system and improves the scalability of the proposed runtime system. • Victim server object. This server is contacted by the Thief clients of other workers in the course of their work hunt. • Result server object. Results from stolen threads are returned to this server which updates the corresponding closure in the waiting pool. • R e g i s t e r client responsible for registering to and periodic updates with the registry. • L i s t e n e r which listens continually for the datagrams multicasted by the registry. It writes the information received in a 1-bounded buffer. The information is read from the buffer and used by the victim selection algorithm which is invoked by the Thief thread. • VictimSelection object implementing the victim selection algorithm. We use the library class java.util.Random to generate a stream of pseudo-random numbers. Each worker uses as a seed its unique ID assigned by the registry. A victim worker is selected by holding a lottery. First, a winning ticket is selected at random. Then, the list of workers is searched to locate the victim worker holding that ticket. This requires a random number generation and 0(n) operations to traverse a worker list of length n, accumulating a running ticket sum until it reaches the winning value. In [27] various optimizations are suggested to reduce the average number of elements of the worker list to be examined. For example, ordering the workers by decreasing level can substantially reduce the average search length. Since those processors with the largest number of tickets will be selected more frequently, a simple "move to the front" heuristic can be very effective. For large n, a more efficient implementation is to use a tree of partial sums, with clients at the leaves. To locate a client holding a winning ticket, the tree is traversed starting at the root node, and ending with the winning ticket leaf node, requiring only O(lgn) operations.
Section 3.3.
75
Architecture and Implementation of the Java Runtime System
03
O
10
15 time
Figure 3.3. Load distribution on a NOWs running Solaris OS
*"/**?
Q,
Q,
—
i>„
h
t>,
•
Q„
Fn
Fn+
Figure 3.4. The tree grown by the execution of the nqueens program i
9 8 7 6 5 4 : 3 2
i
I
I
I I I proposed approach, ideal paSe ^
•—
f
i
^
I ,y ^~
/ \
i
i
i
—
^ i
i
4 5 6 7 number of processors Figure 3.5. Parallel speedup
^ i
i
10
76
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
Scheduling by lottery is probabilistically fair. The expected allocations of victims to thieves is proportional to the number of tickets the victims hold. Since the scheduling algorithm is randomized, the actual allocated proportions are not guaranteed to match the expected proportions exactly. However, the disparity between them decreases as the number of allocations increases.
3.3.2
Implementation of the Java Runtime System
For efficiency, all communication protocols, except the initial registering of the workers, are implemented over UDP/IP. Some of the protocols add reliability to UDP by incorporating sequence numbers, timeouts, adaptive algorithms for evaluating the next retransmission timeout, and retransmissions [15]. The application protocol used to register new workers to the registry is developed over T C P / I P because of the needed reliability during the connection establishment and connection termination. One of the assumptions of this research work is that there is a great number of idle CPU cycles. Figure 3.3 plots the average number of jobs in the ready queue of the machines comprising our network 1 . A script was run for two weeks collecting the average load across the workstations at 15 minute intervals. The results were combined to produce an average load during a day. As can be seen from this plot, though more machines are idle at night, a significant number of idle CPU cycles exists at various time slots throughout the day. The results confirm that a network of workstations does indeed provide a valid environment for HPC. It is also possible to calculate the average daily load from Figure 3.3. By rough approximation, the average load of the workstations is around 0.25, indicating that about 75% of the CPU time of each workstation is wasted every day.
3.4
Performance Evaluation
In this section we present experimental results about the performance of our prototype runtime system for scheduling of multithreaded Java applications on networks of workstations. All experiments are done and measurements taken down on a network of 15 workstations running Solaris OS. Subsection 4.1 shows the implementation of a sample application and Subsection 4.2 presents some experimental results and interpretations of these results.
3.4.1
Applications
Consider the following example taken from [5] and rewritten in Java. The Fibonacci function fib(n) for, n > 0, is defined as ru \ - I n jio(n) - | j . . b ^ n _ ^ 1
+
jib(n
_ 2)
if n < 2 otherwise
This network is in the departmental lab of Computer Science and Engineering, Florida Atlantic University.
Section 3.4.
Performance evaluation
77
Another example that we consider is nqueens. nqueens application is a classical example of searching using backtracking. The objective is to find a configuration of n queens on an n x n chess board such that no queens can capture each other. This following shows the way this function is written as a Java program. The double recursive implementation of the Fibonacci function is a fully strict computation. The Java code is given below which is structured in the run methods of the two runnable objects. class Fib implements Runnable { Continuation dest; int n ; public Fib( Continuation k, int n ) { } public void run () { if ( n < 2 ) dest.sendArg ( n ) ; else { Continuation x = new Continuation 0 ; Continuation y = new Continuation () ; ClosureSum s = new ClosureSum( dest, x, y ) ; ClosureFib fibl = new ClosureFiM x, n-l ) ; ClosureFib fib2 = new ClosureFib( y, n-2 ) ; } return ; } } class Sum implements Runnable { Continuation dest ; int x,y ; public Sum( Continuation k, int x, int y ) { }
public void run 0 { dest.sendArg( x+y ); } } The Java code for nqueens is given in the Appendix. The nqueens problem is formulated as a tree search problem [9] and the solution is obtained by exploring this tree. The nodes of the tree are generated starting from the root, which is the empty vector corresponding to zero queens placed on the chess board. The code is structured in the run method of the classes NQueens, Success, and Failure. On each iteration, a new configuration is constructed, called conf ig in the code, as an extension of a previous safe configuration, thus spawning new parallel work. A configuration is safe if no queen threatens any other queen on the chess board. The
78
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
algorithm uses depth-first search to traverse the generated tree. On termination of the for loop of NQueens.run() method, the variable count contains the number of Nqueens closures pushed in the ready pool. This information is used to set the number of missing arguments of the F a i l u r e runnable object that is used in the backtracking stage if a dead end is reached. Since we use continuation-passing style for thread synchronization, after spawning one or more children, the parent thread cannot then wait for its children to return. Rather, as illustrated in Figure 3.4, the parent thread (Q) additionally spawns two successor threads, namely F a i l u r e (F) and Success (S), to wait for the values returned from the children. (In the Figure, Qi stands for an NQueen object which is executed in a separate thread. Si (Fi) stands for successor thread of type Successor (Failure). The edges creating successor threads are horizontal. Spawn edges are straight, shaded, and point downward. The edges created by sendArgumentO are curved and point upward.) The communication between the child threads and the parent thread's successors is done through Continuation objects. We use two different successor threads because failure and success have different semantics. In order for a thread to return failure, all of its child threads should report failure, while to return success, it suffices only one of its child threads to report success. It is important to note that nqueens spawns off parallel work which it later might find unnecessary. This "speculative work" can be aborted in our runtime system using the abort method of the Master object which synchronizes the access to the four pools of closures of each worker. Subsequently, the abort message is propagated to all workers involved currently in the computation. The latter allows nqueens program to terminate as soon as one of its threads finds a solution.
3.4.2
Results and Discussion
The performance of the runtime system was evaluated using f ibonacci and nqueens applications. Even though both of the applications are not real life, they generate a workload suitable for evaluating the performance of our system, f ibonacci is not computationally intensive but spawns a large number of threads (in millions) which makes it appropriate for evaluating the synchronization of the runtime system. nqueens features behaviour typical of most search algorithms employing backtracking. First, we present the serial slowdown incurred by the parallel scheduling overhead. The serial slowdown of an application is measured as the ratio of the singleprocessor execution of the parallel code to the execution time of the best serial implementation of the same algorithm. The serial slowdown stems from the extra overhead that the distributed scheduler incurs by wrapping threads in closures, reflecting upon closures to find out threads' constructors, and work stealing. Serial slowdown data for f ibonacci and nqueens are 6.1 and 1.15, respectively. As expected f ibonacci incurs substantial slowdown because of its tiny grain size. The slowdown of nqueens is insignificant.
Section 3.5.
Conclusions
Figure 3.5 shows the parallel speed up of the f ibonacci application. In all experiments all workstations have been started up at the same time and therefore have taken a fare share of the load. The speedup is measured as the ratio of the execution time of the parallel implementation running with one participant to the average execution time of the parallel implementation running with m participants, where m is the number of workstations involved. Tables 3.2 and 3.3 compare the performance of the classical work stealing algorithm where victims are chosen uniformly at random to the performance of the proposed work stealing algorithm which makes use of the information about the levels of the tail closures in the ready pools of the workers. Tables 3.2 and 3.3 show that the lottery-based work stealing algorithm consistently outperforms the random work stealing algorithm for f ibonacci and nqueens applications, respectively. However, we need to run more experiments with applications spawning a range of different subcomputations in order to provide stronger evidence in support of that statement. In Table 3.2 and 3.3, Columns 2 and 3 display the wall clock time in seconds for the classical work stealing algorithm and the lottery-based work stealing algorithm, respectively, for different number of processors involved.
3.5
Conclusions
We have devised and implemented a new victim selection algorithm. In the proposed victim selection algorithm, each processor is given a set of tickets whose number is proportional to the age of the oldest subcomputation in the ready pool of the processor. The victim processor is determined by holding a lottery, where the victim is the processor with the winning ticket. The experimental results have shown that the proposed work stealing algorithm outperforms the classical work stealing algorithm where the victims are selected uniformly at random. We have also designed and implemented a Java runtime system for parallel execution of strict multithreaded Java applications on networks of workstations employing the proposed lottery-based victim selection algorithm. The runtime system features: • Distributed thread scheduler that manages efficiently load balancing through a variant of work stealing. • Adaptive parallelism which allows the utilization of idle CPU cycles without violating the automicity of the workstations' users. • Network class loader which lifts up the restriction requiring that all workstations share a common file system and improves the scalability of the runtime system. Our future plans involve adding fault-tolerance to the runtime system through distributed checkpointing so that the system could survive machine crashes. The challenge of this enterprise stems from the absence of a common file system shared by all workstations.
80
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
For the Internet-based version of our runtime system we plan to incorporate in the work-stealing algorithm information about the communication delays among the processors in the system. On a LAN communication delays cannot have dramatic impact on the performance of the system since they are more or less uniform. However, on a WAN or internetwork they have to be taken into account in order to achieve efficient scheduling of the subcomputations. For the estimation of the communication delays between the processors of the network we are going to design and implement a distributed algorithm, where each processor in the system obtains its partial view of the delays in the system through its communications with the rest of the processors. We also plan to justify theoretically the performance of the proposed work stealing algorithm based on lottery victim selection.
3.6
Bibliography
[1] T. Anderson, D. Culler, and D. Patterson, "A case for NOW (networks of workstations)," IEEE Micro, 15(1), 1994. [2] A. Appel, Compiling with Continuations., Cambridge University Press, New York, 1992. [3] J.E. Baldeschwieler, R.D. Blumofe, and E.A. Brewer, "ATLAS: An infrastructure for global computing," In Proceedings of the 7th ACM SIGOPS European Workshop on System Support for Worldwide Applications, 1996. [4] A. Baratloo, M. Karaul, Z. Kedem, and P. Wyckoff, "Charlotte: Metacomputing on the Web," In Proceedings of the 9th International Conference on Parallel and Distributed Computing, 1996. [5] R. Blumofe, Executing Multithreaded Programs Efficiently. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995. [6] R. Blumofe and C. Leiserson, "Scheduling multithreaded computations by work stealing," In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), Santa Fe, New Mexico, November 1994. [7] R. Blumofe and D. Park, "Scheduling large scale parallel computations on networks of workstations," In Proceedings of the Third International Symposium on High Performance Distributed Computing (HPDC), pp. 96-105, San Francisco, California, August 1994. [8] R. Blumofe and P. Lisiecki, "Adaptive and reliable parallel computing on networks of workstations," In Proceedings of the USENIX 1997 Annual Technical Conference on Unix and Advanced Computing Systems, Anaheim, California, January 6-10, 1997.
Section 3.6.
Bibliography
81
[9; G. Brassard and P. Bratley, Fundamentals of Algorithmics. Prentice-Hall, 1996. [io: N. Carriero and D Gelernter, "The S/Net's Linda kernel," A CM Transaction on Computer Systems, 4(2), pp.110-129, 1986.
[ii B. Christiansen, P. Ionescu, M. Neary, K. Schauser, and D. Wu, "Javelin: Internet-based parallel computing using Java," Concurrency Theory and Practice, 1997.
A. J. Ferrari, "JPVM: Network parallel computing in Java," Technical Report [12: CS-97-29, Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA, http://www.cs.virginia.edu/jpvm/doc/jpvm-9729.ps.gz "Globus metacomputing infrastructure," http://www.mcs.anl.gov/globus [13: P. Gray and V. Sunderman, "IceT: Distributed computing using Java," In [14 emphProceedings of ACM 1997 Workshop on Java for Science and Engineering, 1997. [15 V. Jacobson, "Congestion avoidance and control," Computer tion Review, 18(4), pp. 341-329, August 1988.
Communica-
[16 JavaSoft Team, RMI: Java Remote Method Invocation-Distributed ing for Java. Sun Microsystems, Inc., Palo Alto, CA, 1998.
comput-
[17 JavaSoft Team, The JavaSpaces specification. Sun Microsystems, Inc., Palo Alto, CA, 1999. [18 JavaSoft Team, The JavaServer Pages 1.0 specification. Sun Microsystems, Inc., Palo Alto, CA, 1999. [19: L.F. Lau, A.L. Ananda, G. Tan, and W.F. Wong, "JAVM: Internet-based parallel computing using Java," submitted for publication, 2000. [20: P. Launay and J. Pazat, "A framework for parallel programming in Java," IRISA Internal Publications, (1154), 1997. [21: D. Lea, Concurrent Programming in Java: Design Principles and Patterns. Addison-Wesley, 1998. [22 Message Passing Interface Forum. MPI: A message passing interface. In Proceedings of Supercomputing '93, pp.878-883, IEEE Computer Society, 1993. [23: MPIJ 1.1. http://ccc.cs.byu.edu/OnlineDocs/docs/mpij/MPIJ.html
82
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
[24] L. Sarmenta, S. Hirano, and S. Ward, "Towards Bayanihan: Building and extensible framework for volunteer computing using Java," In emphProceedings of ACM 1998 Workshop on Java for High Performance Network Computing, 1998. [25] D. Skillicorn and D. Talia, "Models and languages for parallel computation," Computing Surveys, June 1998. [26] V.S. Sunderam, "PVM: A framework for parallel distributed computing," Concurrency: Practice and Experience, 2(4), pp.315-339, Dec. 1990. [27] C. Waldspurger and W. Weihl, "Lottery scheduling: Flexible proportionalshare resource management," In Proceedings of the First Symposium on Operating Systems Design and Implementation, Usenix Association, November 1994.
Section 3.6.
83
Bibliography
Appendix public class NQueens implements Runnable { Continuation success, fa ; Continuation failure ; private int n; //the total number of queens private int i; //already placed queens private int[] config; //The current configuration of queens on the chessboard public NQueens( Continuation s, Continuation f, int[] a, Integer nQueens, Integer placedQueens ){
public void run() { int j = 0 ; if (i == n) { System.out.printIn("Done") ; for (j = 0; j < i; j++) System.out.print( "" + config[j] + " " ); System.out.println( "" ) ; success.sendArgument( config ) ; return; } Continuation x Continuation y ClosureSuccess ClosureFailure
= new Continuation () ; // success = new Continuation () ; // failure cSuccess = new ClosureSuccessC this.success, x ) ; cFailure = new ClosureFailure( this.failure, y ) ;
short count = 0 ; for ( j = 0; j < n; j++ ) { int[] newConf ig = (int [] )conf ig.cloneO ; if ( safe( newConfig, i, j ) ) { count++ ; newConfig [i] = j ; ClosureNQueens q = new ClosureNQueens( x, y, newConfig, n, i+1 ): } } if ( count == 0 ) { failure.sendArgument( new Integer( 0 ) ) ; } else cFailure.setJoinCount( count ) ; return ;
84
Task Scheduling on NOWs using Lottery-Based Work Stealing
Chapter 3
boolean s a f e ( i n t conf ig[] , int n, int j) { int r = 0; int s = 0; for (r = 0; r < i ; r++) { s = conf i g [ r ] ; if (j == s | | i - r == j - s II i - r == s - j) { return f a l s e ; } } return t r u e ; } } public class Failure extends Task { Continuation destination ; Integer fail ; . ..//constructor and helper methods public void run () { //send failure notification to the failure successor of the parent thread destination.sendArgument( fail ) ;
} } public class Success extends Task { Continuation destination ; int [] config ; ... //constructor and helper methods public void run () { Master .getWorkerO .abortReadyClosures 0 ; //send configuration to the successor of the parent thread destination.sendArgument( config ) ;
} }
Chapter 4
Transaction Management in a Mobile Data Access System K.
SEGUN, A.R.
H U R S O N , V.
DESAI
Computer Science and Engineering Department The Pennsylvania State University University Park, PA 16802
A.
SPINK
School of Information Sciences and Technology The Pennsylvania State University University Park, PA 16802
L.L.
MILLER
Computer Science Department Iowa State University Ames, IA 50011
A b s t r a c t Advances in wireless networking technology and portable computing devices have led to the emergence of the mobile computing paradigm. As a result, the traditional notion of timely and reliable access to global information sources in a distributed system or multidatabase system is rapidly changing. Users have become much more demanding in t h a t they desire and sometimes even require access to information anytime, anywhere. T h e amount and the diversity of information t h a t is accessible to a user are also growing at an exponential rate. Compounding the access to information is the wide variety of technologies with differing memory, network, power, and display requirements. Within the scope of distributed databases 85
86
Transaction Management in a Mobile Data Access System
Chapter 4
and multidatabases, the issue of concurrency control as a means to provide timely and reliable access to the information sources has been studied in detail. In an MDAS environment, concurrent execution of transaction is more problematic due to the power and resource limitations of computing devices and lower communication rate of the wireless communication medium. This article is intended to address the issue of concurrency control within a multidatabase and an MDAS environment. Similarities and differences of multidatabases and MDAS environment are discussed. Transaction processing and concurrency control issues are analyzed. Finally, a taxonomy for concurrency control algorithms for both multidatabases and MDAS environments is introduced. Keywords: Multidatabase, Mobile computing environment, Mobile data access system, concurrency control, transaction processing.
4.1
Introduction
The need to maintain and manage a large amount of data efficiently has been the driving force for the emergence of database technology and more recently Ecommerce. Initially, data was stored and managed centrally. As organizations became decentralized, the number of locations, and thus local databases, increased. The need for shared access to multiple databases was inevitable. Geographical distribution of data, demand for highly available systems and autonomy coupled with economical issues, availability of low cost computers, advances in distributed computing, and the demands of supply chain based E-commerce, are among the pressing issues behind the transition towards distributed database technology. Design of distributed database management systems (DBMS) has had an impact on issues such as concurrency control, query processing/optimization and reliability. Traditionally, distributed DBMSs have been built in a top down fashion - building separate databases and distributing data among them [12]. This approach has the advantage that fixed standards can be set before the databases are built, simplifying the issues of data distribution, query processing, concurrency control and reliability. The local DBMSs are typically homogeneous with respect to the data model implemented and present the same functional interfaces at all levels. The global system has control over local data and processing. Solutions developed in a centralized environment can be typically extended to fit this model, resulting in a tightly coupled global information-sharing environment. This approach to designing a distributed database system is possible only if the design process is started from scratch. The issue is how to effectively distribute the data, having knowledge of resources like machine capacity, network overhead, and semantics of data. The natural extension to the distributed database system came in the form of applications of distributed systems in integrating preexisting databases to make the most use of the data available. Most organizations already have their major databases in place. It would be impractical to move this data into a common database, since it would not only be expensive, but the independence of managing individual databases also would be lost. The alternative is logical integration of
Section 4 . 1 .
Introduction
87
data, so as to provide a view of one logical database. This can be viewed as a bottom up approach to distributed database design. The databases themselves are loosely coupled and could potentially differ in data models, and transaction and query processing schemes used - heterogeneous databases or multidatabases [10]. A key feature is the autonomy that individual databases retain to serve their existing customer set. The goal of integrating the databases is to provide users with a uniform access pattern to data in several databases, without modifying the underlying databases and without requiring knowledge of the location or characteristics of the various DBMSs. Solutions from a centralized environment cannot be directly applied in such an environment, autonomy and heterogeneity being restricting factors. Unlike the distributed approach, this could be viewed as being motivated by efficient integration of data instead of efficient distribution of data. Multidatabase systems (MDBS) are not simply distributed implementations of centralized databases, but can be seen as much broader entities that present their own unique characteristics. This in turn raises several interesting research issues over and above those for centralized databases. Some of these issues, like query optimization and transaction management, are rooted in a centralized environment. Others, such as data distribution, have their roots in distributed databases. However, issues such as local autonomy are unique to the multidatabase problem. An important emerging computing paradigm is mobile computing. Thanks to recent advances in computer and telecommunications technology, and the subsequent merging of both technologies, mobile computing, particularly as an important technological infrastructure for E- commerce is now a reality. A Mobile Data Access System (MDAS) is a multidatabase system that is capable of accessing a large amount of data over a wireless medium. Such a system is realized by superimposing a wireless mobile computing environment on a multidatabase system [38]. Mobility raises a number of additional challenges for multidatabase system design. Current designs are not capable of resolving the difficulties that arise as a result of the inherent limitations of mobile computing: frequent disconnection, high error rates, low bandwidth, high bandwidth variability, limited computational power, and limited battery life. Wireless transmission media across wide-area telecommunication networks are an important element in the technological infrastructure of E-commerce [62]. Effective development of guided and wireless-media networks will enhance delivery of World Wide Web functionality over the Internet. Using mobile technologies will enable the purchase of E-commerce goods and services anywhere and anytime Multiple users, in general, could access a database, which implies multiple transactions occurring simultaneously. This is especially true in a distributed database system where various users at different sites could access independent databases. In data processing applications (e.g. banking, stock exchange) the need for reducing access times and maintaining availability, reliability, and integrity of data is essential. Effective E-commerce is based on the successful development and implementation of multidatabase systems. E-commerce based businesses need effective solutions to the integration of Internet front-end systems with diverse data in legacy- distributed
88
Transaction Management in a Mobile Data Access System
Chapter 4
databases. A key challenge for E-commerce is the need for real-time concurrent access to distributed databases containing accounting, marketing, inventory, sales, production systems, and vendor information. Concurrent access to data is a natural way to increase throughput and reduce response time. Database operations require extensive I/O operations and, in addition, a distributed environment has to cope with delays in the network. These characteristics motivate interleaving the execution of several transactions. Concurrent transaction processing raises the possibility of interference. It is safe to concurrently access data items as long as they are independent, but in case of related data items, accesses should be coordinated - concurrency control being the activity of coordinating concurrent accesses to shared data [6]. In a multidatabase environment, transactions could span over multiple databases. Concurrency control in such an environment should not only synchronize subtransactions of a transaction at the respective sites, but also the transactions as a whole at the global level. In addition, the coordination of transactions in a multidatabase environment should enforce minimal changes to the local databases, with autonomy of the component databases being a distinctive feature. In the MDAS environment, transactions tend to be long-lived. This is due to frequent disconnection, the limited bandwidth constraints experienced by the mobile user, and the mobility of users. The communication path tends to increase as users move from one administrative domain to another even if physical distances traversed are short. Concurrency control in such an environment should take into account the effects of disconnection, limited bandwidth, mobility and portability. Concurrency control must strive to reduce the communication, reduce computation, and conserve the battery life of the mobile unit. Computer software and hardware are failure prone. Incomplete transactions due to failure can lead to inconsistencies. Failures can even lead to loss of data. This stresses the need for a database to have effective recovery mechanisms or methods to maintain atomicity of transactions. In addition, in a distributed case, failure of some sites should not halt the execution of the whole system, availability being an important consideration. To handle these issues, proper transaction management schemes should be incorporated into a database system. It is the responsibility of transaction management schemes to ensure correctness under all circumstances. By correctness, a transaction should satisfy the following ACID properties [31]: • Atomicity: Either all operations of a transaction happen or none happen. State changes in a transaction are atomic. • Consistency: A transaction produces results consistent with integrity requirements of the database. • Isolation: In spite of the concurrent execution of transactions, each transaction believes it is executing in isolation. Intermediate results of a transaction should be hidden from other concurrently executing transactions.
Section 4.2.
Multidatabase Characteristics
89
• Durability: On successful completion of a transaction, the effects of the transaction should survive failures. Extensive research has been done to maintain the ACID properties of transactions in centralized and tightly coupled distributed database environments [31]. The emergence of the need to access preexisting databases, as in a multidatabase environment, imposes newer constraints and difficulties in maintaining the ACID properties. The difficulties stem essentially from the requirement of maintaining the autonomy of the local databases. This implies that the local sites have full control over the data at their respective sites. The consequence is that the local executions are outside the control of the global multidatabase system. Transaction processing in such a fully autonomous environment can give rise to large delays, frequent or unnecessary aborts, possible inconsistencies on failure and hidden deadlocks, just to name a few problems that are encountered. Inevitably, certain assumptions and tradeoffs have to be made, usually compromising the autonomy of the local databases, in order to maintain the goals of transaction processing in general. In the next section, we provide a brief introduction to multidatabase and MDAS systems and their research issues. In Section 4.3, concurrency control and transaction processing issues in MDBS and MDAS are discussed and their differences from traditional distributed systems are addressed. Section 4.4 looks at the existing solutions for transaction management in both environments. Section 4.5 addresses application based and advanced transaction management. Section 4.6 discusses our experiments with the V-locking algorithm in an MDAS environment. Finally, Section 4.7 concludes this article.
4.2
Multidatabase Characteristics
A multidatabase system is a distributed system that acts as a front end to multiple local DBMSs. It provides a structured global system layer on top of existing local DBMSs. The global layer is responsible for providing full database functionality and interacts with the local DBMSs at their external user interface. The end user gets an illusion of a logically integrated database, hiding intricacies of different local DBMSs at the hardware and software levels. Thus, a multidatabase can be viewed as a database system formed by independent databases joined together with a goal of providing uniform access to the local DBMSs. A multidatabase system, in general, can be represented by the architecture shown in Figure 4.1. The primary objective of the multidatabase is to place as few restrictions on the local DBMSs as possible. Another goal (or maybe more of a consequence) of forming a multidatabase system is the recognition of the need for certain basic standards in the development of databases so as to simplify global information sharing in the future. Heterogeneity is a term commonly used in a multidatabase environment. In general, heterogeneity can occur due to differences in hardware, operating systems, data models, communication protocols, to name a few. In a multidatabase environment, to make global information sharing a reality, heterogeneities in data models,
90
Transaction Management In a Mobile Data Access System
Local Transactions
Local Transactions
~~T
LDBSi
l / LDB32
LDBS3
LDBSK
Local Database
Local Database
Local Database
Local Database
•
*^
Chapter 4
\
LDBS: Local Database Management System MDBS: Multidatabase System Figure 4.1. Multidatabase System. schema, query languages, query processing, and transaction management schemes have to be resolved.
4.2.1
Taxonomy of Global Information Sharing Systems
There are a wide range of solutions for global information systems in a distributed environment, with terms like distributed databases, federated databases, and multidatabases being among the most commonly used. The distinction arises from the degree of autonomy and the manner in which the global system integrates with the local DBMSs. A tightly coupled system means global functions have access to low-level internal functions of the local DBMSs. In a loosely coupled system, the local DBMS allows global control through external user interfaces only. The amount of control that a local DBMS retains over data at its site after joining the global system is the basis for the following taxonomy. • Distributed Databases: A distributed database is the most tightly coupled global information sharing system. The global manager has control over transactions occurring both globally and locally. Such systems are typically designed in a top down fashion, with global and local functions implemented simultaneously. Logically, distributed databases give the view of centralized databases, with data at multiple sites instead of a single one. • Federated Database: Federated database systems are more loosely coupled than distributed database systems. The participating DBMSs have significantly more control of data at their respective sites. In a federated database system, each of the local DBMSs decides what part of the local data is shared.
Section 4.2.
Multidatabase Characteristics
91
They cooperate with other local DBMSs in the federation for global operations. The DBMSs in the federation have typically been designed in a bottomup manner, but when they join the federation, they give up a certain amount of local freedom. • M u l t i d a t a b a s e Systems: Multidatabase systems are the most loosely coupled systems. Here, the local DBMS retains full control over the local data even after joining the global system. In a multidatabase system, it is the responsibility of the global system to extract information and resolve various aspects of heterogeneity for global processing. This classification highlights independence and autonomy of the local databases as two important features of a multidatabase system. The relationship between multidatabases and autonomy merits more attention and is highlighted in the following discussion.
4.2.2
MDBS and Node Autonomy
Node autonomy is one of the key concepts in a distributed system [27]. A MDBS is a distributed system formed to allow uniform access to multiple local DBMSs, wherein local operations have priority and may be more frequent than global operations; thus, enforcing autonomy of the underlying sites becomes important. On the other hand, the local site could contain legacy databases; transforming this data to suit global needs would be too expensive. Thus, economical constraints are also motivating factors in preserving local autonomy. Issues such as isolating some local data from global access magnify the need for site autonomy. This allows the local DBA to restrict the information available to the global user. Autonomy may be a suitable feature from the global standpoint as well; in case of failure, it helps check the effects of a local failure from propagating throughout the system. Autonomy could come in different forms [27]: • Design Autonomy: The local DBMS should not be made to change its software or hardware platform to join a multidatabase system. In short, the local DBMS should remain as is on becoming a part of a global system. The global software can be looked upon as an add-on to the existing system. The primary reason for design autonomy is economics - an organization may have significant capital invested in existing hardware, software, and user training. This is especially relevant for systems designed in a bottom up manner. Heterogeneity arises in distributed systems if they are allowed to retain their design autonomy. • Communication Autonomy: The local DBMS has the freedom to decide what information it is willing to share globally. This can imply that a local DBMS may not inform the global user about transactions occurring locally, which makes the task of coordinating global transaction execution spanning over multiple sites extremely difficult. Synchronization of global transactions
92
Transaction Management in a Mobile Data Access System
Chapter 4
is not the responsibility of the local DBMS. Local databases in federated and distributed database systems do not retain their communication autonomy since they provide information for global transaction coordination [10]. • Execution Autonomy: A local DBMS can execute transactions submitted at the local site in any manner it desires. This implies that global requirements for the execution of a transaction at a local site may not be honored by the local DBMS. The local DBMS has the freedom to unilaterally abort any transaction executing at that local site. This is due to the fact that the local DBMS does not treat the global transaction as a special transaction; it is like any other local transaction executing at that node. In the next subsection, we examine some of the issues that arise due to autonomy and the inherent heterogeneity in the multidatabase environment.
4.2.3
Issues in Multidatabase Systems
The primary issues that are heavily influenced by local database autonomy are outlined in the following: • Schema Integration: The local databases have their own schema; the goal here is to create an integrated schema to give a logical view of an integrated database. Schema integration is difficult when component databases differ in name, format and structure. Briefly, naming differences occur due to difference in naming conventions, wherein semantically equivalent data could be named differently or semantically conflicting data could be named the same. Format differences include differences in data types, domain, scale, precision, and item combinations, whereas structural differences occur due to differences in data structures. Schema integration has been discussed extensively in [35,39]. • Query Languages and Processing: Query translation may be required since query languages used to access the local DBMS may be different. Query processing and optimization is also difficult due to difference in data structures and processing power at each local DBMS. Requirements like conversion of data into standard format and processing queries at nodes with more processing power are some of the issues that merit consideration. Not only do these factors increase the overhead of query processing, but they can also create additional problems like communication bottlenecks and hot spots at servers, especially when available information regarding the local DBMS is inadequate at the global level. Fragmentation of data and incomplete local information can make developing an accurate cost model difficult, increasing the complexity of query optimization [55]. • Transaction Processing: In a MDBS, it is possible that the local DBMS may use different concurrency control and recovery schemes. It could happen that one database follows the two-phase locking protocol while others
Section 4.2.
93
Multidatabase Characteristics
use a timestamp-based scheme to serialize accesses. Furthermore, to maintain atomicity and durability of global transactions, the local database should support some atomic commitment protocol. A local DBMS joining a multidatabase environment may not have such a facility. The problem becomes even more severe if the local DBMS does not want to divulge or does not have any information regarding its local concurrency control and recovery schemes. This makes the task even more difficult for maintaining global consistency in a MDBS [9]. In a MDBS where updates are frequent, loss of correctness and inconsistency of data is often unacceptable. The database technology is built around the notion that data will be stored reliably and will be available to multiple users. Thus, maintaining the ACID properties of transactions is of vital importance.
4.2.4
MDAS Characteristics
A mobile data access system (MDAS) is a multidatabase system that is capable of accessing a large amount of data over a wireless medium. The system is realized by superimposing a wireless mobile computing environment over a multidatabase system [38]. The general architecture of the MDAS can be represented as shown in Figure 4.2.
1
•
|MSS|
• Wired Network
|MSS
1
V. .-<
Mobile Client
Wireless ^T Network (Cell)
MSS: Mobile Support Station DB: Database F i g u r e 4.2. Mobile Computing Environment.
Mobile computing environment
The mobile computing environment is composed of two entities: a collection of mobile hosts (MH) and a fixed networking system [15,24,38]. The fixed networking system consists of a collection of fixed hosts connected through a wired network. Certain fixed hosts, called base stations or Mobile Support Stations (MSS) are
94
Transaction Management in a Mobile Data Access System
Chapter 4
equipped with wireless communication capability. Each MSS can communicate with MHs that are within its coverage area (called a cell). A cell could either be a cellular connection, satellite connection, or a wireless local area network. A MH can communicate with a MSS if it is located within the cell governed by the MSS. MHs can move within a cell or between cells, effectively disconnecting from one MSS and connecting to another. At any point in time a MH can be connected to only one MSS. MHs are portable computers that vary in size, processing power, memory, etc. Three essential properties pose difficulties in the design of applications for the mobile computing environment: wireless communication, mobility, and portability [24]: • Wireless Communication: mobile computers rely heavily on wireless network access for communication. Lower bandwidths, higher error rates, and more frequent spurious disconnections often characterize wireless communication. These factors can in turn lead to an increase in communication latency arising from retransmission, retransmission time-out delays, error control protocol processing, and short disconnections. Mobility can also cause wireless connections to be lost or degraded. A mobile user may travel beyond the coverage area or may enter an area of high interference. Thus, wireless communication leads to challenges in the areas of: 1. Disconnection: Wireless networks are inherently more prone to disconnection. Since computer applications that rely heavily on the network may cease to function during network failures, proper management of disconnection is of vital importance in mobile computing. Autonomy is a desirable property that allows the mobile client to deal with disconnection. The more autonomous a mobile computer is, the better it can tolerate network disconnection. Autonomy allows the mobile unit to run applications locally. Thus, in environments with frequent disconnections, it might be better for a mobile device to operate as a stand-alone device. In order to manage disconnection, a number of techniques such as caching, asynchronous operation, and other software techniques may be applied. Maintaining cache consistency is difficult however, since disconnection and mobility severely inhibit cache consistency. Cache consistency techniques employed in traditional architectures designed for fixed hosts may not be suitable for the mobile computing environment. Asynchronous operation can be used to mask round-trip latency and short disconnections. Software techniques such as prefetching and delayed writeback can also be used to minimize communication, thus allowing an application to proceed during disconnection by decoupling the communication time from the computation time of a program [2]. Delayed write back takes advantage of the fact that data to be written may undergo further modification. Operation queuing can also help; operations that cannot be carried out while disconnected can be queued and done when reconnection occurs.
Section 4.2.
Multidatabase Characteristics
95
2. Limited Bandwidth: Wireless networks deliver lower bandwidth than wired networks. Cutting-edge products for portable wireless communication achieve only 1 megabit per second for infrared communication, 2 Mbps for radio communication, and 9 - 1 4 kbps for cellular telephony. On the other hand, Ethernet provides 10Mbps, fast Ethernet and FDDI, 100 Mbps, and ATM (Asynchronous Transfer Mode) 155 Mbps [24]. Available bandwidth is often divided among users sharing a cell. Thus, bandwidth utilization is of vital importance. Software techniques such as compression, filtering, and buffering before data transmission can be used to cope with low bandwidth. Other software techniques such as prefetching and delayed-write back that are used to cope with disconnection can also help to cope with low bandwidth. A large dynamically changing number of mobile clients are a characteristic of a mobile computing environment. Thus, bandwidth contention is a problem. Caching can help to reduce bandwidth contention, which also helps to support a disconnected operation. 3. High Bandwidth Variability: bandwidth may vary many orders of magnitude depending on whether a mobile client is plugged in or communicates via wireless means. Bandwidth variability is treated by traditional existing systems as exceptions or failures [2]. However, this is the normal mode of operation for mobile computing. Applications must therefore have the ability to adapt to the available bandwidth and should be designed to run on full bandwidth or minimum bandwidth. • Mobility: The ability to change location while retaining network connection is the key motivation for mobile computing. As mobile computers move, they encounter heterogeneous networks with different features. A mobile computer may need to switch interfaces and protocols; for example a mobile computer may need to switch from a cellular mode of operation to a satellite mode as the computer moves from urban to rural areas or from infrared mode to radio mode as it moves from outdoors to indoors. Traditional computers do not move, therefore, certain data that are considered to be static for stationary computing becomes dynamic for mobile computing. For example, a stationary computer can be configured to print from a certain printer attached to a particular print server, but a mobile computer needs a mechanism to determine which print server to use. A mobile computer's network address changes dynamically; its current location affects configuration parameters as well as answers to user queries. If mobile computers must serve as guides, locationsensitive information may need to be accessed. Thus, mobile computers need to be aware of their surroundings and have the ability to find location dependent information automatically and intelligently while maintaining system privacy [2]. Mobility can also lead to increased network latency and increased risk of disconnection. Cells may be serviced by different network providers and may employ different protocols. The physical distance may not reflect the true
96
Transaction Management in a Mobile Data Access System
Chapter 4
network distance and therefore a small movement may result in a much longer path if a cell or network boundary is crossed. Transferring service connection to the nearest server is desirable but this may not be possible if load balancing is a key priority. Security considerations exist because a wireless connection is easily compromised. Appropriate security measures must be taken to prevent unauthorized disclosure of information. Encryption is necessary to ensure secure wireless communication, data stored on disks and removable memory cards should also be encrypted. The amount of data stored locally should be minimal; backup copies must be propagated to stationary servers as soon as possible as is done in replicated systems. • Portability: Designers of desktops take a liberal approach to space, power, cabling, and heat dissipation in stationary computers that are not to be carried about. However, designers of mobile computers face far more stringent constraints. Mobile computers are meant to be small, light, durable, operational under wide environmental conditions, and require minimal power usage for long battery life. Concessions have to be made in each of the areas to enhance functionality [24]. Some of the design pressures that result from portability constraints include: 1. Low Power: Batteries are the largest single source of weight in portable computers. Reducing battery weight is important, however too small a battery can undermine the value of portability leading to: i) frequent recharging, ii) the need to carry spare batteries, or iii) make less use of the mobile computers. Minimizing power consumption can improve portability by reducing battery weight and lengthening the life of the battery charge. Chips can be designed to operate at lower voltages. Individual components can be powered down when they become idle. Applications should be designed to require less communication and computation. Preference should be given to listening rather than transmitting since receptions consumes a fraction of the power it takes to transmit. 2. Limited User Interface: Display and keyboard sizes are usually limited in mobile computers as a consequence of size constraints. The amount of information that may be displayed at a time is limited as a result. Present windowing techniques may prove inadequate for mobile devices. The size constraint has also resulted in designers abandoning buttons in favor of analog input devices for communicating user commands. For instance, pens are now the standard input devices for PDAs because of their ease of use while mobile, their versatility, and their ability to supplant the keyboard. 3. Limited Storage capacity: Physical size and power requirements effectively limit storage space on portable computers. Disk drives, which are an asset in stationary computers, are a liability in mobile computers because they consume more power than memory chips. This restricts
Section 4.2.
Multidatabase Characteristics
97
Table 1. Characteristics of Mobile Environment and their effect on Database. Mobile Characteristics Wireless Connection 4*
Mobility 4»
4*4* 4> 4> 4> £ 4* 4i 4»
Portability 0
4> 4k 0 •O <£> 0
Resulting Issues Disconnection Communication Channel —High Cost —Network Measurement —Low Data Rate Motion Management Location-Dependent Data Heterogeneous Networks —Interfacing —Data-Rate Variability Security —Eavesdropping —Privacy —Vandalism Limited Resources Limited Energy Sources User Interface
the amount of data that can be stored on mobile devices. Most PDA products on the market do not have disk drives. Flash EPROM, a dense, non-volatile solid state technology with a read latency close to that of a DRAM, and a write latency close to that of a disk that can withstand a limited number of writes over its lifetime is commonly employed. Solutions include compressing files systems, accessing remote storage over the network, shared code libraries, and compressing virtual memory [2]. Table 1 summarizes these issues and their effect on traditional issues of concern in a database environment.
4.2.5
MDAS Issues
The MDAS system is a multidatabase system that has been augmented to provide support for wireless access to shared data. Issues that affect multidatabases are therefore applicable to the MDAS. Multidatabase issues have received a lot of attention in the literature (see Section 4.2.3). Mobile computing raises additional issues over and above those outlined in the design of an MDAS. These issues are a consequence of the properties and inherent limitations of the mobile computing environment. In this section, we examine the effects of these properties on the issues of query processing, and optimization and transaction processing.
98
Transaction Management in a Mobile Data Access System
Chapter 4
• Query Processing and Optimization: The reliance of the mobile client on battery power, the limited wireless bandwidth, frequent disconnection, and the mobility of the mobile client have an effect on how queries are processed. Query processing considerations need to take into account bandwidth considerations and communication costs. The existing query processing algorithms have focused mainly on resource costs. The fact that local area networks have become commonplace and the resultant lessening in importance of communication costs in this environment has led to this focus. Bandwidth limitation will motivate changes to query processing and optimization algorithms. The financial cost of wireless communication may lead to the design of query processing and optimization algorithms that focus on reducing the financial cost of transactions and consideration for query processing strategies for longlived transactions that do not rely on frequent short communications, but longer communications instead. Query optimization algorithms may also be designed to select plans based on their energy consumption to limit the effects of database operations on the limited battery power. Approximate answers will be more acceptable in mobile databases than in traditional databases due to the frequent disconnection and the long latency time of transaction execution [2]. The issue of location dependent queries was discussed in Section 4.2.4.
• Transaction Processing: Since disconnection is a common mode of operation in mobile computing, transaction processing must provide support for disconnected operation. Temporary disconnection should be tolerated with a minimum disruption of transaction processing, and suspending of transactions on either stationary or mobile hosts. In order for users to work effectively during periods of disconnection, mobile computers will require a substantial degree of autonomy [2,38,57]. Local autonomy is required to allow transactions to be processed and committed on the mobile client. Effects of mobile transactions committed during a disconnection would be incorporated into the database while guaranteeing data and transaction correctness upon reconnection [57]. Atomic transactions are the normal mode of access to shared data in traditional databases. Mobile transactions that access shared data cannot be structured using atomic transactions. Atomic transactions execute in isolation and are prevented from splitting their computations and sharing their state and partial results. However, mobile computations need to be organized as a set of transactions, some of which execute on mobile hosts and others that execute on the mobile support hosts. The transaction model will need to include aspects of long transaction models and Sagas. Mobile transactions are expected to be lengthy due to the mobility of the data consumers and/or data producers and their interactive nature. Atomic transactions cannot satisfy the ability to handle partial failures and provide different recovery strategies, minimizing the effects of failure [2,14,61].
Section 4.3.
Concurrency Control and Recovery
99
• Transaction Failure and Recovery: Disconnection and bandwidth limitations, and the mobile user dropping the mobile unit are some of the possible sources of failure in mobile environments. In a mobile unit, it is often the case that an impending disconnection and a drop in available bandwidth is predictable. Special action can be taken on behalf of active transactions at the time a disconnection is predicted. For example, transaction processes may be migrated to a stationary computer, particularly if no further user interaction is required. Remote data may be downloaded in advance of the predicted disconnection in support of interactive transactions that should continue to execute locally on the mobile machine after disconnection. Log records needed for recovery may be transferred from the mobile host to a stationary host; this is very important since stable storage is very vulnerable to failure due to the user dropping the machine [2].
4.3
Concurrency Control and Recovery
We begin by reviewing the meaning of a transaction and its role in concurrency control and recovery. It should be noted that this paper is not intended to address recovery issues in detail. A transaction is essentially a program that manipulates resources in a shared database or files. A transaction T, consists of read r(x), write w(x) operations and either terminates by a commit operation c,-, making the effects of the transaction permanent, or by an abort operation a;, erasing the effects of the transaction. A classical example of a transaction is the transfer of funds in a bank; e.g., a transaction in a bank may involve withdrawing of funds from a savings account and the subsequent depositing of the funds to a checking account. In a multi-user environment, more than one transaction can access shared data simultaneously. In such an environment, synchronization is required to prevent undesired interference that can cause data inconsistencies. A simple example of an inconsistency due to interference is lost update. Assume transactions T\ and T% read data item x, followed by writes to x by T\, then T2, without synchronization, the update to x by T\ is lost. Another problem of interference is inconsistent retrieval. This occurs when a transaction reads one data item before another transaction updates it and reads some other data item after the same transaction has updated it. This scenario occurs when only some updates are visible, causing inconsistent retrievals or dirty reads [5]. The simplest solution to the interference problem is not to allow transactions to execute simultaneously. But, this in turn implies low throughput and poor utilization of resources, especially in the instance when transactions rarely access shared data simultaneously. Alternatively, one could allow concurrent execution of transactions and have algorithms that will synchronize accesses to shared data such that the final result is equivalent to some serial execution order of the transactions, i.e., serializability [6]. Serializability is widely used as a correctness criterion to ensure concurrency control since it is relatively simple to reason about serial executions compared to concurrent executions. In general, when referring to serializability, we
100
Transaction Management in a Mobile Data Access System
Chapter 4
are concerned with a special case called conflict serializability. Conflict serializability means that conflicting operations of transactions are ordered in a serial fashion. Two operations are said to conflict if both of them access the same data item and one of the operations is a write operation. This in turn can give rise to direct or indirect conflicts between transactions. • Direct Conflict: Two transactions 7f, Tj are said to be in direct conflict with each other if one or more of their operations conflict, denoted by, T; —y Tj. • Indirect Conflict: Two transactions Tj, Tj are said to conflict indirectly if there exists transactions 71, T2, ... Tn such that TJ —y T\ —y T2... —y Tn —y Tj. If n = 0, then this reduces to a direct conflict. This type of conflict will be of particular importance to us later, when the serializability issues in multidatabases are discussed. The most common method used to maintain serializability in a centralized database is the two-phase locking protocol (2PL) [6]. Locking is a pessimistic technique, since it assumes that transactions will interfere with each other and hence takes measures to synchronize accesses. Alternate schemes such as timestamp ordering, serialization graph testing [6], and optimistic concurrency control schemes performing commit time validation [36], have also been addressed in the literature. Table 2 summarizes the various concurrency control schemes. Within the scope of transaction management, the recoverability of a database after failure should also be discussed. The atomicity property for a transaction dictates that either all or none of the transaction effect should be made permanent. In general, a transaction aborts if: • The database is functional and it detects a bad input that can violate database consistency requirements, • The transaction runs into a problem detected by the system such as deadlock or time-out, • A system crash occurs causing any active transaction to be rolled back during recovery. The basic requirement for recoverable execution is that a transaction can commit only after all previously active transactions that modified the values read by this transaction are guaranteed to commit. Recovery and atomicity issues have been dealt with by maintaining a log of the active and committed transactions that are used to undo effects of uncommitted transactions and redo the effects of committed ones [6]. Finally the problem of deadlock exists when resource conflicts occur. This is especially true when some sort of locking scheme is used for concurrency control. Deadlock usually occurs when a cyclic wait for a resource occurs among transactions. Deadlocks are usually dealt with by using time-outs to abort transactions [31]. An
Section 4.3.
101
Concurrency Control and Recovery
Table 2. Concurrency Control Schemes. Concurrency Control Scheme Two Phase Locking [5]
Description
Advantages Disadvantage
and
Two phases • Growing Phase: Acquire locks. • Shrinking Phase: Release locks.
• Pessimistic • Blocks Transactions - deadlocks can occur. • Most widely used in DBMSs.
Time Stamp Ordering [9]
Serialization is enforced using timestamps.
• May involve more restarts if serialization order is assumed a priori. • Memory requirement usually greater than locking methods.
Serialization Testing [5]
Graph
Optimistic Concurrency Control [30]
Transactions are serialized by maintaining an execution history graph, and ensuring that this serialization graph is acyclic. At commit time, transactions are validated to ensure serializability. Data conflicts are resolved by aborting transactions.
• Large memory overhead to maintain read-write sets of transactions used to detect conflicts.i
• Provides good performance for high data contentions systems, hardware resources are available. • May not be suitable for long transactions since it depends on transaction restart.
102
Transaction Management in a Mobile Data Access System
Chapter 4
optimistic way of dealing with deadlocks is to break a deadlock when it occurs. In this CcLSG, el directed graph of transactions waiting for a particular resource has to be maintained. This graph is commonly called waits-for-graph (WFG) [6]. The WFG contains an edge T; —> Tj, if and only if T,- is waiting for 7} to release some lock. If a cycle is detected, some active transactions involved in the deadlock are aborted so that the deadlock can be broken. Having discussed the principles of serializability, atomicity/recoverability, and deadlock, we are now in a position to look at how these issues are translated into a multidatabase environment. The requirements in a distributed environment for global serializability, atomicity, and deadlock detection will thus become apparent.
4.3.1
Multidatabase Transaction Processing: Basic Definitions
Unlike centralized databases, in a distributed database system, there are two types of transactions: local and global. Local transactions are ones that are submitted at a local DBMS and executed locally, whereas global transactions are those that are submitted through the global interface and can potentially require execution at multiple local sites. The distinction is more relevant in a multidatabase system than a tightly coupled distributed database system. In a MDBS, local and global transactions are executed independently, whereas in a tightly coupled system the global manager has control over both local and global transactions; in a tightly coupled system there is no logical distinction between the two types of transactions. In a multidatabase system, local transactions and global transactions generate three types of histories: local history, global subtransaction history, and global history [4]. • Local History: The local history (LH) is the history (H) at a particular local site consisting of local transactions and global subtransactions executing at that local site. Formally, a local history is a partial order with the ordering relation
LH=Uf=1Ti U™=1
3. For any two conflicting operations p, q £ LH either p
Section 4.3.
Concurrency Control and Recovery
103
3.
4.3.2
Global Serialisability in Multidatabases
The problem of maintaining serializability in a multidatabase is complicated by the presence of local transactions that are invisible at the global level. These invisible local transactions can cause indirect conflicts between global transactions that do not conflict based on global information. Furthermore, ensuring that transactions are serializable at each local site does not ensure global serializability. To see the issue in more detail, consider a multidatabase made up of two local sites LDBS\ containing data items a and b, and LDBS2 containing data items c and d. Assume two global transactions G\ and G2 Gi : rGl(a)wGl(c) G2 • wo2(a)rG2(d) and two local transactions run at the local databases. i i : rLl{a)rLl(b) L2 : rL3(c)wL2(d) It is assumed that the local transaction manager at each local site guarantees serializability of transactions occurring at that site. The local histories are: LDBSi : LDBS2 •
rGl(a)rLl(a)rLl(b)wG2(a) rL2(c)wGl(c)rG2(d)wL2(d)
At LDBSi, subtransactions of G\ and G2 are in direct conflict with each other; as a result, G\ is serialized before G2 giving the serialization graph shown in Figure 4.3. At LDBS2, local transaction L2 conflicts directly with subtransactions of G\ and G2 giving rise to an indirect conflict between them. The serialization graph is shown in Figure 4.4. The projections of local histories onto the global transactions results in the order Gx —> G2 at LDBSi and the order G2 —> Gi at LDBS2. Thus, the global history contains a cycle and is not serializable. This emphasizes the need to order multidatabase transactions not only in local DBMS where they directly conflict but also at other sites where they apparently do not conflict, but where hidden conflicts may arise due to the autonomy of the local sites [9]. Formally, the necessary and sufficient conditions for global history to be serializable is:
104
Transaction Management in a Mobile Data Access System
Chapter 4
<£> Figure 4.3. Serialization Graph at LDBS1.
GD
<S)
<S)
Figure 4.4. Serialization Graph at LDBS2. 1. Every local history LH in which the global subtransactions take place is serializable. 2. There exists a total order 0 on global transactions, such that at each local site the serialization order of global subtransactions is consistent with O. For example, if 0 for G\ and G2 is G\ —> G2 then at every local site the subtransactions will have the same order. In our running example, both local histories are serializable, but a total order on global transactions at local sites does not exist. As a result, the global schedule is non-serializable.
4.3.3
Multidatabase Atomicity/Recoverability
In a distributed environment, a global transaction is said to be committed if and only if it is committed at all the sites that its subtransactions execute. Traditionally, in a distributed database, a protocol such as the two-phase commit [12] is used to ensure atomic commitment. Briefly, the two-phase commit involves a global coordinator who sends a prepare-to-commit message to all the DBMSs participating in that transaction. Each participant in turn responds with a READY if it is ready to commit or ABORT if it is not. This constitutes the first phase. The coordinator in the second phase sends a COMMIT or an ABORT as the global decision, using the responses obtained in the first phase. Multidatabase systems do not enjoy the luxury of inter DBMS communication and synchronization due to the autonomy of the local sites. Execution autonomy of the local DBMS implies that local DBMSs could implement a single-phase commit at their respective sites. The prepared to commit state or READY state is not explicitly available nor can it be trivially provided without modifying the underlying DBMS, a violation of local autonomy that may not be possible or desirable.
Section 4.3.
Concurrency Control and Recovery
105
The consequence of dealing with the single-phase commit is illustrated by the following example. Consider a MDBS consisting of two LDBSs: LDBSi containing data item a, and LDBS2 containing data item b. Let a global transaction G be initiated consisting of: G:r(a)r(b)w(b) Suppose the subtransactions of G at LDBS\ and LDBS2 are completed, after which a decision is made to globally commit G by the global transaction manager (GTM). Assume that the commit is globally issued, but before LDBSi can actually commit, a failure occurs at LDBS\, whereas LDBS2 successfully commits the subtransaction of G. Thus, the subtransaction of G is incomplete at LDBSi and hence during the recovery phase, the local DBMS will undo the effects of the active subtransaction of G at its site. We have a global state where one subtransaction of G is aborted while another is committed. This inconsistency is a direct consequence of the communication and execution autonomies of the underlying DBMSs. The committed transaction cannot be rolled back since its effects have been made permanent at LDBS2 • As an alternative, the global transaction manager (GTM) should redo the effects of subtransaction G at LDBSi • While this looks straightforward, trouble may occur. Consider the scenario where, after recovery of LDBSi, a local transaction Ti(rT1(a)u;7'1(a)) is allowed to execute before the subtransaction of G can be resubmitted. In a MDBS system, this transaction is not visible to the GTM. The GTM, unaware of the local transaction that took place, will redo the subtransaction G in order to complete the missing write of G, wo(a), This in turn can lead to the following local history:
LH\ :
rTi^w^iajcommitTiWaia);
had the DBMS failure not occurred, the history would have been: LHi :
rG(a)wa(a)commitGrTl(a)wT1(a)commitT1-
Thus the net effect renders the MDBS schedule non-serializable, since the effect of the local write on a is not considered by the multidatabase recovery transaction. This occurs primarily because the local database does not know that on recovery, there are some global transactions that are pending. A local DBMS treats all transactions in the same manner and hence, on recovery, it just undoes incomplete transactions. If the local DBMS followed a two phase commit protocol, it would have realized that the subtransaction of G at its site was in a prepared state and would get the global decision (in this case commit) and try to redo G, disallowing local transactions from modifying data items accessed by G. It has been shown that in general, it is impossible to perform atomic commitment without violating local autonomy [42]. Furthermore, atomic commitment (even under assumptions such as strict 2PL protocol at the local DBMS sites) is impossible
106
Transaction Management in a Mobile Data Access System
Chapter 4
if the component DBMS are autonomous and there exists cyclical functional (commit, abort) dependencies among the subtransactions. Dependencies specify the effect of transactions on each other. In a multidatabase environment, there can be dependencies between the subtransactions of the same global transaction. Commit and abort dependencies are the two types of dependencies that may exist between subtransactions that can make atomic commitment difficult in a MDBS. The dependencies that can occur between subtransactions of a global transaction are [13]: • Commit Dependency: Subtransaction GST\ is said to have commit dependency on GST2 if GSTi cannot commit until GST? either commits or aborts. • Abort Dependency: Subtransaction GSTi is said to have an abort dependency with subtransaction GST2 if aborting GST2 forces GSTi to abort. Thus, it is clear that atomicity is a problem in a MDBS environment, especially if there exists a dependency cycle among transactions and the local DBMS does not support some form of atomic commitment protocol.
4.3.4
Multidatabase Deadlock
If the component databases in a multidatabase use a blocking protocol like two phase locking to ensure global serializability, then global deadlock can occur. The problem is even more severe since the occurrence of a deadlock cannot be detected easily due to the lack of communication between the local and global transactions. The situation can be illustrated in the following example: Consider a multidatabase composed of two local DBMSs: LDBSi with data items a and b and LDBS2 with data items c and d. Assume that the local DBMSs use two phase locking to maintain serializability at their respective sites. Let there be two global transactions Gi : G2 :
wGl(a)rGl(c) wG2(d)rG2(d)
and two local transactions L\ : WL1(b)riJ1{a) at LDBS\ L2 : wL2(c)rL2(d) at LDBS2 Consider the sequence of events • G\ submits WQ1 (a) to LDBSi
and acquires lock on data item a.
• G2 submits WQ2(d) to LDBS2 and locks d. • Local transaction L\, submits »£,(&) and acquires lock on b, however, for operation ri^a) it has to wait for G\ to release the lock.
Section 4.3.
Concurrency Control and Recovery
107
• Local transaction L2 submits WL2{C) and acquires lock on c, but for ri2{d) it has to wait until G2 releases d. • Now if G\ and G2 submit TGX(C) and ro 2 (6), respectively, they will not be able to get a read locks on c and b, respectively. What we have here is a cyclic wait, causing deadlock. To detect deadlock, we need to maintain a wait-for-graph involving local and global transactions but due to the autonomy requirements, the local DBMS may not divulge information regarding local transactions.
4.3.5
MDAS Concurrency Control Issues
The MDAS is based on a multidatabase, therefore, the basic concurrency issues that affect the multidatabase, which have been outlined in the preceding subsections (see Sections 4.3.3 and 4.3.4), are relevant and important, and equally apply. In a multidatabase, there are two types of transactions (global transactions and local transaction). The MDAS introduces an additional type of transaction - mobile transactions. A mobile transaction can be thought of as a global transaction. However, there are compelling reasons why mobile transactions should be viewed as a separate transaction type. A mobile transaction differs from other transaction types in the following ways: • Mobile transactions might have to split their computations into sets of operations, some of which operate on a mobile host while others on a stationary host. Frequent disconnection and mobility result in mobile transactions sharing their states and partial results violating the principle of atomicity and isolation, which traditional transactions adhere to [40,44]. • Mobile transactions require computations and communications to be supported by stationary hosts [17,40,44]. Transaction execution may have to be migrated to a stationary host if disconnection is predicted in order to prevent the transaction from being aborted. The stationary host behaves like a proxy and executes the transaction on behalf of the disconnected mobile client. The mobile client may either fully delegate authority to the proxy to commit or abort the transaction as it sees fit or may partially delegate authority, in which case the final decision to commit or abort the transaction would be made by the mobile client upon reconnection. • The states of transactions, the states of accessed objects, and location information move with the mobile host as it moves from cell to cell [17,40,44]. An interactive mobile transaction may be initiated at a site and completed at another site when the mobile unit moves. It is important that changes made prior to movement be visible at the new location. Otherwise, this could lead to out of date information being read by the mobile transaction at the new site, particularly if the earlier part of the transaction had written to the data item in question.
108
Transaction Management in a Mobile Data Access System
Chapter 4
• Mobile transactions tend to be long-lived transactions due to the mobility of data consumers and/or producers and due to frequent disconnections [17,40,44]. If locking schemes are being employed, this results in the increased likelihood of deadlocks and aborted transactions. If optimistic schemes are being employed, this results in an increased likelihood of conflict and transaction restarts. The effect is the inadvertent decline in system throughput and increased response time of transaction execution. Concurrency control must minimize blocking and aborted transaction execution. Transaction processing models must address the limitations of mobile computing. Concurrency control in the MDAS must therefore strive to minimize aborts due to disconnection. Operations on shared data must ensure correctness of transactions executed on both stationary and mobile hosts. Blocking or restarting of transactions must be minimized to reduce the communication cost as well as increase concurrency. Local autonomy must be supported to allow transactions to be processed on the mobile host despite temporary disconnection. We have seen the major issues related to transaction processing that exist in a multidatabase and MDAS environments. In the next section, we will look at how solutions to these issues have been addressed in the literature. The complexity and difficulty of the problems have forced researchers to make some assumptions, even compromising autonomy in some cases. The solutions try to extend approaches from the tightly coupled distributed database systems to fit a loosely coupled environment. The approaches have their roots in the fact that a tightly coupled system is a special case of a multidatabase system.
4.4
Solutions to Transaction Management in Multidatabases
Solutions to transaction management problems in multidatabases can be broadly divided into two categories - those that ensure concurrency control in a failure free environment and those taking failure into consideration. The first category reduces the problem to that of ensuring database consistency in the presence of concurrently executing transactions. For solutions under this category, an assumption has to be made to ensure atomicity - the existence of an atomic commitment protocol such as the two phase commit protocol. The second category deals with correctness in the presence of failures, i.e., issues relating to atomicity, recoverability, and durability of multidatabase transactions. In the previous section, we looked into the problems that arise in a multidatabase environment when attempting to maintain global serializability in the absence of knowledge of the transaction management scheme used by the component databases. Various solutions have been proposed in the literature to deal with this form of the transaction management problem. Some of these solutions use global serializability with conflict serializability as the correctness criteria. In environments where serializability can be a too strong and restrictive criterion to maintain, the use of weaker or relaxed criteria has been proposed.
Section 4.4.
Solutions to Transaction Management in Multidatabases
109
Before actually discussing the existing solutions for generating serializable schedules in a multidatabase environment, we will look at some of the possible classifications for solving the concurrency control problem. As has been discussed, one of the important goals of a multidatabase system is to maintain autonomy of the local databases. It is precisely this goal that manifests the problem of maintaining global serializability. One school of thought may be that it is unrealistic to have global transaction correctness with no knowledge of the underlying databases. With maintaining correctness of data being an important goal in any database systems, it may be reasonable to assume that local databases joining a multidatabase system may be willing to cooperate as long they are not affected adversely. As a result, when a local database joins a multidatabase system, to aid global concurrency control, the LDBS allows the global system to make certain assumptions or compromises to the local autonomy. Another approach is to exploit the characteristics of the multidatabase environment. This knowledge can be used to extract the transaction semantics that can be exploited for concurrency control. The use of transaction semantics can allow interleaving of transactions that may lead to non-serializable execution, with the correctness criteria being the semantic consistency of the underlying databases. Serializability is a syntactic correctness criterion, whereas semantic based correctness criteria exploit transaction characteristics and hence can also increase the degree of concurrency. Semantic based methods are of particular relevance when dealing with atomicity, since non-blocking commit protocols can be formulated. Atomicity requires one to maintain the semantic atomicity of the transactions. Thus, we have five categories of solutions that deal with the global serializability problem, as illustrated in Figure 4.5. It should be noted that global concurrency methods have also been categorized broadly as bottom up and top down methods [19]. • B o t t o m Up Approach: in this approach, collecting local information from the LDBSs and validating serialization orders at the global level verify global serializability. As a result, the LDBSs independently determine their own serialization orders and the global scheduler detects and resolves incompatibilities between global transactions and local orders. • Top Down Approach: In this approach, the global scheduler is allowed to determine a serialization order for global transactions before they are submitted to the local sites. It is the responsibility of the LDBSs to enforce this order at their respective local sites, by: — Controlling the submission of global subtransactions if the underlying local concurrency control scheme is known, or - Modifying the local schedulers. This is a violation of local autonomy: therefore, the top down approach can be applied only when the LDBSs can tolerate a certain degree of autonomy violation.
110
Transaction Management in a Mobile Data Access System
Chapter 4
The top-down approach is generally pessimistic in the sense that global order is forced on the local databases. It should be noted that the top-down and bottom-up parameter used to classify the scheduling policy motivates generic approaches without emphasizing any correctness criteria or methods used in the MDBS formation. Hence, it is possible that the solutions in the initial classification (Figure 4.5) could use either bottom-up or top-down policy.
Solution to Oobal Serialnability in Multidatabases
Under Full Autonomy
Under Restricted Autonomy
Under Weaker Correctness Criteria
Exploiting Knowledge of Local DBMS
Using Transaction Semantics
Under MDAS
Figure 4.5. Solutions To Multidatabase Serializability.
4.4.1
Global Serializability under Complete Local Autonomy
Definition 1 Full Autonomy from the concurrency control point of view implies the global scheduler does not have: • Any knowledge of local transactions occurring at the local databases. This in turn implies that the global scheduler does not have any knowledge of indirect conflicts that could possibly occur. • Any knowledge of the concurrency control scheme used by the local databases. • Any authority to enforce modifications on the local schedulers in order to facilitate global concurrency control. Under circumstances of full autonomy, all we can expect from local databases is that they will maintain serializability of transactions occurring at their respective sites. It is the responsibility of the global transaction manager to ensure serializability of global transactions. Hence, full autonomy solutions fall under the bottom up approach of concurrency control. We will look at the solutions under this category using the example employed to illustrate the global serializability problem. The following recaps the example given in the previous section. The problem occurred because the projection of global serialization order at one local site resulted in: G\ —> G2, and at the second local site, the projection of serialization order resulted in: G2 —> G\, indicating a cycle in the global serialization graph. If we look at the local transaction history at LDBS2 we see: LDBS2 '• ?'i,2(c)wG1(c)rG2(rf)wi,2(c?).
Section 4.4.
Solutions to Transaction Management in Multidatabases
111
LDBSi: Data items a,b LDBS2'- Data items c,d LHi : wGl{a)rG2(a) SerializationGraphiDBS! '• G\ —> G2 LH2 :
WG2{c)rGl(c)
SerializationGraphiDBS!
'• G2 —> G\
Figure 4.6. Global-Global Conflicts. Although the transactions G\ and G2 were executed in the desired serialization order, G\ —> G2, an indirect conflict caused by the local transaction L2 reversed the serialization order at LDBS2 • The global transaction manager has information regarding the global transactions only. Therefore, the local transactions are like phantom transactions. Another problem that was not discussed in the previous section was the result of non-serializable schedules caused by direct conflicts between global transactions as illustrated in Figure 4.6. The only solution to serializing transactions under full autonomy and in the presence of indirect conflicts is to assume that global transactions conflict at every local site they execute concurrently. It is easier to deal with non-serializable transactions caused only by direct conflicts between global transactions since the global transaction manager has full knowledge of all global operations. Solutions under full autonomy are given below and summarized in Table 3. Site Graph Method: The site graph method, proposed in [8], is a pessimistic method towards global concurrency control in multidatabase systems. A site graph is an undirected graph, with nodes being the sites containing the data items accessed by the global transactions and edges corresponding to the sites spanned by the global transaction. Initially, there are no edges between nodes in the graph. When a global transaction issues a subtransaction, undirected edges are added between nodes of the LDBSs that participate in the execution of the global transaction. As long as the site graph is acyclic, serializability of global transactions is guaranteed. A cycle in the site graph indicates a possibility of a non-serializable schedule and hence, the global transaction has to be aborted. In our running example, when G\ executes at local sites 1 and 2, an edge is inserted between nodes 1 and 2. Now, when G2 submits its operation at sites 1 and 2, another edge is inserted between the two nodes, causing a cycle in the site graph. As a result, G2 is aborted. The site graph method is an example of a bottom up concurrency control scheme because the site graph is a global data structure used by the GTM. Note that the GTM is the only unit responsible for global concurrency control in this approach. The site graph method preserves the autonomy of the local databases but suffers from low concurrency, since a cycle in the site graph does not necessarily imply a non-serializable global schedule. The throughput is also reduced, since only one transaction can execute concurrently at the same sites. Another disadvantage is
112
Transaction Management in a Mobile Data Access System
Chapter 4
due to the method that is used to safely remove an edge from the site graph. At first, it seems that once a transaction commits, the edges corresponding to that transaction can be removed from the site graph. But conflicts can occur even after a transaction commits, resulting in non-serializable executions, as illustrated in [6]. The knowledge about the local concurrency control mechanism, or weaker notions of correctness such as quasi-serializability [16], can be used to develop a safe policy for edge removal. The site graph method has been used for concurrency control in the Amoco Distributed Database System (ADDS) [10]. Forced Conflict Method: In the forced conflict method [30], the global scheduler generates globally serializable schedules by assuming that the local databases generate locally serializable schedules: serializability among global transactions is ensured by using a special data item, called a ticket, at every local DBMS where they execute. As we have seen, global transactions that appear globally serializable may generate globally non-serializable schedules due to indirect conflicts caused by local transactions. If all concurrent global transactions that span over the same sites are forced to conflict directly, then the problem due to indirect conflicts does not arise, since the induced conflicts serialize the global transactions. Each global subtransaction is modified to read and write the ticket data item at each local database they access. The ticket operation forces direct conflict between global subtransactions, irrespective of the other operations contained in the global subtransactions - the ticket acts as a (logical) timestamp and is stored as a regular data item in each LDBS(Figure 4.7). LHi :
rGl{ti)wGl(ti)rGl(a)rLl{a)rLl{b)rG2(t])wG2{ti)wG^{a)
LH2 :
rL2(c)rGl(t2)wGl(t2)wGl(c)rG:l(t2)wG2(t2)rG2(d)wL2(d)
Figure 4.7. Example of using tickets. In our running example (Figure 4.7), additional operations to modify the ticket are required in the local schedules, shown as follows: where t\ and t2 are the tickets for LDBS\ and LDBS2, respectively. The history at LDBS2 contains a cycle. Since the local schedules are serializable at their respective sites, the schedule at LDBS2 is not allowed by the local
Section 4.4.
Solutions to Transaction Management in Multidatabases
113
concurrency control mechanism. As a consequence, to maintain local serializability, one of the transactions will be aborted or blocked. Furthermore, ticket values read are used to determine the relative serialization order among global subtransactions. It could happen that at LDBSi, GST\ —>• GST2 whereas at LDBS2, GST2 —> GSTi. Both of these executions orders are valid at their respective sites. At LDBSi, GSTi executed before GST2 and the ticket operations created the conflict GSTx —> GST2 whereas, at LDBS2, GST2 is executed before GSTi giving GST2 —> GST\. But globally, the relative serialization orders as indicted by the ticket values read at the respective sites are not consistent and hence one of the global transactions is aborted. The autonomy of the local DBMS is preserved, since no change is required in the local DBMS to support the ticket operation. Also, the local transactions are not affected by the ticket operations, since only subtransactions of global transactions have to read and manipulate the ticket values. The disadvantage of this method is that the ticket operation causes conflicts between global transactions even when they may not actually conflict. This can lead to numerous aborts, especially when optimistic scheduling is employed. Also, the ticket data item could become a hot spot [6] in the local database, if several transactions try to access the database simultaneously. A rare problem (but not too severe) is related to the autonomy of the local databases if the local DBMSs do not support the creation of the ticket data object. This problem can easily be resolved by adding additional operations in the transactions, which will induce direct conflicts among the transactions. However, this solution could result in more aborts than the pure ticket method, since the newly added operations reference a data item accessible by both global and local transactions. A forced conflict method is an important algorithm for multidatabase concurrency control since it demonstrates that global serialization can be achieved by using the fact that the LDBS concurrency control method will serialize transactions at their respective sites. However, strict serializability is very restrictive and may even be inappropriate under certain circumstances [47]. Therefore, we will look at solutions that maintain autonomy of the component databases under relaxed (serializability) correctness criteria.
4.4.2
Solutions using Weaker Notions of Consistency
Serializability can be too strong a criterion, especially in a multidatabase environment where the underlying databases are autonomous. The low degree of concurrency obtained could significantly increase transaction response time; this could also increase delays among transactions. As a consequence, weaker notions of consistency such as quasi serializability [16] or multidatabase serializability [4], and two-level serializability [41] have been proposed. Generalizations of serializability such as Epsilon serializability [49], to allow a bounded amount of inconsistency, would result in a higher degree of concurrency. Before proceeding further, we will
114
Transaction Management in a Mobile Data Access System
Chapter 4
Table 3. Concurrency Control Under Full Autonomy. Solution Site Graph Method [8]
Description • Maintains a site graph to serialize global transactions. • If more than one global transactions spans over the same sites, it allows only one of the transactions to execute.
Forced Conflict Method [30] • Global transactions are made to directly conflict at all local sites. • Local databases maintain a special data item - ticket - that is used to serialize global transactions.
briefly explain the notion of quasi-serial histories, two-level serializable histories, and Epsilon serializability. Two-Level Serializability (2LSR): Definition 2: A schedule is two-level serializable if: • The schedule restricted to each local site is serializable. • The projection of the global schedule onto the set of global transactions is serializable, i.e., the schedule excluding the local transactions is serializable. Two-level serializable schedules are a superset of conflict serializable schedules (2LSR D CSR). It has been shown in [41] that 2LSR schedules preserve database consistency only for certain applications. The basic idea is to exploit the knowledge of the inter-site data dependencies to relax restrictions on the global transactions and find schedules that use 2LSR and maintain database consistency. Consistency is preserved by capturing inter-site integrity constraints and by partitioning the data into global and local data items. A data item is a global data item if there is an inter-site integrity constraint between it and a data item at a different site. It should be noted that partitioning the data imposes restrictions on the data items read and written by a local/global transactions. Quasi-serializable schedules are a subset of 2LSR schedules. Similar to 2LSR, quasi- serializable schedules also rely on relaxing the serializability criterion. Quasi Serializability:
Section 4.4.
Solutions to Transaction Management in Multidatabases
115
Definition 3: A global history is quasi serial if all the local histories are conflict serializable and there exists a total order of all global transactions such that, for any two global transactions Gi and Gj, if G, precedes Gj, in the global order then all of Gi 's operations precedes all of Gj's operations in all local histories in which both of them appear. A global history is quasi serializable if it is equivalent to a quasi-serial history. Quasi serializability preserves database consistency, if i) there are no value dependencies between data items stored at different databases, i.e., if X = Y + Z then X and {Y, Z} are not allowed to be at different databases, and ii) integrity and consistency constraints are defined locally for each local database. These conditions may be reasonable if we consider the manner in which a MDBS system is typically developed - a collection of independent and autonomous databases. In fact, in [28] it has been argued that having a global constraint could be a violation of local autonomy; by agreeing to enforce a global constraint, a local database is at the mercy of the global transaction manager or other nodes in the MDBS system for control over its own data items. Thus, in a MDBS system, it is highly unlikely that two databases that join the system will have any inter-site integrity constraints or value dependencies (such as referential integrity constraints). Multidatabase serializability (introduced in [4]) is equivalent to quasi serializability and hence is not described here. Comparing the different correctness criteria (i.e., conflict serializability, quasi serializability and two-level serializability), quasi serializability is a subset of twolevel serializability and is more restrictive, while conflict serializability is a subset of both two-level and quasi serializability and is the most restrictive. The relationship between the correctness criteria is illustrated in Figure 4.8. Using our running example, it is worth discussing the use of quasi serializability as the correctness criterion. In the schedule at LDBS2, if subtransactions of G2 are submitted after subtransaction of Gi completes, then the total order (G\ preceding G2) among the global transactions is maintained. Indirect conflict does not affect global database consistency due to the absence of value dependencies. Thus, if we were using the site graph method at the global level, as soon as G\ completes, the edge corresponding to G\ can be removed and G2 can be submitted - the weaker notion of consistency enables early removal of edges from the site graph. Concurrency control using methods such as quasi serializability are advantageous if the local databases are truly independent of each other since, under such circumstances, no restrictions are imposed on the underlying local databases and the local autonomy is preserved.
4.4.3
Solutions Compromising Local Autonomy
Until now, we looked at solutions that try to preserve local autonomy either by using strict correctness criteria or by using weaker consistency notions. Weaker notions of consistency do need a few assumptions, but in general can be applied without imposing any design changes to the local database. The schemes that
116
Transaction Management in a Mobile Data Access System
Chapter 4
compromise local autonomy imposes some modifications on local databases when joining a multidatabase system in order to aid global concurrency control. One of the early efforts that compromised the design autonomy of local databases was given in [46]. This approach requires strict serializability as the correctness criteria for local as well as global concurrency control. The scheme uses a data structure, order element (O-element), to represent the serialization order of subtransactions in component databases. The order element is determined depending on the concurrency control method being used by the underlying local database. For example, if transactions are serialized by timestamps at a local site, then the timestamps can serve as the O-element. If we have the local serialization order to be Ti —> Tj, then according to concurrency control mechanism O-element (TJ-) —>0element(Tj), A data structure called an order vector is formed by concatenating order elements from the component databases. Having formed the order vector, it is possible to analyze the vector to maintain serializability. This method, implemented in the Harmony multidatabase project at Columbia University [45], is an example of a bottom up approach to concurrency control, since the global transaction manager is responsible for enforcing global concurrency control. A top down approach to concurrency control that also modifies local DBMSs is described in [19]. A sub process that serves as an interface between the multidatabase and the underlying local DBMSs is used to serialize transactions. The order of global transactions is maintained by controlled submission of global subtransactions to the LDBS. The approach utilizes the local concurrency control schemes in order to enforce serializability. Alternatively, a top-down approach to a concurrency control scheme can be employed by modifying the local schedulers. Modified local schedulers serialize both global and local transactions. The top down approach does not achieve maximum concurrency since the serialization order is predetermined at the global level. Because runtime ordering of global transactions is disallowed, only the local histories compatible with the predetermined global order are allowed.
2LSR
2LSR: TwroJ^el^mahzabilitx QSR: Quasi Serializability CSR: Conflict Serializability
Figure 4.8. Relationship between Different Serializability Schemes.
Section 4.4.
4.4.4
Solutions to Transaction Management in Multidatabases
117
Using Knowledge of Component Databases
To this point, we have discussed concurrency control schemes where no knowledge of local concurrency control mechanism was available. If the local database joining the MDBS can make its local concurrency control scheme known to the global manager, this information can be exploited for global concurrency control with few or no additional compromises to the local autonomy. In the remainder of this subsection, we look at how the local concurrency control scheme can be exploited to maintain global serializability. Timestamp ordering and strict two- phase locking protocols are used to motivate the discussion. Timestamp Ordering: A timestamp ordering (TO) scheduler organizes conflicting operations according to their timestamps. A TO scheduler assigns a unique timestamp, ts(T;), to each transaction IJ- to serialize it, using the TO rule. By the TO rule, if T, and Tj. have conflicting operations, then the operation of TJ- is processed earlier than that of 7} if, and only if, ts(Xi) < ts(Tj). In our running example, a problem is caused by the schedule at LDBS2 because of the indirect conflict caused by local transaction £2- If timestamp ordering were to be followed at LDBS2 then as transactions are submitted, they would be assigned timestamps. In our example L2 was the first transaction at LDBS2 and hence would be assigned the lowest timestamp, followed by Gi and G2, The problem caused by the last operation of L2 would not exist at LDBS2 since the last operation of L\ would have been disallowed by the TO scheduler because conflicting operations of G\ and G2 with ts(C?2) > t s ( d ) > ts(Z<2) already have been scheduled, making this operation too late [6]. Strict Two-Phase Locking: A scheduler is said to be using two-phase locking if a transaction can be divided into two phases - a growing phase during which the transaction obtains locks and the shrinking phase during which the locks are released. Strict two-phase locking is a variant of two phase locking in which a transaction holds onto its locks until it terminates - all the locks held by a transaction are released on termination (commit or abort). In our running example, the problem was caused by the history at LDBS2 where the schedule was LH2 : rL2(c)wG1{c)ra2(d)wL2(d). With a strict two-phase locking scheduler at LDBS2, £2 wiH obtain a read lock on data item c when it submits the operation ri2(c). Strict two-phase locking protocol ensures that £2 will not release the read lock on c until the transaction terminates. Therefore, when Gi submits w(c), it is unable to get write lock on data item c. To serialize global transactions, if the global scheduler assumes that an indirect conflict can always occur due to local transactions, it would then submit global subtransactions in a serial fashion. Since subtransactions of G\ and G2 both access LDBSi a n d LDBS2, a conservative global scheduler would delay subtransactions of G2 until G\ terminates (commits or aborts). The fact that G\ has completed and the knowledge that local sites use the strict-two phase locking protocol guarantees serializable transactions. Functionality of local histories and the possibility of achieving global concurrency
118
Transaction Management in a Mobile Data Access System
Chapter 4
control based on such functionality are summarized in Table 4. We have seen that the knowledge of the underlying local concurrency control scheme can aid in global concurrency control. A more detailed discussion of the use of local concurrency control schemes for global concurrency control can be found in [9].
4.4.5
Global Serializability Based on Transaction Semantics
Non-serializable transactions may be allowed if transaction semantics are taken into consideration. The idea is to exploit transaction semantics to specify compatibility between operations. Transactions are divided into a collection of disjoint classes such that transactions belonging to the same class are compatible. In other words, the transactions can interleave arbitrarily. Transactions that belong to different classes are incompatible; that is, they cannot interleave [22]. Consequently, the history that is produced due to execution of the transactions is equivalent to a correct schedule or has an equivalent semantic effect - semantically serializable. An example of this is the case where one has two deposits and the order in which amounts get deposited is immaterial as long as the effects of both transactions get reflected. Knowledge of the semantics of the operations and the manner in which they can commute can be specified in terms of forward and backward commutativity of transaction operations. The requirement for semantic concurrency control is that a compatibility table has to be provided by the local database administrator so that semantic conflict tests can be performed. This may be considered a violation of local autonomy, but it can be done as an abstraction above the existing concurrency control schemes. If the required information is unavailable, then no harm is done since concurrency control can still be performed using traditional schemes based on strict serializability.
4.4.6
Solutions under MDAS
MDAS Transaction Management deserves a classification of its own due to the peculiar nature of the environment. Though most of the solutions fall somewhat under the general classifications above (see Figure 4.5), however, the uniqueness of the MDAS warrants a separate classification of its own. Multidatabase Transaction Processing Manager ( M D S T P M ) : The model proposed in [50] augments the software component of the multidatabase system by adding a Global Interface Manager (GIM) that coordinates the submission of request/reply between the MDSTPM and the local database manager. The approach used makes the mobile unit part of the multidatabase during its connection with the respective coordinator node, the MSS (see Figure 4.2). The MSS can schedule and coordinate the execution of the global transaction on behalf of the mobile unit upon submission of the global transaction. This is based on the following rational: 1. The user may disconnect from the network and perform some other task without having to wait for the global transaction to complete and,
Section 4.4.
Solutions to Transaction Management in Multidatabases
119
2. The host computers are connected via reliable communication networks and are thus less prone to network failures. A Message and Queuing Facility (MQF) approach is proposed to facilitate the implementation of the strategy. Traditionally, a Remote Procedure Calls (RPC) mechanism has often been used for interprocess communication in distributed computing environments. In the RPC paradigm, an application program requests services from another application executing in a remote node. The strategy is similar to that of a subroutine call in a program. The implementation would imply that events are occurring synchronously as the caller would have to wait for the control to be returned before continuing its process. This has the disadvantage that the mobile unit cannot disconnect until the results are returned back to it, otherwise, the transaction would be aborted. MQF allows asynchronous operation instead. Messages are handled asynchronously allowing the mobile unit to disconnect from the network to perform some other tasks leaving the coordinating node to manage the execution of the global transaction submitted on its behalf. In MQF, a mobile unit sends a request message, together with the information required for processing, to its pre-assigned coordinating node for processing. The MQF strategy is deemed to be most appropriate because of the following advantages: 1. It is simple to manage the delivery and recovery of messages. 2. It is time independent since mobile units may be disconnected for an unbounded period of time while the global transaction submitted by these mobile units are being coordinated and executed by their respective coordinating nodes. 3. The ability of each workstation to query the status of its global transactions at its convenience. This approach would fall under the classification of solutions under full autonomy since no knowledge of the underlying database management systems is required or assumed by the method. Kangaroo Transaction Model: The kangaroo transaction (KT) model defines mobile transactions based upon global transactions in a multidatabase system and split transactions [17,48]. The mobile DBMS is viewed as an extension of a distributed system. It derives its name from the migrating "hopping" nature of mobile transactions. It requires a software component at the MSS that acts as the coordinator of global transaction for mobile computers that connect to the MSS, called a Data Access Agent (DAA). The goals of the model as outlined are: 1. Build on existing multidatabase systems and do not duplicate support provided by a source system. 2. Capture movement of mobile transaction as well as data access. Move transaction control as mobile unit moves.
120
Transaction Management in a Mobile Data Access System
Chapter 4
Table 4. Relevance of Local Properties for Global Concurrency Control. History
Description
Recoverability [6]
A transaction commits, only after the commitment of all transactions from which it read. A transaction may read only those values that are written by committed transactions. No data item may be read or overwritten until the transaction that previously wrote into it is terminated (committed or aborted). No data item may be read or overwritten until the transaction that previously read or wrote that item is terminated (committed or aborted)
Avoiding Aborts [6]
Strict [6]
Rigorous [7]
Cascading
Relevance to Global Serializability This property alone cannot be used to maintain global serializability. This property alone cannot be used to maintain global serializability. This property alone cannot be used to maintain global serializability.
With proper scheduling, global serializability can be maintained.
3. Provide flexibility in terms of atomicity feature. 4. Support for long lived transactions. The KT model is built on traditional transactions - a sequence of operations executed under the control of one DBMS. The view of global transactions is much broader than what is normally assumed. Two types of global transactions are considered: the limited view where a global transaction is composed of subtransactions, which can be viewed as local transactions to some existing LDBMS, and the broader view, where subtransactions may also be global transactions to another multidatabase system. The DAA creates a mobile transaction upon receipt of a request from the mobile user, called a Kangaroo Transaction (KT), at the associated MSS. Each subtransaction represents the unit of execution at one MSS and is called a Joey Transaction (JT). A JT is part of a Kangaroo Transaction and it must be coordinated by a DAA at some base site. When the mobile unit migrates from one cell to another, the control of the KT changes to a new DAA at the new MSS. The new DAA at the new MSS subsequently creates a new JT as part of the handoff process. The
Section 4.4.
Solutions to Transaction Management in Multidatabases
121
creation of the new J T is accomplished by a split operation. T h e old J T is committed independently of the new J T . The failure of a J T may cause the entire K T to be undone at any time; this is accomplished by compensating any previously completed J T s since the autonomy of the local databases must be assured. Two different processing modes for K T s are available: Compensating Mode and Split Mode. Under the compensating mode, the failure of any J T causes the current J T and any preceding or following J T s to be undone and previously committed J T s to be compensated for. Operating in this mode requires t h a t the user provide inform a t i o n needed to create compensating transactions. T h e split mode is the default mode and under this mode, when a J T fails, no new global or local transactions are requested as part of the K T . In addition, any previously committed J T s will not be compensated for. Neither mode guarantees serializability of kangaroo transactions. Although the compensating mode ensures atomicity, isolation may be violated because locks are obtained and released at the local transaction level; however, J T s are serializable. This approach would thus fall under the classification of solutions t h a t use transaction semantics. V - l o c k i n g T r a n s a c t i o n M o d e l : The V-locking algorithm was proposed in [38] and uses a global locking scheme in order to serialize the conflicting operations of global transactions. Global locking tables are used to lock d a t a items involved in a global transaction in accordance to 2PL (Two Phase Locking) rules. In typical multidatabase systems, maintaining a global locking table would require communication of information from the local site to the global transaction manager regarding locked d a t a items. This is impractical due to the delay and the a m o u n t of communication involved. Under this method, the MDAS is a collection of s u m m a r y schema nodes (SSM) [11] and local databases distributed among local sites. T h e MDAS software is distributed in a hierarchical structure similar to the hierarchical structure of the SSM. Transaction management is performed at the global level in a hierarchical, distributed manner. A global transaction is submitted at any node in the hierarchy. T h e transaction is resolved and mapped into subtransactions by the s u m m a r y schema structure. T h e resolution of the transaction also includes the determination of the coordinating node within the structure of the SSM with the coordinating node being the lowest s u m m a r y schema node t h a t semantically contains the information space manipulated by the global transaction. T h e MDAS coordinates the execution of global transactions without any control information from the local DBMS. T h e only information required by the algorithm is the type of concurrency control performed at the local sites. T h e semantic inform a t i o n contained in the s u m m a r y schemas is used to maintain global locking tables. T h e locking tables can be used in an aggressive manner when the information is used only to detect potential deadlocks. A more conservative approach could be used as well, where the operations are actually delayed until a global lock request is granted. T h e global locking tables are used to create a global wait-for graph t h a t is used to detect and resolve potential global deadlocks. T h e accuracy of the "waiting information" contained in the graph is dependent on the amount of communication
122
Transaction Management in a Mobile Data Access System
Chapter 4
overhead that is required. The algorithm can dynamically adjust the frequency of communications (acknowledgment signals) between the GTM and local sites, based on network traffic and/or a threshold value. The algorithm provides higher reliability but at the expense of lower throughput if an aggressive approach is used since it is based on the application of semantic contents rather than exact contents. Decrease in communication between the local and global systems comes at the expense of an increase in the number of potential false aborts but this must be weighed against the cost of communication. To this point, we have discussed the issues of the correctness of a transaction when concurrent transactions occur in a failure free environment. The atomicity and recoverability issue is equally important. In fact, global serializability is affected by transaction aborts and failures. In the following discussion, we look at solutions to the global atomicity and recoverability problem in multidatabases.
4.4.7
Solutions to Global Atomicity and Recoverability
Maintaining atomicity of global transactions is difficult in a multidatabase environment due to the autonomy of the local databases. The problem arises due to the fact that the individual local databases may not follow an atomic commitment protocol such as the two-phase commit protocol. As a result, the local databases can unilaterally abort a global subtransaction. This independence is of particular concern in a situation where the global decision was to commit a global transaction, but before the decision reaches the local databases, a failure occurs. Since the local database does not follow an atomic commitment protocol, the active global subtransaction is aborted, in contradiction to the global decision. This could lead to a situation where few of the global subtransactions are committed and a few other subtransactions of the same global transaction have been aborted. Traditional approaches used for recovery follow a redo or undo approach to maintain atomicity of transactions [6]. A redo approach to multidatabase recovery would be to resubmit the global subtransaction, as a corrective action, at the local site where the transaction was aborted. An undo approach to global recovery would be to abort the global transaction, which would involve rolling back the already committed global subtransactions. However, this is not possible due to the semantics of the commit operation [6]. Several approaches have been proposed in the literature to deal with the global recovery problem. Some of the solutions impose restrictions on the local DBMS, while others put restriction on the transactions. We look at the various solutions to deal with this problem in the remainder of this section. We begin with a method that attempts to maintain atomicity of the global transaction using an approach similar to the existing traditional two phase-commit protocols. Simulated Two Phase Commit: Atomic commitment in a distributed DBMS is critical. Two-phase commit (2PC) is the most commonly used atomic commitment protocol in distributed databases. However, the 2PC scheme is suitable for a tightly integrated distributed database system where the underlying DBMSs are
Section 4.4.
Solutions to Transaction Management in Multidatabases
123
willing to communicate with the GTM. In a MDBS system, the local DBMSs generally follow a single-phase commit due to the inherent autonomy of the local databases; i.e., local databases do not support an explicit prepared state. Therefore, the challenge in a MDBS environment is to simulate a prepared-to-commit or READY state similar to that in the traditional 2PC protocol. After completing their operations at a local site, the global subtransactions should wait until a global decision to commit or abort the transaction arrives. The basic idea behind simulation of a prepared state is that at some point during execution, a transaction will test a condition to see if the GST should proceed or abort [3]. Thus to simulate a 2PC protocol, an operation is added to the global subtransaction that forces it to send a message to the GTM after completion of all its operations. After that, a global subtransaction waits for a reply from the GTM. This state can be viewed as the equivalent to the preparedto-commit state. A simulated 2PC protocol can maintain atomicity as long as the data items accessed by the global subtransaction are not modified by some other transaction(s) before the global transaction has committed. To avoid interference, the underlying database should not allow any other transaction to access the data items required by the global transaction. In the prepared state, the local DBMS should deny accesses to the data items accessed by the global transaction in the prepared state until that transaction commits. The simulated 2PC method maintains atomicity in the absence of failures. However, in the presence of failures, the problem created by the lack of knowledge of the global subtransactions at the local DBMSs still exists. Even if the global subtransaction were in a prepared state, the local DBMS unrolls it like any other active local transaction at the time of failure. The recovery issue after failure can be dealt with by: • Putting restrictions on the data in the local DBMS, • Giving the multidatabase system control on recovery, or • Using methods that lead to eventual consistency - retrying the transaction or using compensating semantic actions. Partitioning Local Data: To preserve multidatabase consistency, one could divide the database into two mutually exclusive sets of data items: local and global [56,60]. The approach to partition data use is called denied local updates property. As a result of partitioning, local transactions do not update objects read or written by a global subtransaction. Similar to the simulated 2PC method, the denied updates property, along with the two-phase agent method (with the additional restriction that local histories must be strict), can be used to preserve multidatabase consistency [60]. To enable recovery, a log is maintained at the global level for global subtransactions that are in a prepared state (simulated). If a site failure occurs, this log is used to resubmit global transactions in the prepared state. As a result, consistency is maintained at the expense of violating the local autonomy and severe
124
Transaction Management in a Mobile Data Access System
Chapter 4
limitations on the types of global transactions - existing data cannot be accessed if it is not in the global set. Using M D B S Control on Recovery: One problem encountered during recovery in multidatabases is that the local transactions can start executing even before the global recovery actions have been completed, causing non-serializable global executions. Intuitively, the problem could be circumvented if the local transactions are not allowed to execute until global recovery actions have been completed. On completion of global recovery, a handshake between the global and local managers is exchanged to enable the local database for local/global transactions. This approach (giving MDBS control on recovery) has been followed in [3] and [29]. In addition to the local log at each local site, a global log containing information on global subtransactions is maintained by the GTM. A simulated prepared state is used to record the state of a global subtransaction. During recovery, exclusive access is provided to the MDBS. The global system resubmits all subtransactions belonging to committed global transactions that have been recorded to be in a prepared state at the time of failure. This approach is a violation of system autonomy since a local DBMS is unavailable to local users until the global recovery manager completes recovery. However, it has been argued that this loss of autonomy may not be of concern in most cases, since many commercial DBMSs allow their respective administrators to control accesses to their database. The authority of the database administrator to choose the appropriate time to open the local database to the users can be employed in developing a protocol for the handshake between the global and local administrator upon recovery or startup of the local databases. The disadvantage of this solution lies in the fact that the local transactions are delayed until a handshake is completed. The REDO Approach: In cases where the local DBMSs do not provide a prepare-to-commit interface, a global transaction may be aborted by the local DBMSs at any time, even after the GTM has voted to commit the transaction. If the global subtransaction is aborted by the local DBMS after the GTM has voted to commit the transaction, the server at the site at which the subtransaction was aborted submits a redo transaction consisting of all the writes performed by the subtransaction to the local DBMS for execution. The agent must maintain a server-log in which it logs the update of global subtransactions. If a redo transaction fails, it is resubmitted to the server until it commits. This approach requires that the schedules produced by the local DBMS be cascadeless. A transaction must read data that is already committed. However, the problem with the redo approach is that the LDBS views resubmitted transactions as new transactions; this could result in a non-serializable schedule. Semantic Based Recovery: As noted earlier, transaction semantics can be used for concurrency control. The correctness criterion is that the execution of transactions results in a database state that is semantically consistent [25]. If a transaction spans over more than one site and the transaction aborted at some sites and committed at others, two directions could be taken for recovery. First, an aborted global subtransaction could be retried until it commits. The second
Section 4.5.
Application Based and Advanced Transaction Management
125
approach would be to use a compensating transaction to undo the effects from the committed global subtransactions. The compensating transaction will have to preserve database consistency as well. Compensation achieves Semantic Atomicity. Specifically, within a history, the operation x is compensated by transaction x~l if all future operations are executed in such a manner as if the operations x and x~l never took place. Along with compensating transactions, compatibility can be specified in terms of the commutativity of the transactions. In that sense, a subsequent operation may have returned different values had strict serializability been followed. These values should be acceptable from the semantic point of view. Compensating transactions may not be applicable in all cases. For example, transactions leading to real world actions are not candidates for compensation. The idea of using compensating transactions was used in Sagas [26]. The concept of sagas is explained in the next section. Semantic based transaction management is important in the development of future multidatabase systems. The knowledge of the environment for which the multidatabase is designed can be used effectively to resolve the problems encountered in transaction management. Advanced transaction models that use the semantics as well as the context can be beneficial. As a result, newer applications and environments can be integrated effectively into a multidatabase environment. These, in our opinion, deserve more attention. The next section is devoted towards the discussion of some more specific applications and the relevance of advanced transactions models in such an environment. The 0 2 P C Method: The optimistic commit protocol (02PC) is based on the concept of semantic atomicity [25]. When a transaction completes, the GTM sends "prepare" messages to the agents at each site. Upon reception of the prepare message, the agents optimistically try to commit their subtransaction. The result is reported to the GTM. If all the transactions are committed, the transaction is declared committed. Otherwise, the transaction is declared aborted and compensating transactions are executed at all sites where the subtransactions were committed. The problem here is that transactions that commit at some sites and abort at others may violate database integrity. Global inter-site integrity constraints are therefore not allowed. Transaction should be prevented from seeing the effects of failed (or compensated for) and successful subtransactions of global transactions.
4.5
Application Based and Advanced Transaction Management
Serializability requires that the execution of each transaction must appear to every other transaction as a single atomic step. This requirement could be too rigid and difficult to implement, particularly when dealing with CAD/CAM applications, long lived transactions, transactions in cooperative environments, engineering applications, and so forth. These applications reveal the need for extended transaction models, which exploit transaction context and semantics. In general, these models relax the ACID properties with weaker guarantees - nested transactions, Sagas [18].
126
4.5.1
Transaction Management in a Mobile Data Access System
Chapter 4
Unconventional Transactions Types
Application specific extensions to transaction processing give rise to transactions having different requirements. Some examples are: Long-lived transactions: These transactions usually consist of many steps and the transaction as a whole is characterized by its long duration. Aborting the transaction would mean a significant loss of work; hence, relaxed atomicity criteria are more appropriate. The transactions should support resumable computation, so that even in case of a system crash, a suspended transaction is not necessarily aborted. Challenges faced while handling long-lived transactions involve: • Formulating isolation requirements such that short transactions executing concurrently are affected minimally, and • Handling deadlocks due to the increase in the transaction duration. Cooperating Transactions: Cooperating transactions form a transaction group in which partial results of the operations in progress can be shared. Such an environment is characterized by relaxed atomicity and relaxed isolation among transactions. To provide interactive control in a cooperating group of transactions, the transactions should be such that operations can be retracted, i.e., compensatable. The eventual requirement of correctness among cooperating transactions is to maintain group consistency and isolation.
4.5.2
Advanced Transaction Models
In an environment supporting long-lived transactions, it would be undesirable to lose a large amount of work because a transaction aborts or the system crashes. Sagas [26], based on compensating transactions and semantic atomicity, have been proposed to deal with long-lived transactions. Sagas [26]: A saga is a long-lived transaction that consists of relatively independent subtransactions that can be interleaved in any way with other transactions. Associated with each subtransaction T\,Ti, ...,Tn defined in a saga are compensating transactions C\, C2, •••, Cn. The system then makes a guarantee that either the sequence Ti,T2, ...,T„ or the sequence T l , T2, ..., Tj,Ci,C2,---,Cj for 1 < j < n will be executed. Thus, in a saga, all the subtransactions are completed or partial executions are undone by a compensating transaction. Sagas relax the property of isolation by allowing partial results to be visible to other transactions before completion. Sagas require that either all subtransactions successfully complete or none complete. Thus, sagas preserve atomicity and durability properties. Sagas are applicable only in an environment when subtransactions are relatively independent and each subtransaction is compensatable. If the databases joining a MDBS are satisfying this condition, then this approach may be helpful in alleviating the atomic commitment problem that arises due to the blocking behavior of 2PC. An approach that utilizes sagas to maintain concurrency among global procedures in a federated DBMS environment is illustrated in [1].
Section 4.5.
Application Based and Advanced Transaction Management
127
Sagas do not support resumable computation because of their support for ACID properties of individual subtransactions, which in turn implies that all active transactions should be aborted on system crash. This raises the question, "what if forward recovery is desired"? The ConTract Model proposed by Reuter is for just such a purpose. ConTract Model [51]: This model was proposed for dealing with long duration, complex computation transactions in non-standard applications like office automation, CAD, and manufacturing control. A ConTract is a predefined set of actions with an explicit specification of control flow among these actions. The main emphasis of this model is that a ConTract should be forward recoverable, implying that a computation should be resumable. To be able to do so, all state information, including database state, program variables of each step, and the global state of the ConTract, must be made recoverable. Similar to sagas, a ConTract is allowed to externalize its partial results before the ConTract actually completes, again relying on compensating transactions to undo effects of committed transactions. Multilevel Transactions [60]: In this approach, the operations of a transaction are allowed to execute at different levels in the system. For example, a global transaction is executed at the global level and a local transaction is executed at the local level. Level specific conflict relations between operations provide an increasing abstraction from bottom to top level. In a multilevel transaction, operations that conflict at a lower level may commute at a higher level if the semantics of the operation are considered, making conflicts at the lower level pseudo conflicts. This model can be applied to MDBS by viewing global transactions at the topmost level (L2), the global subtransactions and the local transactions (LT) at the intermediate level (LI), and the data accesses of GSTs and LTs at the lowest level (LO). Using semantic and context information at LI maintains the concurrency control. The COSMOS project at ETH Zurich employs the open-nested transaction model, which is a generalization of the multilevel transaction model [59]. An opennested transaction is a tree of subtransactions. Nodes at the same depth are at the same level of abstraction in the MDBS and edges represent caller-callee relationships. Concurrent executions in an open-nested model are semantically serializable. The open-nested transaction model has also been used in the implementation of VODAK [43], a distributed object-oriented database management system that allows transactions across heterogeneous and autonomous databases. Flex Transaction Model [21]: The flex transaction model, used in the InterBase project at Purdue University, relaxes the atomicity and isolation properties of subtransactions to provide users with increased flexibility in specifying transactions. This model was proposed for multidatabase transaction management to provide an extended transaction model. The main features of this model are: • It allows the user to give a set of acceptable states, i.e., it allows specification of a set of functionally equivalent subtransactions. This allows failure tolerance since it takes advantage of the fact that a given function can be executed in more that one way.
128
Transaction Management in a Mobile Data Access System
Chapter 4
• Users can define the execution order of transactions in terms of internal dependencies and external dependencies of transactions. • It allows the concept of mixed transactions. This allows compensatable and non- compensatable transactions to coexist. • It allows the user to control the isolation granularity of a transaction through use of compensating transactions. The above features contribute to the flexibility of transaction management. This extended model is particularly useful in a multidatabase model where local autonomy is of concern. Reservable Transactions [20]: So far, we have concentrated on approaches where using the semantics of the transactions can increase concurrency - the effects of the transaction are rolled back using compensating transactions. This is an optimistic way of using the semantics of transactions. Another approach would be to use the data semantics to find out beforehand if the transaction can commit, without completely blocking the data that the transaction accesses. This phase is called the reservation phase for the transactions, and such transactions are called reservable transactions. This is more of a pessimistic approach to transaction management. A reservable transaction consists of a reservation phase followed by the actual transaction, provided the reservation phase is successful. Once the actual transaction completes, an unreservation phase is executed to undo the effects of reservation. A simple example would be booking a room in a hotel; the reservation phase would consist of making sure that a room is available followed by ensuring that any transaction that would make rooms unavailable is aborted. Dynamic Restructuring of Transaction [33]: For long-duration transactions, it may be useful to reconfigure transactions while they are in progress. Two new operations, split transaction and join transaction, were introduced [33] for this purpose. Definition 4 A split transaction divides an ongoing transaction into two new serializable transactions by dividing the actions and the resources between the new transactions. The splitting of a transaction generally results from new information about the dynamic access patterns of the transaction - for example; the transaction no longer needs resources. Therefore, one of the newer transactions can be committed in order to release its resources so that other transactions can access them. Definition 5 A join transaction merges the ongoing work of two or more independent transactions as if it had always been a single transaction. This transaction model is especially useful in dealing with open-ended activities that have uncertain durations and unpredictable developments as they progress, i.e., VLSI design.
4.5.3
Replication
A replicated database is a distributed database in which multiple copies of some data items are stored at multiple sites. Replicated data has the following advantages
Section 4.5.
Application Based and Advanced Transaction Management
129
[6]: 1. Increased availability: by storing critical data at multiple sites, the DBS can operate even though some sites have failed. 2. Increased performance: since there are many copies of each data item, a transaction is more likely to find the data it needs close by, as compared to a single copy database. 3. Maximized system throughput: storing multiple copies of each data item at different sites has the potential to elevate the degree of concurrency and to improve the system's overall utilization [37,50]. However, these benefits come at the price of having to update all copies of each data item. Thus, reads may run faster at the expense of slower writes. A DBS managing a replicated database should behave like a DBS managing a one-copy (non replicated) database. The interleaved execution of concurrent transactions on the replicated database should be equivalent to a serial execution of those transactions on a one-copy database; this is often referred to as one- copy serializability (1SR). Thus, the goal of concurrency control in a replicated database should be to achieve one-copy serializability. This serves as the basis of consistency for replicated data. Classification of replication techniques
Replication techniques can be classified into two categories based on the approach used to maintain the consistency of the replicated database [6]: Write-All Approach: under this approach a DBS translates each read operation, r(x) into T(XA) where XA is any copy of data item x and it translates each write operation, w(x) into {W(XA), • •-, w{xz)} where {XA,---,XZ} are all copies of x. Any serializable concurrency control algorithm is used to synchronize access to the copies. The problem with the write-all approach is that it assumes the ideal case where the sites never fail. However, in reality sites can fail and recover. Since there will be times when some copies of x are down, the DBS will not always be able to write into all the copies of x at the time it receives w(x). Thus, it would have to delay processing w(x) until it could write all copies. The more copies-of x that exist, the higher the probability that one of them is down. As a result, more replication of data actually makes the system less available to update transactions, making the write-all approach unsatisfactory. Write-All-Available Approach: this approach relaxes the constraint that requires all copies of each data item x to be written into by the DBS when a write operation, w(x) is received. Instead, a DBS should write into all available copies; it can ignore any copies that are down while still producing a serializable execution. The write-all-available approach solves the availability problem, but may lead to problems of correctness. There will be times when some copies of x do not reflect the most up-to-date value of x. A transaction that reads an out-of-date copy can create a non-lSR execution that is incorrect. For example, consider the following
130
Transaction Management in a Mobile Data Access System
Chapter 4
execution history of three transactions, To, T\ and T-x issued at cites A, B, and C, respectively: Hi =
wTo(xA)wTo(xB)wTo{yc)cTorTl{yc)wTl(xA)cT1rT2(xB)wT2(yc)cT2
T2 reads the copy XB of X from To, even though Ti was the last transaction to write x before it. Thus, Ti read an out-of-date copy. One reason why this could have occurred is because of a failure at site B. In keeping with the principle of the writeall-available approach, Xi would ignore the failed copy. Thus, the execution history is non-lSR as a result of T2 reading an out-of-date copy of x when B recovered. This problem can be resolved by preventing transactions from reading copies from sites that have failed and recovered until these copies have been brought up-to-date, though this alone is not enough. Update propagation
When a transaction issues a write on a data item x, the DBS is responsible for eventually updating the set of copies of x. In eager replication, a write on a data item is distributed immediately to all the replicas [6]. With lazy replication, the DBS delays the distribution of writes to other copies until the transaction has terminated and is ready to commit. A DBS that uses lazy replication puts all writes destined to the same site in a single message. This tends to minimize the number of messages required to execute a transaction. In contrast, in eager replication, the DBS sends writes to replicated copies while the transaction executes, using in essence one message per write. In lazy replication, aborts often cost less than eager replication. In eager replication, when a transaction aborts, all the writes that have already been distributed to replicated copies are wasted and must be undone. Lazy replication defers the writes until the transaction terminates. As a result, the abortion of a transaction before its termination is less costly than eager replication. In contrast, lazy replication may delay the commitment of a transaction more than eager replication. This is because the delayed write will have to be processed at commit time. An atomic commitment protocol (ACP) is used in order to assure consistency of replicated data at each site. The write processing delays the response to the vote request that the ACP employs. Eager replication can issue a vote right away since it has already performed the writes while the transaction was executing. Lazy replication also tends to delay the detection of conflicts between operations. If multiple transactions write to the same data item while using different copies, one or more of the transactions may be rejected at commit time and will have to be aborted after all the processing has been performed. This can be remedied by requiring the DBS to use the same copy of each data item (called the primary copy) to execute each transaction so that conflict may be detected earlier. Replica control protocols
Replication of data should be transparent to the user. To achieve this, a synchronization layer called a Replica Control Protocol is employed. The protocol provides
Section 4.5.
Application Based and Advanced Transaction Management
131
a set of rules to regulate access (reading and writing) to replicas. These rules help to determine the actual value of a data item in the presence of conflicting replicas. A number of replica control protocols have been developed and addressed in the literature [37,50]. These can be categorized into three groups: Primary-Copy Method, Quorum Consensus Method and Available-Copies Method. Other methods proposed in the literature are extensions to these approaches. Primary-Copy Method: this method assumes that all replicated data in the system have a predefined order. The copy directly maintained by the data owner, or primary copy, takes the first place in the order. The other replicas are secondary copies. To maintain 1SR, write operations are always performed on the primary copy first and then propagated to the other replicas either synchronously or asynchronously while a read operation can choose from any available copy. The method is also called the Read-One-Write-All method. If a transaction reads multiple data copies and eventually finds out that the updates are still in propagation and some replicas are not fully up-to-date, the transaction will either be aborted or directed to the primary copy of the data item to ensure the correctness of the data. In case of a failure in the node hosting the primary copy, the copy with the next highest order takes over and performs as the primary copy. The primary copy method has the advantage of being easy to implement. It is straightforward and capable of ensuring 1SR but is vulnerable to communication failures. The overhead of propagating updates to all copies may eventually become a bottleneck, particularly when the number of replicas is large. Quorum-Consensus Method: in this method an operation is allowed to execute if it can obtain voting permission from a group, or quorum, of nodes containing the replicas of the targeted data item. Consequently, this method is also referred to as the Voting method in the literature. In general, two quorum sets must be defined in a system, one for reads and the other for writes. A read operation should get permission from a read quorum group and a write operation should get permission from a write quorum group. Quorums are formed according to the following rules: 1. Any two write quorum groups must have a common member, which ensures that no two writes can execute concurrently. 2. A read quorum and a write quorum group for a data item must have at least one node in common. This would ensure that conflicting read and write operations do not access a specific replica of the item concurrently. Usually when inconsistencies are detected within a quorum during voting, reconciliation processes will start to synchronize the values of copies. A transaction that fails the voting test will either be delayed after a reconciliation phase or aborted. This method is free from communication failures because of redundant data access, but the performance is relatively poor due to high communication overhead. The assignment of quorums lacks flexibility and is difficult to be optimized automatically. A new copy of the data cannot join the system until the quorums are rearranged. This effectively limits the scalability of the system.
132
Transaction Management in a Mobile Data Access System
Chapter 4
Available-Copies Method: this is also known as optimistic replication. All available copies of a data item are updated during a write operation while reads can access any available replica. This method does not limit data availability for maintaining data consistency. Transactions are allowed to execute even during network partitioning when some of the copies may be unavailable and thus inconsistent. When the sites recover, the inconsistencies will be detected and some convergence operations will be launched to synchronize the data values. Transactions that accessed inconsistent data have to be undone. Single copy serializability is abandoned and other mechanisms such as timestamp or version numbers are used to enforce data convergence. The available copies method can achieve high performance of both read and write operations. However, the reconciliation and validation tests require additional message communication. When the number of data copies or the time period of partition increases, efficiency is compromised due to the number of data convergence operations and increase in communication.
Multidatabase replication issues
The nature of the multidatabase makes the issue of providing and maintaining replicas of data items a not trivial issue. Two identifying features characterize a multidatabase: heterogeneity and local autonomy. Traditionally, replication has been designed for and implemented on homogenous distributed systems. These systems co-operate with each other, follow the same locking policies, concurrency control protocols and recovery algorithms, and have similar system architectures and, more importantly, are not autonomous. Replica control protocols in multidatabase should be designed to work in the presence of heterogeneous, autonomous nodes. They should function in the presence of various transaction processing and locking schemes. Replication often requires that the participating sites implement a global atomic commitment protocol (ACP) in order to ensure the consistency of the replicated data. This implies that the databases at the sites are not autonomous. In contrast, the local database at the local sites in a multidatabase is autonomous. Providing a global atomic commitment protocol in the multidatabase is practically impossible without violating the local autonomy of the databases. Further more, local autonomy implies that the local user may read and write to a replicated data item x at the local site without the global layer being aware of the transaction. Such an execution, though correct at the local site, is a violation of global consistency. A simple solution may disallow local users from accessing the replicas, but this again is a violation of local autonomy. In a multidatabase, each local site is free to join or leave the MDBS at any time. Thus, the replica control protocol should be designed to operate with a minimum of disruption when a node leaves the MDBS. This is similar to the failure of a node in a traditional replicated system with the difference that the local node may or may not rejoin the MDBS. Replication should be transparent to the database user. The approach used in
Section 4.5.
Application Based and Advanced Transaction Management
133
the design of the database will affect how transparent replication is to the user. The global schema approach is well suited to replicating whereas the multidatabase language approach is not [10]. The multidatabase language approach provides users with control over replication instead. Thus, the user is aware of replication, and has the responsibility of managing replicas of data items. Consistency of data is often maintained via one-copy serializability in replicated systems. This has been accepted as the de facto standard. However, maintaining serializability without global control, as is the norm in multidatabases, is difficult and costly to implement. This difficulty has led to weaker notions of correctness and consistency in multidatabases, which could have a significant effect on replication control protocols such that they might have to be redesigned to accommodate the different notions of consistency. MDAS replication issues
In the previous subsection we discussed the issues that affect replication within a multidatabase environment. This section discusses the issues that affect replication in a MDAS environment. The key issues in a multidatabase are local autonomy and heterogeneity; the MDAS, in addition, introduces the issues of frequent disconnection, limited wireless bandwidth, higher error rates, limited computational resource (CPU, memory, storage) and limited battery life. Bringing the data closer to the mobile client would result in quicker response times, particularly, due to the bandwidth limitations experienced by the mobile user. Replicating data at the mobile unit would also allow the mobile client to continue processing, to some extent, in the likely event of a disconnection (isolation). These are some of the advantages that make replication a very attractive proposition for the MDAS. However, any attempt at maintaining replicas must first of all determine how to overcome the inherent limitations of the mobile computing environment. Replica control protocols differ in the distribution of replication function between the mobile client and the server, and the allocation of update capabilities at the client. Some systems allow updates of offline replicas while others do not. Data can be replicated at three levels in the mobile environment. The first level is at the source system, where multiple copies are maintained at different base stations. Dynamic replication can be used at this level to redistribute data to get closer to the access activities. Classical replication protocols are sufficient to handle this. The second level involves caching data into entities referred to as data access agents (DAA) base stations. When a user first requests data, the DAA holds it in its cache. As the user moves, the DAA caches replicate to DAAs at other base stations. The third level involves caching the DAA's cache at the mobile unit. Maintaining the consistency of the cache and the replicas at the three levels is the focus of replication management [50]. In the previous subsection we pointed out that in order to ensure correctness, the multidatabase must implement some form of atomic commitment protocol. In reality, this is a difficult task while preserving the autonomy of the local database.
134
Transaction Management in a Mobile Data Access System
Chapter 4
Frequent disconnection and mobility often results in long-lived transactions, complicating the matter further. In the MDAS environment, this would allow a mobile unit to hold on to locks for a relatively long time if locking schemes are being used. Replication in traditional databases has mainly tried to maintain the 1SR property. The mobility of the mobile client, however, makes maintaining global 1SR a very costly event. The high overhead cost is due to the fact that mobile transactions tend to share states and results of partial transaction execution as the mobile unit migrates from one MSS to another. Frequent disconnection also implies that replicas will often be inconsistent, particularly, if the mobile client is allowed to perform operations on replicated data items while in the disconnected mode. The mobile transaction may have to be aborted upon reconnection and any other transactions that had accessed the data item may have to be rolled back and undone, or compensated for in order to maintain consistency of the global database. Thus, maintaining the 1SR property may not be suitable within the MDAS environment, as the cost may prove too prohibitive. An overview of the how location information, power restrictions and orderly connections/disconnections influence the various replica control protocols can be found in [23]. Replication solutions in multidatabases
Two protocols for replicated data management are presented in [32]. These protocols employ deferred propagation for local applications and immediate propagation for global applications. The first protocol ensures replication consistency (with regard to 1SR) through the use of a global certification protocol, which verifies the consistency of replicated copies accessed by both local and global applications. The second protocol is an extension of the first and uses propagation lock on primary copy sites. The propagation locks permit the consistency criterion to be performed locally for single site query applications, thereby improving the performance of transaction processing. The protocols are describe as follows: Two-Phase Certification Protocol: In this protocol, a global transaction is submitted to the GTM, where it is decomposed and sent to the local sites. A primary-update subtransaction is submitted locally. After the primary-update subtransaction finishes its execution, it is immediately committed at the local DBMS. A propagation transaction is then formed and sent to the GTM. The GTM monitors and executes the propagation transaction just like a regular global transaction. The approach has the advantage that it maintains replication consistency without violating local autonomy. The protocol, however, requires that an application, which accesses copies that logically belong to different sites, be processed as a global transaction. A consequence of this is that it assumes that each LDBS provides a visible prepared-to-commit state for its transactions. This implies that a two-phase commit protocol must be used to control commitment. Even a single-site query that only reads non-primary copies must be treated as a global transaction. Otherwise, 1SR cannot be guaranteed globally. The ability to process single- site queries as
Section 4.5.
Application Based and Advanced Transaction Management
135
local transactions, without the involvement of the GTM is critical to data availability and system performance. This protocol relaxes the requirement that single- site queries be processed as global transactions by relaxing the control of the GTM over single-site queries through the use of a weak consistency requirement [32]. Propagation Lock Protocol: this protocol is an extension of the 2P Certification Protocol. It provides an alternative approach to processing single-site queries as local transactions. Instead of relaxing the consistency requirement, this approach adopts a propagation lock to prevent possible inconsistent access. The propagation lock blocks the execution of single-site queries at the primary copy site until the conclusion of other propagation transactions. The lock is set on the primary copy site when a primary-update subtransaction is submitted to the site; it is released after the propagation transaction globally commits at remote sites. The server layer at the primary copy site is responsible for granting and releasing the locks. Two propagation locks are compatible and can be granted at the same time on the primary-copy site. The lock blocks only single-site query transactions, leaving other transactions unaffected. The GTM monitors the executions of propagation transactions and will inform the local server of the commitment of a propagation transaction. A single-site query can be submitted and committed as a local transaction if no propagation lock is set on the local site. If a propagation lock is set between the submission and the commitment of a single site query, it will be aborted and rerun after the lock is released. Thus, the single-site query is executed and committed locally without any coordination by the GTM. The propagation lock approach ensures 1SR; it does not require global control on the execution of a primary-update transaction subtransaction, as the lock is set only on the primary copy site where the primary-update transaction is executed. The approach may, however, delay some single-site queries on primary copy sites. It also imposes some restriction on local concurrency control, in the form of rigorous schedules. The assumption of rigorous schedules is, in general, practically acceptable, since most commercial DBMS products use strict 2PL [32].
4.5.4
Replication Solutions in MDAS
This section discusses two replication control protocols for the MDAS, Virtual Primary Copy and Adjustable Replication Protocol. Virtual Primary Copy (VPC) Protocol: this was proposed in [23] as a modification to the primary copy method adapted to suit the mobile environment with its distinctive features including frequent disconnections, power limitations, and mobility. It is based on a number of assumptions: 1. The Home Base Node (HBN) - MSS - knows the locations of the VPC of the Mobile Primary Copy (MPC) that has connected to it. 2. Whenever the MPC connects to any base node, the base node it connects to is able to contact the VPC.
136
Transaction Management in a Mobile Data Access System
Chapter 4
3. Whenever the MPC connects to another HBN and executes some transaction from the new base node, then the base node to which it is connected becomes the new VPC and HBN has a pointer to this VPC. At any instant in time, there exists one VPC for every MPC. The read requests for the MPC are handled in the same way as for the primary copy method. The actual location of a primary copy is transparent to global transactions, which access the VPC. The consistency of the VPC is maintained by the HBN. The proposed VPC method differs from a classic primary copy method in that the mobility of hosts is considered, disconnection of mobile hosts is handled and monitored by the HBN, and a multilayered approach is adapted by HBN, which is transparent to other sites. Adjustable Replication Protocol: this is an optimistic replication protocol proposed in [37] that either rolls back an incorrectly committed transaction or sends a warning message to a user that retrieves outdated data. The local transactions of a site hosting a non-primary copy replica are limited to reading the secondary copies of the data in order to avoid problems associated with arbitrary data updates. All updates to secondary copies are issued as global transactions. Update requests submitted by local users are redirected to, and processed by, the global system. The protocol assumes the MDAS is organized using the Summary Schemas Model (SSM) structure. Consistency is maintained at the global level by a new locking structure known as the Horizontal V-lock, an extension of the V-lock algorithm that has the capability of horizontally directing subtransactions in the SSM hierarchy. Replicated data is stored in a fixed host. According to the application domain requirements and the characteristics of the data, a consistency requirement (CR) table records the consistency degree demanded by each data as is set by the data owner or the MDAS system replication manager. A simple cache database (CD) is reserved and used to hold the latest replicated data value at the MSS. The data cached in the CD is refreshed whenever the data is updated. While offline, operations are performed based on the data cached in the CD and transactions that are processed are logged in the history. The storage space of the CD is adjusted in accordance to the resources available at the MSS. On reconnection, a reconciliation process is launched to eliminate the data delusion caused by offline processing and rollback/recovery of the falsely committed operations will be carried out.
4.6
Experiments with V-Locking Algorithm
As noted before, within the MDAS environment a proper transaction model should address issues such as the limited network bandwidth and frequent disconnections. The so-called V-locking model is a hierarchical concurrency control algorithm that uses global locking tables created with semantic information contained within the hierarchy. The semantic information contained in the summary schemas is used to maintain global locking tables. Since each summary schema node contains a semantic content of its children schemas, the "data" item being locked is reflected exactly or as a
Section 4.6.
Experiments with V-Locking Algorithm
137
hypernym term in the summary schema of the GTM. The locking tables can be used in an aggressive manner where the information is used only to detect potential global deadlocks. A more conservative approach can be used where the operations in a transaction are actually delayed at the GTM until a global lock request is granted. In either case, the global locking table is used to create a global wait-forgraph, which is subsequently used to detect and resolve potential global deadlocks. The accuracy of the " waiting information" contained in the graph is dependent upon the amount of communication overhead that is required. The proposed algorithm can dynamically adjust the frequency of the communications (acknowledgment signals) between the GTM and local sites based on the network traffic and/or a threshold value. The number of acknowledgments that are performed varies from one per operation to only a single acknowledgment of the final commit/abort of the transaction. Naturally, the decrease in communication between the local and global systems comes at the expense of an increase in the number of potential false aborts. The algorithm handles three different levels of communication: 1) each operation in the transaction is individually acknowledged, 2) write operations are only acknowledged, and 3) only the commit or abort of the transaction is acknowledged. For the first case, based upon the semantic contents the summary schema node, an edge inserted into the wait-for-graph is marked as being an exact or imprecise data item. For each acknowledgement signal received the corresponding edge in the graph is marked as exact. In the 2nd case, where each write operation generates an acknowledgement signal, for each signal only the edges preceding the last known acknowledgement are marked as being exact. Other edges that have been submitted but have not been acknowledged are marked as pending. As in the previous two cases, in the 3rd case, the edges are marked as representing exact or imprecise data. However, all edges are marked as pending until the commit or abort signal is received. Keeping the information about the data and status of the acknowledgement signals enables us to detect cycles in the wait-for-graph. The algorithm detects cycles in the wait-for-graph based on the depth first search (DFS) policy. The graph is checked for cycles after a time threshold for each transaction. For all of the transactions involved in a cycle, if the exact data items are known and all of the acknowledgements have been received, then a deadlock is precisely detected and broken. When imprecise data items are present within a cycle, the algorithm will consider the cycle a deadlock only after a longer time threshold has passed. Similarly, a pending acknowledgement of a transaction is only used to break a deadlock in a cycle after an even longer time threshold has passed. The time thresholds can be selected and adjusted dynamically to prevent as many false deadlocks as possible. A potential deadlock situation may also occur due to the presence of indirect conflicts. By adding site information to the global locking tables, an implied waitfor-graph could be constructed using technique similar to the potential conflict graph algorithm [7]. A potential wait- for-graph is a directed graph with transactions as nodes. The edges are inserted between two transactions for each site where there are both active and waiting transactions. The edges are then removed when
138
Transaction Management in a Mobile Data Access System
Chapter 4
a transaction aborts or commits. A cycle in the graph indicates the possibility that a deadlock has occurred.
4.6.1
Simulation Studies
The performance of the proposed algorithm was evaluated through a simulator written in C + + using CSIM. The simulator measures performance in terms of global transaction throughput, response time, and CPU, disk I/O, and network utilization. In addition, the simulator was extended to compare and contrast the behavior of our algorithm against the site-graph, potential conflict graph, and the forced conflict algorithms. The MDAS consists of both local and global components. The local component is comprised of local DBMS systems, each performing local transactions outside the control of the MDAS. The global component consists of the hierarchical global structure, performing global transactions executing under the control of the MDAS. There are a fixed number of active global transactions present in the system at any given time. An active transaction is defined as being in the active, CPU, I/O, communication, or restart queue. A global transaction is first generated, and subsequently enters the active queue. The global scheduler acquires the necessary global virtual locks, and processes the operation. The operation(s) then uses the CPU and I/O resources, and communicates the operation(s) to the local system based upon the available bandwidth. When acknowledgements or a commit/abort signals are received, from the local site, the algorithm determines if the transaction should proceed, commit, or abort. If a global commit is possible, then a new global transaction is generated and placed in the ready queue. However, if a deadlock is detected, or an abort message is received from a local site, then the transaction is aborted at all sites and the global transaction is placed in the restart queue. After a specified time elapsed, the transaction is again placed on the active queue. Each local site has a fixed number of active local transactions comprising of both local transactions and global sub-transactions. The local system does not differentiate between the two types. An active transaction is defined as being in the active, CPU, I/O, communication, blocked, or restart queue. Transactions enter the active queue and are subsequently scheduled by acquiring the necessary lock on a data item. If the lock is granted, the operation proceeds through the CPU and I/O queue, and for global sub-transactions, is communicated back to the GLS. The acknowledgement for these transactions is communicated back based upon the available communication bandwidth. If a lock is not granted, the system checks for deadlocks and will either place the transaction in the blocked queue, or the restart queue. For local transactions, it goes into the restart queue if it is aborted, and subsequently it will be restarted later. Upon a commit, a new local transaction is generated and placed in the ready queue. For global sub- transactions, an abort or commit signal is communicated back to the GLS and sub-transaction terminates.
Section 4.6.
4.6.2
Experiments with V-Locking Algorithm
139
System Parameters
The underlying global information sharing process is composed of ten local sites. The size of the local databases at each site can be varied, and has a direct effect on the overall performance of the system. The global workload consists of randomly generated global queries, spanning over a random number of sites. Each operation of a sub-transaction (read, write, commit, or abort) may require data and/or acknowledgements to be sent from the local DBMS. The frequency of messages depends upon the quality of the network link. In order to determine the effectiveness of the proposed algorithm, several parameters are varied for different simulation runs. The local systems perform two different types of transactions, local and global. Global sub-transactions are submitted to the local DBMS and appear as a local transaction. Local transactions are generated at the local sites and consist of a random number of read/write operations. The number of local transactions, which can be varied, affects the performance of the global system. In addition, the local system may abort a transaction, global or local, at any time. If a global subtransaction is aborted locally, it is communicated to the global system and the global transaction is aborted at all sites.
4.6.3
Simulation Results
The performance of the algorithm (V-Lock) is evaluated based on performance metrics such as, the number of completed global transactions, the average response time, as well as the communication utilization at each local site. In addition, the simulator was extended to compare and contrast the proposed algorithm against the potential conflict graph method [7], site graph method [30], and the forced conflict method [9]. Figure 4.9 shows the results. As can be concluded, the V-Lock algorithm completes the most transactions. This result is consistent with the fact that the V-Lock algorithm is better able to detect global conflicts and thus achieves higher concurrency than the other algorithms. As can be seen, the maximum occurs at a multi- programming level of approximately equal to ten. As expected, as the number of concurrent global transactions increases, the number of completed global transactions decreases due to the increase in the number of conflicts. Figure 4.10 shows the relationship between the global throughput and the number of sites in the MDAS environment in which the number of sites was varied from 10 to 40. The throughput decreases as the number of sites is increased due to the data fragmentation across more sites, and hence increasing the likelihood of conflicts in the system. The simulator also measured the percentage of completed transactions during a certain period of time for the V-Lock, PCG, forced conflict, and site-graph schemes. Figure 4.11 shows the results. In general, for all schemes, the number of completed transactions decreases as the number of concurrent transactions increases, due to more conflicts among the transactions. However, the performance of both the forced conflict and site-graph algorithms decreases at a faster rate. This is due to the
140
Transaction Management in a Mobile Data Access System
Chapter 4
Completed Global Transactions 10000 -,
,
Figure 4.9. Comparative analysis of different Concurrency Control Algorithms.
Figure 4.10. Global Throughput Varying the Number of Local Sites. increase in the number of false aborts detected by these algorithms. Finally, on a separate simulation runs, the simulator measured, and compared and contrasted the response time for various simulation runs. Figure 4.12 shows the results. The two locking algorithms have a much better response time than the forced-conflict and site-graph algorithms. The V-locking algorithm has the best response time, and performs better than PCG, particularly for a large number of users. As expected, as the number of concurrent users increases, the response time increases. 4.7
Conclusion
The need to maintain local autonomy is the distinguishing factor in transaction management for multidatabase systems. The problems associated with transaction management in multidatabases have been examined, and solutions proposed in the literature to deal with the concurrency control and recovery problem in multi-
Section 4.7.
141
Conclusion
Percent of Global Transactions Completed 100% 80% 60% 40% 20% 0%
-.-V-Lock — PCG - * - Forced Conflicts - ~ Site-Graph 5
8
10
15
20
50
75
Number of Active Global Transactions
Figure 4.11. Comparison of the Percent of Completed Global Transactions. Average Response Time
-»_V-Lock ---PCG - ^ - F o r c e d Conficfe ^ - Site- Graph 5
8
10
15
20
SO
75
Number of Active Global Transactions
Figure 4.12. Average Response Time. database systems have been studied. With the emergence of new applications for E-commerce and advanced transaction models, many interesting issues arise. The challenge is to combine these models with traditional transaction management models. A multidatabase system is primarily designed to integrate different and existing database systems; the necessity of combining different transaction models is hence very relevant for multidatabase systems of the future. The goal is to develop extensible transaction management schemes that meet the application specific needs of different local database systems in the most efficient manner. As designers try to meet these needs, a number of decisions on the properties of multidatabase systems will have to be made. Since the most severe restriction is the autonomy of the local databases, it is likely that designers will look there for relief. One interesting approach is to allow the local database to retain full control over what and how transactions are performed on the data, but requires local database systems to provide information to the multidatabase system. A variety of information can be used to support transaction management. The
142
Transaction Management in a Mobile Data Access System
Chapter 4
simplest type of useful information is for the local database system to supply the multidatabase system with a complete local scheme. More difficult to obtain, but perhaps the most useful information, is to have the local database system provide the global transaction system with information on the transactions performed on the local database. The typical multidatabase system requires some kind of "global wrapper" on each system supporting local databases. By including a scheduler in the global wrapper, it can make use of the information on the transactions to verify the global order. Once the global order is verified, this information can be communicated to the global transaction scheduler. This information can then be used to ensure that global serializability is eventually maintained. By incorporating such approaches, future transaction management systems for multidatabases will become more practical. Effective E-commerce technologies are in their formative stage. As E-commerce moves from a largely business-to-business model to include more retail selling channels, the demand for more effective multidatabases systems and mobile data access systems will proliferate. The problems of effective multidatabase systems need to be resolved to allow the development of electronic markets.
4.8
Bibliography
[1] R. Alonso, H. Garcia-Molina, and K. Salem. Concurrency Control and Recovery for Global Procedures in Federated Database System. Q. Bulletin of the Computer Society of IEEE Technical Committee on Data Eng., 110(3), September 1987. [2] R. Alonso and H. F. Korth. Database System Issues in Nomadic Computing. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 388 - 392, 1993. [3] K. Barker and M. T. Ozsu. Reliable Transaction Execution in Multidatabase Systems. In Proceedings of First International Workshop on Interoperability in Multidatabase Systems, April 1991. [4] K. Barker. Transaction Management on Multidatabase Systems. Ph.D. thesis, Department of Computing Science, University of Alberta, 1990. [5] D. A. Bell. Distributed Database Systems. Addison-Wesley Publishing Company, 1992. [6] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley Publishing Company, 1987. [7] Y. Breitbart, D. Georgakopoulos, M. Rusinkiewicz, and A. Silberschtz. On Rigorous Transaction Scheduling. IEEE Transactions on Software Engineering, 17(9): 954-960, September 1991.
Section 4.8.
Bibliography
143
[8] Y. Breitbart and A. Silberschtz. Multidatabase Update Issues. In Proceedings of ACM SIGMOD Intl. Conference on Management of Data, pp. 135-142,1988. [9] Y. Breitbart, H. Garcia-Molina and A. Silberschtz. Overview of Multidatabase Transaction Management. VLDB, 1(2): 181-239, 1992. [10] M. W. Bright, A. R. Hurson and S. H. Pakzad. A Taxonomy and Current Issues in Multidatabase Systems. IEEE Computer, 25(3): 50-60, 1992. [11] M. Bright, A. R. Hurson and S. Pakzad. Automated Resolution of Semantic Heterogeneity in Multidatabases. ACM Trans. On Databases, 19(2), pp. 212253, 1994. [12] S. Ceri and G. Pelagatti. Distributed Databases Principles and McGraw-Hill Book Company, 1984.
Systems.
[13] P. K. Chrysanthis and K. Ramamritham. ACTA: A Framework for Specifying and Reasoning about Transaction Structure and Behavior. SIGMOD Record (ACM Special Interest Group on Management of Data), 19(2): 194-203, 1990. [14] P. Chrysanthis. Transaction Processing in Mobile Computing Environment. In Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, pp. 77 -82, 1993. [15] R. A. Dirckze and L. Gruenwal. Nomadic Transaction Management. IEEE Potentials Volume: 17(2), pp. 31 - 33, April - May 1998. [16] W. Du and A. Elmagarmid. Quasi Serializability: A Correctness Criterion for Global Concurrency Control in InterBase. In Proc. of 15th Int. Conf. on Very Large Databases, pp. 347-355, 1992. [17] M. H. Dunham, A. Helal and S. Balakrishnan. A Mobile Transaction Model That Captures Both the Data and Movement Behavior. Mobile Network Applications 2, 2, pp. 149 -162, October 1997. [18] A. Elmagarmid, editor. Database Transaction Models for Advanced Applications. Morgan Kaufmann, 1992. ) [19] A. Elmagarmid and W. Du. A Paradigm for Concurrency Control in Heterogeneous Distributed Database Systems. In Proc. Sixth Intl. Conf. on Data Engineering, pp. 37-46, IEEE, 1990. [20] A. Elmagarmid, J. Jing, J. G. Mullen, and J. Sharif-Askary. Reservable Transactions: An Approach for Reliable Multidatabase Transaction Management. Technical Report SERC- TR- 1 14-P. Software Engineering Research Centre, April 1992. [21] A. K. Elmagarmid, Y. Leu, W. Litwin, and M. Rusinkiewicz: A Multidatabase Transaction Model for InterBase. VLDB, pages 507-5 18, 1990.
144
Transaction Management in a Mobile Data Access System
Chapter 4
[22] A. A. Farrag and M. T. Ozsu. Using Semantic Knowledge of Transactions to Increase Concurrency. ACM Transactions on Database Systems, 14(4): 503525, December 1989. [23] M. Faiz and A. Zaslavsky. Database Replica Management Strategies in Multidatabase Systems with Mobile Hosts. [24] G. H. Forman and J. Zahorjan. The Challenges of Mobile Computing. IEEE Computer Volume: 27(4), pp. 38-47, April 1994. [25] H. Garcia-Molina. Using Semantic Knowledge for Transaction Processing in a Distributed Database. ACM Transactions on Database System, 8(2): 186:2 13, 1983. [26] H. Garcia-Molina. Sagas. In Proc. of ACM-SIGMOD 1987 Intl. Conf. On Management of Data, pp. 249-259, 1987. [27] H. Garcia-Molina. Node Autonomy in Distributed Systems. In Proc. Intl. Symp. on Databases in Parallel and Distributed Systems, pp. 158-166. IEEE, 1988. [28] H Garcia-Molina. Global Consistency Constraints Considered Harmful for Heterogeneous Database Systems. In Proc. of First International Workshop on Interoperability in Multidatabase Systems, pp. 248-250, 1991. [29] D. Geogakopoulos. Multidatabase Recoverability and Recovery. In Proceedings of First International Workshop on Interoperability in Multidatabase Systems, April 1991. [30] D. Geogakopoulos, M. Rusinkiewiez and A. Sheth. Using Tickets to Enforce the Serializability of Multidatabase Transactions. IEEE Transactions on Knowledge and Data Engineering, 6(1): 166-180, 1994. [31] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993. [32] J. Jing, W. Du, A. Elmagarmind and O. Bukhres. Maintaining Consistency of Replicated Data in Multidatabase Systems. In Proceedings of the l^th International Conference on Distributed Computing Systems, pp: 552 -559, 1994. [33] G. E. Kaiser and C. Pu. Dynamic Restructuring of Transactions. Database Transaction Models for Advanced Applications. Morgan Kaufmann, pp. 265295, 1992. [34] J. J. Kistler and M. Satyanarayanam. Disconnected Operation in Coda File System. ACM Trans. Computer Systems, Vol. 10, No. 1, pp. 3-25, Feb. 1992. [35] R. Krishnamurthy, W. Litwin and W. Kent. Interoperability of Heterogeneous Databases with Schematic Discrepancies. In Proc. First Intl. Workshop on Interoperability in Multidatabase Systems, pp. 144-152. IEEE, 1991.
Section 4.8.
Bibliography
145
[36] H. T. Kung and J. T. Robinson. On Optimistic Methods for Concurrency Control. ACM TODS, 6(2), 1981. [37] C. Li. Replication Protocol for Mobile Data Access System - An Approach Under Summary Schemas Model. MS Masters Thesis, Department of Computer Science and Engineering, The Pennsylvania State University, May 2000. [38] J. B. Lim, A. R. Hurson, K. M. Kavi. Concurrent Data Access in a Mobile Heterogeneous System. Proceedings of the 32nd Annual Hawaii International Conf. on System Sciences. 1999. [39] W. Litwin and A. Abdellatif. Multidatabase Interoperability. IEEE Computer, 19(12): 10- 18, 1986. [40] S. K. Madria and B. Bhargava. A Transaction Model for Mobile Computing. In Proc. of the International Database Engineering and Applications Symposium, pp. 92 -102, 1998. [41] S. Mehrotra, R. Rastogi, H. F. Korth, and A. Silberschatz. Non-Serializable Executions in Heterogeneous Distributed Database Systems. In Proc. of First International Conference on Parallel and Distributed Information Systems, pp. 245-252, 1991. [42] J. G. Mullen, A. K. Elmagarmid, Won Kim, J. Sharif-Askary. On the Impossibility of Atomic Commitment in Multidatabase Systems. Technical Report SERC-TR- 11 3-P, Software Engineering Research Centre, April 1992. [43] P. Muth, J. Veijalainen, E. J. Neuhold. Extending Multi-Level Transactions for Heterogeneous and Autonomous Database Systems. GMD Technical Report No. 739, Santa Augustin, March 1993. [44] E. Pitoura, and B. Bhargava. Revising Transaction Concepts for Mobile Computing. In Proceedings of the Workshop on Mobile Computing Systems and Applications, pp. 164 - 168, 1995. [45] C. Pu, A. Leff and S. F. Chen. Heterogeneous and Autonomous Transaction Processing. IEEE Computer, 24(12): 64-72, 1991. [46] C. Pu. Superdatabases for Composition of Heterogeneous Databases. In Proceedings of Fourth International Conference on Data Eng., pp. 548-555, 1988. [47] C. Pu and Shu-Wie Chen. ACID Properties Need Fast Relief: Relaxing Consistency Using Epsilon Serializability. In Proceedings of Fifth International Workshop on High Performance Transaction Systems, September 1993. [48] C. Pu, G. Kaiser and N. Hutchinson. Split-Transactions for Open-ended Activities. In Proc. of the Hth VLDB Conference, 1998.
146
Transaction Management in a Mobile Data Access System
Chapter 4
[49] K. Ramamritham and C. Pu. A Formal Characterization of Epsilon Serializability. IEEE Trans, on Knowledge and Data Engineering, December 1995. [50] R. Ramasubramanian. A Survey of Replication Issues and Strategies in Mobile and Multidatabase Environments. M Eng. Technical Paper, Department of Computer Science and Engineering, The Pennsylvania State University, August 1998. [51] A. Reuter and H. Wachter. The ConTract Model. Data Engineering 14(l):39-43, 1991.
Bulletin,
[52] B. Roberto and S. Silvio. Deadlock Detection in Multidatabase Systems: a Performance Analysis. Technical Report RR-2668, INRIA, The French National Institute for Research in Computer Science and Control, September 1995. [53] M. Satyanarayanam. Mobile Information Access. IEEE Personal Communications, pp. 26-33, February 1996. [54] M. Satyanarayanam. Mobile Computing. IEEE Computer, Volume 26 9, pp. 81-82, September 1993. [55] A. P. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed Heterogeneous, and Autonomous Databases. IEEE Computer, 22(3): 183-236, 1990. [56] N. Soparkar H. F. Korth and A. Silberschatz. Failure Resilient Transaction Management in Multidatabases. IEEE Computer, 24(12): 28-36, 1991. [57] G. D. Walborn and P. K. Chrysanthis. Supporting Semantics-Based Transaction Processing in Mobile Database Applications. In Proceedings of the 14th Symposium on Reliable Distributed Systems, pp. 31 -40, 1995. [58] G. Weikum and Hans-Jvrg Schek. Multi-Level Transactions and Open-Nested Transactions. Data Engineering Bulletin, 14(1): 60-64, 1991. [59] G. Weikum, A. Deacon, W. Schaad, H.-J. Schek. Open Nested Transactions in Federated Database Systems. IEEE Data Engineering Bulletin, Vol. 16, No. 2, June 1993. [60] A. Wolski and J. Vijalainen. 2PC Agent Method: Achieving Serializability in Presence of Failures in a Heterogeneous Multidatabases. In Proc. of PARAIBASE-90 Intl. Conf. on Databases, Parallel Architectures, and Their Applications, March 1990. [61] L. H. Yeo and A. Zaslavsky. Submission of Transaction from Mobile Workstations in a Cooperative Multidatabase Processing Environment. In Proc. of the 14th International Conference on Distributed Computing Systems, pp. 372 379, 1994.
Section 4.8.
Bibliography
147
[62] V. Zwass. Structure and Macro-Level Impacts of Electronic Commerce: From Technological Infrastructure to Electronic Marketplaces. Foundations of Information Systems: E-Commerce Paper, [http://www.mhhe.com/business/mis/ zwass/ecpaper.html]
Chapter 5
Architecture Inclusive Parallel Programming
C K
YUEN
School of Computing National University of Singapore Singapore 119260 [email protected]
5.1 5.1.1
Introduction Architecture Independence - The Holy Grail
Since the very early days of parallel processing, architecture independence has been the ultimate goal of system designers. Why is this problem so i m p o r t a n t , and at the same time, so difficult to solve? Architecture independence means programs are portable between machines despite differences in architecture, and the same program will run on many machines. It enchances the value of the effort you put into writing a program. W h e n you are developing a large program library for some popular applications, you can do so knowing t h a t there are many users who will benefit from it, and they will pool their efforts t o debug and optimize the code. Frequently used code is bound t o be good code because of this pooling of effort, and both developers and users acquire confidence in it and would recommend it to other users. Proficiency and expertise are developed. Programs are also stable, at least after an initial phase when bugs are found and optimization is done, after which they do not need to be modified to cater for new machines and architectural updates because of the architecture independence, thus saving users from frequent relearning efforts and developers from code change efforts. T h e ideal is, however, far from being reached in parallel processing. Instead, it is typical t h a t parallel machines run specialized languages, with each machine offering few languages and each language running on few types of machines. Parallel software 148
Section 5 . 1 .
Introduction
149
libraries are rarely widely usable and well optimized, and seldom have programmers familiar with them everywhere to support them. Software costs are high because the same algorithm requires new code for new machines, and the difficulty level of writing parallel code is high, with the need to deal with unfamiliar features and cater for machine dependencies, plus troublesome debugging because of low familiarity and the need to deal with hardware peculiarities. While many systems claim to be easy to use, and many parallel processing models exist claiming to be general and architecture independent, real experience does not bear these claims out, and simple, inexpensive general purpose parallel computing does not really exist. One cause of the problem is certainly the wide variety of parallel system features: processing, storage and communication resources can be distributed in many different ways, and very different operating systems and languages can be designed to suit the various structures. This by itself need not, however, preclude architecture independence at the application programming level: if users had been able to generally agree on a particular model for writing parallel programs, and the model makes parallel algorithms easy to express and enhances programmer productivity, then ways can be found to compile and run such programs on most types of machines, though at varying levels of efficiency. Parallel systems that fail to cope with such programs would be rejected by the market, leaving the fittest machines to survive. So why has a standard parallel programming model not emerged, in contrast to the way in which scientists and programmers agreed on architecture independent sequential high level languages, despite the fact that they have lower efficiency than architecture dependent machine code? This is ultimately related to our failure to agree on the very objective and domain of parallel computing. For practitioners of parallel processing, i.e., those who run parallel systems for paying users, the purpose of going parallel is to handle very large problems for which single processor systems are too slow. There are only a small number of such so called "grand challenge" problems, and most of these involve numerical computation. These practitioners do not consider it important to cater for the general use of parallel machines, and most of them are more interested in getting the highest performance for a single application on a single machine, as we have seen in such well publicized examples as weather forecasting, nuclear modelling or world championship chess playing. Architecture independence and program portability are of little relevance in such situations, and indeed, may be considered to be negative by these practitioners as the requirement can hamper their search for better performance, since it mandates software and hardware structuring to fit a standard model. At the same time, certain basic concepts of parallel computing, such as concurrency, modularity and structured communication, are present in computing systems generally, and all kinds of algorithms and users can take advantage of parallel computing. By extending parallel computing to a larger community, system designers can over time improve the economics of parallel machines through larger scales and wider experience, if only we can agree on a common programming model, but the
150
Architecture Inclusive Parallel Programming
Chapter 5
lack of interest from the leading practitioners in this venture hardly contributes towards the goal, and it does not appear that the situation will substantially change in the near future. Indeed, the unsuccessful experience of parallel processing over the past decades rather discourages efforts in this direction. Instead of the elusive goal of architecture independence, the present work suggests the more modest goal of architecture inclusiveness: we accept that each group of users only wish to program a particular type of algorithm for a particular type of machine, and the programs need not, perhaps cannot, be ported to another type of machine, at least not without loss of efficiency; nevertheless, we suggest that all parallel programs, regardless of inherent structures, can be coded using a simple, standard set of notations. By embedding such notations into a particular language, allowing programmers to invoke facilities provided in a parallel tools library, we turn it into a parallel language. This gives a user the facilities within the environment he has already chosen for his particular problem, without hampering his capacity to express the parallelism present in his algorithm. While different versions of programs may have to be written for different types of machines, and portability will be limited to machines with similar architectures, the use of the same notations reduces the relearning effort as well as the implementation cost of the tools, thus attaining some of the benefits of architecture independence. The basic approach for architecture inclusiveness presented here is a rather simple one; it is to use a "content rich" kind of lock to control the accessibility of blocks of memory. The locks are closely connected to the idea of Linda tuples and provide for both the synchronization and data combination requirements contained in different parallel processing models, and include the alternative possibilities of describing shared memory, distributed memory and distributed shared memory requirements, and inter processor communication functions for both heterogeneous and homogeneous systems. Before presenting the ideas, we first explore in the next sections some of the issues that have hampered the effort towards architecture independence, and to set the context for our idea of architecture inclusiveness.
5.1.2
Shared Memory Versus Distributed Systems
In a shared memory system, each processor can access any particular memory location by simply presenting its address to the memory interface; normally, this causes a small block of memory content (cache line) which includes the wanted location, to be transferred into the processor cache, so that the information may be repeatedly used. A symmetric multiprocessor (also known as uniform memory access system) allows all locations to be accessed at equal speed, while a non uniform memory access system provides faster access to some parts of the memory, because they are directly attached to a processor, and other parts more slowly via some system interconnect mechanism. In a distributed memory system, remote memory locations cannot be directly read/written; instead of retrieving a remote memory block into its cache, a processor
Section 5 . 1 .
Introduction
151
has to go through some memory to memory transfer step, or a message communication step, to create a copy of the remote memory content in its local memory, before it can perform read/write operations on the individual memory words. Writing programs, compilers and operating systems for shared memory systems is simpler: a programmer can just define code and data in any part of his program where his algorithm requires them, without worrying about whether these will fit into the local memory or not, and a compiler can lay out the data map in the whole memory knowing that everything would be accessible, while the operating system can dispatch a program to execute on any free processor knowing that it can access data and make subroutine calls anywhere. On non uniform memory machines some optimization might still be needed to minimize the more time consuming remote memory accesses, but as long as programs have good cache locality, the cost difference is relatively small and the freedom to suboptimize is present. Parallel languages designed for shared memory systems tend to be fairly simple too: because all the data are visible, one only has to verify that they are "safe" to access, i.e., that they have gone through the expected processing steps before they are used for the intended purpose, and that they are not subjected to conflicting uses. This is normally ensured by undergoing some simple locking steps, so that the same piece of information will be read or modified by different processors one at a time, though more complex concurrent uses are possible. This topic of concurrency will be further discussed in the following section. However, shared memory systems are less scalable: as the number of processors rises, it becomes increasingly difficult to provide fast access to all parts of memory for all processors, and contention from different processors for the same memory module or the same link in the global interconnect becomes frequent, producing serious performance degredations. In distributed memory systems, the parallel languages must contain tools that specify the memory to memory transfers between processors. Such operations are, however, not just one to one between processor pairs, but may also be group operations such as broadcasting the same data to all processors, collecting data from processors into a single block, or vector operations with individual processor contributing different elements and sharing all or part of the result vectors. It is a more complex process to specify an algorithm in terms of such operations. In a shared memory system such operations take place within the same memory system and are much easier to specify, since instead of moving data around, processes only need to make sure that everyone knows the address and status of the data being shared. We see that there is a qualitative difference between parallel programming in shared memory and distributed systems: in the tools provided, the typical sizes of the problems, and in the program design methods. It is therefore understandable that the two types of parallel systems have diverged considerably, and little compatibility exists between them. However, in a relatively recent development, the cluster approach tries to combine the best features of the two types: on a distributed hardware platform, we can implement an operating system that makes the whole system operate like a single
152
Architecture Inclusive Parallel Programming
Chapter 5
processor with a single virtual space. Addresses of this space are mapped onto physical memory locations of the processor that a program happens to be running on, transparently to the program. Two processes running concurrently would share data by using the same virtual address references, without having to perform their own memory to memory transfers, which will be managed by the operating system, using the memory management facilities built into the hardware. With this, programs written for shared memory systems would, in theory, also be able to run on clusters. Because the distributed systems normally have separate servers for disk storage and other I/O devices, I/O operations also take place without a program having to worry where the devices are located, with the operating system taking care of file accesses, networking operations and other needs. Clusters are particularly interesting for the purpose of building highly reliable systems, since a program can be suspended on one node of a cluster and restarted on another using the saved system image. Load balancing by moving work from a busy node to a less busy one is another possibility. However, while the cluster concept is quite attractive, it does not quite eliminate the shared/distributed systems dichotomy, and actually adds to it by creating a trichotomy of programming differences, which will be discussed in Section 4 Memory Consistency as well as subsequent sections.
5.1.3
Homogeneous Versus Heterogeneous Systems
The early parallel systems were single instruction, multiple data (SIMD) systems, in which a command processor sends a stream of instructions to a row of compute processors, so that the same instructions are executed on different components of a large data structure, usually a set of arrays. A similar effect is achieved in vector machines, which pump streams of array elements through an arithmetic pipeline, so that the same arithmetic operation is performed on all the elements. Both are meant to be useful in large numerical computations demanded for high performance applications. Because each instruction affects a large number of elements, and because of the nature of the algorithms, programs on this kind of machines have relatively simple execution sequences. On SIMD machines much of the effort of programming is in fact directed towards data distribution: the right data must be in the memory of each compute processor during each cycle, and a clever initial layout and execution sequencing would minimize the amount of data movements and redistribution between processors during execution. On the vector pipeline machines, the analogous effort is in picking the right elements of each array to put into the vector registers in the processors, before they are pumped at high speed through the arithmetic units. Since these early days, some generalization have taken place. In the single program, multiple data machines, the same program is loaded into the memories of the processors working together on the same problem, but each may jump to a different code sequence during execution so that each processor does not necessarily execute identical instructions. However, code is normally designed in a way that the pro-
Section 5.1.
Introduction
153
cessors would come together regularly and exchange information, thus keeping step with each other in a macroscopic though not microscopic fashion. In other words, execution proceeds as a series of compute-communicate-compute-communicate... "supersteps". The massively parallel processors, with the whole system made up of thousands, possibly millions, of processors, may contain different code in each processor, but combine and exchange information across the whole system using system wide functions like broadcast, gather, summation, find maximum/minimum, and other functions involving all the processors' participation, with the hardware providing ways to execute such combinations at high speed. Again, the system operates in a sequence of supersteps because of the regular need for all the processor to come together and participate in a common function. These are examples of "homogeneous" parallel systems. It should be clear that the algorithms that can be programmed this way are somewhat restricted (though still quite various, as will be discussed in Section 3.3 PRAM Algorithms). Proponents of the homogeneous approach believe however that the purpose of parallel systems is to run very large, time consuming algorithms which are beyond the capabilities of sequential machines, and all the large, economically useful applications are homogeneous in structure. Further, it would be beyond programmers' capability to write parallel programs containing thousands or millions of different parts to run on all the processors, and realistically large problems can only be solved if all the processors run the same program or very similarly structured programs. Proponents of heterogeneous parallel systems have a very different mind set. Their starting point is concurrency, with an algorithm being divided into independent parts that can be executed separately, whether on a single processor using time slices, or in parallel by invoking the necessary multitasking/multiprocessing programming tools. In general the structure of algorithms is heterogeneous, such as different workers taking jobs of varying sizes from a common pool, or consumers using results taken from a queue, put there by producers, with each consumer/producer spending an unpredictable amount of time to perform each piece of work. Each piece of execution may generate additional independent execution at any point. Hence, the number, duration, start, stop and other steps of each execution activity are all unpredictable. The communication requirements between the activities are also unpredictable, though most tend to be bilateral, i.e., between a parent process and a child, or two siblings, which are aware of each other and have data to give to or receive from its processing partners. The design objective is to provide parallel systems that support such general parallelism efficiently. It is clear that both schools have some basis of validity; it is also clear that they are quite irreconcilable. In fact, discussions on which is the "right" approach have often taken quite deadend (though entirely scientific) turns. For example, it is sometimes said that a heterogeneous system can run homogeneous algorithms and so "includes" everything offered in homogeneous systems already, but this has no practical importance because the question is not whether something "can be done", but whether it can be done efficiently and economically. A heterogeneous parallel
154
Architecture Inclusive Parallel Programming
Chapter 5
system, being designed to support highly unpredictable and variable execution requirements, must accept higher overheads that come with such uncertainties, and using such facilities for homogeneous problems would inevitably generate inefficiencies. In particular, since a processor can become idle at any time, either because it has already finished its job or because it has to wait for something, a load balancing mechanism must be built into the runtime system to channel work from busy processors to less busy ones. Much of the overheads are not necessary on homogeneous systems. On the reverse side, the idea "the same parallel program should be able to run on different size systems including a sequential machine without change" is sometimes raised, though it has little significance for the purpose of minimizing parallel programming cost, since it so far only seems to work for simple examples. What we really want to have is "a sequential program should be able to run on a parallel machine without change", but this is even harder to come about. For example, the following SPMD program for summing an array meets the variable system size requirement:
n = No_of_threads; i = My_thread_no; My_sum = 0 . 0 ; For j = i To N By n Do My_sum = My_sum + x [ j ] ; Lock (Sum); Sum = Sum + My_sum; Unlock (Sum);
That is, each process discovers its own process ID and the total number of processes by calling runtime library routines, and uses this information to pick up a subset of the computation to do. If this program is run on a single processor machine, then it would pick up the whole array and sum it, while with 10 processors, processor i would pick up the elements i, i+10, i+20, etc. However, if we are programming for a single processor machine, we would write a different program, without all the parallel tool extras. Writing a parallel program in this way has not reduced the software development cost in any meaningful way. All it demonstrates is that, since the problem is homogeneous, a homogeneous approach works; that is, it can be easily mapped on to a system where each processor can obtain a comparable slice of the work regardless of problem size. Instead of arguing whether homogeneous or heterogeneous systems are better, the real question is how both can fit into some sort of shared framework, just as shared memory and distributed systems need to do so, despite their differences. We now turn to this issue.
Section 5 . 1 .
5.1.4
Introduction
155
Architecture Independence Versus Inclusiveness
The nature of parallel processing makes it inherently difficult to program in an architecture independent way: the programmer has to indicate which parts of the algorithm are concurrent, and what information exchanges take place between them during processing. The way this is done already commits the program to certain architectures: if the parts are uniform and communicate at regular and highly predictable intervals, then a homogeneous parallel program results; without such uniformity, the result is heterogeneous. If the parts must pool their intermediary data and perform further processing that potentially may use any part of the data in some unpredictable way, then a shared memory architecture is necessary, since the cost of information pooling with messages is likely to be very high, whereas a message passing algorithm assumes that each sender has a clear idea of which process/process group requires a particular result so that the information can be sent to the recipient(s) by the process that has it, with the total amount of sharing being limited to manageable proportions. If a parallel program expresses its processing in such a high level way that it can be mapped to either shared memory or distributed machines, and either homogeneous or heterogeneous systems, then it is likely to contain little information about how to structure the processing, and compilers would get little help from the source code about how to execute it efficiently. To provide such help, it is almost always necessary to know some details of the machine or operating system that will carry out the processing steps. The situation is more complex than the old single processor case where a compiler with information about registers, cache and memory can optimize a program by minimizing bottlenecks and avoiding repetitions, though in today's superscalar machines things are already harder. With multiple communicating parties running on different pieces of hardware, one simply has to know how the pieces behave when writing the program; architecture awareness provides helpful information. Instead of independence, we should aim for architecture inclusiveness: Certain algorithms can only be efficiently programmed for a particular type of parallel machine, while others require different versions for different categories, but all such programs, regardless of what machine will run the programs and what language is used to code them, will express the parallel requirements using the same notations. So the algorithms are architecture dependent, and the tools that invoke the work distribution and information exchange are also architecture dependent, but the link between them, as they are expressed in the programs, works for any of the architectures. In other words, we want a parallel programming framework that includes different architectures, without aiming to produce architecture independent programs. For example, the program shown in the previous section uses lock (sum) sum = mysum + sum unlock (sum)
156
Architecture Inclusive Parallel Programming
Chapter 5
which supposes that sum is stored in shared memory and each process can read/write it; a message passing system might have send (main, sum, add, mysum) to request that the main node add the received mysum value to sum. A machine providing system wide functions would require each process to execute the function reduce (sum, processID, mysum) that combines the values contributed by each node into a common value, the function completing when all the processes have executed this function (which is why the processID information is required for the purpose of knowing that everyone has made its contribution.) Suppose that we instead express the operation as Update (sum: +mysum, +1) on a record called sum containing two Os initially. When the second field reaches n, we know the wanted total is available since all the processors have made their contributions. Hence, Match (sum: ?, ==n) says we wait until the record contains n in the second field, at which point we ask for the value of the first field. There is no specification of whether sum is in shared memory or not, nor whether updating is done system wide or one value at a time. In other words, these commands are inclusive of the different architectures. While the implementation is architecture dependent the notations do not need to be used differently for different machines. In general however different ways to use the tools are necessary because of the need to know the data distributions and to optimize for particular hardware. For example, it may be that the system is made up of subgroups of processors which have fast links within the groups, and we would add up the subsums within each group, before the groups add up the group totals into an overall total. We would still use Update and Match functions, but with different parameters (and the "group leaders" will perform Update/Match twice, once within and once between groups). Such architecture information however makes the code no longer architecture independent. In some way or other, architectural information like this has to appear in the program to guide the compiler or the runtime system to map the wanted actions onto the available hardware units. Such information may be in the algorithmic steps themselves, or it could be in some hardware declaration that is added to architecture independent source code when it is used for a particular machine. This assumes however that the notation used in the algorithmic description already "includes"
Section 5.2.
Concurrency
157
different architectures in some way, so that the additional information about each architecture can be applied to produce different running versions. In short, like ethnic diversity in a society, we must first accept the differences, before we can come together again behaving in an ethnically aware fashion, rather than claiming to be ethnically blind. Accordingly, the next three chapters will set out the differences before the subsequent three work to bring things together.
5.2 5.2.1
Concurrency Threads and Processes
Two execution activities are said to be concurrent if there are no timing constraints between them, so that it does not matter which executes earlier or later, or if they execute at the same time on separate processors. Concurrency is thus closely related to parallelism, since we wish to divide a program's workload into separate execution activities that may be dispatched to different processors, with each executing earlier or later depending on local conditions. The parts of the program would not be completely concurrent however, since they work on the same problem and some parts use results produced by other parts, or different parts wold combine their results into a larger aggregate. Since producing a result must occur before using it, some timing constraints would have to exist. On a distributed memory system, there is the additional question of how to move the result from where it is produced to where it is used. What are the units of execution activity? Concepts like thread, task and process need to be brought in. A thread is a sequence of instructions with a clearly identifiable execution environment, which contains the data it needs for its execution (including the results it produced in past execution, the activation record). A thread and its environment together make up an execution package that can be dispatched to a processor. That is, the data which the thread needs are set up in the processor's memory and registers, and its program counter is pointed to the thread's instructions, in order to cause the fetching of the instructions into the execution pipeline of the processor to work on the data. Such a package is variously called a closure, a continuation or a task depending on the particular context but we will not discuss these concepts further here. For a simple example, all the instructions inside a subroutine may constitute a thread, and its execution environment is just the data defined internally in the subroutine, plus its calling scope, which are the global variables that the subroutine is entitled to have access to, whether statically according to the scope rules of the language, or passed dynamically via addresses as call parameters. An executing program may invoke some parallel language feature or operating system tool to request that a thread be "spawned", i.e., the instructions and its environment be sent for execution. When the operating system accepts such a package and gives it permission to be an execution activity, we say that "a process has been attached to the thread". A processor may have several such packages
158
Architecture Inclusive Parallel Programming
Chapter 5
in its memory, or multiple processes, taking turns to execute via interrupts from an interval timer, in order to meet the service requirements of several users. The arrangement is also useful for the purpose of ensuring that if some of the processes have to temporarily suspend execution to wait for data from elsewhere, there will always be some other processes available to execute and the processor is kept busy. When a process makes a procedure call that has a significantly different execution environment, we could say that the process is detached from a thread and passed to another thread, but if the two sequences of code share the same environment, then they may simply be regarded as parts of a single thread when one part calls the other. In a distributed system, processors send each other messages. If one processor executes a receive for a wanted message, but the sending processor has not yet executed the send, then the receive process would suspend execution, letting another process take over, until the wanted message arrives to allow the suspended process to be resumed. In this way, message passing can have the effect of establishing timing constraints between concurrent execution activities: the code after the receive in one thread must execute later than the code before the send in the other thread. Similarly, when n processes contribute to a common result (similar to the summation program shown in subsection 1.3) which everyone wants to use in further work, the code after the n message receive operations must execute later than the code before the n send operations, and this is said to constitute an n-way barrier. In a shared memory system, it is a little more tricky to establish such timing constraints: we know that results from other processes are written into some shared memory locations for us to access, but how can we tell whether the locations have already received these results or not? Whereas in a distributed system, the receptions of messages and inter processor signals enforce such supply-demand relations, here we require separate synchronization constructs to establish equivalent timing constraints.
5.2.2
Exclusion and Synchronization
To start the explanation with a simple case: two sections of code in two separate threads may in some way conflict, and they cannot be allowed to execute at the same time. For example, both threads wish to increment a shared variable: load R0,x inc RO store R0,x
However, if the second processor loads x immediately after the first has done the load but before the store, then both would obtain the same value, and the second store would overwrite the first, leaving x incremented once only. Similarly, if one process executes x = x+1 y = y+i
Section 5.2.
159
Concurrency
x and y being equal at the start, so t h a t they remain equal after the updates. However, if a second process reads the values of x and y, i f x==y t h e n . . . between the two updates, the two values would not be equal, leading to incorrect execution subsequently. Similarly, if the second process has started to fetch x and y for a compare, the first process must refrain from changing either variable until the other process has obtained both old values. Hence, in these examples, each code section is "critical" and must be "mutually exclusive": if one process enters its critical section, the other process must not enter its critical section until the first process exits. Basically, the d a t a content created by a process inside a critical section is not guaranteed to be consistent and must not be visible to other processes t h a t use the content in a particular way. Note t h a t the critical section in one thread may or may not contain the same code as the other thread: they do in the first example, but not the second. Mutual exclusion is enforced by implementing a lock construct: before you enter your critical section, you must first pass through a particular door and lock it; the other process, a t t e m p t i n g to enter its critical region, would not be able to pass/lock the same door as it is already locked, and the process is stopped. W h e n you finish your critical section, you unlock the door, causing the process waiting at the door to be resumed; it then passes and locks the door to prevent other processes entering their critical sections until it exits. Note t h a t locks must be maintained by the operating system or concurrent language runtime system and can be quite expensive in system overheads, and for the sake of economy, the same lock might be shared for several related or unrelated critical regions. If any one region is in execution, none of the others can be entered, even though not all the regions would conflict. T h e m u t u a l exclusion of non conflicting regions because they share a lock is termed false exclusion. If all the critical regions are very short and all the processes spend very little t i m e inside them, then false exclusion rarely occurs and causes no practical problem. If however the regions are long, then separate locks will be needed for different groups of conflicting regions and only true exclusion is enforced. If a lock is closed while an item of d a t a is being produced, opened when the item is available, and the d a t a user performs a lock before using the d a t a , we would ensure t h a t d a t a use would not start until after d a t a supply is done - if the user tries the locking operation before the d a t a is ready, its locking fails and the process is suspended. If the d a t a user keeps the lock closed while using the d a t a , it would also prevent further changes to the d a t a until it has finished. However, we shall later see t h a t there are some inadequacies with this arrangement and also suggest how things need to be done better. Now consider the case of two processes each producing part of the information which both need in later processing: P r o c e s s A: Produce x Use x & y
P r o c e s s B: Produce y Use x & y
160
Architecture Inclusive Parallel Programming
Chapter 5
If A runs faster than B, then it must wait for B to finish y before starting to use x and y; similarly, if B runs faster than A, it must wait for A to catch up before entering the second step. So the faster one must wait for the slower one to catch up before going further, which is why the execution relation of the processes is called synchronization. Recall the vector summing program: the result sum becomes available only after n processes have all made their contributions. If they all need the sum in later processing, then none of them can start using the sum until all n have finished the earlier step. So the faster n-1 processes must wait for the slowest to catch up before proceeding further. This synchronization among n processes is an n-way barrier. We could consider a barrier to be mutual exclusion between any of the n process's earlier parts with any of the later parts, an n x n way exclusion. The theoretical significance of such relations is that one kind of timing constraint can be derived from another, and a system only has to provide a single kind of timing constraint to allow all the derived timing constraints to be programmed, but we will not discuss this issue further here.
5.2.3
Atomicity
How are locks, exclusion and barrier implemented in computer programs? Can we simply have a Lock variable which can take on the values Open and Closed, and write: Lock: If ALock==0pen Then ALock = Closed Else { Inc WaitCount; Suspend >; Unlock: If WaitCount>0 Then { Dec WaitCount; Unsuspend someone } Else Lock = Open; But this would produce exactly the same problem which mutual exclusion is meant to solve: after one process discovers the lock to be Open, but before it turned the value to Closed, another process reads the lock value, finding it to be Open, and also decides to turn, it to Closed. Both processes would then proceed to execute, instead of only the first one. In effect, the value of the lock should already be Closed when the second process reads it, but was not because it takes two instructions to achieve that; the lock value between the two instructions is inconsistent with what it is meant to be and should not be seen by other processes. We express this situation as "we need atomicity for the lock (also the unlock) operation", i.e., the set of instructions must be exclude other operations on the same lock variable. So we have a circular situation: to give atomicity to other programs by providing
Section 5.2.
Concurrency
161
them with a lock/unlock facility, we must first have atomicity in the lock/unlock operations themselves. There is a way (called Dekker's algorithm) to achieve this atomicity by clever programming, but more commonly it is done through CPU hardware support: most processors have a Test and Set instruction which copies a value Closed into a lock variable, at the same time bringing back lock's previous value, i.e., a two way copy in one cycle. If the previous value was Open, it changes to Closed without leaving a gap between seeing the value Open and making it Closed, during which another process could see the Open value. So the lock operation is programmed as Loop: Test&Set ALock If Closed Goto Loop ...critical section... ALock = Open so that the second processor trying to lock would execute in a wait loop until the first processor changes the lock value back to Open, in which case processor 2 would be able to continue, It is however undesirable to leave an unsuccessful lock attempt looping, repeating the attempts while waiting for the lock to open, since this keeps the processor busily executing while actually doing nothing. A preferred way is to use the operating system support to suspend execution: Loop: Test&Set ALock If Closed Goto Loop If BLock==Closed Then { Inc WaitCount; ALock = Open;
Suspend } Else { BLock = Closed; ALock = Open } ...Critical section... Loop2:Test&Set Alock If Closed Goto Loop2 If WaitCount > 0 Then { Dec WaitCount; Unsuspend someone} Else BLock = Open ALock = Open So the wait loop is executed only while another process is testing or changing BLock; whether BLock is Open or Closed, ALock is reset to Open quickly to allow another process to come in to test BLock and decide whether it should suspend or continue. Note that because of the protection provided by the ALock, testing BLock and making it Closed need not be atomic since no second process can come in until ALock is Open again.
162
5.2.4
Architecture Inclusive Parallel Programming
Chapter 5
Monitors and Semaphores
Let us consider a more complex problem: a number of processes share a common d a t a structure like a queue or a stack, which can be used by at most one process at any moment, but only when it is in the right state, e.g., you cannot dequeue from an empty queue. T h a t is, a lock on a critical region to use a stack/queue should only succeed conditionally. Such a control structure is called a monitor: M o n i t o r : Test&Set ALock I f C l o s e d Then Goto M o n i t o r ; I f BLock==0pen Then { Block = C l o s e d ; ALock = Open } E l s e { I n c QlCount; ALock = Open Suspend (Ql) > Check d a t a s t r u c t u r e ; I f CanExec == F a l s e Then { I n c Q2Count; BLock = Open; Suspend (Q2) > ...critical region... Loop: Test&Set Alock I f C l o s e d Goto Loop I f Q2Count > 0 Then { Check d a t a s t r u c t u r e ; I f CanExec == True Then { Dec Q2Count; New = F a l s e ; Unsuspend (Q2) } E l s e New = True > E l s e New = T r u e ; I f New Then I f QlCount > 0 Then { Dec QlCount; Unsuspend (Ql) } E l s e BLock = Open; Alock = Open; A monitor allows only one process to use the shared structure at any moment, but a structure like a queue stored in a circular buffer can be safely processed at both ends, provided the buffers are not all full which means the queue can only be processed at the dequeuing end, nor all empty when it can only be processed at the enqueuing end. T h a t is to say, dequeuing is always possible when the number of empty buffers is greater t h a n 0, and queuing when number of filled buffers is not zero. These constraints may be enforced using the counting semaphores.
Section 5.2.
Concurrency
Suppose we use N buffers for the queue elements. Pointer F refers to the first filled buffer, and E the first empty buffer behind the last queue element. Each time F or E is used to remove/add an element, the pointed is incremented modulo N. There are also two numbers Filled and Empty to indicate the number of filled/empty buffers, and each time a buffer is removed, Filled is decremented and Empty incremented; the reverse for buffer added. A lock on the buffers as a whole is also needed to prevent multiple producers/consumers accessing them at the same time. Thus, consumers execute Down (Filled) BufLock read from Buffer[F] F = F + 1 Mod N BufUnlock Up (Empty) and producers execute Down (Empty) BufLock store into Buffer[E] E = E + 1 Mod N BufUnlock Up (Filled) If Down produces a negative value, then there are no buffers available for processing; the process executing Down would suspend. Hence the routine is coded as follows: Down: Test&Set DLock If Closed Goto Down F i l l e d = F i l l e d - 1; If F i l l e d < 0 Then { DLock = Open; Suspend ( Q f i l l e d ) } Else DLock = Open Up:
Test&Set ULock; If Closed Goto Up; Empty = Empty + 1; If Empty <= 0 Then { ULock = Open; Unsuspend (Qempty) } ULock = Open
164
Architecture Inclusive Parallel Programming
Chapter 5
and similarly for Down (Empty) and Up (Filled). To summarize, we have studied the various concurrency constructs which have been generally designed within the perspective of shared memory, heterogeneous parallelism. Because of this background, efforts to extend their use to distributed or homogeneous systems have usually resulted in inconvenient and inefficient programming. In the following chapter, we shall take a very different starting point and see where this leads us.
5.3
Data Parallelism
5.3.1
Vector Processors
A vector processor achieves a particular form of parallelism by sending long arrays through arithmetic pipelines, e.g., the high level language might provide commands like c[l..N] = a[i..N] + b[l..N] to specify a whole-array operation. When executed through a pipeline, different stages of the pipe process different pairs of elements of a and b concurrently, so that only N+M-l cycles are taken to process the arrays with an M-stage pipeline: the first result emerges after M cycles, the second in cycle M + l , etc, and the N-th result in cycle M+N-l, compared to M cycles to process a single pair of elements and MN cycles to process N pairs of individually computed elements. Provided N is large enough to "fill the pipe", a substantial speedup of nearly MN/(M+N-1) is achieved at a modest cost, as an N-stage pipeline is usually only slightly more complex and slower than a monolithic arithmetic unit. Note that the beneficial effect of parallelism comes from the sufficiently large size of the data. Similarly in SIMD and SPMD machines, taking advantage of the parallelism provided by the hardware depends on the parallel nature of the data; hence the term data parallelism, in contrast to control parallelism that arise from the structure of the algorithm by extracting out independent parts. Suitable choice of algorithms is still necessary in data parallelism however. For a simple example, consider a matrix to column vector multiplication: c = Ab. If we take one row of matrix A and send it into a multiply pipeline with vector b, producing a column of N products, and then sum them by repeatedly sending them through the add pipeline, we would find ourselves processing a shorter and shorter vector with each cycle, finally producing a single value of the sum. This would not take good advantage of data parallelism by filling the pipeline. If instead we take a column of A, multiply it into a single element of b, and accumulate the results, i.e., we use the formula
Section 5.3.
Data Parallelism
165
c = A[l..N,l]b[l] + A[l..N,2]b[2] + ... which would produce c after N passes through each of the multiply and the add pipelines, always filling them with vectors of size N. Further, results from the multiply pipe can go into the add pipe as soon as they start to emerge, without having to wait for the whole multiply output to be ready. This is called the chaining approach, and in effect the method has 2M arithmetic operations going on in parallel in the 2M stages of the two pipelines. So a much better utilization of the arithmetic units results from computing with columns instead of rows. Extending the idea, to multiply two matrices C = AB, we do the above for each column of B to produce one column of C successively: C [ i . . N , l . - N ] = 0; For j = i To N Do For k = 1 To N Do C[l..N,k] = C[i..N,k] + A[i..N,k]*B[k,j]; Now consider another example, Gaussian elimination. For each outer iteration, say iteration i, we choose a pivot row with the largest leading element in terms of absolute value, and swap it with row i; a suitable multiple of this row is then subtracted from each of the other rows to make their leading elements zero. After N - l steps, the last row would only have one non-zero element, and x[N] can be computed. It is then used to compute x[N-l], which allows x[N-2] to be computed, etc. F o r i = 1 To N - l Do { p = Maxlndex (Abs ( A [ i . . N , i ] ) ) ; Swap ( A [ i , i . . N + l ] , A [ p , i . . N + i ] ) ; For j = i + 1 To N Do •C r = A [ j , i ] / A [ i , i ] ; A[j,i+1..N+l] = A[j,j+1...N+l] - A[i,i+1..N+l]*r } > x [ l . . N ] = A[1..N,N+1]; F o r i = N Downto 2 Do { x[i] = x [ i ] / A [ i , i ] ; x[l..i-l] = x[l..i-l] - A[l..i-l,i]*x[i] } x[l] = x[l]/A[l,l]; T h e program can be easily extended to handle m a t r i x inversion: instead of a single right hand side column, there are N of t h e m making u p a unix m a t r i x (i.e., each row of A has 2N rather than N + l elements, with A[i,i+N] = 1 and A[i j ] = 0 for j from N + l to 2N except for i + N . T h e coding is left as an exercise for the reader. T h e programs look easy to write, because all the d a t a are in a shared memory, so t h a t fetching a column or a row are equally convenient. In actual vector machines,
166
Architecture Inclusive Parallel Programming
Chapter 5
array elements first have to be moved into vector registers, and are then p u m p e d through arithmetic pipelines from there and back; programming effort is often more to get the right d a t a movement between the memory and the vector registers t h a n to do the actual computation. Similarly, single instruction, multiple d a t a stream machines need much effort to redistribute the d a t a as execution proceeds.
5.3.2
Hypercubes
A hypercube consists of 2**n nodes in which each node is connected to n other nodes, chosen such t h a t the indices of two linked nodes different in one bit only. Thus, in an 8 node hypercube, node 0 is connected to nodes 1, 2 and 4, while node 7 connects to 6, 5 and 3. W i t h node 6 connecting also to 2 and 4, node 5 to 4 and 1, and node 3 to 2 and 1, we have a complete cube with 12 edges. Generally, a hypercube has n x 2**(n-l) edges, and the longest distance between two nodes is n edges. Both the cube's total complexity and the longest distance between two nodes increase slowly with system size, which is why the structure is commonly adopted. Another advantage of the structure is communication redundancy: if you want to go from node 0 to node 1111... you can choose any of n directions, which have the same distance n to the final destination; at the next step, you have choice of n-1 equally distant directions, etc. Only the last step has no equal choice. Hence, if any link is busy or broken, good alternatives to reach the destination can usually be found. T h e hypercube structure is entirely generic, and can work in different parallel processing contexts; it is possible to have a shared memory system in which a processor at one node can use hypercube links to access memory at other nodes, or a heterogeneous system in which various nodes execute asynchronously, exchanging information with messages sent across the links. But because the systems are normally large, it is more likely t h a t the nodes act with some synchronism, with system wide commands to combine information, and memory is distributed. In a large machine, system wide c o m m a n d s usually require special hardware support to be performed quickly, but hypercubes allow such operations to be done recursively in Order (n) steps on a size 2**n system. Consider the s u m m i n g of 2**n results contributed by all the nodes with the final result distributed back to everyone. Each node performs: F o r k = 2 * * ( n - l ) Do I f i >= k Then { Send ( i - k , m y p a r t ) Exit } Else { Receive ( i , h i s p a r t ) mypart = m y p a r t + h i s p a r t ; k = k/2; I f k==0 Then { Sum = m y p a r t ; Send ( i , Sum);
Section 5.3.
Data Parallelism
167
k = 1; Exit } > If i>0 Then Receive ( i , Sum); For k = 2*k Do If k == 2**n Then Exit Else { Send (i+k, Sum); k = k*2 } Note that all the send/receive are over one link only. This is because i+k or i-k always differs from i in one bit only as k is a power of 2 and it is subtracted from i = k,k+l,..,2k-l but added to i smaller than k, i.e., add 1 to 0 digit or remove 1 from 1 digit. The above is an example of a logN algorithm: to compute the sum of 2**n numbers, up to n iterations are performed; similar to complexity measures of a hypercube, in such an algorithm for a problem of size N, the total number of execution cycles increases with logN. Generally, problems below logN cycles are rather trivial, such as adding two arrays in corresponding positions, which can be done in one cycle if we have enough arithmetic units, while problems much more complex than logN algorithms, say with execution cycles increasing with powers of N or even exp(N), are difficult to solve when N is large. Finding logN and similar complexity algorithms is therefore something of a highly desired goal. We see however that, though the above problem is conceptually simple (adding up N numbers), the parallel program is relatively speaking already not simple. That is, finding algorithms fast enough to execute on parallel machines imposes demanding programming requirements. There are few parallel programming problems that are simple in an all around sense and still useful.
5.3.3
PRAM Algorithms
To produce parallel algorithms, a problem is analyzed in order to break it into independent or near-independent parts that execute separately with limited communication between them. This design process usually makes some assumptions about what data sharing mechanisms are available and even the relative cost of different mechanisms. For example, a large set of common algorithms were designed according to the Parallel Random Machine model, which assumes that multiple processors execute in step and share a common memory, though each has some local storage like registers or explicitly addressed local memory (i.e., not just a cache, as programs do not access data in cache using cache addresses but using main memory addresses so that there is no distinct cache address for the program to use). Certain cost figures are assumed for the common classes of instructions, in addition to rules concerning memory read/write operations on the same location from different processors which require timing constraints to be defined and their costs assumed (more on this topic in the following section). This allows different versions of an algorithm, or the same algorithm under different execution conditions, to be compared in terms of a common cost model, both for the purpose of algorithm op-
168
Architecture Inclusive Parallel Programming
Chapter 5
timization, and for hardware design to support the execution of frequently needed algorithms. Despite the very specific architecture assumptions, PRAM algorithms are extremely useful for parallel programming generally because most of them can be slightly modified to produce versions for other architectures, as we will see later in a commonly used example. Further, the recursive divide-and-conquer technique frequently used in PRAM algorithm designs bridges to some extent the gap between homogeneous and heterogeneous systems: if recursion results in the creation of a number of identical subprograms, these can be run on different processors of a homogeneous system; at the same time, the process of recursion itself can be started on a single processor of a heterogeneous system to spawn off multiple processes; in other words, what is "design" on one becomes "implementation" on the other. With the recursively spawned processes themselves spawning further child processes, the logN behaviour is achieved for the initial spawning, and for the eventual process closing down and returning of results, so that the parallel program start and end overheads are slowly increasing with problem size N. The cost of the algorithm steps themselves does not always behave so nicely however, because of the replicated execution of overlapping parts on the different processors. In the previous section we showed a logN program for summing N values for a hypercube machine. To illustrate the design process, we note that sum (x[l..N]) = sum (x[l..N/2]) + sum (x[N/2+l..N]) which can be spawned to two different processors to perform; each can further divide the work between two processes, and so on, eventually resulting in a program requiring N/2 processes and 21ogN clock cycles, with logN add steps and logN send/receive steps. Here we have simple a recursive process that may be actually executed, or may merely be used to design the algorithm. Let us take a somewhat more complex problem, to compute the prefix sums of an array: s[l] = x[l] s[i] = s[i-l] + x[i], i = 2..N. That is, each s contains successively more elements of x. Applying recursion, let's assume that we have already produced the prefix sums of x[l..N/2] and xfN/2+l..N] separately; that is, we already have s[l..N/2] as well as s'[N/2+l..N] with s'[N/2+l] = x[N/2+l] s'[i] = s[i-l] + x[i], i = N/2+2..N Clearly, s is produced from s' by adding s[N/2] to every element of s'
Section 5.3.
Data Parallelism
169
s[i] = s'[i] + s[N/2], i = N / 2 + 1 . . N This is a simple recursive process, and it also produces a logN algorithm. However, for most cases, this is not a good algorithm. On a shared memory system, we would have N / 2 processors all trying to read S[N/2] from the same memory location in order to add it to N / 2 separate locations; in a hardware implementation, the arithmetic unit producing S[N/2] has to send it to N / 2 other arithmetic units, which requires the signal t o be amplified to supply a strong enough input to so m a n y recipients (the "large fan-out" problem). On a distributed machine the processor N / 2 has to broadcast s[N/2] to N / 2 other processors, which can usually be done efficiently by recursively distributing the work in a hypercube structure. However, in all these situations, distributed communication would be preferable to start with. Instead of a single source sending to multiple destinations, it would be better to have multiple sources send to multiple neighbouring destinations with senders and recipients being close to each other. Instead of passing information from one section to another all at once, we use a different design in which information is passed a little bit at a time: first add x[i] to x [ i + l ] for every i (x[i] for i < 1 is assumed to be 0) producing an array containing sums of two; then add each value to the value two steps away, producing sums of four; then add each value to the value four steps away, producing sums of eight; etc. T h a t is, in each clock cycle we go twice as far as before, to find twice as much information. In a shared memory system with 2**n=N processors we have the following prog r a m running on each processor, with index i: F o r k = 1 t o N/2 Do { If i > k Then x [ i ] = x [ i ] + x [ i - k ] Else Exit; k = 2*k } while in a distributed machine
For k = 1 To N/2 Do { If i <= N-k Then Send (i+k, x); If i > k Then Receive (i, y); x = x+y; k = 2*k } Note t h a t though the send/receive nodes have indexes differing by 1, 2, A..., they are not necessarily separated by one edge in a hypercube, since it depends on whether the 1 digit in k is added to a 0 in i.
170
Architecture Inclusive Parallel Programming
Chapter 5
We see t h a t while the recursive program itself may not be practical, it provides us with ideas to produce a better P R A M version as well as other versions suited for homogeneous, distributed systems, including one t h a t could be used on a hypercube machine. There are also a number of further algorithms t h a t can be "derived" from the prefix algorithm. For example, suppose we have a list in which each element has a pointer to the next element only, but we want to know how far each element is from the end of the list, i.e., a list element enumeration. T h e following is the program: If Null (next) Then i = 1 E l s e i = 0; F o r k = i Do { If Null ( n e x t ) Then E x i t Else { i = i + n e x t . i ; next = next.next > > T h e further away an element is from the end, the longer it would execute before getting the null next pointer, and the larger would be the accumulated value i. W i t h each iteration, one goes to an element twice as far away as the one entered last time, and obtain its accumulated value i showing how far it is from the end. By accumulating this information from all elements, we discover how far we ourselves are from the end. Another example is: remove all blanks from a string of text, moving the nonblanks leftward to take u p the room. T h e program is I f x [ i ] == " " Then v [ i ] = 0 Else v[i] = i; F o r k = 1 Do { If i > k Then v [ i ] = v [ i ] + v [ i - k ] Else Exit; k = 2*k } I f x [ i ] <> " " Then x [ v [ i ] ] = x [ i ] T h a t is, since blanks are give value 0, v[i] will be the sum of all the l's, or the number of all the non-blanks, to the left of x[i], and indicates where x[i] will go if all the blanks are removed. Somewhat more elaborate problems like "move all lower case letters to the right end and upper case letters to the left end maintaining the order within each group" or "remove all but one blank from each string of consecutive blanks" are left as an
Section 5.4.
Memory Consistency
171
exercise for the reader. Distributed memory versions of these algorithms are also possible but rather more complex. In fact the overheads of data sharing become so high that one would not consider parallelism worthwhile for such cases at all. Even with such small examples, we already see the divergence between shared memory and distributed systems, just as the gap between homogeneous and heterogeneous parallelism is wide.
5.4 5.4.1
Memory Consistency Shared Memory System
In a shared memory system, two processors can attempt to read/write the same memory location concurrently. This raises the question: what are the timing constraints between the concurrent accesses, and how are these constraints enforced? It might seem simple to say "first come, first served", but things are not so simple: First, processors have caches, and an item of shared data would have multiple copies being accessed separately at any time. What timing constraints are enforced by the system between these copies? Second, a piece of data may need to be processed in a sequence of steps before the final result is used by others, and it would not be useful, and indeed would be wasteful or harmful, to let others share the intermediary results. Finally, the processing may involve several related items of data, which must either be seen together, or not seen at all, i.e., they must be put into a consistent state before they are shared. Therefore, the write results of a processor may be hidden from other processors until a convenient moment when the new data are allowed to become visible. In the mean time, the other processors would only see old data unaffected by the new writes. But there must be some clearly defined rules about when the results become available; in other words, we need system-enforced timing constraints between the concurrent read/write operations. These rules constitute the memory consistency model. Virtually all the shared memory machines make it possible to provide the sequential consistency model at the hardware level: a read by any processor sees the results of the writes of all the processors up to that point in time regardless of where the reads and the writes occur. This is achieved by enforcing what is called a cache coherence protocol between the processors, e.g., 1. Write Broadcast: Whenever a processor writes into its own cache, the same information is sent to all other processors so that, if they have a copy of the same data, their copy is modified. 2. Writethrough Invalidate: When a processor writes into its own cache, it asks all other processors sharing the data to delete their copy, so that if they access the data in the future, the cache hardware would obtain the new copy. Note that this reduces the wastage of updating copies which may or may not get used, or updating repeatedly without using all but the last updated value, but if you do use the new result, then you pay the cost of a cache miss. Also, the method assumes a
172
Architecture Inclusive Parallel Programming
Chapter 5
write-through cache, i.e., new cache values are immediately transmitted to memory so that the processors with invalidated copies can get new copies from the memory. 3. Writeback Invalidate: A cache write invalidates other copies without immediately generating a memory write, but when another processor has a cache miss on the same data, the processor holding the modified copy would prevent a memory access on that address until the copy has been written back, so that the new copy would be retrieved by the other processor. This requires each processor's cache coherence hardware to watch out for reads by other processors and to determine from the address whether a read is on a memory location it has modified earlier but not written back. In contrast, in 2. each processor only has to watch out for writes by other processors and determine if these invalidate something in its own cache. We see that in the first method, new information is "eagerly" sent to others, while the last method "lazily" keeps the information until it is actually needed, and the second method tries something in between. Many other minor variations are possible, all resulting in the same behaviour of sequential consistency but with different cost behaviours that might be best for a particular program behaviour. Overall, Writethrough Invalidate protocol is probably the most common. Sequential consistency is sometimes described as a "strong" memory model, though actually the "strength" is quite limited: the system cannot guarantee that the write results seen by your read are from the "right" writes: the particular write you want may not have occurred because another processor was slow, or a write you do not want to see had already occurred because another processor happens to be fast and has already overwritten the content you wanted to read. Also, the past writes may or may not have occurred in the right order because of unpredictable timing differences in the different processors. The cache coherence hardware has no control over how fast or slow all the processors execute, and the guarantee is merely that you do see all the writes that have occurred so far, that the other processors' changes are not hidden from you in their caches. However, once we have cache coherence and sequential consistency, we can provide a stronger guarantee by implementing locks and atomic regions. For example, if reads and writes take place in a critical region, then if you attempt to read something being written at around the same time, but just before the write has occurred, you will see the new value, because your read will be delayed until the writer exits to allow you enter your critical region, though this is still only a weak guarantee. Another possibility is to adopt synchronous processing: all the processors use a common clock to drive their execution in a series of individual time periods, and at the end of each period writes performed within the period will be shared, so that they are guaranteed to be seen by other processors from the next period onward. In other words, the consistency model only guarantees that a read will see all the writes up to the end of the previous step, not the writes within the current step. In a cache system, this requires all the writes to be saved within the individual caches until the end of each synchronized step, at the end of which memory updates take place system wide. This is called the weak consistency model, designed for synchronous systems working in so called "supersteps"
Section 5.4.
Memory Consistency
173
of compute-communicate-compute-communicate... The reason sequential consistency is "not strong" lies in the fact that data are identified by their memory addresses, which contain no information about the state of the content; a stronger memory consistency requires read operations to describe data in a more specific fashion, but this implies some form of search and verification process with read and write operations matching data supplies with the corresponding data demands. This makes read/write operations more costly; hence, a compromise must be struck whereby search and verification is done for blocks of data rather than individual locations. Pages of memory would be natural units for such memory sharing, but it is also possible to share and move around whole objects. By defining a set of rules on how memory content is shared and when copying over takes place, we have defined a new consistency model. Given a particular memory consistency model, we write our parallel programs in a particular way to ensure the correct results on system offering that model. Moving to another system offering a different model would normally require reprogramming. Multiple consistency models can be offered on the same hardware platform, in order to try different versions of the same algorithm and discover one with the best performance. Until a single consistency model is widely accepted as the standard, there is no real program portability.
5.4.2
Tuplespace
The Linda tuplespace scheme provides a stronger memory model than sequential consistency. Tuples are multi-field records emitted by executing processes into a shared memory pool, and may be retrieved by presenting a matching search key. To produce a tuple, a process executes Out ( e x p r i , expr2, . . . , exprN) where exprX is any expression producing a value, and the N values expri to exprN are stored together as a record in the shared space, while In ( e x p r i , . . . . exprM, ? v a r l , ?var2, . . . , ?varN-M) or Rd ( e x p r i , . . . , exprM, ? v a r l , ?var2, . . . , ?varN-M) retrieves a tuple containing the M values on the left, and then, stores the N-M values on the right into the variables varl to varN-M. An In removes the tuple while a Rd reads its content without removing it so that other processes can access the same tuple subsequently. Tuplespaces are easily implemented on shared memory systems by building a monitor that manages the tuple pool. A process calls the In/Rd/Out functions in the monitor to search/modify the pool, with In/Rd causing a search for an existing tuple containing a given set of values, which make up the search key; if no matching
174
Architecture Inclusive Parallel Programming
Chapter 5
tuple is found in the pool, the tuple request is added to the pool and the process is suspended; for an Out operation, if a matching request in found in the pool, then the waiting process is resumed after the data on the right of the new tuple have been copied to the variables listed on the right of the In/Rd request; otherwise the new tuple is simply added to the pool to wait for future requests; after either alternative the process exits from the Out and returns to its previous execution sequence. An In/Rd specifies the key of a tuple it requires, and waits for the tuple when an Out comes later than the corresponding In/Rd. The state of the data can be included as part of the search key, so that an In/Rd specifying a later state succeeds after an In/Rd that specifies an earlier state, regardless of the execution time of each request. In other words, the tuple scheme provides stronger timing constraints on related memory accesses. This simplifies application programming, but requires more costly system support to manage the tuples. In previous implementations, the number of tuples in a pool is usually variable; hence, the pool has to be organized as a hash table, with searching carried out in hash buckets. If the number of tuples is fixed instead, each can be given a specific location and the address is used as part of the search key, with data state information given as additional key fields to be matched, so that a tuple retrieval is successful only if the tuple at the addressed location has the particular wanted content. That is, an In/Rd operation would now require a content match rather than a hash bucket search and is generally faster. However, a number of unsatisfied requests may be queued at a tuple location waiting for its content to change. So an Out operation might require a search of queued requests, and it is necessary to avoid "overusing" a single tuple by sharing it among many concurrent users as this might produce inefficiency. More desirably, a tuple should provide a "bilateral" form of sharing with a single producer and a single consumer, with the tuple buffering smoothing out unpredictable timing between the two parties. Fixed location tuples can be made to behave like locks: modifying the content of a tuple causes other processes to suspend or resume depending on their key match success, and normally memory is partioned into separate areas for different processes plus the tuple pool, perhaps with a data exchange area controlled by relevant tuple locks. If each tuple operation is followed by a number of data accesses on the exchange area, then the tuple access overhead is less significant relative to the total volume of access. In the following two sections we shall see a number of examples using tuples in this way. But first some more discussion of memory consistency.
5.4.3
Distributed Processing
As explained in subsection 1.2, a distributed system may allow a processor to copy over a block of remote memory through the system interconnect, or to invoke a message transfer to achieve the same effect. The two methods cause multiple copies of the same information to be created in different nodes and need to meet the same memory consistency requirements. However, there are some significant differences. As explained earlier, by acquiring a lock and entering a critical region or moni-
Section 5.4.
Memory Consistency
175
tor, a concurrent process can verify that a shared data block is in the correct sate, before accessing the content of this block. A shared memory machine with cache coherence support can implement such locks correctly, usually through Test and Set instructions executed by the concurrent processes on a shared lock variable. But if any processor can obtain a copy of the lock variable without going through a coherence maintenance procedure, then the critical regions cannot be correctly implemented. That is, while cache coherence hardware enforces sequential consistency between multiple copies of data in caches, multiple copies in distributed memories require other mechanisms to maintain consistency. To illustrate, while processor A copies a block of memory containing the lock variable x, and finds it "Open", another processor B may have done the same operation; after making its own copy, it too would find the lock variable Open, but it would be incorrect for both of them to set the value to "Closed" and then copy the blocks back, since this would cause both to be in their critical region at the same time. Unlike the Test and Set instruction which copies over the previous value and stores in a new value in the lock variable in one step, in a distributed system the copying over, testing previous value and returning new value are separate steps allowing an inconsistent operation from another processor to occur in between. The solution lies in some form of system support, similar to a tuplespace style "In" operation which copies over a block of memory at the same time removing it from access by others, and turning accessibility back on only after the lock variable has been reset to "closed". More commonly, locking/unlocking is done by performing a remote supervisor call on the processor whose memory holds the lock variable. If a number of such calls arrive at a processor, they would be responded to one at a time, each resulting in an atomic sequence of steps, namely Reading the lock value, If Open Then set it to Closed Else... resulting in consistent operations between the different processors. The equivalent solution is message passing: a processor requiring data, including operations on a shared lock, on another processor would send a request message and wait for a reply; if a number of competing messages arrive from different processors, they would be picked up one at a time, and the complete, consistent processing steps needed by each request are carried out before the next message is handled. A processor may not respond to a request immediately, since it does not usually simply stay idle while waiting for messages from others; instead, it has work waiting to be done in its task queue and tries to keep busy all the time. To make sure that messages get quick attention, a processor has to be interrupted, either by a communication interface interrupt mechanism, or by setting the interval timer to make the processor regularly look around for other matters to attend to. In either case, interrupts are disabled during an earlier interrupt so that the earlier message is processed to completion before the next interrupt is allowed. In this way, a locking operation is executed as an atomic sequence during which other requests are turned off. There is much system cost arising from the need to maintain the correct memory consistency between multiple copies of the same data, with programs inserting
176
Architecture Inclusive Parallel Programming
Chapter 5
the relevant locking and other synchronization steps before and after accesses on shared data, as well as in copying over the shared data and in returning them after modification. Normally, one would try to minimize overhead by careful planning so that remote copying is done in blocks which are neither too big nor too small, and the same block of local memory is reused for different remote data at different times. That is, increased programming effort is needed to maintain efficiency.
5.4.4
Distributed Shared Memory
In a distributed shared memory (DSM) system, a large virtual space is defined and its pages may be mapped to physical pages on different machines. Multiple programs executing at the same time are simply allocated different parts of this virtual space, and multiple threads within the same program would execute on different processors, each mapping over the virtual pages the thread uses into its local physical memory, including possibly multiple copies of the same virtual page. Explicit memory to memory copying and message passing are not required, and programming is simpler. A program may be dispatched to any processor to execute, and we can consider such possibilities as restarting a crashed program on another node using information saved from an earlier checkpoint, or stopping a program to move it to another, less busy node to achieve better system throughput. With these advantages, it is understandable that distributed shared memory is a highly attractive idea. But an efficient memory consistency system is critical, because multiple copies of the same information are now used more commonly: simply by referring to a variable that another process also refers to, we would cause a second copy of the page containing that variable to be loaded into our processor memory. Not only must consistency be achieved, it would also be invoked frequently. This is where the idea of lazy consistency comes in: we try to delay the work needed to achieve consistency as much as possible. When one processor changes its copy, others may not have to know, because they may not need the new value, or may need only the final result after a set of changes. Hence, the new memory content need only be delivered to a processor when it actually declares its need for a piece of shared information, by requesting a lock. A lock request is broadcast to all processors that share the same information and may have changed it; if no one is sharing the information or no one has changed it; then nothing particular need to be done, since the copy the processor already has is still correct. But if one or more processors have a modified copy, then, the "lock successful" reply also tells the requesting processor it should invalidate its own page containing the shared information. When the processor does access the information, a page fault occurs, causing the new content of the page to be delivered to it after all the changes made by the other processors have been incorporated. In turn, when some other processor tries to lock the information, it would have to invalidate its copy if this processor has made a change during its locked processing. The merits of this method is, first, memory content transfers between proces-
Section 5.5.
Tuple Locks
177
sors are kept to the minimum necessary, and second, it makes use of normal virtual memory support mechanism. There are however some debate about various details. For example, how much information should be broadcast with a lock request? By specifying which page, even which particular address, a processor wishes to lock, one minimizes false sharing where modified pages are forwarded even though the recipient uses something else in the page, but increases communication and status recording costs. Also, should each shared page have an "owner" that has its complete new content, or should all the users have separate copies with separately made changes, until someone locks it, at which point he would have all the changes sent to him by all the other users individually, to work out the latest complete content, which will be passed on to the next user doing a lock? Because of the various possibilities, there are many slightly different DSM systems, including some that allow a program to choose a combination of the features to optimize the performance of a particular run. Further, though programming on a DSM system largely proceeds like programming on a shared memory system, shared data have multiple copies and not all the program behaviours are identical, as some examples in the next section will show. Furthermore, one need to have some awareness of the way the system handles the shared data operations of a program, in order to maximize efficiency. Distributed and shared memory systems remain different systems. DSM programming is also likely to be less efficient than the equivalent distributed memory algorithms, which would minimize memory usage by re-using the same memory block and the send/receive buffers for different data needed at different times, and by grouping transmitted data into blocks of the right sizes, at the cost of greater programming complexity. While DSM "papers over" the differences, it does not provide the architecture independence we would like to enjoy. In later sections where tuple programming examples are shown, we shall see that each system architecture requires slightly different code.
5.5 5.5.1
Tuple Locks The Need for Better Locks
In the previous chapter we discussed weak memory consistency models. With the relaxation of the consistency requirements, the implementation cost of distributed shared memory is reduced, and efficient DSM systems running on clusters that behave in ways that are close shared memory systems. Concurrency constructs like monitors and semaphores can be built on top of simple locks to allow shared memory application algorithms to be ported over. However, weaker memory consistency imposes its own requirements on program sequentiality: programs need to carry out certain synchronization and locking steps before and after accesses on shared memory, producing effect that are not quite identical to true shared memory systems. Such steps affect the structures of the programs and may make them less efficient, compared to what ought to be possible
178
Architecture Inclusive Parallel Programming
Chapter 5
if we could take full advantage of the memory sharing behaviour. The the memory sharing and locking requirements also link closely with the heterogeneous concurrency ideas, and thus make DSM systems less suited to homogeneous machines and applications. We shall then see that a stronger memory consistency model based on more content laden locks improve matters on both fronts. Locks may be very basic in form; for example, the system may provide just a simple lock command with no call parameters; that is, the lock does not provide information on what is being locked nor the particular reason for locking it, in fact no information to choose any particular lock so that the same lock is used for all critical regions, rather than different locks for different regions. Process X being inside region A would then exclude process Y from entering region B, even though there is no conflict. In other cases, parameters are provided in the lock request but the information available may be insufficient to determine if a conflict exists, such as two processes wishing to lock the same page, though they use different data in the page and it is in fact safe for them to execute at the same time, i.e., false sharing. In some memory models locks are required to do more and must specify the particular item of shared information involved, or the compiler might work this out for itself from the execution region and anticipated runtime conditions, resulting in new forms of consistency such as entry consistency, scope consistency and location consistency. Each of these techniques reduces false sharing, but generates overheads in requiring the system to maintain different lock indicators as well as their current lock holding conditions. The desirable goal is to get high benefit from the more sophisticated locks while minimizing the cost. To illustrate that thee program sequences imposed by the consistency models through the locking requirements may introduce process inefficiencies; consider the example: given process PI which repeatedly writes into a shared location X for process P2 to read, we have PI lock write X unlock
P2
lock read X unlock lock write X unlock But we know that if we are running on a DSM system, then a second copy of X would have been given to P2 when it acquired the lock, while PI still has its own copy, so that there is actually no need for PI to wait for P2 to finish reading before starting a new write into X. Yet, it is necessary for PI to go through the
Section 5.5.
179
Tuple Locks
lock/write/unlock process to ensure that P2 does not start reading the next value until PI has finished writing it. The program sequencing imposed by the lock hampers programmability. The use of more complex synchronization constructs, such as named locks, semaphores and barriers, can improve programmability. For example we can have PI
P2
write X read X 2-process barrier write X read X where, with each barrier, like in the case of a lock, the modifications made by PI on its copy of the shared page would be given to P2, so that P2 can read the previously written value on its copy while PI proceeds to make another modification on its own copy, which will be seen by P2 after the next barrier. Hence, the use of a more complex synchronization operation has simplified programming. This is-the arrangement for problems in which the two processes wish to share every value of X, which is achieved in the synchronized relation between PI and P2 as the process that finishes first must wait for the other process to finish before both can start a new step together. The synchronized processing would also be correct for multiple consumers in which every new data page must be seen by every consumer, while another new page is being produced in the mean time. However, if we are programming something like a web server, then the data consumers do not necessarily want to access every new value produced, but only the latest value at the time a read is done. Further, the number of consumers is variable as they come and go, and it would not be correct to require all the processes to pass a barrier after the supply of each new value. While the system can provide a way of defining varying subgroups of processes as synchronization groups, so that a barrier need not include all processes but only those that are taking part in sharing a particular item of data. However, though the method could work for some problems, generally things are not so predictable and there is risk of program errors like a process forgets to take part in a synchronization, or is late doing so, preventing the barrier function from completing and stopping others from passing the barrier. That is, the above arrangement requires each participant in a synchronization group to refrain from starting the next read/write until the whole group is ready, and to "stick around" until the whole group is done; they cannot start/end any time they like. This complicates as well as delays individual programs. But a simple mutual exclusion is also not the right structure as usually multiple readers can read concurrently as long as the data are not currently being written. To achieve the ability to come and go as one chooses in web like behaviour, one requires different kinds of read and write locks, such that a write lock can exclude subsequent reads until a write unlock, but a read lock does not exclude other readers, who can join
180
Architecture Inclusive Parallel Programming
Chapter 5
in to read any time they like as long as the page is not being changed. This is the well known reader-writer problem from operating systems courses. It is therefore desirable that a lock specifies "why" as well as "what" in order to simplify programming and reduce inefficiency in the sharing processes, since a successful lock indicates that certain conditions for execution are met, but at the cost of increased lock complexity. Consider another example. Suppose X is a dynamic data structure such as a queue or stack. After a process acquires the lock guarding X, it might discover that X's condition (e.g., stack empty) does not permit the operation it wants to perform (in this case a pop). The process is therefore forced to release the lock in the hope that some other process will rectify the condition, but it, as well as any other processes wishing to carry out the same operation, will have to repeatedly enter the same critical region and obtain new copies of X while waiting for the rectification, just to see whether the situation has changed. This repeated entry/exit can however be avoided if we use a more complex lock arrangement such as a second lock or a semaphore. This will be further discussed below. What we really need is a kind of conditional lock, which is successfully acquired only if a specific requirement is met. So the direction we are moving in seems to be (a) A lock acquire specifies what data to lock (b) Lock acquire specifies what action is intended (c) Lock acquire specifies the condition for lock to succeed and it would seem logical to go a step further and have (d) acquire also brings over the data being locked, or alternatively, invalidates the existing copy so that new pages would be fetched as required which produces something akin to the idea of tuplespace. In other words programmability can be improved by the adoption of more sophisticated locks, which however tends to increase implementation cost. In the present study we suggest that, by imposing some restrictive structures on the tuplespace, we can produce a system for sharing data with significantly enhanced programmability at a modest cost, contrary to previous experiences that tuplespaces are expensive to implement in a distributed environment: By "tying down" tuples to specific virtual addresses and replacing a search process by a matching process, tuple operations become little more complex than simple lock - read/write - unlock operations.
Section 5.5.
5.5.2
181
Tuple Locks
Tuple Locks
Two shortcomings of the tuplespace methods are that the search process is time consuming especially in a large space, and there are at present no simple rules governing the scope of the tuple operations defining the visibility of one process's emitted tuples to another process, which would help to reduce the search space. The present proposal is to replace the search process by a matching process: tuples are fixed in number, location and field attributes, and are not dynamically created or destroyed, but merely change in content with tuple operations performed by processes. Further, they are mainly used to control access to blocks of data, which would otherwise require a large set of tuples and costly tuple operations. A process declares one or more "buckets" each guarded by one or more tuple locks. Locking and unlocking are both achieved by a RdTuple operation: RdTuple (BucketID.TuplelD: match I modify,
...)
with the alternative format RdTuple (BucketlD.TuplelD: I modify I match,
...)
i.e., change the current tuple content first and then wait for or retrieve some expected value, which may be produced later by other processes. It is also possible to have modify without match, or match without modify. When match|modify appears in one field and |modify|match in another, the modify operations are synchronized, with the matches on the left occurring earlier and those on the right taking place later. To specify a match condition, the value currently in the tuple field is denoted by ?, and the match may be any Boolean expression involving ?, e.g., RdTuple (Bucket.Tuple: ?==x, ...) succeeds if the value in the tuple equals the search key x. A simpler way to denote this is RdTuple (Bucket.Tuple: x, ...) and x may be an expression instead of a variable or constant. If multiple match conditions are given for several tuple fields, RdTuple succeeds if all of them are true. If a successful match followed by a —modify, then the same RdTuple operation performed by the same or another process, would no longer succeed, until some other RdTuple operation changes the tuple fields being matched back to its original values. This corresponds to a lock/unlock. For example, a simple binary lock contains just a T / F value, and locking is achieved by RdTuple (BucketlD.TuplelD: T I = F)
182
Architecture Inclusive Parallel Programming
Chapter 5
which succeeds if the tuple currently contains T (match) and changes its content to F, (modify) so that another process trying to perform the same operation will suspend until some process executes RdTuple (BucketID.TuplelD: F I = T) changes the value back to T. This is a simple lock and unlock. The modify part may contain any operator expression (e.g., +1) instead of an assignment (i.e., = ? + l ) . A match specification ?var retrieves a value from a tuple into a local variable as before. Note that ?var can have a modify on its left or right. Two operations involve copying the bucket guarded by a tuple: InBucket (BucketlD.TuplelD: match I modify,
...)
causes the bucket to be copied to the local memory (which could be achieved on a DSM system by invalidating any current copy so that a new copy would be fetched upon access) if the tuple match fields equate. Similarly, OutBucket (BucketlD.TuplelD: match I m o d i f y , . . . ) makes the modified local copy visible to others, again provided that the tuple lock contains matching values, which indicates that the shared bucket is "write enabled". The Out could modify tuple content to indicate that the new bucket content is now "read enabled". That is, by modifying the content of the tuple in a Rd/In, one prevents other processes from retrieving the bucket using the same match key, until an OutBucket or some other tuple operation restores the matching values, indicating the condition to release the data to others is met. OutBucket may mean "overwrite existing content" or "incorporate my changes along with changes made by others", which is an implementation detail varying between systems. Details like this affect program behaviour, and we shall from time to time present different code versions for different assumptions of platform and implementation details. However, the ultimate objective is to standardize on a particular system behaviour regardless of whether the platform is distributed or shared memory, as will be shown in the final part of this work.
5.5.3
Using Tuple Locks
For our first example, consider the multiple read/exclusive write problem in a distributed system. We require a tuple with two fields: a Boolean read-enable flag and a reader count. We assume a distributed system in which the writer can update its own copy at the same time as existing readers read their copies, but must prevent later readers/writers starting until a new, consistent copy has been released; hence we have Readers:
InBucket (BucketlD.TuplelD: T, I +1) ..read..
Section 5.5.
RdTuple (BucketlD.TuplelD: - , Writer:
183
Tuple Locks
I -1)
InBucket (BucketlD.TuplelD: T | =F, - ) ..write.. OutBucket (BucketlD.TuplelD: I =T, 0)
Note that each reader increments the reader count by 1 before reading, and decrements it after; writer can start writing even when there are readers reading, but would not put back the new copy until the reader count goes to 0. (- indicates we do not care about the current value or already know it.) Multiple writers can also be accommodated by replacing the read enable flag by the number of writers: Readers:
InBucket (Bucket.Tuple: 0, I +1) RdTuple (Bucket.Tuple: - , I -1)
Writers:
InBucket (Bucket.Tuple: I + 1 , - ) OutBucket (Bucket.Tuple: I - 1 , 0)
with the assumption that they write to different parts of the bucket so that no conflict arises. Note also that the arrangement has problem that the readers will suffer starvation if new writers keep starting before earlier writers end. Tuple locks can also be implemented on shared memory systems, but the programs generally cannot be ported between shared memory and DSM systems without change; for example, concurrent read and write would not work, and in the reader/writer problem writer must for reader count to go to 0 before starting to write RdTuple (Bucket.Tuple: I + 1 , , 0) ..write.. RdTuple (Bucket.Tuple: | - i , - ) Note that because the bucket in the shared memory is accessible to all processes we can use RdTuple rather than InBucket and OutBucket, which would bring back/send out a local copy rather than use a common shared block, and this is not require for the shared memory reader/writer problem. Comparing distributed system with DSM, the program behaviour is different, because the local copy is obtained at once in whole, rather than the actually used pages in a bucket being transferred upon demand. To keep things efficient, buckets are usually smaller, and may be used as buffer spaces for different purposes as execution proceeds. An example approximating the single reader multiple writer problem is Gaussian elimination with each process taking care of one row of the coefficient matrix: the
184
Architecture Inclusive Parallel Programming
Chapter 5
first row is not changed in the elimination, while the Nth row is modified N-l times; in fact the process in charge of row 1 does no computation until the back substitution stage, completing last since x[l] cannot be computed until x[2..N] are available. The processes that manage the lower rows perform progressively more work, but finish earlier the lower the row. Hence, the N processes are not well synchronized, and constitute an asynchronous structure with each process i providing the ith pivot row as writer in iteration i, and processes i-f 1 to N reading the pivot row, during the elimination stage, and then in the reverse order during the back substitution stage, with each process i writing x[i] into the shared location so that processes 1 to i-1 can act as readers to subtract x[i] from its right hand side. For a multiple writer example, consider the producer/consumer problem with the bucket containing a circular buffer guarded by two tuples, which contain the head and tail pointers respectively in addition to the queue/dequeue enable flags and empty/filled buffer counts. To proceed: Consumer - InBucket (Queue.Head: T I = F, ? F i l l e d , ? CIndex) remove Bucket[CIndex] OutBucket (Queue.Head: = F i l l e d > l , 1 - 1 , 1 + 1 mod N) RdTuple ( q u e u e . T a i l : I = T, | + 1 , - ) Producer - InBucket ( q u e u e . T a i l : T I = F, ? Empty, ? PIndex) add Bucket[PIndex] OutBucket ( q u e u e . T a i l : | = Empty>l, 1 - 1 , 1 + 1 mod N) RdTuple (queue.Head: I = T, I + 1 , - ) Assuming that there are both filled and empty buffers, then one producer and one consumer can both be modifying the bucket, but while they are doing so, the enable flags are F and so other producer/consumers must wait. Afterwards, the consumer decrements the filled buffer count and increments the empty buffer count, while the producer does the reverse. Each forwards its pointer by 1. If the buffer used is the last empty/filled buffer, then the queue/dequeue enable flag in the relevant tuple remains F until a new empty /filled buffer is created by another process. The program assumes that concurrent OutBucket operations are processed by incorporating the changes each makes into the home bucket, rather than by a simple overwrite. In a shared memory system RdTuple would be used in place of InBucket/OutBucket. The use of the tuple locks and buckets produces a number of advantages: 1. Absence of false sharing: even when two small buckets share a page, the tuple operations indicate which buckets are required, so that if the wanted bucket has not been modified, the current copy of the page would not been invalidated even if other buckets have changed. 2. Absence of false exclusion: concurrent reads/writes on the same bucket are possible with appropriate locking, and reads/writes on different buckets would not
Section 5.5.
Tuple Locks
185
compete for the same lock. 3. Prefetching: by appropriately structuring bucket contents, one can cause related items to be fetched together, so that access on one part prefetches others. A bucket can contain one or more objects, or in reverse, an object can enclose several tuples guarding related buckets. 4. Scoping rules: Buckets and their guard tuples are declared like other data structures, and normal scoping rules apply. Where multiple readers and writers share a bucket, they go through the locking operations so that the process managing the home bucket need only to broadcast invalidate signals to the actual users of the bucket lazily. . Hence, there are simple rules on both the scope of visibility and the scope of sharing. 5. Limited memory sharing support: Only tuples and buckets are shared processes, limiting system resources devoted to the purpose. However, in produce identical program behaviour on shared memory and distributed bucket duplication may be needed in certain programs to run on shared systems; this too will be discussed in the final part of the article.
5.5.4
between order to systems, memory
Bucket Location
A bucket and its tuples may be defined globally, where it is visible to all the processes which will share it according to the scope rules, e.g., Bucket Queue: Tuple Head : Boolean, Integer, Integer; Tuple Tail : Boolean, Integer, Integer; Array[0..N-l] of QElement; End; Procedure Consumer (Integer) End; Procedure Producer (Integer End; For i = Exec For i = Exec
i To n Consumer (i); 1 to m Producer (i);
186
Architecture Inclusive Parallel Programming
Chapter 5
On a shared memory system, the producers and consumers can directly refer to the bucket Queue and the tuples Queue.Head and Queue.Tail, as well as its content, a simple array which does not require a separate name and each element is referred to as Queueflndex]. On a distributed machine, each consumer/producer must define a local duplicate of Queue
Duplicate Queue;
to allocate local storage for the bucket, and copy the home bucket content over with InBucket and return modified content with OutBucket. On a DSM system both are achieved simply by RdTuple, which causes any changes made to the local copy (diffs) to be incorporated into the home bucket, and if the local copy is not identical to the home bucket, the invalidation of the local copy so that future accesses would cause the home bucket content to be transferred over. Note that on a shared memory system the Duplicate declaration causes no local space allocation, since the home bucket is already accessible. (In subsection 8.4 we will discuss situations where a separate copy is needed.) Buckets may also be locally defined as a group, with each process holding a separate one identified by the process ID: Process Bucket A [] ; Tuple . . Array [1..N] A process may refer to another process's bucket/tuple by RdTuple (A[processID].tuplelD : . . . ) and use InBucket/OutBucket to transfer bucket contents. It refers to its own bucket by the bucket name alone; the meaning of A[i] is therefore context dependent: within the bucket identification part of bucket/tuple operations it means bucket A of process i, while elsewhere it means the ith element of array bucket A.
5.5.5
Homogeneous Systems
For homogeneous parallel systems that go through a set of compute-communicate supersteps, tuple locks can be used in the communicate step. For example ...compute... OutBucket (Bucket.Tuple: I +1) InBucket (Bucket.Tuple: N) ...compute...
Section 5.5.
187
Tuple Locks
with OutBucket causing incorporation of the processes changes (diffs) to the home bucket. T h a t is, each process contributes its results to the shared bucket and increments the arrive count, and when all N processes have arrived, the InBucket succeeds to give everyone the collected new information. (In a repetitive process there is the question of how to reset the arrive count to 0 for the next communicate step. This will arise in a number of programming examples and will also be discussed in section 8.) However, often we need not invoke bucket operations, but cause inter process communication using tuples only. A homogeneous parallel system usually requires a number of system wide functions in which all processors participate, such as Reduce to compute the total of N contributed values, Broadcast, Scatter, Gather, PrefixSum, etc. T h e RdTuple operation can be used to specify such functions, e.g. RdTuple(Bucket.Tuple: ?x) broadcasts the same value to local locations x in each process, which has been put into a tuple by a process via RdTuple(.., | = x ) . RdTuple(Bucket.Tuple:
I +1 I N, I +x I ? r e s u l t )
sums x from N processes and assigns the total to each process, RdTuple ( B u c k e t . T u p l e :
I +1 I N, I = I f ?<x => x I ? r e s u l t )
extracts the m a x i m u m of an array distributed across nodes, and RdTuple(Bucket:
[i]
| =x)
places x received from the ith process into the ith position of the bucket. Note the absence of a tuple ID indicates operation on the bucket content, and the index field [i] matches a particular element of the bucket. Like other tuples, the bucket content tuple can have additional fields as locking or content status signals though none are used in this operation. Then RdTuple(Bucket:
[i]
| ?x)
does the reverse. Finally the more obscure, RdTuple(Bucket.Tuple: i
I + 1 , I +x I ? r e s u l t )
successively adds each x contributed by process i, and hands over the sum u p to t h a t point, thus performing the prefix sum operation. Note t h a t + 1 is synchronized with + x , waiting for i is earlier, and retrieving the current prefix sum value is later. To perform a m a t r i x multiplication C = AB with each node holding a row of each array, we need a shared Array bucket with an array access control Tuple, and an additional Result tuple; in each outer iteration j node j initiates the scatter of j - t h row of m a t r i x A, A[j, 1..N], which is stored at t h a t node as a vector A, across the N nodes:
188
Architecture Inclusive Parallel Programming
Chapter 5
Array = A to put its row of A into the local copy of the Array bucket and then executes OutBucket ( A r r a y : - ,
I =j)
sending it to the home bucket so t h a t everyone can share its content, and each node i executes RdTuple ( A r r a y :
[i]
I ?x,
j)
to receive a particular element. In the inner iteration k each node i performs RdTuple ( A r r a y . R e s u l t :
1 + 1 , I + x*B[k])
with node j performing in addition RdTuple ( A r r a y . R e s u l t : N, ? C [ k ] ) to receive the result and RdTuple ( A r r a y . R e s u l t : N I = 0 ,
I =0.0)
to reset the result tuple for the next iteration. So the whole program is If i==i Then RdTuple ( A r r a y . R e s u l t : I = 0 , I = 0 . 0 , N + l ) ; F o r j = 1 To N Do i If i==j Then { A r r a y = A; RdTuple ( A r r a y . R e s u l t : 0 , 0 . 0 , N+l I = 1 ) ; OutBucket ( A r r a y : - , I = j ) } RdTuple ( A r r a y : [ i ] I ? x , j ) ; For k = 1 To N Do •C RdTuple ( A r r a y . R e s u l t : 1 + 1 , I + x * B [ k ] , k ) If i==j Then RdTuple ( A r r a y . R e s u l t : N I = 0 , ? C[k] I = 0 . 0 , } }
I =k+l)
Note t h a t there is an extra field in the Result tuple to control the inner iteration over k and and trigger the outer iteration when k exceeds N. In this program reinitialization of the Result tuple is simple because only one process uses the result; when it sees t h a t all N processes have made the contributions, it retrieve the sum of products and re-initializes. When the result is shared by others, reinitialization must wait until everyone has read the result, and more care need to be taken. We are now ready to look at some larger examples using buckets and tuples.
Section 5.6.
5.6 5.6.1
189
Using Tuple Locks in Parallel Algorithms
Using Tuple Locks in Parallel Algorithms Gaussian Elimination
First a small b u t complete algorithm: Gaussian elimination on a multiprocessor system with each node holding one row of the m a t r i x plus one element of the right hand side. This is named vector A with A[i] for i = 1 t o N + l , and each node has a bucket Pivot to receive the pivot row for each step of elimination, with the pivot chosen by the size of the leftmost element. T h e bucket has tuples Value and Index to broadcast pivot and root information to other nodes. Each node j executes: I f j = = l Then RdTuple ( P i v o t . V a l u e : I F o r i = 1 To N - l Do •C x = Abs ( A [ i ] ) ; RdTuple ( P i v o t . V a l u e : i , I +1 I N, I f i = = j Then RdTuple ( P i v o t . I n d e x : RdTuple ( P i v o t . I n d e x : i , I +1 I N, I f j==p Then { j = i ; P i v o t = A;
= 1 , I =0, I = 0 . 0 ) ;
I = I f ?<x => x I ? r e s u l t ) ; I =i, | =i-l, - ) ; I = I f x==p => j I ? p ) ;
OutBucket (Pivot.Value: | =i+l, I =i, I =0.0); Exit } Else { If j==i Then j = p; InBucket (Pivot.Value: i+1, -, - ) ; x = A[i]/Pivot [i]; { For k = i+1 To N+l Do A[k] = A[k] - P i v o t [ k ] * x > Continue }
> r = A [N+l] F o r k = N Downto j + 1 Do { RdTuple ( P i v o t . V a l u e : k , I - 1 , ? x ) ; r = r - A [k] *x > x = r/A[j]; RdTuple ( P i v o t . V a l u e : | = j , 0 I = j - l , x )
T h e program can be readily extended to perform a m a t r i x inversion; instead of Ax = a where x and a are N-element column vectors, we solve AX = U where U is an NxN unit m a t r i x . T h u s , the A vector at each node contains 2N elements: N for a row of A matrix, and a row of the unit matrix, and the back substitution requires each process t o OutBucket a column of roots for the other nodes t o use instead of a single root in a tuple.
190
Architecture Inclusive Parallel Programming
Chapter 5
Each node j executes: If j==l Then RdTuple (Pivot.Value: I =1, I =0, I =0.0); For i = 1 To N-l Do { x = Abs (A[i]); RdTuple (Pivot.Value: i, I +1 I N, I = If ?<x => x | ?result); If j==i Then RdTuple (Pivot.Index: I =i, I =i-l, - ) ; RdTuple (Pivot.Index: i, I +1 | N, I = If result==x => j I ? p) If j==p Then { j = i; Pivot = A; OutBucket (Pivot.Value: I =i+l, N I =i, I =0.0); Exit > Else { If j==i Then j=p; InBucket (Pivot.Value: i+1, -, - ) ; x = A[i]/Pivot[i]; For k = i+1 To N*2 Do A[k] = A[k] - Pivot Ck]*x; Continue } } r = A[N+1..N*2] For k = N Downto j+1 Do { InBucket (Pivot.Value: k, I - 1 ) ; r = r - A[k]*Pivot } r = r/ACj]; Pivot = r; OutBucket (Pivot.Value: I =j, 0 I =j-l)
5.6.2
Prime Numbers
It is simple to write a sequential program to find prime numbers using the sieve: For i = 2 To N Do Prime [ i ] = T; For i = 2 To Sqrt(N) Do If Not Prime[i] Then Continue Else For k = i * i To N By i Do Prime[i] = F; but now suppose we have up to N processors each performing the sieve on one section of N integers. On processor j , Prime[i] indicates whether jN+i is a number number. Processor 0 has the responsibility of supplying the primes up to N for others to use in their sieve, sending the numbers through a shared bucket; that is, it acts as the writer while the others are readers. The bucket size n need to be large enough to contain all the primes up to N, plus an extra 0; it increases with N at a rate nearly but slower than linear.
Section 5.6.
Using Tuple Locks in Parallel Algorithms
Processor 0: For i= 1 To n Do Bucket [i] = 0; RdTuple (Bucket.Tuple: 0); Ready = 0; For i = 2 To N Do Prime[i] = T; For i = 2 To Sqrt(N) Do If Not Prime[i] Then Continue Else { Ready = Ready+i; Bucket[Ready] = i; RdTuple (Bucket.Tuple: |=Ready); For k = i*i To N By i Do Prime[i] = F } For i = Sqrt(N)+l To N Do If Not Prime[i] Then Continue Else { Ready = Ready+1; Bucket[Ready] = i; } RdTuple (Bucket.Tuple: |=Ready)
Processor j, j = 1..N-1: For i = 1 To N Do Prime[i] = T; Done = 0; For i = 1 Do { InBucket (Bucket.Tuple: ?>Done); For ii = 1 Do { Done = Done + 1; d = Bucket[Done]; If d==0 Then Exit Else For k = If d*d > j*N Then d*d - j*N Else d - Mod(j*N,d) To N By d Do Prime[k] = F > }
191
192
Architecture Inclusive Parallel Programming
Chapter 5
In other words, the bucket is retrieved if the number of available primes shown in the tuple exceeds the number already processed by the node, and the node keeps obtaining new primes from the bucket until it runs out of small primes to sieve with, when it gets a 0 in the bucket, at which point it tries to get a new bucket. While Processor 0 is still doing its own sieving, it supplies each new prime it finds to others without delay, but because the others are busy they may not access the bucket every time it changes. After its sieving is finished, Processor 0 would put all the primes between Sqrt(N)+l and N into the bucket in one go. Note that in this problem the data structure in the bucket is simple and the usual reader-writer exclusion need not be enforced.
5.6.3
Fast Fourier Transform
The FFT algorithm factorizes an NxN Fourier transform matrix into logN sparse matrices each containing just two non-zero elements per row, thus reducing the number of multiplications to 2NlogN and additions to NlogN. However, to keep the structure of computation simple, the algorithm would produce the results in "bit reverse" order, requiring a re-ordering of the output vector by moving element in position i = abc... to position i' = ...cba where a,b,c, etc. are the binary digits of index i. Suppose the elements are stored in m buckets of size N, then computation proceeds with each bucket j combining its elements with bucket j + m / 2 , j + m / 4 , etc, followed by in-bucket computation; after which bucket j exchanges elements with bucket j + m / 2 , j + m / 4 , etc., in groups of size 1, 2, etc., with odd groups from the upper bucket changing position with even groups of the lower bucket, while group size < sqrt(m). There are altogether log m compute steps and (log m)/2 reorder steps. Note that'each process owns one bucket but shares several buckets of other processes according to different index relations at various steps of computation and element exchanges. '/, compute s t e p between buckets: gap = m/2; same = F; group = 0; For k = 2*m Do { If k >= m Then { If Not (same) Then Bucket = A; OutBucket ( B u c k e t [ j - g a p ] . t u p l e : l=gap > InBucket ( B u c k e t [ j - g a p ] . t u p l e : - g a p ) ; k = k-m; same = T;
Section 5.6.
Using Tuple Locks in Parallel Algorithms
group = group*2 + 1; } Else { RdTuple (Bucket[j].tuple: gap); minrow = j*N; E = e(group*gap/m); For row = 0 To N-l Do { x = A[row] + E*Bucket[row]; Bucket[row] = A[row] - E*Bucket[row]; A [row] = x } same = F; RdTuple (Bucket[j]. tuple, |=-gap); group = group*2 } k = 2*k; If gap==i Then Exit; gap = gap/2 } '/, compute steps within bucket; For gap = N/2 Do { For 1 = 0 To N-i By gap Do { E = e(2*(j+l/N)/m); For row = 1 to 1+gap-l Do { other = row+gap; x = A[row] + A[other]*E; A[other] = A[row] - A[other]*E; A[row] = x } } If gap==2 Then Exit; gap = gap/2 } '/. reorder steps between buckets in groups; gap = m/2; group = 1; same = F; For k = 2*j Do { If k >= m Then { If Not (Same) Then Bucket = A; OutBucket (Bucket[j-gap].tuple: I =gap); InBucket (Bucket[j-gap].tuple: -gap);
193
194
Architecture Inclusive Parallel Programming
Chapter 5
A = Bucket; same = T; k = k-m } Else { RdTuple (Bucket[j].tuple: gap); For 1 = group To N-l By 2*group Do For row = 1 To 1+Group-l Do { x = A [row] ; A[row] = Bucket [row-group]; Bucket [row-group] = x > RdTuple (Bucket[j].tuple: |=-gap); same = F > } group = group*2; gap = gap/2; '/, reorder within bucket - this is for m gap = gap/2; group = group*2; If group >= gap Then Exit; > '/, reorder whole buckets - this is for m >= N; If group >= M Then { For ij = mod (j, gap*2), same = F Do { If ij <= 1 Then Exit; If ij >= gap Then { If Even (ij) Then { If Not (same) Then Bucket = A; OutBucket (Bucket[j-gap]+1.tuple: I =gap); InBucket (Bucket[j].tuple: gap); A = Bucket; same = T }
Section 5.6.
Using Tuple Locks in Parallel Algorithms
195
ij = (ij-gap)/2 } Else { If Odd ( i j ) Then { RdTuple ( B u c k e t [ j ] . t u p l e : gap); For 1 = 0 To N-l Do { x = Bucket[1]; Bucket[1] = A[1]; A[l] = x > RdTuple (Bucket[j+gap-1].tuple: I =gap); same = F > ij = iJ/2 > } gap = gap/2 } }
5.6.4
Heap Sort
A heap sort starts by organizing N elements into a binary tree, in which each element is larger than its left and right successors; hence, each element is larger than all elements below it, and the largest element is at the root. This is the heap making phase. Afterwards, the heap is successively reduced in size by removing the root and replacing it with the element at position k, k = N-l,N-2,... each time moving the new root down and larger elements up until the heap property is restored. Since the removed element is always the largest in the heap, the final result is a sorted array with largest element at N-l and smallest at 0. When m buckets of size N each are used, bucket 1 contains the top part of the heap, with each leave element linking to two sub heaps in two successor buckets; the interior elements have successors within bucket 1 itself. In buckets 2 to m, the leaf elements have no successors. In the Make Heap phase, each recursion calls the subroutine MH to make the left and right sub heaps, and compares the two sub heap roots with the parent element, putting the largest at the parent node and, if necessary, calls the Remake Heap function for the subheap whose root has been swapped with the parent element. Whenever we reach an element which is a leaf of bucket 1, then the tuples for the left and right buckets are used to obtain the two subheap roots as well as to return the new subheap roots. In the heap reduction phase, elements are exchanged between bucket 1 and the tail bucket, and the new element at the bucket 1 root causes the Remake Heap function to be repeatedly invoked; when the tail bucket has performed N swaps with bucket 1, the bucket Tail-1 becomes the next tail bucket for further exchanges, until all the non-root buckets contain sorted elements and exchanges occur within the root bucket. MH ( i ) :
196
Architecture Inclusive Parallel Programming
x = A[i]; i i = i+i; iil = ii+i; If i <= i n t e r i o r Then { MH ( i i ) ; MH ( i i l ) ; If A [ i i ] > A [ i i l ] Then { If x < A [ i i ] Then { A C i ] = A [ i i ] ; A [ i i ] = x; RH ( i i ) ; > > x < ACiil] Then { A[i] = A [ i i l ] ; ACiil] = x; RH ( i i l ) } } bucket==0 Then { i i = i i - N ; i i l = iil-N; RdTuple (ACii].Bucket: T, ? y ) ; RdTuple ( A C i i l ] . B u c k e t : , T, ? z ) ; If y > z Then { If x < y Then { ACi] = y; RdTuple (ACii].Bucket, |=F, l=x) } } x < z Then {ACi] = z; RdTuple (ACiil]-Bucket, |=F, l=x) } } RH: x = ACi]; i i = i+i; i i l = ii+1; If i <= i n t e r i o r Then { If ii==NR Then { If x < ACii] Then {ACi] = ACii]; ACii] = x; RH ( i i )
Chapter 5
Section 5.6.
Using Tuple Locks in Parallel Algorithms
> > A[ii] > A[iil] Then { If x < A[ii] Then { A[i] = A[ii]; A[ii] = x; RH (ii); } > x < A[iil] Then { A[i] = A[iil]; A[iil] = x; RH (iii) > } bucket==l Then { ii = ii-N; iil = iii-N; If iil <= Tail Then { RdTuple (A[ii].Bucket, T, ? y); RdTuple (A[iil].Bucket, T, ? z); If y > z Then { If x < y Then { A[i] = y; RdTuple (A[ii].Bucket: I =F, I =x); Return (ii) > } x ii==Tail Then { RdTuple (A[ii].Bucket: T, ?y); If x < y Then { A[i] = y; RdTuple (A[ii].Bucket, =F, =x); Return (ii) >
}
bucket > 0: interior = N/2; Tail = Nil; NR = Nil; MH (1); For loop = 0 Do
}
197
198
Architecture Inclusive Parallel Programming
Chapter 5
{ RdTuple (A[i].Bucket: | =T I F, I =A[i] I ?x); If Not (Null (x)) Then { A[l] = x; RH (1); > If Null (x) OR Not (Null (NR)) Then { If Null (NR) => NR = N; x = ACNR] ; If NR==1 Then { RdTuple (A[root].Bucket: T I =F, ?A[1] I =x, I =Tail-l); Exit } Else { RdTuple (A[root].Bucket: T I =F, ?A[NR] I =x, -) NR = NR-1; interior = NR/2 > > > bucket=0; interior = N/2; Tail = N+l; MH (1); For loop = 0 Do { RdTuple (A[Tail].Bucket: I =T IF, I =A[1] l?A[l], ||?Tail); j = RH (1); If Tail==0 => Exit; If j <> t
Then RdTuple A[(Tail].Bucket: I =T, I =Nil); t = Tail } For NR = N Downto 2 Do •C x = A [ l ] ; A[l] = A[NR]; A[NR] = x i n t e r i o r = NR/2; RH (1) }
5.7 5.7.1
Tuples in Objects Objects and Buckets
A bucket is simply a block of memory and can be used to store any data structures including objects, and a large object may spread over a number of buckets. Hence, the concepts of object and bucket may be treated as orthogonal, and for execution efficiency most systems would see page size buckets as most natural. However, there is also merit in imposing a connection between buckets and objects, because this
Section 5.7.
Tuples in Objects
199
leads to simple rules of bucket and tuple visibility: if a bucket must be an object, then the tuple locks of a bucket are visible only to threads whose scopes include the object, provided the tuples have been declared as public; otherwise, only processes executing inside the object can see the tuple. In any case, two threads which can both see the same tuple can synchronize with each other using the tuple, and rules on what inter process communication is permitted can then be naturally derived from object scoping rules. Since objects can be defined in complex parent child hierarchies, matching the stages of parallelism to the hierarchical object-bucket structure is likely to be an important part of the algorithm design, further complicated by the need to have shared storage and communication efficiency during runtime. By checking the way objects engage in information exchange via a particular bucket/tuple, the compiler can help the runtime system to distribute objects to the right nodes so that remote storage operations and message transfers are minimized. A number of related issues arise in using objects in parallel programming, such as object atomicity requiring a shared object to behave like a monitor, and the connection between object and process spawning (active versus passive objects). There are also issues on how the manner of using objects changes with objects in shared memory or distributed objects, and with homogeneity and heterogeneity. For example, a homogeneous algorithm is likely to require arrays of objects of the same class, so that processors execute the same or similar steps on similarly structured data content, whereas a heterogeneous algorithm is more likely to have a hierarchical object structure with more varied inter-object relations and objects created and deleted in a more dynamic fashion. Again the basic thinking on this issue need to be sorted out in light of further developmental work. To illustrate the use of tuple locks attached to objects, we present below an example, medical reception room, which is one of the four so called Salishan problems posed by a parallel programming group meeting held at resort by that name. Each of these problems tests the capability of a parallel programming language to capture a particular aspect of parallelism. Problem 1, Hamming Numbers, requires three processes producing results at different rates but maintaining a loose synchronism with each other; problem 3 involves breaking up a large molecular bonding structure into smaller chunks and then recombining the results; while problem 4, Skyline Matrix, is a Gaussian elimination program that takes sparseness into consideration. Each of the other three can be programmed using tuples as the communication mechanism but without involving objects. Problem 2 is however clearly object oriented.
5.7.2
An example
In the Reception Room problem, a set of patient objects and doctor objects interact via a reception room queue: when a patient arrives, if all doctors are busy, the patient waits in the queue; when a doctor becomes free a patient is taken off the queue for treatment; if a free doctor finds no waiting patient, he/she waits in the queue until a patient arrives.
200
Architecture Inclusive Parallel Programming
Chapter 5
{ DefClass Reception; Public Tuple Queue : Boolean, Integer, Integer, Pointer; ... } { DefClass Patient; RdTuple (Lobby.Queue: T I =F, ? Doc, - , - ) ; If Doc==0 Then { RdTuple (Lobby.Queue: I =T, -, | +1, -) RdTuple (Lobby.Queue: -, -l|=-2, -, ? MyDoctorI=.); > Else { RdTuple (Lobby.Queue: -, -, l=-l, 1=.); RdTuple (Lobby.Queue: l=T, |-1, -21=0, ? MyDoctor) } ..speak to MyDoctor.. Exit } i DefClass Doctor; For i=0 Do •C RdTuple (Lobby.Queue: T I =F, - , ? P a t , - ) ; If Pat==0 Then { RdTuple (Lobby.Queue: |=T, 1+1, - , - ) ; RdTuple (Lobby.Queue: - , - , - l | = - 2 , ?MyPatient|=.) } Else { RdTuple (Lobby.Queue: - , l = - i , - , 1=.); RdTuple (Lobby.Queue: |=T, -21=0, | - 1 , ?MyPatient) > ..Speak t o MyPatient.. > > Lobby = Reception; For j = i To n Do Exec Room[j] = Doctor; For j = l Do { Delay (Random); Exec P a t i e n t }
A call on a class returns a pointer to a new instance, i.e., after Lobby = Reception; Lobby points to a reception object. This contains a queue tuple made up of a read enable flag, a doctor count, a patient count, and a doctor or patient object pointer. A total of n doctor objects are generated at the start, pointed to from n Room pointers, while patient objects are generated at random intervals without
Section 5.7.
Tuples in Objects
201
their pointers being retained as variables (i.e., only they themselves know where they are) but EXEC attaches a process to each object so that it would not be garbage collected. If upon checking the Lobby tuple, a doctor object finds that the patient number is 0, it waits for patient arrival which is signalled by patient number being set to -1; this comes associated with a patient pointer. The dequeued doctor object then passes its own pointer, denoted by ".", through the tuple with the signal -2, after which the tuple access flag is returned to T. After a patient finishes talking to MyDoctor, the object exits so that its process terminates and its space may be garbage collected, while the freed doctor object goes back to Lobby to look for another patient. If it finds the patient number to be non-zero, it passes its own pointer "." to a waiting patient with the -1 signal, and waits for the -2 signal which comes with the patient pointer. The tuple operations of the patient are similar. The program makes use of the tuple request queue maintained by the tuple management routine, accepting its queuing policy. It is not difficult to provide one's own Enqueue and Dequeue functions so that there is greater control over queuing policies. The Pointer field of the Reception.Queue tuple would now contain the queue pointer. A patient object finding the Doctor number to be 0 would execute Enqueue (Pointer, .) before suspending via a RdTuple on its own tuple. Later a newly free doctor object would acquire a patient object pointer with a Dequeue and resume the patient via a RdTuple(MyPatient.Tuple: ...), passing over its own pointer to the patient object to enable it to take actions requiring the MyDoctor information. The Salishan problem specification does not prescribe how a patient object would speak with a doctor object. It is possible for a patient to take a passive stance: RdTuple(.Tuple: I =F I T); Exit while the doctor object takes various readings a = MyPatient.Temperature; b = MyPatient.Bloodpressure; RdTuple (MyPatient.Tuple: |=T)
with the tuple operation waking up the patient object so that it could exit. It is also possible to envisage that the doctor object might delve into more complex patient information stored in a MyPatient.History object. Alternatively, the two objects could engage in a dynamic interaction via the MyDoctor.Tuple (which is .Tuple to the doctor object) and My Patient.Tuple tuples. Note that the two tuples have different scopes: the doctor object can place information from within itself into .Tuple for the patient to retrieve, and it can retrieve patient information placed into My Patient.Tuple by the patient object. Neither object can directly retrieve nonpublic information from the other, but can call a method in the other object to extract information.
202
Architecture Inclusive Parallel Programming
Chapter 5
We have seen that tuples can be used to control storage access on buckets and the execution of processes that use the buckets; we have also seen that tuples can be used to control the execution of objects. It is also meaningful to use tuples to control the access of objects with the appropriate RdTuple before and after access on object content, and on a distributed system, using InBucket and OutBucket to transfer object content between the home process, which originally created an object instance, and processes holding local duplicates. Note that since one refers to an object instance using a pointer, sharing processes must have declared duplicates of the same pointer. Where, as in the case of the patient object example, an active object without a pointer to it is created, then no duplicates are possible. What if a process modifies the structure, rather than just the content, of a shared object, even perhaps reassigning a shared pointer to an object of another class? Theoretically, this can be handled: when a process executes a RdTuple on the object whose home copy has been modified, it is told not just to invalidate its own copy, but also to re-allocate the space for the object to the new object template. Such operations are likely to be expensive, and also assumes that the process anticipates an object change and knows what tuples the new object contains. The issue is analogous to the need to make sure that all processes using a tuple have already accessed it before re-initializing, but technically more complex.
5.7.3
Reflective Objects
An object contains a packet of data together with methods that operate on them, and a method call may, in addition to returning a result to the caller, also modify the internal data, i.e., change the object state. This could change the object response to the future calls, and a call on the same method with identical arguments may return a different result. The behaviour of other methods can similarly be modified. We can therefore design objects to provide multiple public methods sharing information that evolve with calls from outside threads, resulting in reflective function families. Further, such objects may enclose sub objects, which may display context dependency with sub object behaviour changing with its external environment. For a simple example of the kind of problem that requires such object design, we look at state space control theory: a control system has a vector of state variables x and input vector y, which determine the next state variables and the output vector z via the linear equations x(t) = Ax(t-l) + By(t) z(t) = Cx(t-l) + Dy(t) There are direct ways to deduce the differential equation equivalents of these matrix formalisms, including various special cases, but we would not go into this here. During execution, reception of the input values, the production of the next state variables and output values must be synchronized, and this can be achieved using
Section 5.8.
Towards Architecture Inclusive Parallel Programming
203
tuple locks attached to the object. The methods that receive the input signals and modify the states are spawned to execute concurrently, using tuples to form iteration barriers. Another useful formalism arises from the Kalman filters, designed to produce an output that deviates from the desired values in accordance with a least square error criterion. This too involves the use of a set of matrix equations operating on the input and the state but again no details will be discussed here.
5.8 5.8.1
Towards Architecture Inclusive Parallel Programming Parallel Tasking
One would think that putting parallel tasking features into a language is a simple, even trivial job: just annotate your program to indicate parts that can be executed independently, and get compilers to carry these intentions out. This is architecture inclusive, since if a number of identical or similar parts are so indicated, e.g., a loop body, then homogeneous parallelism results, while a more arbitrary structure would lead to heterogeneous parallelism. If the parts annotated as parallel share data globally, then a shared memory structure is needed; otherwise a distributed architecture would be suitable. It should be possible to agree on a standard, widely usable parallel tasking construct. But as things turned out, parallel tasking constructs have been as varied as inter process communication constructs, because people wished to combine parallel tasking with some related ideas; depending on which ideas were chosen and how they are added to parallel tasking, all kinds of constructs were invented, usually with some level of architecture dependence so that they are no longer inclusive. Take for example the widely used idea of future: when a parallel thread is spawned, a pointer to its result is created, and another thread that attempts to retrieve the result via the future pointer suspends until the result is available. This appears to offer a synchronization mechanism in addition to a parallel tasking mechanism, since a number of threads can suspend on a commonly required future. However, future can only be used to provide inter process communication in very restricted ways, and cannot replace IPC features generally. Hence, more conventional IPC mechanisms still have to be provided, and one gets embroiled into issues of how to combine their use with futures. In the mean time, the need to support future pointers efficiently, restricts architectural and runtime system design choices, as it is more natural on shared memory systems, while on distributed systems a future to message interface has to be constructed to support it. For another example, the traditional Linda tuplespace systems use the Eval command to spawn a number of parallel tasks, each returning a result that becomes one field of a tuple, which becomes ready when all these tasks end. By executing an In/Rd on this result tuple, other parallel threads achieve synchronization, like tasks suspending on a shared future. While this multiple tasking tool is more flexible than the single tasking future construct, it generates new questions like: are the
204
Architecture Inclusive Parallel Programming
Chapter 5
multiple tasks returning results to the same tuple homogeneous or heterogeneous? Since they start and stop together, there is homogeneity, but only a small number of parallel tasks are fired up with a tuple and so the method is actually more suited to support heterogeneous parallelism. The same issue arises with the parbegin...parend construct used in CSP and OpenMP, since one can only specify a limited number of parallel blocks between the two boundaries; thus, though the blocks start and end together homogeneously they cannot specify homogeneous parallelism on the scale it requires.. Ideas of performance and optimization would often creep into parallel tasking, with the parallel tasking construct giving the programmer room to specify the exact mapping of threads of code onto a particularly linked set of processors, matching a similar set of data mapping constructs used in another, declarative part of the program. Such facilities are however rarely architecture inclusive, and indeed are usually difficult to recode for different machines. We .have instead expressed preference for the more inclusive idea of specifying parallel parts of the code by simple annotations, declaring the architecture separately, and putting the optimization expertise into the particular compiler. Even ideas of making programs "architecture independent" could actually commit the programmer to particular architectures at the same time. For example the summation program of subsection 1.3, designed to allow variable numbers of threads, and hence independent of the number of processors, can only work on shared memory machines, since each processor i must be able to obtain elements x[i+j*n] from the memory for any j and any n. So in trying to achieve one kind of architecture independence, one simultaneously commits to a particular architecture in another direction. In short, in consideration parallel tasking, we need to resist temptation to entangle with other ideas however nice. It simply remains to use a parallel annotation (Spawn, Par, or our own past choice, Exec) to indicate that a statement or block in a program may be executed as a separate thread:
Exec <statement> Sync
means <statement> and may be executed in parallel. Hence For i = 1 To N Do Exec { . . . > Sync spawns N threads to execute the body of the loop, but all threads synchronize at the end of each loop execution. Hence, both homogeneous and heterogeneous tasks can be accommodated.
Section 5.8.
Towards Architecture Inclusive Parallel Programming
205
To determine the data distribution, the normal scoping rules of the language would show what data the parallel threads are entitled to access, so that these may be packaged with the instructions of the thread into a dispatchable task. If the thread corresponds to an object or a subroutine denned at the outmost level of the program, then its data scope is self contained and is satisfiable locally. In this way, the shared memory /distributed memory requirements of the program can be determined. Note that no architecture information is invoked in the program directly. Such information may however, be separately declared in another part of the program for each machine running the program, so that the compiler may work out the actual mapping of the threads onto processors and data onto the memory modules of the system.
5.8.2
Speculative Processing
Parallel tasks may be mandatory, meaning that we are certain their results are needed for the continuing execution of the program, or speculative, when we spawn off work that may or may not be necessary, in the hope that if a result is shown to be needed, we need not wait for it to be computed since it has already been produced speculatively. While wasted processing from unsuccessful speculation may arise, this is not a problem if a system has idle capacity, which may be used for computing results in advance and thus reduce the overall elapsed time of the program. A scheme for assigning decreasing priority to more speculative tasks is needed so that if the system is busy, only the mandatory tasks will be selected for execution. A simple way to do this is to use the If .. Then .. Else .. structure as the speculative construct: If . . . Exec Then ... Exec Else ...
which says the Then and Else parts will execute without waiting for the If part to finish first. Since we do not know whether the Then or Else part will be needed until the If part returns its result, spawning the Then and Else parts in parallel with the If part would produce speculative processing; the Then and Else parts would have lower priority than the If part, while more deeply nested parts would have successively lower priorities still. In this way, the programmer can cause different levels of speculation that meets his priority requirements and adapts to runtime conditions. A question arises regarding the assignments and tuple operation of speculative tasks: they should not become visible to other threads until speculation is confirmed, but in order to determine whether to confirm a speculative task, we may want to first see what its results are; in other words, the If part may need to see the results of the Then and Else parts before deciding whether to choose one or the other by returning a True or False. This dilemma is resolved by requiring each speculative
206
Architecture Inclusive Parallel Programming
Chapter 5
task to correspond to an object, and allowing the If part to see the public portions of the Then and Else objects: If . . . Exec Then A Exec Else B
(...) (...)
where A and B are defined classes, so that invoking them creates an instance of each type, which may be referred to from the If part using the object pointers Then and Else; therefore, the If part may inspect Then.x to inspect a public variable defined in class A, or call a public method y defined in class B using Else.y in order to see what the Else part is doing, before deciding whether to return a T or F. If we execute z = If . . . Exec Then A (...) Exec Else B (...)
we would have two objects executing in parallel with the Boolean expression in If, and z will point to one of them depending on whether the Boolean returns T or F, possibly after looking into the two objects before making the choice. Speculative processing is inherently heterogeneous: it assumes that some processors have less work to do than others and become idle, so that their capacity can be used speculatively. The need to make objects out of speculative tasks, with limited visibility into their contents also make them more amenable to distributed architectures. Thus, the class of speculative algorithms is not architecture inclusive; we have merely deployed the architecture inclusive annotation of Then and Else parts for parallel tasking in order to invoke architecture related speculative processing. In short, the same philosophy has been followed.
5.8.3
Efficient Implementation of Tuple Operations
In both shared memory and distributed systems, the cost of acquiring and modifying a tuple is little more than doing a lock/unlock: either one enters a monitor in charge of a group of buckets in order to test and modify one of its tuples, or sends a message to the node for the home bucket after figuring out the node ID from the bucket/tuple information supplied in the tuple operation, and waits for a reply. By distributing control of all the home buckets among a group of processors, we can minimize the overhead due to the sequentiality of tuple processing. Similarly, the cost of transferring buckets via In/Out operations is no greater than block transfers on a distributed system or page transfers on a DSM system. There is however a special kind of tuple operation overhead; it lies in the queuing up of tuple requests before they succeed in acquiring a tuple. Not only is it necessary to provide space with the bucket for the queued requests; whenever the tuple is modified, its new content must be compared with all the search keys in the
Section 5.8.
Towards Architecture Inclusive Parallel Programming
207
queued requests to determine if a previously unmet match is now satisfied. The tuple content must be delivered to all the successful requests, and if a successful request modifies the tuple, the new content must once again be matched against the remaining requests. This is why "overusing" a tuple is highly undesirable, because a tuple getting many different requests is bound to have a long wait queue, requiring an extended search process with each tuple modification, and once a long queue arises, it would take considerable time and processing before they can all be satisfied and removed. For example, when a tuple is used for barrier synchronization, with each process executing RdTuple (Bucket.Tuple: I +1 I N) there will in the end be N-l tuple requests waiting for the value N to appear. This is however not a serious problem as all the requests have the same match key, and with each increment, a single comparison is sufficient; when the key finally matches, all the requests can be removed one after another, as none of them modifies the tuple so that they all match and can be satisfied with the same tuple. The more difficult problem occurs when using a tuple for the prefix sum RdTuple (Bucket.Tuple: i I + 1 , I +x I ?sum) with each request waiting for its own index to appear in the tuple, before incrementing the index, adding its own element to the sum in the tuple and copying back the new sum. Each waiting request would have a different match key and a long queue search would be needed, repeatedly. By organizing all the requests in a heap (priority queue), the overhead could be minimized, but it remains high in comparison with most tuple operations. One way to avoid this is to distribute the process, perhaps using recursive divide and conquer to find a logN algorithm, and in the particular case of a hypercube machine, with communication steps between neighbouring nodes only. For the above example, we could divide the N processes into m groups of size n, and have each member node execute RdTuple (groupID.Tuple: i I +1, I +x I ?sum) after which the leader of each group, which controls the group home bucket and holds the rightmost element of the group prefix sum, executes RdTuple (Global.Tuple: Group | + 1 , | +sum I ? t o t a l ) RdTuple (groupID.Tuple: I =0, I = t o t a l ) to bring back the inter group prefix sum, while each member executes RdTuple (groupID.Tuple: 0, ? t o t a l ) sum = sum + t o t a l
208
Architecture Inclusive Parallel Programming
Chapter 5
to add the prefix sura from all the groups to the left to its own intra group prefix sum. The logN algorithm for a hypercube is more complex, but can be produced in a fairly straightforward fashion. We claim that, while some reprogramming is needed to adapt the tuple and bucket code to different architectures such as shared memory or distributed as well as different interconnections, the adaptation is not difficult to carry out. Further, the process could also be automated as a form of compiler optimization, with the application programmer providing architecture inclusive statements like the above RdTuple command, the user who runs the program on a particular machine providing architecture descriptions, and the compiler modifying the code to fit the system. On systems where system wide functions like barrier or prefix sum are provided as hardware operations, the compiler can recognize the tuple equivalent of such functions and compile the tuple operations given in the program accordingly. Discussing the details of such compiler optimization would take us rather far from the themes of this book and into such issues as architecture specification notations, system programming and compiler design; so we would only briefly point out that some relevant past work is likely to be found in the data mapping declarations used in High Performance Fortran and similar work, as the notation for such declarations reflect memory as well as processing capacity distributions in a system. There is also need to design the right bucket configuration to optimize data distribution and runtime processing in tuple and bucket programming, and possibly part of this can also be automated with compilers utilizing architecture declaration and data mapping statements. Some developmental experience will be needed before more definitive ideas can be formulated for general cases.
5.8.4
Tuple and Bucket Programming Styles
The provision of tuples and buckets permits information sharing in a standard way: tuples are accessed using RdTuple commands to verify that the shared bucket is in the right condition for processing, and InBucket/OutBucket commands are used to bring back private copies of a block and to return results produced in these copies; the same mechanism works whether the copies exist in a shared memory with the home bucket, or come across a communication channel of a distributed memory. By changing the content of the control tuple with the Rd/In/Out commands, one also controls whether the tuple/bucket commands of other processes would succeed in accessing the shared information, achieving exclusion, synchronization, monitor and other inter process relations. However, though the same notations are used, the programming styles are not independent of architecture. In a shared memory system, private copies are not usually necessary; read only processing can be performed directly on the home bucket, and even writes can be done this way with the appropriate locks. Only in rather special situations would it be necessary to obtain a private copy, either to work on an old copy while the home bucket is being modified by others, or to carry out processing that makes changes to the bucket content for private purposes only
Section 5.8.
Towards Architecture Inclusive Parallel Programming
209
so that the changes need not be given to others, or to reduce memory contention for a heavily shared home bucket. It would be rare for two processes to have two private copies, both modifying them, and try to merge the changes, since it is not trivial to define what the final result ought to be when changes overlap; if all changes are meant to be shared, then both should simply work on the shared copy, setting the tuple values in such a way that non-conflicting changes (such as queuing and dequeuing operations on a queue neither filled nor empty) take place concurrently, while conflicting changes are done sequentially. In a distributed system, one has to obtain a private copy first before processing can occur. Hence, InBucket/OutBucket operations are performed whereas in a shared memory system just RdTuple would be sufficient for an exclusive read and modify operation. Concurrent writes on a shared bucket are done only for quite specific algorithms with proven results. In a distributed shared memory system things are different yet again. Each processor would have a local copy of a shared bucket, but this is not a "private" copy, since the operating system would reflect all changes performed on the local copy in the home bucket whenever another process performs a successful RdTuple or InBucket operation. That is, all writes are shared, but they are not shared immediately like in a shared memory system. A process that has an old, unmodified copy can continue using its content without being affected by writes elsewhere as long as it does not perform any Rd/In operations on the bucket which may cause invalidation of the local copy. It is not necessary to create another local copy in order to preserve old contents if such care is taken. To illustrate the differences, consider the simple case of exclusive bucket processing. On the shared memory system RdTuple (Bucket.Lock: Open | = Closed) . . c r i t i c a l region.. RdTuple (Bucket.Lock:I = Open) and on a distributed system InBucket (Bucket.Lock: Open I = Closed) . . c r i t i c a l region.. OutBucket (Bucket.Lock: I = Open) On a DSM system, one also would use Rd/Rd instead of In/Out, but things are different from the shared memory system in that another process can still access its own copy containing the previous content of the bucket, at the same time the process in the critical region is putting in the new content, or even after the bucket has been unlocked, as long as it does not perform a Rd/In on the bucket. On a shared memory system the new content is immediately visible to others if they do not attempt to lock the bucket for exclusive access. If they do acquire the lock, the they can only access the version after the unlock, and thus cannot see the intermediary changes. In any case, the old content is not guaranteed.
210
Architecture Inclusive Parallel Programming
Chapter 5
Now take the more complex example of multiple synchronized readers/writers. Whereas the asynchronous reader/writer problem of section 5 has readers which wish to read the latest data only, now the readers wish to receive every new block produced by the writers, and all the reads/writes must be synchronized so that while the writers are producing the next block the readers are reading the previous block, and would switch together after all finish. For the shared memory case: reader Bucket = 0; For j = 1 To Do •[ RdTuple (Bucket. Tuple: j , I +1 I N) read Bucket; complement (Bucket) } except that for reader 1 the tuple operation is different: RdTuple (0.Tuple: I =0, - , - ) RdTuple (1.Tuple: I =0, - , - ) For j = 1 . . . { RdTuple (Bucket.Tuple: I = j , I =i I N)
i.e., it has the job of initializing the tuple before waiting for all the processes to synchronize using it. Note that we need two tuples in two buckets used for synchronization: if I have passed one barrier, then the tuple used for the previous barrier must be free as everyone must have passed the previous barrier already, and it is safe to re-initialize it, while the tuple for the current barrier should not be reinitialized immediately after I pass it as I cannot be sure whether others have passed it too. Note that two tuples .Value and .Index were used in Gaussian elimination with successful access of one followed by re-initialization of the other within each iteration. writer Bucket = 0; For j = 1 To Do { write to Bucket; RdTuple (Bucket.Tuple: j, I +1 I N) complement (Bucket) } That is, there are two buffers alternatingly used for writing and reading, and at the end of each cycle readers switch to the block written by the writers, and writers switch to the block just read by the readers. N is the total number of processes,
Section 5.8.
Towards Architecture Inclusive Parallel Programming
211
whether reader or writer, and when a barrier is passed, is means the writers have finished writing, and the readers are ready to read. Now consider the same problem on distributed systems; t is necessary to use only one bucket, as everyone must make a local copy to process. First suppose that each OutBucket overwrites the home bucket, so that writers must execute exclusively, bringing over the current home bucket first. This is achieved by each writer setting the index in the tuple to 0 before writing and putting it back to j afterwards. Before starting, they also need to know that all readers have taken a copy of the previous block, in which case the last reader, noting that the reader count is R, would have re-initialized the tuple. Before starting to read, readers need to wait for all the writers to have done their writing and incremented the write count, so that the tuple contains the number W. Hence: Reader For j = 1 Do { InBucket (Bucket.Tuple: j , W, I +1 I ? n ) ; If n==R Then RdTuple (Bucket.Tuple: I +1, | =0, I =0); read (Bucket) > Writer For j = 1 Do { InBucket (Bucket.Tuple: j I =0, -, 0 ) ; write (Bucket);
OutBucket (Bucket.Tuple: I = j , I + 1 , 0) } Next we assume that the system processes OutBucket by merging all the changes a writer made on its own copy into the home bucket, and if several writers OutBucket at the same time, the home bucket incorporates all their diffs. Readers need to wait for all the writers to do this before executing InBucket; hence Reader For j = 1 Do { InBucket (Bucket.Tuple: j , W, I +1 I ? n ) ; If n==R Then RdTuple (Bucket.Tuple: | +1, | =0, I =0); read (Bucket) > Writer For j = 1 Do
212
Architecture Inclusive Parallel Programming
Chapter 5
{ InBucket (Bucket.Tuple: j , - , 0 ) ; w r i t e (Bucket); OutBucket (Bucket.Tuple: j , I +1, 0) > That is, simply by not modifying the tuple, the writers' InBucket all succeed on the same bucket content, and the changes made by each will be incorporated during the OutBucket. The DSM version is similar to the shared memory version, except that on DSMs system it is again only necessary to have one bucket, and after a RdTuple, a node would be able to access the latest bucket content by referring to its virtual address, while a writer can modify the copy at its node without affecting the others as the new content is not passed to the home node until a tuple operation is performed by a node. However, it is still necessary to use two tuples to handle re-initialization correctly.
5.8.5
Back Towards Architecture Independence
We have thus illustrated the need to program in an architecture specific way, not only taking into consideration hardware differences, but difference in bucket operations. This occurs because we used the tuples and buckets as parallel programming tools and tried to program in the most efficient way for an architecture, rather than as the basis of a common model. Now that we know the differences, we should try to bring everything together again. One difference between distributed and DSM systems is that the former requires the use of InBucket/OutBucket to move data from or to the home node of a globally defined bucket, but the equivalent effect is achieved if a RdTuple causes changes made on the local duplicate to be incorporated into the home bucket (OutBucket), and invalidates the local copy if it is different from the home bucket (InBucket), with the difference that, if the home bucket is modified before its pages are brought to the local node by misses, the content received would differ from the content at the time of the RdTuple. To ensure the equivalence between Rd with In/Out, it is necessary to impose the programming convention that the owner process of the home bucket does not write into it, which is usually satisfied because the home copy of the globally defined bucket is with the parent process that generates child processes to do the work on their local copies while it waits for them to complete so that the results can be incorporated into the home copy; alternatively, home buckets can be defined in separate, dummy processes that only act as data repositories. With locally defined buckets, every process has a distinct bucket and In/Out are used to transfer information between them, regardless of whether the system is shared memory or distributed. We also saw that a major difference between DSM and shared memory systems is that when several writers concurrently modify a common bucket, they do not see other processors' writes on a DSM system until there have been tuple operation that cause incorporation of changes into the home bucket and invalidation of local
Section 5.8.
Towards Architecture Inclusive Parallel Programming
213
copies. If we impose the programming convention t h a t concurrent writers write t o separate areas of the shared bucket, and readers only read old d a t a , until tuple operations are used to shared the new information among all the parties, then again the distinction is covered up. Note t h a t while other processes can request the incorporation of their writes into the home bucket with a RdTuple, the change m a y be prevented by the tuple being set to write-disable, until the current readers finish acquiring the current content and reset the tuple. However, for some algorithms the above programming convention might negate the basic advantage of distributed shared memory t h a t allows readers and writes freedom to access their own copies. An alternative solution is to provide for the same freedom in a shared memory system: a writer may declare Own D u p l i c a t e
..name..;
which causes a separate copy of the shared bucket/object to be allocated in the process's local virtual space, and content transfer between this copy and the home copy is performed by the tuple system with each RdTuple. Each writer is then free to modify its own copy, until it causes its changes to be incorporated into the home copy with a RdTuple provided the home copy is write-enabled. On a distributed system Own Duplicate and Duplicate have the same effect since each process must have its own copy of shared content. Now the multiple reader-writer program may be coded as: writer F o r j = 1 To Do { w r i t e t o Bucket; RdTuple ( B u c k e t . T u p l e : j , I +1 I N) > reader F o r j = 1 To Do { RdTuple { B u c k e t . T u p l e : j , I + i I N ) ; r e a d Bucket } since on both share memory and distributed systems, a RdTuple causes new content to be incorporated and distributed when its match suceeds, and each writer has its own copy. The same program now runs on both. Another useful convention is to program in a homogeneous manner as far as possible. For example, if we have For i = i To N Do procedure; then on an S P M D machine the given procedure is simply started on N processors, while in a hypercube machine or a linked set of workstations the program
214
Architecture Inclusive Parallel Programming
Chapter 5
would recursively spawn N processes in logN steps. Similarly, a system wide tuple operation like prefix sum would be compiled into a hardware function on a massively parallel machine, or a logN-step recursive operation on machines with some kind of uniform interconnect. If we have a system with n nodes each containing a subgroup of m processors then a two level distribution between groups then within groups would result. In each, the compiler can be assisted by system declarations describing the hardware configurations as well as the memory model features. So given the tuple and bucket mechanisms, the compiler facilities and the programming conventions, we would have a common, architecture inclusive framework, with the same application program plus the architecture description running on different systems. The programming conventions do not always work out for a particular algorithm. For example, in some multiple reader/writer problems one cannot be sure that readers would confine themselves to reading old parts or that writes are all non overlapping, and the algorithm that is guaranteed to work correctly on a shared memory system (using exclusive writers or separate read and write buckes) would not be optimal for DSM systems. Possibly we have to settle for programs made up of architecture independent segments, interpersed with code that has to be re-written for each system, but even this would be a useful step towards the elusive goal of architecture independence. In short, by accommodating both system wide, homogeneous interprocesss communication and bilateral, heterogeneous communication in the tuples, using them as locks controlling buckets as part of a stronger memory consistency model, we have a framework that includes the various divergencies that have prevented progress in parallel programming previously, and a starting point for a new effort towards architecture independence.
Acknowledgement This article was written during the author's sabbatical leave at Rice University, Houston, Texas for most of 2000. The support provided by NUS and Rice for this work is gratefully acknowledged.
N articles describing significant developments in the field. Include h current topics as clusters, parallel tools, load balancing, mobile systei I architecture independence.
ISBN 981-02-4579-31
www. worldscientific. com
9 "789810"245795"