PARALLEL
COMPUTING: Software Technology, Algorithms, Architectures and Applications
ADVANCESIN PARALLELCOMPUTING VOLUME 13
Series Editor:
Gerhard R. Joubert ManagingEditor
(Technical University of Clausthal) Aquariuslaan 60 5632 BD Eindhoven,The Nmherlands
2004
ELSEVIER Amsterdam - Boston - Heidelberg - London - New Y o r k - Oxford Paris- San D i e g o - San Francisco- Singapore- S y d n e y - Tokyo
PARALLEL COMPUTING: Software Technology, Algorithms, Architectures and Applications
Edited by
G.R. Joubert
W.E. Nagel
Clausthal
Dresden
Germany
Germany
EJ. Peters
W.V. Walter
Eindhoven
Dresden
The Netherlands
Germany
20O4
ELSEVIER Amsterdam - Boston - Heidelberg - London - New York - Oxford Paris- San D i e g o - San Francisco- S i n g a p o r e - S y d n e y - Tokyo
ELSEVIER B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands
ELSEVIER Inc. 525 B Street, Suite 1900 San Diego, CA 92101-4495 USA
ELSEVIER Ltd The Boulevard, Langford Lane Kidlington,Oxford OX5 1GB UK
ELSEVIER Ltd 84 Theobalds Road LondonWC1X 8RR UK
9 2004 Elsevier B.V. All rights reserved. This work is protected under copyright by Elsevier B.V., and the following terms and conditions apply to its use: Photocopying Single photocopies of single chapters may be made for personal use as allowed by national copyright laws. Permission of the Publisher and payment of a fee is required for all other photocopying, including multiple or systematic copying, copying for advertising or promotional purposes, resale, and all forms of document delivery. Special rates are available for educational institutions that wish to make photocopies for non-profit educational classroom use. Permissions may be sought directly from Elsevier's Rights Department in Oxford, UK: phone (+44) 1865 843830, fax (+44) 1865 853333, e-mail:
[email protected]. Requests may also be completed on-line via the Elsevier homepage (http://www.elsevier.com/locate/permissions). In the USA, users may clear permissions and make payments through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA; phone: (+1) (978) 7508400, fax: (+1) (978) 7504744, and in the UK through the Copyright Licensing Agency Rapid Clearance Service (CLARCS), 90 Tottenham Court Road, London W1P 0LP, UK; phone: (+44) 20 7631 5555; fax: (+44) 20 7631 5500. Other countries may have a local reprographic rights agency for payments. Derivative Works Tables of contents may be reproduced for internal circulation, but permission of the Publisher is required for external resale or distribution of such material. Permission of the Publisher is required for all other derivative works, including compilations and translations. Electronic Storage or Usage Permission of the Publisher is required to store or use electronically any material contained in this work, including any chapter or part of a chapter. Except as outlined above, no part of this work may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the Publisher. Address permissions requests to: Elsevier's Rights Department, at the fax and e-mail addresses noted above. Notice No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.
First edition 2004 Library of Congress Cataloging in Publication Data A catalog record is available from the Library of Congress. British Library Cataloguing in Publication Data A catalogue record is available from the British Library.
ISBN: 0-444-51689-1 O The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper). Printed in The Netherlands.
PREFACE Dresden, a city of science and technology, of fine arts and baroque architecture, of education and invention, location of important research institutes and high tech firms in IT and biotechnology, and gateway between Western and Eastern Europe, attracted 175 scientists for the international conference on parallel computing ParCo2003 from 2 to 5 September 2003. It was the tenth in the biannual ParCo series, the longest running European conference series covering all aspects of parallel and high performance computing. ParCo2003 was once again a milestone in gauging the status quo of research and the state of the art in the development and application of parallel and high performance computing techniques, highlighting both current and future trends. The conference was hosted by the Center of High Performance Computing (ZHR) of the Technical University of Dresden. Since its foundation in 1828, the TU Dresden has undergone tremendous transformations from engineering school to technical school of higher education to full university. Today, the TU Dresden offers a broad range of subjects and specialisations in a wide variety of fields to about 33000 students. In the tradition of many inventions in the development of mechanical calculators and early computers, the Center for High Performance Computing was founded in 1997 and has played an important role in the development of modem methods and tools to support high performance computing at the university and beyond. Nowadays, many aspects of parallel computing have become part of mainstream computing. It is now commonplace to buy commodity off-the-shelf computers for home and office use that incorporate parallel techniques such as superscaIarity, hyper threading, VLIW (Very Long Instruction Word) and even cluster technologies that were considered advanced a mere decade ago. Quite apart from the speed with which new parallel technologies find their way into new products, these developments underline the importance of parallel computing research and development for the advancement of computer science and IT in general. In view of the rapid technology transfer taking place, one could be led to conclude that parallel computing research and development has passed its zenith since it has become standard computing practice. ParCo2003 showed that such a conclusion is invalid and that many complex research issues remain to be investigated. Thus it is c l e a r - and this has been the case for a number of years - that future research in parallel computing will have to concentrate increasingly on all aspects of software engineering. In addition the development of new architectures, especially those based on new technologies such as nanotechnologies, biocomputing, improved methods for performance evaluation, advanced algorithms, etc. must continue to receive appropriate attention.
Historical aspects and current trends were highlighted by three invited talks: 9 Friedel HoBfeld (Germany): Parallel Machines and the "Digital Brain"- An Intricate Extrapolation on Occasion of JvN's 100-th Birthday 9 Manfred Zorn (USA): Computational Challenges in the Genomics Era 9 Charles D. Hansen (USA): High-Performance Visualisation: So Much Data, So Little Time
vi Furthermore, there were more than 80 contributed papers broadly grouped under three main topics: Applications, Algorithms, and Software & Technology. Most of the papers received three and sometimes even four reviews, and we want to thank all members of the Programme Committee for their diligent work. In contrast to previous ParCo conferences, this conference put a strong emphasis on minisymposia, which constituted two entire parallel tracks of the conference. The topics of the seven minisymposia were: Title Organised by Franz-Josef Pfreundt Grid Computing Manfred Zom and Craig Stewart Bioinformatics, Allen Malony Performance Analysis Barbara Chapman OpenMP Tor Sorevik Parallel Applications Anne C. Elster Cluster Computing Alessandro Genco Mobile Agents We wish to thank the organisers of these minisymposia for their tremendous work and support in attracting excellent speakers and in reviewing the papers. The invited speakers, authors of contributed papers and participants in the industrial session as well as in the various minisymposia highlighted many challenging application areas of parallel computing, such as bioinformatics and genomics, visualisation, image and video processing, modelling and simulation, mobile agents and data mining. Due to the complexity of the problems encountered, parallel computing paradigms often provide the only feasible approach. The overall picture conveyed by the conference was thus one of consolidation of parallel computing technologies and transfer of these into off-the-shelf products on the one hand, and of emerging new areas of research and development on the other. The editors are greatly indebted to the members of the International Programme Committee as well as of the Steering, Organising, Finance and Exhibition Committees for the time they spent in making this conference such a successful event. Many thanks are due to the staff of the Center for High Performance Computing for their enthusiastic support. This was a major factor in making this conference a great success. Our special thanks go to Claudia Schmidt (general organisation), Heike Jagode (overall design), Thomas Blfimel (web administration), Stefan Pflfiger (exhibition), and Guido Juckeland (proceedings).
Gerhard R. Joubert Germany
February 2004
Wolfgang E. Nagel Germany
Frans J. Peters Netherlands
Wolfgang V. Walter Germany
vii
SPONSORS AMD GmbH Cray Computer Deutschland GmbH Hewlett-Packard GmbH, Gesch~iftsstelle Berlin IBM Deutschland GmbH, Fachb. Lehre u. Forschung Megware Computer GmbH NEC High Performance Computing Europe GmbH Pallas GmbH Silicon Graphics GmbH SUN Microsystems GmbH
E X H I B I T O R S / PARTICIPANTS IN THE I N D U S T R I A L T R A C K Cray Computer Deutschland GmbH Hewlett-Packard GmbH, Gesch~iftsstelle Berlin IBM Deutschland GmbH, Fachb. Lehre u. Forschung Megware Computer GmbH NEC High Performance Computing Europe GmbH Silicon Graphics GmbH SUN Microsystems GmbH
viii CONFERENCE
COMMITTEE
Gerhard R. Joubert (Conference Chair, Germany/Netherlands) Wolfgang E. Nagel (Germany) Frans J. Peters (Netherlands) Wolfgang V. Walter (Germany)
STEERING COMMITTEE Frans J. Peters (Chair, Netherlands) Friedel Hossfeld (Germany) Paul Messina (USA) Masaaki Shimasaki (Japan) Denis Trystram (France) Marco Vanneschi (Italy)
ORGANISING COMMITTEE Wolfgang V. Walter (Chair, Germany) Thomas Blfimel (Germany) Uwe Fladrich (Germany) Heike Jagode (Germany) Guido Juckeland (Germany) Claudia Schmidt (Germany) Bernd Trenkler (Germany) Andrea Walther (Germany)
EXHIBITION SUB-COMMITTEE Stefan Pflfiger (Chair, Germany) Norbert Attig (Germany) Hubert Busch (Germany) Wolf-Dietrich Harz (Germany) Matthias Mfiller (Germany)
FINANCE COMMITTEE Frans J. Peters (Chair, Netherlands)
ix
PROGRAMME COMMITTEE Wolfgang E. Nagel (Chair, Germany) Nikolaus Adams (Germany) Hamid Arabnia (USA) Norbert Attig (Germany) Eduard Ayguad6 (Spain) Achim Basermann (Germany) Christian Bischof (Germany) Petter E. Bjorstad (Norway) Arndt Bode (Germany) Thomas Brandes (Germany) Mats Brorsson (Sweden) Helmar Burkhart (Switzerland) Barbara Chapman (USA) Michel Cosnard (France) Pasqua D'Ambra (Italy) Luisa D'Amore (Italy) Erik D'Hollander (Belgium) Koen De Bosschere (Belgium) Luiz DeRose (USA) Andreas Deutsch (Germany) Beniamino Di Martino (Italy) Michael Eiermann (Germany) Rtidiger Esser (Germany) Thomas Fahringer (Austria) Afonso Ferreira (France) Salvatore Filippone (Italy) Michael Gerndt (Germany) Lucio Grandinetti (Italy) Andreas Griewank (Germany) John Gurd (UK) Volker Gtilzow (Germany) Bianca Habermann (Germany) Rolf Hempel (Germany) Hans-Christian Hoppe (Germany) Lennart Johnsson (USA) Dieter Kranzlmiiller (Austria) Norbert Kroll (Germany) Herbert Kuchen (Germany) Keqin Li (USA) Thomas Lippert (Germany) Thomas Ludwig (Germany) Allen Malony (USA)
Tomas Margalef (Spain) Djordje Maric (Switzerland) Federico Massaioli (Italy) Arndt Meyer (Germany) Bemd Mohr (Germany) Almerico Murli (Italy) Per Oster (Sweden) Jean-Louis Pazat (France) Franz-Josef Pfreundt (Germany) Wilfried Philips (Belgium) Michael J. Quinn (USA) Roll Rabenseifner (Germany) Thomas Rauber (Germany) Alexander Reinefeld (Germany) Michael Resch (Germany) Richard Reuter (Germany) Jean Roman (France) Mathilde Romberg (Germany) Dirk Roose (Belgium) Hanns Ruder (Germany) Gudula Rtinger (Germany) Tor Sorevik (Norway) Jens Simon (Germany) Horst Simon (Germany) Henk J. Sips (Netherlands) Erich Strohmaier (USA) Vaidy Sunderam (USA) Mateo Valero (Spain) Marco Vanneschi (Italy) Jeffrey Vetter (USA) Heinrich Voss (Germany) Martin Walker (Switzerland) Wolfgang Walter (Germany) Helmut Weberpals (Germany) Roland Wismtiller (Germany) Gabriel Wittum (Germany) Rtidiger Wolff (Germany) Emilio Zapata (Spain) Hans Zima (Austria) Manfred Zorn (USA)
This Page Intentionally Left Blank
xi CONTENTS Preface Sponsors, Exhibitors / Participants in the industrial track Committees
V
vii viii
Invited Papers Parallel Machines and the "Digital Brain"- An Intricate Extrapolation on Occasion of JvN's 100-th Birthday E Hossfeld So Much Data, So Little Time... C. Hansen, S. Parker, C. Gribble
Software Technology
13
21
On Compiler Support for Mixed Task and Data Parallelism T. Rauber, R. Reilein, G. Riinger
23
Distributed Process Networks - Using Half FIFO Queues in CORBA A. Amar, P. Boulet, J.-L. Dekeyser, E Theeuwen
31
An efficient data race detector backend for DIOTA M. Ronsse, B. Stougie, J. Maebe, E Cornelis, K. De Bosschere
39
Pipelined parallelism for multi-join queries on shared nothing machines M. Bamha, M. Exbrayat
47
Towards the Hierarchical Group Consistency for DSM systems : an efficient way to share data objects L. LefOvre, A. Bonhomme An operational semantics for skeletons M. Aldinucci, M. Danelutto A Programming Model for Tree Structured Parallel and Distributed Algorithms and its Implementation in a Java Environment H. Moritsch
55
63
71
A Rewriting Semantics for an Event-Oriented Functional Parallel Language E Loulergue
79
RMI-like communication for migratable software components in HARNESS M. Migliardi, R. Podesta
87
Semantics of a Functional BSP Language with Imperative Features E Gava, E Loulergue
95
xii The Use of Parallel Genetic Algorithms for Optimization in the Early Design Phases E. Slaby, W. Funk An Integrated Annotation and Compilation Framework for Task and Data Parallel Programming in Java H.J. Sips, K. van Reeuwijk
103
111
On The Use of Java Arrays for Sparse Matrix Computations G. Gundersen, T. Steihaug
119
A Calculus of Functional BSP Programs with Explicit Substitution
127
E Loulergue JToe: a Java API for Object Exchange
135
S. Chaumette, P. Grange, B. MOtrot, P. Vign&as A Modular Debugging Infrastructure for Parallel Programs D. Kranzlmiiller, Ch. Schaubschldger, M. Scarpa, J. Volkert
143
Toward a Distributed Computational Steering Environment based on CORBA O. Coulaud, M. Dussere, A. Esnard
151
Parallel Decimation of 3D Meshes for Efficient Web-Based Isosurface Extraction A. Clematis, D. D'Agostino, M. Mancini, V. Gianuzzi
159
Parallel P r o g r a m m i n g
167
MPI on a Virtual Shared Memory F. Baiardi, D. GuerrL P. Mori, L. RiccL L. Vaglini
169
OpenMP vs. MPI on a Shared Memory Multiprocessor J. Behrens, O. Haan, L. Kornblueh
177
MPI and OpenMP implementations of Branch-and-Bound Skeletons I. Dorta, C. Le6n, C. Rodriguez, A. Rojas
185
Parallel Overlapped Block-Matching Motion Compensation Using MPI and OpenMP E. Pschernig, A. Uhl
193
A comparison of OpenMP and MPI for neural network simulations on a SunFire 6800 A. Strey
201
Comparison of Parallel Implementations of Runge-Kutta Solvers: Message Passing vs. Threads M. Korch, T. Rauber
209
xiii
Scheduling
217
Extending the Divisible Task Model for Workload Balancing in Clusters U. Rerrer, O. Kao, F. Drews
219
The generalized diffusion method for the load balancing problem G. Karagiorgos, N. Missirlis, E Tzaferis
225
Delivering High Performance to Parallel Applications Using Advanced Scheduling N. Drosinos, G. Goumas, M. Athanasaki, N. Koziris
233
Algorithms
241
Multilevel Extended Algorithms in Structural Dynamics on Parallel Computers K. Elssel, H. Voss
243
Parallel Model Reduction of Large-Scale Unstable Systems P. Benner, M. Castillo, E.S. Quintana-Orti, G. Quintana-Orti
251
Parallel Decomposition Approaches for Training Support Vector Machines T. Serafini, G. Zanghirati, L. Zanni
259
Fast parallel solvers for fourth-order boundary value problems M. Jung
267
Parallel Solution of Sparse Eigenproblems by Simultaneous Rayleigh Quotient Optimization with FSAI preconditioning L. Bergamaschi, ,4. Martinez, G. Pini
275
An Accurate and Efficient Selfverifying Solver for Systems with Banded Coefficient Matrix C. H61big, W. Krdmer, T.A. Diverio
283
3D parallel calculations of dendritic growth with the lattice Boltzmann method W. Miller, E Pimentel, I. Rasin, U. Rehse
291
Distributed Negative Cycle Detection Algorithms L. Brim, 1. Cernd, L. Hejtmdnek
297
A Framework for Seamlesly Making Object Oriented Applications Distributed S. Chaumette, P. VignOras
305
Performance Evaluation of Parallel Genetic Algorithms for Optimization Problems of Different Complexity P. K6chel, M. Riedel
313
xiv Extensible and Customizable Just-In-Time Security (JITS) Management of ClientServer Communication in Java S. Chaumette, P. VignOras
Applications & Simulation An Object-Oriented Parallel Multidisciplinary Simulation System U. Tremel, F. Deister, K.A. Sorensen, H. Rieger, N.P. Weatherill
321
329 The SimServer
331
Computer Simulation of Action Potential Propagation on Cardiac Tissues: An Efficient and Scalable Parallel Approach J.M. Alonso, J.M. Ferrero (Jr.), V. Herndndez, G. Molt6, M. Monserrat, J. Saiz
339
M o D y S i m - A parallel dynamical UMTS simulator M.J. Fleuren, H. Stiiben, G.F. Zegwaard
347
apeNEXT: a Multi-TFlops Computer for Elementary Particle Physics F. Bodin, Ph. Boucaud, N. Cabibbo, F. Di Carlo, R. De Pietri, E Di Renzo, H. Kaldass, A. Lonardo, M. Lukyanov, S. de Luca, J. Micheli, V. Morenas, N. Paschedag, O. Pene, D. Pleiter, E Rapuano, L. Sartori, F. Schifano, H. Simma, R. Tripiccione, P. Vicini
355
The Parallel Model System LM-MUSCAT for Chemistry-Transport Simulations: Coupling Scheme, Parallelization and Applications R. Wolke, O. Knoth, O. Hellmuth, W. Schrdder, E. Renner
363
Real-time Visualization of Smoke through Parallelizations T. Vik, A.C. Elster, T. Hallgren
371
Parallel Simulation of Cavitated Flows in High Pressure Systems P.A. Adamidis, F. Wrona, U. Iben, R. Rabenseifner, C.-D. Munz
379
Improvements in black hole detection using parallelism F. Almeida, E. Mediavilla, A. Oscoz, E de Sande
387
High Throughput Computing for Neural Network Simulation J. Culloty, P. Walsh
395
Parallel algorithms and data assimilation for hydraulic models C. Mazauric, V.D. Tran, W. Castaings, D. Froehlich, EX. Le Dimet
403
Multimedia Applications Parallelization of VQ Codebook Generation using Lazy PNN Algorithm A. Wakatani
413 415
XV
A Scalable Parallel Video Server Based on Autonomous Network-attached Storage G. Tan, S. Wu, H. Jin, E Xian
423
Efficient Parallel Search in Video Databases with Dynamic Feature Extraction S. Geisler
431
Architectures
439
Introspection in a Massively Parallel PIM-Based Architecture H.P. Zima Time-Transparent Inter-Processor Connection Reconfiguration in Parallel Systems Based on Multiple Crossbar Switches E. Laskowski, M. Tudruj SIMD design to solve partial differential equations R. W. Schulze
441
449
457
Caches
465
Trade-offs for Skewed-Associative Caches H. Vandierendonck, K. De Bosschere
467
Cache Memory Behavior of Advanced PDE Solvers D. Wallin, H. Johansson, S. Holmgren
475
Performance
483
A Comparative Study of MPI Implementations on a Cluster of SMP Workstations G. Riinger, S. Trautmann
485
MARMOT: An MPI Analysis and Checking Tool B. Krammer, K. Bidmon, M.S. Miiller, M.M. Resch
493
BenchIT- Performance Measurement and Comparison for Scientific Applications G. Juckeland, S. Bdrner, M. Kluge, S. K6lling, WE. Nagel, S. Pfliiger, H. R6ding, S. Seidl, T. William, R. Wloch
501
Performance Issues in the Implementation of the M-VIA Communication Software Ch. Fearing, D. Hickey, P.A. Wilsey, K. Tomko
509
Performance and performance counters on the Itanium 2 study U. Andersson, P. Ekman, P. Oster
A benchmarking case
On the parallel prediction of the RNA secondary structure F. Almeida, R. Andonov, L.M. Moreno, V. Poirriez, M. Pdrez, C. Rodriguez
517
525
xvi Clusters
533
M D I C E - a MATLAB Toolbox for Efficient Cluster Computing R. Pfarrhofer, P Bachhiesl, M. Kelz, H. St6gner, A. Uhl
535
Parallelization of Krylov Subspace Methods in Multiprocessor PC Clusters D. Picinin Jr., A.L. Martinotto, R. V. Dorneles, R.L. Rizzi, C. H61big, T.A. Diverio, P. O.A. Navaux
543
First Impressions of Different Parallel Cluster File Systems T.P Boenisch, P W. Haas, M. Hess, B. Krischok
551
Fast Parallel I/O on ParaStation Clusters N. Eicker, F. Isaila, T. Lippert, T. Moschny, W.F. Tichy
559
PRFX : a runtime library for high performance programming on clusters of SMP nodes B. Cirou, M.C. Counilh, J. Roman Grids
569
577
Experiences about Job Migration on a Dynamic Grid Environment R.S. Montero, E. Huedo, I.M. Llorente
579
Security in a Peer-to-Peer Distributed Virtual Environment J. K6hnlein
587
A Grid Environment for Diesel Engine Chamber Optimization G. Aloisio, E. Blasi, M. Cafaro, I. Epicoco, S. Fiore, S. Mocavero
599
A Broker Architecture for Object-Oriented Master/Slave Computing in a Hierarchical Grid System M. Di Santo, N. Ranaldo, E. Zimeo A framework for experimenting with structured parallel programming environment design M. Aldinucci, S. Campa, P. Ciullo, M. Coppola, M. Danelutto, P. Pesciullesi, R. Ravazzolo, M. Torquati, M. Vanneschi, C. Zoccolo M i n i s y m p o s i u m - Grid C o m p u t i n g
Considerations for Resource Brokerage and Scheduling in Grids R. Yahyapour Job Description Language and User Interface in a Grid context: The EU DataGrid experience G. Avellino, S. Beco, E Pacini, A. Maraschini, A. Terracina
609
617
625
627
635
xvii On Pattern Oriented Software Architecture for the Grid H. Prem, N.R. Srinivasa Raghavan
Minisymposium- Bioinformatics Green Destiny + mpiBLAST = Bioinfomagic W. Feng
643
651 653
Parallel Processing on Large Redundant Biological Data Sets: Protein Structures Classification with CEPAR D. Pekurovsky, I. Shindyalov, P. Bourneb
661
MDGRAPE-3: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations M. Taiji, T. Narumi, Y. Ohno, A. Konagaya
669
Structural Protein Interactions: From Months to Minutes P. Dafas, J. Gomoluch, A. Kozlenkov, M. Schroeder
677
Spatially Realistic Computational Physiology: Past, Present and Future J.R. Stiles, W.C. Ford, J.M. Pattillo, T.E. Deerinck, M.H. Ellisman, T.M. Bartol, T.J. Sejnowski
685
Cellular automaton modeling of pattern formation in interacting cell systems A. Deutsch, U. BOrner, M. B~ir
695
Numerical Simulation for eHealth: Grid-enabled Medical Simulation Services S. Benkner, W. Backfrieder, G. Berti, J. Fingberg, G. Kohring, J.G. Schmidt, S.E. Middleton, D. Jones, J. Fenner
705
Parallel computing in biomedical research and the search for peta-scale biomedical applications C.A. Stewart, D. Hart, R. W. Sheppard, H. Li, R. Cruise, V. Moskvin, L. Papiez
Minisymposium- Performance Analysis
719
727
Big Systems and Big Reliability Challenges D. A. Reed, C. Lu, C.L. Mendes
729
Scalable Performance Analysis of Parallel Systems: Concepts and Experiences H. Brunst, W.E. Nagel
737
CrossWalk: A Tool for Performance Profiling Across the User-Kernel Boundary A. V. Mirgorodskiy, B.P. Miller
745
Hardware-Counter Based Automatic Performance Analysis of Parallel Programs F. Wolf, B. Mohr
753
xviii Online Performance Observation of Large-Scale Parallel Applications A.D. Malony, S. Shende, R. Bell
761
Deriving analytical models from a limited number of runs R.M. Badia, G. Rodriguez, J. Labarta
769
Performance Modeling of HPC Applications A. Snavely, X. Gao, C. Lee, L. Carrington, N. Wolter, J. Labarta, J. Gimenez, P. Jones
777
Minisymposium - OpenMP
785
Thread based OpenMP for nested parallelization R. Blikberg, T. Sorevik
787
OpenMP on Distributed Memory via Global Arrays L. Huang, B. Chapman, R.A. Kendall
795
Performance Simulation of a Hybrid OpenMP/MPI Application with HESSE R. Aversa, B. Di Martino, M. Rak, S. Venticinque, U. Villano
803
An environment for OpenMP code parallelization C.S. Ierotheou, 11. Jin, G. Matthews, S.P. Johnson, R. Hood
811
Hindrances in OpenMP programming E Massaioli
819
Wavelet-Based Still Image Coding Standards on SMPs using OpenMP R. Norcen, A. Uhl
827
M i n i s y m p o s i u m - Parallel A p p l i c a t i o n s
835
Parallel Solution of the Bidomain Equations with High Resolutions X. CaL G.T. Lines, A. Tveito
837
Balancing Domain Decomposition Applied to Structural Analysis Problems P. E. Bjorstad, J. Koster
845
Multiperiod Portfolio Management Using Parallel Interior Point Method L. Halada, M. Lucka, I. Melichercik
853
Performance of a parallel split operator method for the time dependent Schr6dinger equation T. Matthey, T. Sorevik
861
xix Minisymposium - Cluster Computing
869
Design and implementation of a 512 CPU cluster for general purpose supercomputing B. Vinter
871
Experiences Parallelizing, Configuring, Monitoring, and Visualizing Applications for Clusters and Multi-Clusters O.J. Anshus, J.M. Bjorndalen, L.A. Bongo
879
Cluster Computing as a Teaching Tool O.J. Anshus, A.C. Elster, B. Vinter Minisymposium- Mobile Agents
887
895
Mobile Agents Principles of Operation A. Genco
897
Mobile Agent Application Fields E Agostaro, A. Genco, S. Sorce
905
Mobile Agents and Grid Computing E Agostaro, A. Chiello, A. Genco, S. Sorce
913
Mobile Agents, Globus and Resource Discovery E Agostaro, A. Genco, S. Sorce
919
A Mobile Agent Tool for Resource Discovery E Agostaro, A. Genco, S. Sorce
927
Mobile Agents and Knowledge Discovery in Ubiquitous Computing A. Genco
935
Author & Subject Index
Author Index Subject Index
943 945 951
This Page Intentionally Left Blank
Invited Papers
This Page Intentionally Left Blank
Parallel Computing:SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
Parallel Machines and the "Digital Brain"- An Intricate Extrapolation on Occasion of JvN's 100-th Birthday F. Hossfeld a aChair for Technical Informatics and Computer Sciences, University of Technology (RWTH) Aachen, Central Institute for Applied Mathematics, Research Centre Juelich, Germany On 28 December 2003, the scientific community will celebrate the 100-th anniversary of John von Neumann's birthday. On this occasion, we are reminded of his achievements as outstanding mathematician and creator of Game Theory, but even more that he laid the very conceptional foundations of the digital computer. His concept had the fortune - contrary to Konrad Zuse's in Germany- that the transistor was invented in 1947, just when JvN wrote his famous reports on the digital computer, at Bell Labs thus giving rise to the extraordinary technological development of microelectronics pushed further by other inventions like photolithography and integrated circuits in the years to come. Since decades, the exponential growth of the power of microchips- every 18 months, the integration density of transistors on the chips is doublingis steering the also exponential growth of computer power. The top computers have surpassed the teraflops level by far today targeting towards 100 teraflops or even petaflops. The computer has become ubiquitous, and protagonists of robotics and artificial intelligence are tempted to attribute to it omnipotent capabilities which will lead to autonomous "humanoids" (Moravec, Kurzweil) on the one hand and threatening horror scenarios on the other (Joy). The predictions of the semiconductor industry tell us that "Moore's Law" describing the exponential evolution in microelectronics might remain valid for other 10 to 15 years. Beyond Moore's Law, quantum effects will definitely end the orderly functioning of"classical" circuits. In 1982, Richard Feynman pointed out that certain quantum mechanical systems could not be simulated efficiently on a "classical" digital computer how powerful it may evolve. This led to speculations that computation in general could be done, in principle and even more efficiently, if a novel computer could make thorough use of quantum effects, thus providing a challenging option for parallel computing by exploiting the exponential speedup through quantum parallelism. Peter Shor's factorization algorithm of 1994 showed that the quantum computer is capable to solve NP-hard problems efficiently. Experimentalists work on different physical concepts to realize quantum computation; for instance, quantum dots, trapped ions, superconducting devices, and NMR technology have been shown to provide the principles and the technology to build quantum computers. 1. LEGACIES In the early 1950-s, John von Neumann once said that three scientific achievements had changed the view of the world in the 20-th century: (1) Einstein's Theory of Relativity, (2)
Heisenberg's Quantum Mechanics, and (3) G6del's Incompleteness Theorem. In the second half of the 20-th century, the history of science and technology added a then unforeseen 4-th one: von Neumann's Digital Computer. John von Neumann was born in December 1903 in Budapest as son of a Hungarian Jewish banker. JvN was a mathematical infant prodigy with kind of photographic memory. Of course, he later was disposed to study mathematics; however, his father seeking advice from the famous von Kfirmfin then at the University of Technology in Aachen, JvN was to study chemical engineering which he humbly d i d - obviously in Budapest, Berlin, and Zurich. He finished these studies with the diploma in 1926. In the same year, however, in parallel he got his PhD in mathematics with a thesis on set theory certainly influenced at his visits to G6ttingen by Hilbert and his program. After lecturing in Berlin and Hamburg, he received an invitation as visiting professor from Princeton University which he accepted in 1930. In 1933, after the foundation of the Institute of Advanced Study in Princeton in 1930, he became together with Alexander and Veblen the first professors of mathematics at the institute which attracted afterwards Einstein, G6del, Church, Dyson, and Oppenheimer and many other famous scientists as well as, in 1937, as a guest scientist also Turing. Hence, JvN was well informed not only about the progress of physics but also on the Hilbert-program deconstructing work of G6del and on Turing's and Church's ideas of computation and computability. The concept of the Turing machine and algorithmics gave definitely major impact on his later plans and projects of designing and building the digital computer. Highly awarded, JvN died of bone cancer on February 8, 1957. Already in 1921, he received the Award of the Best Mathematician in Hungary. In 1927, he published five papers: three on quantum mechanics which had just been created by Heisenberg and Schr6dinger, one on a mathematical problem, and one which is considered as the foundation of game theory. In 1932, he published the important monography - in German - on the "Mathematical Foundations of Quantum Mechanics" where he set this theory on solid mathematical ground especially with his ideas on algebras which anew gain much attention in modem theoretical physics. From 1940 to 1945, he was deeply involved in the US Government's Manhattan Project continuing his involvement as an esteemed advisor until his death in Washington. However, simultaneously, he focussed his mathematical interests on grand challenges arising from the stalemate of the analysis of partial differential equations which again guided him to the definition, design, and construction of the digital computer to overcome the stagnation of the analytic mathematical treatment of complex problems like climate and weather forecast, to indicate only a few out of his broad spectrum of interdisciplinary challenges which led him to establish computer simulation as the third category of scientific methodology in addition to theory and experiment, a discipline which has later been named Computational Science (& Engineering) by the Nobel ~rize Winner Kenneth Wilson (1982). Honoured with the Medal of Freedom, the Albert Einstein Memorative Award and the Enrico Fermi Award and the membership in the American Academy of Sciences, the American Academy of Arts and Science, the American Philosophical Society and many other academies abroad, he dies too early to complete, besides his diverse projects in mathematics and computing, his thorough analysis of the potential of the digital computer and his "organs"- as he called its various components- and the similarities of the "digital brain" with, or rather its fundamental distinction from the human brain as he elaborated in his last publication, the fragmentary analysis of "The Computer and the Brain" which he worked out for the invited Silliman Lectures in 1956 which he never could give at Yale University [1, 2, 3, 4, 5].
2. COMPUTER TECHNOLOGY AND SUPERCOMPUTER ARCHITECTURE
While Konrad Zuse in Germany [6] as the definitely, and at last also in the US accepted, first digital computer pioneer struggled since the 1930s with the limited technical and technological resources available, JvN was lucky enough to design his digital computer concept at a time when the transistor was invented by John Bardeen, Walter Brattain and William Shockley in 1947 at Bell Labs - earning them the Nobel Prize in 1956 - which gave a tremendous push to the development of digital devices encouraging the STRETCH Program at IBM in 1955 in parallel with the development of the ferrite core memory by Jay Forrester and stimulating John Backus to create the first compiler for the high-level language Fortran in 1957 to harness the IBM 704 computer with 32,768 32-bit word ferrite core memory and a magnetic drum as secondary storage. In 1959, Jack Kilby at Texas Instruments developed the first integrated circuit (then on germanium substrate) and Robert Noyce at Fairchild Camera invented photolithography. These giant technological steps created the incredible explosion of microelectronics which resulted in 1970 in the road-paving achievement of the first microprocessor Inte14004. From then on, we can follow an exponential growth of digital computer power in parallel with the exponential growth of the densitiy of transistors on the chips which is acknowledged today as "Moore's Law" allowing for the doubling of microprocessor power and memory size within about 18 months. Personal computers, workstations, servers, and supercomputers follow this "Law" since decades enhanced even more by innovations in computer architecture and operating system concepts [7] leading to vector-processing functions and parallelism in modern supercomputers which nowadays reach far beyond the teraflops performance even with large microprocessor clusters while since early 2002 the Japanese "Earth Simulator"- a powerful combination of vector-processing and parallel structures built on NEC technology- is heading the TOP-500 list with more than 40 teraflops peak and about 35 teraflops sustained Linpack performance [8], thus causing similar shock waves -"Computnik" - within the political and technocratic circles in USA as by the Sputnik launched then by the late Soviet Union. This tremendous explosion of the digital computer is in almost all aspects guiding back to roots of computer architecture and programming as layed by JvN in the 1940-s. Scanning the history since the very birthday of Computational Science and Engineering, which as mentioned may be dated back to 1946 when JvN formulated the strategic program in his famous report on the necessity and future of digital computing together with H. H. Goldstine, at that time complex systems were primarily involved with fluid dynamics. JvN expected that really efficient high-speed digital computers would "break the stalemate created by the failure of the purely analytical approach to nonlinear problems" and suggested fluid dynamics as a source of problems through which a mathematical penetration into the area of nonlinear partial differential equations could be initiated. JvN envisioned computer output as providing scientists with those heuristic hints needed in all parts of mathematics for genuine progress and to break the deadlock- the "present stalemate"- in fluid dynamics by giving clues to decisive mathematical ideas. In a sense, his arguments sound very young and familiar. As far as fluid dynamics is concerned, in his John von Neumann Lecture at the SIAM National Meeting in 1981 yet Garett Birkhoff came to the conclusion on the development of fluid dynamics that it be unlikely that computational fluid dynamics (CFD) would become a truly mathematical science in the near future, although computers might soon rival windtunnels in their capabilities; both, however, would be ever essential for research [9, 10].
The various strategic position papers in the 1980s [11, 12, 13, 14, 15] and the government technology programs in the USA, in Europe, and in Japan in the early 1990s claimed that the timely provision of supercomputers to science and engineering and the ambitious development of innovative supercomputing hardware and software architectures as well as new algorithms and effective programming tools are an urgent research-strategic response to the grand challenges arising from these huge scientific and technological barriers. The solutions of complex problems are critical to the future of science, technology, and society. Supercomputing will be a crucial factor for the industry as well in order to meet the requirements of international economic competition especially in the area of high-tech products. Despite the remarkable investments in research centers and universities in building up supercomputing power and skills and also some sporadic efforts in the industry concerning supercomputing in Europe, it took until the 1990s that the U.S. Government and the European Union as well as several national European governments started non-military strategic support programs [ 16, 17, 18]. Their goals were also to enhance supercomputing by stimulating the technology transfer from universities and research institutions into industry and by increasing the fraction of the technical community which gets the opportunity to develop the skills required to efficiently access the highperformance computing resources. In recent years, computer simulation has reached even the highest political level, since, in 1996, the United Nations voted to adopt the Comprehensive Test-Ban Treaty banning all nuclear testing for military and peaceful purposes. Banning physical nuclear testing created a need for full-physical modeling and high-confidence computer simulation and, hence, unprecedented steps in supercomputer power. DoE's Accelerated Strategic Computing Initiative (ASCI) [ 19, 20] aiming to replace physical nuclear-weapons testing by computer simulation, and NSF's Partnerships for Advanced Computational Infrastructures [21 ] in the US targeting at the advancement of new computing and communication infrastructures for grid computing [22, 23] intended to revolutionize science and engineering- as well as business- is definitely establishing computer simulation as a fundamental methodology in science and engineering. The dedication of the Nobel Prize for Chemistry in 1998 to Computational Chemistry, in addition, confirmed its significance in the scientific community as well as in industry and politics. In the early 1990s, Cray vector-supercomputers with shared memory architecture and proprietary bipolar CPU technology were foreseeable to run into difficulties to provide the giant steps in compute power needed by greedy users, then particularly in physics. Supercomputer architectures with massive parallelism and distributed memory emerged as the future, also more cost-effective, line for the very top end. During these years, nearly thirty companies were offering massively parallel systems and others were planning to enter the market with new products, although many experts predicted that the market would not be able to sustain thus many vendors [24]. In [25], a chronological compilation of high performance computer history illuminates the "Darwinistic" forces affecting the supercomputer evolution lines. It demonstrates that the expected shake-out in the computer industry took place questioning the health and the future potential of this industry in total. Some went out of the parallel computer business for quite different reasons -, others became just mergers. The dramatic survival battle in the supercomputer industry was also giving severe damage to the users and the customers in the supercomputing arena. Quite often the critical situation of parallel computing has rigorously been analyzed with respect to the possible negative impacts on the future perspectives and the progress of this scientific discipline. Already Goldstine said that the history of computing is
littered with "australopithecanes", short computer lines which do not lead anywhere [3]. Following in the wake of DoE's ASCI program, in recent years powerful new supercomputers were brought into the market by the manufacturers participating in the high-performance computing race again. It seemed that only those supercomputer manufacturing companies would have a realistic chance to survive in the years to come who had the potential, capabilities, and favour to get involved in the ASCI or other significant US Government supported programs. However, in early 2002, the Japanese took the lead with the "Earth Simulator" which was argued by the climate problems arising around the world. The parallel architectures became quite successful on the TOP 500 list recently extending the parallel computer concept towards clustered symmetric multiprocessor (SMP) nodes mainly based on commodity components tying together possibly tens of thousands of processing elements [26]. Hence, the Central Institute for Applied Mathematics at the Research Centre Juelich running the National Supercomputer Centre "John von Neumann Institute for Computing" just inaugurated its 8.9 teraflops SMP-based new supercomputer "JUMP", now No. 1 in Europe, with 41 IBM p690 nodes of 32 processors each thus adding up to 1312 processors and surpassing its MPP system CRAY T3E/1200 by more than a factor of 10 in performance [27, 28]. The simulation of extremely complex systems may also determine future large-scale computing by interconnecting supercomputers of diverse architectures as giant supercomputer complexes via the new paradigm of grid computing. These developments will challenge not only system reliability, availability and serviceability to novel levels, but also interactivity of concurrent algorithms and, in particular, adaptivity, accuracy, and stability of parallel numerical methods [29]. In any case, however, a new technology will be necessary to target at and beyond the petaflops barrier in supercomputing, since the further expansion of the floor space of several thousands of square-meters needed by today's SMP-clustered systems with only tenths of teraflops peak performance will be neither reasonable nor technically sensibly feasible. IBM's Blue Gene project may give some guidance and, more aggressively, nanotechnology may provide innovative transistor concepts which will shift the limit of Moore's Law a bit more into the next decades [30]. 3. FROM BITS TO QUBITS
Nevertheless, the end of Moore's Law has been predicted for the time around 2020, at last, since extrapolation leads into nanoscales where the quantum regime will totally govern and, thus, by noise disturb the transistor functions. While the applications of"ubiquitous" computing and its specific requirements may be satisfied by the currently available and further improved technology, high-performance computing targeting towards multi-petaflops performance may severely suffer from the ending exponential growth. Therefore, alternative computing concepts breaking those performance barriers should be intensively explored in order to provide the means to deal with complex problems in future science, industry, and society. In 1982, Richard Feynman explained that the simulation of certain quantum physical systems cannot be simulated by "classical" computers; he, however, pointed out that quantum mechanics could possibly provide the principles to develop a fundamentally new computing paradigm: quantum computing, based on technologically and logically different computer architectures [31]. In 1984 already, David Deutsch designed a quantum computer model [32] which has been elaborated since then theoretically in remarkable depth [33] yielding, as an expansion
of the Church-Turing computational concept to quantum Turing machines, deep complexitytheoretical results and quantum algorithms which meet the expectation of exponential computational speedup due to the inherent exponential superposition of quantum states which, in analogy to classical Boolean bit-based logics, are represented by "qubits", thus exploiting the so-called quantum parallelism. Like in Boolean switching algebra with the NAND and NOR gates, in quantum computing also universal gates (Qubit and Conditional NOT: CNOT) are available besides versatile gates to build up logical circuits, and remarkably- like in classical computation- the Fourier Transform turned out to be a fundamental algorithmic element to exploit the exponential speedup in quantum algorithms. On this basis, Peter Shor succeeded in developing a factorization quantum algorithm [34] which not only provides exponential speedup compared with classical sequential algorithms for this well-known computationally hard problem, but also demonstrated that quantum computing is fundamentally capable to dissolve at least for a significant subset of NP-hard problems the NP - P question by providing efficient quantum algorithms [35]. Although there are diverse physical methods which provide the principal means to establish quantum computer technology and architecture (like NMR technology, quantum dots, trapped ions, superconducting devices and others) which in some cases already have been successfully demonstrated with small numbers of qubits, it will definitely take many years to establish, if at all, a sound quantum computer technology. There are many sources for limited stability and fundamental perturbations of entangled quantum states (decoherence). However, as science history has shown elsewhere, once the idea is born, it will be definitely pursued by scientific curiosity in innovative computational concepts and by the need of greedy applications as well. Thus, quantum computing represents a grand challenge for the years to come; it opens a new space for future intellectual adventures in research and development of physics, mathematics, and computer science. 4. TOWARDS THE "DIGITAL BRAIN"? Along with considerations on the future of computation and the understanding of the human brain, in his popular books Roger Penrose also addressed quantum mechanics and possibly emerging new theories in physics like String Theory and its future extensions as potential scientific vehicles to explain human brain functions and understanding mind and consciousness [36, 37, 38]. As indicated above, JvN discussed in his preliminarily sketched Silliman Lectures, published as "The Computer and the Brain" [5], the similarities and differences of the computer, as he defined it, and the brain, as he understood it at his time. Interestingly enough, in his famous reports defining the digital computer in the 1940-s he talked about the major computer components as "organs" thus showing that he in some sense entertained the long European tradition of ideas to design and construct humanoids which boomed in the 17th and 18-th centuries giving space to many, as seen today, ridiculous appearances of fluteplaying monkeys, chess-playing fakes and other curiosities. However, as a great mathematician he certainly also stood in the tradition of occidental philosophy which for centuries up to today is challenged by the mind-body problem reaching back to the ancient philosophers Thales of Milet ("the four element S " ), Anaxagoras of Clazomenae (introducing "nous", the mind), Leukippos, Demokrit and, last not least, Aristoteles and Platon via the Renaissance with Locke , ,, ), ("Essay concerning understanding"), Leibniz ("Nouveaux essais sur l'entendement h umam de la Mettrie ("L'homme plus que machine"), Descartes ("Discours de la m&hode" and "Med-
itationes de prima philosophia") towards Eccles ("The self and its brain", together with Popper) and the brain researchers of today like Edelman, Chalmers, Singer, Roth and many others [39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 ] pointing out that the human brain be definitely the most complex system in the - k n o w n - universe. The more one learns about the brain, the greater the distance seems to grow to the digital computer, although this statement does not match with the expectations of Artificial Intelligence (AI). The "strong" Artificial Intelligence still maintains its postulation that the exponentially growing complexity of the digital computer will automatically create mind and (self-)consciousness in the digital machine, and Ray Kurzweil [52] and Hans Moravec [53] draw dramatic pictures of a future of autonomous intelligent robots and humanoids which might take over control already during this century keeping us humans as pets if we behave. While Moravec is focussed mainly on robotics, he predicts that the digital computer has reached or is not far from the capacity and capabilitiy to simulate living beings and, pointing at recent computer chess matches with Kasparov, even compete with human intelligence. Based on simplistic calculations comparing vertebrate retina performance and robot vision requirements, he states that 100 tera-instructions per second of computer power will be sufficient to match human behaviour by simulation. This computer performance indeed will be soon available, if it is not already today. On the other hand, IT experts as Bill Joy, philosophers and social scientists as well as brain researchers consider Kurzweil's predictions as either horror scenarios or fundamental impossibilities [54]. However, as long as we do not know what it is about human intelligence and its essence and the possibly - if not, as a scientist, to say: certainly - materialistic ground of human mind and consciousness, what they are and how they emerge, there will be no final answer to the mind-body problem a n d - vice versa - no understanding of the human brain, and, thus, it is not possible to make true statements about the "intelligent" evolution of the digital computer as well. Penrose discussed- in "Shadows of the Mind" [38] - four viewpoints on the question whether computing equals thinking and thinking equals computation: 9 Position A: All thinking is computation ("computation" in the sense of Church and Turing, i.e. based on algorithms and representable by Turing machines); in particular, feelings of conscious awareness are evoked merely by executing appropriate computations. ("Strong" AI) 9 Position B: Awareness is a feature of the brain's physical action; and whereas any physical action can be simulated computationally, computational simulation cannot by itself evoke awareness. ("Weak" AI) 9 Position C: Appropriate physical action of the brain evokes awareness, but this physical action cannot be even properly simulated computationally. 9 Position D: Awareness cannot be explained by physical, computational, or any other scientific terms. It seems to be quite natural and consequent that Penrose, as a mathematician and scientist rather than a strong-AI promoter or a theologically locked esoteric, discarded Positions A, B, and D as inadequate answers to the question, and in strengthening Position C he referred to the evolution of physical theories extending quantum mechanics which will favour and finally
10 confirm Position C. In this respect, it is interesting that also Nobel Prize winner John Eccles seems to have pursued during his last scientific activities (still vague) ideas of involving quantum mechanical processes in the description of the functions of the human brain concerning the interaction of mind and matter. Looking back to JvN's first monograph defining and clarifying rigorously the mathematical foundations of quantum mechanics and to his last work comparing the computer and the brain, we must say that his early death caused a great loss also for this challenging and futuredetermining field of research. If we listen to the predictions of the Kurzweils and Moravecs, we may be justified to state that JvN as one last genius of the 20-th century would have activated his exceptional intellectual capabilities and sharp theoretical instruments to shape this field in his strong scientific manner as he did in his relatively short life in other areas, rather than by blue speculations. The science community has indeed good reasons to celebrate his 100-th birthday! REFERENCES
[ 1] [2] [3] [4] [5]
[6]
[7] [8] [9] [10] [11] [12]
[13]
[ 14] [15]
von Neumann, J., Collected Works, Vol. I-VI, Pergamon Press, 1961-1963. Aspray, W., John yon Neumann and the Origins of Modern Computing, MIT Press, 1990. Macrae, N., John yon Neumann, Pantheon Books, 1992. yon Neumann, J., Mathematical Foundations of Quantum Mechanics, Princeton University Press, 1955. von Neumann, J., The Computer and the Brain, 2nd Edition (with a Foreword by Paul. M. Churchland and Patricia S. Churchland), Yale University Press, Yale Nota Bene Book, 2000. Ceruzzi, P. E., The Early Computers of Konrad Zuse, 1935 to 1945, Ann. Hist. Comp. Vol. 3 (1981), No. 3,241-262; Ritchie, D., The Computer Pioneers, Chapter 3, Simon & Schuster, 1986. Hwang, K., Advanced Computer Architecture- Parallelism, Scalability, Programmability, McGraw-Hill, 1993. www.TOP500.org/lists/2003/11. Birkhoff, G., Numerical Fluid Dynamics, SIAM Review Vol. 25 (1983), 1-34. Roache, P. J., Fundamentals of Computational Fluid Dynamics, Hermosa Publishers, 1998. Special Double Issue: Grand Challenges to Computational Science, Future Generation Computer Systems 5 (1989), No. 2/3. Committee on Physical, Mathematical, and Engineering Sciences, Federal Coordinating Council for Science, Engineering, and Technology, Grand Challenges 1993: High Performance Computing and Communications, The FY 1993 U.S. Research and Development Program, Office of Science and Technology Policy, Washington, 1992. Board on Mathematical Sciences of the National Research Council (USA), The David II Report: Renewing U.S. Mathematics - A Plan for the 1990s, in: Notices of the American Mathematical Society, May/June 1990, 542-546; September 1990, 813-837; October 1990, 984-1004 Commission of the European Communities, Report of the EEC Working Group on HighPerformance Computing (Chairman: C. Rubbia), February 1991. Trottenberg, U., et al., Situation und Erfordernisse des wissenschaftlichen H6chstleistungsrechnens in Deutschland- Memorandum zur Initiative High Performance Scientific
11
[16] [17]
[ 18] [ 19] [20] [21 ]
[22] [23]
[24]
[25]
[26]
[27] [28] [29]
[30] [31] [32] [33] [34]
Computing (HPSC), Februar 1992; published in: Informatik-Spektrum Band 15 (1992), H. 4, 218. The Congress of the United States, Congressional Budget Office, Promoting HighPerformance Computing and Communications, Washington, June 1993. High-Performance Computing Applications Requirements Group, High-Performance Computing and Networking, Report, European Union, April 1994; - High-Performance Networking Requirements Group, Report, European Union, April 1994. Bundesministerium f'tir Forschung und Technologie, Initiative zur F6rderung des parallelen H6chstleistungsrechnens in Wissenschafl und Wirtschafl, BMFT, Bonn, Juni 1993. www.llnl.gov/asci/. www.llnl.gov/asci-pathforward/. Smarr, L., Toward the 21-st Century, Comm. ACM Vol. 40 (1997), No. 11, 28-32. - Smith, Ph. L., The NSF Partnerships and the Tradition of U.S. Science and Engineering, Comm. ACM Vol. 40 (1997), No. 11, 35-37. Foster, I., and C. Kesselman (eds.), The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, 1999. Atkins, D. E., et al., Revolutionizing Science and Engineering Through Cyberinfrastructure, National Science Foundation Report, Blue Ribbon Advisory Panel on Cyberinfrastructure, USA, January 2003. The Superperformance Computing Service, Palo Alto Management Group, Inc., Massively Parallel Processing: Computing for the 90s, SPCS Report 25, Second Edition, Mountain View, California, June 1994. Strohmaier, E., et al., The marketplace of high-performance computing, Special Anniversary Issue (G. Joubert et al., eds.), Parallel Computing Vol. 25 (1999), No. 13/14, 15171544. Hossfeld, F., et al., Gekoppelte SMP-Systeme im wissenschafllich-technischen Hochleistungsrechnen - Status und Entwicklungsbedarf- (GoSMP), Analyse im Auflrag des BMBF, F6rderkennzeichen 01 IR 903, Dezember 1999. www. fz-j uelich.de/zam. http ://jumpdoc, fz-j uelich.de/. Hossfeld, F., Teraflops Computing: A Challenge to Parallel Numerics?, in: Zinterhof, P., et al. (eds.), Parallel Computation, Proceedings of the 4-th International ACPC Conference 1999, Salzburg, Austria, pp. 1-12; Feeding Greedies on Meager Roadmaps, in: Bubak, M., et al. (eds.), Proceedings of SGI'2000, Krakow, Poland, pp. 11-22. Semiconductor Industry Association (SIA), International Technology Roadmap for Semiconductors, 2003 Edition (ITRS 2003). Feynman, R. P., Simulating physics with computers, Int. J. Theor. Phys. Vol. 21 (1982), 467. Deutsch, D., Quantum Theory, the Church-Tufing Principle and the universal quantum computer, Proc. Roy. Soc. Lond. A Vol. 400 (1985), 97. Nielsen, M. A., and Chuang, I. L., Quantum Computation and Quantum Information, Cambridge University Press, 2000. Shot, P. W., Algorithms for quantum computation: discrete logarithms and factoring, in: Proceedings 35-th Annual Symposium on Foundation of Computer Science, IEEE Press, 1994; Polynomial-time algorithms for prime factorization and discrete logarithms on a
12 quantum computer, SIAM J. Comp. Vol. 26 (1997), No. 5, 1484-1509. [35] Moret, B. M. E., and Shapiro, H. D., Algorithms from P to NP - Volume 1: Design & Efficiency, The Benjamin/Cummings Publishing Company, 1991. [36] Penrose, R., The Emperor's New Mind, Oxford University Press, 1989. [37] Penrose, R., Shadows of the M i n d - A Search for the Missing Science of Consciousness, Oxford University Press, 1994. [38] Penrose, R., The Large, the Small, and the Human Mind, Cambridge University Press, 1997. [39] Locke, J., An Essay Concerning Human Understanding, Great Books in Philosophy, Prometheus Books, 1995. [40] de La Mettrie, J. O., Machine Man (L'homme plus que machine), in: Cambridge Texts in the History of Philosophy (ed. by Ann Thomson),: La Mettrie, Machine Man and Other Writings, Cambridge University Press, 1996. [41 ] Descartes, R., Discours de la m6thode pour bien conduire sa raison, et chercher la verit6 dans les sciences; Meditationes de prima philosophia, in: Philosophische Schriften in einem Band, Felix Meiner Verlag, 1996. [42] Leibniz, G. W., Nouveaux Essais sur L'Entendement Humain, Livre I-IV, Philosophische Schriften Band III, Wissenschaftliche Buchgesellschaft, 1985; Schriften zur Logik und zur philosophischen Grundlegung von Mathematik und Naturwissenschaft (orig. latin), Philosophische Schriften Band IV, Wissenschaftliche Buchgesellschaft, 1992. [43] Popper, K., and Eccles, J., The Self and Its Brain - An Argument for Interactionism, Springer Verlag, 1977. [44] Chalmers, D. J., The Conscious Mind- In Search of a Fundamental Theory, Oxford University Press, 1996. [45] Edelman, G. M., and Tononi, G., A Universe of Consciousness - How Matter Becomes Imagination, Basic Books, 2000. [46] Pinker, St., How the Mind Works, Penguin Books, 1998. [47] Singer, R., Der Beobachter im Gehirn, Essays zur Hirnforschung, Suhrkamp Taschenbuch Wissenschaft Band 1571, Suhrkamp Verlag, 2002. [48] Roth, G., Das Gehirn und seine Wirklichkeit- Kognitive Neurobiologie und ihre philosophischen Konsequenzen, Suhrkamp Taschenbuch Wissenschaft Band 1275, Suhrkamp Verlag, 1997. [49] Pauen, M., und Roth, G. (Hrsg.), Neurowissenschaften und Philosophie - Eine Einfiihrung, Wilhelm Fink Verlag, 2001. [50] Pauen, M., Grundprobleme der Philosophic des Geistes - Eine Einf'fihrung, Fischer Taschenbuch Verlag, 2001. [51 ] Zoglauer, Th., Geist und Gehirn- Das Leib-Seele-Problem in der aktuellen Diskussion, Vandenhoek & Ruprecht, 1998. [52] Kurzweil, R., The Age of Spiritual Machines; in German: Homo s@piens, Kiepenheuer & Witsch, 1999. [53] Moravec, H. P., Mind Children: The Future of Robots and Human Intelligence, Harvard University Press, 1990; Robot: Mere Machine to Transcendent Mind, Oxford University Press, 2000. [54] Schirrmacher, F. (Hrsg.), Die Darwin A G - Wie Nanotechnologie, Biotechnologie und Computer den neuen Menschen tr~iumen, Kiepenheuer & Witsch, 2001.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004ElsevierB.V. All rights reserved.
13
So M u c h Data, So Little Time... C. Hansen a, S. Parker a, and C. Gribble ~ aScientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT 84112 USA Massively parallel computers have been around for the past decade. With the advent of such powerful resources, scientific computation rapidly expanded the size of computational domains. With the increased amount of data, visualization software strove to keep pace through the implementation of parallel visualization tools and parallel rendering leveraging the computational resources. Tightly coupled ccNUMA parallel processors with attached graphics adapters have shifted the research of visualization to leverage the more unified memory architecture. Our research at the Scientific Computing and Imaging (SCI) Institute at the university of Utah has focused on innovative, scalable techniques for large-scale 3D visualization. Real-time ray-tracing for isosurfacing has proven to be the most interactive method for large scale scientific data. We have also investigated cluster-based volume rendering leveraging multiple nodes of commodity components. 1. INTRODUCTION In recent years, scalable architectures and algorithms have led to unprecedented growth in computational data. The effectiveness of using such advanced hardware that produces large amounts of high-resolution data will hinge upon the ability of human experts to interact with their data and extract useful information. Needless to say, the interactive analysis of such large, and sometimes unwieldy, data sets has become a computation bottleneck in its own right. To address this challenge, the field of visualization and computer graphics has made significant advances in the area of parallel methods for both scientific visualization and image generation. 2. SCALABLE ISOSURFACING Many applications generate scalar fields p(x, y, z) which can be viewed by displaying isosur.faces where p(x, y, z) = Piso" Ideally, the value for Piso is interactively controlled by the user. When the scalar field is stored as a structured set of point samples, the most common technique for generating a given isosurface is to create an explicit polygonal representation for the surface using a technique such as Marching Cubes [ 1, 10]. This surface is subsequently rendered with attached graphics hardware accelerators such as the SGI Infinite Reality. Marching Cubes can generate an extraordinary number of polygons, which take time to construct and to render. For very large (i.e., greater than several million polygons) surfaces the isosurface extraction and rendering times limit the interactivity. In this paper, we generate images of isosurfaces directly
14 with no intermediate surface representation through the use of ray tracing. Ray tracing for isosurfaces has been used in the past (e.g. [9, 12, 21 ]), but we apply it to very large datasets in an interactive setting. It is well understood that ray tracing is accelerated through two main techniques [ 18]: accelerating or eliminating ray/voxel intersection tests and parallelization. Acceleration is usually accomplished by a combination of spatial subdivision and early ray termination [8, 5, 20]. Ray tracing has been used for volume visualization in many contexts (e.g., [8, 19, 23]). Ray tracing for volume visualization naturally lends itself towards parallel implementations [11, 14]. The computation for each pixel is independent of all other pixels, and the data structures used for casting rays are usually read-only. These properties have resulted in many parallel implementations. A variety of techniques have been used to make such systems parallel, and many successful systems have been built. These techniques are surveyed by Whitman [24]. Ray tracing naturally lends itself towards parallel implementations. The computation for each pixel is independent of all other pixels, and the data structures used for casting rays are usually read-only. Simply implementations work, but achieving scalability on large parallel resources requires careful attention to synchronization costs and resource assignment. To reduce synchronization overhead we can assign groups of rays to each processor. The larger these groups are, the less synchronization is required. However, as they become larger, more time is potentially lost due to poor load balancing because all processors must wait for the last job of the frame to finish before starting the next frame. We address this through a load balancing scheme that uses a static set of variable size jobs that are dispatched in a queue where jobs linearly decrease in size. The implementation of the work queue assignment uses the hardware fetch and op counters on the Origin architecture. This allows efficient access to the central work queue resource. This approach to dividing the work between processors seems to scale very well. Rendering a scene with a large memory footprint (rendering of isosurfaces from the visible female dataset [ 17]) uses only 2.1 to 8.4 Mb/s of main memory bandwidth. These statistics were gathered using the SGI perfex utility, benchmarked with 60 processors.
2.1. Rectilinear Isosurfacing Our algorithm has three phases: traversing a ray through cells which do not contain an isosurface, analytically computing the isosurface when intersecting a voxel containing the isosurface, and shading the resulting intersection point [ 16]. This process is repeated for each pixel on the screen. Since each ray is independent, parallelization is straightforward. An additional benefit is that adding incremental features to the rendering has only incremental cost. For example, if one is visualizing multiple isosurfaces with some of them rendered transparently, the correct compositing order is guaranteed since we traverse the volume in a front-to-back order along the rays. Additional shading techniques, such as shadows and specular reflection, can easily be incorporated for enhanced visual cues. Another benefit is the ability to exploit texture maps which are much larger than texture memory (typically 32MB to 256MB). 3. PARALLEL ISOSURFACING RESULTS We applied the ray tracing isosurface extraction to interactively visualize the Visible Woman dataset. The Visible Woman dataset is available through the National Library of Medicine as part of its Visible Human Project [ 15]. We used the computed tomography (CT) data which
15 was acquired in lmm slices with varying in-slice resolution. This data is composed of 1734 slices of 512x512 images at 16 bits. The complete dataset is 910MBytes. Rather than downsample the data with a loss of resolution, we utilize the full resolution data in our experiments. As previously described, our algorithm has three phases: traversing a ray through cells that do not contain an isosurface, analytically computing the isosurface when intersecting a voxel containing the isosurface, and shading the resulting intersection point.
# processors 2 3 4 6 8 12 16 24 32 48 64 96 124
0.427/1.00 0.84/1.97 1.26/2.94 1.67/3.91 2.45/5.73 3.20/7.50 4.81 /11.26 6.38/14.93 9.54/22.33 12.65/29.61 18.85/44.13 24.73/57.90 35.38/82.82 43.06/100.79
Frame rate/Speedup 0.084 /1.00 0.155 /1.00 0.17 /1.99 0.31 /2.00 0.25 /2.95 0.46 /2.96 0.33 /3.96 0.62 /3.97 0.50 /5.97 0.93 /5.96 0.67 /7.94 1.23 /7.93 1.00 /11.89 1.84 /11.88 1.33 /15.84 2.45 /15.80 1.98 /23.54 3.65 /23.49 2.63 /31.38 4.88 /31.47 3.92 /46.72 7.30 /47.02 5.18 /61.78 9.64 /62.14 7.67 /91.38 14.28 /92.02 9.73 /115.88 18.17 /117.08
0.304 /1.00 0.60 /1.98 0.89 /2.93 1.19 /3.92 1.76 /5.77 2.32 /7.61 3.44 /11.30 4.59 /15.08 6.84 /22.48 9.12 /29.96 13.52 /44.39 17.72 /58.19 25.04 /82.23 30.28 /99.45
0.568 /1.00 1.13 /1.98 1.68 /2.96 2.24 /3.94 3.29 /5.80 4.36 /7.67 6.51 /11.47 8.64 /15.21 12.92 /22.76 17.09 /30.10 25.27 /44.50 32.25 /56.80 45.50 /80.14 57.70 /101.63
~.20 o
7
~15
m
o
~10 o
ffl
~
JO ~i~
j',i!
'
'
'
'
'i'
'
I
'
iFrame N~mber~(tihe) I
Figure l. Variation in framerate as the viewpoint and isovalue changes.
~a
I
i
I
i
I
16 The interactivity of our system allows exploration of both the data by interactively changing the isovalue or viewpoint. For example, one could view the entire skeleton and interactively zoom in and modify the isovalue to examine the detail in the toes all at about ten FPS. The variation in framerate is shown in Figure 1. 4. INTERACTIVE VOLUME RENDERING WITH SIMIAN Simian is a scientific visualization tool that utilizes the texture processing capabilities of consumer graphics accelerators to produce direct volume rendered images of scientific datasets. The true power of Simian is its rich user interface. Simian employs direct manipulation widgets for defining multi-dimensional transfer functions; for probing, clipping, and classifying the data; and for shading and coloring the resulting visualization [6]. A complete discussion of direct volume rendering using commodity graphics hardware is given in [2]. For more on using multi-dimensional transfer functions in interactive volume visualization, see [7]. All of the direct manipulation widgets provided by Simian are described thoroughly in [6]. The size of a volumetric dataset that Simian can visualize interactively is largely dependent on the size of the texture memory provided by the local graphics hardware. For typical commodity graphics accelerators, the size of this memory ranges from 32MB to 256MB. However, even small scientific datasets can consume hundreds of megabytes, and these datasets are rapidly growing larger as time progresses. Although Simian provides mechanisms for swapping smaller portions of a large dataset between the available texture memory and the system's main memory (a process that is similar to virtual memory paging), performance drops significantly and interactivity disappears. Moreover, because the size of the texture memory on commodity graphics hardware is not growing as quickly as the size of scientific datasets, using the graphics accelerators of many nodes in a cluster-based system is necessary to interactively visualize large-scale datasets. Naturally, cluster-based visualization introduces many challenges that are of little or no concern when rendering a dataset on a single node, and there are many techniques for dealing with the problems that arise. Our goal was to create an interactive volume rendering tool that provides a full-featured interface for navigating and visualizing large-scale scientific datasets. Using Simian, we examine two approaches to cluster-based interactive volume rendering: (1) a "cluster-aware" version of the application that makes explicit use of remote nodes through a message-passing interface (MPI), and (2) the unmodified application running atop the Chromium clustered rendering framework. 5. CHROMIUM CLUSTERED RENDERING F R A M E W O R K Chromium is a system for manipulating streams of OpenGL graphics commands on commodity-based visualization clusters [3]. For Linux-based systems, Chromium is implemented as a set of shared-object libraries that export a large subset of the OpenGL application programming interface (API). Extensions for parallel synchronization are also included [4]. Chromium's crappfaker libraries operate as the client-side stub in a simple client-server model. The stub intercepts graphics calls made by an OpenGL application, filters them through a userdefined chain of stream processing units (SPUs) that may modify the command stream, and finally redirects the stream to designated rendering servers. The Chromium rendering servers,
17 or crservers, process the OpenGL commands using the locally available graphics hardware, the results of which may be delivered to a tiled display wall, returned to the client for further processing, or composited and displayed using specialized image composition hardware such as Lightning-2 [22]. Depending on the particular system configuration, it is possible to implement a wide variety of parallel rendering systems, including the common sort-first and sort-last architectures [13]. For the details of the Chromium clustered rendering framework, see [3]. In principle, Chromium provides a very simple mechanism for hiding the distributed nature of clustered rendering from OpenGL applications. By simply loading the Chromium libraries rather than the system's native OpenGL implementation, graphics commands can be processed by remote hardware without modifying the calling application. However, for the Simian volume rendering application, Chromium does not currently provide features that sufficiently mask the underlying distributed operation and still enable the application to realize the extended functionality that we seek. OpenGL applications may still require significant modifications to effectively utilize a cluster's resources, even when employing the Chromium framework. It was necessary to implement some OpenGL functionality that was not supported by Chromium. First, we implemented the subset of OpenGL calls related to 3D textures, including glTexImage3D, which is the workhorse of the Simian volume rendering application. In addition, we also added limited support for the NVIDIA GL NV t e x t u r e s h a d e r and G L _ N ' V _ t e x t u r e s h a d e r 2 extensions, implementing only those particular features of the extensions that are explicitly used by Simian. 6. CLUSTER-BASED VOLUME RENDERING RESULTS We experimented with two C-SAFE fire-spread datasets simulating a heptane pool fire, h300_0075 and h300_0130, that were produced by LES codes. Both of the original datasets were 302x302x302 byte volumes storing scalar values as u n s i g n e d c h a r s . Simian is capable of rendering volumes using gradient and Hessian components in addition to scalar values. Therefore, each of the fire-spread datasets was pre-processed to included these additional components and then padded to a power of 2, resulting in two 512x512x512x3 byte volumes that store the value, gradient, and Hessian components ("vgh") as u n s i g n e d c h a r s . 1 We first attempted to visualize both the original and vgh versions of each fire-spread dataset using the stand-alone version of Simian and a fairly powerful workstation. This workstation was equipped with an Intel Xeon 2.66GHz processor, 1GB of physical memory, and an NVIDIA GeForce4 Quadro4 900 XGL graphics accelerator with 128MB of texture memory. On this machine, Simian was able to load and render the original datasets using the swapping mechanisms inherent in OpenGL. As expected, the need to swap severely penalized the application's performance, resulting in rates of 0.55 frames per second. However, the OpenGL drivers could not properly handle the large-scale vgh datasets, crashing the application as result. Having firmly established the need for a cluster-based approach, we tested both the clusteraware version and the Simian/Chromium combination with the vgh datasets using 8, 16, and 32 of the C-SAFE cluster nodes. The results are summarized in Table 1. Note that, with 8 and 16 rendering nodes, the vgh subvolumes distributed by the clusteraware version are 256x256x256 bytes and 256x256x128 bytes, respectively. These subvol1Forthe remainder of this discussion, it is assumedthat the numberof bytes consumedby a raw volumetric dataset is the given size multiplied by three, accounting for each of the three, 1-bytecomponentsin a vgh dataset.
18 Rendering Approach
Number of Nodes
Cluster-aware Simian
8 16
32 Simian/Chromium combination
8 16
h300_0075 h300_0130
vgh dataset
vgh dataset
0.52
0.58 2.15 3.44 0.05 0.04 0.02
1.47 2.87 0.07 0.05 0.03
32 Table 1 Average frame rates (in frames per second) using various cluster configurations
umes exceed the maximum "no-swap" volume size permitted by the cluster's graphics hardware (128x128x256 bytes), so in either of these configurations, even the cluster-aware Simian must invoke its texture swapping facilities. However, with 16 rendering nodes, the swapping is much less frequent, resulting in reasonably interactive frame rates. With 32 rendering nodes, the subvolume size is reduced to 128x128x256 bytes, so no texture swapping occurs and interactive frame rates are restored. Chromium is not able to distribute subvolumes among the rendering nodes. As a result, Simian must rely on its texture swapping mechanism. When the application calls for a new block to be swapped into texture memory, Chromium must transmit the block to the appropriate rendering nodes. The resulting delays impose severe performance penalties that grow with the number of rendering nodes. This behavior is reflected in the low frame rates given in Table 1. 7. CONCLUSIONS The architecture of the parallel machine plays an important role in the success of visualization techniques. In parallel isosurfacing since any processor can randomly access the entire dataset, the dataset must be available to each processor. Nonetheless, there is fairly high locality in the dataset for any particular processor. As a result, a shared memory or distributed shared memory machine, such as the SGI Origin, is ideally suited for this application. The load balancing mechanism also requires a fine-grained low-latency communication mechanism for synchronizing work assignments and returning completed image tiles. With an attached graphics engine, we can display images at high frame rates without network bottlenecks. We have implemented a similar technique on a distributed memory machine that proved challenging. Frame-rates were, of course, lower and scalability was reduced. We have shown that ray tracing can be a practical alternative to explicit isosurface extraction for very large datasets. As data sets get larger, and as general purpose processing hardware becomes more powerful, we expect this to become a very attractive method for visualizing large scale scalar data both in terms of speed and rendering accuracy. For cluster-based volume rendering, we have demonstrated that a cluster-aware MPI version performs far better than simply replacing the underlying shared graphics library with a parallelized library. There are several advantages of developing cluster-aware visualization tools. First, these tools can exploit application-specific knowledge to reduce overhead and enhance performance and efficiency when running in a clustered environment. Second, cluster-aware applications that are built upon open standards such as MPI are readily ported to a wide variety
19 of hardware and software platforms. While frameworks that mask the clustered environment may provide certain advantages, using standard interfaces allows an application's components to be reused or combined in new, more flexible ways. Third, cluster-aware applications are not dependent upon the functionality provided by a clustered rendering package. Finally, because they do not access remote resources via a lower-level framework, cluster-aware applications can exploit the capabilities of the underlying system directly.
REFERENCES
[1]
B. Wyvill G. Wyvill, C. McPheeters. Data structures for soft objects. The Visual Com-
puter, 2:227-234, 1986. [2] [3]
[4] [5] [6]
[7]
[8] [9] [ 10]
[11 ] [ 12] [13] [14]
[15]
[16]
M. Hadwiger et al. High-quality volume graphics on consumer pc hardware. IEEE Visualization 2002 Course Notes. G. Humphreys, Mike Houston, Yi-Ren Ng, Randall Frank, Sean Ahem Peter Kirchner, and James T. Klosowski. Chromium: A stream-processing framework for interactive graphics on clusters. In Proceedings of SIGGRAPH, 2002. H. Igehy, G. Stoll, and P. Hanrahan. The design of a parallel graphics interface. In SIGGRAPH Computer Graphics, 1998. Arie Kaufman. Volume Visualization. IEEE CS Press, 1991. J. Kniss, G. Kindlmann, and C. Hansen. Interactive volume rendering using multidimensional transfer functions and direct manipulation widgets. In Proceedings oflEEE Visualization, 2001. J. Kniss, G. Kindlmann, and C. Hansen. Multi-dimensional transfer functions for interactive volume rendering. IEEE Transactions on Visualization and Computer Graphics, pages 270-285, July 2002. Mark Levoy. Display of surfaces from volume data. IEEE Computer Graphics & Applications, 8(3):29-37, 1988. Chyi-Cheng Lin and Yu-Tai Ching. An efficient volume-rendering algorithm with an analytic approach. The Visual Computer, 12(10):515-526, 1996. William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. Computer Graphics, 21 (4): 163-169, July 1987. ACM Siggraph '87 Conference Proceedings. K.L. Ma, J.S. Painter, C.D. Hansen, and M.F. Krogh. Parallel Volume Rendering using Binary-Swap Compositing. IEEE Comput. Graphics and Appl., 14(4):59-68, July 1993. Stephen Marschner and Richard Lobb. An evaluation of reconstruction filters for volume rendering. In Proceedings of Visualization '94, pages 100-107, October 1994. S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs. A sorting classification of parallel rendering. 1EEE Computer Graphics and Applications, July 1994. Michael J. Muuss. Rt and remrt - shared memory parllel and network distributed raytracing programs. In USENIX: Proceedings of the Fourth Computer Graphics Workshop, October 1987. National Library of Medicine (U.S.) Board of Regents. Electronic imaging: Report of the board of regents, u.s. department of health and human services, public health service, national institutes of health. NIH Publication 90-2197, 1990. Steven Parker, Michael Parker, Yarden Livnat, Peter-Pike Sloan, Charles Hansen, and
20
[ 17]
[ 18] [ 19] [20] [21 ] [22]
[23] [24]
Peter Shirley. Interactive Ray Tracing for Volume Visualization. IEEE Transactions on Visualization and Computer Graphics, 5(3):238-250, July 1999. Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter-Pike Sloan. Interactive ray tracing for isosurface rendering. In Proceedings of Visualization '98, October 1998. E. Reinhard, A.G. Chalmers, and F.W. Jansen. Overview of parallel photo-realistic graphics. In Eurographics "98, 1998. Paolo Sabella. A rendering algorithm for visualizing 3d scalar fields. Computer Graphics, 22(4):51-58, July 1988. ACM Siggraph '88 Conference Proceedings. Lisa Sobierajski and Arie Kaufman. Volumetric Ray Tracing. 1994 Workshop on Volume Visualization, pages 11-18, October 1994. Milos Sramek. Fast surface rendering from raster data by voxel traversal using chessboard distance. In Proceedings of Visualization '94, pages 188-195, October 1994. G. Stoll, M. Eldridge, D. Patterson, A. Webb, S. Berman, R. Levy, C. Caywood, M. Taveira, S. Hunt, and P. Hanrahan. Lighming-2: A high-performance display subsystem for PC clusters. In SIGGRAPH Computer Graphics, 2001. Craig Upson and Micheal Keeler. V-buffer: Visible volume rendering. Computer Graphics, 22(4):59-64, July 1988. ACM Siggraph '88 Conference Proceedings. Scott Whitman. A Survey of Parallel Algorithms for Graphics and Visualization. In High Performance Computing for Computer Graphics and Visualization, pages 3-22, 1995. Swansea, July 3-4.
Software Technology
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
23
On Compiler Support for Mixed Task and Data Parallelism T. Rauber ~, R. Reilein b, and G. Rfingerb a Department of Mathematics, Physics, and Computer Science University of Bayreuth E-mail: rauber@uni-bayreuth, de b Department of Computer Science Chemnitz University of Technology E-mail: {reilein, ruenger}@cs, tu-chemnitz, de The combination of task and data parallelism can lead to an improvement of speedup and scalability for parallel applications on distributed memory machines. To support a systematic design of mixed task and data parallel programs the TwoL model has been introduced. A key feature of this model is the development support for applications using multiprocessor tasks on top of data parallel modules. In this paper we discuss implementation issues of the TwoL model as an open framework. We focus on the design of the framework and its internal algorithms and data structures. As examples fast parallel matrix multiplication algorithms are presented to illustrate the applicability of our approach. 1. INTRODUCTION Parallel applications in the area of scientific computing are often designed in a data parallel SPMD (Single Program Multiple Data) style based on the MPI standard. The advantage of this method is a clear programming model but on large parallel platforms or cluster systems the speedup and scalability can be limited especially when collective communication operations are used frequently. The combination of task and data parallelism can improve the scalability of many applications but requires a more intricate program development. The adaption of complex program code to the characteristics of a specific parallel machine may be quite time consuming and often results in a code structure which causes a high reprogramming effort when porting the software to another parallel system. To support the systematic development of mixed task and data parallel programs the TwoL model has been introduced. The model provides a stepwise development process which is subdivided into several phases. Applications are hierarchically composed of predefined or user-supplied basic data parallel modules. The development starts with a specification of the parallelism inherent in the algorithm to be implemented. The specification is transformed stepwise into a coordination program by applying scheduling, static load balancing, and data distribution algorithms. During this transformation process the code is adapted to the characteristics of a specific parallel system. In this paper we present an implementation of the TwoL model as an open compiler framework (TwoL-OF). The open framework implements the core concepts of the TwoL model and additionally provides several intermediate program representations resulting within the trans-
24 formation process. The intermediate representations are produced in specific TwoL-OF formats to provide access interfaces for compiler users. The advantage is that the framework can be used by both, application programmers and algorithm developers. Application programmers can use the framework as transformation tool by providing a specification of the problem which is transformed into an efficient parallel program. Developers can test new scheduling and load balancing techniques by exploiting the access interfaces of the intermediate formats. The remaining paper is structured as follows. Section 2 gives a short overview of the TwoL model. In Section 3 the TwoL-OF compiler is introduced and the compilation process is illustrated. Section 4 presents first results with fast matrix multiplication algorithms and Section 5 concludes. 2. THE TwoL MODEL The TwoL (Two Level) model has been developed to exploit the combination of task and data parallelism [6, 7]. It defines a transformation process from a specification of the parallelism inherent in an algorithm into a coordination program for a specific parallel system. The parallelism is exploited on two levels, an upper task parallel level consisting of hierarchically structured multiprocessor tasks and a lower data parallel level. On the lower level the user provides module specifications and function implementations which declare and realize multiprocessor tasks. On the upper level consecutive transformation steps generate a coordination program from the specification of potential task parallelism. Figure 1 illustrates the derivation of the coordination program. non-executable specification coordination program ~...@.,,,.,~ stepwise derivation --~ double *a,*b; ~
@/Q~@
@ ; ~ ~
.Transformation
9Annotation
9 Selection
Level ~'~ ~
MPI Comm comml,comm2; MPI--Comm_split(&comml,color pid,&comm2) ; if (gpid==0) { ...
~ typea distrib b var x:a ,..
Declaration of types,
m o d u l e s and variables,
Parallel library and/~
user-defined functions.
I dmm dmm-fox std [
II
dmm can
Figure 1. Derivation of the parallel coordination program. The specification program declares multiprocessor tasks as modules and defines their data and control dependencies. It uses a special type concept to declare data types and data distribution types. The transformation applies scheduling, load balancing, and data distribution techniques and the decisions made are included as annotations. The final coordination program contains all information about execution order, processor group sizes, and data distributions. The entire transformation process is accompanied by a parallel cost model to guide the transformations and to obtain an efficient parallel program for a specific parallel machine. 3. TwoL-OF C O M P I L E R STRUCTURE AND IMPLEMENTATION The TwoL open framework compiler implements the core concepts of the TwoL model, supports the guided development of mixed task and data parallel programs, and the porting to
25 different parallel machines. The emphasis lays on fast and easy coding support utilizing the activation of basic modules written in different imperative languages especially C or Fortran. It offers specific interfaces to control the transformation process and to manually revise decisions made by the transformation algorithms. Therefore the intermediate code representation is explicitly stored and open for transformations and annotations.
3.1. Short overview The compilation process starts with a specification language file describing the inherent parallelism of an algorithm. The basic components of a specification are modules which can either be declared (basic modules) or defined (coordination modules). For basic modules a module name and three different types for each parameter to be passed have to be declared. Each module parameter has a data type which corresponds to a data type in the high level target language and a data distribution type which declares how the data are distributed over a processor group and which is used by redistribution functions to establish the data distribution required before starting the execution of a basic module. The third type is the I/O access mode which defines the kind of access to the parameters. Basic modules can be implemented in C or Fortran and are linked to the generated program code in the final compilation step. Coordination modules have to be declared before they can be defined by a coordinating expression which consists of module calls, data dependence operators, and control statements. These expressions are translated into syntax trees which are extended and transformed in the following compilation phases. Based on the coordinating expressions the coordination program is generated by the compiler framework as C code augmented with MPI functions and calls to a data redistribution library interface. The resulting code can be compiled and linked with the basic module implementations using a standard C compiler. The TwoL-OF compiler consists of two consecutive compiler stages. The first stage reads a specification program, generates the intermediate code representation, and outputs two control files for the next stage, one for type definitions and a second file to control tree transformations and annotations. The intermediate code representation comprises different kinds of symbol tables, specification syntax trees for the coordinating expressions, and parallel data dependence graphs to attach data flow information to the syntax trees. Between both stages the control files are augmented with additional specifications supplied by automatic tools or by the user. In the second stage after the intermediate code is read, the type definition and transformation control files are processed. The contents are used to amend symbol table entries and to transform the specification syntax trees into coordination syntax trees. C code with calls to MPI and a data redistribution library is generated after compiler passes to update the data flow information and to introduce variable copying and data redistribution. The compilation process is accompanied by the output of internal data structures as program files which can be postprocessed using the graphviz package by AT&T to generate a graphical visualization. In the next subsections we introduce the specification language and give an example for a parallel specification to illustrate the compilation process.
3.2. Specification language for multiproeessor task programs The specification language allows the expression of data dependencies between module calls by two operators, the ~--operator for data dependence and the II-operator for data independence. Both operators can be used in infix form or as k-ary operator. Furthermore there are several statements to control the program flow especially loops and conditions. The following grammar
26 defines the essential parts for expressing data dependencies and control flow. cexpr stat cexpr_list
--+ I
cexpr >- cexpr I cexpr II cexpr ] II (cexpr_list) I >- (cexpr_list) stat ( cond ) { cexpr } I call loop ] while I if cexpr_list, cexpr ] cexpr
A coordination module defines one coordinating expression (cexpr) which is recursively built up of sub-expressions. To illustrate the compilation process the standard four block matrix multiplication is used here as an example. The algorithm is defined by the following equation:
( all a12 ) ( b l l x hi2 ) _ _ a21 a22 b21 b22
( a 1 1 • b11nt-q12 x b21 all >
To specify the inherent parallelism a specification program containing the following definition of a coordination module (tram) has to be supplied by the user. The declarations of the parameter types for the basic modules used are omitted here. c o m p m o d tmm (a, b, c, n) { m p a r t (a, all, a12, a21, a22, n) >- m p a r t ( b , b l l , b 1 2 , b 2 1 , b 2 2 , n ) ]l (dram(all, bll, cll, n/2) , dram (a12, b21, dll, n/2) , dram (all, b12, c12, n/2) , dmm (a12, b22, d12, n/2) , dmm (a21, bll, c21, n/2) , dmm (a22, b21, d21, n/2) , dmm(a21,b12,c22,n/2) , dmm(a22,b22,d22,n/2)) ~- II ( m m a ( c l l , d l l , c l l , n / 2 ) , m m a ( c l 2 , d l 2 , c l 2 , n / 2 ) , mma(c21,d21,c21,n/2) , mma(c22,d22,c22,n/2)) >- rejoin (cll, c12, c21, c22, c, n)
The module m p a r t subdivides a matrix into four quadratic submatrices and mj o in reverses this process. Addition and multiplication of submatrices are performed by mma and dram, respectively. This specification is processed by the first compiler stage and the coordinating expression is stored as a specification syntax tree with a linked parallel data dependence graph for expressing data flow. Figure 2 illustrates the parallel data dependence graph for the specification example as output from the framework. Together with the intermediate representation which is stored as binary file by the first stage two text files for type definitions and to control transformations are generated. The type definition file is used to specify a C or Fortran data type for the abstract types used in the specification program. If memory allocation is required for a specific data type the corresponding statements can be supplied. Within the second file the user can control transformations and annotations of the specification syntax trees. In its unmodified form the transformation control file expresses the maximum degree of parallelism specified. The following code fragment shows a part of the transformation control file for the block matrix multiplication example: par[ID34] [4] { [0] :ID30 # m m a ( c l l , d l l , c l l , n / 2 ) [i] :ID31 # m m a ( c l 2 , d l 2 , c l 2 , n / 2 ) [2] :ID32 # m m a ( c 2 1 , d 2 1 , c 2 1 , n / 2 ) [3] :ID33 # m m a ( c 2 2 , d 2 2 , c 2 2 , n / 2 )
}
27
m
\
\
I
\
/
Figure 2. Part of the Parallel data dependence graph for block matrix multiplication. The labels enclosed in brackets and prefixed with 'ID' are unique identifiers for the nodes of the corresponding specification syntax tree. They are also displayed in the visualization of the trees and data dependence graphs. The second parameter of the p a r statement specifies the number of processor groups or multiprocessor tasks, respectively. The t a s k statement groups tasks together to be executed consecutively on a specific processor group given as first parameter. Tasks would be separated by commas in that case. The number sign (#) marks the start of a comment which are generated here for the sake of clarity. The following subsection presents the internal structure of the compiler framework in more detail. 3.3. Algorithms and data structures The first compiler stage of the TwoL open framework comprises a syntax-oriented translation scheme to built up specification syntax trees, the construction of symbol tables for data types, data distribution types, variables, and modules, and the creation of parallel data dependence graphs as representations of the data flow. The second stage updates the symbol tables and transforms the syntax trees using information in specific control files provided by automatic tools or directly by the user. After the insertion of the data redistribution operations required the final coordination code is generated. A specification program which contains definitions of coordination modules by coordinating expressions is translated into syntax trees which are realized as C data structures. A tree node comprises a type, a unique identifier, pointers to child nodes, and a union to maintain type specific data. Leaf nodes, which represent module calls, have an additional pointer to a data structure for the parallel data dependence graph. The nodes of this graph represent parameters of module calls and a vertex of the graph defines a true data dependence between two accesses to a specific variable. A parallel data dependence graph is constructed by a depth first sweep over the corresponding specification syntax tree. If the algorithm detects an input parameter reading a particular variable it creates a new node for the parallel data dependence graph and traverses backwards a list of previously visited nodes to find an appropriate output for that variable. The search procedure takes the data dependence information stored in the specification syntax tree into account. When an output is found which does not already have a graph node a new one is constructed and linked with the current one by pointers. These links define a true data dependency between two accesses to a variable. Specification syntax trees are also used to generate a text representation of transformation
28 control structures that is output to the transformation control file. For each ll-operator node the file contains a specific program structure, see for example the control file in Section 3.2. According to these structures the following transformation of the syntax trees is applied by the translation scheme of the second compiler stage. Each II-operator node is replaced by a fork node that is used to insert activations of multiprocessor tasks. The data independence expressed by the tl-operator is then replaced by a fork into multiprocessor tasks. A fork node stores the number of processor groups and has new task nodes as children to define multiprocessor tasks. The former children of the II-operator node have been linked to the particular task nodes representing the processor group which will execute them in the final code. After this transformation the syntax tree defines the parallel execution order of a coordination function. The second compiler stage maintains variables and introduces copies if a variable is accessed in parallel. Based on the information contained in the parallel data dependence graphs data redistribution operations are inserted in the trees as additional nodes. The resulting trees comprise all information to generate the final coordination code. Figure 3 illustrates the internal structure of the TwoL-OF compiler and the main compilation steps. l a Parsing the specification program to built up specification syntax trees and symbol tables.
Specification language program
ISymbol tables Specification syntax tree modules ~,~ data t y p e s Specification syntax tree with linked distribution types paralleldata dependence graph I* variables Type definition Intermediatecode Transformationcontrol program representation program
I
.,.
Updated symbol tables
.....
|
......
Coordination syntax tree with linked paralleldata dependence graph ( ~ C code coordination function
l b Construction of the parallel data dependence graphs. 1c Generation of control files and write out of intermediate code. 2a Processing of intermediate code and parsing of transformation control and type definition files to transform syntax trees and to update symbol tables entries. 2b Maintenance of variable copies and data exchange operations. 2c Generation of coordination program as C code.
Figure 3. TwoL-OF structure (left) and main compilation steps (fight).
4. EXPERIMENTS
To evaluate the TwoL-OF compiler two versions of four block matrix multiplication C = A x B have been implemented, the standard algorithm with 8 matrix multiplies and 4 additions and the Strassen scheme with 7 multiplies and 18 additions [8]. The inner multiplications of two block matrices are done using the Fox algorithm. To determine the overhead both schemes have been implemented by hand and as specification program supplemented with appropriate basic modules. All four versions have been measured on a Linux cluster (CLiC) built of 800 MHz
29 Pentium III PCs interconnected with switched Fast Ethernet. Figure 4 shows the overhead of the compiler-generated version which is given as the ratio between the runtime and the runtime of the hand-tuned version.
Overhead factor for block matrix multiplication on CLiC Matrix size 2400 , 3600 ........... 4800 ............
1.15 1.1 1.05
Overhead factor for Strassen matrix multiplication on CLiC Matrix size 2400 , 3600 ........... 4800 ............
1.15 1.1
7_. . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 36
64
100
5-~_-~..........................
144
196
256
processor number
"-,..
1.05
16 36
64
100
144
196
256
processor number
Figure 4. Overhead factors for the standard scheme (left) and the Strassen scheme (fight) on CLiC. The results show a maximum overhead factor of 1.12 for this examples. This corresponds to first experiences made with other codes and applications. It has been observed that handtuned programs usually require less variable copies and can organize data redistribution more compact. But on the other hand they require a higher development effort and cannot benefit from the automatic detection and insertion of data redistribution operations.
5. RELATED W O R K AND CONCLUSION Many research groups have developed compiler tools and parallel environments which can integrate task and data parallelism. Most of them extent Fortran or High Performance Fortran (HPF) with task parallel constructs [3, 9, 2], see [1] for a good overview. Newer approaches include the coordination of data parallel programs written in skeleton-like or data parallel languages [5], parallel Haskell programming [10], or task and data parallel programming in Java
[4].
In this paper the TwoL open framework compiler as an implementation of the core concepts of the TwoL model was introduced as an open platform for algorithm and application developers. We gave a closer look on the structure of the framework and presented early compilation results. The average overhead factor of compiler generated programs is reasonably small and caused by a higher amount of variable copies required in combination with a slight communication overhead. This factor is expected to be reduced by the development of optimization algorithms for message-packing and for an efficient placement of data redistribution operations. Future work will include the definition of an interface to the binary intermediate representation and support for runtime prediction.
30 REFERENCES
[1] [2]
[3] [4] [5] [6]
[7] [8] [9] [ 10]
H. Bal and M. Haines. Approaches for Integrating Task and Data Parallelism. IEEE Concurrency, 6(3): 74-84, 1998. P. Banerjee, J. Chandy, M. Gupta, E. Hodge, J. Holm, A. Lain, D. Palermo, S. Ramaswamy, and E. Su. The Paradigm Compiler for Distributed-Memory Multicomputers. IEEE Computers, 28(10):37-47, 1995. I. Foster and K.M. Chandy. Fortran M: A Language for Modular Parallel Programming. Journal of Parallel and Distributed Computing, 25(1):24-35, 1995. F. Kuijlman, H.J. Sips, C. van Reeuwijk, and W.J.A. Denissen. A Unified Compiler Framework for Work and Data Placement. In Proc. ofASC12002, pages 109-115, 2002. S. Pelagatti and D.S. Skillicorn. Coordinating Programs in the Network of Tasks Model. Journal of Systems Integration, 10(2): 107-126, 2001. T. Rauber and G. R/inger. The Compiler TwoL for the Design of Parallel Implementations. In In Proc. of 4th International Conference on Parallel Architecture and Compilations Techniques (PACT'96), pages 292-301. IEEE Computer Society Press, 1996. T. Rauber and G. Riinger. A Transformation Approach to Derive Efficient Parallel Implementations. IEEE Transactions on Software Engineering, 26(4):315-339, 2000. V. Strassen. Gaussian Optimization is not Optimal. Numerische Mathematik, (13):354356, 1969. J. Subhlok and B. Yang. A New Model for Integrated Nested Task and Data Parallel Programming. In Proc. ofACM SIGPLANPPoPP 97, pages 1-12, 1997. P.W. Trinder, H-W. Loidl, and R.F. Pointon. Parallel and Distributed Haskells. Journal of Functional Programming, 2(4/5):469-510, 2002.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
31
Distributed Process Networks - Using Half FIFO Queues in CORBA A. Amar a*, R Boulet a, J.-L. Dekeyser a, and F. Theeuwen b aLaboratoire d'Informatique Fondamentale de Lille, Lille 1 Cit6 Scientifique, Bat. M3, 59655 Villeneuve d'Ascq cedex, France b Philips ED & T / Synthesis WAY 3.13, Prof. Holstlaan 4 5656 AA Eindhoven, The Netherlands Process networks are networks of sequential processes connected by channels behaving like FIFO queues. These are used in signal and image processing applications that need to run in bounded memory for infinitely long periods of time dealing with possibly infinite streams of data. This paper is about a distributed implementation of this computation model. We present the implementation of a distributed process network by using distributed FIFOs to build the distributed application. The platform used to support this is the CORBA middleware. 1. INTRODUCTION Kahn Process Networks [6, 7] are well adapted to model many parallel applications, specially dataflow applications (signal processing, image processing). In this model, processes communicate only via unbounded first-in first-out (FIFO) queues. This model has a dataflow flavor and can express a high degree of concurrency which makes it particularly well suited to model intensive signal processing applications or complex scientific applications. This model makes no assumption on the computation load of the different processes and thus is heterogeneous by nature. Distributed architectures provide an attractive alternative to supercomputers in terms of computation power and cost to execute such complex and computation intensive applications. The two main weak points of these architectures are their communication capabilities (relatively high latency) and often the heterogeneity of their hardware. We present in this paper a distributed implementation of the process network model on heterogeneous distributed hardware. The different processing power of the connected computers is a good support for the different computation needs of the networked processes. We have chosen to use the Common Object Request Broker Architecture (CORBA) [9] middleware to handle the communications for its interoperability properties. Indeed, each process of the process network can be written in a different language and run on a different hardware, provided that these are supported by the chosen Object Request Broker (ORB). In addition of the heterogeneity, our implementation presents the following characteristics: *This work has been supported by the ITEA 99038 project, Sophocles.
32 9 automation of data transfer between distributed processes 9 dynamic and interactive linking of the processes to form the data flow 9 hybrid data-driven, demand-driven data transfer protocol, with thresholds for load balancing 9 the implementation was carried out such as to enable a distributed or local execution without any change to the program source. This paper is organized as follows. In section 2, we motivate our approach and we present our implementation. Section 3 describes a process network deployment and distributed execution. The transfer strategies (demand and data driven) are detailed in section 4. And we finally outline our conclusions and plans for future work in section 5. 2. DESIGN AND IMPLEMENTATION 2.1. Related work
The Kahn process network model has been proposed by Kahn and MacQueen [6, 7] to easily express concurrent applications. Processes communicate only through unidirectional FIFO queues. Read operations are blocking. The number of tokens produced and their values are completely determined by the definition of the network and do not depend on the scheduling of the processes. The choice of a scheduling of a process network only determines if the computation terminates and the sizes of the FIFO queues. Some networks do not allow a bounded execution. Parks [ 10] studies these scheduling problems in depth. He compares three classes of dynamic scheduling: data-driven, demand-driven or a combination of both with respect to two requirements: 1. Complete execution (the application should execute completely, in particular if the program is non-terminating, it should execute forever). 2. Bounded execution (only a bounded number of tokens should accumulate on any of the queues). These two properties are shown undecidable by Buck [4] on boolean dataflow graph which are a special case of process networks. Thus they are also undecidable for the general case of process networks. Data-driven schedules respect the first requirement, but not always the second one. Demand-driven schedules may cause artificial deadlocks. A combination of the two is proposed by Parks [ 10] to allow a complete, unbounded execution of process networks when possible. In the context of a distributed execution of a process network, the process execution is inherently asynchronous. We have thus chosen a completely asynchronous scheduling: each process runs in its own thread that is scheduled by the underlying operating system. As explained by Parks and Roberts in [ 11 ], who use a similar scheduling technique, using bounded communication queues allow for a fair execution of the process network. This blocking write when the output queue is full can lead to deadlocks. Determining a priori if the queue length is large enough to avoid such deadlocks is undecidable. We provide a way for the user to modify this length at runtime.
33 Several implementations of process networks are used for different purposes: for heterogeneous modeling with PtolemyII [8], for signal processing application modeling with YAPI [5] and for metacomputing in the domain of Geographical Information Systems with Jade/PAGIS [13]. To our knowledge, only the Jade/PAGIS implementation and the one by Parks and Roberts [ 11 ] are distributed. Parks and Roberts use the Java Object Serialization to automate the distribution of the network processes while we use a central console to deploy the processes. In Jade, all communications proceed through a central communication manager while in Parks and Roberts' and our implementation the processes communicate directly. This allows a greater scalability. Only our implementation allow the coupling of processes written in different languages. 2.2. Design directions The design of our distributed process network implementation was done so as to: 9 enable users to simulate their network model quickly and effectively, 9 keep the source compatibility with the Yapi library: this library developed by Philips [5] implements the process network model for a local execution, Yapi is a C++ library which focuses on signal processing applications, 9 enable the distributed and the local execution without any change to the program source. The idea of the Yapi syntax is to group processes into process networks. The processes communicate via ports. These ports are linked by point-to-point unidirectional FIFO queues. A process network can be seen as a process and used in the same way. This hierarchical construction allows an easy modeling of complex applications. For this study, we have completely reengineered Yapi to be able to distribute any application over a CORBA bus without any change to the application code. In our implementation the communications are hidden to the programmer who can though configure the data transfer parameters. The reader can refer to [3] for more implementation details. 3. PROCESS NETWORK DISTRIBUTION 3.1. Deployment To control the distributed process networks, a console has been developed. It consists of a program which controls the processes connection and execution by the use of a simple language. It also provides a frontend used for monitoring. The presence of a manager program is contrary to the peer-to-peer character of component systems. However, the console is minimal and serves only as collaboration control. All the communications between the components are done without involving the console through the distributed FIFO queues presented in section 3.2. The FIFO links that form the process network are made interactively (or via a script) by this console. The use of a console allows more flexibility in the connection choice and a dynamic control of the components and the communication parameters. The use of an interactive console and the fact that the FIFOs are bounded also allow for an incremental development where computations can start even if the application is not complete. When all the output queues are full, the computation is suspended and can resume as soon as a consuming component is attached to the not-yet-connected output queues.
34
3.2. Distributed FIFOs The FIFO queues are completely distributed, and distributed process networks communicate directly, without a central point contrarily to what is done in Jade [ 13]. These queues implement the blocking read needed by the process network model but, as they are bounded, the write may block also if the FIFO queue is full. A deadlock can appear, but as the execution is fully distributed, deadlock detection is difficult and has not been implemented. To guarantee the code reuse with our implementation, the implementation of the distributed FIFOs must be done without programmer intervention or code change. This was done by encapsulating the distributed FIFOs (the CORBA object) in the FIFOs. The figure 1 shows the structure of the FIFO objects. For the programmer, no difference exists between the local and the distributed FIFO queue. To determine if the FIFO is a distributed FIFO or not, the runtime uses its ports. When a FIFO is local to the program, it has an input and output port. On the other hand, the distributed FIFO is represented by two half FIFOs, the output FIFO queue (producer side) and the input FIFO queue (consumer side). Each half FIFO has one port (an input port for the input FIFO, and an output port for the output FIFO) and should be linked to the other half FIFO. The runtime uses this property to activate the CORBA object only on the distributed FIFOs, and thus, the FIFO distribution is transparent for the programmer.
F,FO
_•
OutputPort
I
write
Data
write
CorbaObjRef
~,FO
I Da-~a 11 writ~Corba readi~__
le ad
o
Distributed FIFO ask
offer
full satisfyRequest sync_ask empty last_request
|
o~.~ ~ - -
(.1
getLength unlink
1
Eo.Oo ~u.
noUfy4Send
link
Distributed FIFO ask
~-<--
~_ read
InputPort
offer
full satisfyRequest sync_ask getLength unlink
empty last_request notify4Send
ilink
Figure 1. Distributed FIFOs implementation
To exchange data between distributed FIFO, we implement a communication protocol based on thresholds. This protocol which will be detailed in section 4 is a hybrid of data-driven and demand-driven transfers.
4. TRANSFER STRATEGIES In this section, we are interested in the data transmission between FIFOs. The transfer strategies were implemented having in mind the minimization of communications between FIFOs, the overlapping of communications by computations, and the load balancing.
35
4.1. Threshold based protocol The communication protocol governs the exchange of data between the FIFOs. A communication can be triggered by any of the two FIFOs in a hybrid data-driven, demand-driven protocol. This protocol is completely distributed, no central authority directs the communication. Each FIFO queue handles its data transmission independently of the rest of the process network. To manage the communications, two thresholds on the number of elements of the FIFO queue have been defined: 9 a maximum threshold (for the producer FIFO queues) which indicate that offering a part of its tokens is necessary to avoid overflowing, 9 a minimum threshold (for the consumer FIFO queues) which indicates that it is necessary to ask the linked producer half-queue for some tokens to avoid idle time. The distributed FIFO IDL interface described bellow defines the methods used for the initialization, the interaction with the other FIFOs, and the recovery of informations. This interface which is implemented by the distributedFifo class is not accessible to the user, but used implicitly by the distributed FIFO queue to communicate with the other distributed FIFO queue and the manager. interface
fifo
base
interface
{ void link(in string refFifo); void unlink(); u n s i g n e d long g e t L e n g t h ( ) ;
}; interface
fifo
int
: fifo
base
interface
{ oneway void a s k ( ) ; oneway v o i d f u l l ( ) ; boolean offer(in eltSequence buffer); oneway v o i d s a t i s f y R e q u e s t (in eltSequence buffer); void empty(); void sync_ask(inout eltSequence buffer); bool notify4Send();
}; The 1 i n k and u n l i n k permit to create or remove a link between two FIFOs. The a s k and s a t i s f y ' R e q u e s t methods are used respectively by the input FIFO to ask for data, and the output FIFO to response to an a s k . s y n c _ a s k is a synchronous version of the a s k method. The f u l l and e m p t y methods are used to indicate the FIFO state, while the o f f e r and n o t 2 f y 4 S e n d are used by the output FIFO to send data to the following FIFO respectively with and without event notification. It should be noticed that communications are vectorized. Indeed we transfer token sequences together and not each token at a time.
36 4.2. Transfer policies The two next paragraphs present the two transfer policies. It can be demand driven with event notification, and data driven without event notification. The other cases are implemented but not used by the runtime (demand driven without event notification by s y n c a s k method and data driven with event notification by n o t i f y 4 S e n d / s y n c _ a s k methods). However, the programmer can modify the transfer strategy by calling the c h a n g e P o l i c y method. Below, we suppose that the two FIFOs F_out and F_in are linked, the first one being the output FIFO and the second one, the input FIFO. Demand driven
When the transfer is demand driven, it is with event notification. When the token number becomes lower than the minimum threshold, the F_in FIFO notifies ( a s k ) the F_out FIFO which responds by sending data ( s a t 2 s f y R e q u e s t ) . The demand driven without event notification protocol was not used because such requests are synchronous and the process would be blocked until the availability of the requested data. With event notification, the process can continue computing if there are still some tokens in its input FIFO (F_in). Data driven
When the transfer is data driven, it is without event notification. When the token number in the FIFO F_out exceeds its maximum threshold, the FIFO sends data ( o f f e r ) to the F_in FIFO. To avoid the overloading of the F_in FIFO, the request returns a result. This result indicates if the receiver FIFO can receive or not other requests from the FIFO F_out. If the F_in FIFO is full, the F_out FIFO will not send any data until it receives a request for data from F_in.The offer requests being asynchronous (return a result), a separate thread manages this communication mode. A data driven protocol with event notification would involves two remote requests, and the use of this mode together with the demand driven protocol with event notification does not guarantee the order of the data reception. All the communications are hidden to the programmer, only the links between the distributed FIFOs determine the data exchange between processes. For the distributed FIFOs, the communications can be triggered in two ways: 1. Inside the process" according to whether the distributed FIFO is an input or an output FIFO, the communication triggering is done in two ways: 9 when a read operation is executed in an input FIFO (figure 2), a pre-read processing is started to ask eventually for data (if the number of tokens in the FIFO is below the minimum threshold). 9 on the other hand, when a write operation is executed in an output FIFO (figure 3), post-write processing is started to eventually send data to the following FIFO (if the number of tokens in the FIFO is above the maximum threshold). 2. Outside the process: this is done directly by the remote invocations from the linked FIFOs which ask for data or send data. However, to adapt the communications to particular applications, some methods have been implemented to give to the programmer the possibility to configure the communications, by
37
.... --
,- . . . . . . . . . . . . . .
..LI .............. ~ (~
..... ~ . . . . . . .
-
- - basic_read
~I
-~ ~I
I i
I
Figure 2. Read and pre-read operations in the distributed FIFO
_,
o.,.
....
~ ,
-
>
(~
:(')
_ _(2)_.....
.... :: 1
Figure 3. Write and post-write operations in the distributed FIFO
setting some communication parameters: minimum and maximum threshold, maximal FIFO capacity, exchange buffer's maximal size. 5. CONCLUSION We have described here our distributed implementation of process networks using CORBA. This implementation is based on the assembly of software components to form a distributed application. Each software component represents a process network which can be hierarchical, and thus exploits the parallelism of the process network model. The CORBA architecture choice is motivated by the handling of the hardware and software heterogeneity. The mapping and the control of the scheduling of the processes is carried out explicitly via a console interface. That makes it feasible to start an application respecting the process network model before its complete implementation. There are a lot of important issues we plan to investigate in the l~ear future. The most important is the dynamicity aspect of the network. Our previous implelhentation of a subcase of the process network model [1, 2] provides more dynamicity, where it is possible to migrate a process or to replace it by another, and our goal is to support these two dynamicity aspects in the general model. To achieve this, two difficulties must be overcome: 9 How to retrieve the internal state of the processes, and how to resume the computation. 9 How to retrieve the contents of the local FIFO queues. Moreover, we are interested in the use of the process network model as an execution model for the applications of multidimensional signal processing based on Array-OL [12], which is a programming language dedicated to systematic signal processing. REFERENCES
[1] Abdelkader Amar, Pierre Boulet, and Jean-Luc Dekeyser. Assembling dynamic com[2]
ponents for metacomputing using CORBA. In Parallel Computing 2001, Naples, Italy, September 2001. Lecture Notes in Computer Science. Abdelkader Amar, Pierre Boulet, and Jean-Luc Dekeyser. Towards distributed process networks with CORBA. Parallel and Distributed Computing Practice on Algorithms, 2003. Special Issue on Parallel and Distributed Computing Practice on Algorithms, to appear.
38 [3]
[4] [5]
[6]
[7]
[8] [9]
[ 10] [ 11 ] [12]
[13]
Abdelkader Amar, Pierre Boulet, Jean-Luc, and Frans Theeuwen. Distributed process networks using half FIFO queues in CORBA. Research Report RR-4765, INRIA, March 2003. Joseph T. Buck. Scheduling Dynamic Dataflow Graphs with Bounded Memory Using the Token Flow Model. PhD thesis, University of California at Berkeley, 1993. E.A. de Kock, G. Essink, W. J. M. Smits, P. van der Wolf, J.-Y. Brunel, W. M. Kruijtzer, P. Lieverse, and K. A. Vissers. YAPI: Application modeling for signal processing systems. In 37th Design Automation Conference, Los Angeles, CA, June 2000. ACM Press. Gilles Kahn. The semantics of a simple language for parallel programming. In Jack L. Rosenfeld, editor, Information Processing 74: Proceedings of the IFIP Congress 74, pages 471-475. IFIP, North-Holland Publishing Co., August 1974. Gilles Kahn and David B. MacQueen. Coroutines and networks of parallel processes. In B. Gilchrist, editor, Information Processing 77, pages 993-998. North-Holland, 1977. Proc.IFIP Congress. Edward A. Lee. Overview of the Ptolemy Project. University of California, Berkeley, March 2001. Object Management Group, Inc., editor. Common Object Request Broker Architecture (CORBA), Version 2.6. http://www.omg.org/technology/documents/ formal/corba iiop. htm, December 2001. Thomas M. Parks. Bounded Scheduling of Process Networks. PhD thesis, University of California at Berkeley, 1995. Thomas M. Parks and David Roberts. Distributed process networks in java. In International Workshop on Java for Parallel and Distributed Computing, Nice, April 2003. Julien Soula, Philippe Marquet, Jean-Luc Dekeyser, and Alain Demeure. Compilation principle of a specification language dedicated to signal processing. In Sixth International Conference on Parallel Computing Technologies, PaCT 2001, pages 358-370, Novosibirsk, Russia, September 2001. Lecture Notes in Computer Science vol. 2127. Darren Webb, Andrew Wendelborn, and Kevin Maciunas. Process networks as a highlevel notation for metacomputing. In Workshop on Java for Parallel and Distributed Computing (IPPS), Puerto Rico, April 1999.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
39
An efficient data race detector backend for DIOTA M. Ronsse a, B. Stougie b, J. Maebe ~, F. Cornelis ~, and K. De Bosschere ~ aELIS Department, Ghent University, Belgium bparallel and Distributed Systems, Delft University of Technology, The Netherlands In this paper, we describe a data race backend we developed for DIOTA. DIOTA (Dynamic Instrumentation, Optimisation and Transformation of Applications) is our generic instrumentation tool, and this tool uses so-called backends to use the information gathered using the instrumentation. Our novel data race backend uses innovative technologies like multilevel bitmaps, snooped matrix clocks and segment merging in order to limit the amount of memory used. The tool was implemented for Linux systems running on IA32 processors and is fully operational. 1. I N T R O D U C T I O N The never ending urge for faster and more robust computers, combined with the existence of cheap processors causes a proliferation of inexpensive multiprocessor machines. Multithreaded applications are needed to exploit the full processing power of these machines, causing a widespread use of parallel applications. Even most contemporary applications that are not CPU-intensive are multithreaded because the multithreaded paradigm makes it easier to develop servers and applications with an MDI (multiple document interface) such as Windows programs to improve responsiveness to actions done by the user, etc. However, developing multithreaded programs for these machines is not easy as it is harder to get a good view on the state of a parallel program. This is caused by the fact that there are a number of threads 1 running simultaneously. Moreover, the very fact that the computation is split into simultaneously executing parts can introduce errors that do not exist in sequential programs. These synchronization errors show up because parallel programs are developed in order to let a number of threads work together on the same problem, hence they will share data. It is of paramount importance that the accesses performed by the different threads are properly synchronised. A lack of synchronisation will lead to data races. More specifically, such a data race occurs when two or more concurrently executing threads access the same shared memory location in an unsynchronised way, and at least one of the threads modifies the contents of the location. As data races are (most of the time) considered bugs, they should be removed. Unfortunately, data races are difficult to find because their occurrence depends on small timing variations. Although it is possible to detect data races using a static approach this is not feasible for nontrivial programs. Therefore, most data race detection tools detect data races dynamically during an actual program execution. 1in this paper we will consider an execution of a parallel program as being a process consisting of N threads executing on a machine with N processors.
40 2. DATA RACE DETECTION As mentioned before, a dynamic data race detector finds data races that occur during a particular execution. In order to detect the conflicting memory operations, one should collect these operations during a particular execution together with information about their concurrency. The concurrency depends on the (order of the) executed synchronisation operations. Basically two (dual) methods exist for detecting data races using collected memory and concurrency information: 9 for each access to a global variable, the access is compared against previous accesses by other threads to the same variable. This requires, for each global variable, information about past accesses. It is obvious that this will lead to a huge memory consumption, especially if each memory location is a potential global variable with life time equal to that of the program itself, as is the case for e.g. programs written in the C language. This method can however be applied for programs with a rigorous object model that is adhered to, such as Java programs as is shown in [4]. 9 for all parallel pieces of code, the memory operations are collected and compared. This method is based on the fact that all memory operations between two successive synchronisation operations (called segments from now on) satisfy the same concurrency relation: they are either all concurrent or not concurrent with a given operation and therefore with the segment containing the latter operation. Given the sets L(i) and S(i) containing the addresses used by the load and store operations in segment i, the concurrent segments i
((L(j)U S(j))N S(i)) r O. Therefore, data race detection basically boils down to collecting the sets L(i) and S(i) for all segments executed and comparing parallel segments. Note that (contrary and j contain racing operations if and only if
((L(i)US(i))AS(j))
U
2"
to the first method) a data race will be detected at the end of the segment containing the second racing instruction, possibly a long time (e.g. at the end of the program) after the actual occurrence of the data race. It is clear that the second method is better suited for programming languages with unconstrained life time of and access to shared variables. As we wanted to develop a tool that is programming language independent, we opted for this method. The remainder of this papers is organised as follows: first we will give a general overview of the infrastructure that supports our implementation of the data race detection tool that is being discussed here. Next, we will explain the paradigms and techniques that we use to detect data races and to make sure that the resulting tool can be used on wide range of real-world applications and with a reasonable slowdown factor. Finally, we will present some experimental evaluation results and the conclusions of this paper, along with our future plans. 3. DIOTA 3.1. General overview
DIOTA (DynamicInstrumentation, Optimizationand Transformationof Applications) is implemented as a shared library for the Linux/80x86 platform. It instruments programs at the machine code level, so it is completely compiler- and language-agnostic and can also cope with
41 hand-written assembler code. It has support for extensions to the 80x86 ISA such as MMX, 3DNow! and SSE and is written in an extensible and modular way so that adding support for new instructions and new types of instrumentation is easy. An environment variable is used to tell the dynamic linker to load the DIOTA library whenever a dynamically linked application is started 2. An i n i t routine allows DIOTA to be activated before the main program is started, after which it can gain full control and start instrumenting. The instrumentation happens gradually as more code of the program (and the libraries that it uses) is executed, so there is no need for complex analysis of the machine code to construct control-flow graphs or to detect code-in-data. The instrumented version of the code is placed in a separate memory region (called the "clone"), so the original program code is not touched and as such neither data-in-code nor the variable-length property of 80x86 instructions pose a problem. Once DIOTA encounters an instruction of which the successor is unknown (such as a jump with a target address that lies in a register or a return instruction), it inserts a trampoline in the clone. Such a trampoline will pass the actual target address to DIOTA every time the instrumented version of this instruction is executed. This way, DIOTA can check whether the code there is already instrumented and if not, do it now. Next, the instrumented version of the target code is executed. Since every exit from an instrumented block leads either to a trampoline or to another instrumented block, DIOTA remains in control. What is extremely important in the light of data race detection however, is that during this process no data is relocated and as such all addresses used by the program remain the same as in an uninstrumented execution, plus or minus a certain delta due to memory occupied by the DIOTA library. As such, all accessed memory addresses can be easily recorded without the need for performing any kind of relocation afterwards. The behaviour of DIOTA can be influenced by using so-called backends. These are small (and not so small) dynamic libraries that link against DIOTA and tell it what kind of instrumentation should be performed. They can ask for any dynamically linked routine to be intercepted and replaced with a routine of their own, ask to be notified of each memory access, of each basic block that is executed and of each system call that is performed (both before and after their execution, so their behaviour can be modified as well as analysed). DIOTA can also handle multi-threaded programs and supports the instrumentation of exception handlers. Further technical details and in-depth discussion on the issues surrounding the interception of dynamically linked routines, can be found in [8]. 3.2. Data race detection b a c k e n d
As described above, the data race detection backend is implemented as a shared library that uses the services provided by DIOTA. The backend request DIOTA to instrument all memory operations and to intercept all p t h r e a d synchronisation operations. The memory operations are intercepted to construct the L(i) and S(i) lists for all segments and the synchronisation operations are intercepted to detect concurrent segments. The problems faced by most contemporary data race detectors is the huge overhead they introduce. The fact that an execution will slow down considerably is primarily caused by the interception of all memory accesses. Indeed, for each memory access a number of operations 2This is the only prerequisite for DIOTA to be able to instrument a program: the program must be dynamically linked. Fortunately, this is the case for all but a few system-maintenanceapplications.
42 will have to be performed by the backend. There exist tools that try to reduce the number of memory operation that have to be checked; e.g. static analysis could reveal that certain memory regions will only be used by one thread. This is of course not possible if we perform data race detection at the lowest level (the machine instruction level, where we could encounter memory operation with unbound addresses, e.g. as a result of the use of a pointer in a C-program), but it is viable for languages were access to variables is strictly controlled and there are no pointers to bypass this protection (e.g. Java [4]). Our tool throws away a small portion of all memory operations: the memory operations executed by the first thread and before a second thread is created are not checked, as they can not be part of a data race. A second reason for the large overhead is the huge memory consumption exhibited by most data race detectors. This is caused by the fact that memory operations have to be compared with memory operations executed in parallel with and in the past of this operation. Keeping track of the old memory operations and the concurrency information requires a lot of memory and we will go to great length to lower the memory consumption, as will be explained in detail in the next few paragraphs.
Collecting and comparing the addresses of the memory operations As mentioned before, two lists of addresses are used for every segment i: the load operations are collected in L(i) and the store operations in S(i). These addresses are collected in a bitmap (one for each list). Such a bitmap contains a 1 on place i if address i was used. As a linear bitmap would require 232/8 = 512 MiB, a multilevel bitmap is used. At this moment we use a 9-level bitmap: the first 8 levels contain page tables while the last level contains the actual bitmap. The highest level is addressed using the 3 highest bits of the addresses. This level contains pointers to page tables at the second level. These page tables are indexed using the next 3 bits and contain pointers to the actual bitmaps. This continues for the next 6 levels, each level indexed using the next 3 bits. The actual bitmaps are indexed using the remaining 8 bits of the address. Although the use of a bitmap means that some information is lost - two or more accesses to the same address count as one - this is not a problem for data race detection. Indeed, if one of these accesses is involved in a race, the subsequent accesses are also involved in a data race. The use of multilevel bitmaps makes it possible to compare the segments in an efficient way.
Detecting and comparing parallel segments In our race detection tool, we use a classical logical vector clock [9] to detect concurrent segments. As vector clocks are able to represent the happened-before [6] relation (they are strongly consistent), two vector clocks that are not ordered must belong to concurrent segments. This gives us an easy way to detect concurrent segments.
Detecting Obsolete Segments For each segment i encountered during the program execution, we build two lists: L(i) and Of course, if we keep all these lists in memory, we will run out of memory. However, it is possible to throw away old information: if it is impossible that a new segment will be parallel with an 'old' segment i, we can throw away the lists L(i) and S(i), effectively freeing memory. Detecting the 'old' segments can be done using matrix clocks.
S(i).
43 Logical matrix clocks are traditionally used to discard information in a distributed environment: the componentwise minimum of the columns yields the maximum number of segments per thread that can be discarded [ 14]. However, in practice we can discard more information than is indicated by logical matrix clocks as logical clocks capture causality, which is one of the weakest forms of event ordering. In a particular execution, all events are executed in some order (not specified by the program), even if they do not have a causal relationship. Classical logical clocks are not able to capture this kind of additional execution-specific ordering. Our data race detection method uses clock snooping [5]. Instead of maintaining matrix clocks, a snooped matrix clock is built each time a segment ends, using the latest vector clock of the processors. All segments that have a vector clock smaller than the componentwise minimum of the snooped matrix clock can be discarded as they will not be needed in the future anymore.
Merging Segments For most parallel applications, logical matrix clocks succeed in throwing away obsolete information. However, there are applications where the segment lists never become obsolete, e.g. because there is little synchronisation between the threads. The lack of synchronisation will preclude segments from becoming causally ordered before new segments, preventing memory from being freed. It is however possible (under strict cirumstances) to lower the memory consumption somewhat by combining segments. This technique [3] is based on the fact that, if all new segments will be ordered equally (parallel with or ordered after) with two segments i and j, we can merge the segments i and j. Merging boils down to removing the segments i and j from the list and adding a new segment j' with L(i') = L(i)U L(j) and S(i') = S(i)U S(j). Of course, we lose some information with merging: if a data race is detected, we no longer know if segment i or j is the cause. Although merging segments belonging to two different threads is possible, at this moment DIOTA only merges successive segments belonging to the same thread.
4. EXPERIMENTAL EVALUATION During the development of our data race backend we used a multitude of small parallel programs and it turned out that our tool was able to find the data races we intentionally introduced. Until now, only a small number of large parallel applications have been benchmarked. One of the benchmarks we used is the Mozilla web browser, a very large, complex and multithreaded application. The table below show some results we obtained with some benchmarks on a dual Celeron processor (500MHZ, 128KiB cache) system running Linux kernel 2.4.19. Four different executions were tested: a normal execution, an execution under control of DIOTA but without adding instrumentation, the same execution but with instrumentation of all memory operations, and finally an execution with the full blown data race detection. Maximum memory overhead was 3.4 x. It is clear that the slowdown is rather high. Fortunately, this is not a real problem since the data race detector can run unattended, e.g. overnight. The effort invested in methods for limiting the memory consumption clearly pays off: the memory consumption is moderate, considering the information that is needed to perform data race detection.
44 program
normal exec. time mozilla 7.50 LU.cont -p4 8.06 fft-p4-m22 11.47 radix-p4-n4194304 6.96 cholesky-p4 inputs/tk29.o 1 0 . 4 3 ocean-p4-n514 15.59 radiosity -p 4 -batch -room 27.50 water-spatial < input4 30.70
no instrum, time slowd, 35.00 4.67x 9.59 1.19x 27.59 2.41x 11.74 1.69x 12.84 1.23x 17.56 1.13x 90.14 3.28x 52.51 1.71x
diota memory instrum, time slowd, 169.00 22.53x 54.15 6.47x 200.37 17.48x 137.39 19.73x 310.74 29.79x 339.06 21.75x 1157.45 42.09x 742.27 24.18x
data race detection time slowd. 401.00 53.47x 85.74 10.64x 393.36 34.31x 244.18 35.07x 581.97 55.80x 667.14 42.80x 6 8 0 5 . 6 1 247.49x 1 5 6 6 . 0 4 51.01x
Table 1 Benchmark results.
5. RELATED W O R K
Although much theoretical work has been done in the field of data race detection [ 1, 2] few implementations for general systems have been proposed. Tools proposed in the past had limited capabilities: they were targeted at programs using one semaphore [7], programs using only post/wait synchronisation [ 11 ] or programs with nested fork-join parallelism [2, 10]. In previous work, we developed RECPLAY [ 12] for Solaris running on the SPARC architecture. This tool was geared exclusively at data race detection and contained an instrumentation engine because this was necessary to gather the accessed memory locations. Later, the instrumentation engine was separated under the name JiTI (Just-in-Time Instrumentation), with the data race functionality remaining as a special purpose backend. Recently, JiTI has been ported to the 80x86 architecture and has since been greatly enhanced and extended, so that another name change was advisable. The DIOTA name captures the (both current and potential) capabilities of the new framework quite well. A major difference between DIOTA and JiTI is that DIOTA also instruments code in dynamic linked libraries. Also, the data race backend of DIOTA is much more powerful than the JiTI version, e.g. segment merging was added recently. The backend is still being actively developed and improved regularly. Most of the previous work, and also our tool, is based on Lamport's happened-before relation. This relation is a partial order on all synchronisation events in a particular parallel execution. Therefore, by checking the ordering of all events and by monitoring all memory accesses, data races can be detected for one particular program execution. Another approach is taken by a more recent race detector: Eraser [ 13]. It goes slightly beyond work based on the happened-before relation. Eraser checks that a locking discipline is used to access shared variables: for each variable it keeps a list of locks that were held while accessing the variable. Each time a variable is accessed, the list attached to the variable is intersected with the list of locks currently held and the intersection is attached to the variable. If this list becomes empty, the locking discipline is violated, meaning that a data race occurred. Unfortunately, Eraser detects many false data races: as Eraser is not based on the happened-before relation it has no timing information whatsoever. For instance, in theory there is no need to synchronise shared variables before multiple threads are created.
45 6. C O N C L U S I O N S A N D F U T U R E W O R K
In this paper, we presented a tool called DIOTA and its data race backend. The tool is able to instrument complex programs and with the help of the backend it can do data race detection on them. The data race detection occurs by dividing the execution into segments and by keeping a history of accessed memory locations for each segment. The segments themselves are ordered through the use logical vector clocks, which are updated by occurrences of synchronisation operations. In the future, we want to improve the speed of the data race backend in several ways. One technique is to remove redundant checks of addresses: once an address has been written to once in a segment, all subsequent read and write operations to the same address are irrelevant as far as data race detection is concerned. Therefore, it is possible to greatly optimise the instrumentation. Simple testing has shown that in e.g. the Mozilla web browser, 60% of the memory accesses use the same address as the previous memory access. Also when looking at loops, it is possible to move a lot of instrumentation code out of the loop body, e.g. the instrumentation of the write operations to loop counters. Another enhancement we propose, is to treat exception handlers properly with regard to data race detection. Currently, memory operations inside exception handlers are considered as being part of the thread in which the exception occurred. However, since exceptions are raised asynchronously, they should be regarded as very short lived separate threads instead. The latest version of DIOTA can be downloaded from h t t p : / / w w w . e 1 i s . U G e n t . b e /
diota/.
REFERENCES [1]
S.V. Adve, M.D. Hill, and R.H.B. Netzer. Detecting data races on weak memory systems. In Proceedings of the 18th Annual Symposium on Computer Architectures, pages 234-243, May 1991. [2] K. Audenaert and L. Levrouw. Space efficient data race detection for parallel programs with series-parallel task graphs. In Proceedings of the third Euromicro Workshop on Parallel and Distributed Processing, pages 508-515, San Remo, January 1995. IEEE Computer Society Press. [3] M. Christiaens, M. Ronsse, and K. De Bosschere. Bounding the number of segment histories during data race detection. Parallel Computing, 28(9):1221-1238, 9 2002. [4] Mark Christiaens and Koen De Bosschere. Trade, a topological approach to on-the-fly race detection in java programs. In Java Virtual Machine Research and Technology Symposium (JVM'O1), pages 105-116. USENIX, April 2001. [s] Koen De Bosschere and Michiel Ronsse. Clock snooping and its application in on-thefly data race detection. In Proceedings of the 1997 International Symposium on Parallel Algorithms and Networks (I-SPAN'97), pages 324-330, Taipei, December 1997. IEEE Computer Society. [6] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21 (7):558-565, July 1978. [7] H.I. Lu, P.N. Klein, and R.H.B. Netzer. Detecting race conditions in parallel programs that use one semaphore. Technical report, Brown University, 1993. [8] J. Maebe, M. Ronsse, and K. De Bosschere. DIOTA: Dynamic instrumentation, opti-
46
[9]
[10]
[ 11 ]
[ 12] [13]
[14]
mization and transformation of applications. In M. Chamey and D. Kaeli, editors, Compendium of Workshops and Tutorials Held in conjunction with PACT'02: Intl. Conference on Parallel Architectures and Compilation Techniques, Charlottesville, VA, 9 2002. Friedemann Mattem. Virtual time and global states of distributed systems. In Cosnard, Quinton, Raynal, and Roberts, editors, Proceedings of the Intl. Workshop on Parallel and Distributed Algorithms, pages 215-226. Elsevier Science Publishers B.V., North-Holland, 1989. John M. Mellor-Crummey. On-the-fly detection of data races for programs with nested fork-join parallelism. In Proceedings of Supercomputing '91, pages 24-33, November 1991. Robert H. B. Netzer and Barton P. Miller. On the complexity of event ordering for sharedmemory parallel program executions. Intl. Conference on Parallel Processing, pages 9397, August 1990. Michiel Ronsse and Koen De Bosschere. Recplay: A fully integrated practical record/replay system. ACM Transactions on Computer Systems, 17(2):133-152, May 1999. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15(4):391-411, November 1997. G.T.J. Wuu and A.J. Bemstein. Efficient solutions to the replicated log and dictionary problems. Proc. 3rd ACM Symp. Principles Distributed Computing, pages 233-242, New York, 1984. ACM Press.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
47
Pipelined parallelism for multi-join queries on shared nothing machines M. Bamha a and M. Exbrayat ~ aLaboratoire d'Informatique Fondamentale d'Orlrans, Universit6 d'Orlrans, BP 6759, 45067 Orlrans cedex 2, France {mostafa.bamha,matthieu, exbrayat }@lifo.univ-orleans. fr The development of scalable parallel database systems requires the design of efficient algorithms, especially for join, which is the most frequent and expensive operation in relational database systems. Join is also the most vulnerable operation to data skew and to the high cost of communication in distributed architectures. Moreover, for multi-join queries, the problem of data-skew is more complicated because the imbalance of intermediate results is unknown during static query optimization. In this paper, we show that the join algorithms we presented in our earlier papers can be applied efficiently in various parallel execution strategies making it possible to exploit not only intra-operator parallelism but also inter-operator parallelism. These algorithms minimize the communication and synchronization costs while guaranteeing a perfect load balancing during each stage of join computation even for highly skewed data. 1. INTRODUCTION Join is the most frequent operation of parallel relational database management systems (PDBMS). It is also the most expensive one, due to its vulnerability to data skew and to the high cost of communications in shared nothing architectures. Research has shown that join is parallelizable with near-linear speed-up on Shared Nothing machines only under ideal balancing conditions. Data skew can have a disastrous effect on performance [ 1] due to the high costs of communications and synchronizations in this architecture. Many algorithms have been proposed to handle data skew for a simple join operation, but little is known for the case of complex queries leading to multi-joins [2, 3, 4]. In particular, the performance of PDBMS has generally been estimated on queries involving only one or two joins [5]. However the problem of data skew is more acute with multi-joins because the imbalance of intermediate results is unknown during static query optimization [6]. Such algorithms are not efficient for many reasons. First, they are not scalable (and thus cannot guarantee linear speedup) because their routing decisions are generally performed by a coordinator processor while the others are idle. Second, they cannot solve load imbalance problems as they base their routing decisions on incomplete or statistical information. Finally, they poorly handle data skew because data redistribution is generally based on hashing data into buckets, while hashing is known to be inefficient in the presence of high frequencies [2]. On the contrary, the join algorithms we presented in [7, 8, 9], use a total data-distribution information in the form of histograms. In this paper, we show that these join algorithms can be
48 applied efficiently in various parallel execution strategies, making it possible to exploit not only intra-operator parallelism but also inter-operator parallelism. These algorithms are used in order to avoid the effect of load imbalance due to data skew, and to reduce the communication costs due to the redistribution of the intermediate results, which can lead to a significant degradation of the performance. The organization of this paper is as follows: we first recall the notions of PDBMS, join operations and load balancing in section 2. We then present parallel execution strategies for multi-queries in section 3, and propose and discuss the use of pipelining strategies together with the Osfa_j oin algorithm in section 4. We finally sum up in section 5. 2. PDBMS : JOIN OPERATIONS AND LOAD BALANCING
2.1. PDBMS, join operations and data skew The join (or equi-join) of two relations R and S on attribute A of R and attribute B of S (A and B of same domain) is the relation T, written R N S, containing the pairs of tuples from R and S verifying R.A = S.B. The sere• - j o • of S by R is the relation S ~< R composed of the tuples of S which occur in the join of R and S. Semi-join reduces the size of relations to be joined and R N S = R N (S K R) = (R ~< S) N (S K R). In PDBMS, relations are generally partitioned among nodes by horizontal fragmentation according to the values of a given attribute. Three methods are used [10]: hash partitioning, block-partitioning (also called range partitioning) and cyclic (also called round-robin) partitioning. The main parallel join algorithms are: Sort-merge join, Simple-hash join, Grace-hash join and Hybrid-hash join [11]. All of them are based on hashing functions which redistribute relations so that tuples having the same attribute value are forwarded to the same node. Local joins are then computed and their union provides the output relation. Their main disadvantage is to be vulnerable to both attribute mlue skew (imbalance of the output of the redistribution phase) and join product skew (imbalance of the output of local joins) [4, 6]. The former affects immediate performance and the latter affects the efficiency of output of pipelined operations in the case of a multi-join. 2.2. Load balancing in parallel join operations The authors of [4] have identified the two best proposed solutions in conventional and sampling-based parallel join algorithms. All are based on the notion of parallel hashing. In the first category, the extended adaptive load balancing parallel hash join [3] sends all tuples with the same attribute value to the same node. As a result, the algorithm may fail to balance the load when few values have a large weight. Moreover, the algorithm ignores the attribute value skew (AVS) of the probe relation and the join product skew UPS). In the second category of algorithms, virtual process partitioning [2] improves on the previous category but only handles AVS for the build relation. It also ignores JPS. To improve on these methods, the authors of [4] have introduced an algorithm, that minimizes the expected AVS by statistically predicting the result of hashing. However, it still fails to correct JPS. The algorithm is based on hashing, and therefore sensitive to the same effect as earlier methods (an important skew will anyway lead to a JPS). We conclude that all existing methods are sensitive to imbalance when applied multiple times because of JPS. To address this problem, we introduced in [7] a deterministic data redistribution algorithm
49 with near-perfect balancing properties, that dynamically computes exact frequency histograms to avoid JPS. A scalable and portable cost analysis was made with the BSP model [12], leading to general predictions about the effect of relation histograms on performance. The analysis suggests a hybrid frequency-adaptive algorithm (f a _ j o i n algorithm) [7], dynamically combining histogram-based balancing with standard hashing methods. The f a _ j o i n algorithm avoids the slowdown usually caused by AVS and the imbalance of the size of local joins processed by the standard algorithms. It handles skews at the cost of extra processing time. We analyzed this overhead both theoretically and experimentally and concluded that it never penalizes overall performance, even in the absence of skew. However, the performance of f a _ j o i n is sub-optimal when computing the join of highly skewed relations because of unnecessary redistribution and communication costs. We introduced in [9] a new parallel algorithm called Osf a _ j o i n (Optimal symmetric frequency adaptive join algorithm) to perform such joins. O s f a _ j o i n improves on f a _ j o i n by its optimal complexity and by the use of semi-joins. Its predictably low JPS and AVS make it suitable for multi-join queries. Its analyze through BSP predicts a linear speedup and an optimal complexity even for highly skewed data. 3. PARALLEL EXECUTION STRATEGIES FOR MULTI-JOIN QUERIES Several strategies were proposed for the evaluation of multi-join queries [5]. These strategies generally depend on the parallel query execution plan. They can use both intra-operator and inter-operator parallelism, and mostly differ in the way they allocate simple joins to different processors and the way they choose an appropriate degree of parallelism (i.e. the number of processors used to compute each simple join). They can be divided into four main categories presented thereafter.
3.1. Sequential parallel execution Sequential parallel execution is the simplest strategy for evaluating, in parallel, a multi-join query. This strategy does not require inter-operator parallelism since simple joins of the query are evaluated one after the other in a parallel way. It does not induce any pipelined parallelism owing to the fact that a simple join cannot be started until all its operands are entirely available. The execution time of each join is then the execution time of the slowest processor. To reach acceptable performance, the join algorithms used in this strategy should reduce the load imbalance among all the processors. O s f a _ j o i n can be used there to reduce the idle time, owing to the fact that it is insensitive to AVS and JPS, and also proved to guarantee a linear speedup. 3.2. Parallel synchronous execution In addition to intra-operator parallelism, parallel synchronous execution uses inter-operator parallelism [13]: several simple joins can be computed simultaneously on disjoint sets of processors. The main difficulty then lies in the allocation of simple joins to the available processors and in the choice of an appropriate degree of parallelism for each join, so that i) the global execution time of each operator should be of the same order to avoid the latencies induced by the imbalance in computations of the different groups of processors, and ii) the execution time of each operator of a given group of processors (i.e. of a given parallel join) must also be of the same order, as the execution time for each join is the execution time of its slowest processor.
50 The load balancing between the groups of processors depends on the effectiveness of the dynamic balancing strategy used at the time of the query optimization, whereas the load balance of the processors in the same group can be carried out by using the join algorithms we introduced.
3.3. Segmented right-deep execution Contrary to a parallel synchronous strategy, a segmented right-deep execution [14] uses, in addition to intra-operator parallelism, pipelined inter-operator parallelism. Pipelined interoperator parallelism is used in the evaluation of the right-branches of the query tree. O s f a _ j o i n could still be used there. Nevertheless, the initial version of O s f a _ j o i n computes joins sequentially, one after the other, and thus has to be modified to handle pipelining efficiently. 3.4. Full parallel execution Full parallel execution [5] is simply a combination of the preceding strategies, that uses intra-operator, inter-operator, and pipelined inter-operator parallelism. In this strategy, all the simple joins, that compose the multi-join query, are computed simultaneously in parallel using disjoint sets of processors. Inter-operator parallelism and pipelined inter-operator parallelism are exploited according to the type of the query tree. Its effectiveness depends on the quality of the execution plan. The initial version of Osfa_j oin is not really adapted to full parallelism Oust like it was not really adapted to segmented right-deep execution), and thus has to be modified to handle pipelining. In the following section we will present a detailed approach to introduce pipelining in Osfa_j oin. 4. PIPELINING OSFA JOIN: CAN WE PIPELINE A DYNAMIC A L G O R I T H M ? . - -
4.1. Principle Pipelining has been successfully implemented with many classical join algorithms. Nevertheless, these algorithms are generally based on hash join techniques and are thus sensitive to AVS and JPS, which are very likely to occur in a multi-join query. For this reason we propose to adapt O s f a _ j o i n (which is insensitive to data skew) to pipelined joins. This task in fact appears to be difficult. Most of pipeline strategies use a static execution plan. The way data is distributed for each operator is fixed at query compilation time. When one processor produces a result tuple, this latter can be immediately transmitted to and used by the next operator, which does not need to have a global view of the whole set of manipulated tuples. This can be repeated over several consecutive join operators. Long pipeline chains can be constituted, their length being mainly limited by processors' availability. On the contrary, O s f a _ j o i n is a dynamic operator; the definitive distribution of data is determined at execution time, once histograms have been built, i.e. after the former operator has finished. This strongly limits the use of pipelines. We thus propose to use two-operator pipeline chains. The first enhancement consists in parallelizing the creation of histograms for source relations (i.e. using inter-operator parallelism). This means that we organize the execution plan so that as many join operators as possible apply either on two source tables, or on one source table and an intermediate result. In this case we can build the histograms for source tables as soon as the
51 query execution begins. This strategy is very similar to the parallelized build phase that is used for classical right-deep trees. We then build the histograms for computed (intermediate) relations as their tuples are produced. In other words, we interlace the tuple production phase of a join operator with the building phase of the next. By the way we do not need to re-scan the results from the disk. Once the histograms are produced for both tables, we can compute the communication template, then distribute data, and finally compute the join. Unfortunately, computing the communication template is the implicit barrier within the execution flow, that prohibits the use of pipeline chains. From the implementation point of view, we must notice that we do not really compute the local histogram on-the-fly (i.e. each time a tuple is produced), but rather wait for a sufficient number of tuples to be available to compute the corresponding histogram.
4.2. Detailed algorithm Let us consider each join is of the form Ri N Si, where Ri is a source relation and S~ is either a source relation (which should only occur for the first join) or an intermediate result. The execution tree consists of t join operators; the final result is stored as St+l. Let us call Rij the fragment of Ri which is stored on processor j. We use n nodes numbered 1 to n. The global algorithm is then: Algorithm 1 Pipelined Osfa_join For each join Ji - ( R i N Si) Par (on each node) j E [1, n] create local histograms for Rij Hist(Rij) broadcast Hist(Rij) to other nodes, and collect all Hist(Rik), k 7~ j build the corresponding part of the global histogram (each node being assigned a given set of values) for Ri Histj (Ri) If i= 1 (first join) Then create local histograms for Sij Hist(Sij) Else wait for Hist(Sij) to be completed by the preceding join phase Endif broadcast Hist(Sij) to other nodes, and collect all Hist(S~k), k ~ j build global histogram for 5'/Histj (Si) compute communication templates according to Histj(S~) and Histj(Ri) (see [7, 8, 9]) broadcast (and collect) the communication templates exchange data across nodes according to the communication template If i=t (last join) Then compute S(i+l)j = (Rij N Sij) Else compute S(~+I)y = (R~j N S~j) and build Hist(S(~+l)j) on the fly Endif Endpar Endfor
52
4.3. Example Figures 1 and 2 represent an annotated query execution tree of a 3-join query T1 N T2 M T3 t~ T4. Source tables are placed on leaves. Numbers indicate the relative order of operations. Referring to the general algorithm, we can identify TJ, T3 and T4 respectively to R1, R2 and R3, and T2 to $1. Figure 1 illustrates an execution where operators are executed sequentially (i.e. a basic use of Os f a _ j o i n , where we do not use inter-operator parallelism). We first build the histograms for T 1 and T2, then compute the communication templates, distribute data, and execute the join. Once the join is over, we build the histograms for T3 and T1 N T2, and so on. Figure 2 illustrates a pipelined execution of the query tree. Building the histograms of all source tables is executed in parallel. As soon as histograms are available for T1 and T2, the communication template of the first join is computed, then data is transfered, and finally the join is executed. During this last operation, result tuples are scanned in order to produce the histogram of T1 N T2. Once this phase is over, we can compute the communication template for the second join, and so on.
6. Compute comm. templates, exchange data, join T4 and $3 into $4 (result)
5. build Hist(T4) T4
5. build Hist(S3) 4. Compute comm. templates, exchange data, join T3 and $2 into $3
3. build Hist(T3) T3
3. build Hist(S2) 2. Compute comm. templates, exchange data, join T1 and T2 into $2
/2
1. build Hist(T1) T1
1. build Hist(T2) T2
Figure 1. Standard execution of a multi-join query using Os f a _ j o i n
4. Compute comm. templates, exchange data, join T4 and $3 into $4 (result)
1. build Hist(T4) T4
3. Compute comm. templates, exchange data, join T3 and $2 into $3, and build Hist(S3) on the fly
T3
2. Compute comm. templates, exchange data, join T1 and T2 into $2, and build Hist(S2) on the fly
/<
1. build Hist(Tl) Tl
1. build Hist(T2) T2
Figure 2. Pipelined execution of a multi-join query using Os f a _ j o i n
4.4. Discussion We showed in section 4.1 that Os f a _ j o i n could hardly be pipelined in a long chain. Nevertheless our two-operator pipeline achieves several enhancements. We first use a parallel construction of histograms for source relations. Moreover, we combine the actual computation of joins together with the production of the next-join histogram, thus limiting the number of accesses to data (and to disks).
53 Conceming the parallel construction of histograms for source relations, we can notice that the degree of parallelism might be limited by two factors: the total number of nodes available, and the original distribution of data. A simultaneous construction of two histograms on the same node (which occurs when two relations are distributed, at least partially, over the same nodes) would not be really interesting compared to a sequential construction. This inter-processor parallelism does not bring acceleration, but should not induce noticeable slowdown: histograms use to be small, and having several histograms in memory would not necessitate swapping. On the other hand, as relations use to be much bigger than the available memory, we have to access them by blocks. As a consequence, accessing one or several relations does not really matter. Our pipeline strategy will really be efficient if join operators are executed on disjoint (or partially disjoint) sets of processors. Intra-operator parallelism is thus limited, and we have to segment our query trees, similarly to segmented fight-deep trees, each segment (i.e. a set of successive joins) being started when the former is over. These two limitations (degree of parallelism of operators and number of "pipelined" operators) would naturally suppose some minor modifications of the general algorithm presented above. 5. CONCLUSION In this paper we overviewed the existing parallel join algorithms, and the execution strategies of multi-joins queries on shared-nothing machines. We have shown that the algorithms, we introduced in earlier papers can be applied efficiently in different parallel strategies. We proposed a pipelined version of the Os f a _ j o i n algorithm. Our approach consists of two elements: parallelizing the scan of source tables and scanning intermediate results at production time, thus limiting disk accesses. We show that long pipeline chains can hardly be constructed due to the dynamic nature of O s f a _ j o in. We are currently conducting tests that tend to validate the superiority of our approach compared to standard hash-join pipeline algorithms. The pipelined version of O s f a _ j o i n is currently implemented within a parallel database prototype, in the framework of the CARAML ~ Grid project. REFERENCES
[1] L. Harada and M. Kitsuregawa, Dynamic Join Product Skew Handling for Hash-Joins in [2]
[3] [4]
Shared-Nothing Database Systems, in Proc. of the 4th Int. Conf. on Database Systems for Advanced Applications, Singapore, 1995,246-255. D. J. DeWitt, J. F. Naughton, D. A. Schneider and S. Seshadri, Practical Skew Handling in Parallel Joins, in Proc. of the 18th Int. Conf. on Very Large Databases, Vancouver, Canada, 1992, 27-40. K. A. Hua and C. Lee, Handling data skew in multiprocessor database computers using partition tuning, in Proc. of the 17th Int. Conf. on Very Large Databases, Barcelona, Spain, 1991, 525-535. V. Poosala and Y. E. Ioannidis, Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing, in Proc. of the 22th Int. Conf. on Very Large Databases, Bombay, India, 1996, 448-459.
1CARAMLproject URL http://www.caraml.org
54 [5]
[6] [7] [8]
[9] [10] [ 11]
[12] [13]
[14]
A.N. Wilschut, J. Flokstra and P. M. G. Apers, Parallel evaluation of multi-join queries, in Proc. of the 1995 ACM SIGMOD Int. Conf. on Management of Data, San Jose, USA, 1995, 115-126. H. Lu, B. C. Ooi and K. L. Tan, Query Processing in Parallel Relational Database Systems (IEEE Computer Society Press, Los Alamos, USA, 1994). M. Bamha and G. Hains, A frequency adaptive join algorithm for Shared Nothing machines, Parallel and Distributed Computing Practices 3(3) (1999) 333-345. M. Bamha and G. Hains, A skew insensitive algorithm for join and multi-join operation on Shared Nothing machines, in Proc. 1 l th Int. Conf. on Database and Expert Systems Applications, London, UK, 2000, 644-653. M. Bamha, An optimal and skew-insensitive join algorithm for Shared Nothing machines, Research Report RR-LIFO-2000-05, LIFO, Universit6 d'Orl6ans, France, 2000. D. J. DeWitt and J. Gray, Parallel database systems : The future of high performance database systems, Communications of the ACM 35(6) (1992) 85-98. D. Schneider and D. DeWitt, A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment, in Proc. of the 1989 ACM SIGMOD Int. Conf. on Management of Data, Portland, USA, 1989, 110-121. D. B. Skillicorn, J. M. D. Hill and W. F. McColl, Questions and Answers about BSP, Scientific Programming 6(3) (1997) 249-274. M.S. Chen, P.S. Yu and K.L. Wu, Scheduling and Processor Allocation for the Execution of Multi-Join Queries, in Proc. of the Int. Conf. on Data Engineering, Los Alamos, USA, 1992, 58-67. M. S. Chen, M. L. Lo, P. S. Yu and H. C. Young, Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins, in Proc. of the 18th Int. Conf. on Very Large Databases, Vancouver, Canada, 1992, 15-26.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
55
T o w a r d s the H i e r a r c h i c a l G r o u p C o n s i s t e n c y for D S M s y s t e m s : an efficient w a y to share data objects L. Lef6vre a and A. Bonhomme a aINRIA / LIP (UMR CNRS, INRIA, ENS, UCB) Ecole Normale Sup6rieure - Lyon, France [email protected] We present a formal and graphical comparison of consistency models for Distributed Shared Memory systems, based on the programmer point of view. We conclude that most consistencies can be categorized into 3 models depending on their flexibility degree (none, 1 or 2). We propose a new consistency model that provides these 3 flexibility degrees : the Hierarchical Group Consistency (HGC) and present its deployment inside the DOSMOS DSM system. 1. INTRODUCTION Distributed Shared Memory (DSM) systems are now a well recognized alternative for the deployment of a large class of applications. Their main challenge is to manage data consistency while keeping good performances. During last decade, a lot of consistencies have been proposed [2, 3, 7, 10]. However, their definition is made from the designer or hardware point of view. Each definition is thereby often dependent on the DSM designer context. By using large scale clusters (hundreds or thousands of nodes), DSM have to face the scalability problem. How to provide scalable solutions for applications needing a virtual shared space with a large number of computing nodes? In particular, one issue that DSM systems have to address concerns minimizing replication (i.e. reducing the number of messages required to keep the memory consistent) and maximizing availability (i.e. increasing the number of local accesses) of shared data (objects). Our goal is to propose a new taxonomy for consistency models, that can be used by a programmer that deals with the implementation of a distributed application on top of a cluster of machines. This taxonomy is programmer-centric. For this, we use the framework introduced by Hu et al. [9], and we show that from the programmer's point of view, there are only 3 main consistency models. The other ones are just various implementations of those models. Based on this observation, we propose a graphical representation of those models. This representation is based on different degrees of flexibility of the consistency : flexibility about consistent moment (like Release Consistency) or about consistent data (like Entry Consistency). This representation outlines that so far, there is no consistency model with flexibility about consistent processes. Hence, we suggest the Hierarchical Group Consistency (HGC). Consistency is only maintained inside a group of processes rather than between all processes. Furthermore, for each group, various consistency rules can be applied, depending on the type of sharing performed
56 inside a group. This group structure is particularly interesting for heterogeneous clusters and allows the programmer to adapt the consistency management to the application (depending on the sharing degree, for example) and to the execution cluster (depending on the communication performances, for example). The Hierarchical Group Consistency has been implemented in the DOSMOS system [5, 12]. Like in most DSM systems based on weak consistency, DOSMOS provides Acquire and Release operations to set critical sections. However, with HGC, DOSMOS also allows the programmer to specify which data to share between which processes. This paper is organized as follows : Section 2 briefly presents the programmer-centric framework used to define consistency models, and introduces a formal comparison of existing models. Then Section 3 is devoted to a graphical 3D representation of the 3 consistency models. Next, Section 4 presents the Hierarchical Group Consistency model. Section 5 describes the hierarchical group deployment in the DOSMOS system. Section 6 concludes this paper and presents future directions. 2. THREE MEMORY CONSISTENCY MODELS FROM THE PROGRAMMER'S POINT OF VIEW
Research on data consistency has always been a hot topic because of its central position in areas like parallelism, operating systems, distributed systems, etc. However, if a lot of consistencies have been proposed for DSM systems, the formal context of the consistency concept has not been clearly specified. As a consequence, comparing consistencies remains a difficult challenge [1, 8]. In this section, we focus on the consistency concept based on the programmer's point of view. That means that a consistency model is defined based on how it is perceived by the programmer rather than how it is managed or implemented by the system. We base our work on the formal framework introduced by Hu et aL [9] in order to compare different models. Section 2.1 introduces this framework and defines the best known consistency within this framework. Section 2.2 shows how to compare these models with this framework. 2.1. Formal definition of consistency models In their article [9], Hu et all define a memory model as follows : Definition 1 A memory consistency model M is a pair (CM, SYNM) where CM is the set of possible memory accesses (read, write, synchronization) and S Y N M is an inter-process synchronization mechanism to order the execution of operations from different processes. The execution order of synchronization accesses determines the order in which memory accesses are perceived by a process. Accordingly, for each program, there are several possible executions. A program execution is defined as follows.
An execution of the program PRG under consistency model M, denoted as EM(PRG), is defined as an ordering of synchronization operations of the program. With the ordering of synchronization operations, the execution of all related operations are also ordered. Thus, we define the synchronization order of an execution.
Definition 2
The synchronization order of an execution EM(PRG) under consistency model M, denoted as SOM(EM(PRG)), is defined as the set of ordinary operation pairs ordered by the synchronization mechanism S Y N M of M. Definition 3
57 Hence, for any consistency model M, we can define CM and SOM(EM(PRG)). CM deals with how the programmer has to program, and SOM gives the rules used to generate the result. The basic Atomic Consistency (AC), the Sequential Consistency (SC) [11], the Release Consistency (RC) [7], the Lazy Release Consistency (LRC) [ 1], the Eager Release Consistency (ERe) [6], the Entry Consistency (EC) [3] and the Scope Consistency (SsC) [10] can all be defined within this framework. Furthermore, based on those definitions, we easily outline that RC, LRC and ERC have the same definition. That means that LRC and ERC are different implementation of the RC model. In effect, from the programmer's point of view, the results are the same for both implementations. Similarly, we state that SsC is a particular case of EC. As a result, in the following, we only consider AC, SC, RC and EC models. Finally, Hu et all. define a correct program as follows :
Definition 4 A program PRG is said to be correct for the consistency model M, iff for any possible execution EM(PRG), all ordinary conflicting access pairs are ordered by either the program order (PO) or by the synchronization order of execution SOM.
2.2. Formal comparison of consistency models In order to compare consistency models, we define the concept of model equivalence. Definition 5 M1 and M2 are said to be equivalent iff :
a) CM~ = CM~, b) a correct program PRG for M1 is also correct for M2, c) and 2 compatible executions EM1 (PRG) and EM2 (PRG) give the same result. EM1 (PRG) and EM2(PRG) are said to be compatible executions if there do not exist 2 synchronization operations (u, v) such that (u, v) E EM1 (PRG) and (v, u) r EM2 (PRG). Theorem 1 The Atomic Consistency model and the Sequential Consistency model are equivalent. Proof : By definition, CAC -- C s c . Furthermore, AC and SC provide a global order for all read/write operations. Consequently, they are both correct for any program. Finally, for any execution Eac, all accesses are ordered. Yet, Esc only orders operations concerning data accessed several times during the execution. Thus, Esc is included in Eac. For the remaining operations that are in EAC and not in Esc, they concern distinct data. The result is, therefore, the same whatever the execution order. AC and EC are equivalent. More generally speaking, we refer to them as the Strong Consistency Model.
Theorem 2 The Release Consistency model and the Entry Consistency model are not equiva-
lent. Proof : Let us give a counterexample. Let us consider the program and execution shown here. If for EC, x is not associated to l l, then PRG is correct for EC (accesses to x are ordered by the synchronization order), and gives
58 the result (a = 0, b - 1). However, for RC, PRG is also correct, but the result is (a = 1, b - 1). Thus EC and RC are not equivalent.
Corollary 1 The Strong Consistency model the Release Consistency model and the Entry Consistency model are three different consistency models. Proof : The strong consistency model does not define any synchronization variables. It is consequently not equivalent to EC or RC. 3. G R A P H I C A L COMPARISON OF THE THREE CONSISTENCY MODELS Once we have outlined 3 distinct models, we suggest to compare them with a graphical taxonomy based on the user's point of view. Considering a correct program for a model M, this taxonomy allows us to answer the following question : At each moment t of a program execution, if a given process Pi accesses a shared data d, will he read the last value written in the shared memory by another process pj ? In case of a positive answer, we say that at moment t, process pi is consistent with the process pj about the data d. Then, we represent each model with a 3D Accesses between visualization of the answer to this question barriers Time remaining a ~ All the Time by using various parameters (t, d, p). Hence, When Critic each model is represented by a volume made between Acquire and Release up of the triplets (t, d, p) for which the answer All Processes is yes. Basically, (t, d, p) is defined with reWho v v Processes linked Other processes spect to 3 axes: with synchronization When (t) : for time intervals of running apAll MemorySpace plication; What k---v----~ v ) W h a t (d) : for shared memory space; Memory Objects linked Memory Space with synchronization Who (iv) : for application processes. As shown in Fig.l, the axis discretization is Figure 1" When, who, what axes. based on shared objects' access patterns. Accordingly, if we normalize the axes, a plain cube means that a process, for all running execution, for all shared data, reads last value written by any other process. The strong consistency model perfectly fits in that definition and graphical representation (Fig. 2). Figures 3 and 4 graphically represent weak consistencies. With Release Consistency, an access performed between Acquire and Release operations is consistent with other processes on all accessible data. In a barrier case, all the processes have the same view of the shared memory. For the remaining execution, there are some conflict risks for any accessible data. The Entry Consistency model gives a slightly different result from the previous one. In effect, if for the barriers, the behavior is the same, it is rather different for an access in a critical section surrounded by an Acquire and Release operation. The process is consistent with the other processes only for the shared data associated with this synchronization.
59 Who
Who
When
/
WhaI
Figure 2. Strong consistency
]
When
/
What
Figure 3. Release consistency Figure 4. Entry consistency
4. TOWARDS A N E W C O N S I S T E N C Y M O D E L ? THE H I E R A R C H I C A L GROUP C O N S I S T E N C Y (HGC)
who Moving from a strong towards a weak consistency implies a decrease in the When axis decrease. The Entry Consistency model also relaxes the What axis. At this stage, only one dimension remains fixed" the processes one. It would be pertinent to [ make this axis flexible. For an application, all processes do not l Whe, use all shared data. Moreover, in order to provide scalable solutions, it could be interesting to synchronize with barriers a / What subset of running processes. Thus, we propose a new consistency model which groups Figure 5: Graphical representogether processes working on the same synchronization vail- tation of a new model with 3 ables. Such a model has the graphical representation shown in flexible axes Figure 5. Basically, the HGC model is similar to EC since data are associated to a synchronization variable. Furthermore, processes are also associated to this synchronization variable. Thus, let us consider a data modification performed inside a critical section managed by the synchronization variable l. Then, those modifications are forwarded only if they concern data associated to l, and only to processes associated to 1. Furthermore, in the HGC model, it is possible to perform some synchronization barrier for only a subset of the processes. Thus, the HGC model can be defined as follows : The Hierarchical Group Consistency model is defined by : - CHCC : {read(x), write(x), Acq(1), Rel(l), Sync(1)} - (u, v) E SOHcc(EHcc(PRG)) iff 3 a synchronization variable I to which u and v are associated such that: u is performed before Rel(1) and v is performed after Rel(1). OR u is performed before Sync(1) and v is performed after Sync(1). Definition 6
Theorem 3 The Hierarchical Group Consistency model is not equivalent to the Release Consistency model nor to the Entry Consistency model Proof : HGC introduces a new synchronization operation (the barrier restricted to a synchronization variable). Consequently, CHCC # CRc and CHcc # Czc. Thus, those models are not equivalent.
60 HGC limits coherence management costs of a shared data to some dedicated and explicitly associated processes. Global communications are restricted to processes really needing to have up to date copies of shared data. In this way, the Hierarchical Group Consistency allows to mix high availability of shared data combined with weak replication [4]. 5. IMPLEMENTATION ASPECT We implement the Hierarchical Group Consistency inside the DOSMOS framework (Distributed Shared Objects MemOry System) [5, 12]. DOSMOS is based on 3 kinds of processes : Application Processes (run application code), Memory Processes (manage memory access and consistency inside a group) and Link Processes (manage memory consistency between groups).. The policy of the DOSMOS system is to manage the consistency of an object only within the group of processes that frequently use this object. Thus DOSMOS introduces a hierarchical view of one's application processes by creating groups of processes that frequently share the same object. We define a group as a pair (G, O) where O is a set of objects and G a set of Application Processes sharing those objects. Each group has a group manager (Link Process : LP) responsible for the inter-group communications. Each group management is independent from the others. Concerning read and write operations to shared data, we classify the accesses into two categories : the intra-group accesses (the accessing process belongs to object's group) and the intergroup accesses between two distinct groups. Figure 6 shows an example of object accesses using group consistency. In this example, we have 2 sites A and B. In each site, processes share the same object (Y for site A and X for site B). The objects X and Y do not have the same consistency management : Release Consistency for Y and Lazy Release Consistency for X. In this context, we describe three different actions : First, one process of site A writes Y, then this process sends the new value of Y to the other members of the site A. The second action illustrates an inter-group access. One process of site/3 wants to read the object Y. Since it does not belong to Y's group, it sends its request to its group manager that forwards it to the group manager of site A. This one forwards again the request to the right process in the group that sends the Y value to the requesting process. The last action concerns an acquire of X : since X is managed in Lazy Release Consistency, the acquirer asks the last acquirer for the new value of X. For an intra-group access, we distinguish two cases : 9 If it is the first access for the process to that object, the group manager sends it a copy of this object; 9 Else, the process already has a local copy. Then, depending on the consistency implementation of the group, the process will directly work on this copy or not. For an inter-group access, the accessing process doesn't have a copy and will never have one during the whole execution of the application. It simply asks for the value of this object (for a read) or sends the new value (for a write) to the group manager.
61 SITE A
SITE B
Action 1 /'
Action 3 .. Acion 2
\
.
Object Y Release Consistency
Object X Lazy Release Consistency
Figure 6. Three actions supporting the group consistency : an intra-write in Release Consistency (action 1), an inter-read (action2), and an acquire in Lazy Release Consistency (action 3)
6. CONCLUSION This paper presents an original taxonomy of memory consistencies based on the programmer approach. This classification shows the interest of a new consistency relaxing un-useful memory management costs. The Hierarchical Group Consistency proposes to group together processes that frequently access the same shared data. This model allows a gain in performance without denying DSM programmability. The Group Consistency model implemented in DOSMOS is an original approach for improving DSM Systems. It allows the system to be adapted to the environment under which it performs (type of processors, communication network, application patterns, object size .... ) Depending on the environment, the policy concerning replication of objects will not be the same and consequently the group structure will be different. The hierarchical Group Consistency is well suited for large scale systems and could perfectly fit for multi-cluster and Grid applications. This model fits perfectly the requirements of various kind of networks. Depending on the latency, the bandwidth or other criteria, we can design an accurate group structure in order to reach the best compromise between replication and availability suited for the studied network. REFERENCES
[1]
[2]
S. V. Adve, A. L. Cox, S. Dwarkadas, R. Rajamony, and W. Zwaenepoel. A comparison of entry consistency and lazy release consistency implementations. In Proc. of the 2nd IEEE Symp. on High-Performance Computer Architecture (HPCA-2), pages 26-37, February 1996. S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29(12):66-76, December 1996.
62 [3]
[4]
[5]
[6] [7]
[8]
[9] [ 10]
[ 11] [ 12]
Brian N. Bershad, Matthew J. Zekauskas, and W. A. Sawdon. The midway distributed shared memory system. In 38th IEEE International Computer Conference (COMPCON Spring'93), pages 528-537, February 1993. Alice Bonhomme and Laurent Lef6vre. How to combine strong availability with weak replication of objects ? In ECOOP98 : 12th Conference on Object Oriented Programming : Workshop on Mobility and Replication, Bruxelles, Belgium, July 1998. Lionel Brunie, Laurent Lef6vre, and Olivier Reymann. High performance distributed objects for cluster computing. In 1st IEEE International Workshop on Cluster Computing (IWCC '99), pages 229-236, Melbourne, Australia, dec 1999. IEEE Computer Society Press. John B. Carter, John K. Bennet, and Willy Zwaenepoel. Implementation and performance of MUNIN. ACM- Operating Systems Review, 25(5): 152-164, 1991. K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In 16th Annual Symposium on Computer Architecture, pages 15-26, May 1989. L. Higham, J. Kawash, and N. Verwaal. Defining and comparing memory consistency models. In Proc. of the 10th Int'l Conf. on Parallel and Distributed Computing Systems (PDCS-97), October 1997. W. Hu, W. Shi, and Z. Tang. A framework of memory consistency models. Journal of Computer Science and Technology, 13(2), March 1998. L. Iftode, J. E Singh, and K. Li. Scope consistency: A bridge between release consistency and entry consistency. In Proc. of the 8th ACMAnnual Symp. on Parallel Algorithms and Architectures (SPAA "96), pages 277-287, June 1996. L. Lamport. How to make a multiprocessor that correctly executes multiprocess programs. IEEE Trans. on Computers, C-28(9):690-691, September 1979. Laurent Lef6vre and Olivier Reymann. Combining low-latency communication protocols with multithreading for high performance dsm systems on clusters. In 8th Euromicro Workshop on Parallel and Distributed Processing, pages 333-340, Rhodes, Greece, Jan 2000. IEEE Computer Society Press.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
63
An operational semantics for skeletons* M. AldinuccP and M. Danelutto b ~Institute of Information Science and Technologies (ISTI)- National Research Council (CNR), Via Moruzzi 1, 1-56124 Pisa, Italy bDepartment of Computer Science, University of Pisa, Via Buonarroti 2, 1-56127 Pisa, Italy A major weakness of the current programming systems based on skeletons is that parallel semantics is usually provided in an informal way, thus preventing any formal comparison about program behavior. We describe a schema suitable for the description of both functional and parallel semantics of skeletal languages which is aimed at filling this gap. The proposed schema of semantics represents a handy framework to prove the correctness and validate different rewriting rules. These can be used to transform a skeleton program into a functionally equivalent but possibly faster version. 1. INTRODUCTION Skeletons have been originally conceived by Cole [8] and then used by different research groups to design high-performance structured parallel programming environments [5, 6, 12]. A skeleton may be modeled as an higher-order function taking one or more other skeletons or portions of sequential code as parameters, and modeling a parallel computation out of them. A skeletal program is a composition of skeletons. Skeletons can be provided to the programmer either as language constructs [5, 6, 7] or as libraries [3, 9, 11 ]. The formal description of a parallel, skeletal language involves at least two key issues: 1) the description of the input-output relationship of skeletons (functional semantics); 2) the description of the parallel behavior of skeletons. The functional semantics enables the definition of semantics-preserving program transformations [2, 4, 1, 10]. These transformations can also be driven by some kind of analytical performance models associated with skeletons [ 13], in such a way that only those rewritings leading to efficient implementations of the skeleton code are considered [2, 1]. Almost all frameworks previously cited have a formal functional semantics. But none of them provide a complete and uniform description of the parallel semantics. In this work we present a schema of operational semantics suitable for skeletal languages exploiting both data and stream parallel skeletons. The operational semantics is defined in term of a labeled transition system (LTS). It describes both functional and parallel behavior in a *This work has been partially supported by the Italian MIUR Strategic Project "legge 449/97" year 1999 No. 02.00470.ST97 and year 2000 No. 02.00640.ST97, and by the Italian MIUR FIRB Project GRID.it No. RBNE01KNFE
64
1. seqf<x,'r>g
e e..seqf<7_>e
2. farm A {x, ~-)e e A {x}o(e,x) "" farm A @}o(e,x) 3. pipe A 1 A 2 (X, T}e
g > A2 "~o(g,x) A1 (X)g "' pipe A 1 A 2
(T)g
4. comp A1 A2 (x, ~-)g g, A2 A1 (x)e "" comp A1 A2 (~-)e 5. map f~ A fd (X, 7"}g
g
' fc (Oz A) fd (X)g "" map f~ A fd (r)g
6. d&c ftc fc A fd <x, 7->e
g
d&c ftc fc A fd (Y>e :
7. whileftcA(x,T)e
h "" •
, d&c ftc f~ A fd <x>e "" d&c ftc fc A fd <7-)e a
e
i# (A~ v)
fc (a (dac ftc fc A/d)) fd (Y)e otherwise
e { whileftcA (A(x)e::@}e) > (x)e :: while ftc /k (7-}g
join
(e>• "" (T>e2 _t_ e2 joinc
A (x>e~ e2> (Y)e3 7Ze A<x>h
g2
, g
fd (X>g g> ~ g, "'" g g>
iff (ft~x) otherwise
relabel
E1 A E1
fc ~
L (~) f~ <x>e e (z>e
Ei - - ~ E~
Vil
<~>~- El
..... En
A " r
3 i , j 1 < i , j < n, gi = ~j :=v i = j ~>
<~>.--E~ ..... E"-.
e E2
> A E2
context
.... e
l..n
@
sp
r
Figure 1. Lithium operational semantics, x, y E value; a, 7- E values; u C values U {e}; E E exp; F C exp*; g, g i , . . . E label; 0 :label • value ~ label.
uniform and general way. We use a subset of the Lithium language as a test-bed to describe the methodology [3]. 2. L i t h i u m F O R M A L DEFINITION
Lithium extends the Java language by providing the programmer with both task parallel and data parallel skeletons [3] All the skeletons process a stream (finite or infinite) of input tasks to produce a stream of results. All the skeletons are assumed to be stateless, static variables are forbidden in Lithium code. No concept of "global state" is supported by the implementation, but the ones explicitly programmed by the user 2. The Lithium skeletons are fully nestable.
2Such as RMI servers encapsulating shared data structures.
65 Each skeleton has one or more parameters that model the computations encapsulated in the related parallelism exploitation pattern. Lithium manages two new types in addition to Java types: streams and tuples that are denoted by angled braces and "~ ~" braces respectively: value ::= A Java value values ::= value I value, values
stream ::= ( values ) ] (c) tuplek ::=~ streamb . . . , streamk
A stream represents a sequence (finite or infinite) of values of the same type, whereas the tuple is a parametric type that represents a (finite, ordered) set of streams. Actually, streams appearing in tuples are always singleton streams, i.e. streams holding a single value. The set of skeletons (A) provided by Lithium are defined as follows:
A::= seq f I farm A I pipe A~ A21 comp A~ A21 map fa A f~ I d&c ft~ fa A f~ I while ftc A where sequential Java functions (f, 9) with no index have type Object ---+ Object, and indexed functions (fc, fd, tic) have the following types: fr :tuplek ---+ stream; fd :stream ~ tuplek; ftc :value ---+ boolean. In particular, fr fd represent families of functions that enable the splitting of a stream in k-tuples of singleton streams and vice-versa. Lithium skeletons can be considered as a pre-defined higher-order functions. Intuitively, seq skeleton just integrates sequential Java code chunks within the structured parallel framework; farm and pipe skeletons model embarrassingly parallel and pipeline computations, respectively; comp models pipelines with stages serialized on the same processing element (PE); map models data parallel computations: fd decomposes the input data into a set of possibly overlapping data subsets, the inner skeleton computes a result out of each subset and the f~ function rebuilds a unique result out of these results; d&c models Divide&Conquer computations: input data is divided into subsets by fd and each subset is computed recursively and concurrently until the ft~ condition does not hold true. At this point results of sub-computations are conquered via the fc function, while skeleton model indefinite iteration. A skeleton applied to a stream is called a skeletal expression. Expressions are defined as follows: exp ::= A stream I/x exp I exp ] fr (c~ A) fd stream. The execution of a Lithium program consists in the evaluation of a A stream expression. 3. Lithium O P E R A T I O N A L S E M A N T I C S
We describe Lithium semantics by means of a LTS. We define the label set as the string set augmented with the special label "_L". We rely on labels to distinguish both streams and transitions. Input streams have no label, output stream are 2_ labeled. Labels on streams describe where data items are mapped within the system, while labels on transitions describe where they are computed. The Lithium operational semantics is described in Figure 1. The rules of the Lithium semantics may be grouped in two main categories, corresponding to the two halves of the figure. Rules 1-7 describe how skeletons behave with respect to a stream. These rules spring from the following common schema: Skel params (x, T)el
e~ >,T(x)e 2 .. Skel params (T)e 3
where Skel E [seq, farm, pipe, comp, map, d&c, while], and the infix stream constructor (a)e~ :: (~-)g~ is a non strict operator that sequences skeletal expressions. In general, f is a function
66 appearing in the params list. For each of these rules a couple of twin rules exists (not shown in Figure 1):
o~) Skel params (x, "r) _L> (C>_L"" Skel params (x, "r)z
~) Skel params <x>e~
> ~<x>~ 2
these rules manage the first and the last element of the stream respectively. Each triple of rules manages a stream as follows: the stream is first labeled by a rule of the kind c~). Then the stream is unfolded in a sequence of singleton streams and the nested skeleton is applied to each item in the sequence. During the unfolding, singleton streams are labeled according to the particular rule policy, while the transition is labeled with the label of the stream before the transition (in this case ~1). The last element of the stream is managed by a rule of the kind ~). Eventually resulting skeletal expressions are joined back by means of the :: operator. Let us show how rules 1-7 work with an example. We evaluate farm (seq f) on the input stream (xl,x2, x3). At the very beginning only the 2~ can be applied. It marks the begin of stream by introducing (c)x (empty stream) and labels the input stream with _k. Then rule 2 can be applied: (c)• "" farm (seq f) (x,,x2,x3)•
_L
, (c)• .. seq f (Xl>0 "" farm (seq f) (x2, x3>0
The head of the stream has been separated from the rest and has been re-labeled (from _L to 0) according to the 69(1, x) function. Inner skeleton (seq) has been applied to this singleton stream, while the initial skeleton has been applied to the rest of the stream in a recursive fashion. The re-labeling function (_9 : label x value --, label (namely the oracle) is an external function with respect to the LTS. It would represent the (user-defined) data mapping policy. Let us adopt a two-PEs round-robin policy. An oracle function for this policy would cyclically return a label in a set of two labels. In this case the repeated application of rule 1 proceeds as follows: (c)• :: seq f (X,>o :: seq f (X3) 0
farm (seq f) (x2, x3>o
0
,
1
> (c)• :: seq f (xx)0 :: seq f <x2>1
::
The oracle may have an internal state, and it may implement several policies of label transformation. As an example the oracle might always return a fresh label, or it might make decisions about the label to return on the basis of the x value. As we shall see, in the former case the semantics models the maximally parallel computation of the skeletal expression. Observe that using only rules 1-7 (and their twins) the initial skeleton expression cannot be completely reduced (up to the output stream). Applied in all the possible ways, they lead to an aggregate of expressions (exp) glued by the :: operator. The rest of the work is carried out by the six rules in the bottom half of figure 1. There are two main rules (sp and dp) and four auxiliary rules (context, relabel, joint and join). Such rules define the order of reduction along aggregates of skeletal expressions. Let us describe each rule:
sp (stream parallel) rule describes evaluation order of skeletal expressions along sequences separated by :: operator. The meaning of the rule is the following: suppose that each skeletal expression in the sequence may be rewritten in another skeletal expression with a certain labeled transformation. Then all such skeletal expressions can be transformed
67
farm (pipe (seq fl ) (seq f2)) (Xl, x2, x3, x4, x5, x6, x7 )
(1)
(C)_L "" farm (pipe (seq f l ) (seq f2)) (Xl,X2, X3,X4,X5,X6,X7)_L
(2)
(e)_L "" pipe (seq f l ) ( s e q f 2 ) ( x l ) o "" pipe (seq f l ) ( s e q f2)(X2)l "" pipe (seq f l ) ( s e q f2)(x3)o "" pipe (seq f l ) ( s e q f2)(X4)l "" pipe (seq f l ) ( s e q f2)(x5)o "" pipe (seq f l ) ( s e q f2)(x6}1 "" pipe (seq fz)(seq f2)(xT)o
(3) (e)• -" seq f2 7~02 seq fl (Xl)O "" seq f2 7~12 seq fl (x2)1 "" seq f2 TZ02 seq fl (x3)o "" s e q f 2 72,.12s e q f l (x4)1 "" seq f2 R02 seq fz (x5)o "" seq f2 7Z12 seq fz @6)1 "" seq f2 7~02 seq fl (x7)o
(4)
(e)_L "" s e q f 2 (fl Xl)02 "" s e q f 2 (fl x2)12 "" s e q f 2 7%2 s e q f l (x3)o "" s e q f 2 7~12 s e q f l (x4)1 "" seq f2 R02 seq fl (xs)o "" seq f2 7Z12 seq fl @6)1 "" seq f2 7~02 seq fl (x7)o
(5)
A_ "" 02 "" 02 "" (f2 (fl m4) >12 "" 02 ""
(6) (f2 (fl Xl), f2 (fl x2) f2 (fl x3), f2 (fl x4) f2 (fl x5), f2 (fl x6), f2 (fl x7) )..L
(7)
Figure 2. The semantic of a Lithium program: a complete example.
in parallel, provided that they are adjacent, that the first expression of the sequence is a stream of values, and that all the transformation labels involved are pairwise different.
dp (data parallel) rule describes the evaluation order for the fc (c~ A) fd stream expression. Basically the rule creates a tuple by means of the fd function, then requires the evaluation of all expressions composed by applying A onto all elements of the tuple. All such expression are evaluated in one step by the rule (apply-to-all). Finally, the fc gathers all the elements of the evaluated tuple in a singleton stream. Labels are not an issue for this rule.
relabel rule provides a relabeling facility by evaluating the meta-skeleton 7~e. The rule does nothing but changing the stream label. Pragmatically, the rule imposes to a PE to send the result of a computation to another PE (along with the function to compute).
context rules establishes the evaluation order among nested expressions, in all other cases but the ones treated by dp and relabel. The rule imposes a strict evaluation of nested expressions (i.e. evaluated the arguments first). The rule leaves unchanged both transition and stream labels with respect to the nested expression.
join rule does the housekeeping work, joining back all the singleton streams of values to a single output stream of values, join, does the same work on the first element of the stream. Let us consider a more complex example. Consider the semantics of f a r m (pipe fl f2) evaluated on the input stream (xl,x2, x3, x4, x5, x6, xT). Let us suppose that the oracle function returns a label chosen from a set of two labels. Pragmatically, since farm skeleton
68 represent the replication paradigm and pipe skeleton the pipeline paradigm, the nested form farm (pipe fl f2) basically matches the idea of a multiple pipeline (or a pipeline with multiple independent channels). The oracle function defines the parallelism degree of each paradigm: in our case two pipes, each having two stages. As shown in Figure 2, the initial expression is unfolded by rule 2~ ((1) --~ (2)) then reduced by many applications of rule 2 ((2) ---~* (3)). Afterward the term can be rewritten by rule 3 ((3) --~* (4)). At this point, we can reduce the formula using the sp rule. sp requires a sequence of adjacent expressions that can be reduced with differently labeled transitions. In this case we can find just two different labels (0, 1), thus we apply sp to the leftmost pair of the previuos expressions : (E)_L "" seqf2 ~02 seq fl (Xl)0 "" seqf2 ~12 seq fl ( X 2 ) l "" seqf2 7~02seq fl (x3)0 ..... (e)_t_ :: seq f2(fl Xl)02 :: seqf2 (fl X2)12 :: seqf2 7Z02seq fl (X3)o : : ' ' " Observe that due to the previous reduction ((4) 4 (5) in Figure 2) two new stream labels appear (02 and 12). Now, we can repeat the application of sp rule as in the previous step. This time, it is possible to find four adjacent expression that can be rewritten with (pairwise) different labels (0, 1, 02, 12). Notice that even on a longer stream this oracle function never produces more than four different labels, thus the maximum number of skeletal expressions reduced in one step by sp is four. Repeating the reasoning the we can completely reduce the formula to a sequence of singleton streams ((5) 4 " (6)), that, in turn can be transformed by many application of the join rule ((6) 4 " (7)). Let us analyze the whole reduction process: from the initial expression to the final stream of values we applied three times the sp rule. In the first application of sp we transformed two skeletal expression in one step; in the second application four skeletal expressions have been involved; while in the last one just one expression has been reduced 3. The succession of sp applications in the transition system matches the expected behavior for the double pipeline paradigm. The first and last applications match pipeline start-up and end-up phases. As expected the parallelism exploited is reduced with respect to the steady state phase, that is matched by the second application. A longer input stream would rise the number of sp applications, in fact expanding the steady state phase of modeled system. 4. PARALLELISM AND LABELS The first relevant aspect of the proposed schema is that the functional semantics is independent of the labeling function. Changing the oracle function (i.e. how data and computations are distributed across the system) may change the number of transitions needed to reduce the input stream to the output stream, but it cannot change the output stream itself. The second aspect concerns parallel behavior. It can be completely understood looking at the application of two rules: dp and sp that respectively control data and stream parallelism. We call the evaluation of either dp or sp rule apar-step, dp rule acts as an apply-to-all on a tuple of data items. Such data items are generated partitioning a single task of the stream by means of an user-defined function. The parallelism comes from the reduction of all elements in a tuple (actually singleton streams) in a single par-step. A single instance of the sp rule enable the parallel evolution of adjacent terms with different labels (i.e. computations running on distinct PEs). The converse implication also holds: many transitions of adjacent terms with different 3Second and third sp application are not shown in the example.
69 labels might be reduced with a single sp application. However, notice that sp rule may be applied in many different ways even to the same term. In particular the number of expressions reduced in a single par-step may vary from 1 to n (i.e. the maximum number of adjacent terms exploiting different labels). These different applications lead to both different proofs and parallel (but functional) semantics for the same term. This degree of freedom enables the language designer to define a (functionally confluent) family of semantics for the same language coveting different aspects of the language. For example, it is possible to define the semantics exploiting the maximum available parallelism or the one that never uses more than k PEs. At any time, the effective parallelism degree in the evaluation of a given term in a par-step can be counted by inducing in a structural way on the proof of the term. Parallelism degree in the conclusion of a rule is the sum of parallelism degrees of the transitions appearing in the premise (assuming one the parallelism degree in rules 1-7). The parallelism degree counting may be easily formalized in the LTS by using an additional label on transitions. The LTS proof machinery subsumes a generic skeleton implementation: the input of skeleton program comes from a single entity (e.g. channel, cable, etc.) and at discrete time steps. To exploit parallelism on different tasks of the stream, tasks are spread out in PEs following a given discipline. Stream labels trace tasks in their journey, and sp establishes that tasks with the same label, i.e. on the same PE, cannot be computed in parallel. Labels are assigned by the oracle function by rewriting a label into another one using its own internal policy. The oracle abstracts the mapping of data onto PEs, and it can be viewed as a parameter of the transition system used to model several policies in data mapping. As an example a farm may take items from the stream and spread them in a round-robin fashion to a pool of workers. Alternatively, the farm may manage the pool of workers by divination, always mapping a data task to a free worker (such kind of policy may be used to establish an upper bound in the parallelism exploited). The label set effectively used in a computation depends on the oracle function. It can be a statically fixed set or it can change cardinality during the evaluation. On this ground, a large class of implementations may be modeled. Labels on transformations are derived from label on streams. Quite intuitively, a processing element must known a data item to elaborate it. The relabeling mechanism enables to describe a data item re-mapping. In a par-step, transformation labels point out which are the PEs currently computing the task. 5. CONCLUSION We propose an operational semantics schema that can be used to describe both functional and parallel behavior of skeletal programs in a uniform way. This schema is basically a LTS, which is parametric with respect to an oracle function. The oracle function provides the LTS with a configurable label generator that establishes a mapping between data and computation and system resources. We use the Lithium - a skeletal language exploiting both task and data parallelism- as a test-bed for the semantics schema. We show how the semantics (built according to the schema) enables the analysis of several facets of Lithium programs such as: the description of functional semantics, the comparison in performance and resource usage between functionally equivalent programs, the analysis of maximum parallelism achievable with infinite or finite resources. To our knowledge, there is no other work discussing a semantics with the same features in parallel skeleton language framework (and, in general, in the structured parallel programming world as well).
70 REFERENCES
[ 1]
[2]
[3] [4]
[5]
[6] [7] [8] [9]
[10]
[ 11 ] [12] [13]
M. Aldinucci. Automatic program transformation: The Meta tool for skeleton-based languages. In S. Gorlatch and C. Lengauer, editors, Constructive Methods for Parallel Programming, Advances in Computation: Theory and Practice, chapter 5, pages 59-78. Nova Science Publishers, NY, USA, 2002. M. Aldinucci and M. Danelutto. Stream parallel skeleton optimization. In Proc. of the 11th lASTED lntl. Conference on Parallel and Distributed Computing and Systems (PDCS'99), pages 955-962, Cambridge, Massachusetts, USA, November 1999. IASTED/ACTA press. M. Aldinucci, M. Danelutto, and P. Teti. An advanced environment supporting structured parallel programming in Java. Future Generation Computer Systems, 19(5):611-626, 2003. M. Aldinucci, S. Gorlatch, C. Lengauer, and S. Pelagatti. Towards parallel programming by transformation: The FAN skeleton framework. Parallel Algorithms and Applications, 16(2-3):87-122, 2001. P. Au, J. Darlington, M. Ghanem, Y. Guo, H.W. To, and J. Yang. Co-ordinating heterogeneous parallel computation. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Proc. of Euro-Par 1996, pages 601-614. Springer-Verlag, 1996. B. Bacci, M. Danelutto, S. Pelagatti, and M. Vanneschi. SklE: a heterogeneous environment for HPC applications. Parallel Computing, 25(13-14): 1827-1852, December 1999. G.H. Botorog and H. Kuchen. Efficient high-level parallel programming. Theoretical Computer Science, 196(1-2):71-107, April 1998. M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computations. Research Monographs in Parallel and Distributed Computing. Pitman, 1989. M. Danelutto and M. Stigliani. SKElib: parallel programming with skeletons in C. In A. Bode, T. Ludwing, W. Karl, and R. Wismiiller, editors, Proc. of Euro-Par 2000, number 1900 in LNCS, pages 1175-1184. Springer-Verlag, September 2000. S. Gorlatch, C. Lengauer, and C. Wedler. Optimization rules for programming with collective operations. In Proc. of the 13th Intl. Parallel Processing Symposium & l Oth Symposium on Parallel and Distributed Processing (IPPS/SPDP'99), IEEE Computer Society Press, pages 492-499, 1999. H. Kuchen. A skeleton library. In B. Monien and R. Feldmann, editors, Proc. of Euro-Par 2002, number 2400 in LNCS, pages 620-629. Springer-Verlag, 2002. J. Srrot and D. Ginhac. Skeletons for parallel image processing: an overview of the SKIPPER project. Parallel Computing, 28(12):1685-1708, December 2002. D. B. Skillicom and W. Cai. A cost calculus for parallel functional programming. Journal of Parallel and Distributed Computing, 28:65-83, 1995.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
71
A Programming Model for Tree Structured Parallel and Distributed Algorithms and its Implementation in a Java Environment* H. Moritsch a a School of Business, Economics, and Computer Science, University of Vienna, Brfinner Strage 72, A-1210 Vienna, Austria [email protected] We describe the Distributed Active Tree, a programming model which allows for an objectoriented, data parallel formulation of a whole class of parallel and distributed tree structured algorithms. It provides collective communication with varying synchronization conditions and shared address space through Accessible objects, representing the nodes accessed within a communication operation. We developed a Java implementation and employed it in the parallelization of nested Benders decomposition, a solution technique for large stochastic programming problems such as multistage decision problems. The optimization algorithm is part of a high performance decision support tool for asset and liability management, the Aurora Financial Management System. We discuss the architecture and coordination layer of our implementation and present experimental results obtained on a Beowulf SMP-cluster and on a network of workstations, including a comparison of a synchronous and an asynchronous version of the algorithm. 1. INTRODUCTION Tree structured problems occur in various applications areas, for example in portfolio management and reservoir optimization. In fact, they represent an important class of the problems dealt with in finance. The Aurora Financial Management System under development at the University of Vienna is a decision support tool for asset and liability management employing parallel solvers for large multistage stochastic optimization problems [11]. Using decomposition techniques, the whole optimization problem is represented as a set of small problems defined at the nodes of a scenario tree. By iteratively solving their local problems and updating them using information from neighbor nodes, all nodes contribute to the overall solution. The decomposition method has been implemented in parallel as a distributed active tree structure. The programming model behind allows to formulate a wide class of tree structured algorithms at a high level with transparent remote accesses, communication and synchronization. The remainder of this paper is organized as follows. Section 2 describes the distributed active tree model, and section 3 discusses the implementation as a Java class library on top of Remote Method Invocation. Section 4 presents an application in the field of stochastic optimization, *The work described in this paper was supported by the Austrian Science Fund as part of the Special Research Program No. F011 AURORA "Advanced Models, Applications and Software Systems for High Performance Computing".
72 a tree structured decomposition algorithm, and reports on initial experiments conducted on a network of Sun workstations and on a Beowulf SMP-cluster. Section 5 concludes the paper. 2. T H E DISTRIBUTED ACTIVE TREE M O D E L The Distributed Active Tree (DAT) is a structure which executes a parallel program. The program is defined as a set of cooperating algorithms associated with nodes of a the tree. The DAT is a distributed data structure containing active tree node objects, i.e. nodes have their own threads of control. Whereas in principle they could execute different programs, in this model they execute the same (eventually parameterized with particular properties of the node). As they operate on different data, the DAT programming model is a Single Program, Multiple Data (SPMD) model. The set of tree nodes is distributed onto a set of compute nodes (physical processors), according to a specified mapping defining the distribution of the tree data. The nodes altogether perform a parallel algorithm, thus the whole tree can be as a distributed processing unit (an abstract machine). Tree nodes can be mapped onto compute nodes individually or at the level of complete subtrees. With emphasis on iterative tree structured algorithms, the node activity is modeled as a while-loop. It can contain additional nested loops. Conceptually, all nodes are simultaneously active and exchange data with other nodes. Due to synchronization requirements, a node can be in a wait state. For the coordination among tree nodes, the model provides the abstraction Accessible. It represents the set of tree nodes (element nodes) which are involved in a communication operation, e.g. the successors, the predecessor, or the union of these. Data accesses to other nodes are performed through calling methods of the respective Accessible objects. Data transfered to another node is buffered at the target node. The target.put(data) method copies the object data into buffers at the element nodes of the Accessible target. The source.get(allOrAny) method, called at a node, retrieves data (put by the element nodes of the Accessible source) from the buffer and copies it into local data structures of the node. The get method either operates, depending on the value of its parameter, in a mode in which it waits until data has arrived from all element nodes, or in a mode in which it waits for data from at least one node; this allows for diverse synchronization constraints. Accessing a set of nodes in a single operation provides a kind of collective communication. The communication primitives are transparent with respect to the distribution of the tree: nodes residing on the same compute node (local nodes) are addressed in the same way as nodes residing at a different compute node (remote nodes); the programmer maintains a shared memory view of the whole tree. For the execution of many node activities on one compute node, different scheduling strategies are possible. If every tree node is associated with a thread of the underlying runtime environment, the scheduling is just delegated. Following a different strategy, one single thread cyclically executes one iteration on all tree nodes, then the next iteration, and so on. Combinations of these strategies are possible, e.g. there are m threads at a compute node, each of which is responsible for n tree nodes. 3. I M P L E M E N T A T I O N IN JAVA We developed an implemention in Java because of Java's portability and other features such as the object-oriented programming model, robustness, automatic memory management and support for multithreading and network programming. In particular, we assume the financial
73 application to be used in a heterogenous computing environment. The Java implementation also serves as a basis for a grid-enabled version. At the level of the Java language, the DAT model is defined as a set of interfaces, organized in several packages: a u r o r a , d a t (specification of the tree structure), a u r o r a , c-tat, a l g (specification of the algorithm), a u r o r a . d a t , c o o r d (coordination layer implementation), a u r o r a , d a t . d i s t (tree distribution specification), and a u r o r a , d a t . s c h e d (loop scheduling method); the Java classes implementing these interfaces altogether build a concrete DAT. Every tree node is an instance of a class which implements the A l g o r i t h m N o d e interface and overrides the methods i t e r a t i o n and c o n d i t i o n of the L o o p A c t i v i t y interface, thus specifying the body and termination condition of the main loop. Associated to every tree node there is a coordination object (an instance of a class implementing the Coordination interface), which maintains a buffer for incoming data and provides Accessible objects to the tree node. The scheduling of loops is specified in an implementation of the L o o p S c h e d u l e r interface which operationally defines a mapping from the set of loops to be executed on a compute node onto a set of Java threads. An instance of this class creates the threads, assigns loop iterations to them, and starts them. The tree, including its distribution, is specified in an implementation of the D i s t r i b u t e d T r e e S p e c i f i c a t i o n interface. Using Java code, arbitrary mappings can be defined. At runtime, the DAT comprises, at each compute node, a set of A l g o r i t h m N o d e objects with their corresponding C o o r d i n a t i o n objects, a L o o p S c h e d u l e r - , and a DistributedTreeSpecification object. In the m a i n function at a compute node, these objects are created and combined within a C o m p u t e N o d e object: new C o m p u t e N o d e (number, AlgorithmNode-class-name, Coordinationclass-name, new DistributedTreeSpecification-class (..) ) . r u n (new LoopSchedulerclass (..) ). The implementation of the coordination forms a separate layer in the DAT below the algorithm layer, dealing with all intra- and inter-processor communication and-synchronization. It will employ underlying mechanisms such as Sockets, RMI, CORBA, and even MPI (via the Java Native Interface). Via the C o o r d i n a t i o n interface, different coordination implementations can be used by an algorithm. We developed an implementation on top of Java Remote Method Invocation (RMI). The class R D C (RMI-based DAT Coordination) has a s y n c h r o n i z e d method add, which appends new data to the input buffer. Within the p u t operation, local nodes are accessed through shared memory via calls of the a d d method at the coordination objects of the target nodes. Remote nodes are accessed through calling (via RMI) a remote method of the target C o m p u t e N o d e object, which then locally calls the a d d method at the target coordination objects. We describe the synchronization of tree nodes and its implementation as a collection of state dependent actions using Java's built-in synchronization features [7]. Every tree node performs its activity in a separate thread. With regard to coordination, the logical state space of a node (more precisely, of the coordination object) consists of the following states. In the Active state, the node is performing computations of the node algorithm. When it starts waiting for data from other nodes, it enters the Wait state. When new data has arrived, it enters the Transfer
74
add ~ [
_~rI
i
add ( ] [element][" TransferJ( :"~ [element&&
Active a)
la~176 >
public class RDC implements Coordination { 9. . public class RDA implements Accessible { 9. . p u b l i c v o i d get ( b o o l e a n a l l O r A n y ) { s t a t e C h a n g e d (this, a l l O r A n y ) ; s y n c h r o n i z e d (RDC. this) { if (state == WAIT) try { RDC.this.wait(); } c a t c h ...
} do {
t r a n s f e r (this) ; s t a t e C h a n g e d (this, false) ; } w h i l e (state == T R A N S F E R ) ; }//get } }
b)
Figure 1. a) Logical states of the coordination object b) Synchronization within the get: operation
state. In this state, the buffer contents are copied into local data structures. States are left either when the activity associated with a state is finished, or when the a d d method has been called, i.e. another node has added data to the buffer. The statechart diagram in Figure 1 shows the additional substates ActiveNew and TransferNew, which are entered when new data has arrived during the state activity. This allows for optimizations: in the case of Active, Wait can be skipped, and in the case of Transfer, the new data will just be considered during the current g e t operation. In addition, two guard conditions control the state transitions. Incoming data is regarded relevant only if the node, which has put it into the buffer, is an element of the Accessible performing the g e t operation. In the case of a l 10rAny== t r u e , data is regarded available only if arrived from all element nodes of the Accessible. The transition diagram is implemented in the synchronized method R D C . s t a t e C h a n g e d , which writes the instance variable s t a t e ; it is called in the g e t method, and, from the source node thread, at the end of the a d d method. The code excerpt in Figure 1 shows the implementation of the g e t operation. If the state has changed to Wait, the node thread is caused to wait; s t a t e C h a n g e d calls n o t i f y just in the case that Wait has been left. The s y n c h r o n i z e d method RDC. t r a n s f e r copies the buffered data objects into the local data structures of the A l g o r i t h m N o d e object. These are not visible to the coordination object, because they are in the algorithm layer; the copy operations are methods of the algorithm data objects and, via a conversion mechanism, called by t r a n s f e r .
4. APPLICATION EXAMPLE: STOCHASTIC OPTIMIZATION A multistage decision problem, for example in portfolio optimization, is defined as a set of node problems, each of which is associated with a node of a scenario tree describing the possible developments of the environment. At every node, based on the scenario, a node specific objective function is formulated. The optimization of a combination of all node specific functions is the ultimate goal:
75
class BendersNode implements AlgorithmNode {
[.................. nil. C updateR,Q ~'~
+
I when(newCutornewSolution) l
LP ip;
+
C so've'P ) [infeasible]+
[feasible]
+ ,
master
i
l ........... > . . . . .
compute C feasibi
S o l u t i o n master; Vector Q[], R[]; Coordination coord; Accessible from; boolean synchronous; boolean forward;
- [objVal> master.objVal]3~
d
~SIRV~
(forward ? coord.getPredecessor() : coord.getSuccessors())
- coord.getAll()); // update master, R, Q ... ip = n e w L P ( m a s t e r , Q , R ) ; // create and solve LP ... coord.getSuccessors().put(solution); // send s o l u t i o n ... coord.getPredecessor () .put (cut) ; // send cut
Ise]
from.get(synchronous);
]
a)
// sweep direction
// linear program object
// loop body of the decomposition algorithm public void iteration(int iterationNumber) { S o l u t i o n solution; // s o l u t i o n of LP Cut cut; // cut for master from = (synchronous ?
{cardinality=#slaves]
[else] ~ (~ [termination]
// s o l u t i o n from master // cuts from slaves // coordination object // source of get o p e r a t i o n // is synchronous version
}
b)
Figure 2. a) Node algorithm for nested Benders decomposition b) DAT implementation
Minimize ~
f~ (z~)
hEN"
V(n E iV') {
TnxpredC (n)xnSn + dnxn = bn
where N" denotes the set of nodes in the tree. A node n E N" is associated with a local objective function f~ wrt. decision variables zn. An, bn, and Tn describe the constraints representing the dependency on the decision variables at the predecessor node pred(n), Sn describes constraints local to node n, for example budget constraints. In the case of the root node, Tn = 0. In the nested Benders" decomposition method, every node performs an iterative procedure, acting as a master, slave, or both of them [ 1]. The master solves a linear program, sends the solution to the slaves (the successor nodes), and receives from them additional constraints (cuts) which will improve the master's solution in the next iteration. In the synchronous version of the method, each node receives the solution from its master, builds and solves the local problem, then sends the solution to its slaves, and waits for cuts from every slave. The solution process is a sequence of forward- and backward sweeps over the whole tree. In the asynchronous version, every node waits until it has received data from at least one of slaves or from the master [9]. Figure 2 shows the asynchronous node algorithm and fragments of the algorithm layer code for both versions. Note that the Accessible variable f r o m is used to express the different cases of getting data from neighbor nodes. Figure 3 shows the results of initial experiments with a multistage financial portfolio optimization problem, and tree sizes from 127 nodes to 255 nodes. Every tree node runs as a separate thread; the nested Benders decomposition code calls within a synchroni z e d block, via JNI and C, the E04MFF NAG Fortran LP routine. For the distribution of the tree, the root node and its descendents, up to a specific depth in the tree, are mapped to the "root" processor. The subtrees emanating from the nodes at that depth as a whole are mapped to the remaining
76 10,0 9,0 8,0 7,0 6,0
5,0 ~ 4,0 3,0 2,0 1,0 0,0
8,0 A__ ........... A "-.. "'- .. "'-
(3-(3127 ~ 127 sync a - w 255 ~ 255 sync A-A511 .. -..
~5,0
-,~
o=4,0
[] . . . .
7,0
A.. ..........
6,0
""-...... tSl..
". . . . .
(3- 9 127 ~ 127 sync G-D 255 ~ 255 sync A-A511 ~ 511 sync
~3,0 O- . . . . . . . . . . . . . . . . . . .
2,0
~ ....................
"-... (3- . . . .
~ ....................
1,0 1
3 compute nodes
a)
5
0,0
1
3 compute nodes
5
b)
Figure 3. Execution times on a) network of Sun workstations b) Beowulf cluster compute nodes; they require communication with the root processor only. The initial size of the local constraint matrices were 7 x 6, the values for the depth parameter d of the distribution of nt tree nodes onto n~ compute nodes were chosen (nt]nc, d) = (127]5, 4), ({12713 , 255]5}, 5), and ({255]3,511]5}, 6). The results expose some properties of the algorithm. In the synchronous version, the tree nodes perform on average less iterations than in the asynchronous one (1); this is reflected in shorter execution times on a single compute node. In the asynchronous version, the tree nodes spend less time in waiting for new data (2). When running in parallel, a compute node as a whole is idle, when all of its tree nodes are waiting. Shorter processor idle times result in larger speedups for the asynchronous version. Still, these are rather small due to small node problem sizes, resulting in weak computation/communication ratios; we expect better numbers for larger problems (which currently suffer from numerical instabilities). The number of iterations per tree node increases also with the tree size (3). The additional increase in the number of communication operations is seen as one reason for longer execution times of the asynchronous version with larger trees in parallel. In addition, low level effects such as the thread scheduling overhead of the particular runtime system (JVM and operating system) have to be taken into account. 5. CONCLUSIONS In this paper we have presented the Distributed Active Tree programming model, which allows the application programmer to express tree structured iterative algorithms at a high level in a natural way. The model can be implemented with a variety of protocols, communication mechanisms, including web services and grid technology. We described an implementation on top of Java/RMI, with the sole use of Java's multi threading, communication and synchronization mechanisms. As a case study, a parallel decomposition technique for solving large scale stochastic optimization problems has been implemented, in a synchronous and in an asynchronous version. Within the Aurora project, the nested Benders decomposition has been parallelized using OpusJava [6], and the DAT with the coordination implemented on top of JavaSymphony [5]. An alternative optimization algorithm is described in [2], parallel decomposition techniques
77 and their implementation in [4, 10, 13, 14]. For writing high performance applications in Java, language extensions have been defined. Spar [12] provides extensive support for arrays such as multidimensional arrays, specialized array representations, and tuples. It supports data parallel programming and allows via annotations for an efficient parallelization. Titanium [ 15] is a Java dialect with support for multidimensional arrays. It provides an explicitly parallel SPMD model with a global address space and global synchronization primitives. HPJava [3] adds SPMD programming and collective communication to Java. DAT does not change the syntax or semantics of Java and is specifically targeted at a high level formulation of tree structured iterative algorithms and a highly modular architecture. A classification of language extensions, libraries, and JVM modifications for high performance computing in Java is given in [8]. The implementation of the nested Benders decomposition algorithm is subject to optimization along various dimensions, such as the tree distribution (including dynamic pruning and rebalancing of the tree), the loop scheduling strategy, the underlying communication mechanism, LP solution techniques (warm start of the LP solver), and the mapping of scenario tree nodes to local problems. The Distributed Active Tree is a tool to implement and combine variants of all contributing parts, to study the interplay of effects and to achieve a high performance application. REFERENCES
[1] [2]
[3] [4] [5]
[6]
[7] [8]
[9]
[10]
J.F.Benders. Partitioning procedures for solving mixed-variable programming problems. Numer. Math., 4:238-252, 1962. S.Benkner, L.Halada, M.Lucka. Parallelization Strategies of Three-Stage Stochastic Program Based on the BQ Method. Parallel Numerics'02, Theory and Applications. Ed. R.Trobec, P.Zinterhof, M.Vajtersic, A.Uhl, pp.77-86, October,23-25, 2002. B.Carpenter, G.Zhang, G.Fox, X.Li, Y.Wen. HPJava: data parallel extensions to Java. Concurrency: Practice and Experience 10,11-13: 873-877, 1998. M.A.H.Dempster, R.T.Thompson. Parallelization and aggregation of nested benders decomposition. Annals of Operations Research, 81:163-187, 1998. T.Fahringer, A.Jugravu, B.Di Martino, S.Venticinque, H.Moritsch. On the Evaluation of JavaSymphony for Cluster Applications. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster2002). September 2002, Chicago, Illinois. E.Laure, H.Moritsch. Portable Parallel Portfolio Optimization in the Aurora Financial Management System. In Proceedings of SPIE ITCom 2001 Conference: Commercial Applications for High-Performance Computing. August 2001, Denver, Colorado. D.Lea. Concurrent Programming in Java. Addison-Wesley, Reading, Mass., 1997. M.Lobosco, C.Amorim, O.Loques. Java for high-performance network based computing: a survey. Concurrency and Computation: Practice and Experience (Ed. Geoffrey Fox), John Wiley & Sons, 14:1-31, 2002. H.Moritsch, G.Ch.Pflug, M.Siomak. Asynchronous nested optimization algorithms and their parallel implementation. In Proceedings of the International Software Engineering Symposium. Wuhan, China, March 2001. Soren S.Nielsen and S.A.Zenios. Scalable parallel Benders decomposition for stochastic linear programming. Parallel Computing, 23:1069-1088, 1997.
78 [ 11] G.Ch.Pflug, A.~;wi~tanowski, E.Dockner, H.Moritsch. The AURORA Financial Management System: Model and Parallel Implementation Design. Annals of Operations Research, 99:189-206, 2000. [12] C.van Reeuwijk, F.Kuijlman, H.J.Sips. Spar: a set of extensions to Java for scientific computation. Concurrency and Computation: Practice and Experience 15,3-5: 277-297, 2003. [ 13] A.Ruszczynski. Parallel decomposition of multistage stochastic programming problems. Math. Programming, 58: 201-228, 1993. [ 14] H.Vladimirou, S.A.Zenios. Scalable parallel computations for large-scale stochastic programming. Annals of Operations Research, 90:87-129, 2000. [15] K.Yelick, L.Semenzato, G.Pike, C.Miyamoto, B.Liblit, A.Krishnamurthy, P.Hilfinger, S.Graham, D.Gay, P.Colella, A.Aiken. Titanium: a high-performance Java dialect. Concurrency and Computation: Practice and Experience 10,11-13: 825-836, 1998.
Parallel Computing: Software Technology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
79
A Rewriting Semantics for an Event-Oriented Functional Parallel Language F. Loulergue a aLaboratory of Algorithms, Complexity and Logic, 61, avenue du G6n6ral de Gaulle, 94010 Cr6teil cedex, France This paper presents the design of the core of a parallel programming language called CDS*. It is based on explicitly-distributed concrete data structures and features compositional semantics, higher-order functions and explicitly distributed objects. The denotational semantics is outlined, the (equivalent) operational semantics is presented and a new realization of the latter is given as a rewriting system. 1. I N T R O D U C T I O N Resource-aware programming tools and programs with measurable utilization of network capacity, where control parallelism can be mixed with data parallelism are advocated. In [8, 5] we proposed semantic models for languages whose semantics is functional and whose programs are explicitly parallel. Such languages address the above-stated requirements by expressing data placement, and hence communications explicitly, allowing higher-order functions to take placement strategies and generic computations as arguments, allowing higher-order functions to monitor communications within other functions, and yet avoid the complexity of concurrent programming. We also have introduced the elements of an explicitly-parallel functional language CDS*: denotational semantics, operational semantics and full abstraction result. It is inspired by Berry and Curien's sequential language CDS [1] but uses Brookes and Geva's generalized concrete data structures (gcds) [2], for compatibility with parallel execution. Here we present a detailed rewriting semantics. Unlike Multilisp [6] or CD-Scheme [10], CDS* causes no dynamic process creation in order to facilitate performance prediction. Moreover all the possible events of a CDS* program are declared statically together with their physical processor (or static process) location. We call this feature explicit processes and share it with Caml-Flight [4] and the proposed BSP library standard [7]. CDS* improves on Caml-Flight and BSP by its compositional semantics. User-defined functions define dependencies between (explicitly-located) events and thus prescribe the communications generated by their application. As observed by Berry and Curien, a program of functional type on concrete structures can observe event dependencies inside a functional argument and thus compare different algorithms before applying them. In our context, this means that a second-order function can compare first-order function's load-balancing and communication properties before applying them. This goes one step beyond the proposal of Mirani and Hudak [9] which was to use explicit and programmable schedules but to obtain cost information from system calls. Compared with data-flow programming CDS* is a gener-
80
alization by its inclusion of higher-order functions but it is beyond the scope of this paper to make an exact comparison. Let us simply observe that connecting deterministic programs with streams is a special case of programming with section. The following sections present the language's denotational and operational semantics and then the rewriting semantics. 2. CDS* AND G E N E R A L I Z E D C O N C R E T E DATA STRUCTURES W I T H INDICES. A CDS* program is made of type definitions, followed by term definitions and a request to evaluate a given term. The definition of a type inference system, ofparameterized and polymorphic types is important open problem. As a result our (strong) typing is explicit and monomorphic. A CDS* type is a gods with explicit process indices. A gods is a set of cells with allowed values for each one, seen as a game to be played. Certain cells can be filled at any time and others can only be filled after being enabled by specific finite sets of filling events. A valid gods configuration is called a state and the object of the game is to compute states monotonically: never erasing or changing the value given to a cell. As such, states are generalized traces/streams. A program of functional type r --+ r ' describes a continuous function from states of gods r to states of gods r'. Program syntax is almost standard for a functional language except for one crucial difference: there is no A-binding. Elementary terms are simply an enumeration of finite states, as sets of events. Now unlike standard functional programs which denote abstract functions, a CDS* program denotes a function between concrete domains which can be (concretely) encoded by the state of a special-purpose exponential gods. As a result, elementary terms of type r -+ r' enumerate so-called exponential events which are in fact functional dependencies between (sufficient) input events and (necessary) output events. A generalized concrete data structure is a tuple ((C, _<), V, E, ~-) where C is a countable and partially ordered set of cells, V is a countable set of values, E c_ C • V is a set of events. An event (c, v) is often written cv. The enabling relation ~ is between finite sets of events and cells. It induces aprecedence relation on cells: c << c' iff By, v. y U {cv} ~ c'. The enabling relation must be well-founded. The set of events and the enabling relation are upwards-closed with respect to cell ordering. Namely cv E E, c <_ c' ~ c'v C E and y F- c, c _< c' ~ y F- c'. We assume a finite and statically-available set I of indices. A gcds with indices is a gcds and a total function A from the set of cells to the set of indices, called the location function. A cell c is called initial if it is enabled by the empty set of events (written ~- c). A cell c is filled in a set of events y if 3v. cv E y. Write F(y) for the set of cells filled in y. If y k c, then y is an enabling of c. Let E(y) be the set of cells enabled in y, and call A ( y ) - E ( y ) - F ( y ) the set of cells accessible from y. If y ~ c and Vd E F(y), a(d) - A(c), then the enabling y F- c is local, else it is synchronizing. Let M, M', N denote gcds from now on. A state of M is a set of events x c_ EM which is functional and safe, namely where CVl, cv2 E x =~ Vl = v9 and c c F ( x ) =~ By c_ x. y F-~ c. States must be upwards-closed with respect to cell ordering: cv E x, c <_ c' =~ c'v c x. The generalized concrete domain or simply domain associated with M is the poset (7)(M), C_ ) where 7)(M) is the set of states of gcds M. In the mathematical description we sometimes leave A implicit, although it is understood that the concrete location of every event is that of its cell. Because all types-domains are concrete, the usual type constructors product and arrow have
81 concrete realizations. The product construction is straightforward. Given two gcds with indices M1, M2, we define the gcds with indices M1 x M2 by renaming the cells c of M~ into c.i (i = 1, 2) and by taking the union (using the cell renaming this union will be a disjoint union) of cell ordering, values, access relation, function of location of the two gcds with indices. As expected, and easily verified, the domain of M1 x M2 is isomorphic to the cartesian product of the domains of M1 and M2. The definition of a correct exponential is non-trivial and was in fact a central issue in the works of Berry and Curien and later Brookes and Geva. Let two gcdss with indices M, M' be given and let us call them the source structure and the target structure. The exponential M ---+ M' is (C, <, V, E, F) where C = D~in(M) x Ca4,, and 7)~in(M) is the set of finite states of M ordered by inclusion. (z, c') will be abbreviated to zc'. The finite sets of cells are ordered by inclusion and the cells of the target retain their order: < - C x _
(1).
Clearly (l) defines, for given functional state a and argument state y, a unique set of events. Moreover this set of events can be verified to be functional and safe [2]. This is therefore used as the definition of function application in CDS*. Finally, there is a type constructor specific to gcds the graft, which adds great flexibility in practical applications. Its use is illustrated in [3]. Due to lack of space we do not give a concrete syntax for CDS* and we refer to [5]. This syntax allows to define (for the moment only finite) types (gcds with indices) by enumeration of the cells and enablings and by combining types using product, sum or arrow. The values and functions are named states given by enumeration of their events. Terms are function names, application of one term to another, composition of terms, pairs or couples of terms, fix-point, currying or uncurrying of a term (of correct types of course).
3. OPERATIONAL SEMANTICS We now define the formal operational semantics of CDS* which is fully abstract with respect to the denotational semantics outlined above [5]. The operational semantics manipulates memo terms, terms which memorize parts of their evaluation. A memo term is a CDS* program term some of whose syntactic nodes are tagged with tables. Tables, initially empty, store current approximations of auxiliary states. For example, a function-application node is tagged with a table approximating the (state denoted by) the right sub-term i.e. the argument of the application. Let us recursively define the memo terms. States are memo terms. Couples, pairs, currying and uncurrying of memo terms are memo terms. If T and U are two memo terms, x is a state of the same type as U, then [T.U, x] is a memo term, where x represents the current approximation of U and is called a table. I f T is a memo term and x is a state ofthe same type as T's input, then
82 [ f i x ( T ) , x] is a memo term. If T and T' are memo terms and F is a set of pairs of states (x, x') where x (resp. x') has the same type as T's (resp. T"s) input, then [T' o T, F] is a memo term. x is an approximation of the input of T and x' is an approximation of T ( x ) . CDS* expressions are memo terms without tables x and F. We then define an inference system on a set of requests T?c and answers T!v where T is a memo term, c a cell of the same type and v a value. The rules are presented in figure 1. In the rules AP1, AP2, the table x represents the current approximation of the input U. The union II is defined over memo terms whose pure-term parts are equal. Its result IIcUc has the same pure-term part as its arguments Uc and has as approximation-part, the union of approximation parts of Ur and recursively. In the rules for composition o f T ' o T, we have an input state x and a table F of pairs (z, z') where z is a set of input events and z' a set of output events for T.z (i.e. input events for T'). From x and F, two states y, y' are defined as follows, y is the set of input events found in F and relevant w.r.t x (i.e. element of x). y' is the set of output events found in F and relevant w.r.t. x. More precisely: y = U{zl 3(z, z') E F. z c_ x} and y ' = U{z'l 3(z, z') c F. z C_ x}. By construction, for each pair (z, z') of F, z' represents an approximation of the input to T'. Rules AP2, F2, C2 and C3 are the only rules who introduce sub-requests which will be executed in parallel. These sub-requests are determinate by a set A ( x ) , which is a set of cells which are accessible from x (to follow access conditions). Rules CL1 and P1 are for couples and pairing respectively and there exists also symmetrical rules CL2 and P2 (not shown here).
Figure 1. Operational semantics
T?xc' > Tl!v' T?c E> Tl !v (T, U)?c.1 > (T1, U)!v [CL1] (T, U} ?x(c'.l) [> (T1, U)!v' [P1] T?x(x' c") > Tl !V" T?(x, x')c" > T~!v" [CUR] uncurry(T) ? (x, x')c" [> uncurry(T1) !v" [UNC] curry(T)?x(x'c") > curry(T1)!v" T?xd > T~!v' U ' = u~U~ y = u~{r I r e A(z) & U?c > Ur lAP2] [T.U, x]?c' E> [TI.U, x]!v' [AP1] IT.U, x]?c' E> [T.U', x U y}]?c' T?xc > Tl !V [FU [fix(T),x]?c > [fix(T~),x]!v T1-- UClTca, y-U Cl(Yc~U{ c1Vl })[c1 cA(x)&[fix(T),x]?cl[>[fix(T~ ),Yr ]!Vl [fix(T), x]?c [> [fix(T1), y]?c IF2] T'?y'r ~ T~!v" T1 = U~,T~, y~ = U~,{dv'} I d e AC(y') & T?yc't> Tr [c2] [T'oT, F]?xc"[>[T~oT, F]!v ''[C1] [T'oT, F]?xc"[>[T'oT1,FU(y,y'Uy~l)]?xc '' Yl Uc{CV} I c e AC(y) & x?c c>x!v cv 9 up~(~) x - T a ~>/3 /3 > T fTRN] IT' o T, F]txc" > [T' o T, F (2 { (y U Yl, Y') }]?xc" [C3] x?c > x!v [E] a > ,7 -
"
-
4. T O W A R D S AN A B S T R A C T M A C H I N E : R E W R I T I N G S E M A N T I C S Like natural semantics of the usual functional languages, the operational semantics cannot be used in practice. Thus we investigate a new semantics which is to an implementation. The
83 rewriting semantics is defined by the figures 2, 3 and 4 below representing possible transitions from left columns to right columns. The global data structure is a multi-set of tasks together with a function (set of pairs) mapping syntactic nodes of the term being evaluated to tables. In the figures the columns contain only the interesting tasks for the rules and it must be understood that the remaining tasks of the multi-set are left unchanged. There are two main forms of tasks. The first component of tasks is the name of a cell together with the term whose type contains it. The last component of tasks is always a m o d e : either a value computed for the given cell, a failure marker 2_ to indicate that the cell could not be filled, or a request mode. A simple request mode "?" simply means a general request to fill the cell. An indexed request mode ?~ means a request with the added information that the value sought is also that of cell c. As an example of this last kind of mode, see rule CL2 in the first figure. A last kind of mode ?tr indicates that a s l i c e of state ( A ( x ) in the operational semantics) is currently being filled. A task with this mode has a third, middle, component. This middle component stores a set of cells remaining to fill and the current result of the attempt at filling the slice. This result is either failure on all cells • success at least with one cell !. A special kind of task is also used (in figure 4) for filling the so-called F-tables when evaluating a composition term. The following definition is needed. V(x c) - / v iff cv E x '
.
_J_iffcv ~ x
In general, transition rules come in pairs: a call for evaluation and a return with the result. For example CL 1 with CL2. Rules in figure 2 merely follow the syntactic structure of terms. For example rule ST must be read as: if the multi-set of tasks contains the task (c~, ?) then it will be replaced by the task (c, v) if cv e x and by the task (c, 2-) otherwise.
Figure 2. Rules without table ]Tasks ST (cx,?) CL1 (C.I(T1,T2), 7) CL2 (C.I(T1,T2), ?.CT1 ) U (CT1,//) CR1 (C.2(T1,T2), 9.) CR2 (C.2(T1,T2), ?.CT2 ) ~ (CT2, v ) PL1 (c.I
Tasks' (~x,v(~,~)) (C.I(T1,T.2) , 7CT1 ) U (CT1 , ?) (C.I(TI,T2), Y)
(C.2(T1,T2), ?.CT2 ) W (CT2,9) (C.2(T1,T2), v)
(c.I,?~T~) w(cT~,?) (C.I
(X(X'C")cur(T),U) ((x, x')%nc(T),7~(~'c")T)U (~(~'~")T,7) ((x, x')~,c(T),")
Rules in figure 3 implement the definition of concrete function application. The evaluation
84 proceeds in two phases. First AP1, then AP2, then AP3 or AP4. AP2 asks the global system whether the current approximation of U (namely x) is sufficient to fill the result cell in T . U . If successful then complete with AP3. If not, then AP4 requests to increment the table: AP5, AP6, AP7. If this was a success, AP8 resumes and otherwise AP9 aborts.
Figure 3. Rules for application Tasks AP1 (~,.v, ?) AP2 (C(F.U, 71) AP3 (c'T.~, ?x~'T) | (J~, v) AP4 (~r.u, ?x~)w (xc~,• AP5 AP6 AP7 AP8 AP9
(dr.a, (t u ~,m), ?t~) +u (cu, v) (c~.v,(tUc, m),?~r)W(cU,• (~.~, (~,!), 7~) (c~.u, (~,_L),?~)
Tables
Tasks' (~. v, 71)
Tables'
7-U(T.U,x)
(C~T.u,?XC~)W(XC~T,?) (~'~.~, v)
7-tO(T.U,x)
(~.u, .%) ~-u (T.u,x) 7-t2 (T.U,x)
(C~F.U,(Uc EA(x)C,-1-),?tr) WceA(x)(CU, 7)
(C~r.v,(t, !), ?~) 7-U(T.U,x) (c~,.v,(t,m),7~r) (~.~, 71) (~.u, •
7-u (T.U,~) T u (T.g,x u (cu, v))
7-U(T.U, xU(cu,_L))
Rules are needed for the fix-point. There are similar to those of figure 3, the difference is that the table x is an approximation of the fix-point (and that T . U must be replaced by f i x ( T ) everywhere). Rules in figure 4 implements the composition.The evaluation proceeds in three phases. First, rule C2 requests to obtain the approximation y' of T.x from the F-Table. Then rule C3 uses this approximation to get the value of the cell y'c" in term T'. If the value can not be obtained, then C3 requests to increment the table. C6-C 11 aim at increment the approximation of the output of T. In case of success, C 10 resumes, in case of failure, C 11 requests to increment the input part of T with rules C 12 to C 17. If the incrementation successes C 16 requests a new approximation of the output part, otherwise the whole request fails with rule C 17.
5. CONCLUSIONS
CDS* integrates concrete higher-order functions with parallel execution and this combination of features opens the possibility of improving the state of the art in declarative parallel programming. Open questions are of two kinds. Programming convenience would greatly improve if a polymorphic type system can be designed. Efficient implementation remains to be shown but the rewriting system presented here constitutes the first draft of a distributed abstract machine, all of its actions being even-driven and having therefore explicit locations. We believe that transparency with respect to communications is the key to effective parallel programming. Our main goal is to prove this kind of transparency compatible with compositional semantics, higher-order functions, and reflexivity.
85 Figure 4. Rules for composition and F-tables
1
1 ~a~k~ C1 C2 C3 C4 C5 C6 C7
] Tbl
"91)
(XC~,oT,9.)
(XCT, oT,
( X C ~ t o T , ?fzl) ~ ( ( T ' o T , x ) ,
(Xr
?tl) ~ ((T , o T, x),r X C TIt, o T , ?q,t tt , V) "~ ott ~T' ) ~ ~(Y t CT, X C T , o T , ?qtlt~ll If s[ l CT,,
C14 C15 C16 C17
(Cx,V)~(xc~,oT,(tUc,(y, yt),m),?tr3) (Cx,-[-)W{-CTrot , (tUc,(y, yt),m),9, tr3) (~'oT' (~' !)' "~) (X~'oT' (~' (Y'Y')' • "~'~) ((T, zUzl),FU(z,
zt),(y,y'))
,,
~,, ~,,
,~,,
(x%,oT, .~ ~T,) w (y ~T"
(y,B'))
C9 C10 Cll C12 C13
FTBL1
Tbl'__i
(x~" ?~) CT,oT,
(XC~,oT,72) (ZC~r,oT, %2) | ((T' oT, x),e,(y,y')) (-~" :1; T r o T ~ (tu~' , (y,v'),~) , ?~2) (~d~, ~') (xC~,oT, (t u d, (~, r m), ?~) w (ST, • ('~'oT' (~' (Y'Y')' !)' ?*~) (X~(~,oT,(~, (Y,Y'), • ?~) (~'oT' 73) (XC~,oT,?.t3) ~J((T' o T, x),G (y,y'))
C8
Tasks'
.
75
F, e) ?)
(xC~,oT, ~) it (XCT, oT,
?2)
(XC~,oT,?t2)~ ((T' T,x),F,e) (J~,oT, (u~,d, (y,y'),• ?~) o
~J
U c t E A ( y , ) ( Y C t T , ?)
(~,oT, (t, (y, y'), !), 7 ~ )
~(v')
(xc"T, o T ~ (t, (y,y') m), 7~2)
7-2(•
(XC~,oT,?I)
=r~ .1_
(XC~.,oT,73) (xc~,oT,?t3)t~((T' oT, x),F,Q (xc" T r o T ~ (u~c,(v,y') z),?t~3) u U~eA(~)(c~,?) (x~,oT, (t, (~, y'), !), 7~T3) (xd~,or, (t, (y, y'), .~), v~)
7-3(,) ~(J_)
(xd~,o~,• ((T, zUzl)'F'(YUz'ytUz))
1
FTBL2 ((T,z),Fu(zuzl,z'),(y,y')) ((T,z),f,(y,y')) [ whereT1 = T U ( T ' oT, F),T2(m) = "-l-U(y,y'U(c',m))andTa(m) = T U ( y U ( c , m ) , y ' ) .
[ [
REFERENCES
[1] [2]
[3]
[4]
[5] [6]
[7]
G. Berry and R-L. Curien. Theory and practice of sequential algorithms: the kernel of the applicative language CDS. In J. C. Reynolds and M. Nivat, editors, Algebraic Semantics, pages 35-84. Cambridge University Press, 1985. S. Brookes and S. Geva. Continuous functions and parallel algorithms on concrete data structures. In S. Brookes, M. Main, A. Melton, M. Mislove, and D. Schmidt, editors, MFPS'91, number 598 in LNCS, pages 326-349. Springer, 1991. R-L. Curien. Categorical Combinators, Sequential Algorithms and Functional Programming. Birkh~iuser, Boston, second edition, 1993. C. Foisy and E. Chailloux. Carol Flight: a portable SPMD extension of ML for distributed memory multiprocessors. In A. W. B6hm and J. T. Feo, editors, Workshop on High Performance Functional Computing, Denver, Colorado, April 1995. Lawrence Livermore National Laboratory, USA. G. Hains, F. Loulergue, and J. Mullins. Concrete data structures and functional parallel programming. Theoretical Computer Science, 258( 1-2):233-267, 2001. Jr. Halstead. Multilisp: A Language for Concurrent Symbolic Computation. A C M Transactions on Programming Languages and Systems, 7(4):501-538, october 1985. J.M.D. Hill, W.F. McColl, and al. BSPlib: The BSP Programming Library. Parallel Computing, 24:1947-1980, 1998.
86 [8]
F. Loulergue and G. Hains. Parallel functional programming with explicit processes: Beyond SPMD. In C. Lengauer, M. Griebl, and S. Gorlatch, editors, Euro-Par "97, number 1300 in LNCS, pages 530-537. Springer, 1997. [9] R. Mirani and P. Hudak. First-class schedules and virtual maps. In 7th Annual Conference on Functional Programming Languages and Computer Architecture (FPCA "95), La Jolla, California, 1995. ACM. [ 10] C. Queinnec. A concurrent and distributed extension of Scheme. In D. l~tiemble and J.-C. Syre, editors, PARLE'92, number 605 in Lecture Notes in Computer Science, Paris, 1992. Springer.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All fights reserved.
87
R M I - l i k e c o m m u n i c a t i o n for m i g r a t a b l e s o f t w a r e c o m p o n e n t s in H A R N E S S M. Migliardi ~ and R. Podesta ~ ~DIST-University of Genoa, Via Opera Pia 13, 16145, ITALY {om, ropode }@dist.unige.it Tel. +39 010 3536549 Fax +39 010 353 2154 1. INTRODUCTION The foundation of the HARNESS [ 1] experimental metacomputing system is the concept of dynamic reconfigurability of networked computing frameworks. This reconfigurability goes beyond simply involving the computers and networks that comprise the virtual machine, and includes the capabilities of the VM itself. These characteristics may be modified under user control by accessing a set of Java based kernels and via an object oriented [2] "plug-in" mechanism that is the central feature of the system. While the usefulness of being capable to enroll and release computational resources at runtime has been already proved in the past by metacomputing systems such as PVM [3], the plug-in paradigm appeared in the past only in sequential environments and applications such as Netscape Navigator and Adobe Photoshop. On the contrary, HARNESS extends this concept to a parallel, distributed environment providing to users the capability to control, monitor and manage the pluggability of a distributed virtual machine as a whole. Recently, the scope of distributed programming has enlarged from clusters of computers connected to a LAN toward dynamically evolving sets of resources interconnected through the Internet. In this new scenario the capability of software components to migrate from one node of the virtual machine to another has become paramount. As a matter of fact, while in an environment where all the nodes belong to and are managed by a single entity, the disappearance of a node is relatively infrequent, in an unstructured environment such as the Internet the nodes may exhibit a very short staying time with frequent join and barely announced leave events. For this reason, it is necessary to build applications as sets of interacting migratable components capable of surviving the disruption caused by unpredictable component location. n a Java based environment, such as it is HARNESS, one of the most convenient mechanism to allow components to interact is RMI [4]. RMI is a fully object oriented remote procedure call mechanism, however, it does not support component migration as it ties the reference of the remote object to its Internet address. This fact makes extremely difficult to use RMI in an environment where components need to often migrate from one node to another. In this paper we describe a user controlled task migration mechanism and an RMI-like method invocation mechanism that we developed for a migratable components support suite in the HARNESS framework. The paper is structured as follows: in section 2 we give an overview of the HARNESS system architecture; in section 3 we describe how HARNESS supports user controlled migration of multi-threaded tasks; in section 4 we describe how we implemented a migration
88 independent RMI-like method invocation; finally, in section 5, we provide some concluding remarks. 2. OVERVIEW OF THE HARNESS METACOMPUTING SYSTEM In figure 1 we show the abstract layers of the HARNESS system architecture.
e.
~ J 4~
Level 3
re~ource~emdledilltheDVM
SclvR;cs are p l u ~ ] --in m form the
H~=rog~eous Compulaliotml
'
lhe D V M
Level
Figure 1. Abstract model of a Harness virtual machine The fundamental abstraction in the HARNESS metacomputing framework is the Distributed Virtual Machine (DVM) (see figure 1, level 1). Any DVM is associated with a symbolic name that is unique in the HARNESS name space, but has no physical entities connected to it. The HARNESS name space is a mapping of the symbolic DVM name into an access point that Heterogeneous Computational Resources can contact to enroll into the DVM. There are at present two implementation of the HARNESS name space, the first is based on multicast and the hashing of the DVM name onto port numbers and the second is based on a HARNESS name server. Heterogeneous computational resources may enroll into a DVM (see figure 1, level 2) at any time. At this level the DVM is only able to provide two services: loading of plug-ins and status tracking and management. To satisfy the needs of users and applications the heterogeneous computational resources enrolled in a DVM need to load plug-ins (see figure 1, level 3). A plug-in is a software component implementing a specific service. By loading plug-ins a DVM can build a service baseline that is consistent to the needs of the application
89 over a large set of heterogeneous resources(see figure 4, level 4). Users may reconfigure the DVM at any time (see figure 1, level 4) both in terms of computational resources enrolled by having them join or leave the DVM and in terms of services available by loading and unloading plug-ins. The main goal of the Harness metacomputing framework is to achieve the capability to enroll heterogeneous computational resources into a DVM and make them capable of delivering a consistent service baseline to users. This goal require the programs building up the framework to be as portable as possible over an as large as possible selection of systems. The availability of services to heterogeneous computational resources derives from two different properties of the framework: the portability of plug-ins and the presence of multiple searchable plug-in repositories. Harness implements these properties mainly leveraging two different features of Java technology. These features are the capability to layer a homogeneous architecture such as the Java Virtual Machine (JVM) [5] over a large set of heterogeneous computational resources, and the capability to customize the mechanism adopted to load and link new objects and libraries. However, the adoption of the Java language as the development platform for the Harness metacomputing framework has given us several other advantages: an OO development environment, a clear and consistent boundary for plug-ins (each plug-in has to appear as a Java class), a natural environment for developing additional services both in a passive, library-like flavor and in an active thread-enabled flavor, a generic methodology to transfer data over the network in a consistent format (Java Object Serialization [6]) and a way to tune the trade-offbetween portability and efficiency for the different components of the framework (JNI [7]). 3. HARNESS USER C O N T R O L L E D MIGRATION OF MULTI-THREADS TASKS Our implementation of task migration support in HARNESS has three main requirements: 9 keeping strict compatibility with the standard Java 2 platform by requiring no changes to the JVM 9 supporting migration of multi-threads tasks 9 generating as limited as possible an additional burden for programmers of mobile tasks. The first constraint is imposed by the fact that one of the main goals of HARNESS is portability over a set as large as possible of heterogeneous systems. Thus it was necessary to renounce to a strong migration model and to limit our design to a weak migration model. In fact, modifying the JVM to support serialization of the JVM stack would have completely nullified the portability advantage obtained adopting the Java language by requiring users to run a special version of the JVM itself. For this reason we implemented a weak migration mechanism that requires applications to adhere to a set of rules in order to be migratable. These rules are: 9 tasks must be modeled as one or more finite state machines (one finite state machine per each running thread); 9 in every instant in which migration is allowed threads must store their complete status in non-automatic, non static variables, i.e. instance variables; 9 a task must extend the Migratable abstract class;
90 9 very thread that can be migrated must be obtained using the method createThread of the Migratable abstract class, not using the constructors of the Thread class. A task that extends the Migratable abstract class needs an execution environment in the physical nodes of the DVM. Such an environment is provided by the implementation of the functional interface named Arena. This implementation is a HARNESS plug-in that can be loaded onto each node belonging the DVM according with general programming model of the framework. One node can host one or more Arenas. However, Arenas are able to managing multiple Migratable, thus having more than one Arena in a single node is redundant unless different Migratable tasks need to strictly separate their execution environments. Modeling a thread, as a finite state machine does not represent a significant burden for the programmers, on the contrary, a finite state machine seems to be a very natural way to model the evolution of a sequential entity such as a thread. A multi-threaded task can be realized simply defining a different finite state machine for each type of thread involved in the computation. The restriction of storing the status of tasks in instance variable does not mean that use of automatic variable is forbidden. On the contrary this rule only requires that before acknowledging that it is ready for migration a thread must transfer all the values that are part of its status from automatic variables to instance variables. Any part of the status that is not stored into instance variables will not migrate with the task. Migration can be requested by any thread in a task at any time during the task execution. However, migration will only actually take place as soon as all the migratable threads (i.e. the ones that have been created with the createThread method) have acknowledged their availability by calling the canlContinue method. Besides, entities outside of the task can request the migration of the migratable task, in fact the requestMigration method is public. In order to start a migratable task into a HARNESS DVM, it is necessary to have at least one Arena plug-in available in the DVM. More in detail, each node of the DVM that wishes to be capable of hosting migratable tasks needs to load ate least one instance of the Arena plug-in. If a node is used mainly for providing other services and does not need to host migratable software components it does not need to load the Arena plug-in. The process of injecting injecting a multi-thread task into an Arena plug-in is composed of the following steps: 1. get the name of the DVM 2. get the reference to any Arena plug-in in the DVM 3. instantiate an object that implements the Migratable abstract class 4. invoke the start method of the Arena whose reference has been obtained passing the Migratable instantiated at step 3 as a parameter. The retum value of the start method is an ID that is unique at HARNESS DVM level. This ID allows tracking the task regardless of its migrations from one node of the DVM to another. The Arena plug-ins and the Migratable class leverages the HARNESS event system to keep track of the node where the migratable task is currently hosted. The current implementation of the HARNESS event system is based on a star topology that poses some performance limitation. In fact, the system is not capable of delivering more that five hundred events per second in a DVM
91 with twenty active nodes. However, the plug-in based architecture of the HARNESS system allows adopting a new, more performant event service if the need arises. An Arena plug-in advertises the reception of a migratable task while a Migratable advertises his leaving an Arena. These events are distributed by HARNESS as user events with a specific Java type, namely the MigrationEvent type. This binding between event type and Java type makes extremely easy for application interested in the movement of migratable tasks to subscribe this specific kind of event without the need to transmit additional, protocol specific event type tags. Thus applications can keep track of migration event to follow the movements of migratable tasks and pin-point their location if they need to. Arenas process migration event provide a location service for migratable tasks by means of a lookup method. An entity can request any Arena the location of a migratable task identifying it by means of the unique ID value that is generated as soon as a migratable task is injected into any of the Arena plugins loaded onto a HARNESS DVM. The ID value is unique in the whole HARNESS DVM and cannot be changed after it has been set into a migratable task. The lookup method returns the HARNESS handle of the Arena where the migratable task is currently running. Note that a HARNESS handle is a string with a specific format, namely //hostname/DVMname/plug-inName/instance#. This string can be used inside a harness DVM to request services to a plug-in as it identifies unambiguously the HARNESS plug-in it is associated to. The association between the unique task ID and the Arena handle allows the run-time to discover the node of the HARNESS DVM where a migratable task is currently residing. 4. THE RMI-LIKE M E T H O D I N V O C A T I O N M E C H A N I S M F O R MIGRATABLE TASKS IN H A R N E S S
An entity can interact with a migratable task in several different ways. In fact, HARNESS does not impose a single mechanism to interact with migratable task, on the contrary migratable tasks are free to support either direct temporally and spatially synchronous interaction such as RMI, or indirect interaction such as messages posted to a JavaSpace as well as both. owever, Arenas provide an additional mechanism for entities to interact with migratable tasks. This mechanism uses reflections together with the locator service provided by Arenas to allow applications to invoke any of the public methods of a migratable task regardless of their current location. The most important part of this mechanism is the public method invoke provided by the Arena plug-in. This method allows the RMI-like method invocation and operates following these step: 9 it obtains the current task handle using the task ID as a key in the lookup table where they are stored; 9 it matches the Arena handle against the task handle; 9 IF the task and the Arena are residing on the same host - THEN 9 a Migratable object is initialized with the instance of the task obtained by looking up the local database of migratable objects using its ID as a key (this step follows the rule shown in the previous section of the paper). An object Method
92 keeps the method name and the formal parameter that are provided like method invoke input. Finally, through the Java reflection, the requested method is invoked with the actual parameters. - ELSE 9 If the task is residing into another Arena, loaded onto another node of the HARNESS DVM, the invoke method takes the RMI reference of the actual Arena owner of the task and then calls its method invoke operating like shown. In figure 2 we show the core code of the implementation of the invoke method of the class Arenalmpl and its abstract interface Arena.
p~lic
{
interface Arena extends j a v a . r m i . ~ m o ~
p ~ l i c void receive(Object thaW) throws j a v a . ~ i . ~ t e E x ; c e p t i o n ; p ~ l i c void start(Object thaW} throws java~rmi ~ t e E x c e p t i o n ; p ~ l i c Object invoke ( S ~ i n g ID, String m ~ o ~ a m e , Class [] fo~iPar~~rs, Object[ ] actualParame~rs} ~ r o w s java. r ~ . ~ t e ~ a e p t i o n ; p ~ l i c java util. L i n k e d ( s t Is () ~ z o w s java ~rmi. ~moteException ~ p ~ l i c java.~util. L i n k e d ( s t i a ~ 1 () ~ r o w s java. rmi. Remote~ception; p ~ l i c void test (Java.io.Serializ~!e test) throws java. ~ i . ~ m o teExceptio.;
p ~ l i c Object invoke(Stri.g ID, String ~ o ~ ~ , actual) ~ r o w s java, rmi. Remo~Exception
Class[] f o c a l
Object[]
/* (,,~:)* t
if (source. e ~ a l s (myHan~e. toString () ) )
{
~ g r a t a b l e obj - null synchronized (migran ~ s )
}
Obj = (~gra~ble)migrantes, get (ID) ;
/, ( ..... ) */
Class clazz = obj,getClass () ; Method method -- c i a z z . g e t M e t h o d ( m e t h o ~ ~ , / * ( .....)*t retvai ~ method.invokeiobj, actual) ;
formal)
9
/* I,,~) */ else
{
~ena
o w n e r .=. ( A r e n a ) H _ S c o r e , g e t ~ I R e f e r e n c e
t * I-~)*t
t * (~,:,) * t
}
retval
= o~er.
invoke
lID
met,~o~:~,
formal,
( s o u r c e ) ": autual)
;
}
Figure 2. The Arena interface and the core code of the invoke method of Arenalmpl
93 Therefore, an external application can call the invoke method of an Arena passing as parameters the migratable task ID, the requested task method name, the array of formal parameters types and the array of actual parameters. The Arena checks if the requested task is locally present, if it is then it generates a local function call and directly returns the obtained results. If the requested migratable task is not locally available then the Arena checks a lookup table to see if another Arena has currently the requested task. If it finds the requested task it requests the invocation service to the remote Arena. If the requested task is not found on any of the Arena in the DVM then the local Arena assumes that the task is currently migrating and waits for an Arena to signal reception of the requested task. The wait time is limited by a time-out to prevent blocking the caller entity forever. Another important limitation imposed to the waiting time of a caller is the number of redirections performed by Arena plug-ins. In fact, our approach is not capable of preventing a Migratable from leaving an Arena before the actual invocation of the method took place. To avoid a livelock effect where the invocation message keeps following the Migratable movements without actually ever reaching it we introduce a time-to-live concept for all the redirected calls. Following the c l i e n t - server model, we can call the external entity client application and the migratable Task the server application. Such a server has the particularitity of not having a fixed location, as he can migrate from one HARNESS DVM node to another without disrupting its clients. This kind of invocation allows programmers to get over the limitation of standard RMI invocation: the constraint of having a fixed binding between the remote object and its Internet address. With the introduction of the Arena invocation service, the callers are freed from this problem, in fact any component that needs to obtain a service from a migratable component can do it regardless of the callee current location. The task of locating the callee is delegated to the implementation of the Arena abstract class, the ArenaImpl class, by means of its public method invoke. Therefore, the invocation remains transparent to the external application. The package supporting task migration in HARNESS is currently in beta release. A distribution that includes sources of both the framework and a test application is available. This distribution demonstrates the location-unaware invokation of a public method of a Migratable that store the history of its random migrations.
5. CONCLUDING REMARKS RMI is a very convenient mechanism for leveraging services provided by software components independently from their location. However, its fight binding of services? end-points to IP addresses prevents programmers from using it in presence of migration mechanisms. In this paper we have described how the HARNESS framework supports a mechanism for user controlled migration of multithread tasks and an RMI-like mechanism to obtain services from migratable software components. The modular, object oriented nature of HARNESS system allows on demand mix and match of components or complete applications adopting programming models that are different from its native one. This process is supported by means of compatibility suites that can be plugged into the system if the need for such a programming model arises. In this paper we described the main features of our migratable software components compatibility suite and, more in detail, its capability to support weak migration for multi-threaded
94 tasks and a location-free RMI-like communication mechanism. In order to qualify as a migratable component a task must follow a simple set of rules that pose only a limited burden on the programmer?s shoulders. Such a software component is allowed to migrate freely among a HARNESS DVM nodes and is also allowed to have public methods that can be invoked from extemal entities in an RMI-like fashion regardless of the callee current location. Our implementation of this mechanism allows the programmer to call the methods belonging to a migratable component with no knowledge whatsoever about its current location; as a matter of fact, the migratable component can reside on any HARNESS DVM node. Our implementation achieved following results: /) it overcame the inherent limitation of RMI that binds tightly a service end-points with IP addresses, ii) it allowed programmers to adopt a convenient, RMI-like communication mechanism between migratable components. The comerstones of our implementation are the capabilities of the Arena HARNESS plugin. This plug-ins keeps track of the state and location of the migratable components leveraging the event services provided by the HARNESS system. The current, star based implementation of the HARNESS event service may pose scalability problems for more than five hundred events per second. However, the plug-in based architecture of HARNESS allows substituting the current implementation with a more performant one if the need arises. REFERENCES [1]
[2] [3]
[4]
[5] [6]
[7]
M. Migliardi, V. Sunderam, A. Geist and J. Dongarra, Dynamic Reconfiguaration and Virtual Machine Management in the HARNESS Metacomputing Framework, proc. Of ISCOPE98, Santa Fe (NM), December 8-11, 1998. G. Booch, Object Oriented Analysis and Design with Applications, Second Edition, Rational, Santa Clara, Califomia, 1994. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Mancheck and V. Sunderam, PVM: Parallel Virtual Machine a User?s Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, MA, 1994. Sun Mycrosystems, Remote Method Invocation Specification, available on line at http://java.sun.com/products/jdld1.2/docs/guide/rmi/index.html, July 1998. T. Lindholm and F. Yellin, the Java Virtual Machine, Addison Wesley, 1997. Sun Mycrosystems, Object Serialization Specification, available on line at http://j ava. sun. com/products/j dlJ 1.2/doc s/guide/serialization/index.html. S. Liang, The Java Native Interface: Programming Guide and Reference, Addison Wesley, 1998.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
95
Semantics of a Functional BSP Language with Imperative Features F. Gava a and F. Loulergue a ~Laboratory of Algorithms, Complexity and Logic, 61, avenue du G6n6ral de Gaulle, 94010 Cr6teil cedex, France The Bulk Synchronous Parallel ML (BSML) is a functional language for Bulk Synchronous Parallel (BSP) programming, on top of the sequential functional language Objective Caml. It is based on an extension of the A-calculus by parallel operations on a parallel data structure named parallel vector, which is given by intention. The Objective Caml language is a functional language but it also offers imperative features. This paper presents formal semantics of BSML with references, assignment and dereferencing. 1. I N T R O D U C T I O N Declarative parallel languages are needed to ease the programming of massively parallel architectures. We are exploring thoroughly the intermediate position of the paradigm of algorithmic skeletons [ 1, 7] in order to obtain universal parallel languages whose execution costs can be easily determined from the source code (in this context, cost means the estimate of parallel execution time). This last requirement forces the use of explicit processes corresponding to the parallel machine's processors. Bulk Synchronous Parallel (BSP) computing [ 11 ] is a parallel programming model which uses explicit processes, offers a high degree of abstraction and yet allows portable and predictable performance on a wide variety of architectures. Our BSML [6] can be seen as an algorithmic skeletons language, because only a finite set of operations are parallel, but is different by two main points: (a) our operations are universal for BSP programming and thus allow the implementation of more classical algorithmic skeletons. It is also possible for the programmer to implement additional skeletons. Moreover performance prediction is possible and the associated cost model is the BSP cost model. Those operations are implemented as a library [5] for the functional programming language Objective Caml [8]; (b) the parallel semantics of BSML are formal ones. We have a confluent calculus, a distributed semantics and a parallel abstract machine, each semantics has been proved correct with respect to the previous one. Our semantics are based on extension of the A-calculus, but our BSMr,1 i b library (a partial implementation of the complete BSML language) is for the Objective Caml language which contains imperative features. Imperative features can be useful to write (sequential) programs in Objective Caml (for example, many interpreters of A-calculus written in Objective Caml use a imperative counter in order to generate fresh variable names). Sometimes imperative features are also needed to improve efficiency. Thus to offer such expressivity in BSML we have to add imperative features to BSML. In the current version of the BSML1 i b library the use of imperative features is unsafe and may lead to runtime errors. This paper explores formal
96 semantics of BSML with imperative features. We first describe functional bulk synchronous parallel programming and the problems that appear with the use of imperative features (section 2). We then give two formal semantics of BSML with imperative features whose parallel executions are different (section 3) and conclude (section 4).
2. P R E L I M I N A R I E S
We assume here some knowledge of Bulk Synchronous Parallelism (BSP) [9]. There is currently no implementation of a full Bulk Synchronous Parallel ML language but rather a partial implementation: a library for Objective Caml. The so-called BSML12b library is based on the following elements. It gives access to the BSP parameters of the underling architecture. In particular, it offers the function b s p _ p 9u n i t - > i n t such that the value of b s p _ p ( ) is p, the static number of processes of the parallel machine. The value of this variable does not change during execution. There is also an abstract polymorphic type ' a p a r which represents the type of p-wide parallel vectors of objects of type ' a, one per process. The nesting of p a r types is prohibited. Our type system enforces this restriction [4]. In our framework, indeterminism and deadlocks are avoided and it is possible to easily prove programs using the Coq proof assistant [2]. The BSML parallel constructs operate on parallel vectors. Those parallel vectors are created bymkpar: (int->'a)->'a p a r so that ( m k p a r f) stores (f i ) on p r o c e s s i f o r i between 0 and (p - 1). We usually write f as f u n p i d - > e to show that the expression e may be different on each processor. This expression e is said to be local. The expression ( m k p a r f ) is a parallel object and it is said to be global. A BSP algorithm is expressed as a combination of asynchronous local computations (first phase of a super-step) and phases of global communication (second phase of a super-step) with global synchronization (third phase of a super-step). Asynchronous phases are programmed with m k p a r and with a p p l y - ( ' a - >' b) p a r - >' a p a r - >' b p a r . The expression a p p l y ( m k p a r f) ( m k p a r e) stores (f i ) (e i ) on process i. Neither the implementation of BSML1 i b , nor its semantics prescribe a synchronization barrier between two successive uses of a p p l y . put- (int->'a option)par->(int->'a option)par allows to express communications and synchronizations, where ' a o p t i o n is defined by: type 'a
option=None]Some
of 'a.
Consider the expression: p u t ( m k p a r ( f u n i - > f si) ) (*) To send a value v from process j to process i, the function f sj at process j must be such as (f sj i ) evaluates to Some v. To send no value from process j to process i, ( f s / i ) must evaluate to None. Expression (*) evaluates to a parallel vector containing a function fdi of delivered messages on every process. At process i, (fdi j ) evaluates to None if process j sent no message to process i or evaluates to Some v if process j sent the value v to the process i. The full BSML language would also contain a global synchronous conditional operation. This i f a t - ( b o o l p a r ) * i n * ' a * ' a - > ' a operation is such that i f a t (v, i , v l , v 2 ) will evaluate to v l or v2 depending on the value o f v at process i. But Objective Caml is an eager language and this global synchronous conditional operation can not be defined as a function.
97 That is why the core B S M L l i b library contains the function a t 9b o o l p a r - > i n t - > b o o l to be used only in the global expression: i f ( a t v e c p i d ) t h e n . . . 6 l s e . . . where ( v e c - b o o l p a r ) and ( p i d - i n t ) . i f a t expresses communication and synchronization phases. Global conditional is necessary of express algorithms like: Repeat Parallel Iteration Until Max of local errors < c. Without it, the global control cannot take into account data computed locally. Objective Caml offers to the programmer an important extension of functional languages: imperative features. They have been added to functional languages to offer more expressiveness. Classically, this modification is added to functional languages by the possibility of assignment and allocation of a variable or a data structure. The idea is to add references. A reference is a cell of the memory which could be modified by the program. One creates a reference with the allocation's ref(e) construction which gives a new reference in the memory initialized to the value of e. The value kept by the reference is called the stored value. To use and read the stored value (dereferencing), we need an operation, written l, to extract it. Finally, we can modify the content of our reference by replacing this value by another. This operation is called assignment and written: el := e2. We use the same notations as Objective Caml. A reference binding to an identifier in a functional language is like a variable in an imperative language. Imperative features are not a trivial extension of functional language. First, the value of an expression changes with the values of the free variables. If these variables have a known value, the evaluation of the sub-expressions could be done independently. For imperative language, it is not the case (and also for imperative extensions)" the evaluation of a sub-expression could modify a reference by an assignment and thus affects the evaluation of the other sub-expressions which used this reference. A secondly difficulty came from the shared references which could not pass the well-known and classical/%reduction of the functional languages. Take for example: let r=ref 2 in r-=[r*!r; (!r+l). The instances of r are for the reference, allocated by the r e f 2 sub-expression. If we make a natural/3-reduction to our expression, we would have: ( r e f 2) -= ! ( r e f 2 ) * ! ( r e f 2 ) ; ( ! ( r e f 2 ) + 1 ) which allocates four different references and do not have the same behavior. To extend the dynamic semantics of functional languages and keep out the problem of the shared allocations, locations (written 0 and store (written s) have been added. A store, is a partial function from locations to values and a reference is evaluated to a location. In the following, we give the reduction of expressions starting from an empty store. First, the left sub-expression of the 1 e t construction is evaluated and a new location g is created in the store. Second, the/%reduction can be applied and finally the fight sub-expression of the 1 e t construction is evaluated with the classical rules of a functional language:
-~
let r = ref(2) in r :=!r,!r; (It + 1) / letr-(inr:=!r,!r;(!r+l) /
0 {~2}
e := 2 9 2; (!e + 1)
{e ~ 2}
/
--~ ( l ( + l ) -~ ( 4 + 1 ) -~ 5
/ / /
{(~-+4} {e~4} {e ~ 4}
2.1. BSML with imperative features BSML is a parallel functional language based on BSP whose architecture model contains a set of processor-memory pairs and a network. Thus in the current implementation each processor
98 can reach its own memory, and it causes problems. Take for example, the expression: let if
a = ref(O) in let d a n g e r = m k p a r ( f u n p i d rood 2=0) in (at d a n g e r !a) t h e n el e l s e e2
pid
-> a - = p i d ;
First, this expression creates a location a at each processor which is initialized at 0 everywhere. For the BSML12b library each processor has this value in its memory. Second, a boolean parallel vector d a n g e r is created which is trivially t r u e if the processor number is even or f a l s e otherwise. Thus, from the BSMLZib point of view, the location a has now a different value at each processor. After the ifat construct, some processor would execute E1 and some other E2. But, the ifat is a global synchronous operation and all the processors need to execute the same branch of the conditional. If this expression had been evaluated with the BSMLZ 2b library, we would have obtained an incoherent result and a crash of the BSP machine. The goal of our new semantics is to dynamically reject this kind of problems (and to have an exception raised in the implementation).
3. D Y N A M I C S E M A N T I C S OF BSML W I T H I M P E R A T I V E F E A T U R E S This section introduces the syntax and dynamic semantics of a core langage, together with some conventions, definitions and notations that are used in the paper. 3.1. Syntax The expressions of mini-BSML, written e, have the following abstract syntax: e ::= x I ( e e ) [ f u n x - ~ e c oplletx=eine I (e,e) [ e ifetheneelsee [ ifeatetheneelsee In this grammar, x ranges over a countable set of identifiers. The form (e e') stands for the application of a function or an operator e, to an argument e'. The form fun x - . e is the lambda-abstraction that defines the function whose parameter is x and whose result is the value of e. Constants c are the integers, the booleans and we assume having a unique value: 0 that have the type unit. This is the result type of assignment (like in Objective Caml). The set of primitive operations ol) contains arithmetic operations, fix-point operator fix, test function isnc of nc (which plays the role of Objective Caml's Hone), our parallel operations (mkpar, apply, put) and our store operation ref, ! and :=. We note el :=e2 for :=(el, e2). Locations are written g, pairs (e, e). We also have two conditional constructs: usual conditional if then else and global conditional if at then else. We note 3r(e), the set of free variables of an expression e. let and fun are the binding operators and the flee variables of a location is the empty set. It is defined by trivial structural induction on e. Before presenting the dynamic semantics of the language, i.e., how the expressions of mini-BSML are computed to values, we present the values themselves. There is one semantics per value of p, the number of processes of the parallel machine. In the following, Vi means Vi E { 0 , . . . , p - 1} and the expressions are extended with enumerated parallel vectors: ( e , . . . , e) (nesting of parallel vectors is prohibited; our type system enforces this restriction [4]). The values of mini-BSML are defined by the following grammar: v ::= fun z --, e functional value c constant [ oi3 primitive ] ( v , . . . , v) p-wide parallel vector value (v, v) pair value I g location
99
3.2. Rules The dynamic semantics is defined by an evaluation mechanism that relates expressions to values 9 To express this relation, we used a small-step semantics. It consists of a predicate between an expression and another expression defined by a set of axioms and rules called steps. The small-step semantics describes all the steps of the calculus from an expression to a value. We suppose that we evaluate only expressions that have been type-checked [4]. Unlike in a sequential computer and a sequential language, an unique store is not sufficient. We need to express the store of all our processors. We assume a finite set A/" = { 0 , . . . , p - 1} which represents the set of processors names and we write i for these names and N for all the network. Now, we can formalize the location and the store for each processor and for the network. We write s~ for the store of processor i with i E N'. We assume that each processor has a store and a infinite set of addresses which are different at each processor (we could distinguish them by the name ofthe processor). We write S = [So,..., sp-1] for the sequence ofall the stores of our parallel machine. The imperative version of the small-steps semantics has the following form: e / S ~ e'/S'. We will also write e / s ~ e'/s' when only one store of the parallel machine can be modified. We note - - , for the transitive closure of ~ and note e0/S0 - - v / S for e0/So ~ el/S1 ____x e2/$2 ~ ... ~ v / S . We begin the reduction with a set of empty stores { 0 0 , . . . , Op-1} noted ON. To define the relation ~ , we begin with some axioms for two kinds of reductions:
1. e/si i e,/s~i which could be read as "in the initial store si, at processor i, the expression e is reduced to e' in the store s~IV! . 2. e / S ~ e ' / S ' which could be read as "in the initial network store S, the expression e is reduced to e' in the network store S'". We write s + {g ~ v} for the extension of s to the mapping of g to v. If, before this operation, we have g E Dora(s), we can replace the range by the new value for the location g. To define these relations, we begin with some axioms for the relation of head reduction. We write e~ [x +-- e~] the expression obtained by substituting all the free occurrences of x in el by e2.
For a single process 9 (fun x -+ e) v / si For the whole parallel machine:
~-~ e[x +-- v] / si i
(fun x --~ e) v / S
( /~i I~n)
~-~ e[x +-- v] / S M
9 (flZ~n)N .
Rules (/3~t) and (fl~t) are the same but having let x = v in e instead of fun. For primitive operators and constructs we have some axioms, the 6-rules. For each classical 6-rule, we have two new reduction rules: e / si ~ e' / s i' a n d e / S i e ' / S'. Indeed, these reductions do not 6i
(SN
change the stores and do not depend on the stores (we omit these rules for lack of space and we refer to [3]). Naturally, for the parallel operators, we also have some ~-rules but we do not have those ~-rules on a single processor but only for the network (figure 1). A problem appears with the p u t operator. The p u t operator is used for the exchange of values, in particular, locations. But a location could be seen as a pointer to the memory (a location is a memory's addresses). If we send a local allocation to a processor that does not have the location in its store, there is no reduction rule to apply and the program stops with an error (the famous segmentation fault of the C language) if it deferences this location (if it reads "out" of the memory). A dynamic solution is to communicate the value contained by the location and to create a new location for this value (as in the M a r a h a q module of Objective Caml). This solution implies the renaming of locations that are communicated to other processes. For this,
100 we define Loc the set of location of a value. It is defined by trivial structural induction on the value. We define how to add a sequence of pair of location and value to a store with: s + O = s and s + [go ~ V o , . . . , gn ~ Vn] = (S + {gO ~ VO}) + [el b--+ V l , . . . , en ~+ Vn]. We note q; = {go ~ g'o,.., gn ~ g.'} for the substitution, i.e, a finite application from location gi to another location g{ with {go,... gn} is the domain of q;.
(mkparv)/S
apply((vo,..., Vp_l) (v; ..., V~_l)) / S
~" 6M
((vO),...,(v(p-1)))/S
~
((v 0 v;) ... (Up-1 Up_l)) / S
~" 6N
el
M M
n
if ( . . . , t r u e , . . . ) n
at v t h e n ei else e2 / S
if (. . . , false, . . .) at v t h e n el else e2 / S
put((fun d s t
ep_l) ) / ~
~ eo,... ,fun dst ~
~"
/ S
if v = n
((~f atT)
e2/Sifv=n
6M ~r (r0,..., Tp_l) / S t 6N
((Sp%t)
where S = [So,.. ., 8p--1]and S' = [s~,.. ., sp__x] ' ' 1 where Vj . sj' = sj + h 'o + . . . + hp__ ! where hj = [e'o ~ Vo, . . . , g" H Vn] and hj = { (go, Vo), . . . , (gn, Vn) } where gk E Loc(ey) and {lk H vk} E sj and qpj = {go H g ~ , . . . , gn ~ g'~} and e} - ~ y ( e j ) and Vi ri -- (let v~ = eo[dst' +-- i] i n . . . V~_li _ ep_lt [dst +-- i] infi) where fi -- f u n x - + if x -- 0 t h e n v~ e l s e . . , if x = (p - 1) t h e n v~_ii else n c O Figure 1. Parallel 5-rules Now we complete our semantics by giving the &rules of the operators on the stores and the references. We need two kinds of reductions. First for a single processor, 5-rules are (5~f), (5~) and (5~=) (given in figure 2). Those operations work on the store of the processor where this operation is executed. The r e f operation creates a new allocation in the store of the processor, the ! operation gives the value contained in the location of the store and the := operation changes this value by another one.
ref(v)
/ s~
!(g) / si e:-v / ~
e
5~ ~ 5~ ~, &
g / si + {g H v}
ifg r
Dom(si)
(Sref
si(g) / si
ifg E Dom(si)
(5~)
() / ~ + {e H v}
if e ~ Dom(,~)
(at__)
S t=
[80 -[- {e~ ~+ ~')0(V)},..., 8p--1 qt- {~:~-1 ~+ ~DP--1(V)}]
ref(v) / S ~',~ e~ / S' where gi.g~ q~ Dom(si) !(g~a) / S W:=v / S
e
v/ S
where 3v Vi s~(g~) = v and g~ ~ Dom(s~) or 3v v = P-l(s~(g~))
~ a~
() / S'
where
5N
&
S t--. [8 0 --[- {e~ ~-+ ~Oo(V)},...
Vi.g~ ~ Dom(si)
8p--1 + {e~__1 w-+ ~Op__1 (v)}]
(a~) (a~) (a~) ((~proj)
Figure 2. "Store" 5-rules For the whole network, we have to distinguish between the name of a location created outside
101 a m k p a r which is used in expressions and its "projections" in the stores of each process. We note 1~ in the first case and l~ for its projection in the store of process i. When an expression outside a m k p a r creates a new location, each process creates a new location (an address) on its store (rule (~ef), figure 2) where 7~i is a trivial function of projection for the parallel vectors of a value. With this function, we assure that there is no data from another processes in a store. The assignment of location 1N (rule (6.~), figure 2) modifies the values of locations l~ using also the function of projection (which have no effect if the value do no contain any parallel vectors). This rule is only valid outside a mkpar. But a reference created outside a m k p a r can be affected and deferenced inside an mkpar. For assignment, the value can be different on each process. To allow this, we need to introduce a rule (6~roj) (figure 2) which transforms (only inside an m k p a r and at process i) the common name l N into its projection l~. Notice that the affection or the deferencing of a location l N cannot be done inside a m k p a r with rules (~i=) and (6() since the condition l c Dom(si) does not hold. The use of the rule ((~pNroj)is first needed. The deferencing of 1N outside a m k p a r can only occur if the value held by its projections at each process is the same or if this value is the projection of a value which contain a parallel vector (a value which cannot be modify by any process since nested of parallelism is forbidden [4]). This verification is done by rule ( ~ ) where 7~-1 is a trivial function of de-projection for the values stores one each processes which contains the projection values. This de-projection do no need any communication. The complete definitions of our reductions are:
-
U i
and - - ' : - ~ U ~-~. It is easy to see 6i
N
~N
that we cannot always make a head reduction. We have to reduce in depth in the sub-expression. To define this deep reduction, we need to define two kinds of contexts. We refer to [3] for such definitions.
3.3. Cost model preserving semantics In order to avoid the comparison of the values held by projections of a 1N location in rule ( ~ ) we can forbid the assignment of such a location inside a mkpar. This can be done by suppressing rule (6~roj)" But in this case, deferencing inside a m k p a r is no longer allowed. Thus, we need to add a new rule:
!(~N) / si
~-~ 6~
si(g~) / si
(6~ ) to suppress the comparison: !(gN) / S
~~N
if ~ C
Dom(s~) (6~N)
7~-1(s~(/~)) / S
and modify the rule
ifg~ c
Dom(s~) (3,N,) " "
This rule is not deterministic but since assignment of a 1N location is not allowed inside a
mkpar the projections of a 1N location always contain the same value. The cost model is now compositional since the new (~'~) does not need communications and synchronization. 4. CONCLUSIONS AND FUTURE W O R K The Bulk Synchronous Parallel ML allows direct mode Bulk Synchronous Parallel programming. The semantics of BSML were pure functional semantics. Nevertheless, the current implementation of BSML is the BSMT.1• library for Objective Caml which offers imperative features. We presented in this paper semantics of the interaction of our bulk synchronous operations with imperative features. The safe communication of references has been investigated, and for this particular point, the presented semantics conforms to the implementation. To ensure safety, communications may be needed in case of assignment (but in this case the cost model is no longer compositional) or references may contain additional information used dynamically
102 to ensure that dereferencing of references pointing to local values will give the same value on all processes. We are currently working on the typing of effects [1 O] to avoid this problem statically. REFERENCES
M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, 1989. [2] F. Gava. Formal Proofs of Functional BSP Programs. Parallel Processing Letters, 2003. to appear. [3] F. Gava and F. Loulergue. Semantics of a Functional BSP Language with Imperative Features. Technical Report 2002-14, University of Paris Val-de-Marne, LACL, october 2002. [4] F. Gava and F. Loulergue. A Polymorphic Type System for Bulk Synchronous Parallel ML. In Parallel Computing Technologies (PACT 2003), LNCS. Springer Verlag, 2003. [5] F. Loulergue. Implementation of a Functional Bulk Synchronous Parallel Programming Library. In 14th lASTED PDCS Conference, pages 452-457. ACTA Press, 2002. [6] F. Loulergue, G. Hains, and C. Foisy. A Calculus of Functional BSP Programs. Science of Computer Programming, 37(1-3):253-277, 2000. [7] F.A. Rabhi and S. Gorlatch, editors. Patterns and Skeletons for Parallel and Distributed Computing. Springer, 2002. [8] D. R6my. Using, Understanding, and Unravellling the OCaml Language. In G. Barthe and al., editors, Applied Semantics, number 2395 in LNCS, pages 413-536. Springer, 2002. [9] D.B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and Answers about BSP. Scientific Programming, 6(3):249-274, 1997. [ 10] Jean-Pierre Talpin and Pierre Jouvelot. The Type and Effect Discipline. Information and Computation, 111(2):245-296, June 1994. [11 ] Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103, August 1990. [1 ]
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
103
The Use of Parallel Genetic Algorithms for Optimization in the Early Design Phases E. Slaby a and W. Funk b aWaldorfer Str. 8, D 50969 K61n, Germany slaby @unibw-hamburg. de bInstitute of Machine Design and Production Technology University of the Federal Armed Forces, D 22039 Hamburg, Germany This paper deals with the use of genetic algorithms, which are integrated into a commercial CAE-system in such a way that all the CAE-systems functions can be used for evaluation and data processing. To reduce runtime, a parallelization of the fitness evaluation routine based on the CORBA technology is presented. 1. I N T R O D U C T I O N In these days, the use of CAE-systems (Computer Aided Engineering) for the design process is quite usual in many development departments. Presently a great number of different computerized support systems for the engineering design process are available. Particularly in the last years even more computation and dimensioning tools were integrated into the systems. But most of these tools are applicable only in the later design phases. The early phases like "planning" and "conceptual design" are still relatively unexplored issues concerning computer aided systems. Optimization algorithms and technologies in the field of Computational Intelligence can hardly be found as an integrated component of the systems [3]. Therefore the use of these algorithms is quite rare in practice. Among others the reason for this are long training periods for the user, difficult handling and the high programming efforts which are necessary for each new task. On the other hand some projects ([5], [7], [2]) show the large potential of these technologies for the design process, even to support the engineer in those areas which are considered as a domain of human intuition and creativity. Bentley points this out in [ 1], where particularly the evolutionary algorithms are used not only for the optimization of already existing construction units, but also for the generation of new concept variants. With a closer view on such projects it is noticeable that besides the engineer as the user a computer scientist is often needed for the software-technical realization. Such a time-consuming effort with additional personnel expenditures cannot be operated in practice during the already temporally limited design process. For this reason a software has been developed that combines an evolutionary algorithm and a commercial 3D-CAE-system (I-DEAS Master Series) in such a way that no programming has to be done by the engineer.
104 At first this paper gives a short introduction into evolutionary algorithms (EA), considers the disadvantages of existing applications which use EA's in section three. Section 4 points out the intention of the developed software-system for the integration of EA's into an existing CAE-system (section 5), section 6 deals with the parallelization followed by the use of a neural network for the EA fitness evaluation routine in section 7. 2. EVOLUTIONARY ALGORITHMS Evolutionary strategies, especially genetic algorithms, represent promising tools for the engineering design process. Evolutionary strategies imitate the principles of natural evolution: reproduction, mutation and selection. Starting with an initial random population, the fitness of each individual is evaluated under consideration of the given constraints. The strings representing the individuals are crossed-over and mutated. The individuals of such a generation are tested and given a probability to survive according to their fitness. After a sufficient number of generations, optimal or nearly optimal solutions can be found. By randomly mutating some strings, areas of the solution space are examined which may not have been considered with the initial population [4]. Some fundamental differences to conventional deterministic and stochastic search methods are [8]: 9 Evolutionary algorithms select in a parallel way in a population of points, not only from one individual point. 9 No derivatives of the objective function or other auxiliary information are needed by the evolutionary algorithms. Only the objective function value is used as a basis for the search. 9 Evolutionary algorithms can offer a number of possible solutions for a problem. 9 Evolutionary algorithms can be applied to problems with different representations of the variables (continuous and discrete variables). 9 Evolutionary algorithms are simple and flexible in use and there are no restrictions for the definition of the objective function. Evolutionary algorithms represent thus versatile, durable and efficient optimization procedures, which can be used with strongly nonlinear, non-continuous objective functions and for problems with variables of different representation (binary, integral, real). It is recommended not to use evolutionary algorithms if the computation of the objective function is very complex or timeconsuming, because a large number of objective function computations is necessary [8]. Basis of the implemented optimization module is the software package ECJ (Evolutionary Computing system written in Java [6]). A modified genetic algorithm is used that fits to a broad band of design problems and leads to good results without beeing adapted in an individual way to each problem by the user. Because of the parallel way of stepping through the solution space, EA's suit for a parallel fitness evaluation very well. Thus will be discussed in section 5.
105 3. EXISTING APPLICATIONS USING EA'S IN THE DESIGN PROCESS Existing examples which use genetic algorithms in the early design phases show the great potential that lies among those algorithms for the optimization of design-concepts and the possibility to create new designs from scratch. On the other hand, a lot of programming work is necessary to develop the algorithm and the necessary input/output interfaces and the fitness evaluation routine. In many cases, a software developer has to support the engineer. The resulting software often fits only to a small kind of problems and has to be adapted with a lot of work to a different task. The input of the optimization objectives as well as the interpretation of the results of an optimization run is often very difficult because of a lacking linkage to a CAE-system. The EA's output data must usually be converted with an additional step or entered manually. Some applications contain a special geometry model for the computation and representation. Such a development is connected with a great programming effort and does not reach the functionality of a modem CAE-systems (e.g. free forming surface modelling). For many of the examined applications computation routines are especially programmed for the evaluation of the solution variants, instead of using already existing computation tools of a CAE-system. 4. INTENTION OF "EasyEvoOpt-3D" The intention of the developed system "EasyEvoOpt-3D" is to offer the engineer a tool which gives him the possibility to use the above mentioned technologies during the product design process via the 3D-CAD-systems graphical user interface without the need of deeper knowledge of the used algorithms. In detail the following points have been considered for the development: 9 Optimization algorithms have to be integrated into a CAE-system to offer the possibility to input all relevant information with the help of the CAE-systems input functions. 9 The computation routines of the CAE-system should be usable by optimization algorithms, in order to be able to evaluate individual solutions. 9 The results of the optimization run have to be available in the CAE-system without further action or data conversion. 9 The use of the optimization procedures should be possible for a wide range of different tasks without further adjustment of the optimization algorithms. 9 If necessary, there should be an option to particularly adapt the optimization algorithm to the problem. 9 A modular expendability has to be planned, so that the integration of further external applications for the evaluation of solution variants is possible. A main point for the development of the system is that an engineer should have the possibility to let the system generate solution variants at an early phase of the design process without programming efforts and without a deeper knowledge of the optimization algorithms.
106 5. INTEGRATION INTO THE CAE-SYSTEM I-DEAS The geometry model of the CAE-system is regarded as the basic element of the optimization. Thus the computation tools of the CAE-system are available apart from the possibilities for visualization. The definition of the restrictions should be made not by mathematical descriptions, but by graphic elements (e.g. restricted areas).
........~: ...................................................................................................................... .: ..............................,+r.......................
IV Object-data, ~ Restrictions. optimization-
. . . . . . . . . . . .
"............................................
Record-Mode
Figure 1. User inputs at the CAE-system recorded by the optimization module
All above mentioned points - even the input of the restrictions and the objective function have to be available without programming by the user. With this demand, the communication of the optimization module with the CAE-system gets a central meaning for the further procedure. Here two different cases of communication between the CAE-system and the optimization module can be pointed out. The optimization module has to record the inputs such as variables, optimization objective and restrictions done by the user via the CAE-system and convert them in such a way, that the optimization module can replay them without any user interaction. This means that the variables have to be set according to the outputs of the evolutionary algorithm, and that the fitness evaluation routine is able to use the CAD-systems tools. The possibilities to enter the problem, the restrictions and the fitness evaluation with userfriendly graphical tools of the CAD-system are shown in figure 1. The inputs are recorded and processed in such a way that the tools of the CAD-system can be used for fitness evaluation without any user interaction as shown in figure 2. The communication module for communication with the CAE-system I-DEAS is represented by a Java class named o:r_com (Open I-DEAS Communication). Task of this class is to execute the entire internal CORBA communication and to perform the necessary conversions for communication with the CAE-system I-DEAS. Low level communication with I-DEAS has been developed with the Open I-DEAS programming interface. These CAE-systems library is based on the CORBA technology. It offers a number of methods in order to access internal data of I-DEAS. In the first phase of the optimization application the user has to define which parameters of
107
EasyEv0Opt-3D
, .................. .
OptimizationRun Figure 2. Fitness evaluation without any user interaction a construction unit are variable. For users comfort, this can be done by interactively picking the dimensions or other defining parameters in the graphics region of the CAE-system. The Open-I-DEAS API includes a method to get the 3D-coordinates of any picked point which is not useable for the later use because by changing some parameters of the object, their position in space will change and the system could not pick the element afterwards. For this reason the OI_Com-class contains different methods, which offer the possibility to convert the selection of an element in the graphics region from I-DEAS into the system internal labels. The OI_Com-class also includes methods to access all computation results in the CAEsystem, even if these routines are not supported by the Open-I-DEAS API. This is possible with a specially developed parser which can access and filter all internal result lists and pass them to the optimization module. 6. PARALLEL FITNESS EVALUATION The disadvantage of the use of these CAE-system tools lies in an unsatisfying runtime. Additionally, the convergence of genetic algorithms is generally slow because of the huge number of fitness evaluations. Each individual in a generation encodes a possible solution that is independent from the information of other individuals. Therefore the fitness of the individuals can be evaluated in parallel. As fitness evaluation with the tools of the CAE-system is the most time consuming part of the process, a significant reduction of execution time can be realized through the parallelization. The implemented parallelization scheme had to be applicable on distributed memory systems with different operation systems (Windows, Unix) connected via a LAN and on multiprocessor supercomputers. Therefore the main focus of this section lies on developing a parallel fitness evaluation routine which is able to use many processes of the CAD-system on a heterogeneous cluster of workstations as it can be found in the development department of a engineering company. The parallel fitness evaluation routine is able to integrate workstations into the cluster with different operation systems (Windows, Unix) and also multiprocessor-systems, on which a lot of CAD-system processes can be executed. The parallelization-routine uses Java-threads and the Common Object Request Broker Archi-
108 tecture (CORBA). This architecture offers a possibility to communicate with different processes of the CAE-system, even on different operating systems. It is not necessary to change the CAEsystems source code. An advantage of this approach is the fact that the developed system offers the possibility to use all available CAE-workstation for the parallelization without having to modify the system or even boot a different operating system. The parallelization takes place in a master thread, which is responsible for the distribution of the individual computations as well as the error handling, and slave threads, which can be instantiated several times and perform the direct communication with the CAE-system. The master thread communicates via messages with the slave threads. Each slave thread has a separate message register. Via this register defined messages conceming the status (READY, C A L C U L A T I O N FINISHED, RESULT RECEIVED, ERROR) and for calling subroutines ( I N I T , CALCULATE FITNESS, SLEEP, DONE) can be passed. To avoid slower computers to slow down the whole process, a load balancing is included. If all individual of one EA's generation have been sent to a slave for evaluation and some slaves are already waiting for new individuals, some of the not yet evaluated individuals are sent again to another slave. If a result has been retumed, the other slaves evaluating the same individual are able to stop their run.
7. EVALUATION OF FITNESS VALUES W I T H A NEURAL N E T W O R K A second approach for a reduction of run-time of an optimization run consists in the use of a neural network for the evaluation of the individuals' fitness values. Basis of this concept is to train a neural network with the data, which describe the solution variants, and which associated fitness values computed by the CAE-system. If the neural network achieved a given quality by many learning procedures, a determination of the fitness values can be accomplished by the neural network, which is very much faster than the complex computation of the CAE-system. A neural network must be trained at first with example data, in order to obtain a desired behavior. Therefore the neural net is trained with the individuals, for which the CAE-system did already compute a fitness value. If for a given number of new data sets a demanded accuracy is reached by the neural network, it will be changed over to the second phase, the evaluation of the fitness values by the neural network. In this second phase there is a mixture of fitness values evaluated by the neural network and the computation of selected individuals made by the CAE-system. A closer view of the genetic algorithm is necessary at this point. The best individuals of a generation are preferred for reproduction. For the remaining individuals only the order concerning their quality in the population is important. The exact fitness value must be available only for the best individuals. Therefore the fitness values of all individuals of one generation are evaluated by the neural network and additionally the fitness values of the best individuals as well as some randomly selected are computed by the CAE-system. The neural network module was developed with MatLAB and converted with the MatLAB/C++ compiler to C++ source code. The CORBA-methods were implemented so that the neural network module can be integrated in the optimization system via a network. Figure 3 gives an overview of all components of the optimization system an the integration of the CAE-systems for input/output and fitness evaluation.
109
i
M
Multiprocessor CORBACORBA-
coRPBCA.
fs ....
CAD/CAE
CORBA-Communication-Modul CAD/CAE-System i:i....... i:i~:!~I
Ii!i:
Parallelization-Modul (multithreaded) I
Graphical User Interface
(GU~)
lodul [" I
Evolutionary Algorithm Datarecording and Conversion
Datenbasis
Figure 3. Overview of all components
8. R E S U L T S A N D C O N C L U S I O N
The optimization system "EasyEvoOpt-3D" has been successfully applied for the search for a non-uniform transmission for a wicket and a ergonomics-study for a bench [9]. The flexibility to user changes and the possibility to follow the optimization steps generation by generation on screen has shown the advantages of the use of the CAE-system. The disadvantages of the quite time-consuming CAE-tools for the fitness evaluation could be compensated by the parallelization and the use of a neural network for the fitness evaluation. The system shows that it is possible to use optimization methods at an early stage of the design process without having to consult a optimization and software specialist with just using the already available hardware. REFERENCES
[1] Bentley, Peter John: Generic Evolutionary Design of Solid Objects using a Genetic Algorithm; Dissertation University of Huddersfield 1996 [2]
Dasgupta, D.; Michalewicz, Z (Hrsg.): Evolutionary Algorithms in Engineering Applica-
110 tions; Berlin, Heidelberg, New York, u.a.: Springer 1997 Figel, Klaus: Optimieren beim Konstruieren: Verfahren, Operatoren und Hinweise ffir die Praxis; Mfinchen, Wien: Hanser 1988 [4] Goldberg, David E.: Genetic Algorithms in Search, Optimization & Machine Learning; Massachusetts, Harlow, Menlo Park, u.a.: Addison Wesley Longmann, 1989 [5] Hafner, S.;Kiendl, H.; Kruse, R.; Schwefel, H.-P.: Computational Intelligence im industriellen Einsatz: Tagung Baden-Baden, Mai 2000 VDINDE-Gesellschaft Mess- und Automatisierungstechnik; Diisseldorf: VDI-Verlag 2000 [6] Luke, Sean: Evolutionary Computing system written in Java. http://www.cs.umd.edu/users/seanl/, University of Maryland [7] Miettinen, K.; Neittaanmdki, P.; Mdkeld, M. M.; Periaux, J. (Hrsg.): Evolutionary Algorithms in Engineering and Computer Science; Chichester, Weinheim, New York u.a.: Wiley 1999 [s] Pohlheim, Hartmut: Evolution~ire Algorithmen: Einsatz von Optimierungsverfahren, CAD und Expertensystemen; Berlin, Heidelberg, New York, u.a.: Springer 2000 [9] Slaby, Emanuel: Einsatz Evolution~irer Algorithmen zur Optimierung im frfihen Konstruktionsprozess. Fortschr.-Ber. VDI Reihe 20 Nr. 361. Dfisseldorf: VDI Verlag 2003
[3]
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
111
A n Integrated A n n o t a t i o n and C o m p i l a t i o n F r a m e w o r k for Task and D a t a P a r a l l e l P r o g r a m m i n g in Java* H.J. Sips a and K. van Reeuwijk aDelft University of Technology (sips,reeuwij k)@its.tudelft.nl
1. I N T R O D U C T I O N For most applications Fortran has been replaced by languages such as ANSI C, C++. and Java. Yet Fortran has a number of features that still make it particularly useful for scientific programs, namely multi-dimensional arrays, complex numbers, and, in later versions, array expressions. Moreover, Fortran compilers tend to produce highly efficient code compared to compilers for other languages. We have developed a set of language constructs, called Spar, that augment Java with a set of language constructs for scientific programs. The set consists of multi-dimensional arrays; complex numbers; a 'toolkit' to build specialized array representations such as block, symmetric, or sparse arrays; annotations; and parallelization. In this paper we concentrate on the constructs for parallel programming; the other extensions have been described in [9]. In this paper, we present a unified model to describe data and task parallelism in an object oriented language. The approach allows compile-time and run-time techniques to be combined. 2. THE F O R E A C H C O N S T R U C T In a sequential program the execution order of all statements is specified exactly. This is often an over-specification: the programmer may not care about the execution order of statements, even if the observable results differ. For example, the code for(
int i=O;
i
i++
) a[i]
= Math.sin(
2*Math. PI/i
);
fills array a in the exact order that is specified in the f o r loop. This is an over-specification, since order in which the array is filled is irrelevant for the final result. Since this form of overspecification often prevents parallelization of the program, Spar provides the e a c h and f o r e a c h statements. Given an e a c h statement such as: each
{ sl;
s2;
}
the compiler may choose one ofthe execution orders s l ; s2 ; or s2 ; s l ;. Once a statement has been started, it must complete before another statement is started. *This research was supported by NWO (project 'Automap'), Esprit (LTR Project 'Joses' (#28198)), and Delft University of Technology (DIOC project 'Ubicom').
112 The f o r e a c h statement is a parameterized version of the e a c h statement. For example, we can now write the loop above as: foreach(
i :- O : a . l e n g t h
) a[i]
= Math.sin(
2*Math. PI/i
);
Similar to the e a c h statement, the iterations may be executed in any order, but an iteration must complete before the next one is started. 3. T H E A N N O T A T I O N L A N G U A G E Statements, expressions, types, declarations, formal parameters, and the entire program can be annotated with pragmas. For example: <$ reduction, i t e r a t i o n s = @ n
$> foreach(
i :- O:n
){ sum += i;
}
annotates a f o r e a c h statement with two pragmas: a r e d u c t i o n pragma, and an i t e r a t i o n s pragma. Spar provides a general annotation mechanism. An annotation consists of a list ofpragmas. These pragmas allow the user to give the compiler information about the program, and give hints for efficient compilation. By convention, a pragma does not influence the behavior of a program2; it only improves the efficiency of the program in terms of execution time, memory use, or any other measure. For example, the annotation: <$ i n d e p e n d e n t , b o u n d s c h e c k = f a l s e ,
iterations=42
$>
consists of three pragmas. The first one, i n d e p e n d e n t , does not have a value, and is called a flagpragma. The other two, b o u n d s c h e c k and i t e r a t i o n s , have a value, and are called value pragmas. Identifiers in the pragma name and pragma value are completely independent of those in Spar/Java; they have a different name space. It is possible, however, to refer to variables in the host language by prefixing a name with a '@'. For example, in the code fragment foreach(
i :- O:n
) <$ i t e r a t i o n s = @ n
$>
{ sum += i;
}
the value of the i t e r a t i o n s pragma contains a reference to the variable n. Instead of a single value, a pragma may have a list of values. Such a list is written as a sequence of expressions surrounded by brackets. Since the lists can be nested to an arbitrary depth, this allows expressions of arbitrary complexity. For example: <$cost=(lambda
(i j)
(sum
(prod 5 i i j j)
(prod 3 i j) 42)$>
As a convenience, Spar allows a number of binary operators to be used in pragma expressions. The traditional precedence rules on these operators are obeyed. These expressions are immediately translated to expression lists. For example, the pragma expression 3 * n * n + 5 * n + 1 is translated t o ' ( s u m ( p r o d 3 n n) ( p r o d 5 n) 1 ) ' Furthermore, Spar allows subscript-like expressions. They are immediately translated to a list starting with the identifier' a t ' . For example, the pragma expression 'p [ 1, a ] ' is translated to '(at
p
1 a)'.
2Since the each and f o r e a c h statements are non-deterministic, we consider a program to have the same behavior for all possible execution orders of these statements. However, an annotation may cause another execution order of such a statement.
113
4. P R A G M A S F O R P A R A L L E L I Z A T I O N
Both data declarations and code fragments can be annotated with the on pragma, which specifies the placement of that data, or code. For example" int[*,*] <$ on:(lambda b = new int [50,50] ;
(i j) dsp2D[(block
j 5), all])
$>
annotates an array with a placement pragma in data-parallel fashion, and foreach( i -- 0:I00 ) <$ on=dsplD[ (block @i 25) ] $> { a[i] = a[i] + i; }
annotates the iterations of a foreach with a placement pragma in task-parallel fashion. To help the compiler with the parallelization of a program, the user may annotate a program with pragmas to specify the distribution of data, or the place where a block of code is executed. For this purpose the parallelization engines support the following annotations: The ProcessorType pragma is used to declare names of processor types and to associate them with processors characteristics (e.g. alignment of data structures and endianness of primitive types etc.) and capabilities (e.g. whether it contains a FPU etc.). P r o c e s s o r T y p e pragmas must be global. For example: <$ ProcessorType:( (Gpp "Pentium2")
(Dsp "Trimedia"))
$>
The strings " P e n t i u m 2 " and " T r i m e d i a " refer to processor descriptions that are known to the compiler. A P r o c e s s o r s pragma is used to name the processors in a system, and describe their arrangement to the compiler. The P r o c e s s o r s pragma must be a global pragma. Either a single processor can be declared, or an array of processors. The processor array can have any number of dimensions. For example: <$ Processors:( (Gpp gppl)
(Dsp dsplD[4] ) (Dsp dsp2D[2,3] )) $>
This system consists of a single processor g p p l of type Gpp, a one-dimensional processor array d s p l D , and a two-dimensional array dsp2D, both of type Dsp. Each modeled processor corresponds to a single physical processor. A Spar/Java program can be annotated with on pragmas to place data and work on specific processors. Pragmas can annotate expressions, statements, and member functions. <$ on=Gpp $> <$ on:Dsp[0] $>
Pragmas may be nested; an on pragrna on an inner block or expression overrules the enclosing specification. The special on value _ a l l is used to denote all processors; the value '_' (a single underscore) is used to denote an unspecified distribution ('don't care'). It is also allowed to use three special functions as on expressions: The (b 1 o c k a* i +b m) distribution function places index i onto processor p - (a. i + b ) / m . The value ofp is bounded by the index range in the corresponding dimension of the processor type. If no m is specified, the value is derived from the context, if possible" if there are N elements in the corresponding
114 array dimension, m = N/Pext is assumed. The ( c y c 1 i c a* i +b) distribution function places index i onto processor p = ( a . ~ + b) rood Pext. The ( b l o c k c y c l i c a * i + b m) distribution function places index i onto processor p = ((a. i + b)/m) rood P~xt. For data that is distributed with the b l o c k and c y c l i c functions, the compiler is able to generate highly efficient code to enumeration local elements and to translate global to local array indices; see [8] for details. With these functions, all the data mappings of High-Performance Fortran (HPF) 2.0 [5] can be specified. Annotating declarations By annotating a member function, the user can specify the group of processors allowed to execute the member function. For example: <$ on d s p l D [ _ a l l ] $> int m y f u n c ( int i, S t r i n g
s ) { return
i+l;
Annotating statements A statement annotated with an
}
on
pragma will be executed only on the specified processor(s). In principle arbitrary statements may be annotated, but in practice the annotation is mainly interesting for code blocks. Annotating expressions In principle, any expression can be annotated with an o n pragma. This specifies the place where the expression must be evaluated. The new expression is a special case, since this not only specifies the distribution of the constructor execution (if not overridden by an annotation on the constructor), but also the distribution of the newly constructed class or array instance. For example, the pragma: String
a = <$ o n = g p p l
$> n e w
String();
specifies that a new S t r i n g must be constructed on g p p l . In the case of array new expressions, a slightly extended version of the on pragma is allowed. For example, the pragma: int [*,*] <$ o n = ( l a m b d a (i j) d s p 2 D [ ( b l o c k b = n e w int [50,50] ;
j 5),_all])
$>
specifies that every array element b [ i , j ] is constructed on processors d s p 2 D [ ( b l o c k j 5) , _ a 1 1 ] . The _ a l l expression in the second dimension means that the elements are replicated in that dimension. Since the formal parameter i is not used in the distribution expression, the first dimension of the array does not influence the distribution of elements. Other annotations The Spar/Java programs that we discuss in Sec. 5 use two other annotations. The f o r e a c h statements allows reduction operations such as int s u m = 0; <$ r e d u c t i o n
$> f o r e a c h (
i -- 0 : a . l e n g t h
) sum
+= a[i] ;
If the array that is being reduced is b l o c k or c y c i i c distributed, and if the loop is annotated as a reduction, the parallelization engines are able to generate efficient parallel code for such a reduction. As mentioned in Sec. 2, the e a c h and f o r e a c h statements can only be executed in parallel when there is no observable interference between statements or iterations. Such loops are annotated with the i n d e p e n d e n t annotation.
115 1000
. . . . . . . . . .
loo L
spa; ' ; ' "
,Spar
. . . . . . . . . .
~,
-~-'
'
Spar (unchecked) ---• H P F --- ~---
~..:.::. ...-..
G)
100
.g
E
._
c .o_
.~
u.I
w
10
p,
"rq
1 . . . . . . .
,
,
,
, ,,,,I
,
1'0
10
N u m b e r of p r o c e s s o r s
N u m b e r of p r o c e s s o r s
. . . .
,,, 100
Figure 1. Results for the NAS EP 'A' dataset (228 samples) and for the NAS FT 'W' dataset (128 x 128 x 32 elements), left and right figure, respectively 5. RESULTS To evaluate the practicality of our language extensions, we have implemented a compiler for Spar/Java. This compiler is called 'Timber'. It is available for downloading at [7]. To measure the performance of the experimental compiler, we have implemented three benchmarks from the NASA Numerical Aerospace Simulation group (NAS) parallel benchmark suite Version 2.3 [6, 4]. Guided by the HPF and Fortran 77 versions we have available, the algorithms were reimplemented in Spar/Java. For the benchmarks, the original NAS implementations for Fortran 77/MPI were used as reference. For the FT and MG benchmarks the HPF version was also used as reference. For the EP benchmark no HPF version was available. All measurements were done on the DAS [3] distributed supercomputer. The Fortran/MPI implementation was compiled with the PGI Fortran 90 Version 3.1 compiler. The HPF version was compiled with the PGI HPF Version 3.1 compiler. All versions, including the Spar/Java version, used the communication library Panda [ 10]. The Fortran/MPI and HPF version used the MPI library implemented on top of Panda, the Spar/Java version used Panda directly. Due to their method of implementation, the Fortran/MPI version of FT and MG can only be run for a number of processors that is a power of 2. The HPF and Spar/Java versions do not have this restriction. NAS EP benchmark. The NAS EP benchmark approximates 7c by choosing random points in a square, and counting the percentage that falls in the inscribed circle of the square. To parallelize the code, the calculations are divided in batches. The result for each batch (the percentage of 'hits') is stored in an element of an array; the global result can then be calculated with a reduction operation on this array. The Spar/Java version contains the following annotations: a distribution annotation for the results array; a distribution annotation for the loop that iterates over the batches; and an annotation to label the result reduction loops as reductions. From these annotations only, the compiler is able to generate a parallel program, with the results shown in Fig. 1. Note that both data and task distribution is used. The Spar/Java program scales just as well as the Fortran/MPI version, and is slightly faster. NAS FT benchmark. The NAS FT benchmark performs a 3D Fast Fourier Transform (FFT). The Spar/Java version distributes the FFT array in cyclic fashion in one dimension. Thus, if a 64 x 64 x 64 array is distributed over 2 processors, each processor gets a 64 x 64 x 32 elements
116 1000 .
.
.
.
.
.
.
.
HpF ' ; ' "
.
Spar Spar
100
._~
---•
unchecked F77
---~---
......~a......
!. . . . . " ' - - x .
..........
"--..'-x.
~5 iii ..... N .... ....... [3. .....
.... [] ,
,
i
,
, ,,,1
, 10
Number
,
i
. . . . . 100
of processors
Figure 2. Results for the NAS MG 'W' dataset (64 x 64 • 64 elements).
slice of the array. For 4 processors the slice is 64 x 64 • 16, and so on. In the original NAS benchmark program, The 3D FFT is implemented by doing a 1D FFT on all one-dimensional array sections parallel to the x axis, then parallel to y axis and then the z axis. The Spar/Java version always performs 1D FFT transforms on array sections parallel to the x axis. For the last two phases, it copies and permutes the elements in the original array to a suitably distributed temporary array, performs the 1D transform, and copies the elements back again. Since we distribute the temporary array in a different dimension, all elements for the 1D FFT transform on one array section are always locally available, which allows it to be implemented very efficiently. The Spar/Java version contains the following annotations: distribution annotations for the distributed arrays; a distribution annotation for the iteration over the 1D FFT transform; a r e d u c t i o n annotation on the loop that calculates the verification checksum, and a number of i n d e p e n d e n t annotations. Using these annotations only, the compiler generates a parallel program, with the results shown in Fig. 1. Again, note that both task and data distributions were used. In most cases the Spar/Java results compare favorably with those for Fortran/MPI and HPF. For larger number of processors secondary effects become significant, such as initialization and setup times, and load imbalance due to the distribution 3. NAS MG benchmark. The NAS MG benchmark is a multigrid solver using grids of different coarseness. The Spar/Java version only uses distribution annotations for the different arrays; an annotation to label the loop that calculates the verification checksum as a reduction, and a number of annotations to label loops as 'independent'. Using these annotations only, the compiler generates a parallel program, with the results shown in Fig. 2. As these measurements indicate, the overhead of null pointer checking and bounds checking in the Spar/Java version is somewhat higher than for the other benchmarks. The performance difference with F77/MPI is caused by the larger number of copy operations on the data, and an inefficient use of the available cache hardware.
3For example, if an array of 64 elements is distributed over 24 processors, some processors get 2 elements, and some processors get 3 elements.
117 6. RELATED W O R K
Spar supports the same class of distributions as HPF 2.0, but uses a significantly different notation. In HPF a distribution annotation is used that is more limited than the Spar version. Whereas in Spar the annotation ( c y c 1 i c a* i +b) is allowed, in HPF only the annotation (cyclic i ) is allowed. However, HPF provides alignment statements to specify a distribution relative to another distribution. Through this mechanism the equivalent of distribution ( c y c l i c a* i +b) can be specified. The absence of alignment makes distributions somewhat less convenient to express, but we think that this is a small inconvenience. On the other hand, distributions like ( c y c l i c a* i +b) can be expressed more conveniently than in HPF, since no alignment construct is necessary. The b l o c k , c y c l i c , and b l o c k c y c l i c distributions can also be used for the efficient distribution of loops. Spar is not the only proposal to extend Java for scientific computing. HPJava [2, 12] adds multi-dimensional arrays and data-parallel programming to Java. In contrast to Spar, the programs are explicitly data-parallel. That is, any data transfer between processors has to be explicitly done in the program by a call to the HPJava communication library. Blount, Chatterjee, and Philippsen [1] describe a compiler that extends Java with a f o r a l l statement. To execute the f o r a l l statement, the compiler spawns a Java thread on each processor, and the iterations are evenly distributed over these threads. Synchronization between iterations is done by the user using the standard Java synchronization mechanism. No explicit communication is performed; a shared-memory system is assumed. Due to the dynamic nature of the implementation, they can easily handle irregular data and nested parallelism. Titanium [11] provides vectors, multidimensional arrays, iteration ranges, and a f o r e a c h statement comparable to those in Spar. In most cases the Spar version of these constructs is more general. They explicitly state that their f o r e a c h is not intended for parallelization. Titanium supports iterations over arbitrary sets; moreover these iteration ranges are 'first-class citizens'; they can be handled and modified independent of any loop statements. Unsurprisingly, support for arbitrary iteration sets leads to inefficiencies. Therefore, Titanium provides a separate representation for rectangular iteration. In contrast, Spar supports 'classical' iteration sets that can be expressed as nested iteration ranges with strides. Spar's iteration ranges are more general than the rectangular iteration sets of Titanium, and can be implemented just as efficiently.
REFERENCES
[1] B.Blount, S. Chatterjee, and M. Philippsen. Irregular parallel algorithms in Java. In Irregular'99, Apr. 1999. [2]
[3] [4] [5] [61
B. Carpenter, Y. Chang, G. Fox, D. Leskiw, and X. Li. Experiments with "HPJava". Concurrency, Practice and Experience, 9(6):633-648, June 1997. DAS website, www.asci.tudelft.nl/das/das.shtml. M. Frumkin, H. Jin, and J. Yan. Implementation of NAS parallel benchmarks in high performance fortran. In IPPS, 1999. High Performance Fortran Forum. High Performance Fortran Language Specification, 2.0 edition, Feb. 1997. NAS parallel benchmarks website, www.nas.nasa.gov/Software/NPB.
ll8 [7] [8]
[9] [ 10]
[ 11]
[12]
C. v. Reeuwijk. Timber download page. www.pds.twi.tudelfl.nl/timber/downloading.html. C.v. Reeuwijk, W. Denissen, H. Sips, and E. Paalvast. An implementation framework for HPF distributed arrays on message-passing parallel computer systems. IEEE Transactions on Parallel and Distributed Systems, 7(9):897-914, Sept. 1996. C.v. Reeuwijk, F. Kuijlman, and H. Sips. Spar: an extension of Java for scientific computation. In ACMJava Grande- ISCOPE Conference, pages 58-67, June 2001. T. Riihl, H. Bal, R. Bhoudjang, K. Langendoen, and G. Benson. Experience with a portability layer for implementing parallel programming systems. In Int. Conf. on Parallel and Distributed Processing Techniques and Applications, pages 1477-1488, Aug. 1996. K. Yelick, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. C., and A. Aiken. Titanium: a high-performance Java dialect. In ACM Workshop on Java for High-Performance Network Computing, pages 1-13, Feb. 1998. G. Zhang, B. Carpenter, G. Fox, X. Li, and Y. Wen. Considerations in HPJava language design and implementation. In Proc. of the l lth International Workshop on Languages and Compilers for Parallel Computing, pages 18-33. Springer, Aug. 1998.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
119
O n T h e U s e o f J a v a A r r a y s for S p a r s e M a t r i x C o m p u t a t i o n s G. Gundersen a and T. Steihaug ~ aDepartment of Informatics, University of Bergen, Norway [email protected], [email protected] Each element in the outermost array of a two dimensional native Java array is an object reference while the inner is an array of primitive elements. Each inner array can have its own size. In this paper it is shown how to utilize this flexibility of native Java arrays for sparse matrix computations. We discuss the use of Java arrays for storing matrices and particularly two different storage schemes for sparse matrices: the Compressed Row Storage and a profile or skyline storage structure and the implementations in Java. The storage schemes for sparse matrices are discussed on the basis of performance, memory and flexibility on several basic algorithms. Numerical results show that the efficiency is not lost using the more flexible data structures compared to classical data structure for sparse matrices. This flexibility can be utilized for high performance computing (HPC). 1. I N T R O D U C T I O N Java's considerable impact implies that it will be used for high performance computing and Java is already introduced as programming languages in introductory courses in scientific computation [ 1]. Matrix computation is a large and important area in scientific computation. Developing efficient algorithms for working with matrices are of considerable practical interest. Java's native arrays can be seen as a hybrid between an object and a primitive handled with references. When creating an array of primitive elements, the array holds the actual values for those elements. An array of objects stores references to the actual objects. Since arrays are handled through references, an array element may refer to another array thus creating a multidimensional array. A rectangular array of numbers (matrix) as shown in Figure 1 is implemented as Figure 2. Since each element in the outermost array of a multidimensional array is an object reference, arrays need not be rectangular and each inner array can have its own size as in Figure 3 creating a jagged form. When an object is created and gets heap allocated, the object can be placed anywhere in the memory. This implies that the elements d o u b l e [] of a d o u b l e [] [] may be scattered throughout the memory space. Non-contiguous memory access is slower than contiguous ones in any language on a a cache based memory due to reduced spatial locality. Java's native arrays have not been regarded efficient enough for high performance computing. Replacing the native arrays with multidimensional array [2], and extensions like Spar [3] have been suggested. However, packages like JAMA [4], JAMPACK [5] and Colt [6] use the native arrays as the underlying storage format. Investigating how to utilize the flexibility of Java arrays for creating elegant and efficient algorithms is of great practical interest.
120
II[ll Illll IIIII IIIII IIIII Figure 1. A t r u e 2D array.
II I II
Figure 2. A 2D Java array.
Figure 3. General Java array.
We discuss the use of native Java arrays for storing matrices and particularly two different storage schemes for sparse matrices. The implementation of Compressed Row Storage as Java Sparse Arrays [ 15], and a profile or skyline structure introduced in this paper. The compressed row format and profile format have a static implementation frequently used in languages like C and Fortran, and a dynamic implementation as a jagged array that can be used in languages like Java and c#. For operations like matrix vector and vector matrix multiplications we show that there is no loss in efficiency using the more dynamic and flexible data structure created by jagged arrays for storing sparse matrices compared to the more static format. For operations like matrix multiplication, matrix updates, and matrix factorization where back-fill or fill-ins are created we demonstrate that there is a gain in efficiency using the more dynamic and flexible data structure. The timings for the matrix operations where done on Red Hat Linux with Sun's Java Development Kit (JDK) 1.4.2, processor is Intel Pentium 4, processor speed is 2.26GHz and memory is 500MB. The time is measured in milliseconds (mS) and is an average of three trials.
2. SPARSE MATRICES A sparse matrix is usually described as a matrix where "many" of its elements are equal to zero and we benefit both in time and space by working only on the nonzero elements [7]. The difficulty is that sparse data structures include more overhead since indexes as well as numerical values of nonzero matrix entries are stored. There are several different storage schemes for large and unstructured sparse matrices that are used in languages like Fortran, C and C++. These storage schemes have enjoyed several decades of research and the most commonly used storage schemes for large sparse matrices are the compressed row storage [7, 8, 9, 10]. There are few released packages or separate algorithms implemented in Java for numerical computation on sparse matrices [6, 11, 12, 13, 15].
2.1. Compressed Row Storage scheme The Compressed Row Storage (CRS) format puts the subsequent nonzeros of the matrix rows in consecutive locations. For a sparse matrix we create three vectors" one for the double type ( v a l u e ) and the other two for integers ( c o l u m n i n d e x , r o w p o i n t e r ) . The v a l u e vector stores the values of the nonzero elements of the matrix, as they are traversed in a row-wise fashion. The c o l u m n i n d e x vector stores the column indexes of the elements in the v a l u e vector. The r o w p o i n t e r vector stores the locations in the v a l u e vector that start a row. If v a l u e
[k] - A # then c o l u m n i n d e x
[k] - j and r o w p o i n t e r [ / ]
__ k <
121 rowpointer [i + 1]. By convention r o w p o i n t e r [m] = n n z , where n n z is the number of nonzero elements in the matrix and m is the number of rows. Consider the sparse 6 x 6 matrix A which is a permutation of a matrix in [ 14]. To be consistent with Java first row and column index is 0. -1 13 0-2 3 0 0
A-
2 9 0 5 0
0 4 0 0 8 9 10 0 0 3 9 0 0 3 0 7 8 0 7 7
0 0 0 "
(1)
8
The nonzero structure of the matrix (1) stored in the CRS scheme:
double[] v a l u e = { - I , 2 , 4 , 1 3 , 9 , 8 , 9 , - 2 , 1 0 , 3 , 3 , 9 , 5 , 3 , 7 , 8 , 7 , 7 , 8 } ; int[] c o l u m n i n d e x : {0,1,3,0,1,3,4,1,2,0,2,3,1,2,4,5,3,4,5}; int [] r o w p o i n t e r = { 0 , 3 , 7 , 9 , 1 2 , 1 6 , 1 9 } ;
2.2. Java Sparse Array The Java Sparse Array format (JSA) introduced in [15] is a unique implementation in languages that supports jagged arrays. JSA takes advantage of the feature that Java's native arrays are objects. This concept is illustrated in Figure 4 and uses an array of arrays. There are two arrays, one for storing the references to the value arrays and one for storing the references to the index arrays (one for each row). With the JSA format it is possible to manipulate the rows independently without updating the rest of the structure as would have been necessary with CRS. Each row consist of a value and an index array each with its own unique reference. To store an m - n matrix, JSA use 2nr~z + 2 m storage locations compared to 2 n n z + m + 1 for the CRS format. A J a v a S p a r s e A r r a y "skeleton" class can look like this.
public class JavaSparseArray{ p r i v a t e double[] [] A v a l u e ; p r i v a t e int [] [] A i n d e x ; public JavaSparseArray(double[] this.Avalue = Avalue; this.Aindex = Aindex;
}
public JavaSparseArray
[] A v a l u e ,
int [] [] Aindex) {
times(JavaSparseArray
B) {...}
In this JavaSparseArray class, there are two instance variables, double [] [] and i n t [] []. For each row these arrays are used to store the actual values and the column indexes. The nonzero structure of matrix (1) is stored as follows in JSA.
double[] [] Avalue : {{-1,2,4},{13,9,8,9},{-2,10},{3,3,9},{5,3,7,8},{7,7,8}}; int[] [] Aindex = { {0,1,3}, {0,1,3,4}, { i, 2}, {0,2,3}, {1,2,4,5}, {3,4,5}};
122 int [ ] [ ] Aindex
JavaSparseArray
"-----~
liui
Value Array
Figure 4. Java Sparse Array format.
Figure 5. Jagged Variable Band Storage and Non Symmetric Skyline format.
2.3. Variable Band Storage The Variable Band Storage (VBS) is a row oriented storage scheme defined by the semibandwidths of the matrix. For each row i define upper ui and lower li semibandwidth
li = maxj i - j for nonzero Aij, ui = metxj j - i for nonzero Aij. In the VBS data structure, all matrix elements from the first nonzero in each row to the last nonzero in the row are explicitly stored. >From the semibandwidths ui and li the column indexes are i - li <_ j < i + ui. The number of nonzero elements of an m x n matrix is m--1 Y~'~i=o (li + ui) + m. For rows with only elements in the upper triangular part the corresponding l~ will be negative. The matrix (1) can be stored as double[] v a l u e = { - 1 , 2 , 0 , 4 , 1 3 , 9 , 0 , 8 , 9 , - 2 , 1 0 , 3 , 0 , 3 , 9 , 5 , 3 , 0 , 7 , 8 , 7 , 7 , 8 } ; int[] r o w p o i n t e r = { 0 , 4 , 9 , 1 1 , 1 5 , 2 0 , 2 3 } ; int[] f i r s t c o l u m n i n d e x = {0,0,i,0,i,3};
in a compressed format and in a jagged VBS format as double[] [] Avalue = {{-1,2,0,4},{13,9,0,8,9},{-2,10},{3,0,3,9},{5,3,0,7,8}, int[]
firstcolumnindex
{v,7,8}};
= {0,0,I,0,i,3};
In the VBS in a compressed format the columnindexes of A in row i are firstcolumnindex[i] _<j < firstcolumnindex[i] +rowpointer[i+l] -rowpointer[i],
and in the jagged VBS format firstcolumnindex
[i] _< j < f i r s t c o l u m n i n d e x
[i] + A v a l u e
[i] . l e n g t h .
123 Table 1 The algorithms for c = Ab and c T - bTA. The time is in milliseconds and the measuring is done on Red Hat Linux with Sun's JDK 1.4.2. Sparse Matrix Vector and Vector Matrix Multiplication Type GRE 216A GRE 216B GRE 343 GRE 512 SHERMAN SHERMAN SHERMAN SHERMAN SHERMAN WATT 1 WATT 2
1 2 3 4 5
m - n
nnz(A)
216 216 343 512 1000 1080 5005 1104 3312 1856 1856
876 876 1435 2192 3750 23094 20033 3786 20793 11360 11550
CRS Ab bTA 0 0 0
0 0 0 1 1 5 5 1 5 3 3
JSA Ab [ bTA 0 0 0 0 1 6 5 1 6 3 3
0 0 0 1 1 5 5 0 5 2 2
VBS Ab bTA
NSK Ab bTA
1 1 4 5 4 8 29 7 31 6 7
1 1 3 6 9 12 29 10 32 10 10
1 1 4 4 4 6 19 6 19 6 6
1 2 4 6 9 11 29 10 29 10 10
2.4. The Non Symmetric Skyline format The Non Symmetric Skyline format (NSK) is a standard way of storing variable band matrices [ 14, 16] for direct methods. A nonsymmetric skyline matrix or profile matrix has nonzero elements clustered around the main diagonal. Let li = maxj<_i i - j for nonzero Aij, ui = maxj
The column indexes for the lower part in row i are f i r s t c o l u m n i n d e x [ { ] the row indexes in column j are 2j - f i r s t c o l u m n i n d e x [j] - A v a l u e i<j.
i j i i while [j] . l e n g t h _<
2.5. Numerical results We have investigated VBS and NSK for storing banded matrices in Java. These two formats are compared on the basis of memory, efficiency and flexibility. We have only used square matrices but there is no limitations in concept or implementation using matrices where m --fin for JSA and VBS. We tested the formats on matrix vector multiplication (fight and left multiplication), and matrix multiplication. The sparse matrices used in Table 1 are from Matrix Market
124 [ 17] with banded features and for matrix-vector and vector-matrix operations a vector was full. The sparse pentadiagonal matrices in Table 2 are generated by the authors. There were only minor time differences between VBS and NSK on these operations, as shown in Table 1. VBS was slightly more efficient. The memory consumption between these two formats are nnz + 2n for VBS and nnz + 3n for NSK. So there is no loss in introducing VBS as a storage format for banded matrices in Java. We also compared VBS to CRS and JSA on the same operations, as shown in Table 1 and Table 2. Both JSA and CRS was significantly more efficient than VBS and NSK this is due to the fact that CRS and JSA does not store any nonzero elements as VBS and NSK does.
Table 2 The Matrix Multiplication algorithms for C = A 2. The time is in milliseconds and the measuring is done on Red Hat Linux with Sun's JDK 1.4.2. Sparse Matrix Multiplication I ?TZ=n
100 200 300 500 1000 5000 10000 15000 17000 20000
nnz(A) 493 993 1493 2493 4993 24993 44993 74993 84993 99993
nnz(C) 876 1776 2676 4476 8976 44976 89976 134976 179976 359976
JSA 1 2 3 9 10 19 47 57 77 97
VBS 0 1 1 7 7 16 31 43 48 87
In Table 2 used pentadiagonal matrices, that is three elements below the diagonal and one element above the diagonal. The diagonal elements were all non zero. Then we performed C = A 2, it is shown that VBS is competitive with JSA on Matrix Multiplication with matrices that has structure were we traverse only nonzero elements in the matrices involved. VBS was an average of 1.33 more efficient than JSA. This difference is partially explained by the fact that JSA gets indexes from a heap allocated arrays and uses extensive copying from temporary arrays to resulting arrays. When m = n increases we see that the difference between JSA and VBS is getting smaller, this can be explained by the cost of accessing i n t [ ] [] when the arrays gets large. This effect can maybe be removed by introducing two i n t [] arrays, one for li and one for ui. The performance improves if one-dimensional arrays are accessed instead of two-dimensional arrays. This flexibility (independently row updating) is especially useful for operations that calls for partial updating of the data structure where we with CRS and NSK (lower or upper) have to traverse the whole nonzero structure with extensive copying as shown in [15].
125 3. C O N C L U D I N G R E M A R K S
The VBS format has greater flexibility than NSK since the rows can be manipulated independently of the rest of the structure as we can with the classical data structure JSA. They also have the same memory consumption as NSK and where slightly more efficient on our test problems. But if the bands are mostly nonzero (dense) VBS may even compete with JSA and CRS for the test problems in efficiency, this since we do not have to get index elements from the memory. JSA can be extended to dense blocks in a similar way as CRS is extended to blocks[ 14]. Dense bands in VBS might be extended to a block matrix implementation and operations.
REFERENCES
[1 ]
[2]
[3]
[4]
[5] [6] [7] [8] [9] [ 10]
[ 11 ] [12]
[ 13] [ 14]
C.H. Bischof, H. M. Bticker, J. Henrichs, and B. Lang. Hands-On Training for Undergraduates in High-Performance Computing Using Java. In Applied Parallel Computing: New Paradigms for HPC in Industry and Academia, T. Sorevik, F. Manne, R. Moe, and A. H. Gebremedhin (eds), Proceedings of the 5th International Workshop, PARA 2000, Bergen, Norway, June 2000, pages 306-315, Lecture Notes in Computer Science, Volume 1947, Springer Verlag, 2001. Jos6 E. Moreira, Samuel P. Midkiff, and Manish Gupta. Supporting Multidimensional Arrays in Java. Concurrency and Computation: Practice and Experience, Volume 15, Issue 3-5, pages 317-340, 2003. Kees van Reeuwijk, Frits Kuijlman, and Henk J. Sips: Spar: a set of extensions Java for scientific computation. Concurrency and Computation: Practice and Experience, Volume 15, Issue 3-5, pages 277-299, 2003. Joe Hicklin, Cleve Moler, Peter Webb, Ronald F. Boisvert, Bruce Miller, Roldan Pozo, and Karin Remington. JAMA: A Java Matrix Package. June 1999. http://math.nist.gov/javanumerics/jama. G.W. Stewart. JAMPACK: A Package for Matrix Computations. February 1999. ftp ://math.nist. gov/pub/Jampack/Jampack/AboutJampack.html. The Colt Distribution. Open Source Libraries for High Performance Scientific and Technical Computing in Java. http://hoschek.home.cern.ch/hoschek/colt/. Sergio Pissanetzky. Sparse Matrix Technology. Academic Press, Massachusetts, 1984. Yousef Saad. Iterative Methods for Sparse linear Systems. PWS Publishing Company, Boston, 1996. I.S.Duff, A.M.Erisman, and J.K.Reid. Direct Methods for Sparse Matrices. Oxford University Press, 1986. J. Dongarra, A. Lumsdaine, R. Pozo, K. Remington. A Sparse Matrix Library in C++ for High Performance Architectures. Proceedings of the Second Object Oriented Numerics Conference, pp. 214-218, 1994. Java Numerical Toolkit. http://math.nist.gov/jnt. Rong-Guey Chang, Cheng-Wei Chen, Tyng-Ruey Chuang, and Jenq Kuen Lee. Towards Automatic Support of Parallel Sparse Computation in Java with Continuous Compilation. Concurrency: Practice and Experience, 9(1997) 1101-1111. SciMark 2.0. http://math.nist.gov/scimark2/about.html. R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo,
126 C. Romine, and H. van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM Publications, Philadelphia, 1994. [ 15] Geir Gundersen and Trond Steihaug. Data Structures in Java for Matrix Computation. To appear in Concurrency and Computation: Practice and Experience, 2003. [ 16] Yousef Saad. SPARSKIT: A basic tool kit for sparse matrix computations, Version 2. Technical Report, Computer Science Department, University of Minnesota, June 1994. [ 17] Matrix Market. http://math.nist.gov/MatrixMarket.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
127
A Calculus of Functional BSP Programs with Explicit Substitution F. Loulergue ~ ~Laboratory of Algorithms, Complexity and Logic, 61, avenue du G6n6ral de Gaulle, 94010 Cr6teil cedex, France The BSA~-calculus is a calculus of functional Bulk Synchronous Parallel (BSP) programs with enumerated parallel vectors and explicit substitution. This calculus is defined and proved confluent. It constitutes the core of a formal design for a Bulk Synchronous Parallel dialect of ML (BSML) as well as a framework for proving parallel abstract machines which can evaluate BSML programs. 1. INTRODUCTION Bulk Synchronous Parallel ML or BSML is an extension of ML for programming direct-mode parallel Bulk Synchronous Parallel algorithms as functional programs. Bulk-Synchronous Parallel (BSP) computing is a parallel programming model introduced by Valiant [24] to offer a high degree of abstraction like PRAM models and yet allow portable and predictable performance on a wide variety of architectures. A BSP algorithm is said to be in direct mode [10] when its physical process structure is made explicit. Such algorithms offer predictable and scalable performances and BSML expresses them with a small set of primitives taken from the confluent BSA calculus [17]: a constructor of parallel vectors, asynchronous parallel function application, synchronous global communications and a synchronous global conditional. Our B S M L I ib library implements the BSML primitives using Objective Caml [ 15] and MPI [23]. It is efficient [ 16] and its performance follows curves predicted by the BSP cost model (the cost model estimates parallel execution times). This library is used as the basis for the C ARAML project, which aims to use Objective Carol for Grid computing with, for example, applications to parallel databases and molecular simulation. In such a context, security is an important issue, but in order to obtain security, safety must be first achieved. An abstract machine was used for the implementation of Caml and is particular easy to prove correct w.r.t, the dynamic semantics [11 ] and the proof can be done using the Coq Proof Assistant [2]. In order to have both simple implementation and cost model that follows the BSP model, nesting of parallel vectors is not allowed. BSML1 i b being a library, the programmer is responsible for this absence of nesting. This breaks the safety of our environment. A polymorphic type system and a type inference has been designed and proved correct w.r.t, a small-steps semantics. It will be included in a full BSML compiler. A parallel abstract machine [19] (and a compilation scheme) for the execution of BSML programs has been designed and proved correct w.r.t, the BSA-calculus [ 17], using an intermediate semantics. Another abstract machine [ 18] has been designed but those machines are not adapted for grid computing and security because the compilation schemes need the static number of pro-
128 cesses (this is not possible for Grid computing) and some instructions are not realistic for real code and a real implementation. [8] is based on the Zinc Abstract Machine (or ZAM) and has not the drawbacks of the previous machines. [ 11 ] used the A,,-calculus - a A-calculus with explicit substitution- to prove several abstract machines for the evaluation of functional languages: the Krivine machine, the SECD machine [13], the Categorical Abstract Machine (CAM) [4], the Functional Abstract Machine (FAM) [3] and it has also been used to prove the correction of the Zinc Abstract Machine (ZAM) [ 14]. For each machine, the corresponding reduction strategy of the Ao-calculus is given. We are particularly interested by the CAM and the ZAM because they are respectively the abstract machine of implementations of Caml and Objective-Caml [ 15, 21 ] (and also Caml-light). In order to ease the proof of correctness of the parallel abstract machines we designed, we need a version of the BSA-calculus with explicit substitution. This paper is about such a calculus. We first presents the BSP model and the A~-calculus (section 2). We then define the BSA~calculus (section 3) which is confluent. We end with conclusions and future work (section 4). 2. P R E L I M I N A R I E S 2.1. The Bulk Synchronous Parallel model Bulk-Synchronous Parallel (BSP) computing is a parallel programming model introduced by Valiant [24, 22] to offer a high degree of abstraction like PRAM models and yet allow portable and predictable performance on a wide variety of architectures. A BSP computer contains a set of processor-memory pairs, a communication network allowing inter-processor delivery of messages and a global synchronization unit which executes collective requests for a synchronization barrier. Its performance is characterized by 3 parameters expressed as multiples of the local processing speed: the number of processor-memory pairs p, the time I required for a global synchronization the time 9 for collectively delivering a 1-relation (communication phase where every processor receives/sends at most one word). The network can deliver an h-relation in time 9h for any arity h. A BSP program is executed as a sequence of super-steps, each one divided into (at most) three successive and logically disjoint phases. In the first phase each processor uses its local data (only) to perform sequential computations and to request data transfers to/from other nodes. In the second phase the network delivers the requested data transfers and in the third phase a global synchronization barrier occurs, making the transferred data available for the next superstep. The execution time of a super-step s is thus the sum of the maximal local processing time, of the data delivery time and of the global synchronization time:
Time(s) where ~69U i(s)
-
max
i:processor
(s)
,wi
+
max
i:processor
(s)
hi
9g + 1
local processing time on processor i during super-step s and hl s)
-
where 'h~ i(s) (resp hl s)) is the number of words transmitted (resp. received) by processor i + during super-step s. The execution time ~ s Time(s) of a BSP program composed of S superA i(s) and H = steps is therefore a sum of 3 terms:W + H 99 + S 9 l where W - ~ s maxi w ~--~smaxi hl s). In general W, H and S are functions of p and of the size of data n, or of more complex parameters like data skew and histogram sizes. To minimize execution time the BSP
129 algorithm design must jointly minimize the number S of super-steps and the total volume h (resp. W) and imbalance h (s) (resp. W (s)) of communication (resp. local computation). 2.2. The A~#-calculus The reader is assumed to have basic knowledge of the implementation of functional languages [5] and of A-calculus [ 1]. The A~,-calculus is a A calculus of explicit substitution which is confluent [6] (for every term and not only closed ones). The sort of terms is given by the following grammar : ct
::=
I
Xt
Aa
term meta-variable Abstraction
n
a[s]
De Bruijn indice Closure
I aa
Application
xt can be considered as a "hole" that can be filled by a term, but this kind a variable cannot be bound. Terms containing such variables correspond much to contexts in usual A-calculus. De Bruijn indices [7] are used instead of variable names. This method was used to avoid the burden of a-conversion (renaming of bound variables in order to avoid the capture of free variables during/3-reduction). They replace a given occurrence of a variable name by the number of A-abstraction which are included between the occurrence and the A-abstraction that binds it. Application and abstraction are the same than application and abstraction in A-calculus. The last kind of term corresponds to the notion of closure found in the implementations of functional languages: a is the "body" of the function and s is an environment collecting the values of free variables. In the formalism presented here, s is called a substitution. The substitutions are given by the following grammar: s ::= xs [ id 1 T I a . s ] s o s ] zs is a substitution meta-variable. There are two special substitutions: id and T. A substitution is either of list of terms a . s or a composition of two substitutions s o s or a "lifted" substitution that is a substitution which was propagated under an abstraction. The rules of the A~ are given in figure 1. About half of the rules have no intuitive meaning but are part of the calculus in order to make it confluent. (Beta) corresponds to the creation of a closure from the application of an abstraction to another term. Rule (App) is for the propagation of the substitutions through application. Rules (FVarCons) and (RVarCons) show how the value of a variable (here a De Bruijn index) is retrieved from the a substitution. The application of a substitution s' to a term Aa[s] will result in applying s' to s and leading to a new substitution s' o s, by rule (Clos). Other rules are used to compose substitution: (AssEnv), (MapEnv), (ShiftCons) and (IdL). Rule (Lambda) is for the propagation through an abstraction and uses the "lifted" substitution construct. Rules (IdR) and (Id) define id as the identity substitution. The other rules make the adjustments of De Bruijn indices explicit.
3. THE BSA~,-CALCULUS The BSA~-calculus is a calculus for functional Bulk Synchronous Parallel programs. It is composed by the A~-calculus and a term rewriting system [12] (TRS) to express bulk synchronous parallelism.
130 (A a ) b - - , a[b. id]
(Beta)
(ab)[s] --~ (a[s])(b[s])
(App) (Lambda)
A a [ s ] - ~ A(a[~ (s)]) (a[s])[t] ~
(Clos)
a[s o t]
n T --~ n+l n[T os] ~ n+l[s] l[a.s] --~ a
(VarShiftl) (VarShift2)
n+l[~ (s)] -+ n[so T]
(RVarliftl)
n+l[~ (s) o t] -+ n[s o (T ot)] (RVarlift2) T o(a. s) ~ s (ShiflCons) T o ~ (s) -+ so T
(ShiflLifll)
T o(~ (s) o t) --+ s o (T or) ~ (s)o ~ (t) - ~ (s o t)
(ShiflLift2) (Liftl)
(FVarCons)~ (s) o (~ (t) o u) -+~ (s o t) o u
(Lift2)
1[~ (s)] --~ 1 1[~ (s) o t] --~ l[t]
(FVarliftl) (FVarlift2)
~ (s) o (a. t) -+ a - ( s o t) id o s -+ s
(LiflEnv) (IdL)
n + l [ a . s ] - + n[s]
(RVarCons)
s o id -+ s
(IdR)
(s o t) o u --+ s o (t o u) ( a . s) o t --~ a[t] . (s o t)
(AssEnv) (MapEnv)
~ (id) -+ id a[id] --~ a
(Liftld) (Id)
Figure 1. The Ao~-calculus
3.1. The term rewriting system BS BS is a term rewriting system which will be used to extend the A~-calculus. This system introduces operations for data-parallel programming but with explicit processes in the spirit of BSP. We now describe its syntax and its reduction with operational motivations. The terms are defined by the following grammar: a "'= c [ | a . . . a I (a --~ a, a) usual operations [ a # a pointwise parallel application [ (a,..., a,..., a) parallel vector [ a ? a get operation [ (a --% a, a) global conditional c is an integer or boolean constant, | C { + , - , • or, and, <, > , = - , n o t } . The integer constants n must not be mixed up with the De Bruijn indices n which are used instead of usual variables. The principal terms represent parallel vectors i.e. tuples of p local values where the i th value is located at the processor with rank i. The width p is fixed, so there is one set of rule for each value ofp. The notation is: ( ao , . . . , ap-1 ) where a o , . . . , ap_l are terms. The forms al # a2 and al ? a2 are called point-wise parallel application and get respectively. Point-wise parallel application represents point-wise application of a vector of functions to a vector of values, i.e. the pure computation phase of a BSP super-step. Get represents the communication phase of a BSP super-step: a collective data exchange with a barrier synchronization. In a l ? a2, the resulting vector field contains values from al taken at processor names defined in a2. The exact meanings of point-wise parallel application and get are defined by the BS rules. The last form of global terms define synchronous conditional expressions. The meaning of (at --~ a2, a3) (not to be confused with (al --~ a2, a3)) is that of a2 (resp. a3) if the vector denoted by a l has value t r u e (resp. false) at the processor name denoted by n.
131 true
a,2)
> a I
(false ~ al, a2)
> a2
---+ a l ,
(ItTrue) (IfFalse)
(ao bo,..., ap-1 bp-1) ' { ano , . . . , anp_, )
(< a , . . . , ap, Z ' " "
(ParApp) (Get)
a>
>
al
(IfAtTrue)
a ) & a l , a2)
>
a2
(IfAtFalse)
i
({ a , . . . ,
fa~,..., i
(Get) is not a rule but a set of rules such that for all i in { 0 , . . . , p - 1}, ni is integer constant between 0 and p - 1.
Figure 2. Rules of the term rewriting system BS
The rules for usual operators are defined by rules like : 3 + 2 > 5 or (true or false) true. The equality is only defined on integer and boolean constants. The value of al ? a2 at processor name i is the value of el at processor name given by the value of e2 at i. Notice that, in practical terms, this represents an operation whereby every processor receives one and only one value from one and only one other processor. This restriction can be lifted by defining a "put" operation (which is also a pure functional operation) but which is not given here for the sake of conciseness. Next, the global conditional is defined by two rules. The above two cases generate the following bulk-synchronous computation: first a pure computation phase where all processors evaluate the local term yielding to n; then processor n evaluates the parallel vector ofbooleans; if at processor n the value is t r u e (resp. false) then processor n broadcasts the order for global evaluation of a 1 (resp. a2); otherwise the computation fails. Those two rules are necessary to express algorithms of the form: Repeat Parallel
Iteration
UntilMax
of
local
errors
< e
because without them, the global control can not take into account data computed locally, ie global control can not depend on data. Additional rules bare also needed to propagate substitutions through the symbols of BS. For every symbol f of BS with arity n we have the rule : f(al,...,an)[s]
, f(al[s],...,an[s])
If n = 0 this rule means
f[s]
,
f.
3.2. Confluence of the calculus We will use the follow theorem due to Pagano [20] :
(f#)
132 Theorem 1. Let be (F, TO) a term rewriting system such that : 9F
and A~ (the set of terms of the Ao#) share at most the application symbol
9 T~ is
confluent
9 T~ is left
linear
9T~ does not contain variable-applicante rules
Then the AF~9-calculus is confluent R e m a r k 1. If F and A# do not share the application symbol then the last condition is not required. In our case (F, 7~) = BS. There are no critical pairs. Rules are left linear. The TRS BS is confluent. By theorem 1, the BSA~9-calculus is confluent. 3.3. Examples The first example shows that parallel vector can be also expressed by an intentional construction as in the BSA-calculus [17]. The 7r or parallel vector constructor of BSA can be defined as : =
A vector usually used is t h i s defined as: 7r(A1). It can be reduced to (using the rules indicated on the fight): this _= (A( 1 0 , , 1 ( p - I) ))AI > < 10,, l ( p - 1 ) ) ) [ A 1 id] >P ( ( 1 0 ) [ A a - i d ] , . . . , (1 ( p - 1))[A1. id] >
>P ( l [ A l . i d ] O [ A l . i d ] , . . . , a [ A l . i d ] ( p - 1 ) [ A l . i d ] ) >P < A1 (O[A1. i d ] ) , . . . , AI ( ( p - 1)[Al. id]) > >P ( .)kl 0 , . . . , A1 (p - 1) ) >P < 1[0. i d ] , . . . , i [ ( p - 1). id] ) >P < 0 , . . . , p - l >
(Beta)
(f~) (App) (FVarCons)
(f#) (Beta) (FVarCons)
The second one is the direct broadcast algorithm which broadcasts the value held at processor given as first argument: b c a s t : = AA1 ? ~-(A2). If applied to 1 (the root of the broadcast) and to an expression e which evaluates to a parallel vector ( a0 , . . . , ap_l ) it can be reduced as follows (we omit the steps similar to the previous example): bcast
1e
>* < a0 , . . . , ap_ 1 > .7 < 1 , . . . ,
1,...,
1 ) by rule (Get) < a 1 , - . - ,
a 1,-..,
al >
4. CONCLUSIONS AND FUTURE WORK
The BSAo~-calculus is a confluent calculus for functional bulk synchronous parallel programs. Being an extension of the A~-calculus it has its advantages such as to be closer to functional languages implementations than our BSA-calculus [ 17]. Another interesting feature of this calculus is the possibility to express weak reduction strategy (ie no reduction under a
133 A-abstraction) by removing rules of the calculus. This was not possible in the A-calculus: removing context rules which allow the reduction under a A-abstraction leads to a non confluent calculus. Thus the ease of expressing reduction strategies used in real functional languages allow to prove the correctness of abstract machines used in the implementations of those languages [ 11 ]. We have designed a Bulk Synchronous Parallel ZINC Abstract Machine [8] (BSP ZAM) which is an extension of the Zinc Abstract Machine used in the Objective Caml implementation The next phases of the project will be: the proof of correctness of this machine with respect to the BSA~-calculus, and the parallel implementation of this abstract machine. This BSP ZAM implementation will be the basis of a parallel programming environment developed from the Caml-light language and environment. It will include our type inference [9] and will thus provide a very safe parallel programming environment.
REFERENCES
[ 1] [2] [3]
[4] [5] [6] [7]
[8]
[9]
[ 10] [11 ] [12]
[13]
H. R Barendregt. Functional programming and lambda calculus. In J. Van Leeuwen, editor, Handbook of Theoretical Computer Science (vol. B), pages 321-364. Elsevier, 1990. S. Boutin. Proving correctness of the translation from mini-ml to the cam with the coq proof development system. Technical Report 2536, INRIA, 1995. L. Cardelli. Compiling a functional language. In Conference Record of the 1984 ACM Symposium on Lisp and Functional Programming, pages 208-217, Austin, Texas, August 1984. ACM. G. Cousineau, P.-L. Curien, and M. Mauny. The categorical abstract machine. Science of Computer Programming, 8:173-202, 1987. G. Cousineau and M. Mauny. The Functional Approach to Programming. Cambridge University Press, 1998. P.-L. Curien, T. Hardin, and J.-J. Ldvy. Confluence properties of weak and strong calculi of explicit substitutions. Journal of the ACM, 1996. N.G. De Bruijn. Lambda-calculus notation with nameless dummies, a tool for automatic formula manipulation, whith application to the Church-Rosser theorem. Indag. Math., 34:381-392, 1972. F. Gava and F. Loulergue. A Parallel Virtual Machine for Bulk Synchronous Parallel ML. In Peter M. A. Sloot and al., editors, International Conference on Computational Science (ICCS 2003), Part I, number 2657 in LNCS. Springer Verlag, june 2003. F. Gava and F. Loulergue. A Polymorphic Type System for Bulk Synchronous Parallel ML. In Seventh International Conference on Parallel Computing Technologies (PACT 2003), LNCS. Springer Verlag, 2003. A.V. Gerbessiotis and L. G. Valiant. Direct Bulk-Synchronous Parallel Algorithms. Journal of Parallel and Distributed Computing, 22:251-267, 1994. T. Hardin, L. Maranget, and L. Pagano. Functional runtime systems within the lambdasigma calculus. Journal of Functional Programming, 8(2): 131-176, 1998. Jan Willem Klop. Term rewriting systems. In S. Abramsky, D. M. Gabbay, and T. S. E. Maibaum, editors, Handbook of Logic in Computer Science, volume 2, chapter 1, pages 1-117. Oxford University Press, Oxford, 1992. P. J. Landin. The mechanical evaluation of expressions. The Computer Journal, 4(6):308-
134
[14] [ 15] [16]
[17] [18] [19]
[20] [21]
[22] [23] [24]
320, 1964. X. Leroy. The ZINC experiment: An economical implementation of the ML language. Rapport Technique 117, 1991. Xavier Leroy. The Objective Caml System 3.07, 2003. web pages at www.ocaml.org. F. Loulergue. Implementation of a Functional Bulk Synchronous Parallel Programming Library. In 14th IASTED International Conference on Parallel and Distributed Computing Systems, pages 452-457. ACTA Press, 2002. F. Loulergue, G. Hains, and C. Foisy. A Calculus of Functional BSP Programs. Science of Computer Programming, 37(1-3):253-277, 2000. A. Merlin and G. Hains. La Machine Abstraite Cat6gorique BSP. In Journdes Francophones des Langages Applicatifs. INRIA, 2002. A. Merlin, G. Hains, and F. Loulergue. A SPMD Environment Machine for Functional BSP Programs. In Proceedings of the Third Scottish Functional Programming Workshop, august 2001. Bruno Pagano. Des calculs de substitution explicite et de leur application g~la compilation des langagesfonctionnel. PhD thesis, universit6 Pierre et Marie Curie, 1997. D. R6my. Using, Understanding, and Unravellling the OCaml Language. In G. Barthe, E Dyjber, L. Pinto, and J. Saraiva, editors, Applied Semantics, number 2395 in LNCS, pages 413-536. Springer, 2002. D. B. Skillicorn, J. M. D. Hill, and W. F. McColl. Questions and Answers about BSP. Scientific Programming, 6(3):249-274, 1997. M. Snir and W. Gropp. MPI the Complete Reference. MIT Press, 1998. Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103, August 1990.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
135
JToe: a Java* API for Object Exchange S. Chaumette a, R Grange a, B. Mftrot a, and E Vignfras ~ a Laboratoire Bordelais de Recherche en Informatique, Universit6 Bordeaux 1, 351, cours de la Libfration, 33405 Talence Cedex, France. email: {Serge.Chaumette, Pascal.Grange, Benoit.Metrot, Pierre.Vigneras}@labri.fr This paper presents JToe, an API dedicated to the exchange of Java objects in the context of high performance computing. Even though Java RMI provides a good framework for distributed objects in general, it is known to be quite inefficient, mainly due to the Java serialization process. Many projects have already improved RMI either by redesigning and reimplementing it or by reimplementing the serialization process. We claim that both approaches are missing a clear and high level API for the exchange of objects. JToe proposes a new simple API that focuses on the exchange of objects. This API is flexible enough to allow, for instance, direct copy of byte streams representation of JVM objects over a specialized transport layer such as Myrinet or LAPI. Remote method invocation frameworks such as Java RMI can then be implemented over JToe with good performance enhancement perspectives. 1. INTRODUCTION Whereas remote procedure call and remote method invocation have given developers a good abstraction of the underlying transport layers for client/server communication, high performance computing has not adopted these programming models - message passing interface is still preferred- mainly because of the performance penalty they suffer from. It is acknowledged that the main drawback of RMI is its inefficiency for object serialization and de-serialization. A lot of developments exist that improve this serialization mechanism in the context of high performance computing. Some of them consist in a whole RMI reimplementation [ 1, 2, 4]. In these cases, the problem of the exchange of objects over the network has to be addressed simultaneously with the problems of distributed garbage collection, remote method invocation, threads management, registration management, and so on. Our contribution consists in defining the notion of exchange of objects as simply as possible through the JToe API. This allows one to specifically address this problem separately from the other RMI challenges. Other works (or parts of previously cited works) address only the serialization problem in the context of high performance computing [ 1, 3, 9]. From a technical point of view, they generally rely [I, 3] on the already defined Obj ectOutputStream and Obj ect InputStream *Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. The authors are independent of Sun Microsystems, Inc.
136 classes. We argue in section 2 that these APIs are low level ones. What we propose is a new high level API dedicated to the exchange of Java objects in the context of high performance computing. The rest of this paper is organized as follows. In section 2 we discuss why we introduce a new API. We describe this API in section 3. Existing implementations and performances are presented in section 4. 2. A N E W API
We need an API that allows one to get all the benefits of an efficient object exchange layer that can be used to implement a distributed application or more general frameworks such as RMI. For this purpose, a widely used API already exists in Java: Object (Output/Input)Stream [5]. However we believe that, in the context of high performance computing, a new distinct API for the exchange of objects is needed. First, it is acknowledged that standard Java serialization is not efficient. Faster implementations have been proposed [ 1,2, 3, 4]. Of course, none of them relies on, neither respects, the Java serialization specification: technically speaking, one can inherit Obj e c t O u t p u t S t r e a m and still deeply break the compatibility with the standard serialization process as far as the corresponding Obj e c t I n p u t S t r e a m class is provided. However, when inheriting these classes, depending on how the new classes will be used, we sometimes need to respect the standard serialization process. For instance, how to deal with the u s e P r o t o c o l V e r s i o n method when we do not respect the standard protocol to achieve efficiency? Moreover, these classes, Obj ectOutputStream and Obj ectInputStream, may evolve with the protocol, and they have in the past. Such evolutions may lead to incompatibilities with legacy sub-classes 2. Second, we consider that Obj e c t ( O u t p u t / I n p u t ) S t r e a m is a low level API since it does not hide the stream management of object serialization. Such a stream oriented API is not well suited for transport layers that do not rely on stream based hardware or libraries [6, 7]. Of course one can inherit - and this is the common approach - Obj e c t O u t p u t S t r e a m and give a brand new implementation for non stream based hardware. Even if we believe this is unnatural and error-prone, it is not a technical issue. The real problem is that, in order to receive an object, one must use an instance of the Obj e c t I n p u t S t r e a m class and especially the r e a d O b j e c t method. This implies that a thread must be waiting on this blocking method for an object to be received. This prevents one to get the benefits of using special architectures and/or libraries that allow, for instance, one-sided initiator data-transfers [6]. For all these reasons, we claim that a new non stream oriented API has to be defined. One may argue that RMI could already be implemented to take advantage of one-sided initiator data-transfers. Nevertheless, this implies to re-implement RMI what, in turn and as stated in section 1, implies to deal with a lot of other high level challenges instead of just focusing on the exchange of objects: distributed garbage collection, method invocation, registration management, etc. This is why an object exchange oriented API has to be defined, independently of RMI. The problem of defining adapted APIs for the easy replacement of the transport layer or of the serialization process has already been addressed. However, it generally leads to an unfortunate dividing line between the serialization process and the transport layer. For instance in 2The best example being the writeObjectOverride method that allows one to define a new serialization process. It only exists sincejdk 1.2.
137 KaRMI [ 1] a notion of technology is defined. The serialization is performed by KaRMI and the result of this serialization is sent using a given transport technology. A new technology can be defined to enhance the network transfer or to provide a new type of network support. However, in some improvement approaches [3, 9], the data to be transferred may depend on the remote/local JVM and/or on the available communication mechanisms: for instance the assumption that the two communicating computers are running similar JVMs allows to directly send memory regions. It may also be the case that the transport layer makes it possible not to buffer the data to be transferred. In such a case, separating the serialization process from the transport process prevents one from providing improvements based on these characteristics. This is why we believe that this separation is cumbersome.
3. J T O E : THE API
Program 1 The JToe API
public interface Node { void copy(Serializable object)
}
throws JToeException;
public interface CopyListener { void copied(Serializable object) ;
}
From the previous observations, we defined a simple powerful API: JToe. It is mainly composed of two interfaces: Node and CopyListener (see program 1). When one wants to send a copy of an object to a remote node, he has to get the Node object 3 representing the node he wants to communicate with and then invoke the c o p y method passing it the object to send as an argument. On the server side, the JToe layer will inform the application, by means of a call-back mechanism, that a new object has arrived, using the c o p i e d method of the C o p y L i s t e n e r interface. The action of copying an object to a remote node is a one-sided action. No user thread has to wait for an object to be received. This is the responsibility of the implementation to accept and receive any new object and then signal this arrival to the application using an event driven programming model through C o p y L i s t e n e r . c o p i e d . This approach is not only easy to understand and to use, but also allows to really take advantage of one-sided communication libraries [6]. Moreover, as shown in figure 1, there is no limitation on the way the Node interface can be implemented. The c o p y method is responsible for the whole process of copying an object to another address space, that is the construction of the data to send (serialization) and the transfer of these data through a transport layer. This way, no artificial dividing line between the serialization process and the transport layer is introduced. We claim that this allows any serialization improvement approach to be implemented. 3The problem of retrieving a Node is not managed by JToe since it is not a transfer of object problem but more a registry one. General naming services may be used such as JNDI [8].
138
Application
Application cop~
copy(o)
Node JV
i-
:# i i
: i i
JVM
i
System
System
Figure 1. General JToe behavior.
4. J T O E IMPLEMENTATIONS We have already developed three implementations of JToe [10]. Two of them are totally written in Java, one relying on Java RMI, the other using TCP sockets and the standard Java serialization. These allow any JToe application to be 100% Java compatible and portable. They also serve as reference implementations for regression tests. The third is a JVM level implementation, i.e. at a level where we can directly access the memory representation of objects. The goal of this implementation of JToe is to provide high performance in clusters of homogeneous computers and homogeneous JVMs by directly sending memory data instead of going through the serialization process. This implementation uses the JikesRVM virtual machine [14]. JikesRVM allows to directly access memory and interact with the garbage collector from Java code. It is an interesting experimentation platform and allowed us to implement a prototype in a reasonable time. In this implementation our concern is to perform efficiently and interact smartly with the garbage collector. JikesRVM supports various garbage collection policies. Parts of our implementation are garbage collector dependent. This implementation mainly behaves as follows: when an object is being copied, the corresponding graph of objects is computed. Then the data of the objects of the graph are sent with zero copy 4. On the receiver side, memory is allocated in a garbage collector dependent s p a c e mainly the n u r s e r y - and data are directly written into this area. The pointers are then updated to reflect the original structure. Figure 2 shows the performances of the JikesRVM specific implementation of JToe compared to the 100% Java implementation. Both are using TCP as their communication layer. The values represented by the curves are the average time of one round-trip for the specified object in a ping-pong like application. The RVN_RVIVl curve shows the performances of the JikesRVM dedicated JToe implementation when run with JikesRVM - note that it cannot be run with another JVM. The TCP_RVM curve shows the performances of the 100% Java implementation of JToe running with JikesRVM without any optimization. The TCP_SUN and TCP_IBM curves are for the same implementation running on, respectively, the Sun Java virtual machine version 4When we talk about zero copy we mean that our code does not perform any copy even if the actual communication layer will since we rely on TCE Future releases based on zero copy communication layers will actually achieve real zero copy.
139 120
RVM RVM
RVM_RVM TCP RVM
TCP_-RVM TCP_SUN
--
TCP-SUN -TCISIBM
--
100
._~
8o
80
8o
60
--
o~
"5 ._= =
~ 4o
40
20
200
400
0
600
0
200
U--7 I I
600
RvM_RvM --- ]
/
TCP_RVM-
/
800
1000
(b) Ping-pong with Vectors.
(a) Ping-pong with TreeSets. 120
400 size of the Vector
size of the TreeSet
TCP_SUN_ ---
-
RVM_RVM ~ ] TCP RVM -/ TCP-_SUN - - - - A
/
/
100
i
8O
~
60 0
40 0
20
0
o
200o0
40000
60000
8000o
size of the array
(c) Ping-pong with arrays of ints.
1ooooo
0
0
20000
40000
6OO00
80000
100000
size of the array
(d) Ping-pong with arrays of doubles.
Figure 2. Ping-pong performances. 1.4. l_01-b01 and the IBM Java virtual machine version 1.4.0. All the experimentations were achieved using two Linux 2.4.18 computers with Intel 1.7GHz processors and 512MB of RAM connected with Ethernet 100Mbps. We can see that, generally speaking, JikesRVM does not perform as well as the two other virtual machines for our ping-pong application. However, for the ping-pong of vectors, we see that our JikesRVM specific JToe implementation outperforms the 100% Java one by 30% to 40% when both are running on JikesRVM. This is a really promising result since it allows one to think that the same sort of enhancement may lead to same sort of performance enhancement with the other virtual machines. For the TreeSets, our implementation does not give interesting results; it performs even worse than the default serialization. Our code performs a suboptimal graph browsing to collect the objects to send where the ad hoc serialization of the j ava. u t i l . T r e e S e t class directly and linearly writes the objects it contains. We believe that this is the reason of this poor performance. Future releases of JToe for JikesRVM will improve the graph browsing algorithm by marking visited objects with JikesRVM specific techniques. Finally for arrays of int and double, our implementation totally outperforms the standard serialization on JikesRVM. The round-trip times are close to those obtained with the Sun or IBM virtual machines despite the poor performances of JikesRVM. This allows one to think that similarly optimized implementations of JToe for the Sun or IBM virtual machines
140 will lead to comparable improvements. The issue of fast transfer of objects has already been addressed in other works. Expresso [ 11 ] is a framework aimed at transferring Java objects efficiently using zero copy mechanisms. Expresso relies on the Kaffe [12] virtual machine. Expresso uses the notion of cluster to avoid graph exploration and pointer update. A cluster is a contiguous memory area where objects can be allocated. To transfer an object to a remote node, one actually transfers the cluster containing that object. All the other objects in the cluster are also transfered. With Expresso, objects must explicitly be allocated in the correct cluster to be transfered and the references between objects in different clusters are not conserved. Ibis [ 13] is a grid computing environment for Java. It defines the IPL, an API on top of which higher level distributed environments such as RMI can be implemented. Ibis differs from our work since the IPL not only defines the way objects are sent between computers but also how to access topology information, monitoring data, etc. Moreover, object serialization optimization in Ibis relies on byte code rewriting, the aim of which is to add the serialization code to classes to avoid dynamic type inspection. 5. CONCLUSION In this paper we have presented a new API which we call JToe that is dedicated to the exchange of objects between JVMs. This API makes it possible to take advantage of the knowledge we have of both the source and target JVMs and of the underlying network and the associated communication libraries. This leads to very efficient exchange of objects between JVMs. We have three implementations running. Two are 100% Java (one over RMI and one over TCP), the third is a low level one dedicated to Jikes RVM. This last implementation is still a work in progress. We are currently working to enhance the graph browsing algorithm using JikesRVM specific mechanisms. We also plan to release a LAPI and a Myrinet version of JToe. REFERENCES
[1] [2]
[3]
[4] [5] [6]
Christian Nester, Michael Philippsen and Bernhard Haumacher. A More Efficient RMI for Java. In Java Grande, pages 152-159, 1999. Jason Maassen, Rob Van Nieuwpoort, Ronald Veldema, Henri E. Bal, Thilo Kielmann, Ceriel J. H. Jacobs and Rutger F. H. Hofman. Efficient Java RMI for parallel programming. Programming Languages and Systems, 23(6):747-775, 2001. Fabian Breg and Constantine D. Polychronopoulos. Java virtual machine support for object serialization. In Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande, pages 173-180. ACM Press, 2001. Fabian Breg and Dennis Gannon. A Customizable Implementation of RMI for High Performance Computing. In Proc. of Workshop on Java for Parallel and Distributed Computing of IPPS/SPDP99, pages 733-747, 1999. Sun Microsystem. Serialization specification, http://java.sun.com/ Shah G., Nieplocha J., Mirza C.and Harrison R., Govindaraju R.K., Gildea K., DiNicola P. and Bender C. Performance and experience with LAPI: a new high-performance communication library for the IBM RS/6000 SP. In International Parallel Processing Symposium, pages 260-266, 1998.
141 [7] [8] [9]
[ 10] [11 ] [12] [13]
[ 14]
Myricom. Myrinet software, http://www.myri.com/scs/ Sun Microsystem. Java Naming and Directory Interface. http://j ava. sun. com/products/jndi/ K. Kono and T. Masuda. Efficient RMI: Dynamic Specialization of Object Serialization. In Proc. ofIEEE Int'l Conf. on Distributed Computing Systems (ICDCS), pages 308-315, 2000. Pascal Grange and Pierre Vign6ras. The JToe project, http://jtoe.sf.net L. Courtrai, Y. Mah6o and F. Raimbault. Expresso: a Library for Fast Transfers of Java Objects. In Myrinet User Group Conference, Lyon, 2000. The Kaffe Java virtual machine, http://www.kaffe.org/ Rob Van Nieuwpoort, Jason Maassen, Rutger Hofman, Thilo Kielmann, Henri E. Bal. Ibis: an Efficient Java-based Grid Programming Environment. Joint ACM Java Grande ISCOPE 2002 Conference, pp 18-27, 2002. B. Alpern, C. R. Attanasio, A. Cocchi, D. Lieber, S. Smith, T. Ngo, J. J. Barton, S. F. Hummel, J. C. Sheperd, and M. Mergen. Implementing Jalapefio in Java. In Proceedings of the 1999 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages & Applications, OOPSLA 99, Denver, Colorado, November 1-5, 1999, volume 34(10) of ACM SIGPLAN Notices, pages 314 324. ACM Press, Oct. 1999.
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms, Architecturesand Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
143
A M o d u l a r D e b u g g i n g I n f r a s t r u c t u r e for Parallel P r o g r a m s D. Kranzlmtiller ~, Ch. Schaubschl~igel~, M. Scarpa a, and J. Volkert ~ aGUP, University of Linz, Altenbergerstr. 69, 4040 Linz, Austria Debugging parallel and distributed programs is a difficult activity due to multiple concurrently executing and communicating tasks. One major obstacle is the amount of debugging data, which needs to be analyzed for detecting errors and their causes. The debugging tool DeWiz addresses this problem by partitioning the analysis activities into different, independent modules, and distributing these modules on the available, possibly distributed, computing environment. The target of the analysis activities is the event graph, which represents a program's behavior in terms of program state changes and corresponding relations. In DeWiz we distinguish three different kinds of modules, namely Event Graph Generation Modules, which generate an event graph stream from the collected trace data, (Automatic) Analysis Modules, whose task is to perform various analysis techniques on the event graph, and Data Access Modules, which present the results of the event graph analysis to the user in a meaningful way. The interconnection between the different modules is established by a dedicated protocol, which is currently based on TCP/IP or uses shared memory. This allows any user of DeWiz to easily integrate arbitrary software analysis tools with DeWiz. 1. O V E R V I E W O F P R O G R A M A N A L Y S I S W I T H D E W I Z 1.1. Motivation and related work
Software development for High-Performance Computers (HPC) faces several serious challenges. One example is the possible scale of the programs, determined by long execution times and large numbers of participating processes. As a result huge amounts of program state data have to be processed during program analysis activities, such as testing and tuning. This problem is addressed by the debugging tool DeWiz (Debugging Wizard), which evolved out of the Monitoring and Debugging environment MAD [6] as a possible solution. The original approach of MAD was repeatedly affected by the amount of debugging data. Long waiting times during interactive debugging sessions and the impracticability of certain analysis techniques represented substantial drawbacks, especially for real-world applications. Similar approaches to MAD are provided by commercial products, such as VAMPIR [1], or other related tools in this area, e.g. AIMS [18], Paje [2], Paradyn [11 ], Paragraph [3], and PROVE [4]. Each of these tools employs some kind of space-time diagram as a representation of program behavior. By analyzing these related approaches in the field, some basic characteristics and differences are obtained: 9 Analysis goals: performance analysis vs. error debugging Most software tools support only one analysis area, focusing either on performance as-
144 pects of a program or on the correctness of its behavior. Some tools provide additional support for other tasks of the software life-cycle, e.g. visual parallel programming. 9 Means of communication/execution: shared memory vs. message passing Depending on the underlying hardware architecture, runtime system, or development environment, software tools usually support only one kind of programming paradigm. An additional distinction is the parallel execution itself, e.g. whether threads or processes are used. 9 Levels of abstraction: high-level code vs. machine instructions Only few tools provide program analysis activities on different levels of abstraction, while most are pre-determined by the chosen level of instrumentation. 9 Approaches to instrumentation: static source vs. dynamic binary instrumentation Instrumentation of programs allows to distinguish between various different approaches, with static instrumentation of source code on one side of the spectrum, and dynamic instrumentation of binary machine code on the other side. 9 Connection between monitor and analysis tool: on-line vs. post-mortem Another related characteristic of software tools is the connection between the monitoring part, that extracts the behavioral data, and the analysis part, that investigates the program's behavior. In brief, on-line program analysis is required whenever changes to the execution behavior should be applicable on the fly, while post-mortem analysis is chosen if the analysis technique requires information about the complete execution history of a program. The reasons for all these distinctions are often given by the particular interests of the involved software tool developers. Besides that, there are also some tools supporting characteristics that seem converse at first. For example, the VAMPIR tool supports both, shared-memory OpenMP programs [ 14] and message-passing program using MPI [ 13]. This originates from the fact, that some of todays architectures are best utilized by using a hybrid MPI/OpenMP programming style [ 15], and more tools have already been proposed to follow this mixed-mode programming style. During our work on the MAD environment, some more evidence occurred: The event manipulation technique originally developed for message-passing programs required only minor changes to be useful for shared-memory codes [5]. With little extensions, most of them in the monitoring code, MAD was also applicable for performance analysis activities. These experiences motivated to extend the originally conservative approach of using the event graph for message-passing codes only into a more universal solution, whose result is the program analysis too 1 DeWiz. The ideas of DeWiz, namely 9 the representation of a program's behavior as event graph, 9 the modular, hence flexible and extensible approach, and 9 the graphical representation of a program's behavior will be described in more detail in the following sections.
145 2. EVENT GRAPH The theoretical fundament of the DeWiz debugging environment is the event graph, which has been defined as follows:
Definition 1 (Event Graph [8]). An event graph is a directed graph G = (E, -~) , where E is the non-empty set o f events e E E, while --+ is a relation connecting events, such that x --~ y means that there is an edge from event x to event y in G with the "tail" at event x and the "head" at event y.
The events e C E are the recorded events observed during a program's execution, like for example send or receive events in a message-passing program or semaphore locks and unlocks in a shared memory environment, respectively. The relation connecting the events of an event graph is the so called happened-before relation which is defined as follows:
Definition 2 (Happened-before relation [10]). The happened-before relation -~ is defined as
s
where ~ is the sequential order o f events relative to a particular responsible object, while is the concurrent order relation connecting events on arbitrary responsible objects.
c
The sequential order relation ~s simply defines that if two events epi and ej occur on the same 9
process and event e;i occurred before event e3p,then ep 9
C
The concurrent order relatlon~ defines inter-process relations of events. In case of messagepassing programs it means, that if event epi occurs on process p and is a send event, and event e~ occurs on process q and is the corresponding receive event, then ep Based on these definitions we have implemented the DeWiz protocol, which allows to propagate an event graph stream through a DeWiz system.
3. DEWIZ PROTOCOL AND FRAMEWORK The DeWiz protocol defines the communication of event graph streams between different DeWiz modules. Based on this protocol we use the abstract concepts of interfaces and communication channels. One channel connects exactly two modules, channels are unidirectional, and each module has exactly one interface (incoming or outgoing) per channel. There are two approaches how these channels and interfaces are implemented, The first one ist to use TCP/IP and BSD sockets [! 6]. This approach enables the tool to be distributed across different computing resources or even to use grid infrastructures, which are currently deployed all over the world. For example, in a DeWiz debugging session it might be feasible to execute an event graph generation module on the same computer where the application under observation is executed. The event graph generated by this module could be forwarded to a (probably resource intensive) analysis module which is executed on a different computer, in order to provide as much computing power as possible to the observed application (and not having to share resources with analysis tasks). The analysis module could then be connected to
146 a visualization module, which resides on a third computer, e.g. the workstation of the DeWiz user. However, in some cases this approach might not be useful, e.g. when the amount of trace data, or the size of the event graph respectively, exceeds certain limits. In this case it is not efficient, or even impossible, to send the event graph stream over the network, simply because of it's size. In such a case the event graph generation module and the analysis module(s) must reside on the same computer to avoid copying of the event graph. For that purpose DeWiz provides shared memory interfaces in addition to TCP/IP. When two modules are connected via such a shared memory interface, the "producer" module writes the event graph data into a shared memory segment, while possible "consumer" modules can read the data from that segment, thus following the "zero copy" paradigm. To enable the propagation of an event graph with the DeWiz protocol, several data structures have been defined. Simplified, an event graph stream consists of two kinds of such data structures: event: epi = (p, i, type, data) concurrent order relation: ep The event data structure corresponds to a particular state change observed in a program's execution. The variables p and i identify the responsible object on which the event occurred and its relative sequential order, respectively. The variable type determines the kind of observed event, e.g. in message-passing programs a send or receive. The data field is used for optional attributes, which describe the event in more detail depending on the intended analysis activities. The concurrent order relation represents a subset of the happened-before relation stated above. It is used to mark corresponding events on distinct processes, whose operations are somehow connected, e.g. a corresponding pair of send and receive operations. (Please note, that events on the same host object p are already ordered by their sequential identifier i.) The DeWiz Framework offers the required functionality to implement DeWiz modules for a particular programming language. At present, the complete functionality of DeWiz is supported in Java, while smaller fragments of the analysis pipeline are already available in C and C++. Each module must implement the following functionality: 9 Open event graph stream interface 9 Filter relevant events for processing 9 Process event graph stream as desired 9 Close event graph stream interface The functions to open and close interfaces are used to establish and destroy interconnections between modules. The interfaces transparently implement the DeWiz protocol, while filtering and processing events is performed within the main loop of each module.
147
Figure 1. DeWiz System during Event Graph Stream Processing Four monitors generate the event graph stream - presumably through on-line observation of 4 processes. A merger module and 2 buffer modules combine and cache the event graph stream. A pattern matching module and a group detection module perform automatic analysis, while the results of these analysis activities are presented in a visualization tool.
4. DEWIZ MODULES The processing of DeWiz is performed on the above mentioned data structures within DeWiz modules. The modules are assembled as a DeWiz system, which represents the intended program analysis pipeline for a particular debugging or performance analysis strategy. In Figure 1 an example DeWiz system is shown. Depending on their processing capabilities, different types of modules can be distinguished:
4.1. Event graph generation modules At least one module is required for generating an event graph stream corresponding to a program's execution. The events can be generated either on-line with a monitoring tool, or post-mortem by reading corresponding tracefiles. In case DeWiz is utilized as a plug-in for existing program analysis tools, the event graph stream is generated by converting the specific data structures of the host tool into the event graph stream protocol. Example modules already provided by the DeWiz framework include an on-line interface to OCM (OMIS Compliant Monitor) [ 17] and OPARI (OpenMP Pragma and Region Instrumentor) [12][5], as well as a trace reader for the monitoring tool NOPE [7]. 4.2. Automatic analysis modules The actual operations on the event graph are performed by automatic analysis modules. These modules extract the desired information and try to detect interesting characteristics of the program's behavior. Example modules are already provided for error detection, e.g. to determine communication errors in message-passing programs or race conditions at semaphore operations in shared memory programs. More elaborate examples include a pattern matcher for repeated behavioral patterns and a module for detection of process grouping and subsequently process isolation [9].
148 4.3. Data access modules
After or during processing the event graph stream, detected program characteristics can be presented to the user. Different kinds of data access modules support a variety of user interfaces. In most cases, DeWiz forwards the results to a visualization module that represents the event graph as a kind of space-time diagram. Example modules include ATEMPT, the visualization tool of the MAD environment, a failure notification mechanism for cellular phones, and a Java applet for arbitrary web browsers. 5. EXAMPLES The analysis functionality already implemented in DeWiz is described with the following two examples, namely extraction of communication failures, and pattern matching and loop detection. Communication failures can be detected by pairwise analyzing of communication partners. A set of analysis activities for message- passing programs is described in [8]. An example is the detection of different message length at send and receive operations, which may formally be defined as follows: Two events epi and e~9are called events with different message length, if epi c ej and
messageLength(e;) =fimessageLength(eJ). This formal definition can easily be mapped onto a DeWiz module with the available Java framework. During analysis, the module detects every communication pair, where the send operation transmits more or less bytes than the receive operation expects. A more complex analysis activity compared to the extraction of communication failures is pattern matching and loop detection. The goal of the corresponding DeWiz modules is to identify repeated process interaction patterns in the event graph. Some example event graph patterns are given in Figure 2. The leftmost pattern is called a simple exchange pattern, which describes the situation, where two arbitrary processes mutually exchange some kind of data item. The existence of this simple event graph pattern can easily be verified within a DeWiz module. More complex patterns can be specified and provided in a pattern database according to the needs of users and the characteristics of their programs. Vice versa, the user may even specify expected patterns of a program, and the program tries to locate them in the event graph of the observed execution. This allows to decrease the complexity of the analysis data, if patterns are detected in a parallel program, or to detect incorrect behavior, if expected patterns are missing. Some more details about pattern matching can be found in [8]. 6. S U M M A R Y
The modular approach of DeWiz allows parallel and distributed program analysis based on a set of independent analysis modules. The target representation of program behavior is the abstract event graph model, which allows a wide variety of analysis activities for different kinds of programs on different levels of abstraction. The analysis modules may be arbitrarily distributed across a set of available resources, which allows to process even large amounts of program state data in reasonable time. The latter is especially interesting with respect to grid computing, which may be just the fight environment for a future DeWiz grid debugging service.
149
S ->
simple exchange
Figure 2. Process interaction patterns The leftmost pattern is called a simple exchange pattern. In the middle a round robin pattern is shown, while the right screenshot shows a tree-like communication pattern.
REFERENCES
[1]
[2]
[3] [4]
[5]
[6] [7] [8]
[9]
[ 10] [11]
H. Brunst, H.-Ch. Hoppe, W.E. Nagel, and M. Winkler. Performance Optimization for Large Scale Computing: The Scalable VAMPIR approach. Proc. ICCS 2001, Intl. Conference on Computational Science, Springer-Verlag, LNCS, Vol. 2074, San Francisco, CA, USA (May 2001). J. Chassin de Kergommeaux, B. Stein. Paje: An Extensible Environment for Visualizing Multi-Threaded Program Executions. Proc. Euro-Par 2000, Springer-Verlag, LNCS, Vol. 1900, Munich, Germany, pp. 133-144 (2000). M.T. Heath, J.A. Etheridge. Visualizing the Performance of Parallel Programs. IEEE Software, Vol. 13, No. 6, pp. 77-83 (November 1996). R Kacsuk. Performance Visualization in the GRADE Parallel Programming Environment. Proc. HPC Asia 2000, 4th Intl. Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, Peking, China, pp. 446-450 (2000). R. Kobler, D. Kranzlmiiller, and J. Volkert. Debugging OpenMP Programs using Event Manipulation. Proc. WOMPAT 2001, Intl. Workshop on OpenMP Applications and Tools, Springer-Verlag, LNCS, Vol. 2104, West Lafayette, Indiana, USA, pp. 81-89 (July 2001). D. Kranzlmfiller, S. Grabner, and J. Volkert. Debugging with the MAD Environment, Parallel Computing, Vol. 23, No. 1-2, pp. 199-217 (Apr. 1997). D. Kranzlmfiller, J. Volkert. Debugging Point-To-Point Communication in MPI and PVM. Proc. EuroPVM/MPI 98, Intl. Conference, Liverpool, GB, pp. 265-272 (Sept. 1998). Event Graph Analysis for Debugging Massively ParD. Kranzlmfiller: allel Programs, PhD thesis, GUP, Joh. Kepler Univ. Linz (Sept. 2000), http ://www. gup. uni-linz, ac. at/-dk/thesis. D. Kranzlmfiller: Scalable Parallel Program Debugging with Process Isolation and Grouping, Proc. IPDPS 2002, 16th International Parallel & Distributed Processing Symposium, Workshop on High-Level Parallel Programming Models & Supportive Environments (HIPS 2002), IEEE Computer Society, Ft. Lauderdale, Florida, (April 2002). L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, Vol. 21, No. 7, pp. 558-565, (July 1978). B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Kara-
150 vanic, K. Kunchithapadam, and T. Newhall. The Paradyn Parallel Performance Measurement Tool IEEE Computer, Vol. 28, No. 11, pp. 37-46 (November 1995). [12] B. Mohr, A.D. Malony, S. Shende, and F. Wolf. Design and Prototype of a Performance Tool Interface for OpenMP. Proc. LACSI Symposium 2001, Los Alamos Computer Science Institute, Santa Fe, New Mexico, USA (October 2001). [ 13] Message Passing Interface Forum. MPI: A Message Passing Interface Standard- Version I.I. http://www.mcs,
anl. gov/mpi/(June 1995).
[14] OpenMP Architecture Review Board. OpenMP C and C++ Application Program Interface - Version 2.0. h t t p : //www. openmp, o r g / ( M a r c h 2002). [15] R. Rabenseiffner. Communication and Optimization Aspects on Hybrid Architectures. Proc. EuroPVMMPI 2002, Springer-Verlag, LNCS, Vol. 2474, Linz, Austria, pp. 410-420 (2002). [ 16] W. Richard Steves. UNIX Network Programming. Prentice Hall, (1990). [ 17] R. Wism/iller, J. Trinitis, and T. Ludwig. O C M - a Monitoring System for Interoperable Tools. Proc. SPDT 98, 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, ACM Press, Welches, Oregon, USA, pp. 1-9 (August 1998). [ 18] J.C. Yan, H.H. Jin, and M.A. Schmidt. Performance Data Gathering and Representation from Fixed-Size Statistical Data. Technical Report NAS-98-003,
http://www.nas .nasa.gov/Research/Reports/Techreports/
19 9 8 / n a s - 9 8 - 0 0 3. p d f , NAS System Division, NASA Ames Research Center, Moffet Field, California, USA (February 1999).
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
151
Toward a Distributed Computational Steering Environment based on CORBA O. Coulaud ~, M. Dussere ~, and A. Esnard ~ ~Projet ScA1Applix, INRIA Futurs et LaBRI UMR CNRS 5800, 3 51, cours de la Lib6ration, F-33405 Talence, France This paper presents the first step toward a computational steering environment based on CORBA. This environment, called EPSN 1, allows the control, the data exploration and the data modification for numerical simulations involving an iterative process. In order to be as generic as possible, we introduce an abstract model of steerable simulations. This abstraction allows us to build steering clients independently of a given simulation. This model is described with an XML syntax and is used in the simulation by some source code annotations. EPSN takes advantage of the CORBA technology to design a communication infrastructure with portability, interoperability and network transparency. In addition, the in-progress parallel CORBA objects will give us a very attractive framework for extending the steering to parallel and distributed simulations. 1. I N T R O D U C T I O N Thanks to the constant evolution of computational capacity, numerical simulations are becoming more and more complex; it is not uncommon to couple different models in different distributed codes running on heterogeneous networks of parallel computers (e.g. multi-physics simulations). For years, the scientific computing community has expressed the need of new computational steering tools to better grasp the complexity of the underlying models. The computational steering is an effort to make the typical simulation work-flow (modeling, computing, analyzing) more efficient, by providing on-line visualization and interactive steering over the on-going computational processes. The on-line visualization appears very useful to monitor and detect possible errors in long-running applications, and the interactive steering allows the researcher to alter simulation parameters on-the-fly and immediately receive feedback on their effects. Thus, the scientist gains a better insight in the simulation regarding to the cause-andeffect relationship. A computational steering environment is defined in [ 1] as a communication infrastructure, coupling a simulation with a remote user interface, called steering system. This interface usually provides on-line visualization and user interaction. Over the last decade, many steering environments have been developed; they distinguish themselves by some critical features such as the simulation integration process, the communication infrastructure and the steering system design. A first solution for the integration is the problem solving environment (PSE) approach, like in SCIRun [2]. This approach allows the scientist to construct a steering application according to a visual programming model. As an opposite, CAVEStudy [3] only interacts with
1EPSN project (http ://www. labri, fr/epsn) is supported by the French ACI-GRID initiative.
152 the application through its standard input/output. Nevertheless, the majority of the steering environments, such as the well-known CUMULVS [4], are based on the instrumentation of the application source-code. We have chosen this approach as it allows fine grain steering functionalities and achieves good runtime performances. Regarding the communication infrastructure, there are many underlying issues especially when considering parallel and distributed simulations: heterogeneous data transfer, network communication protocols and data redistributions. In VIPER [5], RPCs and the XDR protocol are used to implement the communication infrastructure. Magellan & Falcon [6] communicates over heterogeneous environments through an event system built upon DataExchange. CUMULVS uses a high-level communication infrastructure based on PVM and allows to collect data from parallel simulations with HPF-like data redistributions. In EPSN project, we intend to explore the capabilities of the CORBA technology and the currently under development parallel CORBA objects for the computational steering of parallel and distributed simulations. In this paper, we first describe the basis and the architecture of EPSN. Then, we illustrate the integration process of a simulation. Finally, we present preliminary results on the EPSN prototype, called Epsilon.
2. THE E P S N ENVIRONMENT
2.1. Principles EPSN is a steering platform for the instrumentation of numerical simulations. In order to be as generic as possible, we introduce an abstract model of steerable simulations. This model intends to clearly identify where, when and how a remote client can safely access data and control the simulation. We consider numerical simulations as iterative processes involving a set of data and a hierarchy of computation loops modifying these data. Each loop is associated with a single counter and the association of all these counters enables EPSN to precisely follow the global time-step evolution. The "play/stop" control operations imply the definition of breakpoints at some stable states of the simulation. In practice, we can easily steer most of the simulations by simply instrumenting their main loop. The basic access operations consist in extracting and modifying the data. As these data alternate between coherent and incoherent states during the computation, it implies the definition of restricted areas where data are not accessible. Several data can be logically associated to define a group that enables the end-user to efficiently manipulate correlated data together. Moreover, we have extended the coherency definition for groups to guarantee that all group members are accessed at the same iteration. This model, which fits well parallel applications (SPMD), needs to define a global time extension to maintain the data coherency over coupled or distributed simulations (MPMD). Such a mechanism implies an explicit association of the loops and breakpoints of the different simulation components. The representation in the abstract model is obtained by pointing out the relevant informations on the simulation. First the user describes the simulation elements in an XML file, then he connects the simulation with its representation. To do so, he annotates the source code with EPSN API in order to locate the elements that he has identified and to mark their evolution through the simulation process. The XML description also intends to lighten and clarify this annotation phase.
153 ..............................................
i EPSN 'nfrastructu~ffs A ~ Steerable
_1~ :i _ .Simulati~ ..S~~
Side
~I i [.C_o : ~o ;, 'I 0 ~1~1
Client Side
....
i. . . . . . . . .
d . . . . . Is
dedi....dthread ~ ..............................................
Steering
r ~ ~.o
:o
SimulationLoop ~ ' : ' ~ q ~ <-,,,,-> C O R B A requests
i
. .................................................
(~) -~
UserC . . . . .
~
~
~
P..... In,.... , C. . . . .
'--'~VisualizationLo~ dthre
~
-...............................
,cation Layer (MP,)
iO
CORBAobject( .....
" []
client C O R B A reference
)
~
Cliont ~
~")[I
i(~ !
,cation Layer
Object Request Broker
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
:
/
Client ~j~ k J]l . .......................................
Figure 1. (a) Detail of the EPSN architecture. (b) EPSN parallel architecture with PaCO++.
2.2. Architecture and communication infrastructure EPSN is a distributed and dynamic infrastructure. It is based on a client/server relationship between the steering system (the client) and the simulation (the server) which both use EPSN libraries. The clients are not tight coupled with the simulation. Actually, they can interact onthe-fly with the simulation through asynchronous and concurrent requests. According to this model, different steering systems can concurrently access the same simulation and, reciprocally, a steering system can simultaneously access different simulations. These characteristics make EPSN environment very flexible and dynamic. The communication infrastructure of EPSN is based on CORBA, but it is completely hidden to the end-user. CORBA enables applications to communicate in a distributed heterogeneous environment with network transparency according to a RPC programming model. It also provides to EPSN the interoperability between applications running on different machine architectures. Although, CORBA is often criticized for its performance, some implementations are very efficient [7]. The principle of EPSN infrastructure is to run a permanent thread attached to each simulation process. This thread contains a multi-threaded CORBA server dedicated to the communications between the steering clients and the simulation. As shown on figure 1(a), the simulation thread consists of different CORBA objects corresponding to EPSN functionalities. For being fully asynchronous, EPSN uses oneway CORBA calls and the client thread also implements a callback object to receive data from the simulation. In other steering environments, like CUMULVS, the simulation is in charge of the communications that occur during a single blocking subroutine call. In EPSN, the thread accesses directly to a data through the shared memory of the process without any copy. Between a process and the EPSN thread, the communications use standard inter-thread synchronization mechanisms based on semaphores and signals. This strategy combined with the asynchronous CORBA calls allows to overlap the communications. In order to maintain a single representation of the simulation, we use a specific CORBA object, the proxy, running on the first simulation process. This object provides the description of the whole simulation and all the CORBA references needed by both the client and other simulation processes. As the proxy is registered to the CORBA naming service, remote clients can easily connect it. In order to achieve coherent steering operations on SPMD simulations, we have developed some protocols to synchronize the simulation processes. This synchronization implies to broadcast the request to all the involved processes and to synchronize on the first
154 breakpoint before achieving the parallel request. To reduce the synchronization cost, the parallel processes can also synchronize once at the beginning and keep going synchronously after that. The parallel extension of CORBA objects, like PaCO++ [8], reduces the synchronization cost thanks to a better use of the parallel infrastructure. PaCO++ typically exploits an internal communication layer based on MPI (Fig. l(b)). It also greatly eases the EPSN extension to parallel clients and for the inherent problem of data redistribution. On-going works are focusing on the integration of PaCO++ in EPSN and especially on the full support of regular data decomposition. 2.3. Functionalities EPSN consists in two C/Fortran libraries, the first one provides functions to build a steerable simulation and the second one enables to build a remote steering system. Control. The insertion of b a r r i e r function calls, acting as debugger breakpoints, in the simulation source code allows the user to control the execution flow of the simulation. The breakpoints can be remotely set "up" or "down" ( s e t b a r r i e r command) and they allow classical control commands (play, step, stop). Moreover, the calls to iterate function point out the evolution of simulation through the loop hierarchy. Data Extraction. On the client side, the user can remotely extract data from the simulation by calling g e t functions. The client manages such a request with w a i t and t e s t MPIlike functions. On the simulation side, data access is protected by l o c k / u n l o c k functions, which delimit the "coherence areas" in the source code. Therefore, data sending can be done immediately when receiving a get-request or delayed if the data is not accessible yet. Once the data is received by the client, it can be automatically copied in the client memory within the same lock/unlock areas as for the simulation, or it can start a treatment defined by the user thanks to a callback function call. The user can also request a data permanently, with the g e t p / c a n c e l p functions, in order to continually receive new data releases and produce "on-line" visualization. In this case, an acknowledgment system automatically regulates the data transfer from the simulation to the client, by voluntary ignoring some data releases to avoid to congest the client. Nevertheless, a f l u s h function can be used to force the data sending at each timestep. Thus, it prevents the client to miss any release, but it can slow down the simulation according to the client load. Data Modification. In the same way, the client can modify a data by calling the p u t function which transfers data from the client memory to the simulation.
3. BUILDING A STEERABLE APPLICATION In this section, we present the integration of EPSN steering functionalities in a parallel fluid flow simulation software, FluidBox [9]. This MPI Fortran code is based on a finite volume approximation of the Euler equations on unstructured meshes and simulates a two-fluid spray injection. We detail the XML description of this simulation in the abstract model, the instrumentation of the source code and the different solutions proposed in EPSN to construct steering systems. 3.1. XML description of the simulation The first phase in the construction of a steerable simulation consists in its description through an XML file. This description is the representation of the simulation shared by the simulation
155
<[DOCTYPE simulation SYSTEM "Epsilon.dtd"> <simulation name="spray" context="parallel"> <scalar id="nbNodes" type="long" access="readonly" locat ion= "repl icat ed"/> < scalar i d= "nbTri" type=" long" access=" readonly" iocat ion=" repl icat ed" / > <sequence id="NodesCoord" t.ype="double" iocation="replicated"> <sequence id="Cells" type="long" iocation="replicated"> <sequence id="Energy" type="double" iocation="distributed">
Figure 2. FluidBox short XML description.
! --- Simulation Initialization CALL ReadMesh(Mesh,MeshFile) C A L L Init(Data,Mesh,Var) !
........
Epsiion
initialization
---
......
CALL epsilon_init('simu.xml',numproc,nbprocs,ierr) C A L L epsilon_publish('nbNodes',Mesh%Npoint,ierr ) C A L L epsilon_publish('nbTri',Mesh%Nelemt,ierr) C A L L epsilon_publish('NodesCoord',Mesh%coor(l,l),ierr) C A L L epsilon_publish('Cells',Mesh%nu(l,l),ierr) C A L L epsilon_publish('Energy',Var%Ua(l,l),ierr) C A L L epsilon_publishgroup('mesh',ierr) IF (numproc.EQ.0) THEN CALL epsilon_barrier('begin',ierr) MPI_Barrier(MPI_COMM_WORLD, ierr) Barrier o n m a s t e r process CALL epsilonunlockall(ierr); DO kt = kt0+l, ktmax ! Simulation loop C A L L Inject(Data,Mesh,Var) ! Fluid injection CALL
ona.il processes epsilon_barrier('mid',ierr)
Barrierdistributed
CALL !
........
Modify
field
values
inside
locked
area
.........
CALL e p s i l o n _ l o c k ( ' E n e r g y ' , i e r r )
Var%Ua(:,:) = Var%Un(:,:) C A L L epsilon_unlock('Energy',ierr) C A L L Post(Mesh,Var) C A L L epsilon_iterate('loop',ierr) END DO !
---
Simulation
ending
....
CALL WriteResult(Data,Mesh,Var) CALL epsilonexit(ierr)
Figure 3. FluidBox instrumented pseudo-code.
thread and by the client thread. On the simulation side, the XML is parsed at the initialization of EPSN to build all the necessary structures and to parameter the instrumentation. The clients dynamically get this description from the simulation so they do not need a direct access to the XML file. As it is shown on figure 2, the XML file mainly contains the simulation name, a description of the simulation scheme with loops and breakpoints (control XML element) and a description of all the published data (data XML element). Scalar data and sequences (array) are precisely described with their type, their access permission and their location (on master process, replicated on all processes or distributed over all processes). The dimensions for the sequences must be detailed with its size, its offset and its its decomposition (block, cyclic, etc.) in the distributed case. Moreover, the XML g r o u p elements allow the user to logically associate different data (e.g. the FluidBox mesh group). 3.2. Instrumentation As we already said, the integration of an existing simulation is done through source code annotations by a few function calls. Figure 3 presents the instrumented pseudo-code of FluidBox with the three classical phases, the initialization, the simulation loop and the ending phase. The initialization of all the EPSN infrastructure is simply done by calling the i n i t function with the XML file name as argument. When considering parallel simulations, each process must also indicate its rank and the number of processes involved in the simulation. Then, each data described in the XML has to be pointed in the process memory and published (pub l i sh). Within the simulation body itself, one has to mark the loop evolution ( i t e r a t e ) and to locate the breakpoints ( b a r r i e r ) defined in the XML file. One has also to place the data access areas (lock/unlock) and explicitly signal new data releases (release). In the example,
156
i i .........
Figure 4. EPSN generic client.
.......
Figure 5. Semi-genetic client (FluidBox).
only the energy field is modified during the iteration, so the other mesh components remain accessible during the whole computation. At the end of the process, the EPSNex2 t call properly terminates all the infrastructure. After that, if a client tries to connect or to send a request to the simulation, EPSN call gets a CORBA exception and returns an error status. Eventually, when one runs an instrumented simulation, it starts the EPSN CORBA server, accessible through the CORBA naming service and waiting for client requests.
3.3. Visualization and steering system
EPSN proposes three different strategies to construct a remote steering system. One could implement directly a steering system with the CORBA interface (IDL) but it is more convenient to use EPSN client API. The functionalities (see section 2.3) of this API allow the user to build a specific client precisely adapted to its simulation. One can also use the EPSNgeneric client, implemented in Java/Swing (Fig. 4). This tool can control and access the data of any EPSN simulation. Data are presented through simple numerical datasheets or can use basic visualization plug-ins. An intermediate way consists in implementing semi-generic clients using generic EPSN modules dedicated to the control, the data access and the visualization of complex objects (unstructured meshes, molecules, regular grids). This approach suits well with visualization environments (e.g. AVS/Express (Fig. 5)). 4. PRELIMINARY RESULTS We have implemented a prototype of the EPSN platform, called Epsilon. This prototype is written in C++ and is based on omniORB4 (http://omniorb.sourceforge.net), an high performance implementation of CORBA, and the associated thread library (omniThread). The results of this section come from experiments realized on two PC (Pentium IV) linked by a fast-Ethernet network (100Mbps). The figure 6 presents the mean time needed by Epsilon to send different data sizes (from 1Kb to 4Mb) to both a local and remote client at each iteration (a g e t p request), without any computation performed. Remote Epsilon sendings are compared with TCP/IP communications upon which omniORB communicates and shows that Epsilon performances on data transfers are quite as good as TCP/IP's. The figure 7 presents the same experiment as previous, except that 80ms of computations are performed per iteration, half of time being in a unlocked access area.
157 500
500
9E~p~.iio'n dotal'client) ...........
Epsilon (. . . . tei ~i{e'nti ""~:-"
450 400
:: ,(, ,( ,,:: .:! .
350 300 ~ 250
. . . . . . . . . . . . . .
"
" Sir~uiat~on (80rns/iter)"
400 ./
350
.,,. ..., .,'
~" 300
:' .' ,..'
250
, .,./ /
:i 200
.,. ,: ..,:
150 100
....: 50
...~.:::-64 size (Kb)
.....~
200
, / , . /.
150
..
100
..~: ~j..I-
512
Figure 6. Epsilon communication benchmark.
" "
Simulation + Epsiiion~i'r;:mote i cilenti----~---?
450
...............................
~
... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
~-::::"-~7
50 0
8
64 size (Kb)
512
4096
Figure 7. Data extraction from a simulation.
These results are compared with the computation time added to the TCP/IP sending time of the data. This figure demonstrates the overlapping capabilities of a fully asynchronous approach, as it is performed in Epsilon. Moreover, remote Epsilon performance is still under the sum of the computation time and the TCP/IP sending time. So, the Epsilon steering overhead is fully overlapped by the computation. Finally, we have evaluated the instrumentation cost in the second experiment. It is quite negligible (less than 1%) and does not depend on clients for they are clearly disconnected from the simulation. 5. CONCLUSION AND PROSPECTS As shown in this paper, EPSN architecture intends to provide a flexible and dynamic approach of computational steering. It proposes an instrumentation of existing simulations at a low cost and greatly capitalizes on CORBA features. Moreover, the parallel CORBA objects provides a suitable solution for most SPMD cases (with regular data distributions). The prototype Epsilon reveals itself as a really light and easy to use steering platform providing great capabilities of interaction. Epsilon validates the model of integration based on both a source code annotation and a XML description. It also proves that CORBA features present a great interest in the steering of applications with good performances. The developments of EPSN are now oriented on the integration of parallel and distributed simulations with irregular data distributions. REFERENCES
[1] [2]
[3]
Jurriaan D. Mulder, Jarke J. van Wijk, and Robert van Liere. A survey of computational steering environments. Future Generation Computer Systems, 15(1): 119-129, 1999. S.G. Parker, M. Miller, C. Hansen, and C.R. Johnson. An integrated problem solving environment: the SCIRun computational steering system. In Hawaii International Conference of System Sciences, pages 147-156, 1998. Luc Renambot, Henri E. Bal, Desmond Germans, and Hans J. W. Spoelder. CAVEStudy: An infrastructure for computational steering and measuring in virtual reality environments. Cluster Computing, 4(1):79-87, 2001.
158
[4]
J. A. Kohl and R M. Papadopoulos. CUMULVS : Providing fault-tolerance, visualization, and steering of parallel applications. Int. J. of Supercomputer Applications and High Performance Computing, pages 224-235, 1997. [5] S. Rathmayer and M. Lenke. A tool for on-line visualization and interactive steering of parallel hpc applications. In Proceedings of the llth IPPS'97, pages 181-186, 1997. [6] Weiming Gu, Greg Eisenhauer, Karsten Schwan, and Jeffrey Vetter. Falcon: On-line monitoring for steering parallel programs. Concurrency: Practice and Experience, 10(9):699736, 1998. [7] Alexandre Denis, Christian P6rez, and Thierry Priol. Towards high performance CORBA and MPI middlewares for grid computing. In Proceedings of 2nd IWGC, pages 14-25, 2001. [8] Alexandre Denis, Christian P6rez, and Thierry Priol. Portable parallel CORBA objects: an approach to combine parallel and distributed programming for grid computing. In Proc. of the 7th Intl. Euro-Par'O1 Conference (EuroPar'O1), pages 835-844, 2001. [91 B. Nkonga and P. Charrier. Generalized parcel method for dispersed spray and message passing strategy on unstructured meshes. Parallel Computing, 28:369-398, 2002.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
159
Parallel Decimation of 3D Meshes for Efficient Web-Based Isosurface Extraction A. Clematis% D. D'Agostino ~, M. MancinP, and V. Gianuzzi b aIMATI-CNR Genova, {clematis,dago,mancini }@ge.imati.cnr.it bDISI-University of Genova, [email protected] Isosurface extraction is a basic operation that permits to implement many type of queries on volumetric data. The result of a query for a particular isovalue is a Triangulated Irregular Network (TIN) that may contain huge number of points and of triangles, depending on the size of the original data set. In distributed environment, due to the limits of bandwidth availability and/or to the characteristics of the client, it may be necessary to visualize the result of the query with lower resolution. The simplification process is costly, especially for huge data sets. In this paper we address the problem of efficiently parallelize this process using a cluster of COTS PCs for a Web-based parallel isosurface extraction system. 1. I N T R O D U C T I O N Nowadays large collections of 3D and volumetric data are available, and many people with expertise in different disciplines access these data through the Web. Isosurface extraction [2] is a basic operation that permits to implement many type of queries on volumetric data. The product of an isosurface extraction is a Triangulated Irregular Network (TIN) containing a more or less large number of topological elements (triangles, edges and vertices), depending on the original data set. In order to efficiently transmit results from a Web server to a remote client, it is useful to simplify the TIN and to compress it [ 1]. In Figure 1 a typical scenario is depicted, where a client accesses a Web server in order to study a 3D data set, as discussed in [9]. Looking at the architecture of the system it is possible to point out three main components: the client, the interconnection network and the server. We assume that the client does not provide large computing resources (such as a Personal Digital Assistant), but that it is able to perform basic rendering and visualization operations on 3D data. The interconnection network in turn may have variable characteristics, but often it represents the bottleneck of the system. In order to obtain acceptable performance it is important to reduce the amount of transmitted data or to use suitable strategies, like progressive transmission, which permit to reduce the effect of transmission delay on the visualization process. The server is expected to be powerful enough to support multiple requests arriving from potential clients. Here we suppose the server to have a multi-tiers organization with a Web front end, that provides the user's interface, and one or more local clusters to execute the required process. In this configuration most of the computation is executed on the server side, where
160 parallel processing is a viable possibility to provide high performance. The computational pipeline executed on the server consists of an isosurface extraction step and possibly of a simplification and compression step. The resulting isosurface can be very huge, so an user can decide to obtain a simplification at different levels, depending on the characteristics of its client device. The server may benefit of parallel computing at different levels. In this paper we will deal with the parallelization of the simplification process using a cluster of COTS workstations. The remaining part of the paper is organized in the following way: in Section 2 we discuss the problem of mesh simplification; in Section 3 we provide a review of related works; in Section 4 we describe our proposal; in Section 5 conclusions will be discussed. 2. S I M P L I F I C A T I O N OF TIN The surface simplification process is an important topic in visualization: often meshes are composed by so many triangles that rendering is very difficult. Typically isosurfaces do not make exception, because they are composed by hundred of thousand (if not millions) of triangles, and also today's common workstations have problems in rendering models of this size. Moreover this process is of particular interest for interrogation of remote data since it may contribute to reduce the size of data that it is necessary to transmit. We can define mesh simplification the process of reducing the number of primitives in a mesh M obtaining a mesh M', which is a good approximation of the original surface. To better clarify this concept let us consider the structure of a mesh. A mesh is a polygonal (in our case triangular) model composed by a set of vertices and a set of triangles. It provides a single fixed resolution representation of one or more objects. A simplification of the mesh M is a mesh M' resulting from the elimination, following an appropriate criterion, of a subset of initial vertices and triangles. Several algorithm have been proposed: for a survey see [ 1]. We are interested in algorithms that preserve the original topology and a good approxima-
C L I E N T SIDE
SERVER SIDE
ItTI'P
SER~
I ._I
~x'macnoN srsr~
VOLUMLrl~C DATA SLrr R.~OSlTORY :
....))
Figure 1. A Web system for 3D data analysis
I !
i
9
)
i!
i )
161 tion of the original geometry. For these reasons we have initially focused our attention on the Schroeder's algorithm, called "Decimation" [3], that was originally applied to isosurfaces. The algorithm reduces triangles number by deleting some vertices using local operations on geometry and topology. It is an iterative algorithm, composed by three steps: 9 classification of vertices on the basis of their topology in simple, non-manifold, or boundary; 9 simple and boundary vertices are evaluated and ordered; 9 the less important vertex and triangles that use it are removed, and the resulting hole is patched by forming a new local triangulation. The process is repeated until some termination conditions are met (i.e. the number of the remaining triangles is lower than a threshold). In [5] the vertex removal operation is replaced with the edge collapse operation. A weight is assigned to each edges, that are then ordered and the less important is removed by contracting it in a point, with the consequent elimination of the triangles that share it. In this manner a lot of expensive consistency checks for the new triangulation resulting from vertex removing can be avoided. Furthermore a new evaluation criterion, the quadric error metrics, is proposed. The cost assigned to each edge represents the amount of error introduced in the mesh after its elimination. This criterion best preserves the quality of the resulting mesh because, for each iteration, the cost assigned to each edge modified by the contraction takes into account the error with respect to the original mesh rather than the actual mesh. This error accumulation produces higher quality meshes. For these reason we based our parallel implementation on Garland's algorithm. 3. RELATED W O R K S Several works were done with the purpose of efficiently perform the mesh simplification in parallel. In [6] a combination of message passing and shared memory paradigms are used. The algorithm uses the master-worker scheme and a data parallel approach. Masters (called priority queue handle processes) are associated with priority queues related to the connected components. They maintain the topology and the priority queue order, providing to the associated set of workers edges to collapse. In [7] the approach is based on a greedy partition of the mesh made by the master and an independent simplification of each resulting set made by the workers. In this work the problem of the borders resulting from the mesh subdivision among workers is solved by a post-processing step. In [8] a similar scheme is adopted, with a minimization in workers communications. In [4] the concept of super independent set is on the basis of the proposed technique. The parallel removing of a set of vertices is possible only if they are "super independent": two vertices are super independent if they do not share any element of the boundary of the hole that would result from their elimination, so vertex removal operations do not influence each other. This parallel implementation of the Schroeder's algorithm is based on a master-worker
162 approach. Worker evaluate the vertices, ordering them and sending the result to the master. The master creates an ordered list of super independent vertices, partitioning it among the workers that remove them. The process is repeated until the target simplification is achieved. Even if we exploit data parallelism and a master-worker scheme, our approach is slightly different. In our system not only the simplification algorithm, but also the isosurface extraction is performed in parallel. In this way data are already partitioned among workers, relying on data subdivision performed in order to extract isosurface in parallel proposed in [ 10]. Our aim is to exploit the data distribution resulting from the previous isosurface extraction avoiding costly redistribution, minimizing master-worker communications and producing an high quality mesh. For this reason we base our parallel computation on the Garland algorithm. 4. PARALLEL DECIMATION OF ISOSURFACES Our parallel algorithm is designed to be executed in pipeline after a parallel isosurface extraction step. Our parallel system uses the master-worker scheme and the pipeline is composed by three steps: the parallel isosurface extraction, the parallel simplification and the creation of an output file. The first and third steps are explained in [ 10].
4.1. Parallel algorithm description At the beginning of the simplification step the representation of the isosurface is stored in the main memory of the cluster workstations. The sequential algorithm at first traverses vertices and evaluates them. A cost is then assigned to each edge, that are sorted using a priority queue. We chose to preserve meshes boundaries, because they could be relevant, e.g. in medical images analysis, so we do not collapse border edges. These two phases are proportional on the number of vertices and edges. With huge meshes the heap construction could require a critical amount of memory, while in the parallel version we significatively reduce memory requirements and we may achieve a linear speed-up on it because it is executed independently by the workers. In fact the only overlap between workers assignments is due to few edges that belong to the boundaries resulting from data partitioning. These edges presently are treated as border edges to make simpler the construction of an unique VRML file of the result, but we plan to treat them as internal edges. The maximum number of triangles to be eliminated is known in advance by the workers, but it may not be reached because of constraints to preserve mesh quality. Considering S the global number of triangles to eliminate, we have to globally delete 7s edges, because the collapse of non-border edges implies that two triangles are deleted. But having the purpose to achieve a good quality of the simplified mesh we don't want to let each worker delete the same percentage of its triangles, because some workers may have less regular parts of the meshes that should be preserved, while others may have large planar regions where an hard simplification is compatible with a very little degradation. For this reason we split the edge removal operation in three parts: the local selection of candidates, done by workers, a global sorting of candidates, done by master, and effective removing, done again by workers. s edges, considering them as candidates, updating the Each workers extracts its lower cost -~ cost of the neighboring edges in the heap. These modification are made over auxiliary structures that does not involve the model, that will be really modified in the effective removing step.
163 During this selection a worker recalculates weight for edges involved in the "simulated" contractions, marking as "candidate" edges and related vertices and faces, rearranging consequently the heap. Workers also create another list, containing pairs "number-cost", that summarizes how many edges have that costs, because for the master is not important the identification of candidate edges. This allows to send to the master a lighter list, ordered on the "cost" value. At this point the master examines the lists until it reaches the target number of edges, sending back to each worker the number of edges to remove. On the basis on master reply each worker collapses the given number of edges and adjusts the topology. In particular, considering that each edge collapse operation means the deletion of a vertex, it renames its vertices considering also the number of collapses made by the other workers, in order to produce a representation of its part of the isosurface that can be directly concatenate with the other by the master in the following output creation step. The pseudo-code for the algorithm is showed in Figure 2. ##Wo rke r s /* Isosurface
extraction step
... */
evaluate_edges () 9 build_heap () ; while (triangles > target_reduction) select_candidate () ; u p d a t e h e a p () ; }
{
build candidate list(); send(master, candidate_list). restore model() ; receive (master, new_target_reduction) while (triangles
> new_target_reduction)
remove_edge () ;
/* Output creation step ... */ ##Master /* Isosurface
extraction
for(i < nu~_workers) sort lists; for(i < n~_workers)
step ... * /
receive(worker[i], send(yorker[i],
/* Output creation step ...
candidate_list[i]) ;
new target_reduction[i])"
*/
Figure 2. The algorithm pseudo-code
4.2. Timing analysis and load balancing The sequential time to simplify a mesh is
Tseq - Theap 4- rcollapse 4- Tupd
(1)
where Theap is the time to evaluate edges and to build the heap, Tcouapseis the sum of the time spent modifying the model and Tupdis the sum of the time spent updating the heap after an edge collapse operation. The parallel time is
Tpar -- Theap 4- rsel 4- rtupd 4- Tcomm 4- Tsorting 4- rdel
(2)
164 where Tsez corresponds to the selection of the candidates, Taez the effective edges removing, T~omm the communication between master and workers and Tsort~ng the lists examination made by the master. T~omm and T~ort~ngare measured on the master, the other on the workers. Considering the sequential time we have that: - for very large data sets the amount of required memory may become critical. In particular Theap and Tupd may grow sharply due to the use of virtual memory; - for data sets that fit in main memory the dominant time is Tcollapse,followed by Th~ap and Tupd. In many cases we may consider the contribution of T~,pdnegligible. The parallel algorithm permits first of all to handle very large data sets because of the availability of a larger memory. Considering Equations 1 and 2 and our experiments we have that: - Tcom~ and T~o~t~ngrepresents an overload with respect to the sequential algorithm but their contribution is very limited; - the time reduction for heap construction is nearly linear (Th~ap vs. T~eap). This is possible because the original mesh is evenly distributed among workers at the end of the isosurface extraction step. In fact, as explained in [ 10], workers receive a part of dataset that will produce quite the same number of vertices (and consequently edges), with respect to the others; the time reduction for edge collapse operations is nearly linear. This time is identified by Tcollapse in the sequential algorithm and Tad in the parallel; there is no time reduction in the update phase, apart for the reduction due to better use of memory hierarchies because of smaller data sets. The update contribution for the sequential algorithm is Tupd and for the parallel algorithm it includes T~d and T~pd. For these considerations the algorithm should scale well over a greater number of processors, but the quality of the result may suffer because of the problem of the border edges introduced by data partitioning. The reduction can produce a (normally) little work unbalance between workers, but the effective edge collapse is not as relevant as in the sequential algorithm, in particular when edges number is huge. The edge removing consists into the deletion of an edge and into the replacement of its vertices with a new vertex, whose position was previously calculated. This unbalancing could be less relevant considering that workers must provide their part of the isosurface to the master moving files into a shared NFS partition. In this manner if workers terminate the operation in different time there is less traffic on the network and so a performance improvement. We have experimented our algorithm using a data set representing a CT scan of a bonsai using a cluster of four PC, equipped with 1.7 GHz Pentium IV processor and 256 MB RAM, connected via Fast Ethernet and running Linux. For isovalue 180 we obtained an isosurface made by 286,954 triangles, that we can reduce to, at minimum, 106,650 triangles with simplification. The sequential version takes 3:32 min. (Theap = 2:10 min. TcoIlapse= 28 sec. Tupd = 44 sec.) while with three workers we obtain a speed-up of about 2.6 times. Results are showed in Figure 3. -
-
5. CONCLUSIONS AND FUTURE W O R K S The main contribution of this paper is a new parallel algorithm for TIN simplification. The algorithm is based over [5] and exploits the knowledge of mesh structure, obtained during the isosurface extraction process. The advantages of this approach are due to the minimization of the communications number and size between processes and the reduction of the sequential part
165
Figure 3. The original (on the left) and simplified isosurfaces (on the right) representing a bonsai pot. of the algorithm, performed by the master. An improvement can be achieved by a progressive send of the candidate ordered list, in order to overlap the time spent by the master examining the lists and the creation of them. Futhermore we would reduce the number of candidates, considering for each of the N workers s edges and we would allow the elimination of edges belonging to borders resulting from f(N) data subdivision. ACKNOLEDGEMENTS
This work has been supported by CNR Agenzia 2000 Programme "An Environment for the Development of Multiplatform and Multilanguage High Performance Applications based on the Object Model and Structured Parallel Programming", by FIRB strategic Project on Enabling Technologies for Information Society, Grid.it, and by MIUR programme L. 449/97-99 SP3 "Grid Computing: enabling Technologies and Applications for eScience". REFERENCES
[i]
C. Gotsman, S. Gumhold, and L. Kobbelt, Simplification and compression of3D-meshes. Tutorials on multiresolution in geometric modeling, A. Iske, E. Quak, M. Floater (eds.), Springer, 2002 [21 W.E. Lorensen, and H.E. Cline, Marching Cubes: a high resolution 3D surface reconstruction algorithm. Computer Graphics, vol. 21, no. 4, 1987, pp. 163-169. [31 W. J. Schroeder, J. A. Zarge, and W. E. Lorensen, Decimation of triangle meshes. Computer Graphics, vol.26 n.2, 1992, pp.65-70. [41 M. Franc, and V. Skala, Parallel Triangular Mesh Reduction. In ALGORITHM 2000 proceedings, pp. 357-367, 2000. [5] M. Garland, and Paul S. Heckbert, Surface Simplification Using Quadric Error Metrics, Computer Graphics, vol. 31, 1997, pp. 209-216.
166 O. Schmidt, and M. Rasch, Parallel Mesh Simplification. In PDPTA 2000 proceedings, pp. 1801-1807. [7] C. Langis, G. Roth, and F. Dehne, Mesh Simplification in Parallel. In ICA3PP 2000 proceedings, pp. 281-290. [8] D. Brodsky, and B.A. Watson, Parallelization, small memories, and model simplification. In 11th Westem Canadian Computer Graphics Symposium proceedings, 2000, pp. 75-83. [9] A. Clematis, D. D'Agostino, W. De Marco and V. Gianuzzi, A Web-Based Isosurface Extraction System for Heterogeneous Clients. In 29th Euromicro Conference proceedings, 2003, pp. 148-156. [10] A. Clematis, D. D'Agostino, V. Gianuzzi, An Online Parallel Algorithm for Remote Visualization oflsosurfaces. In 10th EuroPVM/MPI proceedings, 2003. [6]
Parallel Programming
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
169
MPI on a Virtual Shared Memory F. BaiardP, D. Guerri ~, P. MorP, L. RiccP, and L. VaglinP ~Dipartimento di Informatica Universit/t di Pisa Via F.Buonarroti, 56125 - PISA To show the advantages of an implementation of M P I on top of a distributed shared memory, this paper describes MPIs14, an implementation of M P I on top of D V S A , a package to emulate a shared memory on a distributed memory architecture. D V S A structures the shared memory as a set of variable size areas and defines a set of operations each involving a whole area. The various kind of data to implement M P I , i.e. to handle a communicator, a point to point or a collective communication, are mapped onto these areas so that a large degree of concurrency, and hence a good performance, can be achieved. Performance figures show that the proposed approach may achieve better performances than more traditional implementations of collective communication. 1. INTRODUCTION Almost any M P I [6] implementation supports M P I primitives through a low level communication library. In most cases, a first layer implementing point to point operations is implemented on top of proprietary libraries and collective operations are implemented on the top of this layer. A few proposals [8, 7, 9] define a M P I run time support on the top of a shared memory layer. [7] exploits light-weight threads to execute M P I applications. Each M P I node is executed by a distinct thread. Point to point communications are implemented through a message queue shared between the partners of the communication. To guarantee that any M P I node can be safely executed as a thread, [7] defines a set of complex compile-time transformations. Since the correctness of these transformations can be proved only for programs not invoking external functions, the approach cannot be considered completely general. [8] proposes a M P I implementation on a shared memory multiprocessor where, for each pair of applicative processes, a distinct queue implements point-to-point communication between the processes. A key issue in the definition of a shared memory support of M P I is the reduction of the overhead introduced to preserve the consistency of shared data. Any optimised implementation of point to point communication primitives tries to minimise this overhead. For instance, [8] exploits lock-flee buffers to implement point to point communications, [7] assumes one writermultiple readers queues to simplify the lock-flee algorithm. A further challenging issues of a shared memory support of M P I is how a shared memory can simplify the complex protocols to implement M P I collective communications. Furthermore, no existing proposal considers a distributed virtual shared memory architecture where the cost of accessing shared data may be high, because data may be stored in the local memory of a remote node. For this reason, the data are clustered into pages, generally of the
170 same size, and a page is the basic data transfer unit to/from the shared memory. In the case of a virtual shared memory, the definition of M P I poses a new set of problems. The first one is the definition of a proper mapping of the M P I support data into the shared pages. This mapping should minimize the overhead due to synchronizations not required by the semantics of the M P I operations. For instance, data supporting M P I point to point communications between different partners or collective communications executed in different communicators should be mapped into distinct pages to minimize conflicting requests to the same data. Furthermore, any caching strategy to support M P I should be coherent with those to support the virtual memory. This paper present MPIsH, a run time support for M P I developed on the top of DVSA, Distributed Shared Areas, [ 1,3], a distributed shared memory abstract machine currently implemented on a Meiko CS2 architecture and on a cluster of Linux workstations. MPIsH supports M P I communicators as well as point to point and collective communications. D V S A structures the shared space as a set of areas where the size of each area is freely chosen within an architectural dependent range, when the area is declared at program startup. D V S A defines a set of functions to manage the areas. The Notification functions allow a process to declare all and only the areas it is going to share and to notify the termination of its operations on each area. The Synchronization functions set includes operations to acquire exclusive access to an area, i.e. they implement locks. The Access Class includes operations to read/write an area. To enhance the portability of the M P I support across distinct physical architectures supporting DVSA, the implementation of MPIsH exploits the D V S A constructs only. As an example, M P I non blocking communications are defined through D V S A non blocking primitives even if a thread mechanism supported by the architecture might be more efficient. The performance figures of MPIsH show that M P I collective communications can benefit of an implementation on top of a shared memory abstract machine. From another point of view, these figures confirm that one of the major problem of current implementation of M P I , on top of general purpose or special purpose message passing libraries, is an efficient strategy to support both M P I point to point and collective communications. The efficiency of collective communications cannot be neglected [4] because, while data parallel algorithms with static data allocation can be easily implemented through point to point communications only, most complex algorithms require some form of data re-mapping that heavily exploits collective communications. Our results suggest the adoption of a hybrid approach where M P I point to point communications could be directly implemented on top of the communication library of the considered architecture, while M P I collective communications could be implemented on top of a distributed memory system. The additional overhead due to this layer may be recovered if the implementation of each M P I primitive is simplified by properly exploiting the operations of the distributed memory. The overall implementation of MPIxH is presented in Section 2. Section 3 shows the strategy to support MPI communicators. The implementations of point to point and of collective communications are shown, respectively, in Section 4 and in Section 5. Section 2 shows some experimental results and draws the conclusions.
171 2. O V E R A L L S T R U C T U R E OF T H E I M P L E M E N T A T I O N An important assumption that has driven our implementation is that an effective M P I implementation should minimize the contention on an area due to synchronization operations issued by distinct processes. Furthermore, it should properly map the areas into the local memories of the processing nodes. Hence, the first step in the implementation of MPIsI4 has defined the overall structure of the areas to implement message exchange and process synchronization. These areas records both the data exchanged among processes and the state of the processes involved in an ongoing communication. According to our initial assumption, the data structures required to implement MPI communications are mapped onto the areas so that: 9 an area is locked through synchronization functions only if this is the only way to preserve the M P I semantics. Hence, data structures that do not require the invocations of synchronization functions should be mapped onto areas distinct from those recording data requiring this operation. 9 the address of an area A should be known to the processes exploiting the data stored in A only. Hence, an area implementing a M P I communicator C should be shared among the processes belonging to C only. In the same way, only the communication partners should access an area storing a point to point message. These principles can be satisfied by a dynamic allocation of the areas to the processes. A static allocation is not possible, because of the M P I semantics that does not support the definition of a static analysis that returns, for each process P, the communications it is involved in and the corresponding communicators. On the other hand, for efficiency reasons, D V S A does not support a dynamic management of the areas and each process defines the areas it is going to share at the beginning of the execution of its program. For this reason, the dynamic management of the areas is explicitly implemented by MPIsH. Before starting the processes execution, MPIsH defines a pool of areas shared by the processes. The size of the pool depends upon the number of the applicative processes and the maximum number of communicators to be supported. M P I s u primitives fetch areas from this pool and assign them to the requesting processes. In this way, an area A is fetched when a process starts a point to point communication and the address of A is notified to the communication partner when it executes the corresponding primitive. The addresses of the areas shared by a process are dynamically stored in local tables. Each process can access only the areas whose addresses are stored in its local tables. MPIsI4 structures the areas into a hierarchy, where each level of the hierarchy is characterized by the number of processes sharing an area of this level. At the top of the hierarchy we find areas always shared among all the processes of the application. These areas record global information, for instance a global counter to assign unique contexts to communicators. They also store a set of pointers to a pool of free areas to be allocated for communicators descriptors. Next we find areas shared by all the processes of the communicator. These areas record either the information on the communicator or to implement collective communications occurring within a communicator. The next levels of the hierarchy include areas to implement point to point communications that are shared by the two partners of the communications only.
172
DVSA Mother Page ............................................................................................... iMPI_COMM WORD IC~176 [ Descriptor ]
I C O ~ NICATOR........ 1
@
........
n
1
II ......II
SYNC_IN
........
SYNCjN
~
AREA_IN
........
AREA_IN
........
SYNC_OUTI
!
@
n
I 1 ..... I ISYNC-~
II
1
n
.....
11 n
I ..... I I
................................................................................................
Figure 1. MPI_COMM_WORLD Environment and Collective Areas
This structuring results in a better memory utilization, because the size of an area can be chosen according to the level it belongs to. Furthermore, the number of processes sharing an area decreases as the level of the area increases and better allocation strategies can be adopted for areas of the highest level. As an example, each area of the highest level is always allocated in the local memory of one partner of the communication. 3. IMPLEMENTATION OF MPI COMMUNICATORS The execution environment of MPIsH is set up by the function MPI_InitsH that allocates a pool of free areas, initially shared by all the processes and that supports the creation of a communicator. To avoid the bottleneck due to a single pool and to concurrently allocate areas to distinct communicators, MPIsH partitions the pool among the applicative processes. Each process stores in a local table the addresses of the free areas it has been assigned. Furthermore, the areas are partitioned according to both the communicator they are associated with and the semantics of the data they record. Each process taking part in the creation of a communicator assigns some of its areas to the new communicator. MPI_Initsu initialises Levelo areas, the top level of the hierarchy, and creates the MPI_COMM_WORD communicator. A further communicator C can be explicitly created through MPI_Comm_CreatesH. The creation of C is implemented through two successive phases. In the first phase, each process informs its partners of its participation in the creation of C by decreasing a counter in the descriptor of C allocated by the first process invoking MPI_Comm_Createsn. This process initialises the descriptor by storing a new context, the number of processes of the communicator and their identifiers. The context is produced through a global context counter allocated in a Levelo area. The descriptor of C also stores pointers to a set of free areas to support the communications within C. As an example, the left part of Fig 1 shows the areas after MPI_InitsH has initialized both the descriptor of the MPI_COMM_WORLD communicator and the execution environment. A communicator descriptor includes a pointer to a pool of free Levela and Level4 areas, which will be dynamically allocated during the communicator lifetime and a set of Level2
173 areas, one for each process in the communicator, shared among all the processes of the communicator during its lifetime. Free areas are used to support point to point communications; other areas are partitioned according to their use. Collective areas support collective communications, Point to Point areas implement point to point synchronizations and Process areas store the address of dynamically allocated channel areas. These areas will be described in more detail in the following together with the implementation of different kinds of communications. Each process invoking MPI_Comm_CreatesH allocates a subset of the areas of the communicator by selecting their address from its local tables. The second phase synchronizes all the processes involved in the creation of the communicator. This synchronization is required by the M P I semantics that states that a communicator can be used only after all the involved processes have completed its creation. This synchronization is implemented through an explicit barrier after MPI_Comm_Creates14. However, processes can be loosely synchronized so that they are delayed only when they execute the first communication that refers to the communicator. When executing this communication, each process accesses the communicator descriptor and compares the number of processes that have completed the creation against that of the communicator processes. This guarantees that any communication starts only after all the areas of the new communicator have been allocated. At the end of the second phase, each process can copy the addresses of the areas shared within the communicator into a set of local tables. 4. POINT TO POINT C O M M U N I C A T I O N S This section describes the implementation of a subset of point to point communications: [2] describes the whole set of primitives including the non deterministic ones. Point to point communications are implemented through Channel, Buffer and Point to Point areas. Channel areas store information about pending communications between two partners and Buffer areas store the corresponding messages. Point to Point areas implement process synchronization. The partition of the areas supports the implementation of several strategies to optimise the allocation and the accesses to the areas. As far as concerns the allocation, MPIsI-I allocates the pool of areas managed by the process P in the local memory of the node executing P. In this way, the channel and the buffer areas are always allocated either in the memory of the sender or in that of the receiver. A proper caching strategy further optimizes the accesses to the areas. When a process P access a channel to receive a message, it copies into its local memory any information regarding any pending communication. The receiver process has been chosen, because the number of pending sends is generally larger than that of the receives. To receive a message, at first a process checks the pending communications in the cache and, only if the cache does not include any matching pending communication, it accesses the possibly remote channel area. When a process accesses a channel area, it copies into the area the updated information from the cache. As shown in section 2 this may largely improve the overall efficiency. 5. C O L L E C T I V E C O M M U N I C A T I O N S Collective communications exploit Collective areas in the communicator descriptor for both message transmission and process synchronization. Since distinct areas are used for each communicator, communications involving distinct communicators are concurrently executed. Since M P I collective communications are blocking, but not synchronous, a process may start a new collective communication before the previous one has been completed by all the
174 other processes. Hence, MPIsH further partitions the set of Collective areas so that a different set of areas is associated with each process of the communicator. In this way, collective communications with distinct root processes can be simultaneously active within the same communicator because they exploit different areas. For the moment being, let us assume that the data exchanged in a collective communication always fits in one area. MPIsH pairs two data areas, AREA_IN and AREA_OUT, with each process P, see the right part of Fig. 1 The former is exploited when P is the root receiver of a communication, the latter when P is the root sender. Furthermore, each data area is paired with two areas, SYNC_IN and SYNC_OUT, to synchronize the communicating processes. Each synchronization area includes n binary semaphores, one for each process of the communicator. A semaphore enables a receiver to check if the data it is waiting for is present in the corresponding data area. A sender, instead, can check if the data area is free or if it records data of a previous communication. Let us now consider the implementation of a broadcast. The root process P checks if its AREA_OUT area is free by inspecting the corresponding SYNC_OUT area. When all the semaphores are equal to 0, all the processes involved in a previous operation with the same sender root have fetched the message from the AREA_OUT area, then all the semaphores are set to 0. P writes the message in AREA_OUT and it set all the semaphores to 1. The i-th process involved in the communication checks the presence of the message by inspecting the i-th semaphore of SYNC_OUT. When the value of this semaphore is 1, the data can be read and the value of the semaphore is reset. This implementation of collective communication does not require the invocation of a synchronizion function before accessing the areas. Hence, it exploits at best the MPI semantics and the partitioning of the areas into synchronization and data areas. To handle message that do not fit into one area, the support defines k AREA_OUT and k AREA_IN areas for each process, each one associated with a corresponding synchronization area. This allows the sender to partition the message into n packets, and store each one in a distinct area. This strategy further increases concurrency, because the sender can write the i-th packet of a message while the other processes record the previous one and it can be adopted because synchronization is implemented through distinct areas.
6. E X P E R I M E N T A L RESULTS AND CONCLUSIONS We consider preliminary performance figures of MPIsI4 on a MeikoCS2 and compare them against an implementation developed on top of a native communication library. In Fig. 2 we show the execution time of, respectively, a) scatter, b) broadcast and c) all to all primitives in the case of 8 processes as well as the execution time of a barrier primitive d) for a varying number of processes. The performances of collective communications largely benefits from an implementation on a distributed virtual shared memory. The curve in Fig. 2 e) shows the effectiveness of the caching strategy in MPIsH. In the considered program, at first, two processes sends to each other k messages, the pending messages labelling the x coordinate in the figure. Then, they execute a barrier synchronization and receive the messages. Fig. 2 f) compares the performances of MPIsH point to point communication against those of the MPI implementation on a message passing library. The performance of MPIsH point to point communications is worse than the one that can be achieved by a message passing library. A large amount of the overhead of MPIsu in the implementation of
175
2500
1200 1000
.~
E
a) ~ .9 "5 uJ
MPI - SH , MPI ..........
800
~"
./ "
_J
o
...................................... -.................... "....
400
200
16
64
256
Message
1024
4096
b)i~
2000
15oo
lOOO
.......x. . . . . . . 4
16384
~ ...... . .... 64
16
1200
MPI_SH MPI
(~ 5 0 0
#,/"
8
.~
400
d)i
~oo
m
100
1000
1024
4096
16384
Size (bytes)
,
~................ . ............... . ..............
300
c)~ ~oo .9
600
200 400
4
i
,
i
J
i
J
16
64
256
1024
4096
16384
Message
0
Size (bytes)
2
3
4
5 Number
800
.............. 17...... .............. ;7;7...... ...........]. .....
600 o Ev 500
._<-3
................ 7~%;;.... ................ ~~ ....... M P I S H using cache
400
(msg. 4 bytes) , M P I _ S H without cache ( m s g . 4 b y t e s ) MPI S H using cache ( m s g . 1 0 2 4 b y t e s ........... MPI_S-H w i t h o u t c a c h e (msg. 1024 bytes) ........... E~.......:
._
300
ixi i
200
/
100
.......................................
.........~~~......
m ......................................
41
2
3
Number
of p e n d i n g
15o
vE
messages
4
i:
100
6
7
of P r o c e s s
MPISM bytes , MP! 4 b y t e s M P I S M 1024 b y t e s ............ MPI 1024 b y t e s ....................
200
700
e)~
256
Message
S i z e (bytes)
~IPI..........
1600 1400
m
i
//
600
1800
"~
'
500
2000
"-"
'
,
o
/
/.x"
600
9
MPI S H
,~....... . ......
........................ ii i i i i i i i i i i i
"5 iii
50
2 Number
3 of p e n d i n g
messages
Figure 2. Execution times of a) scatter, b) broadcast, c) alltoall, d) barrier; e) Effectiveness of Caching Strategies; f) Comparison of M P I vs. MPIsh
point to point communications is due to the synchronizations to support M P I non deterministic constructs. Our experiments suggest the adoption of a hybrid approach that exploits a native message passing library for point to point communication and a distributed shared memory for collective communications. Further factors for an efficient implementation are proper caching strategies and the minimization of contention on shared information. REFERENCES
[1] [2]
F. Baiardi, D. Guerri, R Mori, L. Moroni, and L. Ricci. Two layers distributed shared memory. Proceedings of HPCN Europe 2001: Lecture Notes in Computer Science, 2110:302-311, June 2001. L.Vaglini. Implementazione di MPI mediante una memoria virtuale condivisa, Tesi di
176
[3]
[4]
[5] [6]
[7]
[8] [9]
Laurea, University of Pisa, 1999. F.Baiardi, D.Guerri, RMori, and L.Ricci. Evaluation of a virtual shared memory machine by the compilation of data parallel loops. In 8-th Euromicro Work. on Parallel and Distributed Processing,Rhodes, January 2000. G. Folino, G. Spezzano, and D. Talia. Performance evaluation and modelling of MPI communications on the Meiko CS-2. Proceedings of HPCN Europe 1998, LNCS 1401:932936, 1998. M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra. MPI: The Complete Reference. MIT Press, Cambridge, Massachussetts, London, England, 1996. W.Gropp and E.Lusk and N.Doss and A.Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Par. Computing, n.6, vol.22, 1996, pp. 789-828 H.Tang and K.Shen and T.Yang. Program Transformation and Runtime Support for Threaded MPI Execution on Shared Memory Machines, ACM Transactions on Programming Languages and Systems, n.4, vol.22. 2000, pp. 673-700. B.Protopopov and A.Skjellum. Shared Memory Communication Approaches for an MPI Message Passing Library. Concurrency: Practice and Experience, vol 12, 2000, pp.799829. W.Gropp and E.Lusk. A high performance MPI Implementation on a Shared-Memory Vector Supercomputer. Parallel Computing, vol.22, 1997, pp. 1515-1526.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
177
OpenMP vs. MPI on a Shared Memory Multiprocessor J. Behrens a, O. Haan a , and L. Kornblueh b ~Gesellschaft f'tir wissenschaftliche Datenverarbeitung mbH G6ttingen bMax-Planck-Institut ffir Meteorologie, Hamburg Porting a parallel application from a distributed memory system to a shared memory multiprocessor can be done by reusing the existing MPI code or by parallelising the serial version with OpenMP directives. These two routes are compared for the case of the climate model ECHAM5 on a IBM pSeries690 system. It is shown, that in both cases modifications of computation and communication patterns are needed for high parallelisation efficiency. The best OpenMP version is superior to the best MPI version, and has the further advantage of allowing a very efficient loadbalancing. 1. INTRODUCTION Parallel computer design follows a trend which is set more by commercial needs and successes than by technological possibilities and demands from parallel applications. The scaling of the majority of parallel applications to large processor counts depends crucially on a balance between the speeds of computation and communication. But commercial applications drive the technological development to faster CPUs and multiprocessing support without increasing at the same rate the speed of accessing local and shared memory. Also the speed of available communication networks for clustering the SMPs to large scale parallel systems is not adequate. So many parallel systems of today do not reach the communication speed of the Cray T3E, which means that the balance of communication and computation has dropped considerably. Many users have experienced this development when porting their application from a T3E to a modem SMP or cluster of SMPs: the application, which had a reasonable parallelisation speed up on the T3E shows a disappointing low parallelisation efficiency on the SMP systems. A very rough estimate already reveals the origin for this behaviour. The latency for point to point communication with MPI was 5 #s for the T3E and is 9 #s with the shared memory MPI implementation on the IBM pSeries690. On the other hand the computational speed of a single processor gains a factor of 3-4 for typical applications by changing from the T3E to the pSeries690. Assuming a typical dependency of the parallel efficiency of the form e~
1 1 +tlat' r.
where (~ is the relative amount of latency aware work, tlat the latency and r the computational speed, an efficiency of 70% on the T3E will be reduced by a factor of more than 3 on the pSeries690. Or put differently, a speed up of 23 on a 32 processor T3E will be reduced to a speed up of less than 8 on a 32 processor pSeries690.
178 Two routes can be taken to improve this situation: 1. Optimising the communication scheme for the application with regard to the communication granularity, because this diminishes the impact of the latency. 2. Using the shared memory programming model on the shared memory system. This will eliminate the software overhead of the MPI implementation, which is largely responsible for the latency of message passing in shared memory. Of course there are many other effects apart from latency, which will influence the efficiency of the parallelisation in the two programming models. Therefore, it is not guaranteed, that an improvement of the efficiency can be achieved at all nor is it predictable, which method will lead to the best results. In the following we report on the results for the above mentioned two ways of optimising a particular code: the climate simulation program ECHAM5 [1, 2]. We started with a MPI version of ECHAM5 on a 128 x 64 x 19 grid, that achieved a speed up of 30 on a 32 processor T3E. With minor changes the same program on a 32 processor pSeries690, using IBM's shared memory MPI implementation sped up to only 14. Further results concerning the comparison of the parallelisation on IBM RS6000/SP and IBM pSeries690 and the efficiency of hybrid MPI / OpenMP parallelisation on IBM RS6000/SP can be found in [3]. 2. OVERVIEW OF ECHAM5
The climate simulation code ECHAM is based on the second generation weather forecast model of the European Centre for Medium Range Weather Forecasts (ECMWF). Numerous modifications have been applied to this model at the Max Planck Institute for Meteorology to make it suitable for climate forecasts, and it is now a model of the fifth generation. A short characterisation of the ECHAM5 model is given in [3], a detailed description will soon be made available. ECHAM5 calculates the time evolution of the prognostic variables vorticity, divergence, surface pressure, temperature, specific humidity and mixing ratio of total cloud water and ice. One major part of this calculation is the determination of the transport of these quantities in every timestep. The transport module SPITFIRE (SPlit Implementation of Transport using Flux Integral REpresentation) [4] separates the advancement of the variables in horizontal and vertical directions, such that the vertical transport can be calculated independently in the different vertical columns, and the horizontal transport can be calculated independently in the different horizontal layers. SPITFIRE is parallelised by distributing the independent transport tasks to the available processors. In the two phases different subdomains are needed for the independent calculations, therefore between the phases the data have to be redistributed between the processors (fig. 1). In the following, we concentrate on the horizontal part of the SPITFIRE code, which contains the rearrangement of data before and after the horizontal transport calculation. 3. OPTIMISATION OF THE MPI COMMUNICATION The original MPI version of SPITFIRE distributes the work for four variables in 19 horizontal layers, i.e. for a total of 76 data layers, cyclically amongst the processors. Every processor
179
/ Z
Z
T(v--> h) B T(h--> v)
Figure 1. independent subdomains for parallelisation with 4 tasks
does the following three steps for each of its layers: 1. collect the data, which were held by other processors during the vertical transport phase, 2. calculate the horizontal transport, and 3. redistribute the data into the columnwise distribution needed for the vertical transport phase. This kind of communication with a rather fine granularity was adequate on a T3E, which could overlap the communication of one layer with the computation on the next layer. Furthermore due to the low latency the rather small size of the communicated messages did not matter. On the pSeries690 system this kind of interleaving of communication and computation is not beneficial. An analysis with the visualisation tool VAMPIR reveals that there are rather long synchronisation periods after every layer rearrangement. The sources for this are twofold: the absence of the possibility for communication/computation overlap and the above mentioned relatively large communication latency on the pSeries690. The data points labelled (a) in fig. 2 show the speed up of the horizontal transport code in the original MPI version. The peaks at processor counts 16 and 32 reveal a particularity of IBMs MPI implementation. Short messages are exchanged in EAGER MODE, i.e. they are stored in reserved buffers on the receiving processor and do not need a handshake protocol for starting the data transfer. The message sizes, that are handled in this mode, are limited by the environment variable MP_EAGER_LIMIT, which by default is 4 KB for up to 16 processors and 2 KB from 16 to 32 processors. The message sizes for collecting a horizontal layer from or redistribute it to the distributed vertical columns decrease with increasing processor numbers. They drop under 4 KB for 16 processors and under 2 KB for 32 processors, so that only for these two processor numbers the EAGER MODE is used for all packets. In a first step towards a more efficient parallelisation we separated communication from computation by first collecting all 76 layers, then calculating the horizontal transport for all layers and finally redistributing all layers. This scheme needs more intermediate storage but avoids the barrier synchronisation between the data collection and the computation for the individual layers. Furthermore the synchronisation delay has been diminished by using the nonblocking m p i _ i r e e v followed by a final call to m p i _ w a i t a l l instead of the blocking m p i _ r e c v . The data points on the curve labelled (b) in fig. 2 show the effect of this rearrangement: the communication efficiency has improved to the EAGER MODE level for all processor counts.
180 16
I
I
I
................
I
ideal
.I
I
I
I
f"
..................
............................. ~ii~................................ :: :::~ i '~i :~ii (ili ~i:!~"~'~ C ~ :,~!!i!~i"i"~i~::~:!~::!izi: :':i~ Ci .."
-+e.
12
.......
(d) MPi datatypes...." (c) contiguous ... p690 distribution... .///
,'"
i.~
:" 8
-
f
/
~i.~.......... ..............
/:"
f~,~/"
" '
": .."
.
.."
,"'"
..... //'/
.................. ~ ................ (b) separated msg&op cyclic distribution | ^ (a) interleaved msg&op 1
1 1'
4'
13
12'
1'6 20' 2' 4 28' 32 Pn Figure 2. Optimisation steps for the speed up of the horizontal transport on the pSeries690.
In a second step the cyclic distribution of the 76 layers onto the processors has been change into a blockwise distribution of the layers, such that different variables with the same vertical index tend to lie on the same processor. Then each processor only needs the common drift velocity data with the vertical indexes of its variables, which reduces the communication load. Curve (c) shows the positive effect of this change of data distribution. Finally the number of messages between each pair of processors was further decreased to one by combining the data of different layers into a derived MPI datatype. This reduces the impact of the communication latency and improves the efficiency especially for large processor counts, when the message sizes get smaller (cf. curve (d) in fig. 2). At this level of optimisation the problem of load balance from distributing the computational load of 76 layers becomes visible. With 18 processors some processors will have to handle 5 layers, from 19 to 25 processors no processor has a load larger than 4 layers and from 26 processors on the maximal load will be 3 layers. This is reflected in the efficiency jumps at processor counts 20 and 26 in curve (d). Even with infinite communication speed we would see the plateaus following these processor counts. A further rearrangement of the layers can minimise the number of different velocity fields on the processors. This leads to the further improvement of speed up for more than 30 processors (curve (e) in fig. 2). The result of this four step change of the MPI version of the horizontal transport part of ECHAM5 is an improvement in speed up on a 32 processor pSeries690 from 11 to 15.
181
24 simple OpenMP (cache opt)
20
simple OpenMP (local copy) MPI (cache opt)
16
,\
Q. "(3
e12 Q..
cO
\ simple OpenMP I
0
4
I
I
8
12
,,
I
I
i
16 Pn
20
24
,
,,I
28
,,
I
32
Figure 3. Speed up curves for OpenMP and MPI versions of the horizontal transport calculation.
4. OPENMP PARALLELISATION The pSeries690 is a shared memory system with 32 processors. The memory modules are attached to multi cpu units (MCU), each of which contains four dual processor chips. One chip houses two power4 processors with their level 1 caches and a single 1.4 MB level 2 cache which is shared between the two processors. A switch fabric interconnects the four chips on a MCU with each other and with four 32 MB level 3 caches. The level 3 caches act as interfaces to the memory modules. Four MCUs are connected by systems of four parallel buses to a 32 way symmetric multiprocessor. Each processor can address the memory modules of every MCU. Communication of data between processors proceeds implicitly by maintaining cache coherence. Shared memory parallelisation of the SPITFIRE code in ECHAM5 with OpenMP starts from the serial version by inserting p a r a l l e l do directives using a static contiguous distribution of the iteration space and by declaring various global arrays for intermediate data as threadprivate. The result of this "simple OpenMP" parallelisation can be seen in fig. 3. The speed up is far worse than with MPI. The reason for this unexpected behaviour can be found in the number of cache misses, which grows with the number of processors used for the calculation of the horizontal transport. Tools for the access of hardware counters for cache misses and other relevant processor activities are available on the pSeries690 [5]. Fig. 4 shows the number of level 1 cache misses, summed over all participating processors. Apparently data needed for local computations are replaced
182 9.0e+08
i
i
i
i
i
i
i
I
I
8.0e+08
S u m m e d L1 C a c h e L o a d M i s s e s
(h
7.0e+08
o
6.0e+08
o
5.0e+08
.=.
/~
J
I
4.0e+08 3.0e+08 2.0e+08 OpenMP (cache opt.) 0
I
4
I
8
I
12
I
I
I
16 20 24 Pn Figure 4. Level 1 cache load misses in the horizontal transport calculation.
I
28
32
somewhere in the cache hierarchy by global data updated by other processors. A closer analysis of the code shows, that these local computations use a two dimensional array for intermediate results and that a simple rearrangement of loops allows to reduce the size of this array to one dimension. This simple cache optimisation results in the lowest curve in fig. 4, which shows that now the number of cache misses is independent on the number of processors and 50% lower than in the non optimised one processor case. The comparison with the MPI case is very instructive. In the original code the number of level 1 cache misses rises slowly with the number of processors, indicating the growing amount of buffer space used for the message exchanges. With cache optimisation this curve is shifted by a constant amount to fewer cache misses. The improvements for the cache misses translate directly to the speed up values. The top curve for the cache optimised OpenMP version shows nearly ideal speed up for 10 and less processors. The plateaus for larger processor counts from the load imbalance are even more pronounced than in the MPI case. 5. CONCLUSION AND O U T L O O K The optimisation of the SPITFIRE transport part of the climate code ECHAM5 for parallel processing on a IBM pSeries690 showed two main results: 1. The original MPI version could be tuned towards larger messages in order to improve the speed up. 2. After a reduction of the array size for intermediate results, the OpenMP parallelisation is definitely more efficient then MPI.
183 The overall effort to produce the cache optimised OpenMP version from the sequential code was small compared to the amount ofreprogramming needed to optimise the original MPI code. A further gain in parallelisation efficiency can be achieved by improving the load balance. The differences of the physical transport conditions in the different layers cause differences in the times needed for the transport calculations, and these differences are changing slowly. A dynamical scheme can be set up for a layer distribution which minimises the load imbalance. With this additional optimisation we reached a speed up of horizontal part of SPITFIRE of 26.5 with the OpenMP version and of 18.5 with the MPI version on a 32 processor pSeries690 (cf. [3]). Because this part of the code has the lowest parallelisation efficiency, the full ECHAM5 code will scale even better. The results of this case study also show that for using a larger number of processors on a clusters of pSeries690 a hybrid programming model could be profitable, with OpenMP parallelisation on a node and MPI parallelisation between the nodes. In this case a multi-threaded implementation of MPI will be needed in order to enable the parallel use of multiple processors on a node and of multiple interconnects between the nodes for the intemode communication (cf. eg. [6]). REFERENCES
[1] [2]
[3]
[4]
[5] [6]
Roeckner, E. et. al., Simulation of the present-day climate with the ECHAM model : Impact of model physics and resolution. Max-Planck-Institut f'fir Meteorologie Report Nr. 93 (1992) Roeckner, E. et. al., The atmospheric general circulation model ECHAM-4: Model description and simulation of present day climate. Max-Planck-Institut f'tir Meteorologie Report Nr. 218 (1996) Behrens, J., O. Haan, and L. Komblueh, Effizienz verschiedener Parallelisierungsverfahren fiir das Klimamodell ECHAM5 - MPI und OpenMP im Vergleich, in O. Haan (ed.) Erfahrungen mit den IBM-Parallelrechnersystemen RS/6000 SP und pSeries690, GWDGBericht Nr. 60 (2003) Rasch, P., and M. Lawrence, Recent development in transport methods at NCAR, in B. Machenhauer (ed.) MPI Workshop on Conservative Transport Schemes, Max-PlanckInstitut f'tir Meteorologie Report Nr. 265 (1998) DeRose, L., The hardware performance monitor toolkit, in Euro-Par 2001 Parallel Processing, p. 122, August 2001. Rabenseifner, R., and G. Wellein., Communication and optimization aspects of parallel programming models on hybrid architectures. International Journal of High Performance Computing Applications, 17(1):49-62 (2003).
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
185
MPI and OpenMP implementations of Branch-and-Bound Skeletons I. Dorta, C. Le6n, C. Rodriguez, and A. Rojas a* aDpto. Estaditica, I.O. y Computacirn, Universidad de La Laguna, Edificio de Fisica y Matemfiticas, 38271 La Laguna, Tenerife, Spain This work presents two skeletons to solve optimization problems using the branch-and-bound technique. Sequential and parallel code of the invariant part of the solvers are provided. The implementation of the skeletons are in C++,and is divided into two different parts: one that implements the resolution pattern provided by the library and a second part, which the user has to complete with the particular characteristics of the problem to solve. This second part is used by the resolution pattern and acts as a link. Required classes are used to store the basic data of the algorithm. The solvers are implemented by provided classes. There is one skeleton in the class hierarchy which is implemented using MPI and another one using OpenMP. This is one of the main contributions of this work. Once the user has represented the problem, she obtains two parallel solvers without any additional effort: one using the message passing paradigm and other with the shared memory one. An algorithm for the resolution of the classic 0-1 Knapsack Problem has been implemented using the skeleton proposed. The obtained computational results using an Origin 3000 are presented. 1. I N T R O D U C T I O N Branch-and-Bound (BnB) is a widely used and effective technique for solving hard optimization problems. This work presents two skeletons to solve problems using this technique. The implementation of the skeletons are in C++. Sequential code and parallel code of the invariant part of the solvers are provided for this paradigm. Several studies and implementations of general purpose skeletons using Object Oriented paradigms are available in [1, 2, 5]. This work is focuses on the BnB skeletons. Several parallel tools have been developed for this paradigm. PPBB (Portable Parallel Branch-and-Bound Library) [12] and PUBB (Parallelization Utility for Branch and Bound algorithms) [ 10] propose implementations in the C programming language. BOB [6], PICO (An Object-Oriented Framework for Parallel Branch-and-Bound) [4] and COIN (Common Optimization Interface for Optimization Research) [9] are developed in C++. In all cases a hierarchy of classes is provided and the user has to extend it to solve her particular problem. There are two different parts in our proposal. The first one implements the resolution pattern provided by the library. The second part consists of a set of classes which the user has to complete with the particular characteristics of the problem to solve. These classes will be used *This work was partially supported by the EC (FEDER) and the Spanish MCyT with I+D+I contract numbers: TIC2002-04498-C05-05 and TI2002-04400-C03-03.
186
Required
I
Solution! .... >[Problem I i
[SetUp [<_ _ q SubProblem
i
I
l + 1 o w e r _ b o u n d () | + u p p e r _ b o u n d () ~ + b r a n c h ()
'
" -
~~o~ ~ L~
Provided
I Solver_SeqI
I SolvEr_Lan
I
Solver_Centralized
]
I
[ I
I Solver_DistributedI
Figure 1. UML diagram of the BnB skeleton
by the resolution pattern (see figure 1). Required classes are used to store the basic data of the algorithm. BnB uses the class P r o b l em to define the minimal standard interface to represent a problem, and the class S o l u t i o n to typify a solution. The class S u b p r o b l e m represents the area of unexplored solutions. Its method b r a n c h (pbm, s p s ) generates from the current subproblem the subset of subproblems to be explored. The l o w e r _ b o u n d (pbm, s o l ) and u p p e r _ b o u n d (pbm, s o l ) subproblem methods calculate a lower and upper bound, respectively, of the objective function for a given problem. Furthermore the user must specify in the definition of the P r o b l em class whether the problem to solve is a maximization or minimization problem. Once the user has represented the problem, she obtains different parallel solvers. The solvers are implemented by the provided class S o l v e r . There is one skeleton implemented using MPI [ 11 ] and another using OpenMP [8] in the class hierarchy. This is one of the main contributions of this work. The user obtains different parallel algorithms with the same specification of resolution: one using the message passing paradigm and another with the shared memory one. The rest of the article is organized as follows. The second section shows how to implement a problem using the skeleton. The third section describes the message passing and the shared memory schemes, respectively. Preliminary computational results appear in the fourth section, where both schemes are applied to the resolution of the 0-1 Knapsack Problem. Finally, the fifth section presents some conclusions and future work. 2. AN EXAMPLE" THE 0-1 KNAPSACK PROBLEM The following paragraphs show the interfaces of required classes of the BnB skeleton. The instantiation of the skeleton classes for a concrete problem consists in two steps: (i) choosing the data types for representing the members of the class (in a . hh file) and, (ii) implementing the methods of the class according to the chosen data types (in a . r e q . c c file). The required class Prob i em represents an instance of the problem to be solved. The internal implementation of the class only needs to create, serialize and obtain the direction. Note that the serialization is important when an object of the class has to be sent to other processes in a
187
parallel distributed execution 9 The 0-1 Knapsack Problem [7] is the example chosen. In this problem a subset of N given items has to be introduced in a knapsack of capacity C. Each item has a profit p~ and a weight wi. The problem is to select a subset of items whose total weight does not exceed C and whose total profit is maximum. This is a maximization problem. We need to represent the profit, the capacities and the constraint vector, which can be done as follows: requires class Problem { public: Number C, // capacity N; // number of elements vector p, // profits w; // weights Problem () ; // constructor -Problem () ; // destructor inline Direction direction() const { return Maximize;} 9
;
.
.
friend opacket& operator<< friend ipacket& operator>>
(opacket& os, const Problem& pbm); (ipacket& is, Problem& pbm);
The required class S o l u t i o n represents a feasible solution to the problem. It is represented through a vector containing true or false values depending on whether the object belongs to the solution or not: requires class Solution public : vector s; // Solution () ; // -Solution () ; // 9
};
.
{ solution vector constructor destructor
.
friend opacket& operator<< friend ipacket& operator>>
(opacket& os, const Solution& (ipacket& is, Solution& sol);
sol);
The required class Sub P r o b 1 em represents a partial problem. A subproblem is determined by the following data: the current capacity, the next object to be considered, the current profit and the current solution: requires class SubProblem public: Number CRest, // obj, // profit; // Solution sol; // SubProblem () ; // -SubProblem (); //
{ current capacity next object current profit current solution constructor destructor
friend opacket& operator<< friend ipacket& operator>> void initSubProblem
(opacket& os, const SubProblem& (ipacket& is, SubProblem& sp) ;
(const Problem& pbm) ;
sp) ;
188
;
Bound upper_bound (const P r o b l e m & pbm, S o l u t i o n & is); Bound lower_bound (const P r o b l e m & pbm, S o l u t i o n & us); void branch (const P r o b l e m & pbm, b r a n c h Q u e u e <SubProblem>&
subpbms) ;
This class must also provide the following functionalities: 9 The method i n i t S u b P r o b l e m (), that is, the first subproblem. In the knapsack problem, the value of C R e s t is initialized to the total capacity, and the current p r o f i t to zero because no object has been selected. 9 The b r a n c h () method. Given a subproblem, two new ones are generated based on the decision to include the next object in the knapsack or not: void
S u b P r o b l e m : : b r a n c h ( c o n s t P r o b l e m & pbm, c o n t a i n e r < S u b P r o b l e m > & subpbms ) { SubProblem spNO = SubProblem(CRest,obj+l,profit) ; s u b p b m s , i n s e r t (spNO) ; Number newC = CRest - pbm.w[obj] ; if (newC >= 0) { SubProblem spYES = SubProblem(newC,obj+l, (profit + p b m . p [ o b j ] ) ) s u b p b m s , i n s e r t (spYES) ;
;
}
9The lower_bound (pbm, sol) and u p p e r _ b o u n d (pbm, sol) methods calculate a lower and upper bound, respectively, of the objective function. The lower_bound is calculated including objects in the knapsack until the current capacity is reached: Bound SubProblem: :upper_bound (const P r o b l e m & pbm) { B o u n d upper, w e i g t h , pft; N u m b e r i; for(i=obj,weigth=0,pft=profit; weigth<=CRest; i++) { w e i g t h += p b m . w [ i ] ; p f t += p b m . p [ i ] ;
}
i--; w e i g t h -= p b m . w [ i ] ; p f t -= p b m . p [ i ] ; u p p e r = p f t + (Number) ((pbm.p[i] * (CRest - w e i g t h ) ) / p b m . w [ i ] ) r e t u r n (upper) ;
;
The upper_bound is calculated in a similar way, but in this case a portion of the last object considered is included to meet the capacity: Bound SubProblem: :lower bound B o u n d w e i g t h , pft; N u m b e r i, f o r ( i = obj, w e i g t h = 0, p f t w e i g t h += p b m . w [ i ] ; p f t += m
}
(const P r o b l e m & pbm, S o l u t i o n & us) tmp; us = sol; = p r o f i t ; w e i g t h <= C R e s t ; i++){ pbm.p[i] ; us.s.push_back(true) ;
i--; w e i g t h -= p b m . w [ i ] ; p f t -= p b m . p [ i ] ; u s . s . p o p _ b a c k ( ) trap = p b m . N - u s . s . s i z e ( ) ; for (Number j = 0; j < tmp; j++) u s . s . p u s h _ b a c k ( f a l s e ) ; r e t u r n (pft) ;
;
{
189 3. IMPLEMENTATION OF THE SKELETONS In this section we will study the interfaces of the BnB skeleton provided classes. The provided classes are those which the user has to call when using a skeleton, that is, she only has to use them, not adjust them. Theses classes are: S e t u p and S o l v e r . The S e t u p class groups the configuration parameters of the skeleton. For example, this class specifies whether the search of the solution area will be depth-first, best-first or breadth-first. It also specifies whether the scheme must be distributed or shared memory. The S o l v e r class implements which strategy to follow, and maintains updated information concerning the state of the exploration during the execution: provides class Solver { protected: Direction dir; branchQueue<SubProblem> queue; Setup st; const Problem& pbm; Solution sol; SubProblem sp; Bound bestSol, high, low; public: ...
// // // // // // //
maximum or minimum search space search method: best-first, problem solution partial problems partial values
deep
};
The execution is carried out through a call to the r u n ( ) method. This class is specified by S o l v e r _ S e q , S o l v e r _ L a n and S o l v e r _ S M in the class hierarchy (see figure 1). In order to choose a given resolution pattern, the user must instantiate the corresponding class in the m a i n ( ) method. The following paragraphs describe the design and implementation of the BnB resolution patterns. A more detailed explanation of the code can be founded in [3].
3.1. Message Passing implementation The message passing parallel version uses a Master/Slave scheme. The generation of new subproblems and the evaluation of the results of each are completely separated from the individual processing of each subtask. The Master is responsible for the coordination between subtasks. The Master registers the occupation state of each slave in a data structure. In the beginning all the slaves are registered as idle. The 'initial subproblem', the 'best solution' and the 'best value of the objective function' are sent to an idle slave. As long as there are no idle slaves the Master receives information and decides on the next action to apply depending upon whether the problem has been solved, whether there is a slave request or whether the slave has no work to do. If the problem is solved, the solution is received and stored. When the master receives a request for a certain number of slaves, it is followed by the upper bound value. If the upper bound value is better than the actual value of the best solution, the answer to the slave includes the number of slaves that can help solve its problem. If this is not the case, the answer indicates that it is not necessary to work in this subtree. When the number of idle slaves is equal to the initial value, the search process ends and the Master notifies the slaves to stop working. A slave works bounding the received problem. New subproblems are generated by calling the b r a n c h method provided by the user. The slave asks for help to the Master. If no free slaves are provided, the slave continues working locally. Otherwise, it removes subproblems from its local queue and sends them directly to the allocated slaves.
190 Table 1 Origin 3000. MPI implementation. The sequential average time in seconds is 867.50. Procs
Average
2 3 4 8 16 24 32
933.07 743.65 454.77 251.82 186.18 152.49 151.09
Min 932.67 740.73 399.86 236.94 174.24 144.29 144.69
Max 933.80 745.29 492.43 266.96 192.80 167.92 166.46
Speedup-Av Speedup-Max Speedup-Min 0.93 1.17 1.91 3.44 4.66 5.69 5.74
0.93 1.17 2.17 3.66 4.98 6.01 6.00
0.93 1.16 1.76 3.25 4.50 5.17 5.21
3.2. Shared Memory implementation The algorithm proposed works with a global shared queue of tasks implemented using a linked data structure. For the implementation of the BnB shared memory resolution pattern we have followed a very simple scheme. First, the number of threads, n, is calculated and established. Then n subproblems are removed from the queue and assigned to each thread. Using a parallel region each assigned thread works in its own subproblem. The best solution value and the solution vector must be modified carefully, because only one thread can change the variable at any time, therefore, it must be done inside a critical region. The same special care must be taken into account when a thread tries to insert a new subproblem in the global shared queue. 4. COMPUTATIONAL RESULTS This section analyzes the experimental behavior using the 0-1 Knapsack Problem on sets of randomly generated test problems. Since the difficulty of such problems is affected by the correlation between profits and weights, we considered the strongly correlated ones. The experiments have been done on an Origin 3000, whose configuration is 160 MIPS R14000 processors at 600 MHz, 1 Gbyte of memory each and 900 Gbyte of disk. The software used in the Origin 3000 was the MPISpro CC compiler of C++(version 7.3.1.2m) and IRIX MPI. Table 1 was obtained using MPI, and shows the speedup results of five executions of the 0-1 Knapsack Problem randomly generated for size 100,000. Only the optimal value of the objective function is calculated. The solution vector where this value is obtained is omitted. The first column contains the number of processors. The second column shows the average time in seconds. The column labelled "Speedup-Av" presents the speedup for the average times. The third and sixth columns give the minimum times (seconds) and the associated speedup whereas the fourth and seventh columns correspond to the maximum. Due to the fine grain of the 0-1 Knapsack Problem, there is no linear increase in the speedup when the number of processors increases. For large numbers of processors the speed-up is poor. For those cases there is still the advantage of being able to manage large size problems that can not be solved sequentially. The limited performance for two processor systems is due to the Master/Slave scheme followed in the implementation. In this case, one of the processors is the Master and, the other is the worker and furthermore, communications and work are not overlapped. Table 2 shows the results for the problem using the OpenMP skeleton. Similar behavior can be appreciated. However, when the number of processors increases the speed-up of the
191 Table 2 Origin 3000. OpenMP implementation. The sequential average time in seconds is 869,76. Procs
Average
Min
Max
Speedup-Av
Speedup-Max
Speedup-Min
2 3 4 8 16 24 32
747.85 530.74 417.62 235.10 177.72 157.30 176.43
743.98 517.88 411.02 231.26 162.00 151.18 174.23
751.02 540.76 422.59 237.00 207.15 175.67 179.01
1.16 1.64 2.08 3.70 4.89 5.53 4.93
1.17 1.68 2.12 3.76 5.37 5.75 4.99
1.16 1.61 2.06 3.67 4.20 4.95 4.86
OpenMP version decreases while the MPI remains stable. 5. CONCLUSIONS AND FUTURE WORKS Two generic schemes for the resolution of problems by means of the Branch-and-Bound paradigm have been introduced in this paper. The high level programming offered to the user should be emphasized. Two different parallel solvers are obtained by means of only one specification of the problem interface: one using the message passing paradigm and other using the shared memory one. Furthermore, the user does not need to have any expertise in parallelism. The obtained computational results suggest that for a large number of processors the proposed algorithms are not scalable. One line of work to improve this deficiency is to parameterize the number of subproblems that is solved by each processor before doing any request. Another issue is to improve memory management. In the message passing algorithm presented all the information related with a subproblem is grouped in the same class, "SubProblem", specifically, the decision path taken up from the original problem down to that subproblem. All this information travels with the object whenever there is a communication. There is room for optimization regarding this issue, since the subproblems share some common past. The natural solution necessarily implies the distribution of the decision vector among the different processors and the introduction of mechanisms to recover that distributed information. Finally, an approach that could allow for better management of the computational resources, especially in hybrid share-distributed memory architecture, is to combine data and task parallelism. REFERENCES
[1] J. Anvik, S. MacDonald, D. Szafron, J. Schaeffer, S. Bromling, and K. Tan, Generating
[21
[3]
Parallel Programs from the Wavefront Design Pattern, In Proceedings of the 7th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS'02), Fort Lauderdale, Florida (2002) M. Cole, Bringing Skeletons out of the Closet: A Pragmatic Manifesto for Skeletal Parallel Programming, submitted to Parallel Computing, (2002) I. Dorta, C. Le6n, C. Rodriguez, Comparison between MPI and OpenMP Branch-andBound Skeletons, In Proceedings of the 8th International Workshop on High-Level Par-
192
[4] [5] [6] [7] [8] [9]
[10] [ 11] [12]
allel Programming Models and Supportive Environments (HIPS'03), Nice, France, 66 73 (2003) J. Eckstein, C.A. Phillips, W.E. Hart, PICO: An Object-Oriented Framework for Parallel Branch and Bound, Rutcor Research Report (2000) H., Kuchen, A Skeleton Library, Euro-Par 2002, 620 629 (2002) B. Le Cun, C. Roucairol, The PNN Team, BOB: a Unified Platform for Implementing Branch-and-Bound like Algorithms, Rapport de Recherche n.95/16 (1999) S. Martello, P. Toth, Knapsack Problems: Algorithms and Computer Implementations, John Wiley & Sons Ltd, (1990) OpenMP: C and C++ Application Program Interface Version 1.0, http://www.openmp.org/, October (1998) T.K., Ralphs, L. Ladfinyi, COIN-OR: Common Optimization Interface for Operations Research, COIN/BCP User's Manual, International Business Machines Corporation Report (2001) Y. Shinano, M. Higaki, R. Hirabayashi, A Generalized Utility for Parallel Branch and Bound Algorithms, IEEE Computer Society Press, 392 401, (1995) M. Snir, S. W. Otto, S. Huss-Lederman, D.W. Walker and J. Dongarra, MPI - the Complete Reference, 2nd Edition, MIT Press, (1998) S. Tsch6ke, S., T. Polzer, Portable Parallel Branch-and-Bound Library, User Manual Library Version 2.0, Paderborn (1995)
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters aridW.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
193
Parallel Overlapped Block-Matching Motion Compensation Using MPI and OpenMP E. Pschernig a and A. Uhl a* aSalzburg University, Department of Scientific Computing Jakob Haringer-Str. 2, A-5020 Salzburg, Austria Overlapped block-matching motion compensation (OBMC) enhances the prediction results of classical non-overlapped block-matching in the context of video compression significantly and is especially well suited for. wavelet-based video coding. However, this is achieved at a high additional computational cost. We investigate possibilities for parallelization of two classical OBMC algorithms and investigate the performance of implementations using MPI and OpenMP. 1. INTRODUCTION Digital video compression is the computationally most demanding algorithm in the area of multimedia signal processing. Block-matching motion compensation covers about 60-80% of the runtime of all standardized video coding schemes. Therefore, a significant amount of work discusses dedicated hardware for block-matching [3, 2]. Also, software based block-matching approaches on general purpose high performance computing architectures have been investigated [7, 8]. The blocking artifacts as introduced by classical block-matching degrade the compression performance of such schemes, especially at low bit-rates. This may be overcome by using overlapped block-matching, however, this comes at a high additional computational cost. Therefore, algorithms for parallel overlapped block-matching are desirable to move this scheme closer towards real-time processing. In section 2 we introduce the term overlapped block-matching and experimentally evaluate the performance of the corresponding sequential implementation. Section 3 discusses parallelization approaches and presents experimental results comparing MPI and OpenME 2. OVERLAPPED BLOCK-MATCHING Classical block-matching (BM) is a standard technique in video compression. A frame is divided into a grid of pixel blocks, and for each block a motion vector (MV) is computed. The result is a motion vector field (MVF) for each frame. The MVs are used to translate blocks of pixels from the previous frame into the current frame when decoding the video sequence. During encoding, MVs out of a given range are searched, which minimize the error between the current frame, and a frame reconstructed from the previous frame by translating the blocks by *Corresponding author: [email protected]. This work was partially supported by the Austrian Science Fund, project 13903.
194 their MVs. In video compression, only the MVF and the difference to the reconstructed flame have to be encoded for each frame to be able to reconstruct the entire video. It works very well because the most common change between frames in a normal video in fact is motion. Overlapped block-matching motion compensation (OBMC) is an enhancement to the basic BM, which lets the blocks in the grid overlap. The pixel intensity for a pixel in a reconstructed frame is not only derived by translating a single pixel from the reference frame, but is also affected by translating pixels according to MVs of neighboring blocks. Where the blocks overlap, the pixel intensities are combined linearly, with the weights taken from a window function over the block [9]. In our implementation, a pyramidal window function is used over the blocks. With S x S sized blocks, the window is sized 2S x 2S, and there are 8 overlapping neighbors. As error measure, a squared error is used. Let p denote a pixel position inside a frame, and Io(p) and I1 (p) the intensity of the pixel at this position in the reference frame and the frame following it, respectively. The intensity in the predicted frame then is I0(p + v), where v is the motion vector for the block. The error of a block is given by summing the squared difference over all pixels p in the block:
-
Z p
The MV v resulting in the smallest error out of the possible MVs in a certain search range is selected as the optimum one. With OBMC, the intensity of a pixel in the predicted frame also depends on the MVs of overlapping blocks, and the window function w. If b is the relative position of a neighboring block, b E { - 1 , 0, 1} x { - 1 , 0, 1}, E can be obtained as: /
p
b
The optimum MV Vb for b = (0, 0) must be found. Because it is affected by the MVs from neighboring blocks, this is more complicated as compared to BM, where the MV for a block is independent from all the other MVs. Instead, the optimum MV for a block mutually depends on the MVs of neighboring blocks, which means that it can't be found at all with a polynomial time algorithm. The following algorithms therefore try to find a sub-optimum solution by assuming that the MVs for some of the neighboring blocks are already found, and ignore the other neighbors. The first algorithm considered in this work is referred to as Raster Scan Algorithm (RAST)[6, 5]. Like its name says, RAST processes the blocks row by row, and so the neighboring blocks with known MVs for a block are the 3 blocks above it in the previous row and the previous block in the same row (except for blocks on the left and upper edges as shown in figure 1.b). The second algorithm considered is an iterative algorithm, based on the Iterative Conditional Mode (ICM)[6, 1]. It divides the blocks into disjoint subsets in a way so blocks within a subset don't overlap. The version we implemented uses only two checkerboard like subsets, ignoring the overlap at the comers. So the known neighbor vectors (except for border blocks) are always from the 4 neighboring blocks with a different color (figure 1.b). For this to work, an initial MV is first found for each block by performing standard full-search BM. After that, the search is iterated by alternating the two subsets. In one iteration, the MVs of all white blocks are
195 r-
.
.
.
I
r
~
~
L
.
.
.
.
.
.
.
.
"~ I
bl~ck b in previous fr~ne 1
S ~ gck n-. ~ S bl ~ ~
.
~ a pixel p in block b ---- window W over block b (a) Search in OBMC
{-1,-1) (0,-1) (1,-1) I I (-L0) (0,0) I I I '
'
_ (0,-1)_ -- t[-~ I I (-L0) (0,0) 110) I I I t(0,1) I
. . . . . . ~ J . . . . . window over block (0,0) RAST ICM (b) Known neighboring blocks for RAST and ICM
Figure 1. Overlapped Block Matching
refined taking 4 neighbors as known, in the next iteration, the same is done for all black blocks. The iteration terminates after the MVs don't change anymore, or after a maximum number of iterations is reached. For comparison purposes, it is also interesting to look at another algorithm we denote as Windowed Block Matching (WBM). For WBM, no MVs of neighboring blocks are assumed to be known. In this case, the error for a MV is simply computed by summing the weighted errors of all pixels in the window of the current block. The only difference to standard BM is the use of the weighted window. Experimental results were obtained by using 40 consecutive frames of 3 video sequences with different sizes: 'Foreman', 176x144 pixels (22xl 8 8x8 blocks), 'Garden', 352x240 pixels (44x30 8x8 blocks) and 'Mobile', 720x576 pixels (90x72 8x8 blocks). The used settings were a block size of 8 pixels, and a search range of +7 pixels, performing integer-full-search. ICM used a fixed number of 4 iterations, and during iterations, a search range of +3 pixels around vectors found in the previous iteration. A common way to normalize the error is the peak signal to noise ratio (PSNR), which we use to measure the quality difference of the different algorithms. It takes into account the total number of pixels r~: PSNR = 10 log10(2552r~/E) E is the summed squared error as used before for single blocks. Figure 2 compares the PSNRs of the frames resulting from the different algorithms for the Foreman sequence. It shows how well the algorithms estimate the motion, and the similarity of a graph to the other graphs also suggests that an implementation is working properly. ICM and RAST give the best results, with almost the same PSNR - for ICM the average is 0.05 db higher. BM is about 2 dB below this on average. The results of WBM are always better than for BM, on average about 1 dB in the Foreman sequence. In the Garden sequence it is even 6 dB. This shows the advantage of OBMC over BM, even when the overlapping areas are ignored for the encoding. At the same time, the additional information about neighboring blocks gives RAST and ICM the ability to obtain higher PSNRs. Finally, in table 1 the per-frame times needed for the sequential version of the algorithms on an SGI Origin 3800, as well as the average PSNR, are shown. The significanlty increased time demand of the overlapping techniques is clearly shown.
196 foreman
,.... m .'o. rr
42
!
,
41
I-L
,
,
PSNR
, ~
,
,
~~
f~!~::,k
40 1
, BAsTICM . . . . .,. . .
BM
WBM
38 37
3 4 i33 L 31
40
A
',", ..~ , ~i' ',13"
"~ 9
' 45
I 50
~" ',,b .... o.. ~ ', ~
/
........
........D.......
,,.~,."-.,,.,,"
'
'
'
'
'
55
60
65
70
75
"_
80
Frame number
Figure 2. PSNR of motion-estimated frames in the Foreman sequence.
Table 1 Average per-frame time and PSNR Foreman Time[s] PSNR[dB] 34.51 BM 0.48 35.52 WBM 2.60 BAST 2.77 36.56 36.61 ICM 3.31
Garden Time[s] PSNR[dB] 20.29 1.58 8.52 26.59 27.25 9.06 27.28 10.53
Mobile Time[s] PSNR[dB] 7.77 32.90 41.50 33.56 34.73 43.88 52.13 34.74
3. P A R A L L E L I M P L E M E N T A T I O N U S I N G MPI A N D O P E N M P
The algorithms were implemented on 2 different architectures, using MPI and OpenMP: An SGI Origin 3800 with MIPS R12000 400 MHz processors and shared memory, which allowed the use of MPI as well as OpenMP. A Hitachi SRS000 with 8 processors/node, and 1 GFlops peak performance per processor. On the SR8000, OpenMP could only be used with 8 processors inside a node, and MPI was used for inter-node communication as well as inside nodes. Note that for the MPI implementation each communication event needs to be explicitly stated whereas OpenMP facilitates parallelization by simple loop distribution using compiler directives (e.g. # p r a g m a omp p a r a l l e l f o r ) . As a consequence, the implementation effort is significantly higher in the MPI case. On the other hand, OpenMP is restricted to multiprocessors whereas MPI may be used on almost any high performance computing architecture. There exist several different granularity levels for parallel BM [8] and OBMC [4]. In our context, we use intra-frame parallelization (distributing single blocks to the processing elements (PEs)) since this is the approach most suited for applications on a majority of hardware systems. This block based parallelism (BBP) works very well in the case of standard BM and WBM, where ME for a block is independent of other blocks. It also works in the case of ICM, where only MVs of neighboring blocks out of a different subset are needed, and the MVs for blocks inside a subset are independent again. Figure 3 shows the speedup results of BAST and ICM for all three video sequences on the
197
Origin/ICM
Origin/RAST foreman mobile
40
---+---........ ,~
garden
35
.......
.... ,,----
30 25
25 "~
~
20
r/)
~..~...~ -.~mmm~
15
20 ........ ~..~-.~
15
...............
10 5
0
5
10
15
20
25
30
35
0
40
0
5
10
15
35
forem'an . garden .... ,,----
.
.
.
.
.
........
mobile ................
30 .
25
25
30
35
40
Origin/MPI/ICM
Origin/MPI/RAST 40
20
Number of PEs
Number of PEs
....
...
.... ......
40
........ "
foremangardenm................ obile§.... --....
35
............ "
................................... ".......
30 ........
25
...
......
Q.
,~.:~-
~.;~...,~. ,*,
~
x ~
,
~
ffl 15
15
10
10 5
5 0
. l
.
5
.
.
10
.
.
15
.
.~.~"
.
20
25
Number of PEs
30
35
40
0
0
n
i
i
5
10
15
i
a
i
i
20
25
30
35
40
Number of PEs
Figure 3. Speedup with OpenMP (first line) and MPI on the Origin 3800.
Origin. The Mobile sequence is the biggest sequence, and uses the most computation time. This results in a better speedup, because there is a larger amount of time spent in executing parallel code compared to the sequential parts of the code. The speedup for Foreman as the smallest sequence suffers most from the time spent in sequential parts and from added overhead, and has the lowest speedup. For RAST, the error of a block depends on the MVs of neighboring blocks, which are expected to be available in the sequential raster-scan order. This means that simply working on blocks in parallel won't work. Instead, computation for a block can only start after the MVs for the block to the left, and the required blocks above it, have been found. As a result of this, no two blocks in the same line can ever be processed in parallel, because every block requires the left block to be processed first. And blocks in the previous row have to be computed at least to the column one more to the right. As a consequence, fewer of the available PEs can be used in parallel. The maximum number of simultaneously processed blocks is half the number of columns. For example, in 176xl 44 pixel sized test-frames, with 8x8 sized blocks, there are 22xl 8 blocks. This means, only a maximum of 11 PEs could be used in parallel, and only for 8 rows all of them would be busy, with 56 time steps needed in total. In the graphs for RAST it can be seen how the maximum number of usable PEs is reached. In the second line, Figure 3 contains the results of running the MPI version on the Origin. It performs better than OpenMP, which can be explained by the additional overhead caused by the creation of parallel threads in the OpenMP version. MPI has no such overhead, and the MPI
198 SR8000/BM/MPI
SR8000/ICM/MPI
foreman - - - § garden .... ,,---mobile . . . . . . . . . . . . . . . . . . .
foreman - - - § garden .........
"
i I~
................
mobile
25 --,
20
~
15
........"ii............
......i.........i................ 2O
(/)
.......... ,.,...,~...: .... "~,:"~t;:~-,~.:~.:~.,,",,
.......... ..~:'~-~x--x-" "x:lf. -x'" ~,~,r 0
"
......~.;~-
""
......~:~" .+_.~-+-+-~.§ .+..~.~.+_.~..+..~ .;~.;~:~-v"+--r "... ~.+___~._+.
"'x "" "x--~.. x--x. -x.
.+_ +_..F..+..l_._,._+_~ ..v..z..p...F_+_.i...+._+ 5
10
15
20
25
30
35
5
10
Number of PEs
15
20
25
30
Number of PEs
Figure 4. Speedups with MPI on the SR8000.
SR8000/BM/Hybrid foreman garden
SR8000/ICM/Hybrid 35
---§
.... x----
foreman - - - + - - garden ......... mobile ................
30 25
2O (/)
......................... ill
15
............ iiiiiiiiii~;~//-...." ...............
20
iiiiii .
9
........ .....-:~~i~. . . . . . . .
.....:::i.(~::.:S. .-~-
...::~ 0
........... 0
,
i
i
,
i
,
5
10
15
20
25
30
Number of PEs
35
0
..... 0
i
n
n
i
i
,
5
10
15
20
25
30
35
Number of PEs
Figure 5. Speedups with hybrid MPI/OpenMP on 4 nodes of SR8000.
directives can efficiently be implemented with shared memory. Figure 4 shows the speedup results for BM and ICM on up to four nodes of the SR8000, where 8 processors can use shared memory inside one node, and therefore the inter-node communication costs have less impact. Still, the resulting speedups are lower than the ones from the MPI implementation on the Origin. The results for the version where both MPI and OpenMP are used is shown in figure 5, using four MPI nodes. Blocks are first distributed to the nodes with MPI, and then the computation of blocks on one node uses further OpenMP parallelization. With four nodes, there are few communication costs for MPI, and the OpenMP implementation works the same whether it is run on four nodes or on one. This makes it possible for the combined version to result in higher speedups than the MPI only version. 4. CONCLUSION In this work, we examined the parallelization of OBMC algorithms, and compared the results with classical, non-overlapped BM. The OBMC algorithms considered were RAST and ICM, and for comparison purposes also WBM was examined.
199 The algorithms were parallelized using OpenMP and MPI, and we presented speedup results for three different architectures. The MPI implementation on the hpcLine exhibited a communication overhead, which made it perform below the results of the MPI implementations on the Origin which has shared memory, and on the SR8000, with shared memory for 8 processors on a node. On the Origin, the MPI version even outperformed the OpenMP version, which has additional overhead for creating parallel threads. On the SR8000, performance could be improved compared to the MPI version by combining MPI and OpenMR REFERENCES
[1]
C. Auyeung, J. Kosmach, M. Orchard, and T. Kalafatis. Overlapped block motion compensation. In Visual Communications and Image Processing "92, volume 1818, pages 561-572, 1992. [2] S.-C. Cheng and H.-M. Hang. A comparison of block-matching algorithms mapped to systolic-array implementation. IEEE Transactions on Circuits and Systems for Video Technology, 7(5):741-757, October 1997. [3] S.B. Pan, S.S. Chae, and R.H. Park. VLSI architectures for block matching algorithms using systolic arrays. IEEE Transactions on Circuits and Systems for Video Technology, 6(1):67-73, February 1996. [4] E. Pschernig and A. Uhl. Parallel algorithms for overlapped block-matching motion compensation. In R. Trobec, P. Zinterhof, M. Vajter~ic, and A. Uhl, editors, Parallel Numerics "02- Theory and Applications (Proceedings of the International Workshop), pages 221233, Bled, Slovenia, October 2002. [5] J.K. Su and R.M. Mersereau. Non-iterative rate-constrained motion estimation for OBMC. In Proceedings of the IEEE International Conference on Image Processing (ICIP'97), pages II:33-xx, Santa Barbara, CA, USA, October 1997. [6] J.K. Su and R.M. Mersereau. Motion estimation methods for overlapped block motion compensation. IEEE Transactions on Image Processing, 9(9): 1509-1521, September 2000. [71 M. Tan, J. M. Siegel, and H. J. Siegel. Parallel implementation of block-based motion vector estimation for video compression on four parallel processing systems. Internaltional Journal of Parallel Programming, 27(3): 195-225, 1999. [8] F. Tischler and A. Uhl. Granularity levels in parallel block-matching motion compensation. In D. Kranzlmfiller, P. Kacsuk, J. Dongarra, and J. Volkert, editors, Recent advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI) - 9th European PVM/MPI Users Group Meeting, volume 2474 of Lecture Notes on Computer Science, pages 183 - 190. Springer-Verlag, September 2002. [9] H. Watanabe and S. Singhal. Windowed motion compensation. In Visual Communications and Image Processing '91, volume 1605, pages 582-589, 1991.
This Page Intentionally Left Blank
Parallel Computing:SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
201
A comparison of OpenMP and MPI for neural network simulations on a SunFire 6800 A. Strey a aDepartment of Neural Information Processing University of Ulm, D-89069 Ulm, Germany This paper discusses several possibilities for the parallel implementation of a two-layer artificial neural network on a Symmetric Multiprocessor (SMP). Thread-parallel implementations based on OpenMP and process-parallel implementations based on the MPI communication library are compared. Different data and work partitioning strategies are investigated and the performance of all implementations is evaluated on a SunFire 6800. 1. MOTIVATION Since the publication of the computation-intensive error back-propagation training algorithm in 1986, many researchers studied the parallel simulation of artificial neural networks (see e.g. [ 11] for a good overview). The focus was first on neuron-parallel implementations that map the calculations of the neurons representing the main building blocks of neural networks onto different processors. Unfortunately, this biologically realistic implementation strategy suffers from a very high communication/computation ratio. That's why later the pattern-parallel simulation was proposed which replicates the neural network in each processor and distributes the training set instead. Here each processor adapts all neural network parameters according to the locally available patterns. Only after each training epoch, the accumulated local updates must be distributed to all processors to compute the new global updates. Thereby the communication/computation ratio can be reduced drastically. However this pattern-parallel implementation requires an epoch-based learning algorithm that modifies the behavior of the original algorithm. Furthermore it is available only for several neural network models. Thus the neuronparallel implementation is favored by many neural network researchers despite of its higher communication costs. Symmetric Multiprocessors (SMPs) represent an important parallel computer class. Here several CPUs and memories are closely coupled by a system bus or by a fast interconnect (e.g. a crossbar). Today SMPs are used as stand-alone parallel systems with up to about 64 CPUs or as high-performance compute nodes of clusters. SMPs offer a short latency (a few #s) and a very high memory bandwidth (several 100 Mbyte/s). Thus they seem to be well suited also for a communication-intensive neuron-parallel neural network implementation. However no performance analysis of neural network simulations on current SMPs has already been published. So in this paper the results of an experimental study about neuron-parallel implementations of a radial basis function (RBF) network on a SunFire 6800 are presented. Furthermore, two different important parallel programming paradigms that are available for current SMPs are compared: OpenMP for the generation of parallel threads [6] and the communication library MPI.
202 RBF training algorithm: U1
Um
_
m
)9
xj Y'~i=l (ui - cij yj = f ( : c j ) = e-X~/2~ " -- e - ~ ' ~' Zk - - E j L 1
(1)
(2) (3) (4)
yjWjk
5~z) = tk -- zk
(6) (7)
8j = 8j -- ?]sXjyj5~ y)
Wjk = Wjk + rlwYj(5~ z) cij = cij + ~7~(ui - c # ) 4u) Z1
y sj
(8)
Zn
Figure 1. Architecture and training algorithm of a m - h - n RBF network (with m input nodes, h RBF nodes and n linear output nodes)
2. T H E
RBF
NETWORK
The RBF network represents a typical artificial neural network model suitable for many approximation or classification tasks [8]. It consists of two neuron layers with different functionality. The RBF neurons in the first layer are fully connected to all input nodes (see Fig. 1). Here neuron j computes the squared Euclidean distance xj between an input vector u and the weight vector cj (represented by the jth column of the weight matrix U). A radial symmetric function f (typically a Gaussian function) is applied to :co and the resulting outputs yj of the RBF neurons are communicated via weighted links wok to the linear neurons of the output layer where the sum zk is calculated. After a proper initialization of the weights, the network is trained by a gradient-descent training algorithm that adapts simultaneously all weights c~0, wok and ~r0 (the center coordinates, heights and widths of the Gaussian bells) according to the error at the network outputs [10]. The complete RBF training algorithm is listed in the fight half of Fig. 1. I to/fromFireplaneAddressrouter
/ 41
Fireplaneaddressrouter _.-----------=-2 . . . . . .
- ~ UltraSparc UltraSparc[ r ~ [UltraSparc] III I III ~ . ~ ' ~ III I
41 ci
256 ....,,..:~2-,. Fireplanedala crossbar]
"1 addressrepeater 1<
17 I,l. i l l T
!1
L2( acheIlj .li iI
IL2cachel L2cacheI I ~il memory
met ory
memory
I L2cache memory
~ ) ~ boarddata switch . ~
A) system architecture (I/O nodes not shown)
I to/fi'omlzireplanedata crossbar B) architecture of a compute node
Figure 2. Architecture of the SunFire 6800 (simplified)
J
203 3. A R C H I T E C T U R E OF THE SUNFIRE 6800
All implementations are evaluated on a SunFire 6800 that represents a fast shared-memory architecture with a broadcast-based cache-coherency scheme. It contains 24 UltraSparc III CPUs operating at 900 MHz. The CPUs are connected by a special hierarchical crossbar-based network called FirePlane built by 256-bit data lines and 41-bit address lines. A fast address router (see Fig. 2A) replicates each address sent by one CPU in up to 15 clock cycles to all other CPUs that compare it with the addresses of the lines in the local cache [3]. Cache coherency is achieved by a MOESI protocol (i.e. a MESI protocol with an additional Owned state). In case of a cache hit the CPU that owns the requested data in its cache sends the data block via the hierarchical data crossbar network to the requesting CPU. The network bandwidth is up to 9.6 Gbyte/s. Each node is built of four CPUs that are arranged in two identical SMPs (see Fig. 2B). Two neighbor CPUs and their local memories are connected by a CPU data switch, two neighbor SMPs are connected by a board data switch. The memory latency (for accessing a 64-byte data block) is about 200ns, the maximum memory bandwidth is 2.4 Gbyte/s. 4. PARTITIONING OF THE RBF N E T W O R K
From a programmer's point a view, the SunFire 6800 can be considered either as a distributedmemory or a shared-memory architecture. In the former case, the user must explicitly map all data structures to the distributed memory. In the latter case, the user is not explicitly concerned with the data distribution. However he must identify all data variables that are private for each thread. As already mentioned in Section 1, a neuron-parallel implementation of the RBF network is preferred and analyzed here. Therefore, the two weight matrices C and W and the vectors x, y, z, 6 (z) and O(Y)describing specific neuron signals should be distributed over all p compute nodes in an appropriate way. Several possibilities have been suggested to distribute the variables of a two-layer neural network: A) The so-called partial sum method (see Fig. 3A) was proposed in 1988 by Pomerleau et al. [9]. Here the rows of both matrices C and W are distributed over all p nodes in the same way: Each node holds an equal number of hip columns from C and nip columns from W. It computes the hip local components o f x and y. For the calculation of the nip local elements of z and ~(z), each node must first gather all components of the distributed vector y by an allgather operation. To adapt the weights of the matrix C, the error ~(Y) in the hidden layer must be calculated first. The i-th element of 6 (y) is basically a dot product of the vector 6 (z) with the row i of the matrix W. However the elements of each row are distributed over all p nodes. To avoid an expensive redistribution of the elements, each node computes first a partial sum }--~ ~z)wjk. Thereafter all total sums 6 (y) can be computed by an allreduce operation. It requires the communication of ph elements, the calculation of h sums and the distribution of the sums to all p nodes. All remaining operations can be implemented locally. B) To eliminate the costly allreduce operation, Kerckhoff proposed the recomputation method [4]. Here the matrix W is stored twice: n/p columns and hip rows of W are mapped onto each node. Thus besides of the matrix W also the transposed matrix W T is distributed over all p nodes (see Fig. 3B). In consequence, the computation of fi(Y) can be
204 performed locally without any communication, if 6 (z) is available in each node. In total, two allgather operations are required here to gather the components of the vectors y and 6 (z) in all nodes. Compared to the partial sum method, the communication requirements are reduced especially for neural networks with large hidden layers. However the update of the weights wjk according to Eq. (7) of Fig. 1 must be calculated for both W and W T, leading to 3hn/p additional arithmetic operations. C) In the simplified method shown in Fig. 3C the rows of the matrix C and the columns of the matrix W are distributed over all nodes. Here only one communication operation is required: The vector z that must be available in each node is computed by an allreduce operation. All other operations can be performed locally. 5. IMPLEMENTATION The RBF network was implemented in C (Sun Forte 7 C compiler) with Sun MPI (HPC Cluster Tools 4). Compiler optimizations were switched on ( - f a s t -xarch-v8plusb -xtarget-ultra3). According to the three partitioning methods described in Section 4, three codes were generated and tested. The OpenMP implementation (Sun OpenMP from Forte Developer 7) of the RBF network was based on a fast sequential C code that was enhanced with OpenMP pragmas. Here also three versions were generated that reflect the different partitioning strategies. Although OpenMP does not allow a manual distribution of data elements, the f o r directives were inserted before the main loop constructs in such a way that the work is shared among the generated threads according to Figures 3A, 3B, and 3C. In addition, the original code was modified to reflect the transposed matrix W T for the simplified and the recomputation method. To reduce the overhead for creating the threads in each iteration, all threads were generated only once at the beginning. Furthermore, the nowa2 t attribute was inserted wherever the implicit barrier synchronization at the end of a parallel f o r loop was superfluous. In both cases the performance was measured on a SunFire 6800 running the operating system Solaris 9. The number of parallel processes or threads and the size of the RBF network were varied.
A) partialsummethod
B) recomputationmethod
U Po
U Po
~<
1:'2 h
"C Irn
allgather~ []~1
!
}<
w
allreduce*~-F + ..~.. ~(Y)y i '~
h p2-~C
allgaiher
U Po
~<
Wr
W T
~(Y)~7~ ~
Pl
h
P2
"]C Im
' I
h Z
C) simplifiedmethod
allreduce
l_! ~ i / ~ ! i I )
~
n
Figure 3. Three different data partitioning methods for an m-h-n RBF network
U z
fi(z~
205 6. P E R F O R M A N C E ANALYSIS For all experiments the speedup related to the optimized sequential version (without any OpenMP directives or MPI function calls) was calculated. Fig. 4 displays the speedup achieved by OpenMP-based and MPI-based parallel implementations for small and large RBF networks. It can be seen that the performance strongly depends on the selected data partitioning strategy: 9 For all OpenMP implementations the recomputation method delivers mostly a higher performance than the simplified method. Only with one or two threads the recomputation method is slower because of its higher computational effort. The partial sum method is extremely slow: the speedup always remains below 1. 9 For MPI implementations, the simplified method usually leads to the best performance. Especially for large neural networks an approximately linear speedup can be gained. The difference between the three methods is by far less significant than with OpenMP. 9 MPI implementations are mostly faster than OpenMP implementations.
12
:: ~o:,ooo::o,'~.~.~o~;B
'.
~.
~.
I
I
I
;. . . . .
; .....
L ....
-'1
',..-
~2
40-1000-10, method C [ [ _j i k1"" J 40--100--10, m e t h o d B l-i . . . . i. . . . . i - ~ . 7 ; i . . . . ;. . . . . 40-100-10, methodC ] "" "" ~.' B 40-1000-10, method A ]i .-7"" i j - ~ ' J . . . ... 4. 0 - .1 0 0.- 1 0 , m e t h o d A, .-" ] i~" "J ' q i J J !, i, i,
8
.....
4
'.
I
.... .... ---
10
i .....
i. . . . .
; .....
',
'~
~ .~,:~::C'~
i. . . . .
. . . . . ! . . . . . i - - . ~ - . ~ 7. ,. . . . .r
A:2.~-<_.
L2,_i
:4o',ooo:,o.m'~.od'C
|1 -I-I l0 |j_._ /I ..... 8 })~ ~ ] 1 . . . . .1
] .....
~
i c
6 + ....
J.. . . . .
!. . . . .
a. . . . . .
.L._.TW,._._
~. . . . . . .
i. . . . . ! . . . . ~ . . . . . ?, . . . . . ~ . . . . - i - - i f
4
',
'.
',
'.
~,..-~.]
: " - q, f.> 1 4 0 - 1 0 0 0 - 1 0 m e t h o d B / i. . i. . i. ' I..-t .... ~ ..... .L . . . . . L"--~---i 40-1000_10, methodA / i i i .-'ifi l 40-lO0-10, methodC / i { .-!~'J! ! / 4 0 - 1 0 0 - 1 0 ' m e t h o d B } - - } - - : ~ - - ' ~ - ,+. . . . . ~ . . . . {. . . . -I 4 0 - 110 0 - 1 01, m e t1h o d Al.]-~r ~.-'~ !' B/ !~ ~ ! -' - @ ~ | ' ! ....
+_>.
.~. . . . .
". . . .
.... i..... i--z ~-~--i-7-
~ ....
!---I
~ ......
q . . . . . t . . . . ~ . . . . . t . . . . . t- . . . . i . . . . v
2
a 1
~ 2
a 3
i 4
5
number
6
i
i
i
7
8
9
of parallel threads
', --i.... 10
11
12
I
~,
~
1
2
3
~ 4
{ 5
number
~ 6
~,
~,
7
8
~, 9
~
~
I
10
11
12
of parallel processes
Figure 4. Speedup measured on a SunFire 6800 for the simulation of an m-h-n RBF network with OpenMP (left) and MPI (fight) The lower performance of the OpenMP implementations was unexpected, because the SunFire 6800 is a symmetric multiprocessor and OpenMP is claimed to be a programming interface for such architectures. Also the advantage of the recomputation method that requires a lot of redundant arithmetic operations was surprising. Therefore, a detailed performance analysis was undertaken to find out the performance bottlenecks. To study the OpenMP overhead for process synchronization and scheduling, the OpenMP microbenchmark was applied [2]. Some results are shown in Fig. 5 (left). It can be seen that the synchronization overhead for must OpenMP directives is low and rather independent of the number of threads. The costly p a r a l l e l directive is applied only once at the beginning. The performance of the collective MPI operations was analyzed by the Pallas MPI benchmark suite [7]. From Fig. 5 (right) it becomes evident that on the SunFire 6800 the MPI function A11 r e d u c e is by far more efficient than its OpenMP counterpart. Furthermore, the function A1 l r e d u c e also allows the reduction of arrays by a
206 single call, whereas the OpenMP r e d u c t • on clause i s - according to the OpenMP specification of C/C++ version 1.0 - not available for arrays [6]. Therefore the recomputation method representing the only data partitioning method that requires absolutely no reduction achieves the highest performance.
20
I !- OpenMP parallel -- OpenMP single -- - - OpenMP for OpenMP, , barrier
~'15
j j l ......
.....
.....
...... i ..... i-~]177:Ji"-"-i ..... i .... i.....
40
I
I-
9 l0
..... i....-~.....i .....L~._.,>_I_~..... ,'.....L....i.... :.....!.....L....i....
. . . .5. . .
2
!
!
!
!
!
!
i
1
1
i
1
1
1
I
,
I
i
. . . .
/,
4
6
Lz-i--l-t-i "--'i---'J''--
-~--'T'--[
8 10 12 n u m b e r of parallel threads
I
i
i
i
i
i i i i i .... i..... ]i ! i i .... f!~[ i ] ....:::! . . . . . . . i .... [~.... i--~-@:--~ ..... '..... , l..... /i--:~F' i i..--,+'--4
..i .... i.-! ..... .il--y
i.--q-?--
!
20
1
1
10
I
14
O;ens176 MPIAllreduce, 4 b y t e " - - MPI Allreduce, 64 byte --- MPI Allreduce, 256 byte ...... MPIAllreduce, 512byte
16
2
!
'
i
i
i
i
i
i i
' i
i ;
' t
' i
' i
i i
4
, , i
i i
i i
! !
6 8 10 12 14 n u m b e r of parallel processes or threads
i l
16
Figure 5. Synchronization time for several OpenMP directives (left) and execution time for a reduction with OpenMP and MPI (right)
To avoid the costly OpenMP r e d u c t i o n clause, several encoding alternatives with other OpenMP directives have been implemented and analyzed. The best performance was achieved by calculating first all local sums in each thread and then adding the local sums to the total sums in one OpenMP c r i t i c a l section. Fig. 6 (left) shows the performance of the accordingly modified methods A' and C'. They are faster than the original versions A and C based on the reduction clause (compare Fig. 4). Now the simplified method C' is for up to 4 threads even slightly faster than the recomputation method B. However for more than 4 threads the recomputation method still gains a higher performance. Replacing the dynamically allocated data arrays by statically declared arrays of known size has lead to some surprising results: Whereas for the sequential reference code and for all MPIbased implementations the performance remained unchanged, the OpenMP-based implementations ran significantly faster (even if only one thread was employed). Even a superlinear speedup can be achieved here (see the dashed curve in Fig. 6 right). This effect is caused by the Sun C compiler that can better exploit the OpenMP directives in loops with static arrays to generate a faster code by using enhanced loop optimizations. As final experiment, OpenMP directives were inserted also in the MPI-based implementation (with static data arrays) and the number of threads per process was varied. The upper curve in Fig. 6 (right) shows that the mixed MPI/OpenMP implementation delivered a higher performance than the pure MPI or OpenMP solutions. However generating two or more threads per process proved to be not advantageous here.
207
R--"
-' 9 " ,,~ne<worKs,ze
i
TI-+
i .
12 10 ----
8
=~ 6
4
.....
.....
. . . . 40z100(~-lO.
methocl B
40-1000-10, 40-1000-10,
method method
i
i
i
C' A'
i
_ . . . . .} ,
i
i
i
i
i
i
~
i
i
;
i
i .....
i. . . . .
;
i
i
>.r
ii. . . . .
~~~ "
-
.
.... j .~q 9" ;
~
!
1
.
. . . . . . . .
' ~
2
I
2
'
_i . . . . .
.
.
'
.i-"
-!../U..-
3
4
5
6
number
,
....
l !
,
5
..... i i
i ;
i i
7
8
9
,
of parallel
,
threads
,
"
.-"
12 ...... .....
i.J-',
i l
i i
I
.
~
--- "U- -!! l
i ;
!
-"'
!
_
.
.
i
i i
i
.
i
. i i
!
<
I
i .... . j _ ~ _ . ~
l
~
.
' -~i~-
i." i i .-'i 5../-t-
i i
.
/f
I
.
"c'
10.....
~g'
.....
i
. . . . .
i ..... !
i
, i
, / ~
~
" .d,
i
i
l ....
i
i . . . . 4' . . . . . ~ ! ! /4",
I
i
!..... ! .... j ..... t ..... ~ . . . . . . .
, i
g~ ~
i
i
i
[MPI.~Op~:~1MP
i
;
.....
.
.
-
. .7 . .~
~ ,'
i
', . - "
..~.
i - - - ~ ' - ~ - - - - % / O p . O r r M P
ii
i l / ~ - i~ / /;
;..---;.... -ri
i .....
t . . . . . . . .
J.... ~ - . - . . L . - . - . i - .
i
i
,
..,:.-
,
l
'
._._iitl_._!_5_)~,;" ..... /d, ..'..".~7:"" ,!
i ..... i i
g .... i i
Z.~.e_i /..~ "~" i
i
,
10
11
12
1
].-*"
i .....
;
- +_:_-: ! -_d-.l =A'-
i;
',
i
/.,
,
! ....
i....--
ii -~i ....
. . . . . MPI+OpenMP. method C, static,. 1 t h r e a d p e r p r o c e s s
,
. . . . 4' . . . . . , , i i
,
i
OpenMP, method i -- - - MPI, method C i OpenMP, method
2
3
4
5
number
....~ "
i _.~--~i..~ . . . . . 'r--.7-
~
.A", ~ - - , + .
i
of parallel
6
7 processes
8
9
B, static B, dynamic
10
11
12
or threads
Figure 6. Performance of several implementation altematives: using the c r i t i c a l directive instead of the r e d u c t i o n clause (left), using static instead of dynamic arrays (right)
7.
CONCLUSION
It was shown that the SunFire 6800 as a typical SMP is basically well suited for the communication-intensive neuron-parallel implementation of artificial neural networks. However non-trivial mappings of the neural network data structures and several resulting code modifications are required to achieve a high performance. Some strange artefacts with OpenMP for the Sun C compiler were presented. The MPI-based implementations turned out to be mostly slightly faster than the OpenMP-based implementations. This corresponds to the results of some related studies, that also compare MPI and OpenMP on shared-memory architectures, but for other applications (e.g. [1 ] [5]). The reason for the lower speedup of OpenMP-based implementation seems to be the unavoidable false cache sharing of data arrays that are frequently modified by several threads. This will be a topic of future investigation. REFERENCES
[1]
[2] [3] [4] [5]
[6]
M. Bane, R. Keller, M. Pettipher, and I. Smith. A Comparison of MPI and OpenMP Implementations of a Finite Element Analysis Code. In Proceedings of Cray User Group Summit (CUG 2000), May 22-26, 2000. J.M. Bull. Measuring Synchronisation and Scheduling Overheads in OpenMP. In Proceedings of First European Workshop on OpenMP, Lund, Sweden, pages 99-105, 1999. A.E. Charlesworth. The Sun Fireplane System Interconnect. In Proceedings of Supercomputing, 2001. E.J.H. Kerckhoffs, F.W. Wedman, and E.E.E. Frietman. Speeding up backpropagation training on a hypercube computer. Neurocomputing, 4:43-63, 1992. G. Krawezik and F. Cappello. Performance Comparison of MPI and three OpenMP Programming Styles on Shared Memory Multiprocessors. In Proceedings of the 15. ACM Symposium on Parallel Algorithms (SPAA 2003), pages 118-127, 2003. OpenMP C/C++ Specification Version 1.0. OpenMP Architecture Review Board, available from
208 h t t p : //www. openmp, o r g , 1998. Pallas MPI Benchmark Suite Version 2.2. Pallas GmbH, Bdihl (Germany), available from http ://www. pallas, com/e/products/pmb. [8] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78:1481-1497, 1990. [9] D.A. Pomerleau, G.L. Gusciora, D.S. Touretzky, and H.T. Kung. Neural network simulation at warp speed: How we got 17 million connections per second. In Proc. 1988 IEEE International Conference on Neural Networks, pages 143-150, San Diego, Ca., 1988. [ 10] F. Schwenker, H.K. Kestler, and G. Palm. Three Learning Phases for Radial Basis Function Networks. Neural Networks, 14:439-458, 2001. [11] N.B. ~erbed~ija. Simulating Artificial Neural Networks on Parallel Architectures. Computer, 29(3):56-63, 1996. [7]
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
209
Comparison of Parallel Implementations of Runge-Kutta Solvers" Message Passing vs. Threads M. Korch a and T. Rauber a ~Faculty of Mathematics, Physics, and Computer Science, University of Bayreuth D-95440 Bayreuth, Germany We investigate the influence of different programming environments on the execution of parallel programs. Particularly, we consider the parallel solution of systems of ordinary differential equations (ODEs) by explicit Runge-Kutta (RK) methods, such as embedded and iterated RK methods. For that purpose, experiments with C implementations using the MPI library and the POSIX thread library and with a Java threads implementation have been performed. 1. INTRODUCTION The different communication requirements of modem parallel computers are supported by different programming environments. On distributed memory machines (DMMs), typically message passing libraries like MPI [11] are employed. Parallel processes on shared memory machines (SMMs) typically consist of several threads that are managed by, for example, the POSIX thread (Pthread) library [3] or OpenMR But it is also possible to use message passing on a SMM or to provide a virtually shared address space on a DMM (see [ 1], for example). Explicit RK methods, e.g., embedded and iterated RK methods, are suitable for the solution of non-stiff initial value problems (IVPs) of ODEs y ' ( t ) - f(t,y(t)),
y ( t o ) = Yo,
y'R~R
n,
f'R•
n - - + R n.
(1)
They execute a number of time steps to compute an approximation to the solution function y(t) in the integration interval [to, to + HI. At each time step n, an approximation ~7~+1to the function value y(t,~+l) is computed. Efficient methods adapt the stepsize h~ depending on the local error, which can be estimated by, e.g., using a second approximation ~)~+l to y(t~+x) of lower order. Embedded RK methods [5] use several stages to compute ~7~+1and ~)~+1" 1-1
vt = f(t. + clh,, ~7~+ h~ E
auvi),
1 - 1 , . . . , s,
i=1
~]~+l = ~]~+ h~ ~ blvl, /=1
(2)
~+l -- ~]~+ hn s
blvl.
l=1
The coefficients aij, ci, bi, and bi are determined by the RK method used. Popular embedded RK methods are the methods of Dormand/Prince and Fehlberg [5]. Embedded RK methods are suitable for a data parallel solution exploiting parallelism across the system.
210 Iterated RK (IRK) methods [12] use a fixed-point iteration scheme based on an implicit RK method with a fixed number m of iterations to produce the approximations #}d)' J = 1 , . . . , m, to the stage vectors v~, l - 1,... , s; the last approximation #(m)l is used to compute the solution function r/~+a" #}o) = f(t~ + czh~, r/,~), o
#lJ) = f (t~ + cth~, rl~ + h~ s
ali#(j_l)),i
1 = 1 , . . . , s, j = 1 , . . . , m,
i=1
r/~+l = rl~ -Jr-h~
bl#(m), /=1
~+1 -- rl~ + h~
(3)
bl#(m-l). /=1
This computation scheme includes an additional source of parallelism since the approximations #}j) can be computed in parallel by separate groups of processors. In the following, we compare different implementations of embedded and iterated RK methods for shared and distributed address space on a shared memory SMP equipped with 8 processors. 2. R E L A T E D W O R K
Several comparisons of parallel programming environments and communication libraries have been performed in the past, leading to partially different conclusions. Among them, Hess et al. [7] compare OpenMP on top of the DSM system SCASH with MPI on a cluster of Pentium III nodes interconnected by a Myrinet network. They observed that, except for small problem sizes, the speed-ups of OpenMP/DSM range between 100 % and 70 % of the MPI speed-ups. Armstrong et al. [2] investigate the performance of MPI and OpenMP on a Sun Ultra Enterprise 4000. In their experiments, both models obtained very similar results. Luecke and Lin [9] compare the scalability of MPI and OpenMP on an SGI Origin 2000. They obtained a better performance using MPI but expect the performance of OpenMP to improve with upcoming improvements of the compiler. Kuhn et al. [8] derive multithreaded parallel implementations of a program called Genehunter using OpenMP and Pthreads. Though they observe a similar performance of both implementations, they favor OpenMP as it permits an easier parallelization of existing sequential code. Van Voorst and Seidel [ 13] compare different MPI implementations on a Sun Enterprise 4500. They discovered significant differences in the performance characteristics of the implementations. Sun's HPC 3.0 implementation offered the best performance compared to LAM and MPICH. 3. I M P L E M E N T A T I O N 3.1. Embedded RK methods
Two C implementations of an embedded RK solver are considered: a message-passing implementation using MPI, and a multithreaded implementation based on Pthreads. Since the evaluation of the fight hand side function f may access all components of the argument vector w, this vector must be made available to all other processors at every stage. MPI provides a multibroadcast operation for this purpose. Typically, the execution time of this operation grows linearly with the number of processors, thus setting limitations to scalability. The Pthreads
211 implementation needs no explicit data transfer since the stage vectors and the argument vector are declared as shared variables. Therefore, barrier synchronization is sufficient to delay faster processors until all components of w have been computed and, later, until the computation of the current stage has been completed. The execution time of the barrier operation also increases with the number of processors. The locality of the embedded RK implementations increases with the number of processors p as the working space of each processor consists of n + (s + 3)nip vector components. 3.2. Iterated RK methods
We exploit the different sources of parallelism available by investigating three implementations: 9 SD (Standard, Delayed) Purely data-parallel implementation. A transformation of (3) allows for storing the argument vectors c~lj) instead of the stage vectors #lJ)" Each processor performs (m + 1)sn/p function evaluations during the iteration and to compute ~7~+1 after the iteration. Further function evaluations are required during the execution of the error control mechanism. The working space of SD consists of sn + (s + 2)nip vector components. The computed parts of the argument vectors crlj) have to be made available to all processors at every iteration step. Using MPI, this leads to the execution of s multibroadcast operations for vectors of size n. In a multithreaded environment, only one barrier operation at every iteration step is necessary. In the MPI implementation, the n components of ~/~+a also need to be made available to all other processors by a multibroadcast because they are required at the beginning of the next time step. 9 GS (Group, Standard) Mixed task- and data-parallel implementation using s processor groups that compute the s stages in parallel. Direct implementation of computation scheme (3). As in SD, the number of function evaluations performed is (m + 1)sn/p, but no additional function evaluations are executed by the error control. The working space consists of n + (s 2 + 3s + 2)nip vector components. The communication costs of implementation GS exceed those of SD because, additionally, the argument vector owned by group G needs to be made available to the other processors in G before it can be used in the function evaluation. In addition, this requires one group multibroadcast of n components or a group-based barrier, respectively. 9 GD (Group, Delayed) Mixed task- and data-parallel implementation similar to implementation GS. But, as in SD, the argument vectors are stored instead of the stage vectors. Therefore, the number of (ms + 1)sn/p function evaluations required during the iteration and to compute ~/~+1 after the iteration is higher than that of GS and SD. As in SD, additional function evaluations are required during the error control. The working space extends to 2sn + 2(s + 1)n/p components. At the end of each iteration step i a multibroadcast of sn components of cr(~) must be performed in a message-passing implementation or a barrier operation in a multithreaded implementation, respectively. Additionally, a multibroadcast of ~7~+1 after the iteration is necessary in a message-passing implementation.
212 x 107 ,
6[-- ......................................................
~--':-i--~MPi, N = i 2 8 . . . . . . . Ii
<
} \ 5l-/ /
--*-+~.:.~ 9 ..... -~---
500
= 4001
~ ?
:9
. . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . !...... ~ : : : ~ ~
~oc ..
2
3
4 5 Number of processors
6
7
8
MPI, N=384 I! Pthreads, N=128 I:: Pthreads, N=256 Pthreads, N=384 i
L~
g
uJ 200
1
i
2
. . . . . . . . . . i. . . . . . . . . . . . i . . . . . . . . . . . i. . . . . . . . . . . . i . . . . . . . . . . . i . . . . . . . . . . . i. . . . . . . . . . . . i
~
..~~_~__o
1
2
.............~ 3
- . - ~ - o ~ ~ . . ~
4 5 Number of processors
6
........... i
7
8
Figure 1. Execution times of the embedded Figure 2. L2 cache misses of the embedded RK methods for the Brusselator equation us- RK methods for the Brusselator equation using ing DOPRI8(7) and H = 4.0. DOPRI8(7) and H = 0.5. 4. RUNTIME EXPERIMENTS AND ANALYSIS We have performed runtime experiments using different implementations of embedded and iterated RK methods on a Sun Fire 3800 SMP equipped with 8 UltraSPARC III processors at 900 MHz that provides a shared address space. The processors are interconnected by the Sun Fireplane system interconnect [4], which produces non-uniform memory access times depending on the memory bank in which the value considered resides. We use the MPI implementation provided by Sun's HPC 4.0. In addition to the experiments on the Sun Fire, we also investigate the behavior of the MPI implementations on a Cray T3E-1200, which provides a native environment for message-passing applications. As a typical sparse ODE, we choose the discretized 2D-Brusselator equation [5] that describes the reaction of two chemical substances. A standard five-point-star discretization of the spatial derivatives on a uniform N x N grid leads to an ODE system of dimension 2N 2. An evaluation of one component of the Brusselator function accesses only five components of the argument vector which are determined by the five-point star. This results in a constant execution time of the function evaluation of one component, independent of the size of the ODE system. The second example we consider is a dense synthetic ODE. Here, the evaluation time of one component increases linearly with the system size. 4.1. Embedded RK methods We have executed experiments with the Brusselator equation using different RK methods and different integration intervals. The results measured with DOPRIS(7) on the Sun Fire are shown in Figure 1. The execution times measured using parameters where many communication operations are executed indicate a performance advantage of the Pthread library. For DOPRI8(7) and H = 4.0, the Pthreads implementation obtains a superlinear speed-up of 10.3 with 8 threads. With these parameters, the MPI implementation obtains a worse, but still very good speed-up of 7.71. For smaller numbers of processors, the speed-ups often also are superlinear. Using the same parameters on a maximum of 8 processors, only a speed-up of 5.66 could be obtained on the Cray T3E.
213 1 0 -1
........
:::.::.::
1 - z ..--:-::: ....... ~
:::::;:..:
:::.:::
;:::::.
..........................
. l o -~, .......... i,~:
:.:.::::.::
.....
i .................
: ::~;~:~i;i!!i!!!iii!i!il
: ....
! .......
~- : :
500
............
450
~'~.-,4~ . . . . . . . . . . . / . . . . . . ! ~%L.~.~~-~ /
9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
400
2/
.
~. . . . . . . . . . . . .
Pthreads,
. . . -e. . . . d_4--..:~.~:..~/~t:*~V,< +
Pthreads,
J. .S. . O
. . . . . . . . . . . . . . . . . . . . .
.
GD GS
Java,Java'GsGO
u~
i!~!!!!!!!!!!i!!!:!!
350 300
......... L~I_
. . . . . . . . . . . . _~---e----
5 '
10-
!/:::i:: /
.
/!!ii : 9: : : : : : : :
10-"1
.
..........
}:i .
::
i....
.
.
.
!: .
.
._ ....... :
!!:.:Z!:!k:!:i!!! .......
!: ::!!!!i:!!! ::
! .........
:::i::::::::
of
!!!:.!
: i
.i
;
-e-
~
mbcast, mbcast, mbcast,
n=l n=lO0 n=lO000
-.~§
mbcast,
n=lO00000
!:i:i! - -'!:
:. . . . . . . .
Number
:
~ - ~ - - ~ .
:
:::.:!!!:..!i.:::.!!!
250 "
processors
Figure 3. Comparison of the execution times of the barrier operation and the multibroadcast operation for different message sizes (n is the number of double values transferred),
200 ~
"
~ :
i
1oo
t00
.
5
'%~--S
10
15
Number
20 of
threads
25
.
.
.
. . . .
30
35
Figure 4. Execution times of the multithreaded implementations of the iterated RK methods for large numbers of threads using the Brusselator equation and Radau IA(5).
The performance advantage of the Pthreads implementations originates from the lower execution time of the barrier operation compared with the multibroadcast operation used in the MPI implementation (see Figure 3), because the execution time of a barrier only depends on the number of processors since no data is transferred. Comparing the overhead of the parallel implementations on one processor in relation to a sequential implementation, an overhead of the MPI implementation of 13.5 % can be observed while the Pthreads implementation only leads to an overhead of 3.2 %. The superlinear speed-ups are due to cache effects. Figure 2 shows the sums of the L2 cache misses over all processors measured for DOPRIS(7) using H = 0.5. A similar behavior can be observed for the L 1 data cache (not shown). Decreasing curves, as those for N = 384, allow for superlinear speed-ups. The Pthreads implementations show a better locality behavior than the MPI implementations, because internal data structures of the MPI library increase the working space. In our experiments with the Brusselator equation, using more threads than processors available leads to a significant deterioration of the performance of the multithreaded implementations. 4.2. Iterated RK methods
In our experiments with the iterated RK methods, we use implementation variants written in C and in Java (JVM version 1.4). To perform communication, we use MPI and the Pthread library for the C implementations, and Java threads [10] for the Java implementations. We use the integration interval H = 4.0 for the Brusselator equation and H = 0.5 for the synthetic equation, and we execute m = 4 iterations. As RK methods, we use Radau IA(5) and Radau IIA(5) with s = 3 stages [6]. Figure 5 shows execution times for both equations measured using Radau IA(5) on the Sun Fire. In Fig. 4, the execution times for the Brusselator equation and Radau IA(5) with large numbers of threads are displayed. The results obtained for Radau IIA(5) are not discussed in this paper because they show only minor differences since they use the same number of stages.
214 looo[ ......................................................... ~. 900I\ ..........
.........
i. . . . . . . . . . .
! ...........
i. . . . . . . . . . . .
!.........
i '. . . . . . . . . . . .
:: :. . . . . . . . . . . .
i :...........
-+~-
i
i
i
i
-e-
i i ...........
,3..... i ! ! ....................~L--i--..J-:-.i.~i..i~i~ . . . . . . . .
~ -. . . .
Pthreads, SD Pthreads, GD Pthreads, GS Java, SD Java, GO
:
i
~\
+
J ....
'~\,
i
i
i '\
i
\
700~- . . . . . .
/
5oot-~
\
\~
\ i
.... !
'"~\.- . . . . ! . . . . . . . . . .
i. . . . . . . . . . . .
i..... ~--i-
140r"
-eMPI GD --*-- MPI, GS
/ \ :: 800~-.. \ . . . . . . . : . . . . . . . . . . . [
~=
~.~-~
..........
..........
i. . . . . . . . . . .
! ...........
I
......................
..........
i. . . . . . . . . . . .
!
i ...................................
i
.........
,00i ~ ~............i..........................",~....12_ ~v~~ ' ~176 I
GS
i
i. . . . . . . . . . .
! w
300
.....
: ........
"~--/~ . . . . . . . . .
~ool.....................~ ~ 1
2
3
4 Number
:.-. :;~ . . . . .
~ of
: ...........
~ 5
processors
:. . . . . . . . . . .
/>
~ 6
(a) Brusselator, N = 384
20
7
8
0
....................................................
2
3
4 Number
5 of processors
.
.
6
7
8
(b) synthetic ODE, n = 10000
Figure 5. Execution times of the iterated RK methods using Radau IA(5). Because the execution times of GS and GD are determined by the groups with the smallest numbers of processors, the execution times of both implementations show a leap at the transition from p with p rood s = s - 1 to a multiple of s. GD is deficiently slow in all experiments because of the higher number of function evaluations. GS can obtain similarly good speed-ups as SD if the number of processors is a multiple of s, because the number of function evaluations performed is similar. However, the communication costs of GS are higher and the working space of GS is larger than that of SD for small numbers of processors. Therefore, the speedups measured for GS with the Brusselator equation on the Sun Fire are often smaller than those of SD. Function evaluations of the synthetic ODE are more expensive, which leads to better execution times of GS for this ODE. On the Cray T3E, the MPI implementations of GS outperform SD using either of the two ODEs. The Pthreads implementations of the IRK methods often perform better than the MPI implementations on the Sun Fire. The main reason is the longer execution time of the multibroadcast compared with the barrier operation (see Figure 3). Using the Brusselator equation with N = 384 and Radau IA(5), the best Pthreads implementation, SD, reaches a speed-up of 5.56. The MPI implementation of SD only obtains 4.01. The group-based implementations are slower. The Pthreads implementation of GS reaches a speed-up of 4.35 using 6 processors, while its MPI counterpart obtains 3.50 on 7 processors. The speed-ups of GD are 2.55 for the Pthreads version and 2.38 for the MPI version. Using one processor, the Pthreads implementation of SD runs 11.8 % slower than a sequential implementation of computation scheme (3). The MPI implementation is slowed down by 17.1%. For the synthetic ODE, the differences in the speed-ups between the different programming environments are less significant, because in our examples the size of the ODE system is smaller and less time steps are executed. Therefore, the execution times are mainly determined by the number of function evaluations; the influence of the communication library used is insignificant. Because Java programs are compiled to intermediate code that is interpreted at runtime, they are slower than C programs. In our experiments, the performance of the Java implementations improves if the system size is increased. For N = 384, the Java implementations are only 1.6 to 2.2 times slower than the C implementations. Considering the Brusselator equation, the speed-
215 ups are similar to the corresponding C/MPI implementations. But the Java implementations often reach their maximum speed-ups with only 6 or 7 threads. This is caused by the Java Virtual Machine as it executes several threads in the background (garbage collector, Just-InTime compiler etc.). Using N = 384 and Radau IA(5), the Java implementation of SD obtains its maximum speed-up of 4.67 using 6 threads. We have also investigated the situation where there are more threads than physical processors. All Java implementations and the C implementations of SD are slowed down by using additional threads. But the group-based C implementations often profit from additional threads because these threads compensate for a load imbalance caused by the fact that the number of physical processors is not a multiple of the number of stages. The best speed-ups for the group-based Pthreads implementations have been measured using 30 threads. For the synthetic ODE, using this many threads even enables the Pthreads implementation of GS to outperform all other implementation variants. Because of the lower clock rate, the execution times measured on the Cray T3E are about twice as long as on the Sun Fire. Also, the speed-ups are lower because of the slower communication network. For example, using the Brusselator equation and N = 384, implementation SD obtains a speed-up of 3.48 on 8 processors of this machine, while a speed-up of 4.01 can be measured on the Sun Fire. 5. C O N C L U S I O N S Our experiments with embedded and iterated RK methods on a Sun Fire SMP have shown that C/Pthreads implementations are often faster than C/MPI implementations because the multibroadcast operations used in the MPI implementations are more expensive than the barrier operations required in the Pthreads implementations. Though the Java implementations are slower than the C implementations, they show similar scalability. Experiments on a Cray T3E have resulted in lower speed-ups due to the slower communication network. Thus, small or medium size SMPs seem to be a more efficient environment for the execution of embedded and iterated RK methods. ACKNOWLEDGMENTS
We thank the University Halle-Wittenberg for providing access to the Sun Fire 3800 and the NIC Jtilich for providing access to the Cray T3E. In particular, we are grateful to Dr. Gerald M6bius at the computer and network center of the University Halle-Wittenberg for his technical support. REFERENCES
[ 1] C. Amza, A. L. Cox, S. Dwarkadas, R Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2): 1828, February 1996. [2] B. Armstrong, S. Wook Kim, and R. Eigenmann. Quantifying differences between OpenMP and MPI using a large-scale application suite. In High Performance Computing: Third International Symposium (LNCS 1940), page 482ff. Springer, October 2000. [3] D.R. Butenhof. Programming with POSIX Threads. Addison Wesley, 1997.
216 [4] [5] [6] [7]
[8] [9]
[10] [ 11 ] [12] [ 13]
A. Charlesworth. The Sun Fireplane system interconnect. In SC2001: High Performance Networking and Computing. ACM Press and IEEE Computer Society Press, 2001. E. Hairer, S. P. Norsett, and G. Wanner. Solving Ordinary Differential Equations I." NonstiffProblems. Springer, Berlin, 2nd edition, 2000. E. Hairer and G. Wanner. Solving Ordinary Differential Equations II." Stiff and DifferentialAlgebraic Problems. Springer, Berlin, 2nd edition, 1996. M. Hess, G. Jost, M. S. Mfiller, and R. Rfihle. Experiences using OpenMP based on compiler directed software DSM on a PC cluster. In WOMPAT2002: Workshop on OpenMP Applications and Tools (LNCS 2716). Springer, 2002. B. Kuhn, P. Petersen, and E. O'Toole. OpenMP versus threading in C/C++. Concurrency and Computation: Practice and Experience, 12(12): 1165-1176, 2000. G.R. Luecke and W.-H. Lin. Scalability and performance of OpenMP and MPI on a 128-processor SGI Origin 2000. Concurrency and Computation: Practice and Experience, 13(10):905-928, August 2001. S. Oaks and H. Wong. Java Threads. O'Reilly, 2nd edition, January 1999. M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra. MP1 the complete reference. MIT Press, Cambridge, Mass., 2nd edition, 1998. P. J. van der Houwen and B. P. Sommeijer. Parallel iteration of high-order Runge-Kutta methods with stepsize control. Journal of Computational and Applied Mathematics, 29:111-127, 1990. B. VanVoorst and S. Seidel. Comparison of MPI implementations on a shared memory machine. In Parallel and Distributed Processing: 15 IPDPS 2000 Workshops (LNCS 1800), page 847ff. Springer, 2000.
Scheduling
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
219
E x t e n d i n g the D i v i s i b l e T a s k M o d e l for W o r k l o a d B a l a n c i n g in C l u s t e r s u. Rerrer ~, O. Kao ~, and F. Drews b ~University of Paderborn Faculty of Computer Science, Electrical Engineering and Mathematics Fuerstenallee 11, 33102 Paderborn, Germany Email: {urerrer, okao}@upb.de bOhio University Center for Intelligent, Distributed and Dependable Systems Athens, Ohio 45701, USA Email: [email protected] During the last years considerable effort has been put into developing scheduling and load balancing techniques and models for the processing of tasks in parallel and distributed systems. In this paper we extend an existing scheduling model called the divisible task model to adapt it to cluster architectures and make it applicable to a broader class of applications. To achieve this we present two new modifications which allows recursive modelling and will increase the models flexibility and performance. The goal of this work is to reduce the data volume which has to be transfered over the communication links to get an equally balanced load on the nodes of the system. As test-system we use a cluster-based image retrieval system to check the achieved performance. 1. INTRODUCTION In the last decades, considerable effort has been put into developing scheduling and load balancing techniques and models for the processing of tasks in parallel and distributed systems with the objective of increasing the efficiency of such systems. The divisible task model (DTM) goes back to the work of Cheng and Robertazzi [ 1] and the more recent work of Drozdowski et al. [6],[7]. It was motivated by problems of searching for records in huge databases, the distributed sorting of database files and a broad field of related problems [9]. The divisible task model describes the distribution of data on the nodes of a parallel system, which us described formally by m processing elements (PEs) 7~ = {P1, P 2 , . . . , Pro}. A divisible task is referred to as a computational unit with a size of V E R elements that can be divided into parts of arbitrary size such that the single parts have a fine-grained granularity and can be handled by the processing elements (PE). Since there are no data dependencies, the parts can be processed in parallel. Initially, the whole workload is located at one processing element which is referred to as originator. Let P1 be the originator. It processes locally an amount of ~1 E R data units and sends the remaining V - c~1 data units to its neighbors, simultaneously. Correspondingly, each
220 neighbor PE P~ processes exactly ai c • data units. The time to process ai data units, which depends on the individual speed of PE i (given as constant Ai E I~>~ is aiAi time units. The time to transmit/3 C R data elements from one PE to another over a communication link j is Sj +/3Cj time units. The constant Sj represents the communication overhead and Cj specifies the transmission time. The divisible task schedulingproblem aims at finding an optimal distribution of the amounts of data units ai such that a given objective function is maximized or minimized, respectively. Popular parallel architecutres, like clusters, provide a cost-effective form of parallel computing power for computational science and commercial applications. However, to gain profit from this computing power the load to be processed has to be balanced equally across the nodes. The application of DTM to cluster architectures requires a central master node which poses a bottleneck in many applications (e.g., applications that require the transfer of large amounts of data between the nodes). An overload of both communicating nodes and communication links could result in a long period of time for which processing elements are busy and hence reduce the efficiency of the parallel execution. Geisler [2] examined this behavior for cluster-based multimedia applications. An attempt to directly apply the DTM model to cluster architectures, however, lead to poor results [7]. This fact motivated our new approach. In this work we extend we classical DTM by providing support for different originators. The main goal of our modifications is the reduction of the transfer volume over the communication links and an equally balanced load. The efficiency of these modifications is examined in detail on the cluster-based image database CAIRO [3]. The paper is organized as follows. In section 2 we present and describe several aspects of adaptation of the DTM to cluster architectures. Section 3 describes the application of the modified model to a cluster-based image database together with some relations to other existing strategies. Section 4 describes some conclusions and aspects of our future work. 2. ADAPTION OF DTM TO CLUSTER ARCHITECTURES Cluster architectures provide a cost-effective form of parallel computing power for computational science and commercial applications. However, to gain profit from this computing power the workload has to be balanced equally over the cluster nodes. The application of DTM to cluster architectures requires a central master node which, in many applications, may become a bottleneck (e.g., applications that require the transfer of large amounts of data between the nodes). An overload of both communicating nodes and communication links could result in long periods of time for which PEs are busy and hence reduce the efficiency of the parallel execution. Geisler et al. examined this behavior for cluster-based multimedia applications [2]. An attempt to directly apply the classical DTM model to problems of data distribution in cluster architectures, however, led to disappointing results [7]. This fact motivated our research. In this work we present the following extensions of the classical DTM model to be applied to cluster architectures: 9 Provide the possibility for the initial data distribution to be altered ~ Provide support for different originators The main goal of our modifications is to relieve the network bottleneck by reducing the overall volume of data transmitted onto communication links and to equally balance the load on the
221
PE'
Is,
1
[i ~i
commun,c.t,on
.....
iii ii i~ii!i!iiiii !i~ $2 ..........................................................
t!i ~ 9
'~, i iil f :(V ai~)C2 ....................
f iii
~!~iii! !!ii !iiii!iiili!i!ii)~ ii
t
ili ,
i!iii
communication
!iii!
2iii
iiii
i
1 s ,l
m-1
] i~!
1~176176176176176176176 !iiiiiiiiiiiiiill !!i!iiiiiiiiii
0
i!!i time
Figure 1. Classical DTM
PEs. The efficiency of these modifications is examined in detail on the cluster-based image database Cairo [3]. The classical DTM has a great variety of of possible settings to adapt the algorithms to its environment. In its easiest form the DTM consists of processors connected via a linear array network [6]. Each Processor Pi E 7) has a network processor and the originator is the first processor Pa. The part of the load which is not processed by the originator is send to the nearest neighbor. The neighbor divides the received data into a part processed locally and forwards the rest to the next idle processor. This procedure is repeated until the last processor is activated as you can see in figure 1. Furthermore, we assume that there is no returning of data. Thus, all the processors must stop computing a the same moment of time [6]. For this case Blazewicz et al. formulated the following set of equations [5]:
aiAi
=
Si %- (OLi+1%- Cti+2 %-''' %-am) Ci %-cti+lAi+l;
V
--
oL1%- OL2 %- "'" %- a m ;
OL1,a 2 , . . . , a ~ __> 0
i -- 1, 2 , . . . , rn -- 1
(1) (2)
where Si, Ci are the startup time and transmission rate of a link joining Pi and Pi+ 1, and Ai is the processing rate of Pi. The above equation can be solved in O(ra) [5]. In our first approach we relax the restriction to only one originator and provide the support of several originators. It turns out that a broad range of network topologies can be considered by variation of the connection topologies. Moreover, this approach permits to assign a PE not only to one but to several sub-networks (i.e., structures consisting of an originator and different PEs). However, we have to deal with the problem of the communication and coordination of the originator nodes: For a global retum of results we have to compose the results of the sub-networks to provide the user with a unique global result of the overall system.
222 Given an homogenous architecture, one possibility to reduce the transfer volume is to initially provide each processing node with an equal amount of data elements to be processed. This results in our second approach by providing the possibility for an alternated initial data distribution 9 To this end, we represent the amount of data units for a single PE i by (ai + a'i) where ai specifies the initial amount of data units and a'~ gives the additional amount of data units due to the data redistribution 9 It follows that the total amount that a PE i requires to process is given by (ai + a'i)Ai. The main goal of the redistribution is to minimize the overall execution time of the system 9 To adapt the classical DTM model to our requirements we make the following assumptions to make the problem easier 9 We suppose a linear array as a model of the communication structure, store-and-forward data transmission. Moreover, we neglect the return of the results to the originator nodes as usually only a small amount of data has to be communicated. The PEs can be grouped into linear chains 9 Let h be the number of chains consisting of m l, r a g , . . . , m~;i - 1, 2 , . . . , s PEs. In matrix notation the PEs can be described as:
Pll, P21,
P12, P22,
..., ...,
9
Psi,
Plml
P2m2 9
P~2, . . . ,
P~m~
According to this notation, we have more than one originator. The originator nodes are PII, P 2 1 , . . . , Psi form the first column of the matrix 9 Each originator node is the first PE in its chain. The originator nodes contain the overall system load Vtotal -- V1 § V2 + ' ' " § Vs where V/, 1 < i < s describes the initial load of PE i, 1 < i < s. The remaining parameters of the model are modified in terms of the new number of PEs (i.e., PE Pij has to process aij data units, it lasts Aij time units to process one data element, and a communication requires Sij +/3 9C~j time units). For simplicity we assume first that no results are returned to the originator. Furthermore, the model permits data transfer among the originator nodes in order to achieve a balance of the load. Now for the second approach we want to present a solution to the problem of finding an optimal distribution of the initial load V~ for each processing element Pi. This is the basis to alter the initial load distribution.
For the aforementioned problem, the equations (2) can be reformulated as follows:
cti,jAi,j
=
~i,j §
(ozi,j+l § ozi,j+2 § " " -t- oLi,mi)Vi,j -Jr- o L i , j + l A i , j + l i = 1,2,...,s;j = 1,2,...,mi1
(3)
V
=
Vl -qt-V2-~-' 9 §
~ri
=
Cti, 1 § OLi,2 § ' ' " § OLi,mi
(4) (5)
c~i,1, c~i,2,..., c~i,m~ >_ O; i = 1, 2 , . . . , s Given a distribution of the initial load 1/1, V2,..., V~ with }-~i~1 Vi = V, we can formulate our problem as inhomogenous linear equation system E . s - b as we did in a further work [8]. There we also showed that this problem can be solved in polynomial time by means of linear
223 programming. The linear equation system can be solved for each feasible initial distribution of 8 the load V1, V2,..., V~ (i.e., a distribution with Y~4=l Vi = V). 3. A P P L I C A T I O N OF T H E M O D I F I E D D T M
The modified DTM is applied to the cluster-based image database Cairo. This database allows the user to select regions of interest and to search all archived images for a similar objects or persons. The image analysis and comparison require large computational resources, thus we designed a suitable cluster-based architecture. The divisible task model is well suited for the modeling of the query processing, as the processing can be separated into independent components (processing per image) and into any desirable granularity (a task can be a number of images, a single image or a number of image subsections). Initially, all available images are distributed equally over the computing nodes and processed in parallel. However, many queries consider only a small image portion, which is created by a much simpler evaluation of a-priori extracted features. The result is a unbalanced system, where overloaded nodes prolong the response time of the entire system. Thus a workload balancing strategy for this NP-complete problem is necessary. Due to the described modifications of DTM the application of this model for workload balancing is now straight forward. We implemented parts of it and the model showed a considerable increase of performance since we reduced the length of busy time intervals on a single node due to a reduction of the communication transmission times. On the other hand the approach of providing several originators increased the flexibility to use other interconnection topologies. We made some measurements as well as comparisons to existing strategies such as LTF [3] and RBS [4]. The results seem promissing, but more work is to be done. 4. CONCLUSIONS AND FUTURE W O R K We presented two modifications to the classical DTM model to adapt it to cluster based systems and thus make it applicable to a broader class of practical applications. The possibility to use several originators increases the flexibility of our approach by covering a broader class of network topologies and allows a recursive modeling. Our second modification provides an altemate distribution of the initial data volume on the PEs results in a considerable increase in performance since we reduce the length of busy time intervals on the single nodes due to a reduction of the communication transmission times. REFERENCES
I1] [2]
[3] [4]
Y. C. Cheng and T. G. Robertazzi, "Distributed Computation with Communication Delay." In IEEE Transactions on Aerospace and Electronic Systems 24, pp. 700-712, 1988. T. Bretschneider, S. Geisler and O. Kao, "Simulation-based assessment of parallel architectures for image databases." In Proceedings of the International Conference on Parallel Computing (ParCo 2001), pp. 401-408, 2002. O. Kao, G. Steinert and F. Drews, "Scheduling aspects for image retrieval in clusterbased image databases." In IEEE/ACM Symposium on Cluster Computing and Grid, pp. 329-336, IEEE Society Press, 2001. F. Drews and O. Kao, "Randomised block size scheduling strategy for cluster-based image
224
[s] [6] [7]
[8]
[9]
databases" In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, pp. 2116-2123,2001. J. Blazewicz, M. Drozdowski. "Scheduling Divisible Jobs with commtmication Startup Costs." Discrete Applied Mathematics 76, pp. 21-24, 1997. J. Blazewicz, M. Drozdowski and M. Markiewicz. "Divisible task scheduling - concept and verification." In Parallel Computing, Volume 25, Number 1, pp. 87-98, 1999. P. Wolniewicz and M. Drozdowski, "Experiments with scheduling divisible tasks in cluster of workstations." Euro-Par 2000, LNCS 1900, Springer, pp. 311-319, 2000. F. Drews, O. Kao, U. Rerrer and K. Ecker, "Extending the Divisible Task Model for Load Balancing in Parallel and Distribited Systems" Proceedings of The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2003), CSREA Press, Vol. 1, pp. 493-497, 2003. J. Blazewicz and M. Drozdowski and K. Ecker, "Management of Resources in Parallel Systems" in Handbook on Parallel and Distributed Processing, Springer, pp. 264-339, 2000.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
225
The generalized diffusion method for the load balancing problem* G. Karagiorgos a, N. Missirlis% and F. Tzaferis ~ aDepartment of Informatics and Telecommunications, University of Athens, Panepistimioupolis 157 84, Athens, Greece This paper defines the Generalized Diffusion (GDF) method by the introduction of a parameter 7. in the Diffusion (DF) method and studies its convergence analysis for a weighted network graph. In particular, it is proved that GDF converges if and only if 7- c (0, 1/[IAII~ ), where A is the adjacency matrix of the network graph G = (V, E). This is a more relaxed condition than the one required by the DF method. Next, we consider a multiparametric version of GDF, which involves a set of parameters 7-i, i = 1, 2 , . . . , IVt, instead of a single parameter 7-. By applying local Fourier analysis we are able to find a closed form formula which produces optimum values for the set of the parameters 7-~ in the sense that the rate of convergence of GDF is maximized for ring and 2D-torus network graphs. 1. INTRODUCTIONS One of the most fundamental problems to solve on a distributed network is to balance the workload among the processors in order to use them efficiently. Load balancing schemes can be classified as either static or dynamic. In static load balancing schemes it is possible to make a priori estimates of load distribution in contrast to the dynamic load balancing, where the workload is distributed among the processors in a time-varying process. The load balancing problem. We consider the following abstract distributed load balancing problem. We are given an arbitrary, undirected, connected graph G = (V, E), where the node set V represents the set of processors and the edge set E describes the connections between processors. Each node vi E V contains a real variable ui which represents the current workload of processor i and a real variable c~j which represents the weight of edge (i,j). The current workload is the sum of the computational work of its tasks. Tasks are independent and can be executed on any other processor. Most of the existing iterative load balancing algorithms [5, 7, 12, 15] involve two steps. The first step calculates a balancing flow. This flow is used in the second step, in which load balancing elements are migrated accordingly. This paper focuses on the first step. The performance of a balancing algorithm can be measured in terms of number of iterations it requires to reach a balanced state and in terms of the amount of load moved over the edges of the underlying processor graph. Our objective. The aim of this paper is to study a generalized form of the DF method by introducing a new set of parameters 7.~,i = 1, 2 , . . . , IV I for edge weighted network graphs, whose role will be to maximize the rate of convergence of the DF method. *Research is supported by the National and Kapodistrian University of Athens (No. 70/4/4917).
226 Related work. In the Diffusion method (DF) Cybenko [4], Boillat [3], a processor simultaneously sends workload to its neighbors with lighter workload and receives from its neighbors with heavier workload. It is assumed that the system is synchronous, homogeneous and the network connections are of unbounded capacity. Under the synchronous assumption, the diffusion method has been proved to converge in polynomial time for any initial workload [3]. If new workload can be generated or existing workload completed during the execution of the algorithm, it has been proved that the variance of the workload is bounded [4]. The convergence of the asynchronous version of the diffusion method has also been proved by Bertsekas and Tsitsiklis [ 1]. Our contribution. We introduce the GDF method, a generalized version of DF. We study the convergence analysis of the GDF method in the case that the edges of the network graph have a weight. In particular, we show that GDF converges under more relaxed conditions than DF. Next, we consider a mutliparametric version of GDF, which involves a set of parameters Ti. By applying local Fourier analysis we are able to find a closed form formula which produces optimum values for the set of the parameters T~ in the sense that the rate of convergence of GDF is maximized for ring and 2D-torus weighted network graphs. In addition, the values of T~ depend only upon local information hence their computation requires only local communication. The rest of the paper is organized as follows. In section 2, we introduce the GDF method. In section 3, we examine the properties of the GDF method such that its work load to be invariant. In section 4, we study the convergence analysis of the GDF method and we intrduce its local version involving a set of parameters ~-~. In section 5, we determine the optimum values of the parameters T~ such that the convergence rate of the local GDF method is maximized. Finally, our conclusions and future work are stated in section 6.
2. THE GENERALIZED DIFFUSION (GDF) METHOD The Generalized diffusion (GDF) method for the load balancing has the form: u(n+l)
i
. (n)
=~i
-T
~
/ (n)
,...., cij~u i
. (n)
-~j
),
(1)
jEA(i)
where T is a parameter that plays an important role in the convergence of the whole system to the equilibrium state, A ( i ) is the set of the nearest neighbors of node i, n is the step index, n = 0, 1, 2 , . . . and u~ ~ (n) (1 _< i _< IV]) is the total workload of processor i at step n. Then, the overall workload distribution at step n, denoted by u (n), is the transpose of the vector (u~n) u~n) ~ (n) , , . . . , Uiyl). u (~ is the initial workload distribution. In matrix form (1) becomes U (n+l) ---
Mu (n),
(2)
where M is called the diffusion matrix. The elements of M, mij, are equal to 7cij, if j E A(i), 1 - - 7" EjEA(i)Cij ' if i = j and 0 otherwise, where cij is the weight of the edge (i,j). With this formulation, the features of diffusive load balancing are fully captured by the iterative process (2) governed by the diffusion matrix M [14, 2]. Also, (2) can be written as u (n+l) = (I - T L ) u (n), where L = B W B T is the weighted Laplacian matrix of the graph, W is a diagonal matrix of size IEI x IEI consisting of the coefficents c# and B is the vertex-edge incident matrix. At this point, we note that if ~- - 1, then we obtain the DF method proposed
227 by Cybenko [4] and Boillat [3], independently. If W = I, then we obtain the special case of the DF method with a single parameter 7- (non weighted Laplacian). In the non weighted case and for network topologies such as chain, 2D-mesh, nD-mesh, ring, 2D-torus, nD-torus and nDhypercube, optimal values for the parameter 7- that maximize the convergence rate have been derived by Xu and Lau [16]. However, there are no analogous results in case of the weighted Laplacian. This paper will attempt to answer the following questions in case of the weighted Laplacian: 1) Under which conditions (2) is convergent and 2) What is the optimum value of ~- such that the convergence rate of (2) is maximized? 3. THE C H A R A C T E R I S T I C S OF THE G D F M A T R I X
The diffusion matrix of GDF can be written as M - I-
TL,
L = D-
A,
(3)
where D - diag(L) and A is the weighted adjacency matrix. Because of (3), (2) becomes u (~+1) - (I - TD) u (n) + ~-Au(") or in component form
u(•+i) i
= (1--T
Z
yeA(i)
u J i + 7- ~ cijuj('~) ,{ -- 1, 2 , . . . , c~u(~) j~A(i)
IVl.
(4)
The diffusion matrix M must have the following properties" nonnegative, stochastic and symmetric, for the work load to be invariant [4, 3, 2]. In the sequel, we examine under which conditions the diffusion matrix M posesses the aforementioned properties. 9 Nonnegative. For the matrix M to be nonnegative we must have M _> 0, or because of (3) I-TD+~-A_>0, or, i t s u f f i c e s I - T D _ > 0 and T A > _ 0 . From T-A _> 0 we have ~- > 0, because A >_ 0 (cij > 0 ) a n d from I - ~-D _> 0 we have 1 - T ~jcA(i)cij >_ 0 Vi E V. Thus, it suffices ~- < 1/rnax~ ~-~-jeA(~)C~j or 7- _< 1/]IAI]~. Therefore, we have proved the following. L e m m a 1. IfO < 7 _< l / l l A I l ~ then M > O. R e m a r k 1. Note that IIA 1 ~ - { Oll~. Corollary 1. I f cij - c (non weighted case) and 0 < r < 1/A(G), where ~- - Tc and A(G) is
the maximum degree of the graph G, then M > O. Proof If qj - c, we have the non weighted Laplacian matrix of the graph G and the inequalities o f L e m m a 1 yield 0 < T _< 1/cdi or 0 < ~? _< 1/di for each i, where ~? - Tc and di is the degree of node i. Clearly, Corollary 1 follows from the last inequalities. 9
v , IYf u~, 9Stochastic. For the matrix M to be stochastic we must have M g - g, where ~2i - ~1 z__~i=l or because of (3), (I - 7-L)~ - g, which is valid since Lg - 0. 9 Symmetric. Due to (3) the matrix M is symmetric since the Laplacian matrix L is symmetric. R e m a r k 2. I f the inequalities of Lemma 1 hold, then the matrix M is doubly stochastic.
228 4. THE C O N V E R G E N C E A N A L Y S I S OF THE G D F M E T H O D
In this section we present the basic convergence theorem for the GDF method. T h e o r e m 1. The GDF method converges to the uniform distribution if and only if the network graph is connected and either (or both) of the following conditions hold: (i) 0 < ~- < x/IIAII~, (ii) the network graph is not bipartite. Proof The diffusion matrix M can take the following form M =
0 /(T
K 0
, where O's
are used to denote square zero block matrices on the diagonal of M and K is a rectangular 1 ViEV. nonnegative matrix, if and only if it is bipartite and I - ~-D = 0 or ~- = EjeA(~)c~ If the above holds, then - 1 is an eigenvalue of the matrix M, hence its convergence factor 7 ( M ) = 1 and the method does not converge [2]. If the graph G is bipartite, then for 1
0<~-<
(5)
~-~jcA(i) Cij 1 or 0 < ~- < m a x i ~jEA(i) Vii ' which is the condition (i), - 1 is not an eigenvalue of M, hence ~, < 1 (i.e the GDF method converges). 9 Note that in the non weighted c a s e - - (1 -- ~-di)ul n) + "7-~jeZ(i)"{zj
(cij -- c) the GDF method, because of (4), becomes
't~i"( n + l )
Corollary 2. For the non weighted Laplacian we have cij = c and the GDF method converges to the uniform distribution if and only if the network graph is connected and either (or both) of the following conditions hold: (i) 0 < ~ < S~c), where ~- = ~-c, (ii) the network graph is not bipartite. Proof If cij = c, then IIAII~ = cA(C) and the condition (i) of Theorem 1 yields (i). Corollary 3. The DF method converges to the unifrom distribution if and only if the network graph is connected and either (or both) of the following conditions hold: (i) ~ cij < 1, (ii) jEA(i) the network graph is not bipartite. Proof Letting ~- = 1 the GDF method degenerates into the DF method. For T = 1 (5) yields (i). Note that Corollary 3 is the basic convergence theorem for the DF method derived by Cybenko [4]. Before, we close this section we consider the following version of GDF, which involves a set of parameters 7-i, i = 0, 1, 2 , . . . , N - 1
i
- ( -
Z jEA(i)
+
Z jEA(i)
(6)
Note that if ~-i = T then (6) yields the GDF method. (6) will be referred to as the local GDF method.
229
5. D E T E R M I N A T I O N OF THE P A R A M E T E R % It is not known yet how the optimum value of r can be determined in case of the weighted Laplacian. However, we are able to compute approximations to the optimum values of the parameters %. In order to compute these optimum values, in the sense described previously, for the local GDF method, we apply Fourier analysis [ 13] in a similar way as in [9, 10, 11 ]. For this reason, we define the local GDF operator for the ring and 2D-torus network graphs and apply Fourier analysis to find their local eigenvalues and the optimum values of ri using only local information of the underlying graph. Our results are presented in the form of two lemmas and two theorems. Their proofs are analogous to the ones in [ 10] and therefore are omitted. Next, we define the local operators for the ring and 2D-torus in order to find their local eigenvalues and in the sequel to determine optimum values of r~, which maximize the convergence rate of the local GDF method.
1) The ring topology , ~ i(n) , where M~ = At a node i, the local GDF scheme (2) can be written as u i(n+l) - Jvliu 1 - r i L i is the local GDF operator. We define Li = di - (Ci+lF, + Ci_lE -1) the local operator of the Laplacian matrix with d~ - c~+a + c~-1. The eigenvalues of the local operator Mi and those of the local operator Li, are related, because of (3), as follows: #i = 1 - fiat, i = 0, 1 , . . . , N - 1, where N is the number of processors, #~ and A~ are the eigenvalues of the operators M~ and L~, respectively. The operators E, E - 1 are defined as Eui = ui+ 1, E - lu~ = ui_ 1, which are the forwardshift and backward-shift operators in the z-direction, respectively. Expressing the opera~ ( n ) - ~ we have e~ ~(~+~) - z,~e~ ~ (~) , n - 0, 1, 2 , . . . tor Li in terms of the error vector ~~(n) - (,~ L e m m a 2. The spectrum o f the operator Li is given by Ai(k) = 2&i(1 - coskh), where ci = ci- 1(= ci+ 1) and h - -~ 1 with N denoting the number o f processors and k - 2rcg, g - 0, +1, + 2 , . . . , + ( N - 1). The optimum values of the parameters wi and the corresponding value of the convergence factor 7 are given by the following theorem. T h e o r e m 2. The optimum values o f the parameters % of the local GDF method for the ring topology o f order N, with N even, are t
riop -
1 2rr
~(3 - cos 7 )
i-O,
1,...,N-
1,
(7) 2rr
1-t-cos -R-
and the value o f the corresponding convergence factor is 7(Mi) - 3-cos 2~. N
The above theorem states that the local GDF method for the ring topology and for any values of the weights Ois (~i >_ 0) attains the same rate of convergence as in the case when ~s are fixed (Oi = c). For this it suffices r~ to be chosen by (7). Moreover, note that if ~ = c, then (7) yields the same value for "?(= 7c) as the one found in [16].
230 2) The 2D-torus topology Using a similar approach as in the previous paragraph, we define Mij as the local GDF operator for the 2D-torus topology. The local GDF scheme at a node i, j can be written o (n+l) = M~j," ( z(n) where Mij ~-- 1 - 7-~jL~j, is the local GDF operator. We define as .(zij ij ~ Lij =
dii-(Ci+l,jE14-Ci-l,jE114-ci,j+1E 2 4 - C i , j - 1 ~ 2 1 )
operator of the Laplacian
the local
matrix, where dii = Ci+l,j 4- C i - l , j 4- ci,j+l 4- c i , j - 1 . The eigenvalues #ij, /~ij of the local operators M~j, L~j, respectively, are related as follows: p~j = 1 - ~-ij)~j, i = 0, 1 , . . . N1 - 1, j = 0, 1 , . . . N2 - 1, where N1, N2 are the number of processors in each direction. The operators El, Ei -1, E2, E~-1 are defined as ElUij = Ui+l,j, F,~luij = Ui-l,j, E2uij = ui,j+l, E21uij = l t i , j - 1 , which are the forward-shift and backward-shift operators in the xl-direction, (x2-direction), respectively. Expressing the operator Lij in terms of the error vector el~ ) --- ul~ )
-
--
~(n+l)
u# we have eij
~(n)
- Lij ~ i j
'
n - 0, 1 ' 2, "'"
L e m m a 3. The spectrum of the operator Lij is given by /~ij(kl, k2) = d i i ( 2 c h c o s k l h l 4. 2CVjCOSk2h2), where dii = 2(c h 4- ci~j), c~ Ci--I(: Ci+l,j), Ci3 -l
Ci,j- l ( - -
Ci,j + l ), h i
=
NI1 ' h2 =
1 ' and N1, N2 are the number of processors in -~2
each direction and kl = 27rgl, gl = 0 , + 1 , 4 . 2 , . . . , + ( N 1 - 1 ) , 0, +~, + 2 , . . . ,
k2 =
27rg2, g2 =
+(N~ - 1)
The optimum values of the parameters ~-~ and the corresponding value of the convergence factor are given by the following theorem. T h e o r e m 3. The optimum values of the parameters Tij o f the local GDF method for the 2D-torus topology of order N = max{N1, N2 }, with N even, are
1
opt
Jij -
v
2.,
2C h 4. 3cVj -- Cij COS N
i=0,1,...N1-
1,
j = O , 1 , . . . N 2 - 1,
(8)
and the corresponding value of the convergence factor is %
v
27r
3c + Cij COS N 27r" + 3cVj + v
(9)
Note that if c h = ci) - c, then (8) yields the same optimum value as the one found in [16]. A closer examination of(9)reveals that 7(Mij) depends upon the ratio rij = ch/ci). 27r
3~+cos ~ 2_~. By examining the behavior of ~/(rij) Indeed, (9) can be written as ~(rij) = 3+2~j+cos N
with respect to rij we find that si9n(~ Orij = +1 which indicates that ~(rij) is an h v increasing function of rij. This means that as the ratio rij = Cij/Cij increases the rate of convergence of local GDF decreases. The best rate of convergence will be achieved when c~ - ci~j a s rij ~ 1 2 for ~- to satisfy the convergence range (0, 1/4). This was also confirmed by our numerical results. 2In fact rij > 1/2(1 + cos(27r/N)
231 6. C O N C L U S I O N S A N D FUTURE W O R K The main advantage of GDF is that it converges for any values of the weights cij if 7- c (0, 1/IIA]]~ ), whereas in DF it is required that c#s must satisfy the conditions ~]~jea(i) c~j < 1. The problem of determining the diffusion parameters c~j such that DF attains its maximum rate of convergence is an active research area [6, 8]. By introducing the set of parameters 7-i, i = 1, 2 , . . . , IV], c#s become the edge weights of the network graph, which are given quantities. Therefore, the problem moves to the determination of the parameters ~-is in terms of c~js. By considering local Fourier analysis we were able to determine good values (near the optimum) for Tis. These good values become optimum in case c#s take a fixed value c (Theorem 2 and 3). This is an encouraging result indicating that our approach is reliable. Finally, we show that the best rate of convergence of local GDF for the ring and toms topology is attained when cij = c. Moreover, the larger the difference between c~ and ci) the worse the rate of convergence. We plan to study the GDF method for other networks topologies using the local Fourier analysis approach. We believe that we can derive results for 2d-regular graphs using the same approach.
REFERENCES
[ 1]
D.P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: Numerical methods, Prentice Hall, 1989. [2] A. Berman and R. J. Plemmons, Nonnegative matrices in the mathematical sciences, Academic Press, 1979. [3] J.E. Boillat, Load balancing and poisson equation in a graph, Concurrency: Practice and Experience 2, 1990, 289-313. [4] G. Cybenko, Dynamic load balancing for distributed memory multi-processors, Journal of Parallel and Distributed Computing, 7, 1989, 279-301. [5] R. Diekmann, A. Frommer, B. Monien, Efficient schemes for nearest neighbour load balancing, Parallel Computing, 25 (1999) 789-812. [6] R. Diekmann, S. Muthukrishnan, M. V. Nayakkankuppam, Engineering diffusive load balancing algorithms using experiments. In G. Bilardi et al., editor, IRREGULAR'97, LNCS 1253, (1997) pages 111-122. [7] Y.F. Hu, R. J. Blake, An improved diffusion algorithm for dynamic load balancing, Parallel Computing, 25 (1999) 417-444. [8] R. Els/isser, B. Monien, S. Schamberger, G. Rote, Toward optimal diffusion matrices, In "International Parallel and Distributed Processing Symposium", IPDPS 2002, Proceedings. IEEE Computer Society Press, (2002). [9] G. Karagiorgos and N. M. Missirlis, Fourier analysis for solving the load balancing problem, Foundations of Computing and Decision Sciences, Vol. 27, No 3, 2002. [ 10] G. Karagiorgos and N. M. Missirlis, Towards the opitmum diffusion parameters, (Submitted). [ 11 ] G. Karagiorgos, Distributed diffusion algorithms for the load balancing problem, Department of Informatics and Telecommunications, Section of Theoretical Computer Science, PhD thesis, 2002 (in Greek).
232 [12] K. Schloegel, G. Karypis, V. Kumar, Parallel multileveldiffusion schemes for repartitioning of adaptive meshes, In Proc.of the Europar 1997, Springer LNCS, 1997. [ 13] A. Terras, Fourier analysis on finite groups and applications, Cambridge University Press, 1999. [ 14] R. Varga, Matrix iterative analysis, Prentice-Hall, Englewood Cliffs, NJ, 1962. [ 15] C. Walshaw, M. Cross, M. Everett, Dynamic load balancing for parallel adaptive unstructured meshes, In Proc.of the 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997. [16] C. Z. Xu and F. C. M. Lau, Load balancing in parallel computers: Theory and Practice, Kluwer Academic Publishers, Dordrecht, 1997.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
233
Delivering High Performance to Parallel Applications Using Advanced
Scheduling N. Drosinos ~, G. Goumas a, M. Athanasaki ~, and N. Koziris a a National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory Zografou Campus, Zografou 15773, Athens, Greece e-mail: {ndros, goumas, mafia, nkoziris }@cslab.ece.ntua.gr This paper presents a complete framework for the parallelization of nested loops by applying tiling transformation and automatically generating MPI code that allows for an advanced scheduling scheme. In particular, under advanced scheduling scheme we consider two separate techniques: first, the application of a suitable tiling transformation, and second the overlapping of computation and communication when executing the parallel program. As far as the choice of a scheduling-efficient tiling transformation is concerned, the data dependencies of the initial algorithm are taken into account and an appropriate transformation matrix is automatically generated according to a well-established theory. On the other hand, overlapping computation with communication partly hides the communication overhead and allows for a more efficient processor utilization. We address all issues concerning automatic parallelization of the initial algorithm. More specifically, we have developed a tool that automatically generates MPI code and supports arbitrary tiling transformations, as well as both communication schemes, e.g. the conventional receive-compute-send scheme and the overlapping one. We investigate the performance benefits in the total execution time of the parallel program attained by the use of the advanced scheduling scheme, and experimentally verify with the help of application-kernel benchmarks that the obtained speedup can be significantly improved when overlapping computation with communication and at the same time applying an appropriate (generally non-rectangular) tiling transformation, as opposed to the combination of the standard receive-compute-send scheme with the usual rectangular tiling transformation. 1. I N T R O D U C T I O N - BACKGROUND Tiling or supernode transformation is one of the most common loop transformations discussed in bibliography, proposed to enhance locality in uniprocessors and achieve coarse-grain parallelism in multiprocessors. Tiling groups a number of iterations into a unit (tile), which is executed uninterruptedly. Traditionally, only rectangular tiling has been used for generating SPMD parallel code for distributed memory environments, like clusters. In [ 1], Tang and Xue provided a detailed methodology for generating efficient tiled code for perfectly nested loops, but only used rectangular tiles due to the simplicity of the parallel code, since only division and modulo operations are required in this case. However, recent scientific research has indicated
234 that the performance of the parallel tiled code can be greatly affected by the tile size ([2], [3], [4]), as well as by the tile shape ([5], [6], [7]). The effect of the tile shape on the scheduling of the parallel program is depicted in Figure 1. It is obvious that non-rectangular tiling is more beneficial in this particular case than rectangular one, since it leads to fewer execution steps for the completion of the parallel algorithm. The main problem with arbitrary tile shapes appears to be the complexity of the respective parallel code and the performance overhead incurred by the enumeration of the internal points of a non-rectangular tile. Therefore, an efficient implementation of an arbitrary tiling method and its incorporation in a tool for automatic code generation would be desirable in order to achieve the optimal performance of parallel applications.
J2 t
Original Iteration Space
J2
/
Rectangular Tiling
~,,. . . . . . , .....
,~,
T;,;....
7~~':./L"-~./T./TZ 7~_Z_/~Z_,4:_Z/j"j~__ 7.7~__.~:.zS(_Z"._7.Z,(Z 7.J'.~_,CZ_7_7.J/J~":Z 7_7~._Z_Z...Z.J_7.ZZ..~Z 7.._7.ZT..S':_2".ZF.Z,,T.Z il
Figure 1. Effect of tile shape on overall completion time
Elaborating further more on scheduling, under conventional schemes, the required communication between different processors occurs just before the initiation and after the completion of the computations within a tile. That is, each processor first receives data, then computes all calculations involved with the current tile, and finally sends data produced by the previous calculations. By providing support for an advanced scheduling scheme that uses non-blocking communication primitives, and consequently allows the overlapping of useful computation with burdensome communication, it is expected that the performance of the parallel application will be further improved. This hypothesis is also established by recent scientific work ([8], [9]). More specifically, the blocking communication primitives are substituted with non-blocking communication functions, which only initialize the communication process, and can be tested for completion at a later part of the program. By doing so, after initializing non-blocking communication the processor can go on with useful computation directly related to the user application. The communication completion can be tested as late as possible, when it will most likely have completed, and thus the processor will not have to stall idle, prolonging the total execution time of the application. 2. ALGORITHMIC MODEL
Our model concerns n-dimensional perfectly nested loops with uniform data dependencies of the following form:
235 FOR j ] 9
.
mini
TO max1 DO
.
FOR
Jn--mi~Zn T O rr~aXn D O
Computation
(jl .....
jn) ;
ENDFOR ENDFOR
The loop computation is a calculation generally involving an n-dimensional matrix A, which is indexed by jl, ..., j~. We assume that the loop computation imposes lexicographically positive data dependencies, so that the parallelization of the algorithm with the application of an appropriate tiling transformation is always possible. Also, if the data dependencies are lexicographically positive, an appropriate skewing transformation can eliminate all negative elements of the dependence matrix, so that rectangular tiling can be applied, as well. Furthermore, for the i-th loop bounds mini, maxi it holds that mini = f ( j l , . . . ,ji-1) and maxi = g ( j l , . . . , ji- 1). That is, our model also deals with non-rectangular iteration spaces, under the assumption that they are defined as a finite number of semi-spaces of the n-dimensional space Z ~. 3. AUTOMATIC CODE GENERATION The automatic parallelization process of the sequential program in schematically depicted in Figure 2. The procedure can be divided in three phases, namely the dependence analysis of the algorithm, the application of an appropriate tiling transformation for the generation of intermediate sequential tiled code, and finally the parallelization of the tiled code in terms of computation/data distribution, as well as the implementation of communication primitives.
Initial.. Code"
Dependence
Analysis
Optimal Tiling
Tiling ~ Transformation
Sequential Tiled Code ~ Parallelization
Parallel Tiled Code "~
Figure 2. Automatic Parallelization of Sequential Code The following Subsections elaborate on the automatic parallelization process, emphasizing on the code generation issues. 3.1. Tiled code generation The generation of the sequential tiled code from the initial algorithm mainly implies transforming the n nested loops into 2n new ones, where the n outermost loops scan the tile space and the n innermost ones traverse all iterations associated with a specific tile. This equivalent form of the algorithm code is more convenient for the parallelization process, as the computation distribution can be directly applied to the outermost n loops enumerating the tiles. In case of rectangular tiling and rectangular iteration spaces, the respective sequential tiled code is simple and straightforward, as it is implemented with the aid of integer division and modulo operators ([7]). In the opposite case, if non-rectangular tiling is applied, or a nonrectangular iteration space is considered, the transformation of the initial algorithm into se-
236 quential tiled code is a more intricate task, that requires significant compiler work. In [10] we have proposed an efficient compiler technique based on the Fourier-Motzkin elimination method for calculating the outer loops bounds. The efficiency of the proposed methodology lies in that we managed to construct a compact system of inequalities that allows the generation of tiled code, and thus compensates for the doubly exponential complexity of the Fourier-Motzkin method. The simplified system of inequalities enumerates some redundant tiles, as well, but the run-time overhead proves to be negligible in practice, since the internal points of these tiles are never accessed. As far as the traversing the internal of a tile is concerned, in [ 10] we further propose a method to transform arbitrary shaped tiles into rectangular ones. By doing so, only rectangular tiles need to be traversed and the expressions required in the n innermost loop bounds evaluation are significantly reduced. Formally, the iteration space of a tile (Tile Iteration Space - TIS) is transformed into a new iteration space, the Transformed TIS (TTIS) by using a non-unimodular transformation. The correspondence between the TIS and the TTIS is schematically depicted in Figure 3. It should be intuitively obvious that the TTIS can be more easily traversed in comparison to the TIS, although special care needs to be taken so that only valid points (e.g. black dots in Figure 3) are accessed. 3.2. Parallelization The sequential tiled code is parallelized according to the SPMD model in order to provide portable MPI C++ code. The parallelization process addresses issues such as computation distribution, data distribution and inter-process communication primitives. We will mainly focus on the communication scheme, as the computation and data distribution are more extensively analyzed in [ 10]. Each MPI process assumes the execution of a sequence of tiles along the longest dimension of the tile iteration space, as previous work in the field of UET-UCT graphs ([9]) suggests that this scheduling is optimal. The n outermost loops of the sequential tiled code are reordered, so that the one corresponding to the maximum-length dimension becomes the innermost of the 7z. Each worker process is identified by an n - 1 dimensional pid vector directly derived from its MPI rank, so that it undertakes the execution of all tiles whose n - 1 outermost coordinates match pid. Also, data distribution follows the computer-owns rule, e.g. each worker process owns the
Tile Iteration Space (TIS)
Trans~rmed Tile Iteration Space (TTIS) J2
20
o o o o o o o o o o o o o o o o o o e o o o o o o o o
I0
.................................
15oooooeoooo o o o o o o o o o e o o o o o e o 10 ooooo o o o o o o o o o o e o o o o o e o 5 o o o o o e o o o o o o
i
o o o o
o o o o
o o o o
o o o o o o o
o o e o o o o
o o o o o o o
p'
o o o o o o o o ooo o o o o o o e o o o o o o e
o
5
.
l /
/
5
/
i~
/
O
9
e..
9
9
9
9
9
9 iI
-o-.-e---
/ I0
lO~
0
,
O /
'
o--. ov f ..........
/
l
/o
H'
o o o o o o o o o
o o e o o o o e o o o o o o o o o o
-j ;.- ~-z.~
o o o e o o o o o o
9
9
9
O
~ ~ , ,
9
9
,
J
9
9/
.//
~
9 ,
5
Figure 3. Transformation of arbitrary shaped tile into rectangular
10
Jl
237 data it computes. By adopting the above computation and data distribution, the required SPMD model for the parallelization of the sequential tiled code is relatively simple and efficient, as far as the overall performance is concerned. Finally, in order for the worker processes to be able to exchange data, certain communication primitives need to be supplied to the parallel code. We have implemented two communication patterns, namely one based on blocking MPI primitives (MPI Send, MPI_Recv), and an alternative one based on non-blocking MPI primitives (MP I I s e n d , MPI _ I r e c v ) . In the first case (blocking), each worker process initially receives all non-local data required for the computation of a tile, then computes that tile, and finally sends all computed data required by other processes (Table 1). Note that in this case communication and computation phases are distinct and do not overlap. In the second case (non-blocking), each worker process concurrently computes a tile, receives data required for the next tile and sends data computed at the previous tile (Table 2). This communication scheme allows for the overlapping of computation and communication phases. f o r ( t i l e t){ MPI Recv(t) ; C o m p u t e (t) ; MPI S e n d (t) ; }
Table 1 Blocking communication scheme
f o r ( t i l e t) { MPI Irecv(t+l) MPI_Isend(t-l) C o m p u t e (t) ; MPI_Waitall ;
; ;
}
Table 2 Non-blocking communication scheme
It is obvious that the non-blocking communication scheme allows for overlapping of computation with communication only as long as both the MPI implementation and the underlying hardware infrastructure support it, as well. That is, the MPI implementation should make a distinction between standard and non-blocking communication primitives, so as to exploit the benefits of the advanced communication pattern. On the other hand, the underlying hardware/network infrastructure must also support DMA-driven non-blocking communication. Unfortunately, this is not the case with the used MPICH implementation for ch_p4 ADI-2 device, as indicated by the relative performance of both schemes. In order to evaluate our proposed advanced scheduling scheme also in terms of communication-computation overlapping, we thus resorted to synchronous MPI communication primitives for the blocking scheme (e.g M P I _ S s e n d instead of MPI_Send). By doing so we were able to simulate the relative performance of both communication patterns, despite the implementation/hardware restrictions. 4. EXPERIMENTAL RESULTS In order to evaluate the performance benefits obtained by the proposed advanced scheduling scheme, we have conducted a series of experiments using micro-kernel benchmarks. More specifically, we have automatically parallelized the Gauss Successive Over-relaxation (SOR[11]) and the Texture Smoothing Code (TSC - [12]) micro-kernel benchmarks, and we have experimentally verified the overall execution time for different tiling transformations, blocking and non-blocking communication schemes and various iteration spaces. Our platform is an 8-node dual-SMP cluster interconnected with FastEthernet. Each node has 2 Pentium III
238 CPUs at 800 MHz, 128 MB of RAM and 256 KB of cache, and runs Linux with 2.4.20 kernel. We used g++ compiler version 2.95.4 with -03 optimization level. Finally, we used MPI implementation MPICH v. 1.2.5, configured with the following options: - - w i t h - d e v i c e = c h _ p 4 - -with-comm=shared.
4.1. SOR
[0011 ] E100]
The SOR loop nest involves a computation of the form A[t, i, j] = f ( A [ t , i - 1, j], Air, i, j 1], A [ t - 1, i + 1, j], A [ t - 1, i, j + 1], A [ t - 1, i, j]), while the iteration space is M • N • N. The dependence matrix of the algorithm is D -
1 0 -1 0 0 . Because of the negative 01 0 -10 elements of D, skewing needs to be applied to the original algorithm for the rectangular tiling to be valid. An appropriate skewing matrix is T =
1 1 0 , since T D >_ O, that is the 201 skewed algorithm contains only non-negative dependencies. We will apply both rectangular and non-rectangular tiling to the skewed iteration space, and evaluate both the blocking and the non-blocking communication scheme. More specifically, the rectangular tile is provided by the matrix Pr -
[ 00] 0 y 0 0 0 z
, while the proposed non-rectangular tiling transformation, as
obtained from the algorithm's tiling cone, is described by the matrix Pnr =
[ 00]
0 y 0 . Note x 0 z that in each case, the tile shape can be determined from the column vectors of the respective transformation matrix (P~ or Pn~), while the tile size depends on the values of the integers x, y and z. However, both tiles have equal sizes, since IP~I = IPn, I = ~ y z , s o that the per-tile computation volume is equal in both cases. Moreover, since in both cases tiles will be mapped to processors according to the third dimension, the per-tile communication volume and the number of MPI processes required are the same, as process mapping and inter-process communication are implicitly determined by the outermost two dimensions. Consequently, any differences in the overall execution times should be attributed to the different scheduling that results from the two tiling transformations, as well as to the communication pattern (blocking or non-blocking). Experimental results for the SOR micro-kernel are depicted in Figure 4. In all cases, nonrectangular tiling outperforms the rectangular tiling, while the non-blocking communication pattern is more efficient than the blocking one, at least on a simulation level. In other words, the experimental results comply to the theoretically anticipated performance.
4.2. TSC TSC algorithm can be written as a triply nested loop with a computation of the form b[t, i, j] - f (b[t, i - 1, j - 1], b[t, i - 1, j], b[t, i - 1, j + 1], b[t, i, j - 1], b[t - 1, i, j + 1], b[t 1,i + 1,j - 1 ] , b [ t - 1,i + 1,j],b[t - 1, i + 1 , j + 1]) (iteration space T x N x N). The [0 0 0 0 1 1 1 1 ] dependence matrix of the algorithm i s D 1 1 1 0 0 -1 -1 -1 . SinceD 1 0-1 1 -1 1 0 -1 also contains negative elements, proper skewing needs to be applied for the rectangular tiling
239
4O
Overall Execution Time for SOR (256x128x!2a Iteration Space)
Overall
35
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Non-rectangular
30
.......................................................................................................
ti]:ing
(non:-blocking.)
Execution
65
Rectangular tiling (blocking) ---+--Rectangular tiling (non-blocking) ---~-Non-rectangular tiling (blocking) ~---
Time
for S O R
(128x256x256
Iteration
Rectangular tiling (blocking) Rectangular tiling (non-blocking) Non-rectangular t i l i n g (blocking) Non-rectangular tiling (non-blocking)
......[] ......
Space)
---+-----x-----a.... ._ .....[] ........
55
i ................................................................
so 45
40
..... ::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ........................................................................................................
l0
5
~ 5
I0
15
I 20
Tile Size
25
30
(in K)
20
35
40
60
80 Tile
Size
i00
120
140
160
(in K)
Figure 4. Experimental Results for SOR
1 0 0]
I
transformation to be valid. We consider the skewing matrix T =
tiling transformation tiling ( P~ -
Pn~ -
[ oo] 0
y
0
-x -x
1 1 0 2 1 1
001
y -y
0 z
. We will apply
to the original iteration space, and rectangular
) to the skewed iteration space.
0 0 z
As in the SOR micro-kernel benchmark, in both cases (rectangular and non-rectangular tiling transformation) we have an equal tile size (]P,I = IPn, I = x y z ) , that results to the same per-tile computation volume. Furthermore, tiles are mapped to MPI processes according to the third dimension. Experimental results are depicted in Figure 5. We observe that, as in SOR, the non-blocking communication scheme with the application of non-rectangular tiling delivers the best overall performance. In this case however, both tiling transformations deliver similar performance
Overall
Execution
20
Time
for T S C
(128x128x128
Iteration
R~cta~gul;r ~ ( ~ I ~ Rectangular tiling (non-blocking) Non-rectangular tiling (blocking)
Space
---x----~ ....
................. i................... N ~ : . ~ e . ~ . t . a ~ g u . l ~ . . . ~ ~ ! ! n ~ . . . . ! . ~ o ~ : ~ l g.~.~S.!....=:...~::::: .....
Overall
Execution
Time
for T S C
(128x256x256
Iteration
Space)
601
~ i ' R . . . . n g u l a r t i l i n g (blocking)' , | ~ i R e c t a n g u l a r t i l i n g ( n o n - b l o c k i n g ) -~-~--58 ~-~-i ..................~.... Non-rectangular t i l i n g ( b l o c k i n g ) ---~.... ',i i Non-rectangular t i l i n g ( n o n - b l o c k i n g ) .....s........
56
s4
16
52
-H
14
.~ so
5
I0
15 Tile
Size
20
25
(in K)
Figure 5. Experimental Results for TSC
44
i0
........... t..................... i" 20
30
Tile
40 Size
50 (in K)
60
i
70
80
240 under the blocking communication scheme, while non-rectangular tiling transformation is more beneficial in case of the non-blocking communication scheme. 5. CONCLUSIONS Summarizing, we have combined the notions of arbitrary tiling transformation and overlapping communication and computation and incorporated these aspects into a complete framework to automatically generate parallel MPI code. We have addressed all issues regarding parallelization, such as task allocation, sweeping arbitrary shaped tiles and implementation of appropriate communication primitives. We have experimentally evaluated our work, and supplied simulation results for application-kemel benchmarks that verify the high performance gain obtained by the advanced scheduling scheme. REFERENCES
[1 ] [2]
[3] [4] [5]
[6] [7] [8]
[9]
[10]
[ 11]
[12]
E Tang, J. Xue, Generating Efficient Tiled Code for Distributed Memory Machines, Parallel Computing 26 (11) (2000) 1369-1410. R. Andonov, E Calland, S. Niar, S. Rajopadhye, N. Yanev, First Steps Towards Optimal Oblique Tile Sizing, in: 8th International Workshop on Compilers for Parallel Computers, Aussois, 2000, pp. 351-366. E. Hodzic, W. Shang, On Supernode Transformation with Minimized Total Running Time, IEEE Trans. on Parallel and Distributed Systems 9 (5) (1998) 417-428. J. Xue, W.Cai, Time-minimal Tiling when Rise is Larger than Zero, Parallel Computing 28 (6) (2002) 915-939. E. Hodzic, W. Shang, On Time Optimal Supernode Shape, in: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas, CA, 1999. K. Hogstedt, L. Carter, J. Ferrante, Selecting Tile Shape for Minimal Execution time, in: ACM Symposium on Parallel Algorithms and Architectures, 1999, pp. 201-211. J. Xue, Communication-Minimal Tiling of Uniform Dependence Loops, Journal of Parallel and Distributed Computing 42 (1) (1997) 42-59. G. Goumas, A. Sotiropoulos, N. Koziris, Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping, in: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS'01), San Francisco, 2001. T. Andronikos, N. Koziris, G. Papakonstantinou, P. Tsanakas, Optimal Scheduling for UET/UET-UCT Generalized N-Dimensional Grid Task Graphs, Journal of Parallel and Distributed Computing 57 (2) (1999) 140-165. G. Goumas, N. Drosinos, M. Athanasaki, N. Koziris, Compiling Tiled Iteration Spaces for Clusters, in: Proceedings of the IEEE International Conference on Cluster Computing, Chicago, 2002. G. E. Karniadakis, R. M. Kirby, Parallel Scientific Computing in C++ and MPI : A Seamless Approach to Parallel Algorithms and their Implementation, Cambridge University Press, 2002. S. Pande, T. Bali, A Computation + Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems, Journal of Parallel and Distributed Computing 58 (3) (1999) 515-545.
Algorithms
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
243
M u l t i l e v e l E x t e n d e d A l g o r i t h m s in S t r u c t u r a l D y n a m i c s on P a r a l l e l C o m p u t e r s K. ElsseP and H. Voss ~ aTechnical University of Hamburg- Harburg, Section of Mathematics, D - 21071 Hamburg, Germany, A parallelization concept for the adaptive multi-level substructuring (AMLS) method is presented, the idea of which is to hierarchically substructure the system under consideration, and at the same time to use a truncated eigenvalue decomposition on the interfaces to reduce the excessive number of interface degrees of freedom in component mode methods. 1. INTRODUCTION Frequency response analysis of complex structures over broad frequency ranges is very costly, since the stiffness and mass matrices K and M in a finite element model Kx = AMx
(1)
of a structure under consideration are usually very large. The number of degrees of freedom (DoF) is often reduced to manageable size by condensation and component mode synthesis. However, if the substructuring is chosen fine enough a very large number of interface DoFs appears and the problem remains very large (and much less sparse), whereas for a coarse substructuring a very large number of local modes of the substructures is required to obtain sufficiently accurate approximations to eigenvalues and eigenvectors. A way out is the adaptive multi-level substructuring where the eigenmodes and eigenfrequencies of the substructures (and their interfaces as well) are approximated by substructuring and modal truncation, and this principle is used in a recursive fashion. Kropp and Heiserer [6] benchmarked an implementation of AMLS against the shift-invert block Lanczos method, and recommend to use AMLS if the dimension of the problem is very large and not only modes corresponding to small eigenfrequencies are needed. They report on a calculation of nearly 2500 eigenmodes of a FE model from vibro-acoustic with 13.500.000 DoF on a HP-RISC workstation. In this paper we present a parallelization concept for the AMLS method. Our paper is organized as follows. Section 2 briefly sketches condensation with general masters, and Section 3 summarizes the adaptive multi-level substructuring method. In Section 4 we present our parallelization concept using threads and MPI to distribute parts of the problem to computation nodes which operate on separate memory. Section 5 contains a numerical example and discussion. 2. PARALLEL CONDENSATION W I T H GENERAL MASTERS We consider the general eigenvalue problem (1) where K E R ~• and M c R ~• are symmetric and positive definite. In the dynamic analysis of structures K and M are the stiffness and the
244 mass matrix of a finite element model of a structure, and they are usually very large and sparse. To reduce the number of degrees of freedom to manageable size Irons and Guyan independently proposed static condensation, i.e. to choose a small number of degrees of freedom (called masters) which seem to be representative for the dynamic behaviour of the entire structure, and to eliminate the remaining unknowns (called slaves) neglecting inertia terms in some of the equations of (1). After reordering the equations and unknowns problem (1) can be rewritten as
Km~ Kmm
xm
Mm~ Mmm
xm
"
Neglecting the inertia terms in the first equation, solving for the slave unknowns x~, and substituting x8 into the second equation one obtains the statically condensed problem /~0Zm- )~-/~0Zm
(3)
for the master variables xm only, where /(0 := Kmm - Kms K~I Ksm,
(4)
and
/l~0 :-- Mmm - KmsK~slMsm - MmsK~lKsm -Jr-K m s K ~ l M s s K ~ l K s m 9
(5)
Combining condensation with substructuring yields a coarse grained parallel algorithm[ 11 ] based on the master-worker paradigm. Suppose that the structure under consideration has been decomposed into r substructures, and let the masters be chosen as interface degrees of freedom. Assume that the substructures connect to each other through the master variables only. If the slave variables are numbered appropriately, then the stiffness matrix is given by Ks~,. K=
...
0
0
Ksm~
9
"'.
i
i
"
0
...
K~2
0
0
...
0
Kms,.
...
K~m2 Ks~I K,~
Kssl
Km~2 Kmsl
,
(6)
and the mass matrix M has the same block form. It is easily seen that in this case the reduced matrices in (3) are given by -fro = Kmm - ~
K ~ j K-~5}K~mj
and
(7)
j=l
= Mmm - ~3
~
- i - 1 - I M s s j Ks~jK~my). - 1 (KmsjKssjMsm j + MmsjKs~jK~mj - Km~jK~j
(8)
j=l
Hence they can be computed completely in parallel, and the only communication that is needed is one fan-in process to determine the reduced matrices/(0 and/~/0. This type of condensation has the disadvantage that it produces accurate results only for a small part of the lower end of the spectrum. The approximation properties can be improved substantially if general masters[7] are considered.
245 Let Zl,... , z m E ~;~n be independent mastervectors, and let y m + l , . . . , Y n be a complementary basis of { z l , . . . , zm} • With Z := ( Z l , . . . , Zm)and Y := (ym+l,..., Yn)every x E R N can be written as x = Yx, + Zx,~, x, E R ~-m, x,~ E R m. Going with this representation into equation (1) and multiplying with the regular matrix (Y, Z) T from the left one obtains the eigenvalue problem
( yZTTKK Yy
yTKZ Z TKZ
)( ) Xs
_
,~
x~
Z T
MY
Z TMZ
Xm
(9)
This decomposition could serve as a basis for condensation with general masters. However, there is a strong practical objection to this naive approach: For large systems the small number of general masters z l , . . . , z,~ will be accessible whereas the large number of complementary vectors are not. Mackens and the second author[7] proved that condensation with general masters can be performed using Z only. Let
P = K-*Z(ZrK-1Z)-IZrZ.
(10)
Then the projected eigenvalue problem
Kox,~ := pTKPx,~ =/~pTMPxm =: )~Moxm,
(11)
is equivalent to the condensed eigenvalue problem with general masters Z l , . . . , z,~. If the matrix Z has orthogonal columns, then the projection matrix P can be determined from the augmented linear system
(o)
,,2,
Similarly as (6) this linear system obtains block form[8] if substructuring is used, if all interface degrees of freedom are chosen as masters, and if they are complemented by general masters the support of each of which is contained in exactly one substructure. Hence, the parallelization concept from [11 ] applies in this case as well [8]. In particular, choosing a small number of eigenmodes of the clamped substructures (modal masters) additionally to the interface masters (which is equivalent to the component mode method [4]) improves the accuracy of condensation considerably. Pelzer[9] reports for a finite element model of a container ship with ca. 35.000 degrees of freedom the following results. Decomposing the model into 10 substructures und using interface masters only yields a model with 2097 DoF. Approximating the 50 smallest eigenvalues of the structure with this condensed problems yields a maximum relative error of 110%. Adding 170 modal masters of the substructures reduces the maximum relative error to 0.13%. So a small number of additional general masters improves the accuracy considerably. The disadvantage of this parallelization concept however is that we are left with the large number of interface masters to obtain the block structure of system (12). A way out is to reduce the number of interface masters itself using modal informations.
246 3. INTERFACE CONDENSATION AND MULTILEVEL E X T E N S I O N
Partitioning the structure FE model into many substructures on multiple levels, obtaining substructure modes up to a specified cutoff frequency, and projecting the problem onto the substructure modal subspace results in the Adaptive Multi-Level Substruture (AMLS) method introduced by Bennighof and co-authors [1 ], [2], [5]. Using the same reordering as in (6) for the stiffness and mass matrices (K and M) and applying a block-Gauss elimination on K yields a block diagonal matrix
K = diag (Kssl, ..., Kssr, I(o)
(13)
with/(0 from definition (7). The same transformation is also applied to the mass matrix M which keeps its structure
Ms~ 2f/I=
...
0
0
9
...
:
:
9
0 0
... ...
M882 0
0 Mssl
2(I~m2
]~/msr
9 99 ]~ms2
f/i,,,. ,
(14)
]~srnl
]~/msl
Mo
but changes all blocks. ]~/i0 =
Mo .j_ E
( Mmsi ( -K-~,~K~m~) -1 -1 t t [Msm i + Mssi (K-ss~Ksm~)]) _ + (-K-,~Kms~)
(15)
i=1
M~mj =
M~mj + M~j (-KssjKsmj) -1
(16)
Subsequently modal truncation is applied to all substructure pencils (K~sj, Mssj). By applying the same modal truncation to interface variables (K0,/17/0) the dimension of the condensed problem can be reduced considerably. Without the modal truncation of the interface pencil this algorithm would be similar to the component mode synthesis by Craig and Bampton [4] using fixed interface modes and normal modes. The modal truncation for the matrix pencil (/(o, 217/0) leads to some eigenvectors q~0 corresponding to the smallest eigenvalues. The vectors /~,ml ~o =
--~f)~mr ~o
(17)
/no are the solutions to the Steklov-Poincar6 operator applied to the first eigenmodes of the interface. 3.1. Multilevel extension
The just presented algorithm has in some cases still the drawback of high numbers of interface variables. Choosing the number of substructures as high as the number of processors available
247 might increase the number of interface variables dramatically again 9 A high imbalance in the blocksizes of all of the Ks~i to the Kmm block will lead to a high imbalance in computational work (concentrated on the master again) despite the modal truncation on the interface. This effect can be reduced by extending this algorithm recursively. This can be done for all substructure pencils (K~i, M ~ ) . Some operations in equations (7, 15,16) require quantities (M~si, K~i or K~-~) which might be substructured themselves in the recursive algorithm. Affected are the terms --
(-
(18)
in equation (7) and Mp od
--
(-
(19)
) -
in equations (15,16). These terms are implicitly computed in the recursive algorithm. The second term is written as a product with the first term because it only occurs in this form and can safely be computed in advance. The term "Kre~" can be viewed as solving the (itself again) substructured stiffness matrix Kssi for the fight hand side Ksmi (under a congruence transformation)
K~i -
kss~
...
0
0
k~m~
9
"',
*~
i
~
0
"'.
k~2
0
k~m2
0 k,~
... ...
0 ks~l ksml km~2 km~l kmm
K~,~i~
=
"
(20)
Ksmi~ Ksmio
This results in simply passing the right hand side to the next lower level of the recursion (appending the columns of ksmi) for the first r blocks
K~i~ - k s~}K~mij j -
1, ..., r.
(21)
Only the last block is a little more complicated
Kresio __ km m^-IRsmi ~ :=
krnm -~-
k m s j ( - kssjksrnj1)j=l
Ksrnio --~
~msj(-kssjKsrnij)-i
.
j=l
(22) The local/(sm~ columns can always be combined with the higher level Ksmi columns. Similar considerations apply to the term Mproa~. 4. P A R A L L E L I Z A T I O N The AMLS algorithm has been implemented with the MPI and threads. The MPI is used to distribute parts of the problem to computation nodes which operate on separate memory. Threads are used to employ multiple CPUs on the same shared-memory and are created dynamically
248 during the computation. A fixed maximum is defined for each host (usually the number of CPUs). The program sets up a communication and control loop on each node it runs on for MPI communication and thread scheduling. First on the master and later on all other nodes, the algorithm splits the problem into parts. From each of these parts a new node is created and queued on the local host, or transferred to another host. The threads can subsequently process each of these nodes. Processing of a node can have the following meanings. 9 Determine the number of substructures for this node. This step includes the partitioning if more than one substructure is determined. If only one substructure is determined this part is a leaf of the substructuring tree and results are computed immediately and returned to the parent node. 9 Computation of intermediate results. If all results from substructures are present the results of this node are computed and returned to its parent. Computation of the number of partitions is implemented as a function of dimension, number of nonzero elements in the matrix, recursion level and a predefined minimum (1~ leaf) and maximum value for the number of substructures. A different substructuring rule should be applied to the highest level because of the significant impact on memory consumption and computational time. The algorithm is implemented recursively passing the following terms between recursion levels /c~m =
(/C~ml.../C~m~) ~
(23)
l~re s
-~-
( [4~res l . . . t4(resr i~reso ) T
(24)
M~
=
(M~I
(25)
Mmult
--
M
. . .M ~ )
~
, I(re s
(26)
The recursion is implemented in a depth-first manner. This leads to a small number of tree nodes for which intermediate results have been computed and are in memory at the same time. Memory consumption can be less efficient if multiple threads work on different locations in the tree. Computation of intermediate results comprises the computation of the local part of the final condensed generalized eigenvalue problem and the return of the results to the parent node. 5. RESULTS AND DISCUSSION To test the approximation properties of the adaptive multi-level substructuring method we considered the vibrational analysis of a container ship treated already in [ 10]. Usually in the dynamic analysis of a structure one is interested in the response of the structure at particular points to harmonic excitations of typical forcing frequencies. For instance in the analysis of a ship these excitations are caused by the engine and the propeller, and the locations of interest are in the deckshouse where the perception by the crew is particularly strong. The finite element model of the ship (a complicated 3 dimensional structure) is not determined by a tool like ANSYS or NASTRAN since this would result in a much too large model.
249 Since in-plane displacements of the ship's surface do not influence the displacements in the deckshouse very much it suffices to discretize the surface by linear membrane shell elements with additional bar elements to correct warping, and to model only the main engine and the propeller as three dimensional structures. For the ship under consideration this yields a very coarse model with 19106 elements and 12273 nodes resulting in a discretization with 35262 degrees of freedom. We consider the structural deformation caused by an harmonic excitation at a frequency of 4 Hz which is a typical forcing frequency stemming from the engine and the propeller. Since the deformation is small the assumptions of the linear theory apply, and the structural response can be determined by the mode superposition method taking into account eigenfrequencies in the range between 0 and 7.5 Hz (which corresponds to the 50 smallest eigenvalues for the ship under consideration). Since the finite element model is very coarse the accuracy requirements are very modest, and an error of 10 % for the natural frequencies often suffices [3]. To determine these eigenvalues with the required accuracy the AMLS method generated a substructuring with 5 levels where the cutoff frequency for eigenmodes of the substructures and the interfaces was chosen to be 8250. The dimension of the reduced eigenvalue problem was 172, where 19, 64, 61, 28 and 0 eigenmodes of substructures or interface patches of the levels 0 to 4, respectively, contributed to the reduced model. Generating and solving this problem on an Intel XEON machine with two 2.2 GHz processors and 5 GB shared memory required a CPU time of 150 seconds using one processor and 81 seconds with both processors (2 Threads) yielding a speedup of 1.85. For comparison, the same eigenvalue problem was considered in [10] using condensation with general masters. To meet the accuracy requirements there a partition into 10 natural substructures had to be augmented by 50 general masters obtained by reanalysis techniques (yielding a condensed problem with 2147 DoF). Hence, AMLS needs a much smaller reduced model than condensation with general masters and classical substructuring, and it is worth to be studied further. Our future research will concentrate on two issues: First, a uniform cutoff frequency is used for all substructures and all interface patches. Can the approximation quality be improved by some strategy to choose cutoff frequencies individually (for instance, depending on the size or on the level)? Second, the partitioning is generated depending only on the graph of the stiffness matrix. Is it possible to improve the accuracy (and hence reduce the dimension of the reduced model for a given accuracy requirement) taking into account geometric properties of the domain (reentrant comers) or the behaviour of (already approximated) eigenmodes ? ACKNOWLEDGEMENTS Thanks are due to Christian Cabos, Germanischer Lloyd, who provided us with the finite element model of the container ship. The first author gratefully acknowledges financial support of this project by the German Foundation of Research (DFG) within the Graduiertenkolleg "Meerestechnische Konstruktionen". REFERENCES [1]
J.K. Bennighof, M.F. Kaplan, M.B. Muller, M. Kim, Meeting the NVH computational challenge: Automated Multi-Level Substructuring. Proceedings of the 18th International
250 Modal Analysis Conference, San Antonio, Texas, 2000 J.K. Bennighof, R. Lehoucq, An automated multilevel substructuring method for eigenspace computation in linear elastodynamics. Tech.Rep. SAND2001-3279J. Sandia National Laboratory 2001 [3] C. Cabos, Private communication 2001 [4] R.R. Craigh, Jr., and M.C.C. Bampton, Coupling of substructures for dynamic analysis. AIAA Joumal 3, 1313 - 1319 (1968) [5] M.F. Kaplan, Implementation of Automated Multilevel Substructuring for Frequency Response Analysis of Structures. Ph.D. thesis, University of Texas at Austin 2001 [6] A. Kropp and D. Heiserer, Efficient broadband vibro-acoustic analysis of passenger car bodies using an FE-based component mode synthesis approach. In H.A. Mang, F.G. Rammerstorfer, J. Eberhardsteiner (eds.), Proceedings of Fifth World Congress on Computational Mechanics, Vienna, Austria 2002 [7] W. Mackens and H. Voss, Nonnodal condensation of eigenvalue problems, ZAMM 79, 243 - 255 (1999) [8] W. Mackens and H. Voss, General masters in parallel condensation of eigenvalue problems, Parallel Computing 25,893 - 903 (1999) [9] A.M. Pelzer, Systematische Auswahl von Masterfreiheitsgraden fiir die parallele Kondensation von Eigenwertaufgaben. Shaker Verlag, Aachen 2001 [ 10] A. Pelzer and H. Voss, A Parallel Condensation-Based Method for the Structural Dynamic Reanalysis Problem. pp. 3 4 6 - 353 in G.R. Joubert, A. Murli, F.J. Peters, M. Vanneschi (eds.), Parallel Computing: Advances and Current Issues, Proceedings of ParCo2001, Imperial College Press,2002 [ 11] K. Rothe and H. Voss, A fully parallel condensation method for generalized eigenvalue problems on distributed memory computers. Parallel Computing 2 1 , 9 0 7 - 921 (1995) [2]
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
251
Parallel Model Reduction of Large-Scale Unstable Systems P. Benner a*, M. Castillo bt, E.S. Quintana-Orti bt, and G. Quintana-Orti bt aInstitut f'tir Mathematik, MA 4-5, Technische Universit~it Berlin, D-10623 Berlin, Germany bDepto, de Ingenieria y Ciencia de Computadores, Universidad Jaume I, 12.071-Castell6n, Spain We discuss an efficient algorithm for model reduction of large-scale unstable systems on parallel computers. The major computational step involves the additive decomposition of a transfer function via a block diagonalization. The actual model reduction is then achieved by reducing the stable part using techniques based on state-space truncation. We will see that all core computational steps can be based on the sign function method. Numerical experiments on a cluster of Intel Pentium-IV processors show the efficiency of our methods. 1. INTRODUCTION Consider the transfer function matrix (TFM) G(s) = C ( s I - A ) - I B + D, and the associated, not necessarily minimal, realization of a linear time-invariant (LTI) system,
~(t) y(t)
: :
Ax(t) + s~(t), C~(t) + D~(t),
t>0, t>0,
(1)
with A E IRnxn, B E lRnxm, C E IRp• and D E IRp• For simplicity we assume that the spectrum of A is dichotomic with respect to the imaginary axis, i.e., Re(A) -r 0 for all eigenvalues A of A. The case with eigenvalues on the imaginary axis could be treated as well with the method described in this paper, but this would add some distracting technicalities. Throughout this paper we will denote the spectrum of A by A(A). The number of state variables n is called the order of the system. We are interested in finding a reduced-order LTI system,
~(t) ~)(t) -
A~(t)+ L)~(t), d ~ ( t ) + z)~(t),
t>0,
t >_ 0,
(2)
of order r, r << n, such that the TFM G(s) - (~(sI - A ) - I / ) + / ) approximates G(s). A well-known approach for model reduction is based on state-space truncation or projected dynamics; see, e.g., the recent monographs [1, 2]. Methods based on truncated state-space transformations usually differ in the measurement of the approximation error and the way they *Supported by the DFG Research Center "Mathematics for key technologies" (FZT 86) in Berlin. Supported by the project CTDIA/2002/122 of the Generalitat Valenciana and the spanish MCyT project TIC200204400-C03-01.
252 attempt to minimize this error. Balanced truncation (BT) methods, singular perturbation approximation (SPA) methods, and optimal Hankel-norm approximation (HNA) methods all belong to the family of absolute error methods, which try to minimize tlAal] = ] [ G - GI] for some system norm I[. Jl. All methods mentioned so far assume that the system is stable and therewith cannot be applied directly to unstable systems which occur quite often, in particular if stabilization of the system is the computational task to solve. If a stabilization strategy for a large-scale unstable system is to be designed, but the model is too large to be treated by the stabilization procedure, model reduction of the unstable plant model is needed. Unstable systems can also occur in controller reduction: controllers are often themselves unstable systems and therefore the task of controller reduction leads to model reduction of unstable systems [3]. Usually, unstable poles cannot be neglected when modeling the dynamics of the system, and therefore should be preserved in the reduced-order system in some sense. This is trivially satisfied using the following approach [4]: first, compute an additive decomposition of the transfer function,
G(s) = G_(s) + G+(s) = (C_(sI - A _ ) - I B _ + D_) + (C+(sI - A+)-IB+ + D+) ,
(3)
such that G_ is stable (has all its poles in the open left half plane, C-) and G+ is unstable (has all its poles in the open fight half plane, C+). Then any of the BT, SPA, or HNA model reduction methods can be applied to G_ in order to obtain a reduced-order system G_ and the reduced-order system is synthesized by =
(4)
Hence, the unstable part is preserved in the reduced-order system. This is advantageous in controller reduction where it is needed to guarantee the stabilization property of the controller. Of course, if the number of unstable poles is dominating, the potential for reducing the model is limited. But in most applications, in particular those coming from semi-discretization of parabolic or hyperbolic partial differential equations, the numbers of unstable poles is very low compared to the number of stable poles. The outline of the paper is as follows: in the next section, we will show how an additive decomposition of a transfer matrix can be computed using sign function-based algorithms for block diagonalization of a general square matrix. In Section 3 we review the BT method used to reduce the stable part of the system. All absolute error model reduction are also based on sign function computations and use low-rank factorizations of the system Gramians. Sign functionbased methods allow very efficient and scalable implementations of the proposed algorithms. We will briefly discuss the implementation of the suggested procedure for model reduction of unstable systems in Section 4. The numerical examples reported in Section 5 on a cluster of Intel Pentium-IV processors reveal the performance of the parallel algorithms. 2. ADDITIVE DECOMPOSITION OF A TRANSFER FUNCTION Here we present the method in terms of a generic realization (A, B, C, D) of a transfer function G(s). First, we describe the sign function method as it will serve as the major tool in the computations. Consider a matrix Z C IRn•
with A (Z) N ziP, - 0 and let Z = 5' [~[-- o+/~S_I be its J
'
3
253 Jordan decomposition, where the Jordan blocks in J - E IRk• and J+ E IR(n-k)x(n-k) contain, respectively, the eigenvalues of Z in the open left and right half of the complex plane. The
matrix sign function of Z is defined as sign (Z) "= S [-Ik 0 In-0 k] S-1, where Ik denotes the identity matrix of order k. Many other definitions of the sign function can be given; see [5] for an overview. The matrix sign function has proved useful in many problems involving spectral decomposition as (Is - sign (Z))/2 defines the skew projector onto the stable Z-invariant subspace parallel to the unstable subspace. (By the stable invariant subspace of Z we denote the Z invariant subspace corresponding to the eigenvalues of Z in the open left half plane.) Applying Newton's root-finding iteration to Z 2 = In, where the starting point is chosen as Z, we obtain the Newton iteration for the matrix sign function: ,A
I..
Zo +-- Z,
1
Zj+I +-- -~(Zj + Z?I),
j = 0, 1, 2, . . . .
(5)
Under the given assumptions, the sequence {Zj }~=o converges to sign (Z) - l i m j ~ Zj [6] with a locally quadratic convergence rate. As the initial convergence may be slow, acceleration techniques are often used; e.g., in the determinantal scaling:
zj +-- cjzj,
- I
(zj)l
1,
where det (Z j) denotes the determinant of Zj. For a discussion of several scaling strategies, see [5]. Once we have computed sign (Z), we can obtain an orthogonal basis for the stable invariant subspace of Z by computing a (rank-revealing) QR factorization of In - sign (Z); e.g., let
Rex R12J 0 0
In - sign (Z) -- ~ R P ,
] ' Rll kx '
where P is a permutation matrix. Then
2 "- Q T Z Q -
[Zll L 0
Zi2 Z22
(6)
where A (Zll) : A (Z)I"I C - , Zll E IR hxk, A (222) = A (Z)I"1 C - , Z22 C IRn-kxn-k; i.e., the similarity transformation defined by Q separates the stable and the unstable parts of the eigenspectrum of Z. In a second step, we compute a matrix V such that
9- V - 1 2 V where Y E IRk•
-
I Ik L 0
Y]E zll
I,~_k
0
Z22
0
Yl
I,~_k
'
satisfies the Sylvester equation
Z11Y - Y Z22 + Z12 = O.
As A (Zll) n A (Z22) = O, equation (7) has a unique solution [7].
(7)
254 The Sylvester equation (7) can again be solved using a sign function-based procedure. For an equation
EY-YF+W=O with E and - F stable matrices, this procedure, already derived in [6], can be formulated as follows
"-- I (Ej -I- E;1) , Eo := E, Fo : = i F ,
Ej+I . - ~ (F9 + FJ--1) Fj+I :=5 ' Wj+I:-- ~1 (Wj -Jr-E ; I W j F j 1)
Wo := W,
j
=
0, 1 2,
'
(8)
....
It follows that lirnj_,or Ej = --irk, limj__,o~ Fj - In-k, and Y = 1 liinj__,~ Wj. For an efficient implementation of this iteration on modem computer architectures and numerical experiments reporting efficiency and accuracy, see [8]. Having solved (7) with the scheme in (8) we obtain V. Applying the above steps to the matrix A from the realization of G(s) we now obtain the desired additive decomposition of G(s) = C(sI - A ) - I B + D as follows: perform a statespace transformation
(fI, [3, O, D) "- (V-1QTAQV, V-1QTB, CQV, D), and partition 0
A22
'
B2
'
C=[C1
C2],
(9)
according to the partitioning in (6). Then
G(s)
=
C ( s I - A ) - I B + D = C ( s I - A)-I/) + /)
- - - [ C1 C2 ] [ ( 8 I k - A l l ) - I
(sln-k
-
A22)-1
I [ B1
B2
+ D
~-- {Cl(8Ik - All)-IB1 -Jc-D} + {C2(8In_k -- A22)-lB2} =.
+
where G_(s) is a stable TFM and G+(s) is an unstable TFM. 3. MODEL REDUCTION OF STABLE SYSTEMS
Here we briefly review the BT method for model reduction of stable systems. If the original system is unstable, the procedure in the previous section allows to separate the stable and unstable parts, so that we consider hereafter (A, B, C, D) := (All, B1, C1, D), the realization associated with the stable TFM G_(s). All absolute error methods are strongly related to the controllability Gramian Wc and the observability Gramian Wo of the LTI system. In the continuous-time case, the Gramians are given by the solutions of two coupled Lyapunov equations
AW~ + WcA T + B B T - O, ATWo + WoA + CTC = O.
(lO)
255 As A is assumed to be stable, the Gramians We and Wo are positive semidefinite, and therefore there exist factorizations We = S TS and Wo = RTR. The Lyapunov equations are solved in our model reduction algorithms using the Newton iteration for the matrix sign function introduced by Roberts [6] and closely related to the iteration for Sylvester equations discussed in the previous section. This iteration is specially adapted taking into account that, for large-scale (non-minimal) systems, S and/~ are often of low numerical rank. By not allowing these factors to become rank-deficient, a large amount of workspace and computational cost can be saved. For a detailed description of this technique, see, e.g., [9]. Consider now the singular value decomposition (SVD)
[ 01 ] 0
22
Vf
'
(11)
where the matrices are partitioned at a given dimension r such that E1 = diag ( a l , . . . , cry), E2 = diag (cry+l,..., (7~), crj >_ ~7j+1 > 0 for all j, and (7~ > ~7~+1. The so-called square-root (SR) BT algorithms determine the reduced-order model as fl = T~ATr, OCTr,
& = T~B, D D,
(12)
using the projection matrices Tt = E I ' / 2 V f R
and
T, = sTu1E11/2.
(13)
Due to space limitations, we refer the reader to [10, 11, 12, 13] for a survey of parallel model reduction methods of stable systems based on state-space truncation. Serial implementations of the model reduction algorithms discussed here can be found in the Subroutine Library in Control T h e o r y - SLICOT (available from http : / / w w w . w i n . tue. nl/niconet/NIC2/slicot.html). 4. IMPLEMENTATION DETAILS
The additive decomposition and the model reduction methods basically require matrix operations such as the solution of linear systems and linear matrix equations (Lyapunov and Sylvester), and the computation of matrix products and matrix decompositions (QR, SVD, etc.). The iterative algorithms for efficiently solving linear matrix equations derived from the matrix sign function (see the previous sections and [8, 9]) only require operations like matrix products and matrix inversion. All these operations are basic dense matrix algebra kernels parallelized in ScaLAPACK and PBLAS. Thus, the parallel model reduction routines, integrated into the PLiCMR library (visit h t t p : / / s p i n e . a c t . uj i . e s / ~ p l i c m r ) , heavily rely on the use of the available parallel infrastructure in ScaLAPACK, the serial computational libraries LAPACK and BLAS, and the communication library BLACS. In order to improve the performance of our parallel model reduction routines we have designed, implemented, and employed two specialized parallel kernels that outperform parallel kernels in ScaLAPACK with an analogous purpose: the QR factorization with partial pivoting is computed in our codes by using a parallel BLAS-3 version instead of the traditional BLAS2 approach [14]. Also, our matrix inversion routine is based on a Gauss-Jordan elimination procedure [ 15].
256 Details of the contents and parallelization aspects of the model reduction routines for stable systems are given, e.g., in [ 10]. A standardized version of the library is integrated into the subroutine library PSLICOT, with parallel implementations of a subset of SLICOT. It can be downloaded from the URI f t p : / / f t p . e s a t . k u l e u v e n . a c . b e / p u b / W G S / S L I C O T . However, it is recommended to obtain the version of the library from http-//spine.act.uji.es/~plicmr as it might, at some stages, contain more recent updates than the version integrated into PSLICOT. The library can be installed on any parallel architecture where the above-mentioned computational and communication libraries are available. The efficiency of the parallel routines will depend on the performance of the underlying libraries for matrix computation (BLAS) and communication (usually, MPI or PVM).
5. N U M E R I C A L E X P E R I M E N T S
All the experiments presented in this section were performed on a cluster of 16 nodes using IEEE double-precision floating-point arithmetic (c ~ 2.2204 • 10-16). Each node consists of an Intel Pentium-IV processor at 1.8 GHz with 512 MBytes of RAM running the Linux (SuSE 7.3) operating system. We employ a BLAS library, specially tuned for the Pentium-IV processor, that achieves around 2.8 Gflops (millions of flops per second) for the matrix product (routine DGEMM). The nodes are connected via a Myrinetmultistage network; the communication library BLACS is based on an implementation of the communication library MPI specially developed and tuned for this network. The performance of the interconnection network was measured by a simple loop-back message transfer resulting in a latency of 61 #sec. and a bandwidth of 280 Mbit/sec. We made use of the LAPACK, PBLAS, and ScaLAPACK libraries whenever possible. In this section we report the performance of the parallel routine p a b 0 9ex for the computation of the additive decomposition of a TFM. In order to mimic a real case, we employ a random single-input single-output LTI system with a single unstable pole. Our first experiment reports the execution time of the parallel routine on a system of order n = 2500. This is about the largest size we could evaluate on a single node of our cluster, considering the number of data matrices involved, the amount of workspace necessary for computations, and the size of the RAM per node. The left-hand plot in Figure 1 reports the execution time of the parallel routine using rip=l, 2, 4, 6, 8, and 10 nodes. The execution of the parallel algorithm on a single node is likely to require a higher time than that of a serial implementation of the algorithm (using, e.g., LAPACK and BLAS); however, at least for such large scale problems, we expect this overhead to be negligible compared to the overall execution time. The figure shows reasonable speed-ups when a reduced number of processors is employed. Thus, when rip=4, a speed-up of2.11 is obtained for routine p a b 0 9 e x . As expected, the efficiency decreases as np gets larger (as the system dimension is fixed, the problem size per node is reduced) so that using more than a few processors does not achieve a significant reduction in the execution time for such a small problem. We next evaluate the scalability of the parallel routine when the problem size per node is constant. For that purpose, we fix the problem dimensions to n / ~ = 2500, and report the Gigaflops per node. The fight-hand plot in Figure 1 shows the Gigaflop rate per node of the
257 5o0
A d d i t i v e d e c o m p o s i t i o n of a r a n d o m u n s t a b l e s y s t e m of o r d e r n=2500/sqrt(np) 2
A d d i t i v e d e c o m p o s i t i o n of r a n d o m u n s t a b l e s y s t e m of o r d e r n = 2 5 0 0
Q
450 400 350
i+ol
~= 250
150
0.6
IO0
0.4
5O
0.2
0
2
4
6
N u m b e r of p r o c e s s o r s
8
10
12
00
2
4
6
8
10
N u m b e r of processors
12
14
16
18
Figure 1. Performance of the parallel routine for the computation of the additive decomposition of a TFM.
parallel routine. These results demonstrate the scalability of our parallel kernels, as there is only a minor decrease in the performance of the algorithms when np is increased while the problem dimension per node remains fixed. A thorough analysis of the performance of the model reduction routines for stable systems is given in [ 10]. 6. CONCLUSIONS We have presented an efficient approach for model reduction of large-scale unstable systems on parallel computers. All computational steps in the approach are solved using iterative algorithms derived from the matrix sign function, which has been shown to offer a high degree of parallelism. Numerical experiments on a cluster of Intel Pentium-IV processors confirm the efficiency and scalability of our methods. Analogous routines for model reduction of unstable large-scale discrete-time systems are under current development. REFERENCES
[1] [2]
[3]
A. Antoulas, Lectures on the Approximation of Large-Scale Dynamical Systems, SIAM Publications, Philadelphia, PA, to appear. G. Obinata, B. Anderson, Model Reduction for Control System Design, Communications and Control Engineering Series, Springer-Verlag, London, UK, 2001. A. Varga, Task II.B.1 - selection of software for controller reduction, SLICOT Working Note 1999-18, The Working Group on Software (WGS), available from
http://www, win. tue.nl/niconet/NIC2/reports .html (Dec. 1999).
[4] M. Safonov, E. Jonckheere, M. Verma, D. Limebeer, Synthesis of positive real multivariable feedback systems, Internat. J. Control 45 (3) (1987) 817-842.
[5] C. Kenney, A. Laub, The matrix sign function, IEEE Trans. Automat. Control 40 (8) [6]
(1995) 1330-1348. J. Roberts, Linear model reduction and solution of the algebraic Riccati equation by use of the sign function, Internat. J. Control 32 (1980) 677-687, (Reprint of Technical Report No. TR- 13, CUED/B-Control, Cambridge University, Engineering Department, 1971).
258 [7] [8] [9] [ 10] [ 11]
[12]
[ 13] [14] [15]
R Lancaster, M. Tismenetsky, The Theory of Matrices, 2nd Edition, Academic Press, Orlando, 1985. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Solving linear matrix equations via rational iterative schemes, in preparation. P. Benner, E. Quintana-Orti, Solving stable generalized Lyapunov equations with the matrix sign function, Numer. Algorithms 20 (1) (1999) 75-100. P. Benner, E. Quintana-Orti, G. Quintana-Orti, State-space truncation methods for parallel model reduction of large-scale systems, Parallel Comput. To appear. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Balanced truncation model reduction of large-scale dense systems on parallel computers, Math. Comput. Model. Dyn. Syst. 6 (4) (2000) 383-405. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Singular perturbation approximation of large, dense linear systems, in: Proc. 2000 IEEE Intl. Symp. CACSD, Anchorage, Alaska, USA, September 25-27, 2000, IEEE Press, Piscataway, NJ,, 2000, pp. 255-260. P. Benner, E. Quintana-Orti, G. Quintana-Orti, Efficient numerical algorithms for balanced stochastic truncation, Int. J. Appl. Math. Comp. Sci. 11 (5) (2001) 1123-1150. G. Quintana-Orti, X. Sun, C. Bischof, A BLAS-3 version of the QR factorization with column pivoting, SIAM J. Sci. Comput. 19 (1998) 1486-1494. E. Quintana-Orti, G. Quintana-Orti, X. Sun, R. van de Geijn, A note on parallel matrix inversion, SIAM J. Sci. Comput. 22 (2001) 1762-1771.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
259
Parallel Decomposition Approaches for Training Support Vector Machines* T. Serafini a, G. Zanghirati b, and L. Zanni a aDepartment of Mathematics, University of Modena and Reggio-Emilia, via Campi 213/b, 41100 Modena, Italy. E-mail: serafini, thomas@unimo, it, zanni, luca@unimo, it. bDepartment of Mathematics, University of Ferrara, via Machiavelli 35, 44100 Ferrara, Italy. E-mail: g. zanghirati@unife, it. We consider parallel decomposition techniques for solving the large quadratic programming (QP) problems arising in training support vector machines. A recent technique is improved by introducing an efficient solver for the inner QP subproblems and a preprocessing step useful to hot start the decomposition strategy. The effectiveness of the proposed improvements is evaluated by solving large-scale benchmark problems on different parallel architectures. 1. I N T R O D U C T I O N Support Vector Machines (SVMs) are an effective learning technique [11 ] which received increasing attention in the last years. Given a training set of labelled examples D = {(zi, yi), i - 1 , . . . , n ,
zi E R m, yi E { - 1 , 1}},
the SVM learning methodology performs classification of new examples z c R m by using a decision function F : R m ~ { - 1 , 1}, of the form
F(z)-sign(~x~y~K(z,z~)+b*),i=~
(1)
where K : R'~ • ]R"~ ~ R denotes a special kernel function (linear, polynomial, Gaussian,... ) and x* - (x~,..., x~) T is the solution of the convex quadratic programming (QP) problem rain
]
f ( x ) - - ~ x T G x - xT1 /_,
sub. to
y T x = O,
O <_ xj < C,
j=
l,...,n,
(2)
where G has entries Gij = yiyjK(z~,zj), i,j = 1 , 2 , . . . , n , 1 = ( 1 , . . . , 1 ) T and C is a parameter of the SVM algorithm. Once the vector x* is computed, b* C ]t{ in (1) is easily derived. An example zi is called support vector (SV) if x~ ~ 0 and bound support vector (BSV) if x~ - C. The matrix G is generally dense and in many real-world applications its size *This work was supported by the Italian Education, University and Research Ministry (grants FIRB2001/RBAU01877P and FIRB2001/RBAU01JYPN).
260 ALGORITHM 1 (SVM Decomposition Technique) 1. Let x (~ be a feasible point for (2), let nsp and nc be two integer values such that n >_ n~p >_ nc > 0 and set i = 0. Arbitrarily split the indices { 1 , . . . , n} into the set B of basic variables, with #B = n~p, and the set N = { 1 , . . . , n} \ / 3 of nonbasic variables. Arrange the arrays x (i), y and G with respect to/3 and N:
x(i) :
X(N )
'
Y--
YN
E
'
GNB
GNN
]
"
2. Compute (in parallel) the solution x(~+1) of 1 min fB(XB) -- ~ x T G ~ x B -- xT(1 -- GBNX(iN))
(3)
~BC~B
where ftB -
{XB E R n~p l y T x B
= --YN T X N(i), 0 <_ XB <_ C1}.
S e t x (i+1) -
[ 3. Update (in parallel)the gradient V f ( x (i+1)) -- V f ( x ( i ) ) + [ GBB and terminate if x (i+l) satisfies the KKT conditions.
GBN
]T(x~+I)
4. Change at most nc elements of B. The entering indices are determined by a strategy based on the Zoutendijk's method [5, 12]. Set i +---i + I and go to step 2.
is very large (n >> 104). Thus, strategies suited to exploit the special problem features become a need, since standard QP solvers based on explicit storage of G cannot be used. Among these strategies, decomposition techniques are certainly the most investigated [3, 4, 5, 6, 7]. They consist in splitting the original problem into a sequence of QP subproblems sized nsp << n that can fit into the available memory. An effective parallel decomposition technique is proposed in [12]. It is based on the Joachims' decomposition idea [5] and on a special variable projection method [8, 9] as QP solver for the inner subproblems. The main steps of this decomposition approach are briefly recalled in Algorithm 1. In contrast with other decomposition algorithms, that are tailored for very small-size subproblems (typically less than 102), the proposed strategy is appropriately designed to be effective with subproblems large enough (typically more than 103) to produce few decomposition steps. A wide numerical investigation has shown that this strategy is well comparable with the most effective decomposition approaches on scalar architectures. However, its straightforward parallelization is an innovative feature. In fact, the expensive tasks (kernel evaluations and QP subproblems solution) of the few decomposition steps can be efficiently performed in parallel and promising computational results are reported in [12]. In this paper two improvements to the above approach are introduced: a parallel preprocessing step, which gives a good starting point to the decomposition technique, and a new parallel inner solver that exhibits better convergence rate. The good efficiency and scalability of the
261 improved technique is shown by solving large-scale benchmark problems on different multiprocessor systems. 2. A P R E P R O C E S S I N G
STEP FOR A HOT START
The decomposition approach in [12] uses the null vector as starting point. This is a good choice for problems which have few support vectors, that is many null components in x*. In other cases, where there are many support vectors, the convergence of the decomposition technique could be improved by hot starting the method with a better approximation of x*. Here, we introduce a preprocessing step that produces an estimation of x* by a scalable and not excessively expensive method; this estimation is used as the initial guess x (~ for the decomposition technique [ 12]. A rough approximation ~ of x* can be calculated by solving p independent QP subproblems with the same structure of (2), in the following way: given a partition {Ij }~=1 of { 1 , . . . , n}, compute Vj the solution ~lj of the QP subproblem subproblem
Pj "
min sub. to
] T Gijijxij 2Xlj
-
xT1 Ij
y~jXlj --O,
0 < X b < C1
(4)
where G b b - {G~kli, k c Ij}. The approximation ~ is obtained by rearranging the solutions of the subproblems: ~
- (5/T ~ ' ' ' ~ ~TIp ) " Note that problems/91 ~" \
/
"'~
are
independent and,
on a multiprocessor system, they can be distributed among the available processing elements and managed at the same time. The described procedure is equivalent to partition the training set into p disjoint subsets Dj = { (z~, y~) E D li C Ij } and in training an SVM on each Dj. After the preprocessing step the estimation 5 is taken as initial point of the parallel decomposition technique: x (~ = ~. The indices for the initial working set t3 are randomly chosen among the indices j such that ~j ~ 0 (indices of potential support vectors) and, if these are not enough, the set t3 is filled up to n ~ entries with random indices j such that yj - 0. In comparison with the strategy in [12], which uses x (~ = O, this approach spends an extra time to compute"
GBNx(~ ),
V f ( x (~ -- G x (~ - 1.
(5)
In fact, since G is not into memory, the computations (5) require expensive kernel evaluations when many components of x (~ are nonzero. Numerical results in section 4 will show that this extra time is usually compensated by a decomposition iterations reduction, so that overall time is saved. Another important issue is how to choose the partition I 1 , . . . , Ip. Small values of p lead to more expensive QP subproblems Pj than for large p, but give a better approximation of x* which yields less decomposition iterations. Besides, small values of p generally imply more null components in ~, so the computations (5) are faster. Table 1 shows the numerical behavior of the preprocessing on the UCI Adult data set (available at www. i c s . u c i . e d u / ~ m l e a r n ) when K(zi, Z j ) - - exp(-IIz i - z j l 2 / ( 2 o - 2 ) ) , 0 -2 = 10 and C - 1 are used. The experiments are carried out on a Power4 1.3GHz processor of an IBM SP4. The subproblems Pj, j - 1,...~ p - 1, have size np while Pp is sized n - (p - 1)rip.
262 Table 1 Preprocessing with different subproblem size on UCI p np SVest % corr. time p Up 2 5610 4328 90.7 40.70 33 340 4 2805 4441 85.5 22.72 81 139 9 1247 4780 79.6 16.82 197 57 21 535 5271 73.2 15.88 387 29
Adult data set (n = 11220, SV = 4222) SVest ~/0 corr. time 5620 68.7 16.23 6415 60.9 17.80 7561 52.3 20.53 8674 46.3 23.30
All the subproblems are solved by the gradient projection method described in section 3. We stop the solver when the KKT conditions for problem Pj are satisfied within a tolerance of 0.01 (see section 4 for a motivation of this choice). In the table we report the number S V e s t of nonzero components in ~ (an estimation of the final number of support vectors) and the percentage of correctly estimated SVs (% corr.). The time column reports the seconds needed for both the preprocessing and the computations (5). It can be observed that large subproblems give more accurate estimation of the final set of support vectors. Furthermore, for decreasing np the time reduces until computations (5) become predominant due to the increased number of SVest. This suggests that a sufficiently large np is preferable both in terms of SV estimation and computational time. 3. THE G E N E R A L I Z E D VARIABLE PROJECTION M E T H O D FOR THE INNER SUBPROBLEMS
In this section we discuss the solution of the subproblem (3) by the Generalized Variable Projection Method (GVPM), recently introduced in [ 10]. This iterative scheme is a gradient projection method that uses a limited minimization rule as linesearch technique [2] and a steplength selection based on an adaptive alternation between the two Barzilai-Borwein rules [ 1]. When it is applied to (3), it can be stated as in Algorithm 2. It is interesting to observe that the special variable projection method used in [ 12] for solving (3) can be considered a special case of 3, obtained by setting is = 2 and Tbmi n = Tbma x = 3 ; it will be denoted by AL3-VPM in the sequel. In comparison with AL3-VPM, 3 presents the same low computational cost per iteration (O(n~p)), but it benefits from a more effective selection rule for the steplength c~k+l, that generally yields better convergence rate. We refer to [10] for a deeper analysis of the GVPM behavior on more general QP problems and for comparisons with other gradient projection methods. Here, some numerical evidences of the GVPM-SVM improvements with respect to the AL3VPM are given by comparing their behavior on the subproblems arising when the decomposition techniques [12] is applied to train a Gaussian classifier on the MNIST database of handwritten digits ( y a n n . l e c u n . c o m / e x d b / m n • A classifier for the digit "8" is trained with SVM parameters C = 10 and a - 1800; the corresponding QP problem of size 60000 is solved by decomposing in subproblems sized 3700. Both solvers use the projection onto f2B of the null vector as starting point and a stopping rule based on the fulfillment of the KKT conditions within a tolerance of 0.001. In the GVPM-SVM, the parameters is = 2, O~mi n - - 1 0 - 3 0 , Cemax = 103~ n m i n - - 3, nmax = 10, Ae = 0.1 and A~, = 5 are used. These experiments are carried out on the platform described in section 2 and the results are summarized in Table 2; we report the number of iterations and the computational time in seconds required by the two
263 ALGORITHM 2 (Generalized Variable Projection Method for subproblem (3)) 1. Let w (~ E ~ B , N, 0 < nmin_< nmax,
is E {1,2},
0 < OLmin < OLmax,
set
Ae_< 1_< Au;
n~=l,
OL0 E [OLmin, OLmax],
Ttmin ~Ttrnax E
k=0.
2. Terminate if w (k) satisfies a stopping criterion; otherwise compute* :
+
3. Compute 4. If
W (k+l) --
d(k)TGBBd(k) compute
-
_ (1) ~k+l
then
=
set ak+l
-
Ak -- arg mince[o,1] fB(W (k) + Ad (k))
W (k) --[-Akd (k) with
= 0
1)))
= Olmax;
d(k)Td(k) _ (2) d(k)TGBBd(k ), ~k+l
else
d(k)TGBBd(k) =
d(k)T G2BBd(k)
)kopt
:
arg rain fB(w (k) + Ad (k)) A
If
(n~ or
~ Ttmin) (
and
((n~ _> nmax) (1)
Aopt < Ae and c~k = c~k )
or or
(2) < OLk < rv(1) OLk+ 1 m -- ~ k + l ) (Aop t >
Au
and
then set
is +- mod(i~, 2) + l,
n~=0;
end. Compute c t k + l - min {OZmax,max {Ctmin, C~k+
,
end. Set k ~ k + 1, n~ ~ n~ + 1 and go to step 2. * Here P a . (-) denotes the orthogonal projection onto f~ B.
solvers for solving the subproblems T3, j = 1 , . . . , 6, generated by the decomposition technique. From these numerical results it is evident the advantage in terms of computational time implied by the better GVPM-SVM convergence rate. Finally we remark that, since the main computational task of each iteration is the matrix-vector product GBBd (k), a parallel version of the GVPM-SVM can be easily obtained by a blockwise distribution of the GBB rows among the available processors.
4. N U M E R I C A L E X P E R I M E N T S ON PARALLEL A R C H I T E C T U R E S
In this section we show how the parallel Variable Projection Decomposition Technique (VPDT) [ 12] can be improved by using the preprocessing step and the GVPM-SVM inner QP solver. The new version of the decomposition scheme is called Generalized VPDT (GVPDT) and is
264 Table 2 Test problems from MNIST database (n~p = 3700). AL3-VPM GVPM-SVM test SV BSV it. time it. time test SV BSV T1 545 2 347 3.1 377 3.1 T4 2694 32 T2 1727 75 966 17.6 793 14.0 T5 2963 5 875 19.4 764 T3 2262 95 17.3 T6 2993 0
AL3-VPM GVPM-SVM it. time it. time 978 25.5 769 20.4 1444 42.6 909 26.4 32.6 1290 37.9 1102
coded in standard C with MPI communication routines. In order to balance the workload, the preprocessing step is implemented by dividing the training set among the available processors and then by requiring each processor to solve the same number pe of problems Pj by the GVPMSVM. We determine Pe in such a way that the problems Pj have similar size, as close as possible to 1.Snsp. For the parallel solution of (3) at each decomposition step, we use a parallel GVPMSVM version obtained as suggested in the previous section. We compare VPDT and GVPDT on two different multiprocessor systems of the CINECA Supercomputing Center (Bologna, Italy): an IBM SP4 equipped w i t h 16 nodes of 32 IBM Power4 1.3GHz CPUs each and 64GB virtually shared RAM per node, and an IBM Linux Cluster with 128 Intel Pentium III processors at 1.13GHz, coupled in 64 two-CPUs nodes with 1GB shared RAM per node. The largest test problems arising in training Gaussian SVMs on the MNIST database (digit "8" classifier, C = 10, cr = 1800) and the UCI Adult data set (C = 1, ~2 = 10) are solved. The two decomposition techniques are run with the same parameter setting: nsp = 3700, nc - 1500 for the MNIST set and nsp = 1300, nc = 650 for the UCI Adult set. Furthermore, to avoid expensive recomputations of G entries, caching areas of 500Mb and 300Mb per processor are used for the MNIST set and the UCI Adult set, respectively. The decomposition procedures stop when the KKT conditions are satisfied within a tolerance of 0.001. As a consequence, in the GVPDT preprocessing step, the KKT conditions for each problem Pj are satisfied within 0.01 since, generally, higher accuracies increase the preprocessing computational cost without improving the convergence of the decomposition technique. The results obtained by using different numbers of processors (PEs) are reported in Table 3. The columns sp,. and effi show the relative speedup (spr(q) = tl/tq, where tj is the time needed by the program on j PEs) and the relative efficiency (eff~(q) = sp~(q)/q), respectively. In order to allow a comparison with a widely used (serial) decomposition technique, the number of iterations and the time required by the SVM t~ghtpackage [5] are also reported. We use SVM t~ght with pr_LOQO as inner QP solver, caching areas sized as above, empirical optimal values of nsp and n~ (n~p = 8, n~ = 4 for MNIST and n~p = 20, n~ = 10 for UCI Adult) and default setting for the other parameters. In case of MNIST test problem, the GVPDT results are obtained without preprocessing since, due to the small percentage of support vectors, the starting point x (~ = 0 already yields a very good convergence rate without the extra time typically required by a preprocessing step. This means that the GVPDT performance improvement is due to the effectiveness of GVPM-SVM only. Furthermore, we observe that this improvement is obtained by preserving the very good scalability of the VPDT (the superlinear speedup is due also to the larger caching area available when more processors are used). In case of UCI Adult test problem, where the GVPDT uses the preprocessing step, a remarkable iteration count reduction is obtained. This feature does not only contribute to reduce the computational time,
265 Table 3 VPDT and GVPDT on MNIST (n = 60000) and UCI Adult (n = 32562) test problems. MNIST (SV = 3155, B S V - 160) UCI Adult (SV = 11698, B S V - 10605) VPDT GVPDT VPDT GVPDT PEs it. time SPr effr it. time sPr effr it. time spr effr it. time spr effr IBM SP4 17 224.2 6 825.4 6 782.4 45 255.3 1 2 6 367.2 2.2 1.1 6 345.2 2.3 1.1 46 169.9 1.5 0.8 18 131.0 1.7 0.9 4 6 180.1 4.6 1.1 6 167.7 4.7 1.2 48 116.4 2.2 0.5 18 80.9 2.8 0.7 6 118.9 6.9 0.9 6 114.2 6.9 0.9 47 60.4 4.2 0.5 18 42.2 5.3 0.7 8 82.2 10.0 0.6 6 75.9 10.3 0.6 46 42.3 6.0 0.4 19 26.2 8.6 0.5 16 6 SVMlight: 4407 it. 278.7 sec. S V M light" 9226 it. 901.6 sec. 1
2 4 8 16
6 1373.6 6 657.4 2.1 6 349.9 3.9 6 192.6 7.1 6 114.5 12.0 SVMlight:
1.0 1.0 0.9 0.7 9301
6 6 6 6 6 it.
IBM Linux Cluster 1229.1 45 613.9 611.2 2.0 1.0 45 349.4 1.8 316.2 3.9 1.0 47 230.7 2.7 170.9 7.2 0.9 47 118.3 5.2 101.3 12.1 0.8 46 73.4 8.4 1017.5 sec. SVMZight:
17 525.8 0.9 18 277.5 1.9 1.0 0.7 17 167.8 3.1 0.8 0.7 18 79.4 6.6 0.8 0.5 18 48.4 10.9 0.7 4515 it. 437.9 sec.
but also implies significant improvements to the scalability of the decomposition technique. As a final remark, note that when the number of PEs increases the workload of each processor to solve the inner subproblems decreases, hence the communication time becomes more and more comparable with the computational time. This explains the suboptimal speedup and efficiency shown for 16 PEs. Of course, for larger problems full efficiency can be recovered.
5. C O N C L U S I O N S We presented new developments of the parallel decomposition technique for SVMs training recently introduced in [12]. The improvements concerned the computation of a good starting point for the iterative decomposition technique and the introduction of an efficient parallel solver for the QP subproblem of each iteration. A good starting point is obtained by a preprocessing step in which independent SVMs are trained on disjoint subsets of the original training set. This step is very well suited for parallel implementations and yields significant benefits in the case where the decomposition technique of [12] exhibits unsatisfactory convergence rate. For a parallel solution of the QP subproblem at each decomposition iteration, we proposed a generalized variable projection method based on an adaptive steplength selection. Since this solver has the same cost per iteration and a better convergence rate than the variable projection method used in [ 12], a remarkable time reduction is observed in the subproblems solution. The computational experiments on well known large-scale test problems showed that the new decomposition approach, based on the above improvements, outperforms the technique in [12] both in serial and parallel environments and can further achieve a better scalability.
266 REFERENCES
[1]
J. Barzilai, J.M. Borwein (1988), Two Point Step Size Gradient Methods, IMA J. Numer. Anal 8, 141-148.
[2] [3] [4] [5] [6]
[7] [8] [9] [10]
[ 11] [12]
D.E Bertsekas (1999), Nonlinear Programming, Athena Scientific, Belmont, MA. C.C. Chang, C.J. Lin (2002), LIBSVM: a Library for Support Vector Machines, available at http ://www.csie.ntu.edu.tw/~cj lin/libsvm. R. Collobert, S. Benjo (2001), SVMTorch: Support Vector Machines for Large-Scale Regression Problems, Journal of Machine Learning Research 1, 143-160. T. Joachims (1998), Making Large-Scale SVM Learning Practical, Advances in Kernel Methods, B. Sch61kopf et al., eds., MIT Press, Cambridge, MA. C.J. Lin (2001), On the Convergence of the Decomposition Method for Support Vector Machines, IEEE Transactions on Neural Networks 12(6), 1288-1298. J.C. Platt (1998), Fast Training of Support Vector Machines using Sequential Minimal Optimization, Advances in Kernel Methods, B. Sch61kopf et al., eds., MIT Press, Cambridge, MA. V. Ruggiero, L. Zanni (2000), A Modified Projection Algorithm for Large Strictly Convex Quadratic Programs, J. Optim. Theory AppL 104(2), 281-299. V. Ruggiero, L. Zanni (2000), Variable Projection Methods for Large Convex Quadratic Programs, Recent Trends in NumericalAnalysis, D. Trigiante, ed., Advances in the Theory of Computational Mathematics 3, Nova Science Publ., 299-313. T. Serafini, G. Zanghirati, L. Zanni (2003), Gradient Projection Methods for Large Quadratic Programs and Applications in Training Support Vector Machines, Technical Report 48, Dept. of Mathematics, University of Modena and Reggio Emilia. V.N. Vapnik (1998), Statistical Learning Theory, John Wiley and Sons, New York. G. Zanghirati, L. Zanni (2003), A Parallel Solver for Large Quadratic Programs in Training Support Vector Machines, Parallel Computing 29(4), 535-551.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
267
Fast parallel solvers for fourth-order boundary value problems M. Jung ~ ~Institut ftir Wissenschaftliches Rechnen, Technische Universit~t Dresden D-01062 Dresden, Germany Let us consider the first biharmonic boundary value problem in mixed weak formulation. The finite element discretization of this problem leads to a system of linear algebraic equations with a symmetric indefinite matrix. We shall discuss three possibilities for solving this system of equations efficiently, namely the preconditioned conjugate gradient method for a corresponding Schur complement system, a conjugate gradient method of Bramble-Pasciak type and a multigrid method. Furthermore, we shall describe the implementation of these solvers on a parallel computer with MIMD architecture. The numerical experiments presented show that these solution methods can be parallelized very efficiently. 1. M O D E L P R O B L E M A N D F I N I T E E L E M E N T D I S C R E T I Z A T I O N
The deformation of plates and shells in structural mechanics can be described by boundary value problems of fourth order. Let us consider, as model problem, the first biharmonic problem: Find the function u C C4(f~) ~ C ~(~) that satisfies
A2~-f
inf~
and
0?z
u-~-O
Introducing the new variable w problem (1):
=
on 0 9 .
(1)
Au, one gets the following mixed variational formulation for
Find ~ E 120 = {~z E H1(9) : ~z = 0 on 0f~} and w C 12 = Hl(f~) that satisfy
+ /~V TuV~dx
-
0
V~c12,
(2)
f v TWVV dx
=
-~fvdx
VvE12o.
Problem (2) is now discretized by means of the finite element method. We construct a sequence of nested triangulations Tq, q = t, 2 , . . . , 5, with mesh-size parameters hq (hq_l = 2hq) of the domain f~ and define the corresponding finite element subspaces
{
i=1
}
and
{ mq {=I
}
spanned by piecewise linear functions Pq,i (7Zq = dimVq, mq = dimV0,q, nq > mq). This dis-
268 cretization leads to the sequence of systems of finite element equations: Find U_q E R mq and ~
/
(Mqj~q J~:)O
E R nq that satisfy
W__q):~
fq
(3)
with
Mq --
Pq,jPq,i dx
.Bq --
V ypq,j ~TPq,i dx
i,j= l
,
1
'L -
-
f pq,i dx
, g__q- o.
i=l,j=l i=1 We are interested in the solution of the system of equations on the finest mesh Te. The matrix in the system of equations (3) is symmetric but indefinite. Therefore, special effort must be spent for the construction of efficient solution methods. In the following section we shall give a short overview on fast solvers and discuss, in more detail, their application to the system of equations (3). 2. FAST SOLVERS F O R SYSTEMS OF FINITE E L E M E N T EQUATIONS
2.1. General remarks on preconditioned conjugate gradient and multigrid methods Let us consider a system of linear algebraic equations Aex_e - be. In the preconditioned conjugate gradient (pcg) method we have to perform the following steps:
Iteration: j = 1, 2,...
Initial step: Determine an initial guess x~~ e.g. x~~ -- 0
r_~~
C~w~~
=
=
Aes_~J-1)
T(J)
:
o-(J-1)/(s~J-1) a_~j-l))
r__ij)
=
r__~J-1)--T(J)a_ij-i)
C~w?)
=
~i j)
~(J)
=
(w~J)~2))
or(J)
<
x2or(~ ~ STOP
/~(J)
=
o-(J)/o-(J -1)
= ~_?
s~~ ~(o)
b_e- Aex~ ~
a~j- ~)
w? =
(w~O/~o//
si~/ : w?+g/~/s~ ~-1) If Ae and Ce are symmetric, positive definite matrices, the estimate Ix__e- x_~J)lA~ <_ fly IIx__e_X~~ ~ of the error holds, where ]l-I den~ the Ae-energy norm, ~Tj - 2~J/( 1 + ~o2J), ~o (1 - ~~ + ~o.5), and ~ = 1/n(CelAe) with the condition number n(CelAe) of the matrix
CelAe (see, e.g., [8]). Thus, in order to reduce the initial error I1~ - ~~ (0 < e < 1) in the Ae-energy norm no more than j - [ln(c -1 + (c -2 + 1)~ are required. ([x] denotes the smallest integer greater than or equal to x.)
by the factor c iterations
269 From the algorithm of the pcg method and its convergence properties we get the following requirements on the preconditioning matrix Ce. 9 The condition number ec(C-j-lAe) should be close to 1. This leads to a fast convergence. 9 Systems of equations with the matrix Ce should be solvable with an arithmetical cost proportional to the number of unknowns, i.e. with the same cost as a matrix by vector multiplication. 9 The solver for systems of equations of the type Cew__e = r_e should run in parallel efficiently. Another fast solution method is the multigrid method. This method is defined recursively. Let us suppose that we have a sequence of finite element discretizations of the considered problem. Let us start the description of multigrid methods with the explanation of one iteration of a ((, ( - 1)-two-grid method. It consists of the following steps:
"-d~J)-b_e_ - Aex~ r e j) stricti~
x~j) pre-smoothing ~ j )_,
" ]Ae-lY_~(J)I=d~J)_ - -1
post-smoothing
I interpolation
Here, x_~o) and ~o+a) denote the initial guess and the improved iterate, respectively. Pre- and post-smoothing means the application of a few iterations of an appropriately chosen "simple" iterative method (for more details see Subsection 2.4). The system of equations in l [has the same structure as the system of equations Ae~e = be. Thus, we apply to this system one or two iterations of a ( ( - 1, ( - 2)-two-grid method with the zero vector as initial guess. This idea is continued until we have to solve in [ _ _ ] a system of equations on the coarsest mesh. This can be done by a direct method or an iterative method. In this way we get an g-grid method. Under some assumptions on the boundary value problem considered and on the choice of the smoothing procedure, restriction and interpolation one can show that the number of arithmetical operations for getting ]lZ_e- xfJ)llA e < ellz_e - z__f~ e is of the order O(ne lnr where ne is the number of nodes in the finest triangulation (see, e.g., [7, 14]). We can use multigrid methods to compute an approximate solution of the preconditioning system Ce_~ = r e in the pcg method. This leads to an implicitly defined multigrid preconditioner which fulfils all requirements formulated above, see [ 10, 11 ]. Let us now discuss the application of the pcg method and multigrid methods to solve the system of equations (3)
2.2. Conjugate gradient method applied to the Schur complement system By eliminating w__e in system (3) we get the reduced system:
Be M e 1B[Lze - - f e "
(4)
The condition number of the matrix B e M [ 1 B ] is of the order O( h e-4 ) (he is the discretization parameter), see [2, 12]. Therefore, for getting an efficient iterative solver for system (4) one needs an appropriate preconditioner. We define preconditioning matrices Ce = h -e2 B 2e,0 or Ce = B e , o M [ 1 Be, o, where
Be,o -
V pe,jVpe,~ dx
and i,j=l
Me,o -
Pe,jPe,~dx
. i,j=l
(5)
270 Then, one can prove that the condition number i~,(C~-lBe M ? I B ; ) is of the order O(he 1) (see, e.g., [2, 12]). The application of these preconditioners Ce in the pcg method requires to perform the following three steps: 1. Solve Be,o~ = r_e. / 2. Compute [e - h~?--eor ~ - Me,o?__e./ 3. Solve Be,oW_.e= r-.e. In steps 1 and 3 we have to solve a discrete Poisson equation. These can be solved approximately by means of a multigrid method. This can be interpreted as an implicitly defined preconditioner Ce for which ec(CelBeMelB~) - O(he 1) holds. Consequently, the number of iterations of the pcg method is of the order O(he ~ in e -1) (e a prescribed accuracy), whereas the number of iterations of the cg method without preconditioning is of the order (.9(he 2 in e -1). For the matrix-vector multiplication a__e = BeM[~B-[s_e we have to perform the three steps: 1. Compute -se - B[s--e. / 2. Solve Meg_e = ~ . / 3. Compute a__e - Beg__e. The condition number of the matrix Me is independent of the discretization parameter he. Consequently, the system of equations Meg_e - g--ecan be solved with an arithmetical cost of the order O(ne in c~ 1) by a "simple" iterative method, e.g. the Chebyshev iteration procedure. The accuracy CM should be at least of the order of the discretization error.
2.3. Conjugate gradient method of Bramble-Pasciak type The conjugate gradient method of Bramble-Pasciak type [4, 13] can be used for solving the system of equations (3). We transform this system of equations into the system AeV_e = b_e
with
Me-
(
cl -1
e,M - 1
0 o
'
(6)
where ~, and 5 are two appropriate scaling factors. If we suppose that the following spectral inequalities ~Ce,M < - 1 B ey < -- Me < -- -~Ce,M and/3Ce __ ,B < -- Be C g,M -- ~Ce ,B hold and that the matrix Me - 7Ce,M is positive definite, one can prove that the matrix ,4e is positive definite and symmetric with respect to the inner product (.,.) defined by (2g, y ) -- ( ( m e - ' f f C e , M ) ~ l , Yl ) -~- ( ( ~ - l C e , B ~ 2 , Y2 ) ,
(7)
where z__= (x___l, x__2) , y ~-- (Y--.1,Y--.2) ' x---l, Y-Y-1~ / ~ n e x___2, Y--2 ~ f~mg. F u r t h e r m o r e , the estimate of the condition number ~(Ae) _< 4(~/7)(max{~, 5/3}/min{7_, 5/3}) holds, see [131. Therefore, we can solve the system of equations AeEe = b_e by means of the conjugate gradient method in the special inner product (7). We choose, as preconditioners, the matrices Ce ,B = h-2B 2 g g,0 or Ce,B = Be,oM[~Be,o and Ce,M -- diag(Me) or Ce,M implicitly defined by the application of a Chebyshev iteration. The matrices Be,o and Me,0 are defined in (5). This choice of the matrices Ce,M and Ce,B makes sure that the condition number n(Ae) is of the order (.9(h~-1), i.e. the proposed cg method of Bramble-Pasciak type applied to the system of equations (3) has the same convergence properties as the pcg method applied to the system of equations (4). In each iteration we have to, again, solve two discrete Poisson equations which can be done approximately by means of multigrid methods.
2.4. Multigrid method for the mixed problem Let us now discuss the application of multigrid methods to the system of equations (3). As smoothing procedure let us choose the following iterative process. Let w~j) and u~j) be given
271
initial iterates. Then the new iterates w~O+l) and _u~0+1) will be computed as follows:
G,M(~_~_~j+l)- ~U~j))
-- ~-
Mew_~__~ j)- B[u__.~j)
G,s(U'*~J-+-l) - ~--~J)) -- Be~u)(J+l) - L Ce , M(w(J+I) _~
,/~j~j+l) )
-
_~[(~(J+
(8)
1) __ _,/t~J)),
, < Me < (l+c~)Ce,M and Be C-1 whereCeM e,M1~[ < Cg ,s (see, e.q., [1 , 3, 15, 16]). Let us choose Ce,M -- 3Mdiag(Me) and Ce,s - he25sI with appropriate scaling parameters 5M and 5s. As interpolation procedure we shall use linear interpolation corresponding to the linear finite element ansatz functions. The restriction operator is the adjoint operator of the interpolation operator. The system of equations on the coarsest grid is solved by means of the pcg method for the Schur complement system (see Subsection 2.2). 3. P A R A L L E L I Z A T I O N In this section we shall describe, briefly, the implementation of the solvers on parallel computers with MIMD architecture. We suppose that the domain f~ is decomposed into p nonoverlapping sub-domains f~i. Let us introduce (ne,i x ne) Boolean matrices He,i which map some global vector v e E R ue of nodal variables into the sub-domain vector _V_e,iE R ne,~ of variables associated with the sub-domain f~i only. Then, the stiffness matrices Ae and the load vectors b_e_ of the system of finite element equations Ae_~ - b__ecan be represented as Ae - ~-~v=, H~,~Ae,~He,i and b_e - ~iV=l H~,/b__e,/ with the sub-domain stiffness matrices Ae,i and the sub-domain load vectors b_e,~. The matrix Ae,i and the vector b_e,~ are stored on processor P~. Now we introduce two types of distribution of vectors to the processors Pi (see, e.g., [6, 9]), namely vectors of overlapping type and vectors of adding type. A vector v e is said to be of overlapping type if v__e is stored on processor P~ as v__e ~ = He,~v_e. A vector b_e of adding type is stored on processor/9/ as be,i such that b_e = E P 1 H~,ib__g,i. Using this data structure one needs communication between the processors only for the computation of the scalar products and in the preconditioning step of the pcg algorithm, where a vector of adding type must be converted into a vector of overlapping type. In the pcg algorithm in Subsection 2.1 vectors of overlapping type are denoted by boldface letters. The other vectors are of adding type. In the multigrid method we need communication in the smoothing procedure and in the solver for the system of equations on the coarsest mesh (for more details see, e.g., [5, 10]). 4. N U M E R I C A L RESULTS Let us consider problem (1) in the domain f~ = (0, 1) x (0, 1) with the right-hand side f - 1. The corresponding system of finite element equations (3) is solved by means of the algorithms described in Section 2. Hereby, the ingredients of these solvers are chosen in the following way:
9 pcg algorithm for the Schur complement system (see Subsection 2.2): We have used the preconditioner he2B~,o, where Be,o is defined by means of a multigrid V-Cycle with two pre- and two post-smoothing steps of a lexicographical Gauss-Seidel method (see [9]). Systems of the type Mew__e = d e are solved by means of the Chebyshev method with a relative accuracy of c - 0.000001.
272
cg of Bramble-Pasciak type (see Subsection 2.3): We have used the preconditioner Ce,B = h e- 2 Be,2 o, where Be,o is defined as in the Schur complement algorithm. Ce,M is defined implicitly by a Chebyshev iteration with a relative accuracy of e = 0.01. We have chosen 7 = 0.975 and 6 = 1.8he 9 Multigrid method (see Subsection 2.4): We have used a multigrid F-cycle with six preand six post-smoothing steps of the smoother (8), where Ce,M = 2diag(Me) and Cg,s = 0.434h~-2I. As solver for the system of equations on the coarsest grid, the pcg method applied to the Schur complement system, is applied.
9
All three solvers are stopped when a relative accuracy of 10 . 6 is reached. Hereby, the Euclidian norm of the defect vector is measured. In tables 1 and 2, the numbers of iterations and runtimes needed for solving the system of finite element equations (3) are given. Furthermore, we give the parallel efficiency Eppar = T(1) and the efficiency E(pl,p2) which is defined by ' pT(p)
E(pl, P2) - pl T(pl) N(p2) where N(p) is the total number of nodes on the finest mesh and p2 T(p2) U(pl)' T(p) denotes the time for one iteration on the finest mesh using p processors. All computations were performed on a CRAY T3E. 5. CONCLUSIONS The number of iterations of the Schur complement pcg method and of the cg method of Bramble-Pasciak type grow with a factor of about v ~ which confirms the theoretical result given in Subsections 2.2 and 2.3. The number of iterations of the multigrid method is nearly constant. If we use the parallelization strategy proposed in [5, 6, 9] and if the problem size is large enough, then all methods have a very good parallel efficiency. For large problems the multigrid algorithm is the fastest solver.
REFERENCES
[1]
[2] [3]
[4] [5] [6] [7]
R.E. Bank, B. D. Welfert, and H. Yserentant. A class of iterative methods for solving saddle point problems. Numer. Math, 56:645-666, 1990. D. Braess and P. Peisker. On the numerical solution of the biharmonic equation and the role of squaring matrices for preconditioning. IMA J. Numer. Anal., 6:393--404, 1986. D. Braess and R. Sarazin. An efficient smoother for the Stokes problem. Appl. Numer. Math., 23:3-19, 1997. J. H. Bramble and J. E. Pasciak. A preconditioning technique for indefinite systems resulting from mixed approximations of elliptic problems. Math. Comput., 50(181): 1-17, 1988. G. Haase. Parallelisierung numerischer Algorithmen fiir partielle Differentialgleichungen. B.G. Teubner Stuttgart-Leipzig, 1999. G. Haase, U. Langer, and A. Meyer. The approximate Dirichlet domain decomposition method. Part I: An algebraic approach. Part II: Applications to 2nd-order elliptic boundary value problems. Computing, 47:137-151 (Part I), 153-167 (Part II), 1991. W. Hackbusch. Multi-Grid Methods and Applications, volume 4 of Springer Series in Computational Mathematics. Springer-Verlag, Berlin, 1985.
273 Table 1 Times and iteration numbers needed for the pcg algorithms pcg applied to the Schur complement system .
! processor
4 processors
#nodes
#it
time [sec]
#it
time [sec]
545
25
0.33
25
0.44
16 processors
E4,par
#it
time [sec]
ml6,par
0.19
25
0.70
0.03
2113
33
2.53
31
1.09
0.58
33
1.22
0.13
8321
40
14.47
41
4.50
0.80
42
2.41
0.38
33025
55
83.95
54
22.95
0.91
55
7.60
0.60
131585
75
475.16
74
124.74
0.95
0.87
104
723.82
525313 2099201
E(1, 4) : 0.91
74
34.03
112
194.09
154
1075.87
E(4, 16) = 0.99
cg of Bramble-Pasciak type 1 processor #nodes
4 processors
#it
time [sec]
#it
time [sec]
16 processors
E4,par
#it
time [sec]
E16,par
545
26
0.30
26
0.49
0.15
26
0.99
0.02
2113
34
1.93
33
1.04
0.46
32
1.35
0.09
8321
48
12.78
49
4.25
0.75
48
2.56
0.31
33025
72
80.94
72
23.24
0.87
70
7.68
0.66
103
132.87
106
36.89
150
197.30
131585 525313
E(1, 4) -- 0.87
E(4, !6) - 0.98
Table 2 Times and iteration numbers needed for the multigrid method 1 processor
4 processors
16 processors
#nodes
#it
time [sec]
#it
time [sec]
E4,pa r
#it
time [sec]
E16,par
545
23
0.71
23
0.81
0.22
23
1.18
0.04
2113
26
4.92
26
2.59
0.47
26
3.07
0.10
8321
28
26.83
28
8.56
0.78
28
6.42
0.26
33025
30
129.82
30
34.17
0.95
30
15.27
0.53
131585
31
572.19
31
142.96
0.99
31
43.93
0.81
33
624.86
34
169.27
525313
E(1, 4) = 0.97
E(4' 16) = 0.92
274 [8] [9] [10]
[11 ]
[12] [ 13]
[ 14]
[15] [16]
W. Hackbusch. Iterative L6sung grofler schwachbesetzter Gleichungssysteme. Teubner Studienbficher Mathematik. Teubner-Verlag, Stuttgart, 1991. M. Jung. On the parallelization of multi-grid methods using a non-overlapping domain decomposition data structure. AppL Numer. Math., 23(1): 119-137, 1997. M. Jung. Einige Klassen paralleler iterativer Aufl6sungsverfahren. Habilitationssthrift, Technische Universit~it Chemnitz, Fakult~it f'tir Mathematik, 1999. (also available as Preprint SFB393/99-11, Technische Universit~it Chemnitz, Sonderforschungsbereich 393). M. Jung, U. Langer, A. Meyer, W. Queck, and M. Schneider. Multigrid preconditioners and their applications. In G. Telschow, editor, Third Multigrid Seminar, Biesenthal 1988, pages 11-52, Berlin, 1989. Karl-Weierstral3-Institut. Report R-MATH-03/89. U. Langer. Zur numerischen L6sung des ersten biharmonischen Randwertproblems. Numer. Math., 50:291-310, 1987. A. Meyer and T. Steidten. Improvements and experiments on the Bramble-Pasciak type CG for mixed problems in elasticity. Preprint SFB393/01-13, Technische Universit~it Chemnitz, Sonderforschungsbereich 393,2001. K. Stfiben and U. Trottenberg. Multigrid methods: Fundamental algorithms, model problem analysis and applications. In W. Hackbusch and U. Trottenberg, editors, Multigrid Methods, Proceedings of the Conference held at K61n-Porz, November 23-27, 1981, volume 960 of Lecture Notes in Mathematics, pages 1-176, Berlin-Heidelberg-New York, 1982. Springer-Verlag. W. Zulehner. A class of smoothers for saddle point problems. Computing, 65:227-246, 2000. W. Zulehner. Analysis for iterative methods for saddle point problems: A unified approach. Math. Comput., 71(238):479-505, 2001.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
275
Parallel S o l u t i o n o f Sparse E i g e n p r o b l e m s b y S i m u l t a n e o u s R a y l e i g h Q u o t i e n t Optimization with FSAI preconditioning L. Bergamaschi a, ,/~. Martinez% and G. Pini ~ ~Dipartimento di Metodi e Modelli Matematici per le Scienze Applicate Universit~ di Padova, Via Belzoni, 7, 3 5131 Padova (Italy) e - ma i 1 {berga, acalomar, pini }@dmsa.unipd.it A parallel algorithm based on the S-dimensional minimization of the Rayleigh quotient is proposed to evaluate the s ~ S/2 leftmost eigenpairs of the generalized symmetric positive definite eigenproblem. The minimization is performed via a conjugate gradient-like procedure accelerated by a factorized approximate inverse preconditioner (FSAI). The resulting code attains a high level of parallel efficiency and reveals comparable with the PARPACK package on a set of large matrices. 1. INTRODUCTION The computation of a number of the smallest eigenpairs of the problem A x = )~Bx
(1)
where A and B are large, sparse, matrices, is an important task in many scientific applications. Among the most promising approaches for the important class of symmetric positive definite (SPD) matrices are the implicitly restarted Arnoldi method (equivalent to the Lanczos technique for this type of matrices) [8], the recently proposed Jacobi-Davidson (JD) algorithm [11, 3] and schemes based on preconditioned Conjugate Gradient (CG) minimization of the Rayleigh quotient (RQ) [2]. In this paper, we propose a parallel algorithm based on the simultaneous optimization of the RQ by a preconditioned CG method. A similar procedure, SRQMCG, has been first proposed, in a sequential environment, in [ 10], where the CG is preconditioned by the inverse of the incomplete Cholesky decomposition of A (called IC(0)), yielding a very effective method. However this algorithm is not easily parallelizable since the application of the IC-type preconditioners is intrinsically sequential, and the attempts to parallelize their construction and their use have not been completely successful. Recently, a class of preconditioners, known as the approximate inverse preconditioners, have been extensively studied by many authors. We quote among others the FSAI (Factorized Sparse Approximate Inverse) preconditioner proposed in [7], the AINV preconditioner described in [1] and the SPAI preconditioner [4]. These preconditioners explicitly compute an approximation to A -1 and their application needs only matrix vector products, which are more effectively parallelized than solving two triangular systems, as in the IC preconditioner. Both FSAI and AINV compute the approximation to A -1 in factorized form. In its current formulation AINV
276 offers limited opportunity for parallelization of the preconditioner construction phase, while FSAI can be efficiently constructed in parallel. We parallelized the construction of the FSAI preconditioner in the symmetric case. The FSAI preconditioner was generated by either imposing the same sparsity pattern as the matrix A or that of A 2 ("enlarged" FSAI), also allowing the possibility of"postfiltration", that is dropping out all the elements in the preconditioning factor, whose absolute value is below a fixed threshold. In this paper we show the validity of the parallel SRQMCG algorithm with respect to the PARPACK package [8]. We give numerical evidence that the FSAI preconditioner is comparable to IC(0) for preconditioning the sequential SRQMCG algorithm. We propose an efficient parallelization of the construction of FSAI preconditioner, in which we allow the processors to store only their local part of matrices A, B and the preconditioner factors. The parallel preconditioned SRQMCG has been used in the computation of 10 eigenpairs of a number of large sparse matrices arising from the discretization of the diffusion equation. Numerical results obtained on an SP4 supercomputer show the good parallel performance of the algorithm together with its satisfactory scalability level. preconditioner 2. SIMULTANEOUS OPTIMIZATION OF THE RAYLEIGH QUOTIENT Following [ 10], we sketch the lines of the simultaneous RQ optimization algorithm for the evaluation of the leftmost eigenpairs of (1). Let us denote the s smallest eigenvalues as A1 _< A2 < ... < As and the corresponding xTAx
eigenvectors as vl, v 2 , . . . , vs. The idea is to minimize the RQ: r ( x ) = x T B x of size S > s by the following preconditioned CG-like procedure.
.
m a subspace
SRQMCG ALGORITHM: 1. Start with an initial approximation X (~ - ( x ~ ~ . . . , x(s~ s. t . X ( ~ Compute the preconditioning factor Z. Set k = 0, /3j = 0, j = 1,... S.
(~ = I.
2. REPEAT UNTIL CONVERGENCE 2.1
G (k) = A X (k) - B X (k) diag ( X ( k ) T A x ( k ) )
2.2
(k)Trz, T,v (k) gj [. ~ g j . IF k > 0 THEN flj = (k-l) T T ,.,(k-l)' gj Z Z1t j
2.3
j=I,...S
2.4
p(k) = Z T ZG(k) + p(k-~)diag(fl~,...,/3s); B-orthogonalize the jth column of p(k) against x~k),..., X~l, j -- 2 , . . . , S. (k) X (k+l) = X (k) + P(k)diag (c~1,..., C~s), with c~j minimizing r(x~ k) + c~jpj )
2.5
B-orthonormalization of the columns of X (k) by the Gram-Schmidt algorithm.
2.6
k=k+l
The first s columns of X (k) are the approximations of Vl,... , Vs, while the first s diagonal entries of X ( k ) T A x (k) approximate A1, A2,..., As.
277 In order to make the algorithm converge practically, after a fixed number of iterations a reset operation should be performed. Steps 2.1, 2.2 and 2.3 are thus substituted by a Ritz projection step which consists in solving the initial eigenproblem projected onto the subspace span (X (k)) giving a new B-orthonormal approximation X (k). Then the new search direction p(k) is computed as p(k) = ZT ZG(k). In our numerical tests we perform the Ritz step every 20 iterations. 3. C O N S T R U C T I O N OF AN FSAI P R E C O N D I T I O N E R
Given a symmetric positive definite matrix A and its Cholesky factorization A - LALrA, the FSAI technique gives an approximate inverse of A in the factorized form
H = Z TZ,
(2)
where Z is a sparse nonsingular lower triangular matrix approximating LA 1. To construct Z one must first prescribe a selected sparsity pattern Sc c_ { (i, j) : 1 <_ i r j <_ n}, such that {(i,j) : i < j} c_ SL, then a lower triangular matrix Z is computed by solving the equations ^
(2A)~j = 6~j,
(i, j) r SL.
(3)
The diagonal entries of 2 are all positive. Defining D = [diag(Z)] -1/2 and setting Z = DZ, the preconditioned matrix Z A Z T is SPD and has diagonal entries all equal to 1. A common choice for the sparsity pattern is to allow nonzeros in Z only in positions corresponding to nonzeros in the lower triangular part of A. A slightly more sophisticated and more costly choice is to consider the sparsity pattern of the lower triangle of A k ("enlarged FSAI") where k is a small positive integer, e.g., k = 2 or k = 3; see [5]. In most cases, however, using the pattern of A 2 results in a preconditioning factor that is too dense, which slows down the iterative process. A partial cure to this inefficiency, is represented by postfiltration, which results in dropping out all the elements whose absolute value is below a fixed tolerance (for details see [6]). In the procedure described by (3) the matrix Z is obtained by rows: each row requires the solution of a small SPD dense linear system of size equal to the number of nonzeros allowed in that row. The rows of 2 can be computed independently of each other. In the context of a parallel environment and assuming a row-wise distribution of matrix A (that is, with complete rows assigned to different processors), each processor can independently compute its part of the preconditioner factor Z provided it has access to a (small) number of non local rows of A. Details of the parallel construction of the FSAI preconditioner are given in the next section. 4. PARALLEL I M P L E M E N T A T I O N A parallel preconditioned SRQMCG algorithm was carefully implemented in Fortran 90, exploiting the MPI library for exchanging data among the processors. In the code the FSAI preconditioner is computed in parallel, and each processor holds only its local part of all matrices needed in the algorithm, A, t3, Z and Z T, which allows us to handle very large problems. A block row distribution of these matrices was used, that is, with complete rows assigned to different processors. Each processor stores its local part of these matrices in static data structures in CSR format.
278 As described in the previous section, in the SPD case any row i of matrix Z can be computed independently by solving a small SPD dense linear system of size n~ equal to the number of nonzeros allowed in that row. However, to do so the processor that computes row i must be able to access ni rows of A. On distributed memory systems several of the needed rows may be non local (they might be stored in the local memory of another processor). Therefore a communication algorithm must be devised to provide every processor with the non local rows of A needed to compute its local part of Z. Since the number of non local rows required by each processor is relatively small we chose to temporarily replicate the non local rows on auxiliary data structures. We implemented a routine called get_extra_row to carry out all the row exchanges among the processors, before starting the computation of Z, which proceeds afterwards entirely in parallel. Any processor computes its local rows of Z by solving, for any row, a (small) dense linear system. The dense factorizations are carried out using BLAS3 routines from LAPACK. Once Z is obtained a parallel transposition routine provides every processor with its part of Z T. The SRQMCG algorithm can be straightforwardly decomposed into a number of scalar products, daxpy-like linear combinations of vectors, a v +/3w, and matrix-vector multiplications. We focused on parallelizing these tasks, assuming that the code is to be run on a machine with p identical, powerful processors. Scalar products, v . w, are distributed among the p processors by a classical block mapping. For any vector combination z = a v + / 3 w , our parallelization strategy requires that each processor j computes only the sub-block z(j) = (yv(j) § /~W(j). An efficient matrix-vector product routine was developed, which minimizes the amount of data exchanged among the processors [2].
5. NUMERICAL RESULTS Here we summarize the results of numerical experiments with a set of four sparse matrices arising from the Finite Element (FE), Mixed Finite Element (MFE), and Finite Difference (FD) discretization of the diffusion equation in two and three spatial dimensions. A brief description of the problems we deal with can be found in Table 1. For these matrices we computed the smallest s = 10 eigenpairs of (1) with B - I. In both sequential and parallel computations the dimension S of the subspace X (k) is set to 20. Table 1 Main characteristics o f the sample matrices (n = matrix dimension, elements in the matrix). name type n nnz 152,204 MFE (triangles) 28,600 hyb2d 42,189 602,085 flow3d FE (tetrahedra) 216,000 1,490,400 3d FD fd 60 268,515 3,926,823 FE (tetrahedra) heter
n n z = number o f nonzero )~n / )k l
7.94 4.21 1.50 1.82
• • • •
105 103 103 105
279 5.1. Sequential computations All the sequential numerical experiments are performed on a Compaq DS20 equipped with an alpha-processor "ev6" at 500Mhz, 1.5 GB of core memory, and 8 MB of secondary cache. The CPU times are measured in seconds, and all the codes are written in Fortran 90. We first compare the efficiency of the proposed preconditioner with IC(0) and with the Jacobi preconditioner. We denote as FSAI(A) the FSAI preconditioner with A as the nonzero pattern, while the FSAI with pattern A 2 with filtration is indicated with "enlarged" FSAI. In all the test cases the threshold value was set to 0.05. In Table 2 we report the number of iterations and total CPU time to assess s = 10 eigenpairs with an average relative residual less than 10 -8. More precisely, convergence is achieved when
.am{ - A { 8 i=1
A}k)
mx i I1 < 10-8 -
(4)
-
where with AIk) -- Xi(k)TA -aXi(k) we denote the kth approximation of A~. Table 2 Iteration number and CPU time (in seconds)for the sequential SRQMCG algorithm with various preconditioners. hyb2d
precond IC(0) "enlarged" FSAI FSAI(A) Jacobi
iter 171 273 471 1383
Time 74.08 124.91 174.16 466.18
flow3d
iter 112 129 162 375
Time 135.32 133.80 171.68 327.44
fd 60
iter 136 163 218 494
Time 964.13 1223.40 1409.15 2585.76
heter
iter 111 295 453 1375
Time 1251.38 3250.91 4521.84 9983.45
From Table 2 we see that the FSAI preconditioner is always more efficient than the Jacobi preconditioner, being 1.3 to 3.5 times slower than IC(0). The "enlarged" FSAI preconditioner reduces in all the cases the gap with the Cholesky preconditioner. 5.2. Parallel computations The numerical tests were performed on a SP4 Supercomputer with up to 512 POWER 4 1.3 GHz CPUs with 1088 GB RAM. The current configuration has 48 (virtual) nodes: 32 nodes with 8 processors and 16 GB RAM each, 14 nodes with 16 processors and 32 GB RAM and 2 nodes with 16 processors and 64 GB of RAM each. Each node is connected with 2 interfaces to a dual plane switch. In this section we present the parallel results of our SRQMCG code preconditioned with both FSAI(A) and Jacobi. The parallel code which makes use of the "enlarged" FSAI to accelerate SRQMCG is currently under development. The results concerning up to 8 processors were obtained by running the code on a virtual node of the SP4 machine with 8 processors, in a dedicated mode. We note that in all the runs, IBM/MPI routines were forced to perform communications tasks via the so-called user space protocol without taking advantage of shared memory inside a node.
280 Table 3
Number of iterations, CPU times, Tp, and Speedups, Sp, obtainedfor the computation of the s = 10 lefimost eigenpairs of thefour matrices with FSAIpreconditioning. flow3d
hyb2d 1 2 4 8 16
iter 473 473 473 473 473
Tp 123.21 61.64 36.68 24.15 18.31
Sp 2.0 3.4 5.1 6.7
iter 164 170 168 166 164
Tp 89.04 46.64 28.68 17.97 11.36
fd_60
Sp 1.9 3.1 5.0 7.8
iter 218 218 219 219 216
Tp
Sp
600.50 297.69 166.98 103.52 56.58
2.0 3.6 5.8 10.6
9 heter iter Tp 472 2112.71 462 1096.05 436 592.88 442 367.53 414 176.74
Sp 1.9 3.6 5.8 12.0
Tables 3 and 4 report the results concerning the FSAI and the Jacobi preconditioners, respectively for a number of processors p = 1, 2, 4, 8, 16. Here we report the number of iterations, the elapsed time (in seconds), Tp, and the speedups Sp = Tp/T1 required to evaluate 10 eigenpairs using the exit test (4). From the Tables we observe that the speedups obtained by the two algorithms are quite similar, being higher in the largest test cases. In the h e t e r problem, using the FSAI preconditioner, we find a satisfactory Sp = 12.0 with p = 16. Table 4
Number of iterations, CPU Times, Tp, and Speedups, Sp, obtainedfor the computation of the s = 10 lefimost eigenpairs with diagonalpreconditioning. 1 2 4 8 16
hyb2d iter Tp 1383 281.26 1383 150.60 1383 87.59 1383 59.54 1383 48.59
fd 60
flow3d
Sp 1.9 3.2 4.7 5.8
iter 360 387 382 378 389
. Tp
Sp
144.73 82.36 49.55 33.77 22.85
1.8 2.9 4.3 6.3
iter 627 571 598 604 600
Tp
Sp
1299.46 610.47 2.1 398.34 3.3 228.11 5.7 124.60 10.4
heter iter Tp Sp 1376 4621.85 1252 2215.23 2.1 1242 1287.03 3.6 1375 884.85 5.2 1316 451.19 10.2
5.3. Comparison with PARPACK Now consider PARPACK [9], the parallelized version of ARPACK. The parallelization relies upon a message passing layer (we exploited MPI), and the reverse communication interface demands for a parallel matrix-vector routine together with a parallel linear system solver. We used the same efficient matrix-vector routine for sparse computations which was previously described, together with a parallel PCG solver, preconditioned by FSAI. As in the SRQMCG algorithm we set s = 10 eigenpairs to be computed by PARPACK and S = 20 the size of the Ritz basis. The accuracy in the computation of the eigenpairs is as before ~- = 10 -s. However, the exit test is somewhat different from that of (4) being based on the relative accuracy of Ritz values. The accuracy depends also on the tolerance, T1, required for a suitable norm of the residual in the inner PCG solver. We set ~-1 = 10 -9. In Table 5 it is shown that the relative residuals as computed by (4) are usually larger when using PARPACK (on the average by one order of magnitude). Table 6 provides the number of iterations, the elapsed time Tp and the speedups Sp required to evaluate 10 eigenpairs using PARPACK. Comparing Table 6 with Table 3 we conclude that
281 Table 5
Residuals computed by (4)for the two algorithms. SRQMCG PARPACK
hyb2d 5.703 x 10 -9 4.436 x 10 -s
flow3d 6.487 • 10 -9 2.991 • 10 -9
fd 60 5.221 • 10 -9 1.059 • 10 -7
heter 7.518 • 10 -9 1.044 • 10 -8
the SRQMCG with p = 1 performs better in problems f l o w 3 d and he ter, being a little slower in the other two. The speedups are higher when using the SRQMCG code (on three problems out of four). This fact can be explained by observing that one of the major burdens of the proposed algorithm is orthogonalization while PARPACK relies more on matrix-vector products. It is worth noting that orthogonalization, being implemented as a number of scalar product and d a x p y operations, is almost "perfectly" parallelizable. Table 6
Number of iterations, CPU times, Tp, and Speedups, Sp, obtained for the computation of the s = 10 lefimost eigenpairs of the four matrices with PARPACK and FSAIpreconditioning. hyb2d 1
2 4 8 16
iter 36 36 36 36 36
flow3d
Tp
Sp
110.36 57.87 32.24 18.90 11.48
1.9 3.4 5.8 9.6
iter 47 47 47 47 47
fd 60
Tp
Sp
103.17 56.29 34.51 22.22 14.48
1.8 3.0 4.6 7.1
iter 69 69 69 69 69
heter
Tp
Sp
466.80 262.38 140.28 92.46 58.66
1.8 3.3 5.0 8.0
iter 41 41 41 41 41
Tp
Sp
2314.08 1228.68 730.65 507.39 260.14
1.7 3.2 4.6 8.9
6. C O N C L U S I O N S A N D F U T U R E D E V E L O P M E N T S .
The parallel SRQMCG algorithm, with FSAI as the preconditioner, has been shown to be very efficient in the computation of a small number of eigenpairs of large sparse SPD matrices. The major effort of this work has been in the parallelization of the construction of the FSAI preconditioner. The SRQMCG method, preconditioned with FSAI, has proved competitive with respect to the PARPACK package. The speedups obtained are quite satisfactory, especially with the largest sample problem where the speedup is equal to 12 with 16 processors. Future work will address the parallelization of the construction of the "enlarged" FSAI preconditioner and its use in the acceleration of the SRQMCG algorithm. REFERENCES
[1] [2]
[3]
M. Benzi, C. D. Meyer, and M. Tfima. A sparse approximate inverse preconditioner for the conjugate gradient method. SIAMJ. Sci. Comput., 17(5): 1135-1149, 1996. L. Bergamaschi, G. Pini, and F. Sartoretto. Approximate inverse preconditioning in the parallel solution of sparse eigenproblems. Numer. Lin. Alg. Appl., 7(3):99-116, 2000. L. Bergamaschi, G. Pini, and F. Sartoretto. Computational experience with sequential and parallel preconditioned Jacobi Davidson for large sparse symmetric matrices. J. Comput. Phys., 188(1):318-331, 2003.
282 [4]
M.J. Grote and T. Huckle. Parallel preconditioning with sparse approximate inverses. SIAMJ. Sci. Comput., 18(3):838-853, 1997. [5] I.E. Kaporin. New convergence results and preconditioning strategies for the conjugate gradient method. Numer. Lin. Alg. AppL, 1:179-210, 1994. [6] L. Yu. Kolotilina, A. A. Nikishin, and A. Yu. Yeremin. Factorized sparse approximate inverse preconditioning IV. Simple approaches to rising efficiency. Numer. Lin. Alg. Appl., 6:515-531, 1999. [7] L. Yu. Kolotilina and A. Yu. Yeremin. Factorized sparse approximate inverse preconditioning I. Theory. SIAMJ. Matrix Anal Appl., 14:45-58, 1993. [8] R.B. Lehoucq and D. C. Sorensen. Deflation techniques for an implicit restarted Arnoldi iteration. SlAM J. Matrix Anal Appl., 17(4): 789-821, 1996. [9] K.J. Maschhoff and D. C. Sorensen. A portable implementation of ARPACK for distributed memory parallel architectures. In Proceedings of the Copper Mountain Conference on Iterative Methods, volume 1, April 9-13 1996. [10] F. Sartoretto, G. Pini, and G. Gambolati. Accelerated simultaneous iterations for large finite element eigenproblems. J. Comput. Phys., 81:53-69, 1989. [ 11 ] G.L.G. Sleijpen and H. A. van der Vorst. A Jacobi-Davidson method for linear eigenvalue problems. SIAMJ. Matrix Anal Appl., 17(2):401-425, 1996.
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
283
An Accurate and Efficient Selfverifying Solver for Systems with Banded Coefficient Matrix C. H61big a, W. Kr~imerb, and T.A. Diverio c. aUniversidade de Passo Fundo and PPGC-UFRGS, Passo Fundo - RS, Brazil E-mail: [email protected] bUniversity of Wuppertal, Wuppertal, Germany E-mail: [email protected] CInstituto de Inform/ttica and PPGC at UFRGS, Porto Alegre - RS, Brazil E-mail: [email protected] In this paper we discuss a selfverifying solver for systems of linear equations A:c = b with banded matrices A and the future adaptation of the algorithms to cluster computers. We present an implementation of an algorithm to compute efficiently componentwise good enclosures for the solution of a sparse linear system on typical cluster computers. Our implementation works with point as well as interval data (data afflicted with tolerances). The algorithm is implemented using C-XSC library (a C++ class library for extended scientific computing). Our goal is to provide software for validated numerics in high performance environments using C-XSC in connection with the MPICH library. Actually, our solver for linear system with banded matrices runs on two different clusters: ALICE 1 at the University of Wuppertal and L a b T e C 2 at UFRGS. Our preliminary tests with matrix-matrix multiplication show that the C-XSC library needs to be optimized in several ways to be efficient in a high performance environment (up to now the main goal of C-XSC was functionality and portability, not speed). This research is based on a joint research project between German and Brazilian universities (BUGH, UKA, UFRGS and
PUCRS) [5]. 1. I N T R O D U C T I O N For linear systems where the coefficient matrix A has band structure the general algorithm for solving linear systems with dense matrices is not efficient. Since an approximate inverse 1Cluster with 128 Compaq DS10 Workstations, 616 MHz Alpha 21264 processors, 2 MB cache, Myrinet multistage crossbar connectivity, 1 TB disc space and 32 GB memory. 2Cluster with 20 Dual Pentium IIl 1.1 GHz (40 nodes), 1 GB memory RAM, HD SCSI 18 GB and Gigabit Ethemet. Cluster server (front-end) with Dual Pentium IV Xeon 1.8 GHz, 1 GB memoryRAM, HD SCS136 GB and Gigabit Ethemet. LaBTeC Cluster Homapage: http://www.inf.ufrgs.br/LabTeC
284 R is used there this would result in a large overhead of storage and computation time, especially if the bandwidth of A is small compared with its dimension. To reduce this overhead, we will replace the approximate inverse by an approximate LU-decomposition of A which needs memory of the same order of magnitude only as A itself. Then we will have to solve systems with triangular banded matrices (containing point data) in interval arithmetic. This seems to be a trivial task and several methods have been developed using such systems and simple forward and backward substitution in interval arithmetic, see e.g. [8], [ 12]. However, at this point there appears suddenly a very unpleasant effect which makes the computed intervals blow up very rapidly in many cases. This effect is known in literature as wrapping effect (see e.g. [13]) and was recognized first in connection with the verified solution of ordinary initial value problems. However it is a common problem within interval arithmetic and may show up whenever computations in ~ n , n > 1, are performed, e.g. if an interval vector is multiplied repeatedly by matrices or more general if a function f : ~ n ~ ~ n is applied repeatedly to an argument x, x~ -- f ( f ( . . , f ( f ( x ) ) . . . )). It should be noted that the wrapping effect is not related to roundoff errors or any other sources of errors in an algorithm (like truncation errors, discretization errors and the like) but it is introduced solely by interval arithmetic itself (though any additional errors may contribute to an increase of the wrapping effect). 2. THE A L G O R I T H M S The algorithms implemented in our work were described in [9] and can be applied to any system of linear equations which can be stored in the floating point system on the computer. They will, in general, succeed in finding and enclosing a solution or, if they do not succeed, will tell the user so. In the latter case, the user will know that the problem is very ill conditioned or that the matrix A is singular. In the implementation in C-XSC, there is the chance that if the input data contains large numbers or if the inverse of A or the solution itself contain large numbers, an overflow may occur, in which case the algorithms may crash. In practical applications, this has never been observed, however. This could also be avoided by including the floating point exception handling which C-XSC offers for IEEE floating point arithmetic [2]. For this work we implemented interval algorithms for solution of linear systems of equations with dense and sparse matrices. There are numerous methods and algorithms computing approximations to the solution x in floating-point arithmetic. However, usually it is not clear how good these approximations are, or if there exists a unique solution at all. In general, it is not possible to answer these questions with mathematical rigour if only floating-point approximations are used. These problems become especially difficult if the matrix A is ill conditioned. We implemented some algorithms which answer the questions about existence and accuracy automatically once their execution is completed successfully. Even very ill conditioned problems can be solved with these algorithms. Most of the algorithms implemented here can be found in [ 14] and [ 15]. 3. BAND MATRICES
Matrices with band structure and difference equations are closely related. There the difference equations could be rewritten equivalently as a linear system with band matrix which was triangular even. Similarly we can go in the other direction and write a triangular banded matrix as a difference equation.
285 The system al,1 "
Ax =
""
0
am,1
"'.
Xl
bl
x2
b2
"
".
--
"
-- b
(1)
.
0
an,n-m+l
" 9"
an,n
xn
bn
is equivalent with the difference equation ai,i-m+lXi-m+l
-~- " ' "
-J- a i , i x i
=
bi ,
i --
m,...,
(2)
Tt,
of order m - 1 with the starting values x i = (bi -
ai,i-lXi-1
. . . . .
ai,lX
1)/ai,
i ,
i ~- 1,...,
m
-
1 .
(3)
In [9] we saw that the solution of triangular systems by interval forward substitution can result in severe overestimations. Therefore we use the relationship to difference equations and solve triangular banded systems with the method for difference equations which was presented in [9]. For general banded systems we will then apply a LU-decomposition without pivoting to the coefficient matrix A and derive an interval iteration similar to [y]k+~ =
R+(b-
A~:) + + ( I -
RA)[y]k
(4)
Here, however, we will not use a full approximate inverse R but rather the iteration will be performed by solving two systems with banded triangular matrices ( L and U ). A similar approach to banded and sparse linear systems can be found, e.g., in [12]. There, however, the triangular systems were solved by interval forward and backward substitution which often results in gross overestimations as we have seen already. For a different approach to the verified solution of linear systems with large banded or arbitrary sparse coefficient matrix see Rump, [ 15]. The mathematical background for the verified solution of large linear systems with band matrices is exactly the same as it was already in [6] for systems with dense matrices. For dense systems the interval iteration (4) was derived by use of an approximate inverse R of the coefficient matrix A. This is however what we want to avoid for large banded matrices A. Therefore we chose a different approximate inverse, namely R :=
(LU) -~ ~ A -~
(5)
where LU
~ A
(6)
is an approximate LU-decomposition of A without pivoting. Since we do not use pivoting both L and U are banded matrices again, and of course they are lower and upper triangular, resp.
286 The analogue of the iteration (4) now reads in our case
Yk+l = ( L U ) - I ( b - AYe) + ( I - (LU)-IA)yk
(7)
or, multiplying with LU and taking intervals:
LU[y]k+I = ~ ( b - AYe)+ ~ ( L U - A)[y]k.
(8)
Therefore we have to solve two linear systems with triangular, banded coefficient matrices, L and U, in order to compute [y]k+l, i.e. to perform one step of the iteration (8). In each iteration we first compute an enclosure for the solution of
L[z]k+l -
(~(b-
AYc)+ ~ ( L U - A)[y]k
and then [y]k+l from u[y]
+l =
In both systems we do not use just plain interval forward or backward substitution, however, as discussed above, we treat the systems as difference equation and apply the corresponding method. Here again, as in [6], the inclusion test =
c
(9)
has to be checked in the same way and if it is satisfied then the same assertions hold as in the dense case. Remark: If we compute the LU-decomposition with Crout's algorithm, then we can get the matrix ~(LU - A) virtually for free, since the scalar products which are needed here have to be computed in Crout's algorithm anyway.
4. TESTS AND RESULTS A very well known set of ill conditioned test matrices for linear system solvers are the n • n Hilbert matrices Hn with entries (Hn)i,j "= i + j1 - 1 . As a test problem, we report the results of our program for the linear systems Hnx = el, where el is the first canonical unit vector. Thus the solution x is the first column of the inverse H~-I of the Hilbert matrix Hn. We give results for the cases n = 10 and n = 20. Since the elements of these matrices are rational numbers which can not be stored exactly in floating point, we do not solve the given problems directly but rather we multiply the system by the least common multiple lcmn of all denominators in Hn. Then the matrices will have integer entries which makes the problem exactly storable in IEEE floating point arithmetic. For n = 10, we have/cml0 = 232792560 and for n = 20, we have lcm2o = 5342931457063200. For the system (lcmloH~o)X = (lcmloel), the program computes the result showed in (10), which is the exact solution of this ill conditioned system.
287
X2 X3 X4 X5 X6 X7 X8 X9 \Xl(
j
r [ 1.000000000000000E+002, 1.000000000000000E+002] '~ [-4.950000000000000E+003, -4.950000000000000E+003] [ 7.920000000000000E+004, 7.920000000000000E+004] [-6.006000000000000E+005, -6.006000000000000E+005 ] [ 2.522520000000000E+006, 2.522520000000000E+006] [-6.306300000000000E+006, -6.306300000000000E+006] [ 9.609600000000000E+006, 9.609600000000000E+006] [-8.751600000000000E+006, -8.751600000000000E+006] [ 4.375800000000000E+006, 4.375800000000000E+006] \ [-9.237800000000000E+005,-9.237800000000000E+005] 2
(10)
For the system (Icm2oH2o)X = (lcm2oel), the program computes the enclosures (here an obvious short notation for intervals is used) showed in (11), which is an extremely accurate enclosure for the exact solution (the exact solution components are the integers within the computed intervals).
f' Xl '~ X2 X3 X4 375 X6 J;7 X8 Z9 /;1( 1;11 ~1~ /;14
/;17 /;lC~
\ r:c ]
/ [ 3.999999999999999E+002,4.000000000000001E+002] '~ -7.980000000000002E+004, -7.979999999999998E+004] [ 5.266799999999999E+006, 5.266800000000001E+006] [- 1.716099000000001E+008, - 1.716098999999999E+008] [ 3.294910079999999E+009, 3.294910080000001E+009] [-4.118637600000001E+010, -4.118637599999999E+010] [ 3.569485919999999E+011, 3.569485920000001E+011 ] -2.237302782000001E+012, -2.237302781999999E+012] [ 1.044074631599999E+013, 1.044074631600001E+013] [- 3.700664527560001E+013, - 3.700664527559999 E+013 ] [ 1.009272143879999E+014, 1.009272143880001E+014] [-2.133234304110001E+014, -2.133234304109999E+014] [ 3.500692191359999E+014, 3.500692191360001E+014] -4.443186242880001E+014, -4.443186242879999E+014] [ 4.316238064511999E+014, 4.316238064512001E+014] [-3.147256922040001E+014, -3.147256922039999E+014] [ 1.666194841079999E+014, 1.666194841080001E+014] -6.044040109800001E+013, -6.044040109799999E+013] [ 1.343120024399999E+013, 1.343120024400001E+013] \ [-1.378465288200001E+012,-1.378465288199999E+012] 2
(11)
As other example, we compute an enclosure for a very large system. We take the symmetric Toeplitz matrix with five bands having the values 1, 2, 4, 2, 1 and on the right hand side we set all components of b equal to 1. Then the program produces the following output for a system of size n = 200000 (only the first ten and last ten solution components are printed):
288 Dimension
Bandwidths
A
n
= 200000
l,k
= 1 2 4 2 1
change b = =i change
: 2 2
elements
?
(y/n)
elements
?
(y/n)
X
I:
[
1.860146
067479180E-001,
1.860146067479181E-001
[
7.518438 200412189E-002, 1.160876 404875081E-001,
7.518438200412191E-002
9.427129 202687645E-002,
9.427129202687647E-002
]
2:
[
4:
[
6:
[
3: 5:
7: 8: 9:
i0:
[ [
9.037859 550210300E-002,
1.003153
]
1.003153932563722E-001
1.028361 799416204E-001,
1.028361799416205E-001
] ]
]
1.005240450090009E-001
]
[
1.004617 422430963E-001,
1.004617422430964E-001
]
[
9.874921 290539136E-002,
[
199992: 199993:
[ [
199994: 199995:
199996:
199997: 199998: 199999:
200000:
max.
1.160876404875082E-001
1.005240 450090008E-001,
199990:
max. min.
]
[
199991:
max.
932563721E-001,
]
9.037859550210302E-002
rel.
abs. abs. abs.
1 . 0 0 1 953 93 9 3 2 6 1 9 6 E -
9.874921290539138E-002
001,
[
1.004 617422430963E-001,
[
1.028 361799416204E-
[ [
[
[ [
[
9.874 921290539136E-002, 1.005 240450090008E-001,
001,
9.427 129202687645E-002,
1.001953939326197E-001
]
9 .8 7 4 9 2 1 2 9 0 5 3 9 1 3 8 E - 0 0 2 1.005240450090009E-001
] ]
1.004617422430964E-001
1.028361799416205E-001
9 .4 2 7 1 2 9 2 0 2 6 8 7 6 4 7 E - 0 0 2
1.003 153932563721E-001,
1.003153932563722E-001
7.518 438200412189E-002, 9.037 859550210300E-002,
7 .5 1 8 4 3 8 2 0 0 4 1 2 1 9 1 E - 0 0 2 9.037859550210302E-002
1.160 876404875081E-001,
1.860 146067479180E-001,
error
= error = x[3] = [ x[l] = [
]
1.160876404875082E-001
1.860146067479181E-001
i. 8 4 5 8 3 3 8 6 0 4 2 2 4 5 1 E - 0 1 6
2 .7 7 5 5 5 7 5 6 1 5 6 2 8 9 1 E - 0 1 7 7.518438200412189E-002,
i. 8 6 0 1 4 6 0 6 7 4 7 9 1 8 0 E - 0 0 1 ,
]
] ] ] ] ] ] ]
at i = 3 at i = 1 7.518438200412191E-002
1. 8 6 0 1 4 6 0 6 7 4 7 9 1 8 1 E - 0 0 1
]
]
5. INTEGRATION BETWEEN C-XSC AND MPI LIBRARIES As part of our research, we did the integration between C-XSC and MPI libraries on cluster computers. This step is necessary and essential for the adaptation of our solvers to high performance environments. This integration was developed using, with first tests, algorithms for matrix multiplication in parallel environments of cluster computers. Initially, we did some comparations about the time related to the computational gain using parallelization, the parallel program performance depending on the matrix order, and the parallel program performance using a larger number of nodes. We also studied some other information like the memory requirement in each method to verify the performance relation with the execution time and memory. We want to join the high accuracy given by C-XSC with the computational gain provided by parallelization. This parallelization was developed with the tasks division among various nodes
289 on the cluster. These nodes execute the same kind of tasks and the communication between the nodes and between the nodes andthe server uses message passing protocol. Measures and tests were made to compare the routines execution time in C language, C using MPI library, C using C-XSC library and C using C-XSC and MPI libraries. The developed tests show that simple and small changes in the traditional algorithms can provide an important gain in the performance [ 11 ]. We observed the way that the processor pipeline is used, and we notice that it is decisive for the results. Based in these tests, we could also observe that the use of just 16 nodes is enough for this multiplication. In the results obtained with these tests, the execution time of the algorithms using C-XSC library are much larger than the execution time of the algorithms that do not use this library. Even in this tests, it is possible to conclude that the use of high accuracy operations make the program slower. It shows that the C-XSC library need to be optimized to have an efficient use on clusters, and make it possible to obtain high accuracy and high performance in this kind of environment. 6. CONCLUSIONS In our work we provide the development of selfverifying solvers for linear systems of equations with sparse matrices and the integration between C-XSC and MPI libraries on cluster computers. Our preliminary tests with matrix multiplication show that the C-XSC library needs to be optimized to be efficient in a High Performance Environment (up to now the main goal of C-XSC was functionality and portability, not speed). Actually we are working in to finish this integration and in the development of parallel software tools for validated numerics in high performance environments using the C++ class library C-XSC in connection with the MPICH library. ACKNOWLEDGEMENTS We would like to thank the students Bernardo Frederes Kr/imer Alcalde and Paulo S6rgio Morandi JOnior from the Mathematic of Computation and High Performance Computing Group at UFRGS for their help in implementing and testing the C-XSC routines. This work was supported in part by DLR international bureau (BMBF, Germany) and FAPERGS (Brazil). REFERENCES
Albrecht, R., Alefeld, G., Stetter, H. J. (Eds.): Validation Numerics- Theory and Appli, cations. Computing Supplementum 9, Springer-Verlag (1993). [21 American National Standards Institute / Institute of Electrical and Electronics Engineers: A Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std. 754-1985, New York, 1985. [3] Hammer, R., Hocks, M., Kulisch, U., Ratz, D.: C-XSC Toolboxfor Verified Computing I: basic numerical problems. Springer-Verlag, Berlin/Heidelberg/New York, 1995. [4] Hofschuster, W., Kr/imer, W., Wedner, S., Wiethoff, A.: C-XSC 2. O: A C+ + Class Library for Extended Scientific Computing. Universit~it Wuppertal, Preprint BUGHW - WRSWT 2001/1 (2001). [5] H61big, C.A., Diverio, T.A., Claudio, D.M., Kr~imer, W., Bohlender, G.: Automatic [1]
290
[6] [7] [8] [9]
[10]
[ 11 ]
[12]
[13] [14] [15]
Result Verification in the Environment of High Performance Computing In: IMACS/GAMM INTERNATIONAL SYMPOSIUM ON SCIENTIFIC COMPUTING, COMPUTER ARITHMETIC AND VALIDATED NUMERICS, 2002, Paris. Extended abstracts, pg. 54-55 (2002). H61big, C., Kr~imer, W.: Selfverifying Solvers for Dense Systems of Linear Equations Realized in C-XSC. Universit/it Wuppertal, Preprint BUGHW- WRSWT 2003/1, 2003. Klatte, R., Kulisch, U., Lawo, C., Rauch, M.,Wiethoff, A.: C-XSC, A C++ Class Library for Extended Scientific Computing. Springer-Verlag, Berlin/Heidelberg/New York, 1993. Klein, W.: Verified Solution of Linear Systems with Band-Shaped Matrices.DIAMOND Workpaper, Doc. No. 03/3-3/3, January 1987. Kr/imer, W., Kulisch, U., Lohner, R.: Numerical Toolbox for Verified Computing II- Advanced Numerical Problems. University of Karlsruhe (1994), see http ://www. uni-karlsruhe, de/'Rudolf.Lohner/papers/tb 2.ps. gz. Kulisch, U., Miranker, W. L. (Eds.): A New Approach to Scientific Computation. Proceedings of Symposium held at IBM Research Center, Yorktown Heights, N.Y., 1982. Academic Press, New York, 1983. Marquezan, C.C., Contessa, D.F., Alves, R.S., Navaux, P.O.A., Diverio, T.A.: An evolution of simple and efficient optimization techniques for matrix multiplication. In: The 2002 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, PDPTA 2002. Proceedings ..., Las Vegas, 24-27, June, 2002. Las Vegas: CSREA Press (2002). Vol. IV. p. 2029-2034. Miranker, W. L., Toupin, R. A.: Accurate Scientific Computations.Symposium Bad Neuenahr, Germany, 1985.Lecture Notes in Computer Science, No. 235,Springer-Verlag, Berlin, 1986. Neumaier, A.: The Wrapping Effect, Ellipsoid Arithmetic, Stability and Confidence Regions. In [1], pp 175-190, 1993. Rump, S. M.: Solving Algebraic Problems with High Accuracy. Habilitationsschrift. In [ 10], pp 51-120, 1983. Rump, S. M.: Validated Solution of Large Linear Systems. In [1], pp 191-212 (1993).
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
291
3 D p a r a l l e l c a l c u l a t i o n s o f d e n d r i t i c g r o w t h w i t h the lattice B o l t z m a n n m e t h o d w. Miller a, F. PimenteP, I. Rasin ~, and U. Rehse a ~Institut ffir Kristallzfichtung, Max-Born-Str.2, 12489 Berlin, Germany www.ikz-berlin.de
We developed a discrete phase-field model and incorporated the model into an existing lattice Boltzmann code, which computes the heat transport and convection. 3D calculations of dendritic growth were made to test a recently installed Beowulf-style Cluster. 1. I N T R O D U C T I O N During the last years the lattice Boltzmann method (LBM) has been evolved as a powerful numerical tool for flow calculations [ 1, 2], especially regarding its efficiency on parallel platforms [3, 4]. Solid-liquid phase-transitions are often strongly influenced by convection. This was the motivation to develop a lattice phase-field model in the framework of the lattice Boltzmann method [5]. The general outline of the phase-field model will be presented in the next section. As the lattice Boltzmann method is concerned we restrict ourselves to the features, which are of relevance for the parallel computing. More details can be found e.g. in [6, 1, 7]. In the LBM the hydrodynamics is mimicked by a set of quasi-particles on a uniform lattice, which obey a simple kinetic equation of streaming and collision. The collision term is local in space and linear in the non-equilibrium distribution of the quasi-particles. The underlying lattice is shown in Figure 1 (left): it is the projection of a four-dimensional face-centered hypercubic lattice. In total, there are 18 links to next neighbours, along which quasi-particles can move. The additional degree of freedom in the fourth direction is used to solve the transport equation of a scalar [8], in our case the temperature. The limitations of a uniform lattice are counteracted by the simplicity of the evolution equation, which allow a large spatial and temporal resolution at low computational cost. In this paper we will focus on using the lattice Boltzmann code with phase transition to test the scalability of the program on a Linux cluster. The computational domain is a cube and the physical problem is the growth of dendrites in an undercooled melt. 2. LATTICE PHASE-FIELD M O D E L In analogy to the continuum models for the phase transition we define a continuous order parameter - 1 < r _< + 1 with the values + 1 for melt and -1 for solid. The order parameter is defined on a lattice point g', at a discrete time t,. Its evolution is given by a simple rate equation: r
t, + 1) - r
t,) + 7~(g',, t,).
(1)
292
1~
S 10
Odd domain
18 ~60J'~i 4 ~
3
....................l,
14 ..... ]'~--~=:---~ ', "---.2.........../ _...... 22
Even domain 1 4
m
r
1
3nization ~nizatic
-
4
~nizati~
1 4
2 3
~Jze~
:~:: ::
Figure 1. Face-centered hypercubic lattice used for the lattice Boltzmann method (left). Communication between domains as organized in the code (fight). The communication in diagonal is obmitted for better viewing. The rate 7"r is given by the probability for melting and solidification, K + and K - , respectively: ~(~',,t,)
E
=
f, K+(1 - r
-
f,~,2 6 ~
w~ (r
1
+ r
_ K-(1 + r
+ ~ , t,) - r
t,)).
ill (2)
The third term is due to the Gibbs-Thomson effect. The timescale for the phase transition is defined by a frequency factor f, and 6& is the width of the transition region (-0.9 _< r _< +0.9). ...+ d is the vector of the ith link (see Fig. 1). w~ is a weight parameter, taking into account the different distances of links in diagonal (wi = 1/2) and non-diagonal directions (wi = 1). Anisotropy in the surface energy of the crystal will add some contributions to the third term of Eq. 2. They do not change the computational scheme and will be described elsewhere [9]. From the view point of parallel computing the rate equation is non-local in space requiting next neighbour's information on the phase field. Because in phase-field models the interface is diffuse and will not be represented by the boundary of elements, velocity boundary conditions cannot be applied in the common way. In LBM the solid-interaction can be introduced naturally via reflection rules for the quasi-particles. The reflection rules depend on the local value of the phase-field [10]. 3. 3D CALCULATIONS ON A LINUX-CLUSTER
3.1. Architecture of the cluster and programming We use a typical Beowulf-style Linux cluster 1, which consists of 16 dual-boards with 2• GHz XEON processors and 2 GByte ECC-RAM per node. During the runs the hyperthreading of the processors was turnned off by Bios settings except for one run, which will be explicitely mentioned in the next subsection. The compute nodes are linked by Myrinet and the communication with the master node is realized via Fast Ethemet. The (rectangular) computational domain has been divided into subdomains for each processor. This decomposition can be done in one (1D), two (2D) or three (3D) directions. Directions of decomposition can be chosen in different ways as explained in Fig. 2. To avoid congestions in the data traffic between nodes the communication is organized as follows. The subdomains lhttp://beowulf.gsfc.nasa.gov/overview.html
293
....... i:i e"
a
b
c
d
2
..,
4
e
Figure 2. Decomposition types for 1D (a and b), 2D (c and d), and 3D (e) configurations as used for performance calculations. (a): decomposition in z-direction (1D-Z). (b): decomposition in x-direction (1D-X). (c): decomposition in x- and y-direction (2D-XY). (d): decomposition in y- and z-direction (2D-YZ).
~95e.
.
t;.
~i-~
Figure 3. 92 • •
Growing dendrite in a flow field caused by thermal buoyancy. Time: 7000 iterations.
Domain size:
are arranged in chess-like pattern so that we can distinguish between two types of subdomains, which will be called odd and even [3]. There are four subsequent steps for the message passing of planes as it is outlined in Fig. 1. The arrows denote the direction of traffic (sending or receiving) and the numbers at the arrows correlate with the step number. In order to send all data in one package the data from the arrays of quasi-particle and phase-field variable have to be picked up and put into a new array. This process will also consume time depending on the data access. This time is included in the measured time for message passing, which we will present in the next subsection. In 2D and 3D domain decomposition also data from the edges have to be transferred from node to node (directions 1, 2, 3, 4, 7, 9, 10, 11, 12, 13, 16, 18 in Fig. 1 left) but the amount of data is much smaller than those of the previous steps. In total we have four synchronization points.
3.2. Runs and performance We have performed some runs of dendritic growth from an undercooled melt in a cubic domain. The Temperature at the boundaries is fixed and defines the undercooling. We started from a spherical seed in an undercooled melt and studied the qualitative behaviour of the interface shape in time for cases with and without flow. An example for a growing dendrite in a buoyancy
294 Table 1 Comparison of calculation time tcal, message passing time tMp, and total time ttotal for different configurations of runs with 150 x 150 x 150 domain size on two processors. The total number of iterations was 500, times are shown for one iteration. 1D-X 1D-Z tcal/S tMp/S ttotal/S tcal/S tMp/S J ttotal/S nodes on sameboard 3.86 0.084 3.93 3.84 0.016 3.86 nodes on different boards 3.00 0.076 3.08 2.95 0.018 2.97 flow field caused by the temperature difference of the warm crystal surthce due to the release of latent heat and the cold melt is shown in Fig. 3. The light colors at the tips correspond to a temperature lower than on the rest of the surface, which reflects the Gibbs-Thomson effect. We take such a run as example to measure the performance of the code on the cluster. The problem size of all runs mentioned in the paper was 150 x 150 x 150 grid points and 500 iterations - timesteps have been used. The performance on one processor is Vcal = 428 Mflops/s, where the total time is used though in the streaming step no floating point operation is required. The communication between two processors depend on the direction of the 1D domain decomposition, which can be explained by different data access in the packing procedure before a package is sent. In our convention, if decomposition is 1D-Z, packing will be performed in a line by line mode from the arrays, whereas in the 1D-X case data have to be picked up from different location in the array. Consequently, the time for message passing including packing is larger in the second case (see Table 1). In our case with non-periodic boundary conditions neglecting the machine latency (time to set-up the communications) the rate is computed via ~fcom --- 2T~byteTtvarNplane/Tcomwhere nbyte , n v a r , and Npl~ne are the number of bytes per variable (8 in our case), the number of variables (7), and the number of grid points in the plane to be transfered (150 x 150), respectively. In particular we observed the following communication rates: V~om=140 MB/s for 1D-Z and V~om=33 MB/s for 1D-X. These rates are for the case of two processors on different boards. In the case of two processors on the same board the rate is of the same order, which clearly indicates the high performance of the Myrinet. Some runs have been performed for a 2D domain decomposition into 8 subdomains. Two different configurations of domain decomposition have been used according to Figure 2 (see Table 2). The same tendency according to the difference of using two processors on the same or on different boards as in the 1D case can be seen. There is not much difference in message passing time but the calculation time is little bit larger for computation on one board, presumably due to congestions in the data access of the two processors. Both processors use the same bus to the data memory. Conceming the different directions of domain decomposition we observe the same effect as in the 1D case. Decompostion in x-direction causes a slow memory access during packing and the result is a message passing time twice as for the 2D-YZ case. Again this is true for both running the jobs on two processors on one board and for running it on different boards. We also tested the option of hyper-threading by using 16 subdomains and run the job on 4 boards. Thus we use the two real plus the two virtual processors on each board. The measurement indicates that there is no advantage to use the virtual processors in this manner. In contrary, the computation time increases (see Table 2).
295 Table 2 Comparison of calculation time tc~l, message passing time tMp, and total time ttotal for different configurations of runs with 150 • 150 x 150 domain size. The results for different configurations of domain decomposition into 8 subdomains and 16 subdomains (case with hyper-threading) are shown (times for one iteration). The total number of iterations was 500. boards 8 4 4* 2D-YZ decomposition 2D-XY 2D-YZ 2D-XY 0.87 1.15 tcal/S 0.77 0.77 0.90 t~ip/s 0.078 0.036 0.076 0.028 0.27 0.90 1.42 ttotal/S 0.85 0.81 0.98 * In this run hyper-threading was on.
[I
The scalability of the code was checked by running the example on up to 32 processors. The reduction in total time for the 1D-Z configuration can be seen in Figure 4(left). The time for message passing is more or less constant regardless of the number of processors. This confirms that each processor can communicate with any other independently from any other traffic in the Myrinet. The data of calculation times can be transverted into a speedup, which is shown on the righthand side of Figure 4. Clearly, there is a deviation from the ideal curve, mainly because the problem size tends to be too small. For 2D-YZ and 3D configurations we made calculations with more than 16 processors. In these cases the second processor on certain boards have to be used with all the disadvantages of data access. Therefore, it is not astonishing that the curves for 2D-YZ and 3D deviate significantly from the ideal curve.
3250-=.-~
aooo,::i,.~~iii~ aTso~:, ................. 12222~2~2...;;....LL.. asoo..~
',iiiill
_ _
,,,o.~ i!iiii ......................................
2ooo~ 1750~
/
~.,_ ~~
..................................................
. . . . . . . . . . . . . . . . . . . . . . .
/
~20
8.
o~
m
5oo ',i 250
i!
~',
~
Number of processors
~, ,.,..........
/ 5
10 15 20 25 Number of Processors
30
35
Figure 4. Measured calculation (Tcpu) and message passing time (Tmp) as a function of the number of processors for 1D-Z configuration with 150 • 150 • 150 grid points (left). Speedup as a function of the number of processors for 1D-Z, 2D-YZ, and 3D configurations (fight). All times were measured with runs of 500 iterations.
296 4. CONCLUSIONS We developed a phase-field model to calculate the liquid/solid phase transition within a lattice Boltzmann code. The uniform lattice used for both lattice Boltzmann equation to compute flow including scalar transport and for the phase-field equation to compute the liquid/solid phase transition was divided into subdomains with different 1D, 2D, and 3D decompositions. These different configurations were used to test the performance on a Beowulf-style cluster with 16 dual-boards and communication via Myrinet. We found that the communication via Myrinet is about as fast as communication between processors on one board. A good speedup was observed up to 16 processors. Beyond 16 processors the speedup is limited because of the congestion in data access of two processors on one board. ACKNOWLEDGMENT This study was partly supported by the Deutsche Forschungsgemeinschaft under grant MI 678. REFERENCES
S. Succi, The lattice Boltzmann Equation, Numerical Mathematics and Scientific Computation, Clarendon Press, Oxford, 2001. [21 D. A. Wolf-Gladrow, Lattice-Gas cellular automata and lattice Boltzmann models, Vol. 1725 of Lecture Notes in Mathematics, Springer-Verlag, Berlin Heidelberg New York, 2000. [3] W. Miller, S. Succi, Parallel three-dimensional lattice Boltzmann hydrodynamics on the IBM 9076-SP 1 and SP2 scalable parallel computers, in: J.-A,D6sid6ri, C.Hirsch, P. Tallec, M.Pandolfi, J.P6riaux (Eds.), Computational Fluid Dynamics '96, ECCOMAS, John Wiley & Sons, 1996, pp. 1052-1058. [4] G. Punzo, F. Massaioli, S. Succi, High-resolution lattice-Boltzmann computing: on the SP ! scalable parallel computer, Comput. Phys. 5 (1994) 1-7. [5] W. Miller, S. Succi, D. Mansutti, A lattice Boltzmann model for anisotropic liqui~solid phase transition, Phys. Rev. Lett. 86 (16) (2001) 3578-3581. [6] W. Miller, S. Succi, A lattice Boltzmann model for anisotropic cwstal growth from me!t, J. Stat. Phys. 107 (1/2) (2002) 173-186. [7] F. Massaioli, S. Succi, R. Benzi, Exponential tails in Rayleigh-B6nard convection, Europhys. Lett. 21 (1993) 305. [8] A. Call, S. Succi, A. Cancelliere, R. Benzi, M. Gramignani, Diffusion and hydrodynamic dispersion with the lattice Boltzmann method, Phys. Rev. A 45 (1992) 5771-5774. [9] W. Miller, I, Rasin, A discrete phase-field model for dendritic growth from melt, submitted to Phys. Rev. E (2003). [10] W. Miller, Crystal growth kinetics and fluid flow, Int. J. Mod. Phys. B 17 (1/2) (2003) 227-230. [1]
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
297
Distributed Negative Cycle Detection Algorithms* L. Brim ~, I. (~ernfi~, and L. Hejtm~nek ~ ~Faculty of Informatics, Masaryk University, Brno, Czech Republic Several new parallel algorithms for the single source shortest paths and for the negative cycle detection problems on directed graphs with real edge weights and given by adjacency list are developed, analysed, and experimentally compared. The algorithms are to be performed on clusters of workstations that communicate via a message passing mechanism. 1. INTRODUCTION The single source shortest path problem (SSSP) is a fundamental problem with many theoretical and practical applications and with several effective and well-grounded sequential algorithms. The same can be said about the closely related negative cycle detection problem (NCD) which is to find a negative length cycle in a graph or to prove that there are none. In fact, all known algorithms for NCD combine a shortest paths algorithm with some cycle detection strategy. In many applications we have to deal with extremely large graphs (a particular application we have in mind is briefly discussed bellow). Whenever a graph is too large to fit into memory that is randomly accessed a memory that is sequentially accessed has to be employed. This causes a bottleneck in the performance of a sequential algorithm owing to the significant amount of paging involved during its execution. An obvious approach to deal with these practical limitations is to increase the computational power (especially randomly accessed memory) by building a powerful (yet cheap) distributed-memory cluster of computers. The computers are programmed in single-program, multiple-data style, meaning that one program runs concurrently on each processor and its execution is specialised for each processor by using its processor identity (id). The program relies on a communication layer based on Message-Passing Interface standard. Our motivation for this work was to develop a distributed model checking algorithm for linear temporal logic. This problem can be reduced to the negative cycle detection problem as shown in [2]. The resulting graph is not completely given at the beginning of the computation through its adjacency-list or adjacency-matrix representation. Instead, we are given a root vertex together with a function which for every vertex computes its adjacency-list. A possible approach is to generate the graph at first and then to process it with a distributed NCD algorithm. However, this approach is highly non-efficient. If one processes the graph simultaneously with its formation it can happen that a negative cycle is detected even before the whole graph is formated. Moreover, this on-the-fly technique allows to generate the part of the graph reachable from the root vertex only and thus reduces the space requirements. As successors of a vertex *This work has been partially supported by the Grant Agency of Czech Republic grant No.201/03/0509.
298 are determined dynamically there is no need to store any information about edges permanently which brings yet another reduction in space complexity. A natural starting point for building a distributed algorithm is to distribute an efficient sequential algorithm. Because of the aforementioned reasons we have concentrated on algorithms which admit graphs specified with the help of adjacency lists (and omit those presupposing an adjacency matrix representation of the graph). These algorithms (for an excellent survey see [5]), which are based on relaxation of graph's edges, are inherently sequential and their parallel versions are known only for special settings of the problem. For general digraphs with non-negative edge lengths parallel algorithms are presented in [15, 18, 7] (see [12] for a comparative study) together with studies concerning suitable decomposition [ 13]. For special cases of graphs, like planar digraphs [ 19, 14], graphs with separator decomposition [6] or graphs with small tree-width [4] more efficient algorithms are known. Yet none of these known algorithms are applicable to general digraphs with potential negative-length cycles. In this paper we propose several parallel algorithms for the general SSSP and NCD problems on graphs with real edge lengths and given by adjacency lists. We analyse their worst-case complexity and conduct an extensive practical performance study of these algorithms. We study various combinations of distributed shortest path algorithms and distributed cycle detection strategies to determine the best combination measured in terms of their scalability.
2. SERIAL NEGATIVE CYCLE P R O B L E M We are given a triple (G, s, 1), where G - (V, E) is a directed graph with n vertices and m edges, 1 : E ~ R is a length function mapping edges to real-valued lengths, and s E V is the root vertex. The length of the path p = < Vo, vl,..., vk > is the sum of the lengths of its constituent edges, l(p) = ~-~i~1l(Vi_l, vi). Negative cycle is a cycle p - < Vo, v l , . . . , vk, v0 > with length l(p) < 0. The negative cycle detection (NCD) problem is to find a negative cycle in a graph or to prove that there are none. Algorithms for the NCD problem that use the adjacency-list representation of the graph construct a shortest-path tree Gs = (V~, Es), where V~ is the set of all vertices reachable from the root s, E~ c_ E, s is the root of G~, and for every vertex v c V~ the path from s to v in G~ is a shortest path from s to v in G. The labeling method maintains for every vertex v its distance label d(v) and parent p(v). Initially d(v) = ~ and p(v) = null, the method starts by setting d(s) = 0. The method maintains for every vertex its status which is either unreached, labeled, or scanned, initially all vertices but the root are unreached and the root is labeled. The method is based on the scan operation. During scanning a labeled vertex v, all edges (v,) out-going from v are relaxed which means that if d(u) > d(v) + l(v, u) then d(u) is set to d(v) + l(v, u) and p(u) is set to v. The status of v is changed to scanned while the status of u is changed to labeled. During the computation the edges (p(v), v) for all v : p(v) r null induce the parent graph G~. If all vertices are either scanned or unreached then d gives the shortest path lengths and Gp is the shortest-path tree. On the contrary, any cycle in Gp is negative and if the graph contains a negative cycle then after a finite number of scan operations G; always has a cycle [5]. This fact is used for the negative cycle detection.
299
2.1. Scanning strategies Different strategies for selecting a labeled vertex to be scanned next lead to different algorithms. The Bellman-Ford-Moore algorithm [1, 9] uses for selecting the FIFO strategy and runs in O(nm) time. The D'Escopo-Pape algorithm [16] makes use of a priority queue. The next vertex to be scanned is removed from the head of the queue. A vertex that becomes labeled is added to the head of the queue if it has been scanned previously, or to the tail otherwise. The Pallotino's algorithm [ 17] maintains two queues. The next vertex to be scanned is removed from the head of the first queue if it is nonempty and from the second queue otherwise. A vertex that becomes labeled is added to the tail of the first queue if it has been scanned previously, or to the tail of the second queue otherwise. Both last mentioned algorithms favour recently scanned vertices and run in O(n2m) time in the worst case, assuming no negative cycles. The network simplex algorithm [8] maintains the invariant that in the current parent graph all edges have zero reduced cost (the reduced cost of an edge (v, u) is l(v, u) + d(u) - d(v)). Therefore, if the distance label of a vertex u decreases, the algorithm decreases labels of vertices in the subtree rooted at u by the same amount. Then a new edge with negative reduced cost (so called pivot) is found and the process continues. There are several heuristics to find a pivot. One can search the scanned vertices and choose the pivot according to a FIFO strategy or depending on the value of the reduced cost. The algorithm runs in O(n2m) time. There are several other algorithms, like e.g the Goldberg-Radzik and the Goldfarb et.al. [1 O, 11 ], which however make use of topological sorting and leveling of the parent graph respectively and thus are not directly convertable into distributed versions.
2.2. Cycle detection strategies Besides the trivial and non-efficient cycle detection strategies like time out and distance lower bound, the algorithms put to use one of the following strategies: walk to root, subtree traversal and subtree disassembly. The walk to root method tests whether Gp is acyclic. Suppose the parent graph G~ is acyclic and the scanning operation relaxes an edge (v, u). This operation will create a cycle in Gp if and only if u is an ancestor of v in the current parent tree. Before applying the operation, the method follows the parent pointers from v until it reaches u or a vertex with null parent (on this path only the root can have null parent). If the walk stops at u a negative cycle has been found; otherwise, the scanning operation does not create a cycle. The walk to root method gives immediate cycle detection. However, since the path to the root can be long, the relaxation cost becomes O(n) instead of O(1). In order to optimise the overall computational complexity we propose to use amortisation to pay the cost of checking Gp for cycles. More precisely, the parent graph G~ is tested only after the underlying scanning algorithm performs f~(n) work. A drawback of the amortised strategy is the fact that even if the relaxation of an edge (v, u) does not create a cycle in G~, there can be a cycle on the way from v to the root. Therefore the strategy marks every vertex through which it proceeds. A cycle is detected whenever an already marked vertex is reached. If Gp is acyclic, all marks are removed. The running time is thus increased only by a constant factor. The correctness of the amortised strategy is based on the fact that if G contains a negative cycle reachable from s, then after a finite number of scanning operation Gp always has a cycle [5]. The subtree traversal method makes use of a symmetric idea: the relaxation of an edge (v, u) can create a cycle in Gp if and only if v is an ancestor of u in the current tree. This strategy fits
300 naturally with the network simplex method as the subtree traversal can be combined with the updating of the pivot subtree. The subtree disassembly method also searches the subtree rooted at u. However, this time if v is not in the subtree, all vertices of the subtree except u are removed from the parent graph and their status is changed to unreached. The work of subtree disassembly is amortized over the work to build the subtree and the cycle detection is immediate. 3. DISTRIBUTED NEGATIVE CYCLE D E T E C T I O N A L G O R I T H M S We develope distributed versions of aforementioned serial algorithms. These parallel algorithms are enriched by several novel ideas. To the best of our knowledge these are the first algorithms for the considered setting of NCD and SSSP problems. Pseudo-codes of the algorithms can be found in [3]. The distributed algorithms are designed for a cluster of workstations. Each workstation executes the same algorithm. In addition, we consider a distinguished workstation (called the manager) which is responsible for the initialization of the entire computation, termination detection and synchronization. The vertices of the input graph are divided equally and randomly into disjoint parts by a partition function.
3.1. Distributed scanning strategies The scanning strategies used in the first three serial algorithms can be converted into their distributed counterparts at no cost and preserving the asymptotic complexity. Correctness of the serial algorithms does not depend on the order in which relaxations are performed. Therefore we can maintain local queues and each processor scans vertices in their relative order. The network simplex algorithm chooses the pivot according to the FIFO strategy. We only need to provide a distributed version of the subtree update to maintain the invariant concerning zero reduced costs. Thanks to the fact that the parent graph does not contain any cycle (first some cycle detection strategy is employed) one can traverse the subtree in the breadth first manner without the necessity to mark visited vertices. The breadth first search is well distributable. The asymptotic complexity is preserved.
3.2. Distributed cycle detection strategies All the three considered cycle detection strategies can be modified preserving their asymptotic time complexity. The subtree traversal method requires only a minor modification. The breadth first traversal of the parent tree can be distributed in a natural way. Thanks to the asynchronous relaxations it can happen that the structure of the subtree is modified before succesfull completion of the subtree traversal and thus a "false" negative cycle can be detected. To recognize such a situation it is enough to count the distance from the subtree root to the particular vertices. The subtree disassembly strategy is more involved. When dissassemblig the subtree we need to maintain distances as in the previous strategy to discover "false" cycles. On top of that it can happen that a cycle in the subtree spanning over several processors is not discovered due to strictly synchronous sequence of relaxations and subtree disassembles. In such a situation the cycle is detected using the distance lower bound. The walk to root strategy follows the parent pointers starting from the vertex where the detection has been invoked and marking the vertices through which it proceeds. At the same time
301 several detections can be invoked (on different processors). Therefore every processor has its own mark and all marks are linearly ordered. If a walk reaches a vertex marked with a lower or higher mark it overwrites this mark or it stops respectively. A vertex marked with the same stamp implies the presence of negative cycle. After finishing the detection the strategy needs to remove the marks starting from the vertex where the detection has been initiated. To find all the marked vertices the parents of the vertices can not be changed in the meantime. This is guaranteed by locking the marked vertices. 4. C O M P A R I S O N OF DI~;TRIBUTED A L G O R I T H M S
The challenge for distributed algorithms is to beat their (usually very efficient) static counterparts. However, their actual running time may depend on many parameters that have to do with the type of the input, the distribution of the graph, and others. Hence, it is inevitable to perform a series of experiments with several algorithms in order to be able to select the most appropriate one for a specific application.
1400
nts ,,,"~",, .
std-paltStd - - * -std-pape .....o. wtr - ~ - - -
,;'
nts std - - -x--std-palt
1200
,.,'
1000 1200
'k
1ooo
~
800
I
I
I
8
10
12
I
I
I
I
14
16
18
20
_
4
Processors
6
8
10
12
14
16
18
20
Processors
Figure 1. Real graphs (model of a lift with 14 and 12 levels)
A parallel execution is characterized by the time elapsed from the time the first processor started working to the time the last processor completed. We present the average of a filtered set of execution times. Our collection of datasets consists of a mix of real (representing verification problems) and generated instances, the instances scale up linearly with the number of processors. The algorithms have been implemented in C++ and the experiments have been performed on a cluster of 20 Pentium PC Linux workstations with 512 Mbytes of RAM each interconnected with a 100Mbps Ethernet and using Message Passing Interface (MPI) library. We compared the following combinations of SSSP and NCD algorithms: Network Simplex with Subtree Traversal [ng s], Bellman-Ford with Subtree Disassembly and distance lower bound [ s t d ] , Pallottino with Subtree Disassembly [ s t d - p a l t], D'Escopo Pape with Subtree Disassembly [ s t d - p a p e ] , Bellman-Ford with Walk to Root [wtr].
302
nts std std-palt std-pape wtr
";.. "..... ......
--+-- --x----- - ~ - - ..... B ...... -.--i-.-
':..
@,
2000
.~_
.....-
1500
x i
i
4
6
i
i
8
10
i
12
I
i
14
16
i
18
i
20
Processors
Figure 2. Generated graph The experimental results are summarised in Figures 1 to 3. From the experiments we can draw some remarkable conclusions. The first and the most important one is that distributing SSSP and NCD algorithms (even if done in a very straightforward manner) allows to solve these problems for huge graphs within reasonable amount of time. This is in particular important in
~i- ...............
9
nts std std-palt std-pape wtr
- ~ ---x-----~--.... [] ..... -.-..-.-
/y x
1
t 1
I 2
I 3
I 4
Processors
Figure 3. Generated graph- the worst case
applications like model checking- more realistic systems can be verified. All the implemented algorithms show good scalability, the exception is the situation which represents the worst case (the negative cycle in grid-like graph passes through all the processors) where the measurements correspond to the theoretical worst case complexity (Figure 3). Proper choice of cycle detection strategy is in the distributed environment more important than the labelling strategy. As regards cycle detection the algorithms behave differently depending on the "type" of the input graph. For example for randomly generated graphs with negative-valued edges and without cycles we is the best choice and the n t s has the worst behaviour. For could conlude that s t d - p a l t
303 generated graphs with positive-valued edges all the algorithms scale well and are reasonably fast. For graphs resulting from model-checking problems the wt r approach has proven to beat all the others regardless of the presence or absence of a negative cycle. The experiments also demonstrated the fact that spliting the graph into too many small parts does not bring additional speedup of the computation due to increase in communication. 5. C O N C L U S I O N S We provide and analyse parallel algorithms for the general SSSP and NCD problems for graphs specified with adjacency lists. The algorithms are designed for networks of workstations where the input graph is distributed over individual workstations communicating via a message passing interface. Based on our experiments we conclude that in situations where no a priori information about the graph is given the best choice is the Subtree Disassembly algorithm (in the sequential version known also as Tarjan's algorithm). For specific applications other algorithms or their combinations can be more suitable, as demonstrated by Walk to Root in case of application to model checking. REFERENCES
R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16(1):87-90, 1958. L. Brim, I. Cernfi, P. Kr6fil, and R. Pelfinek. Distributed LTL model mhecking based on negative nycle detection. In FST TCS 2001, number 2245 in LNCS. Springer-Verlag, 2001. [3] L. Brim, I. (2emil, and L. Hejtmfinek. Parallel Algorithms for Detection of Negative Cycles. Technical report FIMU-RS-2003-04, Faculty of Informatics, Masaryk University Brno, 2003. [4] S. Chaudhuri and C. D. Zaroliagis. Shortest path queries in digraphs of small treewidth. In Automata, Languages and Programming, pages 244-255, 1995. [5] B.V. Cherkassky and A. V. Goldberg. Negative-cycle detection algorithms. Mathematical Programming, Springer-Verlag, 85:277-311, 1999. [6] E. Cohen. Efficient parallel shortest-paths in digraphs with a separator decomposition. Journal of Algorithms, 21(2):331-357, 1996. [7] A. Crauser, K. Mehlhorn, U. Meyer, and P. Sanders. A parallelization ofDijkstra's shortest path algorithm. In Proc. 23rd MFCS'98, volume 1450 of LNCS, pages 722-731. SpringerVerlag, 1998. [8] G.B. Dantzig. Application of the simplex method to a transportation problem. Activity Analysis and Production and Allocation, 1951. [9] L.R. Ford. Networkflow theory. Rand Corp., Santa Monica, Cal., 1956. [10] A. V. Goldberg and T. Radzik. A heuristic improvement of the Bellman-Ford algorithm. AMLETS: Applied Mathematics Letters, 6:3-6, 1993. [11] D. Goldfarb, J. Hao, and S.R. Kai. Shortest path algorithms using dynamic breadth-first search. Networks, pages 29 - 50, 1991. [12] M. Hribar, V. Taylor, and D. Boyce. Parallel shortest path algorithms: Identifying the factors that affect performance. Technical Report CPDC-TR-9803-015, Center for Parallel and Distributed Computeing, Norhwetern University, 1998. [ 1] [2]
304 [13] M. Hribar, V. Taylor, and D. Boyce. Performance study of parallel shortest path algorithms: Characteristics of good decompositions. In Proc. ISUG '97 Conference, 1997. [ 14] D. Kavvadias, G. Pantziou, P. Spirakis, and C. Zaroliagis. Efficient sequential and parallel algorithms for the negative cycle problem. In Proc. 5th ISAAC'94, volume 834 of LNCS, pages 270-278. Springer-Verlag, 1994. [ 15] U. Meyer and P. Sanders. Parallel shortest path for arbitrary graphs. In EUROPAR: Parallel Processing, 6th International EURO-PAR Conference, LNCS. Springer-Verlag, 2000. [16] U. Pape. Implementation and efficiency of Moore-algorithms for the shortest path problem. Mathematical Programming, 7:212-222, 1974. [17] S. Pallottino. Shortest-path methods: Complexity, interrelations and new propositions. Networks, 14:257-267, 1984. [ 18] K. Ramarao and S. Venkatesan. On finding and updating shortest paths distributively. Journal of Algorithms, 13:235-257, 1992. [ 19] J. Traffand C.D. Zaroliagis. A simple parallel algorithm for the single-source shortest path problem on planar digraphs. In Parallel algorithms for irregularly structured problems (IRREGULAR-3), volume 1117 of LNCS. Springer-Verlag, 1996.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
305
A Framework for Seamlesly Making Object Oriented Applications Distributed S. Chaumette a and P. Vignfras a aLaBRI, Laboratoire Bordelais de Recherche en Informatique, Universit6 Bordeaux 1, 351 Cours de la Libfration, 33405 TALENCE CEDEX, FRANCE email: {Serge.Chaumette, Pierre.Vigneras }@labri.fr Traditional distributed frameworks such as DCOM, CORBA or EJB do not allow any object to become remote dynamically even if they were not designed for it. This is a major drawback solved by the active container concept presented in this article. The Java TM 1 implementation of this concept, JACOb, is also presented. 1. INTRODUCTION Distributed applications are more and more often developed using object oriented languages even if other programming techniques such as agent-oriented applications [6] or aspect oriented programming [7] emerge. Three de facto standard exist: DCOM TM [4] (Distributed Component Object M o d e l ) - Microsoft TM, CORBA [ 11 ] (Common Object Request Broker Architecture)OMG, and EJB TM [ 12] (Enterprise Java B e a n s ) - Sun microsystems TM. Even though these standard offer similar functionalities they have the same lack: objects must be written specifically as remote objects to be used remotely. Moreover, the "remote" behavior depends on the framework used. A DCOM object cannot be used in an EJB environment and vice-versa. The protocol used for communication is directly imported from the underlying framework and cannot be changed preventing one to choose between efficient versus safe or secure protocol. Some implementors try to solve this problem by writing some bridge between "naturally incompatible" worlds such as RMI/IIOE but we believe this is not the good approach. A new paradigm must be used to allow any object to become remotely accessible dynamically. Several challenges must be solved in order to achieve this goal: remote objects must deal with issues usually not found in the local case such as latency, memory access, partial failure or concurrency [ 16]. Moreover, a clear separation between remote object management and remote object communication must be found. This paper presents the active container concept and its implementation in Java: JACOb. JACOb is the server-part of the Mandala project [ 15] which focuses on dynamic distribution of objects using the Java reflection facility. The rest of this article is organized as follows: we first propose in section 2 a new paradigm for distributed programming that we call: active containers. Problems related to remote objects are dealt with in section 3. Then JACOb, our Java implementation of the concept is presented in section 4. Conclusion and future works are the topic of section 5.
1java and all Java-based marks are trademarks or registered trademarks of Sun microsystems, Inc. in the United States and other countries. The author is independent of Sun microsystems, Inc.
306 2. THE ACTIVE CONTAINER CONCEPT An active container is a container of objects which provides a communication channel for any of its object. It supports the following main methods: v o i d p u t (Obj e c t k e y , 9insert an object in the container; the inserted object beObj e c t o b j e c t ) comes a stored object; v o i d r e m o v e ( O b j e c t k e y ) 9 remove an object from the container; the object is no more a stored object;
Object get (Object key) void call (Object key, String method, Object[] args,
: get a copy of an object from the container; the stored
object remains in the container; : call a method of a stored object;
MethodResult result) The method c a l 1 ( ) generates the activity. It invokes the given method on the stored object that is mapped to the given k e y in the active container. To support dynamism, the method is specified as an instance of a M e t h o d meta-class and the implementation uses reflection to resolve it. This leads to the problem of strong type checking. This problem is discussed in section 4.4.1. The fact that reflection is used makes it possible to work at the level of the meta-model, reasoning about all applications that can be written using the framework, instead of dealing with a specific one. This leads to a compact model [14] in 7r-calculus [9] of the framework. A separation between objects management and method invocations is de facto provided by the active container model: the active container is a repository of objects and is used for object management (deployment, upgrade, deletion); the c a l l () method provides the communication channel with stored objects. This separation is known to be a good design and is widely used for example in the EJB framework that defines EJB containers. Nevertheless, these containers can only contain specific objects (beans) whereas in our model, stored objects are really any object. Javaspace [ 13] also uses the container abstraction but does not provide the communication channel to use stored objects remotely. Instead, objects must be retrieved locally to be used.
2.1. Usage examples There are many applications that can be developed in a straightforward manner using the framework of active containers. This section presents some of them which we believe illustrate the ease of use of the concept and its capacity to model some interesting applications in the distributed world.
2.1.1. Active containers as a memory model This concept can be seen as a memory model of a distributed virtual machine where the objects seen by the end-user are always stored objects: p u t () provides a mechanism to create objects; r e m o v e () provides a mechanism to delete objects; c a l 1 () provides a mechanism for method invocation. This model may be used to transparently distribute a local application as explained in section 3.2
307
2.1.2. Active containers and mobility The active container concept is powerful enough to express the migration of mobile agents [5]. A s t o p A c g i v i t y ( ) method can be invoked on an o b j e c t - in case strong migration is not possible. The object can then be retrieved using g e t ( ) and moved to another location using p u t ( ) and eventually restarted calling a dedicated method, for instance r e s t a r t A c t ivi ty () as shown in the program 1. Note that in such a case, the object is moved twice. First, on the host which invokes the g e t () method, and then on the host where the p u t () is actually performed. If a direct move between two hosts is needed, a previously inserted M i g r a t o r object which contains a m i g r a t e T o ( ) method may be used. 2.1.3. Active containers and multi-protocols remote objects All communications with stored objects has to go through the active container they live in. This implies that the protocol used to communicate with a stored object is the protocol used to communicate with the remote active container. Using the remote active containers model, we can provide a mechanism for direct communication with stored objects with any protocol such as RMI, IIOP, TCP, UDP, etc. For example, clients outside of a LAN must use a secure, reliable internet protocol such as SSL/TCP/IP to access an object whereas others clients on the LAN would use a faster protocol such as BIP/Myrinet. For this purpose, an object which provides new remote access to a stored object may be inserted in an active container. Consider an instance of a p r o x y F a c t o r y class which delivers remote proxies to a business object it has a local reference on. Clients use its method g e t p r o x y () which takes one parameter: the protocol the returned proxy must use to transform method invocations into remote method invocations. The method may instantiate a server if necessary. Then, the following figure illustrates how multi(2) ProxyFactory protocol can be achieved using the active conproxy = call(key, "getProxy", "UDP"); (1 tainer concept. Suppose the P r o x y F a c t o r y which has a local reference to the business object it must return proxies o n - has already been inserted ( p u t ( ) ) in a remote active container (1). The i(6) method g e t P r o x y () is invoked remotely through
I
i
I the c a l l method of the active container (2) by a client. The proxy factory may instantiate a new (4) (3) (3)' server for the given protocol ("UDP" here) if necesDirect communications sary (3) and the local reference to the business object p,. , proxy UDPServer (3)' is given to him by the factory. Then the server (7) related proxy is returned (4) to the client. The client Active container then has a reference on the returned proxy (5) and JVM direct communications can take place (6) using the remotemethodinvocation reference method invocation specified protocol. ~" ~ ........................... ~> Such a mechanism is proposed for security purpose in the JITS (Just In Time Security) [3]. s
o
i
!
I
s
I
t
t
3. DEALING WITH REMOTE OBJECTS Distributed computing contains some well-known issues [ 16] we believe the active container concept solves. These problems are: latency, memory access, partial failure and concurrency and are dealt with separately below.
308
Program 1 Mobile agents expressed using active containers 1 f r o m A c t i v e C o n t a i n e r , call (key, 2 "stopAct ivi ty", 3 null, 4 null) ; 5 M y O b j e c t object = 6 (MyObject) f r o m A c t i v e C o n t a i n e r , get (key) ; 7 f r o m A c t i v e C o n t a i n e r , remove (key) ; 8 toActiveContainer.put(key, object) ; 9 t o A c t i v e C o n t a i n e r , call (key, 10 "restartActivity", 11 null, 12 nul 1 ) ;
3.1. Latency Latency is defined by the delay between a method invocation and its effective execution. The latency introduced by the network on remote method invocation is several times the order of magnitude of the local method invocation's one. If this issue is not addressed, a distributed application will poorly perform. This problem can be solved by an object oriented program analysis (either static, dynamic or both) which maps instances to active containers (on distinct hosts) in such a way that the overall communication cost becomes minimal. Some work in this direction is currently done in our research team using the notion of separable objects [2]. 3.2. M e m o r y access The memory access problem is the fact that a remote access is not exactly the same as a local access. For example, when a method invocation occurs, parameters must be marshaled in order to be transfered to the remote host. This implies a call by copy schema which is different from the regular call by reference. Using the active container concept, and defining a stored object reference as the pair (activeContainer, key) it is possible to enforce developers to express which schema they want for their method invocations: 9 every reference used in the program must be a stored object reference2 and communication between stored objects must be done through the c a l 1 () method exclusively; this ensure a call by reference mechanism; 9 when a call by copy is needed, the p u t ( ) method of active containers must be u s e d - in the same way the c l o n e ( ) method is used in the Java language for the same purpose in the local programming case.
3.3. Partial failure Partial failure is one of the main characteristic of distributed computing. In the local case, failure is either total or detectable (by the operating system for example). In the remote case, 2This can be verified by a compiler.
309 a component (network link, processor) may fail while others continue to work properly. Moreover, there is no central component that can detect the failure and inform others, no global state which describes what exactly occurred. For example, if a network link is down, a previous remote method invocation may have completed successfully while it is probably not the case if the processor is broken. Nevertheless, the concept does not specify how remote failure must be handled (since, in fact, the concept don't specify that active containers are remote objects). Hence, it is up to the implementation to provide a remote failure handling mechanism. We will see our proposition in section 4.1. 3.4. Concurrency Remote objects must usually deal with concurrent method invocation and this prevents non thread-safe objects from becoming remote without care. As in the partial failure case, the concept does not specify the semantic of method invocation on the active container nor on its stored objects. The implementation is free to provide some mechanism that enable non threadsafe objects to be inserted into remote active container and thus to be accessed remotely. The process used in our implementation will be discussed in section 4.2. 4. THE JACOb F R A M E W O R K F O R JAVA DISTRIBUTED APPLICATIONS JACOb (Java Active Container of Objects) is a Java implementation of the above concept. Active containers are defined by the ActiveMap interface which inherits java. util .Map and defines the c a l 1 () method. Several implementations of this API are available. A local implementation allows the active container concept to be used locally, stored objects are accessed from the virtual machine it is running on. Remote implementations support distributed applications providing remote access to the objects that are inside the container. Active containers make it straightforward to provide this feature: it is enough to make the container usable remotely, and this directly provides remote access to stored objects. From the server side of view, the design of the framework- which uses the design by interfaces scheme intensively- makes straightforward the implementation of a new transport layer such as CORBA, or Myrinet [ 10]. Currently, three protocols are implemented: UDP, TCP and RMI. A JToe [ 1] based implementation is planned. From the client point of view, the use of the JNDI [8] API prevents the user from being dependent of a particular transport layer. Hence, for the end-user, using a JACOb remote A c t i v e M a p is as easy as using a conventional Map where objects are stored by value 3.
4.1. Remote failure handling in JACOb As mentioned in section 3.3, partial failure is one of the main characteristic of distributed applications and dealing with this issue must be done at the interface level [ 16]. Java-RMI uses a per-method basis for this purpose: each method handled by a remote object must declare the R e m o t e E x c e p t i o n in its t h r o w s clause. This limits code reuse since declared exceptions are taken into account in Java by the compiler: an object with interface A cannot become remote easily. Composition must be used since the object has not be de3The serialization mechanism is used. One may claim that dynamismis broken since any objects can't be inserted into a remote active container, only serializable objects can.
310 veloped to be remote. The interface B of the remote object declares each method of A with R e m o t e E x c e p t i o n added to the t h r o w s clause. Hence, a client which previously used A cannot use B without being rewritten. We believe an event-driven based approach is more convenient since usually remote failure are handled the same way: contacting the administrator, hiding the problem by contacting a mirror server, etc... In JACOb, each remote object 4 has a related E x c e p t i o n H a n d l e r which handle exceptions thrown by the transport layer. This handler may resolve the problem transparently reinvoking the failed called method on a mirror for example (the caller will never notice that an exception occurred), contact an administrator or let JACOb throws a R u n t i m e E x c e p t i o n to the caller.
4.2. Concurrency in JACOb As described in section 3.4, remote objects must deal with concurrent access. Since JACOb allows any object to be inserted in a remote active container, any object may become remote, even non thread-safe objects. As mentioned in the introduction, JACOb is the server part of the Mandala project. The client part of i t - called RAMI- solves this problem. RAMI stands for Reflective Asynchronous Method Invocation and defines asynchronous references on which method invocation are always asynchronous. RAMI provides different asynchronous semantics to be used with asynchronous references such as concurrent semantic- where method invocations are executed concurrently - and single threaded semantic - where method invocations are never executed concurrently. JACOb defines the S t o r e d O b j e c t R e f e r e n c e class which is in fact a RAMI asynchronous reference implementation: any method on a stored object may be invoked asynchronously (and remotely if the active map containing the object is remote). Hence, when a single threaded semantic is attached to a non thread-safe object by its instanciator, the object can be accessed asynchronously safely and thus can be inserted into a remote active container.
4.3. JACOb advantages 4.3.1. Dynamism Objects can dynamically become remotely accessible just by being inserted into an active container. This feature implicitly provides legacy code reuse since classes dot not have to implement or extend anything to be remotely accessible and separation of concerns since objects do not have to be designed with the remote concern in mind, developers can then focus on the "business" parts of their objects.
4.3.2. Protocol independence Using a remote object in JACOb does not depend on the underlying protocol, since communication is handled at the level of the active container. Hence, some protocol may be used for efficiency reasons (such as UDP or Myrinet) while others may be used for their guarantees (TCP, RMI or HTTPS). 4ActiveMap is one among many interfaces- such as Map or Collectionimplementation.
JACOb
also provides a remote
311 4.3.3. Asynchronism The A c t i v e M a p , c a l 1 ( ) method provides server-sided asynchronous remote method invocation 5. Applications can then take advantage of their intrinsic parallelism.
4.4. JACOb drawbacks 4.4.1. Strong type checking If the method c a l 1 () of the active container model is used directly to invoke methods of objects it contains, the compiler cannot check types since the method is a meta-object: the link between the method and the object it is invoked on is computed at runtime. Nevertheless this is not the natural way of dealing with remote method invocation in JACOb. The client part of Mandala- the RAMI package which provides reflective asynchronous method invocation - must be used instead of the A c t 2veMap. c a l 1 ( ) method which is only used by implementors. Hence, end-users use a natural syntax to invoke their method remotely and asynchronously solving the strong typing problem. Describing the RAMI framework is far beyond the scope of this article.
4.4.2. Inefficiency Since JACOb uses the reflection mechanism, method invocation may be quite slower than invoking a method locally as usual. But since active containers are usually remote, the cost of the reflection is negligible compared to the cost of communication. Furthermore, RMI itself uses reflection since JDK 1.2 and thus this overhead is unfortunately common over the different frameworks. Moreover, the performance of a remote method invocation depends on the underlying transport layer. The UDP implementation is several times more efficient than the RMI one, but dot not offers the same guarantees. Note anyway that invoking a method in JACOb is asynchronous (on the server-side) what reduces the impact of the reflection overhead on the invoker. 5. CONCLUSION AND FUTURE W O R K The concept of active container provides an abstraction that can be used as a programming paradigm for distributed object applications. Remote active containers provide a clear separation between stored object management and communication channels. Moreover stored objects can be any object which enable dynamism. The concept solves issues traditionally found in distributed computing. Our implementation of this concept in J a v a - called JACOb- is the server-part of the Mandala project which provides reflective asynchronous remote method invocation. Any object can dynamically become remote and be accessed asynchronously. Several protocols may be used in JACOb, and other implementations are easy to achieve. The model is underspecified (remote active containers, partial failure, method invocation semantics (copy, reference, asynchronism)) allowing implementations to provide "goodies". Perhaps shall it be more specified to be really usable for proof of program behavior. Further study on the model may answer this question. Future works include integration of security features into JACOb, development of services, customized distributed garbage collector and group communications. 5Mandala defines server-side and client-side remote method invocation. This first is when the client is blocked until the invoked method is guaranteed to start its execution, the latter is when the client is not blocked at all.
312 We also plan to produce measurements for a set of applications in order to give potential users a good idea of what they can expect from our framework. REFERENCES
[1] [2]
[3]
[4]
[5]
[6] [7]
[8] [9]
[ 10] [ 11 ] [12] [ 13] [14] [ 15] [ 16]
Serge Chaumette, Benoit M6trot, Pascal Grange, and Vign6ras Pierre. Jtoe: a Java API for Object Exchange. To appear in Proc. of PARCO'03, September 2003. Serge Chaumette and Grange Pascal. Optimizing the execution of a distributed object oriented application by combining static and dynamic information. Intemational Parallel and Distributed Processing Symposium, IPDPS 03, April 2003. Nice, France. Poster. Serge Chaumette and Pierre Vign6ras. Extensible and Customizable Just-In-TimeSecurity (jits) management of Client-Server Communication in Java. To appear in Proc. of PARCO'03, September 2003. "Microsoft Corporation". Dcom technical overview. Web page, November 1996. http ://msdn. microsoft, com/i i b r a r y / e n - u s / d n d c o m / h t m l / msdn dcomtec.asp. Carlo Ghezzi and Giovanni Vizna. Mobile code paradigms and technologies: A case study. In Proceedings of the First International Workshop on Mobile Agents, Berlin, Germany, 1997. N. Jennings and M. Wooldridge. Agent-Oriented Software Engineering. Handbook of Agent Technology. AAAI/MIT Press, J. Bradshaw, edition, 2000. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Videira Cristina Lopes, Jean-Marc Loingtier, and John Irwin. Aspect-oriented programming. In SpringerVerlad, editor, European Conference on Object-Oriented Programming (ECOOP), volume 1241 ofLNCS. Springer-Verlad, 1997. Sun Microsystems. Java naming and directory interface (jndi). Web page, 2003. http://java, sun. com/products/jndi/. Robin Milner, Joachim Parrow, and David Walker. A calculus of mobile processes (parts i and ii). Technical report, Laboratory for Foundation of Computer Science, Computer Science Department, Edinburgh University, June 1989. Myricom. Myrinet link and routine specification, 1995. http ://www. myri. com/myricom/document, html. Jon Siegel. CORBA, Fundamental and Programming. Wiley, 1996. Sun microsystems. Enterprise JavaBeans Specification, Juin 2000. Version 2.0, Public Draft. Sun Microsystems. Javaspace - web server. ht tp" / / j ava. sun. corn/product s / j avaspaces/, 2000. Pierre Vign6ras. Agents mobiles: une impl6mentation en Java et sa mod61isation, Juin 1998. M6moire de DEA. Vigneras, Pierre. The mandala web site. Web page, November 2002. http://mandala, sf .net/. Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. A note on distributed computing. In Mobile Object Systems: Towards the Programmable Internet, pages 49-64. Springer-Verlag: Heidelberg, Germany, 1997.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
313
P e r f o r m a n c e E v a l u a t i o n o f Parallel G e n e t i c A l g o r i t h m s for O p t i m i z a t i o n Problems of Different Complexity P. K6cheP and M. RiedeP aChemnitz University of Technology Faculty of Computer Science 09107 Chemnitz, Germany E-mail: {pko,rima} @hrz.tu-chemnitz.de In recent years, Genetic Algorithms (GAs) have proved to be very efficient in dealing with a large variety of optimization problems. Yet their needs of execution time are very high especially for simulation optimization which makes parallelization desirable. In this paper three different parallel implementations of a Genetic Algorithm are presented: a global master-slave, a coarse-grained and a hierarchical parallel GA (PGA). Their performance was evaluated by applying them to several optimization tasks and performing runtime tests on a large Beowulf linux cluster. 1. INTRODUCTION Finding optimal solutions for economic and scientific problems is a very important research topic. Having to optimize production processes, to reduce transportation and storage costs or to exploit available resources as well as possible, the complexity of these applications is commonly so high that even modem computing technology is hardly or not at all able to optimize them. Evolutionary Algorithms (EAs) are a very popular and successful approach for optimization of complex applications. Applying mechanisms of biological evolution to the optimization process, they have shown a very good reliability [6]. Yet the needs on execution time are very high especially for applications whose possible solutions cannot be evaluated by calculation but have to be estimated by simulation (simulation optimization). Because of the implicit parallelism of EAs, parallelization can be expected to lead to performance improvement. The goal of this work was to examine various parallelization strategies for EAs with regards to their applicability for different optimization problems. Our paper is organized as follows: Section 2 provides a short overview of PGAs. In section 3, the implementation of three different PGAs is described, while their performance is discussed in section 4. Finally, the achieved results are summarized in section 5. 2. PARALLEL GENETIC ALGORITHMS Genetic Algorithms are the most widely examined subgroup of EAs. They operate on a set (population) of possible solutions (individuals) of a problem. An individual is an element of the space of solutions spanned by the problem-relevant parameters. The fitness of an individual
314 describes its quality to solve the corresponding problem. The GA propagates the members of a population by genetic operators simulating biological procedures (selection, mutation, recombination) to improve the fitness of the individuals and to find the individual optimally solving the problem. [6] As biological evolution does not happen sequentially but parallelly, GAs can be parallelized easily. The research community today distinguishes between four groups of PGAs (see [ 1] and
[3]): 1. global master-slave PGA: A master node manages the population while several slave nodes create new individuals and calculate their fitnesses. This is especially useful if the fitness calculation is time-intensive. 2. fine-grained PGA: The population is divided into a number of subpopulations that can consist of one or more individuals and are mapped to a two- or three-dimensional grid. New individuals are created by interaction of adjacent subpopulations which causes a large amount of communication. 3. coarse-grained PGA: The population is divided into several subpopulations that are developed by the same GA independently from each other and exchange individuals among themselves. These algorithms are also known as island or distributed PGA. There are many different strategies for the exchange of individuals. 4. hierarchical PGA: Several parallelization approaches are combined in one PGA. 3. IMPLEMENTATION For our evaluation we implemented three different parallel versions of a rather simple but well performing GA. A realization of a fine-grained PGA would not have been wise, because the target computer, a large Linux Cluster, is not suitable for programs with very intensive communication 1. The PGAs were implemented using C and LAM-MPI (see [7]). 1The cluster nodes are connected by 100 MBit/s Fast Ethernet network. Using MPI via TCP, simple tests showed a bandwidth of more than 10 MByte/s and a latency of about 80 microseconds. optimization
I
~i
~:: ~ S T E R
.... II
problem
population management II ~t&mi"ati~ ~
osolution ptimal
I
(population)
transfer of individuals
[
~i
[I
I
/
[
~
I
lii:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :ii!ii:l i::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ~!:~~: : ~::~::~:.~:~
9 9 9
::i~ ~ ~ :
Figure 1. Organization of nodes for PGA1
:.:
[ ~i
i
::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: :,::::: ::.~::~::: ii:ii!i~i
ii:,i',~;i!i]
315 Our PGA1, a global master-slave PGA, runs on a master node managing the population and on several slaves doing the evolution and fitness calculation. The slaves request individuals from the master, create child individuals by genetic operators, calculate their fitnesses and send the children back to the master. The master waits for requests from its slaves, selects individuals from and inserts new individuals into the population. When the stopping criterion is fulfilled, the master determines the best individual (see figure 1). The PGA2 (see figure 2), a coarse-grained PGA, is intended to run on several equal nodes managing their own population each and executing the same basic GA. In regular intervals, the nodes perform an exchange of individuals based on the network model: Each node selects a number of individuals (in our case: 2) and sends them to all other nodes. The received individuals are inserted into the population replacing individuals with worse fitnesses.
optimal solution
optimization problem
_t_c__
_1
node 0 MASTER executes the complete Genetic Algorithm
Qpopulation13
node I MASTER . . . . . . . M A S E R .... executes the complete *** executes the complete Genetic Algorithm Genetic Algori~m ~
L
,~'p~176
,n)
,
transmission of local optima
Figure 2. Organization of nodes for PGA2
Finally, we have combined these two approaches to a hierarchical PGA, the combined PGA (see figure 3): A number of equal masters manage their own populations each and exchange individuals like in PGA2. Every master disposes of several slaves doing the evolution and fitness calculation.
4. P E R F O R M A N C E EVALUATION To evaluate the performance of the different parallel implementations we applied them to various test problems and performed an extensive series of runtime tests on the Chemnitz Linux Cluster (CLiC), a large Beowulf cluster with 528 nodes [2]. The first group of test applications contained a number of standard test functions for optimization methods [5]. They all possess a global minimum and certain qualities making optimization difficult. We have chosen three of them as an example to present and discuss the results: Rechenberg's sphere function (fl), the Griewangk function (f2) and the Schaffer function (f3). Full analysis of the test results of all used test functions can be found in [8].
316 optimalsolution
optimizationproblem
.l population~gement ::] !
::: [I ~pulationm~agement It
~termination of 6~timu~ I [ det,rmi~ationof0~timum [ *'" [ determinationof optimum I
I
(population
~ ~,~"x~ ,~
1 'I ) I II
I
iii'iiiiiiii+!i
]regular Iiiexchang i e of individuals~--~ / / ~ l
.~.-..T.. .~,..
...........
I
~ "~ .~
slaves o f node I
Figure 3. Organization of nodes for combined PGA
f 25
f-I
12. - 5
~
~ xl
~'~5.
.
Figure 4. Rechenberg's sphere function
.
.
0 "'---.~_t5 15
Figure 5. Griewangk function
Figure 6. Schaffer function
n f l (:g)
=
x
2
i
- 5.12 < xi < 5.12
(1)
-600<xi_<600
(2)
_
_
i=1 f2(~)
=
2 1 I ~ I ( ( ~xi ) ) xi-~ cos +1
in 200 '= sin 2
f3(:~)
=
-
i=1
~ xi i=1
-100<xi<100
(3)
1 + 0"001 ( i = ~ z ~ )
The second test application was a complex Travelling Salesman Problem (TSP): The shortest journey around the world visiting 80 given locations should be found [5]. The fitness is calculated on the basis of a given distance matrix based on spherical trigonometry in the first case. In the second tested case, this is combined with a time-consuming heuristic trying to improve the newly created individuals.
317 Table 1 Runtime test results: test functions algorithm
GA
number of nodes
function
1
PGA1
20
PGA2
20
Rechenberg Griewangk Schaffer Rechenberg Griewangk Schaffer Rechenberg Griewangk Schaffer
average number of individuals
5,212.4 78,933.1 456,909.5 6,080.1 95,433.9 255,283.2 30,611.7 124,265.5 1,525,652.5
average time
0.020 s 0.516s 1.645 s 0.24 s 3.71 s 9.96 s 0.05s 0.24 s 2.45s
A Fleet-Sizing-and-Allocation problem (FSAP), a widespread realistic optimization problem [4], completes the test applications: The optimal size of an amount (fleet) of resources and their distribution among a number of consuming places has to be determined [8]. A car hire with 5 locations and the aim to maximize the profit served as test example. The fitness (profit, measured in monetary units) of a possible solution of this problem cannot be calculated, but has to be determined by a linked simulator. A large quantity of runtime tests has been carried out to evaluate the performance of the parallel implementations optimizing these three groups of test applications. In this paper we can only present an expressive selection of empirical results. Interested readers will find the complete results in [8]. All presented values (time and fitness resp. number of individuals) are averages of ten runtime tests. The numbers of nodes for the final tests of the parallel algorithms were chosen depending on their performance: We used the number of nodes that needed the least time possible under the condition that the average fitness was not significantly worse than the average fitness achieved by the basic GA. Table 1 compares the results of the basic GA with those of the parallel implementations (PGA1 and PGA2) applied to the standard test functions. Clearly, the PGA1 needs approximately the tenfold execution time of the basic GA. The reasons for this phenomenon are the qualities of the test applications and the structure of the PGAI: On the one hand, the fitness calculation of a solution for the test functions needs only very little execution time, and on the other hand there is a large additional communication effort to transport all parent individuals over the network from the master to the slaves and the child individuals back to the master node. The bad proportion between them leads to the increased execution time of the PGA1. Yet there is no significant difference in the number of individuals necessary to find the known optimum. Sometimes the basic GA performed slightly better, but sometimes also the PGA1 was able to find the optimum faster. The PGA2 needed much less execution time per created individual than the basic GA, but the number of necessary individuals was sigificantly higher. This leads to the conclusion that the idea of various mainly independent populations is not suitable for this group of test applications. As the parallel algorithms failed to perform better than the basic GA, the combined PGA has not been applied to the test functions. Table 2 contains the most important runtime results of all algorithms applied to the TSE Since its optimum is not exactly known 2, 50 million individuals were created in each of the ten runs. 2The best known solution has got a fitness of 108,247.54664 kin; see [8] or [5] for details.
318 Table 2 Runtime test results: TSP (50 million individuals) algorithm
usage of heuristic
GA
no yes no yes no yes yes
PGA 1 PGA2 combined PGA
number of nodes
1 1 15 40 100 100 440 (40 masters)
average fitness
112,018.88007 108,864.22635 112,781.08875 108,986.45810 111,631.79487 108,247.54664 108,518.80157
average time
689.93 s 29,360.74 s 2,061.68 s 2,092.87 s 7.46 s 295.78 s 206.82 s
Table 3 Runtime test results: FSAP (10 000 individuals) algorithm
GA PGA 1 PGA2 combined PGA
number of nodes
1 50 100 20 30 420 (20 masters) 410 (10 masters)
average fitness
66.915521 68.839931 68.664043 67.376490 67.600934 66.248436 66.066275
average time
13,123.00 s 443.85 s 304.08 s 794.45 s 573.96 s 231.52 s 227.42 s
As it can be clearly seen, the success of the PGA1 depends on the usage of the heuristic: If the fitness calculation is done without it, the PGA1 needs more time than the GA to create 50 million individuals. This is again caused by the bad proportion of needed execution time per individual to the additional communication effort. But if the fitness calculation is combined with the heuristic to improve the individuals, a high speedup could be achieved. The execution time could be reduced from more than eight hours to about 3 5 minutes, which means a speedup of 14. The average fitnesses of the best individuals did not relevantly differ from the values of the GA. The PGA2 succeeds independently from the use of the heuristic. It achieved much shorter execution times in combination with a clearly better average fitness. Not using the heuristic, the execution time decreased from 689.93 to 7.46 seconds which corresponds to a speedup of 92.5. When the heuristic was used, the speedup amounts to nearly 100. So while the basic GA needed more than eight hours, the PGA2 found a better average result in less than 5 minutes. The combined PGA was only applied to the TSP with heuristic, because the combination of PGA1 and PGA2 is probably only successful if both PGAs are successful in optimizing a certain problem. But as it can be seen, the use of 440 cluster nodes (40 masters with 10 slaves each) brought only a little improvement in execution time compared to the PGA2 with 100 nodes while the average fitness slightly declined. Table 3 finally shows the test results of the FSAP profit maximization with 10,000 individuals created in each run. The determination of the fitness of an individual by simulation takes about 1.3 S 3. Obviously the PGA1 achieved the best results applied to the FSAP. Here, the time gained by the parallel evolution was so significant that the communication overhead was of hardly any 3measured on a CLiC node with 800 MHz CPU
319 importance. Using the global-master slave PGA on 100 nodes resulted in a speedup of 44.7 compared to the basic GA. Additionally, the average fitness was clearly better than the one calculated by the GA. The PGA2 performed better than the basic GA, too, but with an increasing number of nodes the average fitness got worse. That is due to the relatively small number of created individuals that are distributed among the equal nodes. So we could run the PGA2 on maximal 30 nodes only, achieving a speedup of 23 which is still very good. Unfortunately, the combined PGA did not come up to the expectations. Although the execution times of the runtime tests with 420 (20 masters with 20 slaves each) and 410 (10 masters with 40 slaves each) were a bit better than the times of the other PGAs, the average fitnesses were not equally good. 5. CONCLUSIONS In our paper we presented three different parallel implementations of a Genetic Algorithms and evaluated their performance on a Beowulf cluster by applying them to a number of optimization problems with different complexity. The extensive series of runtime tests showed that applicability and success of a parallel realization depends to a high degree on the specific optimization task. We can conclude from our experimental results that the use of a parallel GA instead of a serial GA is not sensible for applications with a rather simple fitness calculation. But if the execution time necessary for fitness calculation increases, parallel GAs perform in most cases much better and with an enormous speedup. A global master-slave PGA is most suitable if this time is especially long, e.g. due to a necessary simulation, and therefore the amount of created individuals is not that high. If the proportion of time for fitness calculation to number of created individuals resembles the proportion of the examined TSP, a coarse-grained PGA should be the best choice. Based on these results, our future research will deal with further promising parallelization models for EAs and a more detailed analysis of hierarchical PGAs. REFERENCES
[I] Aguirre, H.E. / Tanaka, K. / Oshita, S.: "Increasing the Robustness of Distributed Genetic Algorithms by Parallel Cooperative-Competitive Genetic Operators." In: Spector, L. et al. (eds.): Proceedings of the Genetic and Evolutionary Computation Conference. San Francisco, California: Morgan Kaufmann Publishers, 2001, pp. 195-202 [2] Becher, M. / Petersen, K. / Riedel, W.: WWW-Seiten zum Chemnitzer Linux-Cluster. TU Chemnitz: http ://www. tu- chemnitz, d e / u r z / a n w e n d u n g e n / C L I C / , 2001 [3] De Falco, I . / D e l Balio, R./Tarantino, E./Vaccaro, R.: "Mapping Parallel Genetic Algorithms on WK-Recursive Topologies." In: Albrecht, R. F. / Reeves, C. R. / Steele, N. C. (eds.): Artificial Neural Nets and Genetic Algorithms: Proceedings of the International Conference in Innsbruck, Austria, 1993. Wien, New York: Springer, 1993, pp. 338-343 [4] K6chel, R / Kunze, S./Nielfinder, U.: "Optimal Control of a Distributed Service System with Moving Resources: Application to the Fleet Sizing and Allocation Problem." International Journal of Production Economics. 2003, Volume 81-82, S. 443-459 [5] Nielfinder, U.: Zur optimalen Konfigurierung und Steuerung diskreter Systeme, ins-
320
besondere Fertigungssysteme, mittels Evolution?irer Algorithmen. Technische Universit~it Chemnitz-Zwickau: Diplomarbeit, 1996 [6] Nissen, V.: Evolution?ire Algorithmen: Darstellung, Beispiele, betriebswirtschafiliche Anwendungsm6glichkeiten. Wiesbaden: Deutscher Universit~its-Verlag, 1994 [7] Rauber, T. / Rfinger, G.: Parallele und verteilte Programmierung. Berlin, Heidelberg: Springer, 2000 [8] Riedel, M.: Parallele Genetische Algorithmen mit Andwendungen. TU Chemnitz: Diplomarbeit, 2002 ht tp :/ / a r c h i v , t u- c h e m n i t z. d e / p u b / 2 0 0 2 / 013 4 /
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
321
Extensible and Customizable Just-In-Time Security (JITS) Management of Client-Server Communication in Java* Illustration on a simple electronic wallet S. Chaumette ~ and R Vign6ras ~ ~LaBRI, Laboratoire Bordelais de Recherche en Informatique, Universit6 Bordeaux I, 351 Cours de la Lib6ration, 33405 TALENCE, FRANCE {Serge.Chaumette, Pierre.Vigneras }@labri.u-bordeaux.fr Security is becoming a more and more important issue when considering communication between clients and services. This is due to the critical aspects of the services and information that are now being dealt with (electronic wallet, private medical information, etc.) and to the fact that multi-applicative devices (Java cards [12], PDAs [4], etc.) are becoming of common use, what might give rise to some interest among the hackers. Over the air communication technologies such as Bluetooth[8, 15], IRDA[24] or 802.11 [2] make the exchange of information even less secure, since the communication signal is straightforward to capture. In this paper we present a framework that we have designed to provide what we call Just In Time Security (JITS) management of data exchange. This mechanism makes it possible to dynamically install a communication protocol between a client and a server, without the client or the server even noticing. It makes it possible to secure legacy services and to decide on the flight the security protocol to run. 1. I N T R O D U C T I O N The work presented in this paper is carried out at the Laboratoire Bordelais de Recherche en Informatique (LaBRI), Universit6 Bordeaux 1 and CNRS. It takes place within the Distributed Objects and Systems (SOD) research team. The main characteristics of our activities is the focus on the Java[3] programming language, or more precisely on the features it provides. We use these features to set up and prove properties of operational distributed environments for use in the industry. We are currently working on an Extensible and Customizable Just-In-Time Security (JITS) management system. The aim of the JITS is to enable the installation of a communication protocol in a dynamic manner between two systems that need to communicate, i.e. a client and a server. It might be the case that the protocol is unknown when the service is developed, and it may be chosen by the service, the client or even an external entity. This JITS is based and *Java and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. The authors are independent of Sun Microsystems, Inc.
322 implemented on top of JACOb[21, 22]. JACOb is a system close in some sense to the Enterprise Java Beans[5, 19] and Jini[17, 18]. It has been developed in our team. It makes it possible to run communicating distributed components and to make legacy services available through the network, i.e. remotely. Throughout this paper, we will illustrate our mechanism on an electronic wallet, that we made as simple as possible for the sake of explanation. The methods of this service are shown in the server side of figure 1.
Server
Client
~l~iiiii~iii~!~iii~iii~iiii~i|
ili~!ii~iiiiii:: Safe dialog
~:[~:
~iiii[~i~iiiiiii~j!iiiiiii!ii
Ni|
Figure 1. A basic electronic wallet service and its client
2. R A T I O N A L F O R A J I T S The framework we offer supports features that we believe are enabling technologies to provide wide spread exchange of sensible information over the various networks, either wired or over the air. In this section we first describe the main problems one has to face regarding the security of communication. Based on it, we then build the rational for the JITS framework that we have set up. 2.1. A n a l y s i s o f the p r o b l e m s - P r o b l e m s related to s t a n d a r d s .
There are some characteristics of communication security standards that must be taken into account so as to propose an open security framework: 9 the wide number of available algorithms and techniques prevents from definitely choosing a specific one, and therefore requires adaptability; 9 it is sometimes difficult to evaluate the available algorithms properly. It may be the case that algorithms now considered robust become obsolete tomorrow : such algorithms have to be replaced by more robust ones; 9 new algorithms are invented regularly and their must be room to integrate them in preexisting software environments. - P r o b l e m s related to technology.
It is more and more often the case that protocols are wired in the hardware; new cards will require new drivers and protocols to be installed in a dynamic manner.
323 - P r o b l e m s related to users.
For psychological rather than technical reasons, users will not necessarily trust security layers provided by someone else. They will prefer to install their own usual communication system to communicate with a given service. 2.2. N e c e s s a r y features for a JITS From the above analysis, we can exhibit a set of constraints that have to be met by a JITS. Here are the most important: 9 support for future protocols that are not even identified; 9 support for the selection of the protocol by the service, the client or even an external entity; 9 support for dynamic management of the set of communication protocols available for a given service. 3. THE JITS F R A M E W O R K 3.1. R e l a t e d w o r k and the base architecture
We are currently building our framework on top of an environment that we have designed and implemented. This is called JACOB[22, 21] for Java Active Container of Objects. The current release of JACOB makes it possible to make remote any object, any service, without any need for compilation, dedicated key words or declarations in the object being dealt with. One of the direct consequences of that point is that it supports legacy code. We believe that the environment that we propose does not really compare to other existing projects, although the system it relies upon, JACOB [22, 21], can be compared by certain aspects to Jini[17, 18] or Enterprise Java Beans[5, 19], all of these systems offering a framework and a software run-time environment that makes it possible to run communicating distributed components. We insist on the fact that the main fundamental difference when comparing JACOB to other systems is that there is no need to modify the code of an object to make it accessible remotely. Furthermore, within usual systems, once a service has been developed there is no way to change the protocol that it uses to communicate with its clients: this feature is one of the main characteristics offered by our JITS framework. Objects or services do not need to be developed specifically to embed network access features as it is the case when using Java RMI[ 16], CORBA[ 14, 23], DCOM[ 11 ] or similar software environments. The service is integrated inside an active container (figure 2) that offers a minimal interface (a c a l i method that uses reflection [ 1]) to invoke the methods of the objects it contains. So that objects can be accessed remotely, it is sufficient that the active container is accessible remotely, what is possible because JACOB supports this feature - in deed it supports separation of concerns and aspect[9, 10] programming. Of course, a client side proxy can be developed (figure 3) to support the same interface on both server and client side. This implies that this proxy necessarily uses the communication protocol imposed by the active container, since it is its peer for the communication - the active container then forwards calls to the effective service -. We prefer the solution where both a client side and a server side proxy are used as described in the next section.
324
credit(amount)
call(key, "credit", amount)
Figure 2. A service inside an active container
iii i ! ~ i ~ i ~ i i i i ~ i ~ |
.....
.~ "
"
~ii~i~i~i~
',|
~
call(key, "credit",amount)
5.+
.....?.?y~.?.,~,..,:
v ::::::-:-c:
!i!i!!!5"~'.~i!'!~
T credit(amount)
..................
Figure 3. A client side proxy 3.2. Building the JITS on top of JACOb The technique that we use to implement our JITS consists in setting up what we call aprotocol manager. This manager is in charge of managing the communication environment for a given service (figure 4). This protocol manager creates an intermediate object, that we call a s e r v e r side proxy, that is in charge of the communication with the client, and that is in charge of implementing the selected security protocol. We now describe the steps required to effectively use the JITS framework (see figure 5).
325
protocol manager
call(key, "s", p)
s(int p) server side proxy service
Figure 4. A service, its protocol manager, and a server side proxy
Step 1. An active container (1) must be running. A service can also be running (2), possibly inside the active container.
Step 2. The client application uploads a protocol manager (3) to handle the given service inside the active container.
Step 3. The client requests the creation of an instance of a given protocol proxy (4), possibly uploading a protocol handler to the protocol manager.
Step 4. It gets as a result a client side proxy (5) to communicate with the service, or more precisely with the server side proxy of the service. The service and the client have no idea about the protocol that is being used to communicate between the two proxies. Many possibilities are available depending on the goal of the application or of the service or on some external policy: direct communication between the client side proxy and the server side proxy; communication through the c a l l method of the active container in order to achieve some additional operation possibly implemented by the active container (transaction, authentication, etc.). This last solution furthermore makes it possible to implement generic proxies and then generic protocol managers. The client and the server can now communicate using the protocol that has been dynamically installed.
326
i!i!!i!~ii!ii~!i!!i!!ii!i! iii!~i ::::::::::::::::::::::::::::::::::::::::::::::
i~iiiiiiiiiiiiiii i ::
client side proxy (5)
]
/
iiiil "~~int
serverr side proxy (4) ~ . . _ ~ _
~i:~i~!:i:!i:~ii:.~::f:!/!~:!i;~~iIii:iiliI:i;::i:~i ~~;;:i:~i.i::~Z!:ili~;i:!ii;~ ~i::i~i:~i!!i~ii:~i:~i!::;i~:i~;!i:;:iiii;iii~i:!i;i~li:~:i:li;~i:i::!:/i;:i:iil !:~ii;i}.:!:ili:li~: ~i!i::iijiill~;i~;~!:i::i;i!:i;ii: ~i:i!:il ~i;ii:ii:i~:i;i~i~i;i:i!~!i :;~ :~i!ii
9
!iiiili iiiiiii!iiiii
::!~
i~~ii~~;:i~!i!i!::! !i:i;i~}i :;:ii: i i:
Figure 5. Client side proxy, server side proxy and protocol manager
4. UNDERLYING F O R M A L M O D E L The concept of active container relies upon a strong foundation. We have a ~-calculus [ 13] formal model [20] of their implementation and operation. We believe that this model will help in the process of proving and hence convincing potential users of the reliability of the features provided by our system. This model is based on 7r-calculus because it provides support for modeling the migration of code and of communication channels. It is close to the effective implementation, what is a plus to us. The formal model of our system is out of the scope of this paper and it will not be detailed here. 5. CONCLUSION AND FUTURE W O R K We believe that the system we propose solves the problem of making legacy services more secure. It also provides an appropriate environment to support security for new services. The current implementation is based on Java but nothing prevents the services from being developed in any language. Of course, we also need to secure the basic bricks of the system itself, i.e. JACOb. We will use the security features provided by the Java language [7] and by JAAS [6], the Java Authentication and Authorization Service. We are in the process of porting an implementation of our JITS on iPAQ PDAs with Bluetooth communication so as to demonstrate its use for over the air communication between PDAs. REFERENCES
[1] [2] [3]
[4] [5]
Java core reflection: Api and specification, 1997. The working group for wireless local area networks, 2001. Ken Arnold, James Gosling, and David Holmes. The Java Programming Language. Addison Wesley, 2000. Third edition. C. Corporation. Com palmpilot. R. Englander. Java Beans, Guide du programmeur. O'REILLY, 1997.
327 [6] [7] [8] [9] [ 10]
[11 ] [12] [13] [ 14] [ 15] [ 16] [17] [ 18] [ 19] [20] [21] [22]
[23] [24]
C.L.L. Gong, L. Koved, A. Nadalin, and R. Schemers. User Authentification and Authorization in the Java Platform. Technical report, Sun Microsystems, 1999. Li Gong. Inside Java 2 Platform Security: Architecture, API Design, and Implementation. Addison-Wesley, June 1999. ISBN: 0-20131-000-7. J. Haartsen. The bluetooth radio system, 2000. Timothy Highley, Michael Lack, and Perry Myers. Aspect oriented programming: A critical analysis of a new programming paradigm. Technical Report CS-99-29, 5, 1999. Gregor Kiczales, John Lamping, Anurag Menhdhekar, Chris Maeda, Cristina Lopes, JeanMarc Loingtier, and John Irwin. Aspect-oriented programming. In Mehmet Ak~it and Satoshi Matsuoka, editors, ECOOP "97 Object-Oriented Programming llth European Conference, Jyvdskyld, Finland, volume 1241, pages 220-242. Springer-Verlag, New York, NY, 1997. Microsoft Corporation. DCOM Technical Overview. Technical report, Microsoft Corporation, 1996. S. Microsystems. Java card technology, 2000. Robin Milner. Communicating and mobile systems: the 7r-calculus. Cambridge University Press, Cambridge, UK, 1999. J. Siegel. CORBA, Fundamental and Programming. Wiley, 1996. T. Special and I. Group. Specification of the bluetooth system. SUN Microsystems. Java Remote Method Invocation Specification, 1998. SUN Microsystems. Jini Architectural Overview. Technical report, Sun Microsystems, January 1999. SUN Microsystems. A Collection of Jini Technology Helper Utilities and Services Specifications. Technical report, Sun Microsystems, October 2000. SUN Microsystems. Enterprise JavaBeans Specification, juin 2000. Version 2.0, Public Draft. P. Vign6ras. Agents mobiles : une impl6mentation en Java et sa mod61isation, Juin 1998. LaBRI, Universit6 Bordeaux I, DEA Report. P. Vign6ras. Jacob : un support d'ex6cution pour la distribution d'objets en Java. In JCS'2000. RenPar'2000, June 2000. Pierre Vign6ras. Jacob : a software framework to support the development of e-services, and its comparison to Enterprise JavaBeans. In Perspectives for Commercial and Scientific Environments, pages 11-12, University of Technology Munich, 19-20 April 2001. International Workshop on Performance-Oriented Application Development for Distributed Architectures (PADDA). S. Vinoski. CORBA : Integrating Diverse Applications Within Distributed Heterogeneous Environments. IEEE Comunications Magazine, 35(2), February 1997. S. Williams. Irda: Past, present and future, 2000.
This Page Intentionally Left Blank
Applications & Simulation
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
331
An Object-Oriented Parallel Multidisciplinary Simulation System- The SimServer U. Tremel a, F. Deister a, K.A. Sorensen ~, H. Rieger ~, and N.P. Weatherill b aFlight Physics Department, EADS Military Aircraft, 81663 Munich, Germany bSchool of Engineering, University of Wales Swansea, Singleton Park, Swansea SA2 8PP, United Kingdom In this paper a new object-oriented (OO) approach is presented for a parallel multidisciplinary simulation environment, the SimServer-system. It has been designed for aerodynamic problems solved by the application of nonlinear computational fluid dynamics (CFD) methods based on the Euler and Navier-Stokes equations on unstructured hybrid grids. The OO design and implementation of the system, the top level components and the integration of existing parallel CFD solvers (EADS M Airplane+-, DLR 7--code) are illustrated. A novel modelling approach for the execution of a mono- or multidisciplinary simulation via events is demonstrated. Important parallelisation aspects based on the message passing paradigm are highlighted and the advantages and limitations of the presented concepts are discussed. Additional incorporated coupling algorithms in space and time are described supporting both strongly coupled aeroelastic calculations and transient aerodynamic simulations strongly coupled to flight mechanics with up to six degrees of freedom. That paper concludes with multidisciplinary examples demonstrating the capabilities of the approach. Presented are aeroelastic calculations and coupled aerodynamic simulations with free motions. 1. INTRODUCTION Computational fluid dynamics (CFD) has become an accepted tool for the prediction of incompressible and compressible fluid flow in research and industry. Especially the CFD methods based on the Euler and Navier-Stokes equations offer highly accurate results and are therefore generally applied for aerodynamic simulations. The price to be paid for the accuracy is huge computational expenses caused by the solution of the nonlinear system of equations on large computational volume grids required. Commonly used grid types are of the block-structured, unstructured and Cartesian types, where the unstructured meshes are preferred for geometrically complex configurations, such as fighter aircraft, due to the highly automatic mesh generation tools available. Additionally they offer flexible domain decomposition and mesh adaptation possibilities, both for steady-state and transient calculations, and have therefore been chosen for the new simulation system presented in this paper. This SimServer-system is a new objectoriented (OO) approach for a parallel multidisciplinary simulation environment. Already existing, validated, and highly optimised parallel CFD solvers such as the EADS M Airplane+-
332 and the DLR T-code[ 1] are tightly integrated in library-form to preserve the high computational performance offered by the stand-alone versions of these solvers. Via specific wrapper and adaptor layers the solver kernels are controlled by top-level objects themselves embedded in a simulation shell. The OO design and implementation of the system, the top level components and the integration of the parallel CFD solvers are described in section 2. It is essential for an efficient execution of a large aerodynamic simulation that all relevant steps are parallelised. Inside the SimServer the message-passing paradigm in form of the MPIstandard[2] is used for the parallelisation. The implications of this approach and other important parallelisation aspects are discussed in section 3. Supported are steady and transient flow simulations, which can be strongly coupled to other disciplines by means of integrated coupling algorithms in space and time, section 4. In this way aeroleastic calculations are supported as well as aerodynamic simulations strongly coupled with flight mechanics calculations enabling free motions with up to six degrees of freedom. In section 5, multidisciplinary examples are shown, demonstrating the capabilities of the approach. Coupled aeroelastic calculations are presented as well as results of coupled aerodynamic-flight mechanical computations. 2. SYSTEM ARCHITECTURE An overview over the SimServer-system is shown in Figure 1. Inside the simulation shell the objects can be divided generally into discipline specific and global shared objects encapsulating interdisciplinary data.
\ Figure 1. Overview over the SimServersystem.
Figure 2. SimServer-system architecture.
The latter category includes the logical model, in which the configuration to be simulated is described in form of a tree of model nodes defining the hierarchy of all components, parts, subparts, etc. and the kinematic relationships between them. Additional mappings from a model node to the groups of the CAD (Computer Aided Design) geometry, to parts of the CFD mesh and to groups of the structural model are also encapsulated in this object. They are used during the simulation to map definitions made on abstract model nodes to the corresponding parts of the real computational models. For example, the CFD boundary conditions are defined on model nodes and are mapped internally to the surfaces in the CFD mesh during runtime. The
333 same mechanism is used to extract the spatial coupling surface parts defined via references to model nodes from both the CFD and CSM grids during runtime. The next simulation-wide object is the motion component, which is also shared beween the disciplines. Here the prescribed and free movements are handled during a transient simulation. Both the logical model and the motion module are replicated across all processes due to the very small memory requirements. In contrast, large memory requirements dictate the use of data parallelism in the CFD and CSM objects, in charge of the discipline specific tasks. Internally, the selected parallel solver kernels are initialised, controlled and monitored and the necessary parameter and data conversions and preparations are performed. Aeroelastic simulations request a spatial and temporal coupling of the aerodynamics and structural mechanics. The spatial coupling is performed in a separate object, encapsulating the different methods available, the coupling in time is discussed in section 4. To support data extraction during runtime, a special data extraction module is available, offering typical engineering and analysis data such as aerodynamic loads and arbitrary submeshes to be extracted in parallel during runtime. All components are controlled by one master object, the SimServer itself.
2.1. Simulation events For flexibility reasons a novel state-machine based approach has been developed inside the multidisciplinary simulation environment. The simulation is modelled by a series of events to be executed one after the other. The sequence of events is managed in an internal event queue and the events are executed in last-in-first-out (LIFO) order. Possible event types are simulation-, logging-, data extraction-, checkpoint-, parameter update- or stop-events, where their specific actions clearly depend on the type of simulation specified. ~ Previous to a run, 'auto-events' can be specified submitted automatically during the calculation in the simulation events. This enables, for example, logging and data extraction automatically after each time step, after a certain number of exchange cycles or even after every n solver cycles. If the event queue is empty, a default event is executed, generally mapped to a simulation event causing the calculation to be continued. The solver kernels are incorporated in such a way that they keep their state from event to event and only act as a filter on a given discrete solution vector. In case of transient simulations the operations necessary to start a new time step are implemented outside the kernels. Currently additional events can be submitted during runtime via simple ASCII-files, other communication means, such as CORBA, etc. are planned. It is worth noticing that this new state-machine based approach in a multidisciplinary simulation environment enables now the termination and restart of transient multidisiplinary simulations even within a time step during the pseudo time step loop of the dual time stepping scheme. This without any loss in the order of time accuracy. 2.2. S i m S e r v e r - System components By means of object-orientation, highly complex parallel tasks can be controlled more easily compared to a traditional structured programming approach[3]. Together with the code maintainability and extensibility advantages OO has proven to be a viable technology for such problem areas. As implementation language, C++ has been chosen, offering many OO techniques such as encapsulation, inheritance, polymorphism and generic programming[4], which 1Available are discipline specific and multidisciplinarymodes.
334 have been extensively used inside the system. The language compatibility to C and Fortran allows the tight integration of already existing, validated, and highly optimised parallel solver libraries, such as the EADS M inhouse code Airplane +- and the DLR T-code. In Figure 2 the top-level architecture of the system is shown. The main object is the simulation director administrating the global components shared across the disciplines (logical model and motion controller). This director instantiates the corresponding simulation manager depending on the simulation mode selected. Available are simulation managers for aerodynamic, structural mechanic and combined aeroelastic simulations, additional managers are available for more specialised tasks such as pure data extraction, spatial coupling, structural modal forms analysis and motion replay. All manager objects are derived from a genetic simulation manager defining a lean common interface consisting of methods initialising the simulation and starting the event processing. Additional hooks are defined for each event type which have to be implemented in the derived classes. For CFD computations a specific CFD simulation manager encapsulates a lower level CFD controller, a data parallel CFD volume mesh and a data extraction component for on-the-fly postprocessing. The main differences between manager and controller are characterised by their tasks. The controller has to control the solver and all dependent tasks, the manager supervises the more global actions, such as processing of events, data extraction, logging and initiation of consistent checkpoints. The same scheme is applicable to the CSM simulation manager. Concerning aeroelastic calculations an aeroelastic simulation manager is in charge of the steps to be done. Internally the discipline specific managers are reused and an additional manager is responsible for the spatial coupling. The aeroelastic manager controls the execution of the coupled computation and manages the data exchange between the separate process groups, whereas the discipline specific managers perform the CFD/CSM calculations and the spatial coupling manager the transformation of the solution quantities between the disciplines.
-
•:•:•:•:•:•:•:••:•:•:•::••:••:•:••:•:•:•:•:••:•:/:•:•:•:•:•:•:•:•:•:•:••:••••••••••:•:•:•••••••••:• ..................
~ ,.._.face , ~ S o" I v : : i e,' ...................................................
....
......
-
ii!ii ()
~he~,,po~,,t ()
ors
s t o p k ~ i: :::: : i : : : : i i l : ]
etc.
()
RunSolver -
~m~lers
.....
- Init
-
weights
-
etc.
(MODAL)
,...............
m
Figure 3. Discipline specific controllers.
Figure 4. Flight mechanics integration.
Figure 3 illustrates in detail the design of the discipline specific controllers. They export both a small set of methods initialising and running the user specified solver kernel, performing the necessary updates to start a new time step and executing all steps necessary to create a consistent checkpoint. Each solver kernel is encapsulated by a kernel-dependent wrapper class derived from a generic CFD or CSM solver class. The generic solver class defines the interface to be used from the controller and the specific solver class contains the driver functionality required to implement the interface.
335 Transient movements during a simulation are realised by means of the motion controller object shown in Figure 4. Two kinds of motions can be handled: prescribed and free. The prescribed movements, such as oscillatory motions or flight trajectories, are administrated by a separate coordinator in an internal motion database. In contrast, free motions with up to six degrees of freedom are calculated during runtime from the flight mechanics controller. Each air vehicle object is linked to a certain model node and obtains the actual acting forces and moments from the corresponding surfaces linked to the model node. These input data for the six degrees of freedom solvers are extracted in parallel on-the-fly by means of the data extraction component offered by the CFD simulation manager. In the current approach, each movement is automatically applied to the complete or to the corresponding parts of the CFD grid, and via parallel mesh deformation the movement is diffused into the volume mesh. The removal of badly distorted elements is not yet available. A local remeshing procedure is planned for the near future to support large local movements required for simulations with independently moving bodies and for solution adaptation. 3. PARALLELISATION ASPECTS The requirements of an aerodynamic simulation based on the Euler or Navier-Stokes equations in terms of computational power and memory are large and dominate even coupled aerolastic simulations. Therefore, to achieve short turn-around times, it is essential that all relevant steps of the CFD simulation are parallelised to utilise the computational power of parallel computers. Parallel PC-cluster systems (running Linux) are used more and more in the aerospace industry today, as they have been demonstrated to be a real alternative to expensive high performance vector computers in terms of computational power and price-performance-ratio for the computational problems considered here. Such monetary savings however entail restrictions in terms of maximum local memory available. For systems built from standard PC modules, the local memory limitations are more stringent compared to current supercomputers with huge amount of shared memory per node. Therefore, the algorithms implementing the preparation of a parallel computation have to respect such limits, especially the algorithms to partition the input data in form of the large CFD volume mesh and the preprocessing routines for the flow solver. Inside the SimServer the message-passing paradigm is used for the parallelisation applying the MPI-interface. This supports the usage of both large shared memory systems, as well as cost-effective distributed memory parallel hardware based on off-the-shelf components. Another reason to choose the MPI-interface is the usage of this interface in the flow solver kernels of Airplane + and the DLR T-code, as well as in other parallel libraries used by the SimServer. Hence, via MPI, parallel programming compatibility can be achieved. Concerning aeroelastic calculations first all the available processes are divided into three groups, a CFD, a spatial coupling and a CSM group. Currently the CSM group contains one or more CSM-processes (CSM solver dependent), the spatial coupling group contains one process only and all other processes are assigned to the CFD group. Between these groups the exchange of parameters and quantities is based on MPI-intercommunicators, within each group MPI-intracommunicators are used. It is assumed that both the CFD and CSM spatial coupling surfaces fit into the memory of the spatial coupling process. If very large meshes are to be used, a parallelisation of the spatial coupling component is envisaged, similar to the approach performed in MpCCI[5].
336 4. COUPLING ALGORITHMS Both steady and transient flow simulations are supported by the $imServer, and can be coupled strongly to other disciplines by means of integrated coupling algorithms in space and time. In this way aeroelastic calculations are supported, as well as aerodynamic simulations strongly coupled with flight mechanics calculations, enabling free motions with up to six degrees of freedom (6-DOF). For both types of coupled computations, a dual time-stepping approach is used inside the solvers. The integration in time is based on a second order accurate implicit backward Euler formulation, both in the CFD solvers, in the structural modal solver and in the 6-DOF solver. During the pseudo time steps not only one but several exchange cycles are performed until a converged steady state is reached. This scheme guarantees a strong coupling of the disciplines in time and is illustrated in Figure 5 for a transient simulation. It is worth
JP12.15,
C E - I I C a s e 1" M a c h = 0 . 8 5 ,
40 E" - -r - -, . . . . . I : I I[ I-~ I . . . . . . . . . . I 30[--~: q- " - - ~ --I- " -TI I
~
.::.for:a. time S t e p s
time
~
mm~
v/ until convergence I or max. #steps reached .JL (differencein relative motions l or difference In aerodynamic loads below threshold)
until convergence or max. #steps reached
Ir p s e u d o time
Figure 5. Temporal coupling strategy.
7:"
- r -
1
I
l:I ': ! ' ! / '
weak coupling, At=0.01 [s] '
k~
7,-,
5 "-~
I
'
f'~ ' t,~
=
Exchange Cycle |
20
'
(~=9 ~ % = 4 0 ~
strong coupling, At=0.003 [s] strong coupling, At=0.01 [s] I weak coupling, At=0.003 [S]
I
-r ,~-~- -IFl't -r - ~J- I- F - ~ 7 - t -=- -~ ]
-I~ , :: M/:i i:!':t: ,: \,I , \ E :~:E :-,_:_Y L:-, :: t, j L L- ~/- .,_ -~ .30F,,,,IIA, h,,I,,,,I~.{~,,I,,~,I
0.2
0.4
0.6
0.8 1 Time [s]
,,,I
....
1.2
i,
1.4
,,I,,J,I
1.6
1.8
Figure 6. Strong ~ weak coupling.
noticing that the strong coupling approach can easily be parameterised to act as weak coupling by limiting the number of exchange cycles to only one. A comparison of strong and weak coupling is shown in Figure 6 and clearly demonstrates the necessity of a strong coupling. Especially in the case of large time steps, the weak coupling becomes instable while the strongly coupled simulation still converges to the expected results. More details about the calculation performed are given in section 5. 5. EXAMPLES The first example presented is an aerodynamic simulation strongly coupled with a flight mechanics calculation. For this purpose Case 1 of the Common Exercise II (CE-II) from the WEAG THALES Joint Programme 12.15 (JP 12.15)[6] was selected. In this test case a delta wing is released at a roll angle q~of 40 degree and is freely rotating around the body roll axis, the angle of attack a is defined to be nine degree at zero roll angle. In Figure 7 the results of that coupled computation are shown. Both the current roll angle and the roll moment are plotted
337
JP CE-II
1 2.1 5 Case
Mach = 0.85, a = 9 ~
r
= 40~
DLR z-Code, Euler Calculation
1
~ Million (';rid PnintL~
l~-0]44 i -o.s2 I
~
.v 0
02
0.4
05
0.8
Time Is]
1.2
1,4
1.6
- 0 . 6
18
Figure 7. Freely rolling delta wing.
Figure 8. X-31 aeroelastic manoeuvre simulation. against time. Because no vortex breakdown occurs, the oscillations of the model are damped out and the model settles at zero roll angle. For this simulation a constant time step size of ten milliseconds was used allowing one cycle to be resolved by more than thirty physical time steps. Totaly 180 physical time steps were performed for the complete simulation. In the pseudo time steps at least every 20 multigrid cycles, an exchange cycle was performed to update the current aerodynamic forces and moments for the flight mechanics calculation. After that the current position for the aerodynamics computation was updated. This was iterated until a converged state in terms of CFD solution and roll angle was obtained. The simulation was calculated on 32 XEON 2.4 GHz processors of a Linux cluster with QUADRICS interconnect. About 96% of the wall clock time was spent in the CFD solver kernel (DLR ~--code), 1% in the parallel preprocessing, 1.5% of the time was occupied by I/Ooperations, the rest was dedicated to data extraction, logging and other operations. Another example included is an aeroelastic manoeuvre simulation presented in [7] performed by an initial development version of the SimServer. The X-31 aircraft is flying along a prescribed flight-path and the aeroelastic behaviour is calculated and monitored. The goal of this simulation was to demonstrate the functionality of the SimServer. Although a real balance of forces was not simulated due to the fixed trajetory and the rigid control surfaces, a meaningful transient aeroelastic behaviour was obtained.
6. CONCLUSIONS The recently developed SimServer-system has been presented. The O 0 design and implementation of the system was described together with the top level components and the integration of existing parallel CFD solvers (EADS M Airplane +-, DLR ~--code). A novel modelling approach for the execution of a mono- or multidisciplinary simulation based on events was illustrated and special aspects concerning the parallelisation based on the message passing paradigm were highlighted. The integrated coupling algorithms were described, which support strongly coupled aeroelastic calculations and transient aerodynamic simulations strongly coupled with flight mechanics calculations with up to six degrees of freedom. Finally the presented multidisciplinary examples demonstrated the capabilities of the system.
338 ACKNOWLEDGEMENTS
The authors would like to thank EADS Military Aircraft, Ottobrunn, for supporting this work. Special acknowledgement has to be given to all colleagues in the numerical simulation department for the great support and to the members of the DLR Institute of Aerodynamics and Flow Technology for their assistance in adapting the r-code into the presented system. REFERENCES
[1] Martin Galle. Ein Verfahren zur numerischen Simulation kompressibler, reibungsbe[2]
[3]
[4] [5] [6]
[7]
hafteter Str6mungen auf hybriden Netzen. Research report No. 99-04, DLR, 1999. M. Snir et al. MPI-- The Complete Reference. The MIT Press, 2nd edition, 1998. Vol. 1, The MPI Core. Helmut Balzert. Lehrbuch der Software-Technik. Spektrum Akademischer Verlag, 1998. Software-Entwicklung. Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley, 3rd edition, 1997. R. Ahrem et al. Specification of MpCCI Version 1.3. Technical report, Fraunhofer SCAI, Sankt Augustin, 2002. M. T. Arthur. Computation of transonic, vortical flow over a delta wing fuselage configuration in free-to-roll motion (unclassified). Proposal for Common exercise II QINETIQ/FST/SP023879/2.0, QinetiQ, 2002. WEAG THALES Joint Programme 12.15. L. Fornasier, H. Rieger, U. Tremel, and E. van der Weide. Time-Dependent Aeroelastic Simulation of Rapid Manoeuvring Aircraft. In AIAA Paper 2002-0949. AIAA, January 2002.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
339
Computer Simulation of Action Potential Propagation on Cardiac Tissues: An Efficient and Scalable Paralell Approach J.M. Alonso ~*, J.M. Ferrero (Jr.) b, V. Hern~ndez ~, G. Molt6 ~, M. Monserrat b, and J. Saiz b aDepartamento de Sistemas Inform~ticos y Computaci6n. Universidad Polit6cnica de Valencia. Camino de Vera s/n, 46022, Valencia, Spain. bDepartamento de Ingenieria Electr6nica. Universidad Polit6cnica de Valencia. Camino de Vera s/n, 46022, Valencia, Spain. The simulation of action potential propagation on cardiac tissues has become a very time consuming task with large memory requirements. Realistic ionic models combined with the everincreasing need to simulate larger tissues during longer time, demand the use of very large-scale computational resources. This paper presents a complete parallel computing implementation for the electrical activity simulation in a two-dimensional monodomain cardiac tissue using a costeffective beowulf cluster. High performance computing techniques have reduced the simulation time by a factor of nearly the number of processors, reaching a 94% of efficiency when using 32 processes. Besides, the application of stable numerical ODE integration methods has allowed the use of larger timesteps, reducing the whole simulation time. Moreover, the use of a GRID infrastructure has allowed the simultaneous execution of a huge amount of different parametric simulations, thus increasing the global research productivity. 1. I N T R O D U C T I O N Experimental observations of cardiac cell electrophysiological properties have been extensively carried out in the past six decades. The invention of the voltage-clamp technique by Hodgkin, Huxley, and others in the late 1940s, and the more recent development of the patchclamp method by Sakmann and Neher [1], have made it possible for electrophysiologists to record and measure ionic currents across the cell membrane. These explorations have been instrumental in understanding the genesis of the action potential, an electric signal that, in normal conditions, is generated in the heart's natural pacemaker (the sinoauricular node). The action potential is propagated along a specialized conduction system in the heart until it reaches all the cardiac muscle cells. The arrival of this signal to a cell provokes several phenomena that results in its contraction. This way, action potentials and the electric conduction system guarantee the synchronized contraction of cardiac muscle and, therefore, the effective blood pump. The electrophysiological data obtained with the help of these experimental techniques led to the formulation of mathematical models of the electrical behaviour of excitable cells. Specif*This work has been supported in part by the Spanish Research Grant TIC 2001-2686 (CAMAEC project) and the Universidad Polit6cnica de Valencia.
340 ically, the electrical activity of cardiac cells has been quantitatively described, since the early 1960s, by models that have become more and more detailed as new ion channels and channel properties have been discovered and studied in depth. In 1977, Beeler and Reuter (BR) formulated a set of equations that reproduced the ventricular action potential. This model, which was modified by Drouhard and Roberge in 1982, was used until the Luo and Rudy models were formulated in the 1990s. The Luo-Rudy model was published in two phases. The first version reformulated some currents of the BR model, while the second version (Luo-Rudy Phase II model) [2] was much more comprehensive, based on very recent patch-clamp data, and describing a wide variety of ionic currents.
Action Potential
40
~" 2o ~ -20 o -40
~
~,- e o
E ~ -80
-lO%
s'o
loo
1~o
Simulation Time (ms.)
2oo
2so
Figure 1. Action potential during 250 ms. for a cardiac cell after being electrically stimulated. Action potential models are becoming increasingly important in cardiac electrophysiology because of their ability to simulate in precise detail the electrical events that take place across the cell membrane. Because the simultaneous recording of all the individual ionic currents as well as the electrical membrane potential is impossible to do by experimental means alone, computer simulations provide a desirable alternative for exploring the basic electrophysiological mechanisms in normal and abnormal cells. Moreover, if the mathematical formulation of the membrane ion kinetics (the cellular model) is combined with a representation of the electrical characteristics of the tissue, the resulting mathematical model (a system of differential equations) can be used to simulate the electrical activity (action potential propagation) of cardiac preparations, or even of the whole heart. The simulation of most arrhythmogenic phenomena requires the use of the models that represent the behaviour of relatively big tissues, with a high number of cells connected in a bidimensional or tridimensional model, either monodomain or bidomain. The more realistic the model is, the more useful will the conclusions be. One phenomenon responsible for the ventricular arrhythmia are the so-called reentries. These depend on the tissues geometry and, therefore, a model that could reproduce a complete heart would be a very important tool for discovering the mechanisms of the different kind of arrhythmia, being able to suggest, with its help, the most suitable therapy for each concrete arrhythmia. This way, the results obtained with the modelization are useful to propose new hypotheses that, once experimentally validated, may be used to introduce modifications in clinical therapy. Therefore, although for certain studies may be enough the use tissues with a reduced number of cells, to have available models that represent larger tissues enables more ambitious studies,
341 from which more useful medical conclusions can be obtained. However, as the size of the simulated cardiac tissue increases, the very large numerical burden resulting from calculating currents and voltages on many cells, and then simulating electrical interactions among the coupled cells, requires very large-scale computational resources. For instance, taking into account that a cardiac tissue consists of irregular, densely packed cells of 30-100 #m long and 10-20 #m width, a 1cm x 1cm tissue is composed of approximately 100.000 coupled cells. A simulation of action potential propagation during 2 seconds requires the execution of 250.000 timesteps of 8 #s, what implies a total simulation time that can last for several days. Moreover, ischemic behaviour may require the simulation of a cardiac tissue electrical state during several minutes, what may last for months in a traditional sequential platform. Thus, the complexity of these cardiac models demands the introduction of high performance computing techniques to enable both a decrease in simulation time and the use of larger tissues. 2. MATHEMATICAL MODEL PRESENTATION The partial derivative equation that governs the bidimensional monodomain propagation of action potential is described as follows: a
1
02Vrn
1 .02Vm
7" (STyx " - -Ox - w - + ST~
OVm
Oy~ ) - Cm. Od---7- + S~o~
(1)
where I~on is the sum of ionic currents traversing the membrane of the cell, a is the radius of the cell, Cm is the specific membrane capacitance, Vm is the membrane potential and R~x and Riy are the longitudinal and transversal resistivity. A regular grid, with each node being one cardiac cell, is used for the spatial discretization of Eq. (1). Crank-Nicolson's implicit method is applied for the numerical solution of this equation, giving place to a large scale system of linear equations that must be solved, for each simulation step, in order to obtain the membrane potential of all the cells of the tissue. The physical problem is transformed into the following algebraic equation:
A _ L . V m t+l -- A _ R . V m t + Iion + Ist.
(2)
The vector I~onof Eq. (2) is calculated as a sum of ionic currents traversing the cell membrane, what depends on their potential and their internal state, given by a specific cellular model. The Luo-Rudy Phase 2 model has been employed to calculate this term. Vector Ist stands for a stimulus that may be injected in the cardiac tissue to provoke an action potential. The term V m t represents each cell membrane potential at time instant t. When using a monodomain modelization, the matrices A_L and A_R, which stand for the conductance among the cardiac cells, do not change along the simulation process, and only once must be created. The main phases of a simulation timestep are: 1. Matrix-Vector product plus adding a vector ( A _ R . V m t + lion). 2. Injection of a stimulus (optional), simulated by negative components in/st vector, and addition to the previously calculated vector to generate the right hand side of the system. 3. Computation of the new membrane potential of the cells (Vm t+l) by solving the linear equation system.
342 4. Update of the cell state with the new membrane potential, according to the cellular model being employed, and calculation of the new lion term. A simulation consists of many timesteps, which allow to iteratively reconstruct the membrane potential of the cells along the simulation time. In the monodomain modelization of the electrical activity simulation problem, all of the above phases can be efficiently parallelized. 3. PARALLELIZATION STRATEGY The data parallelism paradigm has been employed to parallelize the problem, where different processors perform the same functionality over different data. A domain decomposition of the whole cardiac tissue is carried out among all the processors. This can be considered as a pieceof-tissue parallelization strategy, where a set of cells are assigned to each processor. All the data required for the simulation, such as the coefficient matrix and the ionic and membrane potential vectors are generated in parallel. They have been partitioned among the processors following a rowwise block-striped distribution, what overcomes the memory constraints that may arise when simulating a large tissue on a single computer. With all the main data structures distributed, we enable the generation of cardiac tissues that are p times larger, where p is the total number of processors employed. This form of parallelization is the most suitable for this problem, as it reduces to the barely minimum the inter-processes message passing, because, apart from the unavoidable communications involved within the linear sytem solution, only the membrane potential of the neighboring cells must be communicated for each simulation timestep, in order to carry out the A _ R . V m t matrix-vector product in parallel. Other strategies have tried to parallelize the same problem but they appear to be less efficient [3]. 4. SOLVING THE LINEAR EQUATION SYSTEM The coefficient matrix associated to the linear equation system is symmetric, positive definite, and shows a very regular pattern, with a high level of sparsity as it is a pentadiagonal matrix. The dimension of this matrix equals to the number of nodes of the discretization grid, which in our case corresponds to the number of tissue cells. Both parallel direct and iterative solvers have been applied to solve this large scale and sparse system. A parallel direct approach based on Multifrontal Cholesky factorization [4, 5] has been employed, along with a Multilevel Nested Dissection algorithm to order the coefficient matrix, reducing the fill-in [6]. In addition, parallel implementations of iterative methods such as the Conjugate Gradient, the Generalized Minimal Residual and the Transpose-Free Quasi-Minimal Residual, combined with different preconditioners, have been used, being known to be very suitable for large sparse matrices [7]. Even though for small tissues (up to a couple of hundred nodes in each dimension) the direct method performs better than any iterative method tested, for an average tissue size of 500-800 nodes in each dimension and larger, iterative methods performs a faster resolution of the linear equation system. Conjugate gradient method with no preconditioning has probed to be the fastest iterative method. Besides, it solves the system up to 30% faster than the direct method. The matrix is well-conditioned what causes the iterative method to converge to a solution in few iterations, with a good residual tolerance.
343 5. IMPROVING THE COMPUTATION OF IONIC CURRENTS Even though all the steps of the simulation process have been enhanced by means of a high performance computing approach, there is still much room for improvement inside the fourth step, the computation of Iion. The ionic models are responsible for the correct modelization of kinetics inside the cell. Each process updates its local part of the vector, according to a given cellular model, with no necessary inter-process communications. The computation of the ionic currents across the membrane of each cell requires solving, for instance, 14 differential equations in the Luo-Rudy Phase II model: 6 for updating the so-called "gate-variables" and 8 for updating the ionic concentrations. It has been estimated that the ionic model used is responsible for up to 60% of the total execution time.
5.1. Lookup tables Activation functions of the gates in the ionic models require the previous computation of c~ and ~ parameters for each gate, expensive mathematical expressions which involve exponential functions. These expressions only depend on the membrane potential, so a lookup table can be designed to precalculate the output value over a range of membrane potential. It has been estimated that in non-pathological situations, the membrane potential of a cell is in the range [-90, 50] mV. The lookup tables have been designed to cover this range with an equal space of 0.1 mV, where linear interpolation between the two nearest points in the table is employed. This has been found to be enough to keep the global solution accurate up to the third decimal point. For the Luo-Rudy Phase II model, we use 12 lookup tables to estimute the c~ and/3 parameters for each of the six gates, with a total amount of memory of 131.25 Kbytes, a penalty that is clearly outweighed by a 33% improvement in speed. 5.2. Efficient ODE integration Inside a cellular model, a set of ordinary differential equations govern ionic concentration changes as well as the evolution of the gates, which control the permeability of the cell membrane to different ions. The fast activation of the gates, especially the m-gate, in charge of controlling the permeability to the Na+ions (Eq. 3) during the upstroke of the action potential, imposes a serious restriction on the simulation timestep. Om = ozm 9(1 - m) -/3m 9m. Ot
(3)
Traditional Euler methods for integrating this set of ODEs introduce instabilities when using a timestep larger than 10 #s. Other explicit methods of higher order, like those of the RungeKutta family, allow increasing this limit up to 15 #s. These fast-changing ODEs require some sort of implicit or semi-implicit method of integration that behaves stable. We have addressed the problem by employing a semi- implicit method called "Explicit Singly Diagonally Implicit Runge-Kutta" (ESDIRK), already used [8] to solve the Winslow cellular model. This model is based on the Luo-Rudy Phase II, but introducing a new formulation of the Ca 2+ handling mechanisms, with a coupled system of ODEs, that imposes even more serious constraints on the integration timestep. ODEs in the Luo-Rudy model are not coupled, and besides, those not governing the gates may be integrated with traditional forward Euler methods without any considerable lose of precision.
344
Scalability of the Simulation S y s t e m I - - Ideal Speedup ...... Achieved SpeedupJ 35 .....................................................................................................................................................
~o
++L~+~+. ~
Efficiency of the Simulation S y s t e m 10o
23oo0 -i ~ 22500 [
S c a l a b i l i t y f o r a Scaled Problem
/
25
._~~
15 10
T 1,8 Size ~ ' ~ 1,6 1,4 ~"
* - Run'Time -I- D~
~
+1"0,8
~
0
0
8
16 24 Number of Processors
32
(a) Speedup.
I
2 Number of Pr8o~ors 16
(b) Efficiency.
32
21000 2
0,2 4
.
.
.
.
.
.
6 8 10 12 14 Numberof Processors
16
0
(c) Simulation time for a scaled problem (seconds).
Figure 2. Speedup, efficiency and scalability of the simulation system. This gives us a method that computationally only doubles the cost ofa 4 th order Runge-Kutta, but allows a timestep of up to 8 times larger. The ESDIRK method is stable when using a global timestep of up to 125 #s, limit where the diffusion term starts to suffer from stability. Needless to say that using a larger timestep means a less accurate solution, but, depending on our interests, we may not need a fully-detailed 1 #s timestep simulation, that can last for weeks. It must be taken into account that doubling the simulation timestep reduces the global simulation time to the half. 6. E X P E R I M E N T A L RESULTS The simulations have been run on a cluster of 20 Pentium Xeon 2 Ghz biprocessors with 1 Gbyte of RAM, interconnected both by a Gigabit Ethernet and a 4x5 2D torus SCI network. All the simulations are being performed using the SCI network, which, offering a 2.7 #s minimal latency and high bandwidth, makes it very suitable for improving communications. The designed application scales quite linear with the number of processors. Figures 2.a and 2.b show the speedup and the efficiency of the system up to 32 processors when simulating an action potential propagation during 10 ms on a 1000x1000 cells ventricular cardiac tissue, what results in a linear system with 1 million of unknowns. Simulations have been carried out with 2 processes per node. As an example, the system reaches a speedup of 30.17 and 94.2% of efficiency with 32 processors. These results are coherent with the parallelization strategy employed, reducing to the minimum the communications, and balancing the load among all processors, two of the most important aspects of a successful parallelization. Another indicator of the scalability of an application is to use a scaled problem. An efficiently parallelized application should have similar execution times when doubling the number of processors and the size of the problem simultaneously (Figure 2.c). For example, an action potential simulation during 250 milliseconds for a tissue with 1 million cells using 11 processors lasts, approximately, 21500 seconds. These simulations have been run with only one process per node, as better results are obtained than employing two processes per node, mainly due to the collisions provoked when accessing the same shared network card by two processes on the same node. 7. R E C O R D I N G SIMULATION DATA The purpose of all these biomedical simulations is to investigate cardiac electrical behaviour. Thus, it is necessary to record intermediate simulation data when they are calculated. These data
345 may include membrane potential, main ionic currents and any of the cell characteristics, such as ionic concentrations and intermediate currents. Therefore, we should meet a compromise between what the user is able to record and an efficient way to perform all the recording. We have opted for an XML description to indicate the recording information. The user can specify, in a single XML document, which characteristics need to be recorded for each cell of the tissue. A Java program parses the XML document and generates the appropriate code to do the recording. This code is then compiled within the whole application, obtaining a simulator tailored to the user's recording needs. The structure of all the output files is independent of the number of processors, what makes the parallelism transparent to the end user. This is very convenient when a previously saved simulation has to be restarted using a different number of processors. Latest standard MPI-2 I/O, provided by ROMIO implementation, has been employed to provide an efficient access to the secondary storage. 8. FIRST GRID E X P E R I E N C E S
One of the main interests of the simulation process is to test the effects of antiarrhytmic drugs, what requires the execution of parametric simulations changing the cell characteristics. As an example, the influence of a new drug may be modeled as a change in the intracellular potassium concentration. Testing the effect of multiple drugs would require the same simulation application over different cellular configurations. Therefore, a GRID distributed environment is very suitable for this situation, where multiple parametric simulations must be carried out and the focus now moves from speed to productivity. We have been employing the InnerGrid [9] platform, a set of tools that allow to build and manage in a centralized manner an heterogeneous platform of physically distributed computers. The main purpose of this approach is to take advantage of all the computational power, enabling computer collaboration to perform computationally intensive tasks. InnerGrid has been installed on 2 different Linux Clusters, creating a single grid entity with a total amount of more than 50000 MFLOPS. InnerGrid Desktop offers an entry point to the grid, allowing both the definition and state visualization of new tasks. The different parametric simulations (/z-tasks according to InnerGrid terminology) are assigned to the nodes in the grid for sequential execution. The results of each/z-task, mainly files with the state of the cardiac tissue along the simulation, are stored in a central manner, allowing an easy access to the information of all the simulations. InnerGrid implements a fault tolerance policy that guarantees the execution of the /z-tasks as long as there are living nodes in the grid. 9. CONCLUSIONS This paper presents a complete and high efficient parallel implementation for the simulation of action potential propagation on a two-dimensional monodomain ventricular cardiac tissue. As shown, high performance computing techniques have dramatically reduced the simulation time. Combining both the use of a distributed computing model and the numerical improvements taken, it has been possible to reduce simulations, that lasted days, to few hours. With the system depicted, more results can be achieved in less time, reducing the overall researching time. Besides, larger tissues can be simulated by only means of using more processors. Finally, with the advent of new GRID technologies, a vast amount of simulations can be simultaneously
346 performed, enhancing the productivity by several orders of magnitude. REFERENCES
[1] Neher E, Sakmann B. The patch-clamp technique. Sci. Am 266, 1992: 28-33. [2] Luo CH, Rudy Y. A dynamic model of the cardiac ventricular action potential. I Simulations of ionic currents and concentration changes. Circ. Res., vol 74, 1994:1071-1096.
[3] Porras D, Rogers J, Smith W, Pollard A. Distributed computing for membrane-based mod[4]
[5] [6] [7]
[8] [9]
eling of action potential propagation. IEEE Trans. Biomed. Eng. vo147, 2000:1051-1057. Heath MT, NG E, Peyton BW. Parallel Algorithms for Sparse Linear Systems. SIAM Rev., 33, 1991: 420-460. Liu, J.W.H. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev., 34, 1992: 82-109. Hendrickson B, Leland R. A multilevel algorithm for partitioning graphs, Tech. Rep. SAND93-1301. Sandia National Laboratories. Lo G-C, Saad Y. Iterative solution of general sparse linear systems on clusters of workstations. Technical Report UMSI 96/117 & UM-IBM 96/24, Minnesota Supercomputer Institute, University of Minnesota, Minneapolis, MN, 1996. 19 9. Sundnes J, Lines G-T, Tveito A. Efficient solution of ordinary differential equations modeling electrical activity in cardiac cells. Mathematical Biosciences 172,2001: 55-72. InnerGrid Manual. http://www.innergrid.com.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
MoDySim-
347
A parallel d y n a m i c a l U M T S s i m u l a t o r
M.J. Fleuren ~, H. Stfibenb, and G.F. Zegwaard ~ ~TNO Telecom, St. Paulusstraat 4, 2264 XZ Leidschendam, The Netherlands bKonrad-Zuse-Zentrum ftir Informationstechnik Berlin (ZIB), Takustr. 7, 14195 Berlin, Germany CQQQ Delft, Kluyverweg 2a, 2629 HT Delft, The Netherlands UMTS is a 3rd generation mobile telecommunication system which enables multi-service and multi-bit rate communication going beyond the possibilities of previous systems. The simulator MoDySirn models UMTS in great detail. Characteristics of UMTS such as soft hand-over and the interdependency of load and capacity among neighbouring cells are challenges for the parallelisation of such a system. In this paper we explain how the software was parallelised and present performance results of a UMTS simulation for the city of Berlin. 1. INTRODUCTION UMTS is a 3 rd generation mobile communication system. After GSM (2 nd generation), which supported mainly voice calls, GPRS (2.5 th generation) opened the way for packet switched data communication. UMTS will enable true multi-service and multi-bit rate communication. Services such as video telephony or file downloading will become possible on mobile phones and laptops. UMTS differs in many respects from GSM and GPRS. At the eve of implementation and deployment of the system still many aspects need to be investigated. MoDySimis a programme that simulates UMTS in great detail in order to study the behaviour of large UMTS systems realistically. MoDySim was developed from scratch in the EU 5th Framework project MOMENTUM. The aim of MOMENTUM was to develop network planning algorithms for UMTS in order to obtain optimised networks. It is important to be able to check the outcome of the planning process with a detailed simulator like MoDySim. MoDySim was implemented in an interdisciplinary effort by the MOMENTUM partners TNO Telecom (formerly KPN Research), QQQ Delft, and ZIB. The three partners combined UMTS network knowledge, professional programming, and experience in parallelisation. MoDySim is a dynamical simulator (in contrast to a snap-shot simulator), i.e. it simulates the behaviour of the system over time taking into account movements and usage behaviour of the users as well as the dynamics of the network control. The sizes of the systems to be investigated (cities) make it necessary to parallelise the programme in order to shorten its run time to a reasonable length. This paper focuses on the parallelisation aspects of MoDySim. The main problems to be dealt with in the parallelisation are:
348 9 The interference in UMTS is a long range effect. Interference couples large parts of the system, while one would like to exploit data locality of small sub-systems in a parallel programme. 9 The effects of cell breathing and soft hand-over imply an absence of a natural (unique) home process for the mobile stations. 9 Power control implies (in principle) extensive communication between the processes (in a real UMTS system powers of mobile and base stations are updated every 0.66 ms). 2. INTRODUCTION TO U M T S This section provides some background for understanding the dynamics of UMTS and the resulting problems of parallelisation.
(
mobistation le A b.... tation
Figure 1. Typical layout of a UMTS network. The dashed lines indicate connections between the base station and the mobile station,
/
Figure 2. Interference. In the downlink signals of base stations interfere at the mobile stations (left). In the uplink signals of mobile sations interfere at the base stations (fight).
In UMTS base stations (BS) are located to receive signals from mobile stations (MS) (see Figure 1) and transport it (via fixed networks) to either servers or to other BSs which can then transmit the signal to another MS. Unlike GSM and GPRS, UMTS uses the same two frequencies in the uplink (MS ~ BS) and downlink (BS ~ MS) across all cells of a network (operators may ultimately use two such paired frequencies). Different data streams are multiplexed. A signal can only be properly 9detected if the signal to interference ratio (C/I) at the receiver is above a certain target value. This means that the UMTS system is interference limited. As interference is caused by all transmitters (BSs in the downlink, MSs in the uplink, see Figure 2), the available capacity in a cell is directly related to the concurrent transmissions in its surrounding cells. This implies that the capacity in a cell is not a fixed value, but depends on both the interference situation of its surrounding cells and in its own cell. For example, the cells surrounding a heavily loaded cell will typically have less capacity available in that situation. In UMTS the interference couples the cells, and therefore couples the capacity in the cells. An important dynamical effect in UMTS is cell breathing: heavily loaded cells will shrink, while lowly loaded ones will grow (see Figure 3). UMTS can handle services with different bit rates and delay requirements: voice, video conferencing, web browsing, e-mail retrieval, etc. In
349 UMTS higher bit rates require higher power. This implies that the cell breathing dynamics is not only influenced by moving users but also by the type of service. UMTS has a power control mechanism making sure that the transmission power is such that the required C/I at the receiver is exactly met. Every 0.66 ms powers are adapted. A higher power would pollute the system with extra interference, and unnecessarily use precious capacity. Another technical characteristic of UMTS is soft hand-over. When an MS moves, its connection to the BS may become weaker and even become lost. The MS is then handed over to a near BS (not necessarily the nearest) with a stronger signal reception. In UMTS the MS can be simultaneously connected to several (typically two to three) BSs. This situation is called soft hand-over. This means that in the downlink, the MS combines the receiving signals, and in the uplink that the signals received by the BSs are combined higher up in the network. Two other terms will be mentioned in the text: call admission control and congestion control. When a new user wants to set up a call, the MS has to ask permission to the BS to connect. If there is not enough capacity available, the request will be rejected in order to protect the quality of the present users. This is called call admission control. Congestion control starts when the interference becomes too high (e.g. if too many calls have been admitted or too many users move to the same BS). It will block all new calls, reduce the power of the transmitters, and, as the last resort, terminate calls in order to reduce interference.
,
/-- ............
.....,
, ,! Figure 3. Illustration of cell breath- Figure 4. Example of a domain decomposition. Each ing. Small cells can carry a higher domain contains the active base stations (see text) load than large cells, treated by one process.
3. OVERWIEW OF
MoDySim
3.1. Software design goals The goal was to develop a dynamical UMTS system simulator with a high level of detail and fast enough to handle large scenarios, e.g. a city with at the order of hundred BSs with realistic traffic load and a mix of services. The system should incorporate realistic user behaviour (movement and usage of services), realistic power control, soft hand-over, call admission control, and congestion control.
350
3.2. Choice of basic algorithm A fundamental decision at the beginning of the project was to choose between event-driven and time-driven simulation. Event-driven algorithms are popular for simulating mobile communication networks. However, these kind of algorithms do not seem well suited for a (parallel) UMTS simulator. Events have a stochastic nature. There are stochastic events in UMTS but in addition there is the regular heartbeat of the power control signals. This regular heartbeat is the shortest time-scale in the simulation. Events have to be processed in the right order. As a consequence the whole (event-driven) system could only proceed in the rhythm of the heartbeat. This observation makes the timedriven simulation a natural choice. In the time-driven framework the simulation resembles a classical molecular dynamics simulation. Concerning the parallelisation one is therefore naturally led to a domain decomposition approach. The approach we have taken is explained in section 4.
3.3. Main loop The body of the main loop of MoDySim represents one heartbeat. It looks like this (the abbreviations MS and BS indicate which groups of stations are involved): (1)
MS:
(2) (3) (4) (5) (6) (7) (8)
MS, BS: MS: BS: MS: MS, BS: BS: MS:
Update locations (includes arrivals of new calls and clean-up of finished mobiles) Calculate quality of signals Soft hand-over requests Call admission control Update activities Power update (includes interference calculation) Congestion dropping Quality dropping
Some more detailed explanations are in order. In step (1) the mobiles move to their next location. The mobility of the users is realistic (e.g. along roads) and with varying speeds. The calculation of signals in (2) is based on propagation grids that describe signal loss. The propagation grids model real environments (e.g. buildings, streets, mountains, rivers). Based on the new values of signals the mobiles decide on sending hand-over requests to base stations in (3). In step (4) the base stations decide upon all requests. In addition to the hand-over requests (3) there are also requests to admit new calls from step (1). Update of call activities (5) results in adjustments of bit rates depending on the service (e.g. voice call, picture download). Bit rates affect required powers. The higher the bit rate the higher the power required. The power update (6) comprises the update of total base station powers, sending of power control signals to mobiles, adjustment of mobile powers (mobiles in soft hand-over have to consider all received control signals) and interference of mobile signals at the base stations and vice versa. Finally in (7) and (8) dropping of calls may occur due to base station congestion or low signal quality at mobiles (the aim is, of course, to minimise both).
351 4. PARALLELISATION
4.1. Basic approach The basic approach to parallelise MoDySim is a domain decomposition as indicated in Figure 4. Domains usually overlap and one has to make sure that overlapping areas are being updating correctly with data from remote processes. In a UMTS simulator the overlap areas contain base stations and mobiles. While it is clear how to assign home processes to base stations this is not so obvious for the mobiles. In a system with hard hand-over every mobile is connected to one base station (at most). Therefore, the mobile's home process would be the corresponding base station's home process. In UMTS there is soft hand-over implying that a mobile can be connected to two (in practice up to three) base stations, possibly sitting on two (or three) processes. Typically 3040% of the mobiles are in soft hand-over (the value depends on the parameter settings of the operator). In order to reduce bookkeeping and to minimise communication between processes there are no overlap areas in MoDySim. Special concepts were instead introduced for the base stations and mobiles. 4.2. Concept for mobiles Mobiles that are not in soft hand-over exist only on the home process of the serving base station. When a mobile is in soft hand-over, a copy of the mobile exists on every home process of the base stations it is linked to. The copies of the mobiles have to behave identical. In other words the copies of the mobiles perform identical operations, the results of which would have to be communicated otherwise. The consequence is that no mobile data has to be exchanged between processes (except for the action of copying a mobile). 4.3. Concept for base stations Similar to the introduction of copies of mobiles we introduced copies of base stations. In contrast to the copies of mobiles the functionality of the base station copies is reduced. Actually, all base stations reside on all processes. We differentiate between active and shadow base stations. Active base stations can perform all tasks while the shadow base stations can only perform some tasks, especially call admission control (they can receive requests and make decisions). The active base stations are the ones one would think of in terms of the domain decomposition, the shadow base stations are there in addition. All base stations participate in the interference calculation which is a global operation. The construction is such that every mobile (including its copies) can always contact any base station without exchanging data between processes. An active base station can contact all the mobiles linked to it while a shadow base station cannot. The result (or aim) of this construction is that there are basically only two communication patterns. There are (a) global sums and (b) point-to-point date exchange between neighbouring processes. In both cases only base station data is communicated. 4.4. Changes to the sequential version In this section the changes to the sequential version are described for each step of the main loop introduced in section 3.3. A general change is that loops over base stations become loops over active, shadow, or all
352 base stations. Loops over mobiles do not have to be changed. Steps (2), (3), (5), and (8) of the main loop can be left unchanged. Step (1). No changes are needed for updating the mobiles' locations. New calls (mobiles) are brought into existence on the process hosting the geometrically nearest base station. Step (2). No changes are necessary. However, it should be mentioned that in this part we did not decompose the so called environment grids, which are input to the calculation of signals. As a consequence the computer memory needed does not decrease when using more processes. The reason for not decomposing theses grids is that it is then guaranteed that the interference calculation can be done without approximation. In practice this was no restriction. The Berlin scenario fitted into 2 GByte of memory per process. Step (4). Call admission control turned out to be quite subtle in the parallel version. This is the place where mobiles are copied or moved between processes. One has to make sure that the objects representing the mobiles are identical (especially all copies have to know how many copies exist, because the number is needed in the interference calculation). Call admission decisions are made locally. This is possible because only a few parameters are needed at the base station (those parameters had been broadcasted directly after they were updated). Step (6). Updating the powers requires global sums. One global sum is necessary in the interference calculation. All base stations sum up the contributions of the mobiles that are on their process (each contribution is divided by the number of copies of the corresponding mobile in order to obtain the correct contribution from each mobile). Then the global sum is performed. Only the active base stations can calculate new powers. These have to be distributed to the shadow base stations. For this all-to-all communication the global sum routine is used, too. Step (7). Only the active base stations can determine the congestion status and drop mobiles. On each process a list of dropped mobiles is set up and these lists are broadcasted. 4.5. Other issues
Random numbers. We have used the Luxury Random Number Generator (RANLUX) [ 1]. Every mobile and every active base station generates random numbers individually. By proper initialisation it is guaranteed that the individual sequences of random numbers are uncorrelated. Load balancing. Load balancing in MoDySim is static. It was assumed that the researcher setting up a simulation scenario has external means to devise a decomposition that leads to acceptably balanced load. This turned out to be true in the cases that were run up to now. Implementation. MoDySim was written in C++. A fundamental decision for the parallelisation was whether to use OpenMP or MPI. We have decided to use MPI. The main reasons are platform independence and availability. Another big advantage is that the parallel code can be much better modularised. OpenMP requires basically that all loops have to be touched. In MPI parallelisation practically means to add a few calls to data exchange routines and the implementation of a communication class. Once this class is written the code can be further developed without knowledge of MPI. Testing. We have tested the parallel version by demanding that the output does not depend on the decomposition. Due to rounding errors in the global sum the output cannot be bitwise identical. In principle it could happen that different rounding leads to a different yes/no decision. In our tests such "butterfly effects" did not occur. Results agreed at the precision at which values are logged (6 decimal places).
353 5. SCALING The parallel code was tested and run on an IBM p690 computer. The computer has shared memory nodes with 8 and 32 CPUs respectively and 2 GByte of memory per CPU. In the first parallel runs of MoDySim an artificial test scenario was used. The scenario consisted of base stations located at the centers of a hexagonal grid with 7 cells. For this scenario a speed-up of about 5 was obtained on 7 processes. At the time of writing the first realistic scenario was run. The city of Berlin was the first case studied. The Berlin scenario consists of 74 sectorised base stations covering the inner city districts. Table 1 shows the scaling of execution times for this scenario. Table 1 Scaling of execution times for a UMTS simulation of the city of Berlin. processes execution time speed-up 1 71 h 1.0 7 17h 4.2 14 18 h 3.9
6. R E L A T E D W O R K
Writing a parallel dynamical UMTS simulator like MoDySim is pioneering work. A related project is SEACORN [2] which also develops a parallel UMTS simulator. However, we have found no paper about their parallelisation work. Traditionally, mobile communication simulations are event-driven, while MoDySim is time-driven. A generic event-driven dynamical simulator is WiNeS [3]. A recent work on parallelising event-driven simulations is [4]. 7. DISCUSSION AND C O N C L U S I O N In the first realistic simulation MoDySimachieves a useful parallel speed-up. Although, the programme only scales to a moderate number of processes. On 7 processes we observe a speedup of 4. On 14 processes the programme runs slightly longer than on 7 processes. The main cause for this slowing down are the global sums that are needed in the interference calculation. On the 7 processes the global sums need 5.5% of the execution time while on 14 processes the sums need 42% of the time. One design goal of MoDySim was to keep the programme as simple as possible. A consequence was that no cut-off was introduced in the interference calculation. With cut-off there would be no global operations which should result in higher scalability. In addition the memory per process would scale (now the memory requirements per process is independent of the number of processes, see section 4.4). Another simplification was to stay with static load balancing. It is interesting to compare our scaling result with those obtained in [4]. The scaling reported there is 4.2-5.8 out of 8 which is similar to our result. One has to keep in mind that in [4] event-driven simulations were studied. A technical implication is that for parallel event-driven simulations shared memory computers are preferred. Because MoDySim was parallelised with MPI it runs also on distributed memory architectures, especially PC-clusters.
354 8. ACKNOWLEDGEMENTS This work was supported by the European Commission under contract number IST-200028088. Computations were performed on the IBM p690 system of Norddeutscher Verbund f'tir Hoch- und H6chstleistungsrechnen (HLRN). REFERENCES
[1]
M. Lfischer, Comput. Phys. Commun. 79 (1994) 100. F. James, Comput. Phys. Commun. 79 (1994) 111.
[2] ht tp :/ / s e a c o r n , pt i n o v a c a o , pt [3] J. Deissner, G. Fettweis, J. Fischer, D. Hunold, R. Lehnert, M. Schweigel, A. Steil, J.
[4]
Voigt and J. Wagner, A Development Platform for the Design and Optimization of Mobile Radio Networks, in: D. Baum, N. Miiller, R. R6sler (eds.), Kurzvortrfige der 10. GI/ITGFachtagung Messung, Modellierung und Bewertung von Rechen- und Kommunikationssystemen MMB'99, Universit~t Trier, Mathematik/Informatik, Forschungsbericht Nr. 99-17, pp. 129-133. M. Liljenstam, Parallel Simulation of Radio Resource Management in Wireless Cellular Networks, Dissertation, Royal Institute of Technology, Stockholm, Sweden, 2000.
Parallel Computing:SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert,W.E. Nagel, F.J. Peters and W.V. Walter(Editors) 9 2004ElsevierB.V. All rights reserved.
355
apeNEXT: a Multi-TFlops Computer for Elementary Particle Physics apeNEXT-Collaboration: F. Bodin ~, Ph. Boucaud b, N. Cabibbo c, F. Di Carlo c, R. De Pietri o, F. Di Renzo d, H. Kaldass e, A. Lonardo c, M. Lukyanov e, S. de Luca ~, J. Micheli b, V. Morenas f, N. Paschedag ~, O. Pene b, D. Pleiter g, F. Rapuano ~, L. Sartori h, F. Schifano h, H. Simma e, R. Tripiccione h, and P. Vicinff aIRISA/INRIA, Campus Universit6 de Beaulieu, Rennes, France bLPT, University of Paris Sud, Orsay, France CINFN, Sezione di Roma, Italy aphysics Department, University of Parma and INFN, Gruppo Collegato di Parma, Italy eDESY Zeuthen, Germany fLPC, Universit6 Blaise Pascal and IN2P3, Clermont, France gNIC/DESY Zeuthen, Germany hphysics Department, University of Ferrara, Italy We present the a p e N E X l project which is currently developing a massively parallel computer with a multi-TFlops performance. Like previous APE machines, the new supercomputer is completely custom designed and is specifically optimized for simulating the theory of strong interactions, quantum chromodynamics (QCD). We assess the performance for key application kernels and make a comparison with other machines used for these kind of simulations. Finally, we give an outlook on future developments. 1. INTRODUCTION For many problems in modern particle physics lattice gauge field theory offers the only known way to compute various quantities from first principles using numerical simulations. Much progress has been made during recent years, e.g., in calculating the light hadron spectrum, the light quark masses, the running coupling constant c~s or observables in heavy quark physics. Further-on, lattice simulations allow the study of non-perturbative phenomena like chiral symmetry breaking, confinement, or studies of phase transitions in the early universe. 1 However, progress in this field is severely limited by the tremendous amount of required computing power. In order to make the necessary resources available, various research groups engage in the development of massively parallel computers which are specifically optimized for these kind of applications. One of these projects is APE ("Array Processor experiment") which 1See the proceedings of the annual Symposiumon Lattice Field Theory [1] for an overview.
356 is currently developing its fourth generation of machines, apeNEXT. 2 This work is carried out within the framework of an European collaboration by INFN (Italy), DESY (Germany) and the University of Paris Sud (France). The required computing resources critically depend on the physical parameters and the formulation of the theory on the lattice. It remains, for instance, challenging to perform simulations at light quark masses. 3 In now-a-days simulations of full QCD, i.e. calculations which take effects from sea quarks into account, the lightest quarks are usually still significantly heavier than the physical mass of the "up" and "down" quarks. Recently, several research groups started to use chirally improved formulations of fermions on the lattice. While preserving important properties of the continuum theory, these new formulations, which fulfill the so-called Ginsparg-Wilson relation, lead to a significant raise of the required computing resources. In this paper we will concentrate on large scale applications, but we want to stress that there are also many small and medium scale applications from this field of research. In simulations of QCD on the lattice, most of the resources are spent in frequent inversions of a huge, but sparse matrix. In commonly used formulations this so-called fermion matrix M has the following structure: 4
Mx,
-
z -
-
,
1-
+
(1
t
U~_~,u 5~_;,,u 37 .
(1)
#=1
Here x and y label the sites of the 4-dimensional lattice, n is a real parameter which is directly related to the quark mass. The lattice gauge fields U~,u which carry two colour indices are represented by 3 • 3 complex matrices. The Dirac matrices % carry two spin indices. Therefore, M is a (3-4. V) 2 complex matrix, where V is the lattice volume. Since iterative algorithms are used to calculate the inverse, matrix times vector multiplications are the performance dominating operations. When developing computers optimized for simulating lattice QCD several features which are basically independent of the specific formulation of the theory can be taken into account. For instance, multiplication by M involves mostly arithmetics with complex numbers and allows for a relatively high data re-use. Appropriate choice of memory hierarchies or register file size therefore have a significant impact on the performance. The fermion matrix is homogeneous in coordinate space and connects- at least in the most often used formulations, e.g. Eq. (1) only nearest-neighbours. A SIMD programming model is sufficient to parallelize the problem by distributing the lattice sites on a 1 to 4-dimensional mesh of processors. Communication then has to be performed only between nearest-neighbour nodes. These features allow for a significant simplification of the global architecture and particularly the network. 2.
apeNEXTHARDWARE
ARCHITECTURE
apeNEXTis designed as a massively parallel computer. All processor functionalities, including the network devices, are integrated into one single custom chip running at a clock frequency of 200 MHz. The processor is a 64-bit architecture. However most operations act on (64 + 64)bit words, which are, e.g., needed to encode an IEEE double precision complex number. The arithmetic unit can at each clock cycle perform the APE normal operation a x b + c, where a, 2See [2] for a review of the APE project. 3Estimates for the CPU requirements as a function of the relevant physical parameters are given in [3].
357 b, and c are complex numbers. The peak floating point performance of each node is therefore 1.6 GFlops. Each processor is a fully independent processor with access to its private memory bank of 256-1024 MBytes based on standard DDR-SDRAM. In each clock cycle the ECC protected memory bus can sustain one (64 + 64)-bit word with a latency of about 15 clock cycles. The memory controller receives its input from an address generation unit that allows to perform integer operations independently of the arithmetic unit. The memory stores both program and data. A consequence of this organization is that conflicts between data and instruction load-operations are likely to occur, since apeNEXT is a microcoded architecture using 117-bit very long instructions words (VLIW). Two strategies have been employed to avoid these conflicts. First, the hardware supports compression of the microcode. The compression rate depends on the level of optimization, typically values are in the range of 40-70%. Instruction de-compression is performed on-the-fly by dedicated hardware. Second, an instruction buffer allows, under complete software control, pre-fetching of (compressed) instructions and storing performance critical kernels for repeated execution. Intemode communication is done via a high-performance network which has the topology of a three-dimensional toms. The network is able to transfer one byte per clock cycle with an extremely small latency of about 20 clock cycles (i.e. 100 ns). Communication is therefore clearly bandwidth-limited. For the user this has the advantage that almost maximum throughput is already reached when communicating data using rather small bursts, e.g. 12 words or about 200 bytes of data. Transmission of each 128-bit word is protected against single bit errors by a 16-bit cyclic redundancy checksum (CRC). Each node is bi-directionally connected to its six nearest neighbours. However, the hardware supports routing along up to three orthogonal links with only little extra latency. Communication in orthogonal directions can be executed independently of each other, apeNE• systems will contain up to several thousand processors. In order to achieve high scalability it is necessary that network and arithmetic operations overlap. Furthermore, memory and network latencies, although they are small, become nevertheless relevant when aiming for outstanding sustained performance. The apeNEXT processor allows to pre-fetch local or remote data into a receive queue. Later the data can be moved from the queue to the register file with zero latency. This allows to hide memory latencies as well as network bandwidth limitations. The latter is possible because all queued network requests can be executed independently of the rest of the processor. To allow data being re-used a very large register file of 256 (64 + 64)-bit registers is provided. Furthermore, data can be written from the register file into the receive queues for later use. Which data is kept inside the processor is therefore completely under user control, who does not have to care about difficult to control cache effects. All apeNEXY nodes are designed to run asynchronously, which means that apeNEXY follows the single program multiple data (SPMD) programming model. The nodes will implicitly be synchronized by communications. A dedicated logics tree, which allows fast evaluation of global AND or OR conditions, can also be used for implementing barriers. It can therefore be used for explicit synchronization. A set of 16 apeNEXT nodes will be assembled onto just one processing board. A set of 16 boards can be attached to one backplane which provides all links for communications between nodes of different boards. Larger systems are assembled by connecting together several crates using external cables. Up to 32 boards can be hosted by one rack. Such a system of 512
358 l
Program and
DataMemory(DDRSDRAM256M... 1G)
J host
~
~
'128 "~a8
" -~T~
Microcode
~! Disp.
";28
~
+y
local
mini w INNUt R m ~
~t
Figure 1. Schematic view of the apeNEXTprocessor (left) and a sample global configuration with four boards and two front-end PCs (fight). nodes provides a peak performance of 0.8 TFlops. It has a footprint of about 1 m 2 only and its estimated power consumption is < 10 kW. Due to this very moderate power dissipation air cooling is still possible. Accessing an apeNEXT system will be possible via a front-end PC with a custom designed host-interface PCI-board. A simple, I2C-based network is foreseen for bootstrapping the machine and executing simple operating system requests, like program exit or error handling. I/O operations will be handled via high-speed data links connecting one or more boards of a apeNEXT system with one or more host-interface boards. This design concept allows to adjust the overall I/O bandwidth to the needs of the users. The external links are based on the same technology as the internal network and have a gross bandwidth of 200 MBytes/sec. Initial tests with a prototype board indicate that it will be possible to sustain this bandwidth across the 100 MHz, 64-bit PCI-bus. 3. apeNEXT SOFTWARE On apeNEXT two compilers will be available which allow to write code in the high level languages TAO and C. TAO is a FORTRAN-like programming language based on a dynamical grammar. This grammar allows to define objects and to overload functions in a rather simple way. For the programmer this has the advantage of being able to write its code in a more natural way. For the compiler design using objects makes it easier to do burst memory access, which is crucial for performance reasons. TAO was used to write programs for previous generations of APE machines, which now can be ported to apeNEXT with little effort. The C-compiler is based on the freely available Icc compiler [4]. It supports most of the ANSI 89 standard with a few language extensions. Except for standard data types, variables of type complex and vector (a set of two doubles) are available. Further language extensions are related to the support of parallelization, e.g. the statements a n y ( ) , a l l () and n o n e () allow converting local conditions into global conditions. In TAO and C elements of an array stored at a remote memory location can be accessed by specifying a remote offset, which encodes the direction into which local data has to be send and from where data will be received. Whenever the memory controller is provided with a memory address with certain remote-offset bits, it will not load the addressed data into the local queue but write it to a send buffer instead. There is no call to a communication library, e.g. MPI, required to initiate a network transfer. Since the remote-offset bits are within the range of integers, they can be added to index variables used for
359 addressing arrays. Both compilers produce high level assembly instructions. The high level assembly is also suitable for user programming, since many odd details, like insertion of instruction load operations, jumping to the static instruction cache or executing subroutine calls, are handled by a program which converts high to low level assembly. The low level assembly, which is already very machine specific, can be fed into the optimization toolkit sol:an. A number of optimization procedures are postponed until this step since they are difficult to implement on the compiler level in a one-pass procedure. The following optimization steps are foreseen: merging normal operations, removing dead code, eliminating register move operations, merging of address generating operations. The (optimized) low level assembly is used as input for the microcode generator. During this final compilation step the assembly instructions are translated into (possibly several) microcode instructions. To optimize performance the instructions are scheduled as early as possible taking all dependencies and temporization constraints into account. Finally, the microcode generator allocates the registers, compresses the microcode instructions and patches the jump addresses. The final piece of software, needed to actually run a program on an a p e N E X l machine, is the operating system. The apeNlzXr architecture will require a distributed operating system. For handling fast I/O requests system routines running on the custom hardware itself will take care of moving data to or from the nodes which are connected to the external host-interface boards. The latter will be controlled by slave daemons running on the front-end PCs. This daemons poll on the machine's registers accessible via the I2C links to check the machine state, e.g. to detect program halt or exceptions. They will also check for data on the incoming high-speed link to detect I/O requests, which will be handled without halting the machine (unlike in older APE systems). A master process will control the slave daemons running on the one or more front-end PCs connected to a particular aoeNEX-[- system or partition. 4. BENCHMARKS
Functional simulation of the machine design, which is written in VHDL, allows realistic, cycle accurate performance measurements. In the following we report on results for benchmarks which allow to assess the overall performance of lattice QCD simulations on a p e N E X l . For all benchmarks the time needed for instruction loading has not been taken into account, since it can reasonably be assumed that performance critical routines are executed many times. These routines will therefore be loaded only once and will be kept in the instruction buffer. Iterative matrix inversion algorithms, like conjugate gradient or BiCGstab, involve a number of linear algebra operations during each iteration. This includes the calculation of a scalar N-1 , product of two (complex) vectors x and y, z - ~-~=0 xi Yi. Since this operation requires for each iteration i loading of 2 words, there is an upper limit for the sustained performance of 50%: the memory read operations require two clock cycles per iteration, while the arithmetic unit could sustain one iteration per clock cycle. This operation is therefore memory bandwidth limited. To get close to the performance limit a few optimization tricks have to be employed. For instance, to hide memory latency burst memory access should be used for loading the vectors x and y. Reducing loop overhead requires loop unrolling. Finally, to hide the latency of the arithmetic unit (in this case 10 clock cycles) the summation should be split into partial sums. With reasonable choices for burst length and unrolling factor a sustained performance of 41% has been measured. The performance limit is not reached due to loop and pipeline overhead
360 (here: about 40 clock cycles). Higher optimization could be achieved by further loop unrolling at the cost of an increased number of required instructions. The calculation of the squared norm of a complex vector on the other hand requires to load only 1 word per iteration and to perform 4 floating point operations. The apeNEXT processor will nevertheless have to execute a full normal operation, which is equivalent to 8 floating point operations. Therefore, the sustained performance is again limited to 50%. The actual measured performance for this operation is 37%. Another operation typical for lattice QCD applications is the calculation of linear combinations of two complex vectors, zi = a xi + yi. Keeping the (complex) scalar variable a in the register file, this operation requires for each iteration i loading of 2 words and storing 1 word. This operation is therefore again limited by the memory bandwidth, which restricts the sustained performance to a maximum value of 33%. We have measured a sustained performance of 29%. Both, calculation of a scalar product of two vectors as well as the squared vector norm require a final global summation of the results of the partial local sums. The time needed for this operation depends on the total number of nodes. To calculate the global sum we shift the local result by one step into direction k (k = x, y, z) and add it to the result of that processor. This procedure has to be repeated until the information from all nodes has been propagated to all other nodes. This requires P~ + Pv + P~ - 3 iterations, where Pk is the number of nodes in direction k. For achieving bit-identical results on all nodes the result from one node has to be broadcasted to all other nodes, which basically amounts to repeating the summation procedure. Taking network latency and bandwidth into account, each iteration would require about 2 . 3 5 clock cycles. In actual simulations with up to 4 x 2 x 2 nodes we measured 71 cycles per iteration (neglecting electrical delays). For typical applications and partition sizes the time needed to calculate a global sum remains nevertheless relatively small. 4 Let us finally consider the multiplication of a quark field by the hopping part of the fermion matrix, D, as defined in Eq. (1). For each lattice site this operation requires loading of 168 words, storing 12 words and executing 1320 floating point operations. This operation requires read access to remote elements of the quark field, s The performance therefore potentially depends on the ratio of the number of remote versus the number of local memory read operations. This ratio depends on the local volume. Assuming V = 163 x 32 to be the smallest global lattice size used in simulations on apeNEXT, the smallest local lattice size is v = 16 x 23, i.e. all sites are at the boundary and have to be communicated. For this worst case we measured a sustained performance of 55%. This remarkably high sustained performance was only achievable by making efficient use of the queue mechanism and the network. This eventually allows for almost complete overlap of memory load and store operations, communication and arithmetic operations, which means that this most important application scales very well on a p e N E X l . To achieve this all data including index fields were pre-fetched as early as possible. To increase the network throughput consecutive network requests were initiated in orthogonal directions, because these can be executed independently.
4For a lattice of size V = 323 • 64 the local scalar product of two quark fields (i.e. N -- 12 V) requires O(105) clock cycles on 8 x 8 x 8 nodes, while O(103) clock cycles are needed to calculate the global sum. Sin typical simulations the number of updates of the gauge fields is significantly smaller than the number of multiplications with the hopping term. Communicationof the gauge fields can therefore be neglected here.
361 5. O T H E R A R C H I T E C T U R E S
In recent years also other attempts have been started to build machines optimized for simulations of lattice gauge theory, in particular QCD. One of the projects, QCDOC (QCD On a Chip), is also developing a fully custom designed machine [5]. The work is carried out by a collaboration of Columbia University, IBM Research, UKQCD, and the RIKEN-BNL Research Center. The QCDOC chip, similar to apeNEXT, combines all processor functionalities and the network capabilities on one ASIC. It is based on a PowerPC 440 core and its 64-bit floating point unit provides 1 GFlops peak performance. An interesting and performance relevant feature of the QCDOC processor are the 4 MBytes on-chip memory. For key applications in lattice gauge theory the QCDOC collaboration measured sustained performance numbers which are very similar to those of apeNEXY. A number of groups is exploring the possibilities of cluster computers, typically using commodity PCs with IA32 processors. Significant progress has been achieved in recent years [6]. In particular for Intel Pentium 4 processors is was possible to achieve remarkable sustained performances for relevant benchmarks by making extensive use of the memory-to-cache prefetch capabilities as well as the SSE registers and instructions. However, for currently available network techniques large messages are required to obtain high throughput and to reduce latencies. For a number of applications scaling of the performance with an increasing number of nodes therefore becomes a problem. Furthermore, power consumption and space requirements are significantly higher compared to machines like apeNEXT and QCDOC. On the other hand, clusters require less technical know-how, offer more opportunities for developing portable software, and typically allow easy technology upgrades. In case of small and medium sized installations these advantages are likely to prevail. 6. CONCLUSIONS AND O U T L O O K We have discussed in detail the hardware design of apeNEXTand the software which will be available for this platform. The design has been finished and prototype versions for all hardware components except the processor itself exist and have been tested. This includes board, backplane and host-interface board. Prototype processors are planned to become available in autumn 2003. We expect the final procurement costs to be 0.5 C/MFlops peak performance. A functional model of the machine design has been used for benchmarking the architecture. We presented a detailed discussion of the results and showed that the actual design goals of the project have been reached. Benchmark programs indicate that it will be possible to run key lattice gauge theory kernels at a sustained performance of 0(50%) or more. While apeNEXT is designed as a special purpose machine optimized for simulating QCD, we expect the architecture to be suitable also for other applications. During the next years groups doing research in lattice gauge theory in Europe aim for large dedicated installations of massively parallel computers. In Edinburgh (UK) a 10 TFlops QCDOC machine will be installed by 2004. In Italy it is planned to build-up a 10 TFlops apeNEXT installation. In Germany researchers from various groups joined the Lattice Forum (LATFOR) initiative, which proposed a broad research program for the next years [7]. LATFOR assessed the required computing resources and proposed to aim for a 15 and 10 TFlops installation of apeNEXT machines hosted by NIC at DESY (Zeuthen) and GSI (Darmstadt), respectively. To make efficient use of these vast resources and to push towards development of future high
362 performance computing capabilities an initiative on scientific parallel computing (SciParC) has been launched recently [8]. Aim of this project is to bring together different fields in science, which have in common the need of large computing power on massively parallel machines, for developing portable interfaces and software as well as for exploiting existing hardware architectures and designing future hardware solutions. ACKNOWLEDGEMENTS
We would like to thank W. Errico (INFN) for his important work in the early phases of the project, and A. Agarwal (DESY/Zeuthen) and T. Giorgino (INFN) for their contributions. REFERENCES
[1]
R.G. Edwards, J.W. Negele, D.G. Richards, "Proceedings of the XXth International Symposium on Lattice Field Theory," Nucl. Phys. B (Proc. Suppl.) 119 (2003). [2] R. Alfieri et al. [APE Collaboration], arXiv:hep-lat/0102011; R. Ammendola et al., arXiv:hep-lat/0211031; F. Bodin et aL, arXiv:hep-lat/0306018. [3] A. Ukawa [CP-PACS and JLQCD Collaborations], Nucl. Phys. Proc. Suppl. 106 (2002) 195; F. Farchioni et al. [qq+q Collaboration], arXiv:hep-lat/0209142. [4] C.W. Fraser, D.R. Hanson, D. Hansen, 1995. [5] P. A. Boyle et aL, Nucl. Phys. Proc. Suppl. 106 (2002) 177 [arXiv:hep-lat/0110124]; P. A. Boyle et al., arXiv:hep-lat/0210034; P. A. Boyle et al., arXiv:hep-lat/0306023. [6] S. Gottlieb, Comput. Phys. Commun. 142 (2001) 43 [arXiv:hep-lat/0112026]; M. Lfischer, Nucl. Phys. Proc. Suppl. 106 (2002) 21 [arXiv:hep-lat/0110007]; Z. Fodor, S. D. Katz and G. Papp, Comput. Phys. Commun. 152 (2003) 121 [arXiv:hep-lat/0202030]. [7] R. Alkofer et al., http://www-zeuthen.desy.de/latfor. [8] http://www-zeuthen, de sy.de/sciparc.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
363
The Parallel Model System LM-MUSCAT for Chemistry-Transport Simulations: Coupling Scheme, Parallelization and Applications R. Wolke ~*, O. Knoth ~, O. Hellmuth a, W. Schr6der ~, and E. Renner a ~Institute for Tropospheric Research, Permoserstr. 15,04318 Leipzig, Germany, Email: [email protected] The physical and chemical processes in the atmosphere are very complex. They occur simultaneously, coupled and in a wide range of scales. These facts have to be taken into account in the numerical methods for the solution of the model equations. The numerical techniques should allow the use of different resolutions in space and also in time. Air quality models base on mass balances described by systems of time-dependent, three-dimensional advection-diffusionreaction equations. A parallel version of the multiscale chemistry-transport code MUSCAT is presented which is based on multiblock grid techniques and implicit-explicit (IMEX) time integration schemes. The meteorological fields are generated simultaneously by the non-hydrostatic meteorological model LM. Both codes run in parallel mode on a predefined number of processors and exchange informations by an implemented coupler interface. The ability and performance of the model system are discussed for a "Berlioz" ozone episode. 1. INTRODUCTION Atmospheric chemistry-transport models are useful tools for the understanding of pollutant dynamics in the atmosphere. Such air quality models are numerically expensive in terms of computing time. This is due to the fact that their resulting systems of ordinary differential equations (ODE) are non-linear, highly coupled and extremely stiff. In chemical terms, a stiff system occurs when the lifetime of some species are many orders of magnitude smaller than the lifetime of other species. Because explicit ODE solvers require numerous short time steps in order to maintain stability, most current techniques solve stiff ODEs implicitly or by implicit-explicit schemes [ 10]. To date, one limitation of schemes for the numerical solution of such systems has been their inability to solve equations both quickly and with a high accuracy in multiple grid cell models. This requires the use of fast parallel computers. Multiblock grid techniques and IMEX time integration schemes are suited to take benefit from the parallel architecture. A parallel version of the multiscale chemistry-transport code MUSCAT (MUiltiScale Chemistry Aerosol Transport) [8, 7] is presented which is based on these techniques. The MUSCAT code is parallelised and is tested on several computer systems. Presently, MUSCAT has an online-coupling to the parallel, non-hydrostatic meteorological code LM [ 1] which is the operational regional forecast model of the German Weather Service. Both parallel *The work was supported by the NIC Jtilich, the DFG and the Ministry for Environmentof Saxony. Furthermore, we thank the ZHR Dresden and the DWD Offenbach for good cooperation.
364 codes work on their own predefined fraction of the available processors and have their own separate time step size control [9]. The coupling scheme simultaneously provides time-averaged wind fields and time-interpolated values of other meteorological fields (vertical exchange coefficient, temperature, humidity, density). Coupling between meteorology and chemistry-transport takes place at each horizontal advection time step only. In MUSCAT, the horizontal grid is subdivided into so-called "blocks". The code is parallelised by distributing these blocks on the available processors. This may lead to load imbalances, since each block has its own time step size control defined by the implicit time integrator. Therefore, well-suited dynamic load balancing is proposed and investigated. Finally, the parallel performance of the coupled system LM-MUSCAT and the benefit of the developed "load balancing" strategy is discussed for a realistic ozone scenario. 2. THE C H E M I S T R Y - T R A N S P O R T CODE M U S C A T
Air quality models base on mass balances described by systems of time-dependent, threedimensional advection-diffusion-reaction equations
Oy
0
0
0
0
oy/~) + Q + R(y).
O-~ -~- ~Xl (?-tly) -~ ~X2 (~t2y) + ~3X3(u3y) = --~X3(pI(z OX3
(1)
y denotes a vector of species concentrations or aerosol characteristics which will be predicted and p is the density of the air. The wind field (Ul, u2, u3) and the vertical diffusion coefficient Ifz are computed simultaneously by a meteorological model. The conversion term R represents the atmospheric chemical reactions and/or the aerosol-dynamical processes. Q denotes prescribed time-dependent emissions. Multiblock Grid. In MUSCAT a static grid nesting technique [8, 7] is implemented. The horizontal grid is subdivided into so-called "blocks". Different resolutions can use for individual subdomains in the multiblock approach, see Fig. 1. This allow fine resolution for the description of the dispersion in urban regions and around large point sources. This structure originates from dividing an equidistant horizontal grid (usually the meteorological grid) into rectangular blocks of different size. By means of doubling or halving the refinement level, each block can be coarsened or refined separately. This is done on condition that the refinements of neighbouring blocks differ by one level at the most. The maximum size of the already refined or coarsened blocks is limited by a given maximum number of columns. The vertical grid is the same as in the meteorological model. The spatial discretization is performed by a finite-volume scheme on a staggered grid. Such schemes are known to be mass conservative because of the direct discretization of the integral form of the conservation laws. For the approximation of the surface integrals, point values of the mixing ratio # = y/p and its first derivative are needed on the cell surfaces. To approximate the mixing ratio at the surface we implemented both a first order upwind and a biased upwind third order procedure with additional limiting [3]. This scheme has to be applied to non-equidistant stencils which occur at the interface of blocks with different resolutions [7]. Fig. 3 illustrates the choice of the grid values involved in the interpolation formula for the two different upstream directions. In this case the grid cell has five cell wall interfaces. The fluxes are computed with the same time step for all resolutions.
365
,i "-U>O |
o -.,-U < 0
Figure 1. Multiblock grid.
Figure 2. Stencil for advection operator at the interface between different resolutions. 9 for U > 0, 9 forU < 0.
Time Integration. For the integration in time of the spatially discretized equation (1) we apply an IMEX scheme [8, 10]. This scheme uses explicit second order Runge-Kutta methods for the integration of the horizontal advection and an implicit method for the rest. The fluxes resulting from the horizontal advection are defined as a linear combination of the fluxes from the current and previous stages of the Runge-Kutta method. These horizontal fluxes are treated as "artificial" sources within the implicit integration. A change of the solution values as in conventional operator splitting is thus avoided. Within the implicit integration, the stiff chemistry and all vertical transport processes (turbulent diffusion, advection, deposition) are integrated in a coupled manner by the second order BDF method. We apply a modification of the code LSODE [2] with a special linear system solver and a restriction of the BDF order to 2. The nonlinear corrector iteration is performed by a Newton method in which the sparse linear systems are solved by linear Gauss-Seidel iterations. The time step control is the same as in the original LSODE code. The error control can lead to several implicit time steps per one explicit step. Furthermore, different implicit step sizes may be generated in different blocks. The "large" explicit time step is chosen as a fraction of the CFL number hcfl. This value has to be determined for each Runge-Kutta method individually in order to guarantee stability and positivity. Higher order accuracy and stability conditions for this class of IMEX schemes are investigated in Knoth and Wolke [6]. Gas Phase Chemistry and Aerosol Dynamics. The chemical reaction systems are given in ASCII data files. For the task of reading and interpreting these chemical data we have developed a preprocessor. Contained in its output file are all data structures required for the computation of the chemical term R(y) and the corresponding Jacobian. Changes within the chemical mechanism or the replacement of the whole chemistry can be performed in a simple and comprehensive way. Time resolved anthropogenic emissions are included in the model via point, area and line sources. It is distinguished between several emitting groups. For simulation of the distribution of particulate matter an aerosol module was included. The particle size distribution and the aerosol-dynamical processes (condensation, coagulation, sedimentation and deposition) are described using the modal technique. The mass fractions of all particles within one mode are assumed to be identical.
366 Parallelization. Our parallelization approach is based on the distribution of blocks among the processors. Inter-processor communication is realized by means of MPI. The exchange of boundary data is organized as follows. Since the implicit integration does not treat horizontal processes, it can be processed in each column separately, using its own time step size control. An exchange of data over block boundaries is necessary only once during each Runge-Kutta substep. Each block needs the concentration values in one or two cell rows of its neighbours, according to the order of the advection scheme. The implementation of the boundary exchange is not straightforward because of the different resolutions of the blocks. The possibilities of one cell being assigned to two neighbouring cells or of two cells receiving the same value must be taken into account. We apply the technique of "extended arrays": the blocks use additional boundary stripes on which incoming data of neighbouring blocks can be stored. Hence, each processor only needs memory for the data of blocks that are assigned to it.
3. ONLINE-COUPLING TO THE PARALLEL METEOROLOGICAL MODEL LM
Figure 3. Communication structure in the model system LM-MUSCAT. The meteorological and the chemistry-transport algorithms have their own separate time step size control. The coupling procedure is adapted to the applied IMEX schemes in the chemistrytransport code. Each stage of the IMEX scheme requires a new calculation of the horizontal fluxes and an implicit integration cycle. The new mass conservative coupling provides constant time-averaged horizontal wind fields for the computation of the horizontal fluxes and timeinterpolated meteorological fields (vertical exchange coefficient and wind speed, temperature, humidity) during the implicit integration part. In this way, average temporal changes of the meteorological values are considered, even though the meteorological solver generally generates these variations with a more accurate time resolution. All meteorological fields are given with respect to the equidistant horizontal meteorological grid. They have to be averaged or interpolated from the base grid into the block-structured chemistry-transport grid with different resolutions. The velocity field is supplied by its normal components on the faces of each grid cell, and their corresponding contravariant mass flux
367
O~raphy and Grid m
i0~ 9~
6~ 5~
51
4~ 3~ ii!iii!i!iliiii~i~
!iiii!~i!!
49,
9
!0 !1i 12 13 14 Lo~ute [grd]
15
i
2~ 0 < 0
Figure 4. Orography and multiblock grid for the Berlioz simulation. components fulfill a discrete version of the continuity equation in each grid cell. This property has to be preserved for refined and coarsened cells. Because only the mass flux components are needed for the advective transport of a scalar, the necessary interpolation is carried out for these values. The interpolation is done recursively starting from the meteorological level. The same approach is applied to the coupling with the meteorological driver LM. Since LM solves a compressible version of the model equations an additional adjustment of the meteorological data is necessary. The velocity components are projected such that a discrete version of the continuity equation is satisfied. The main task of this projection is the solution of an elliptic equation by a preconditioned conjugate gradient method. This is also done in parallel on the LM processors. The projected wind fields and the other meteorological data are gathered by one of the LM processors. This processor communicates directly with each of the MUSCAT processors, see Fig. 3. 4. PARALLEL P E R F O R M A N C E AND LOAD BALANCING In this section, the parallel performance and the run time behaviour of the model system is discussed for a typical meteorological "summer smog" situation ("Berlioz ozone scenario"). The model area covers approximately 640 km x 640 km with a multiscale grid with resolutions of about 4 km-8 km-16 km, see Fig. 4. In the vertical direction the model domain is divided into 19 non-equidistant layers between the surface and a height of approximately 3000 m. The used chemical mechanism RACM considers 73 species and 237 reactions. Time-dependent emission data from both point and area sources are taken into account. All tests are run on a Cray T3E with a different number of processors. In all cases, 20 % of the available processors are used for the meteorological code LM, the others for MUSCAT.
368 The equidistant meteorological grid is decomposed in several subdomains which number corresponds with the number of LM processors. The MUSCAT runs are performed with two different prescribed error tolerances Tol of the BDF integrator. Dynamical Load Balancing. Consider a static partition where the blocks are distributed between the processors only once at the beginning of the program's run time. Here, we use the number of horizontal cells (i.e., of columns) as measure of the work load of the respective block. Therefore, the total number of horizontal cells of each processor is to be balanced. This is achieved by the grid-partitioning tool ParMETIS [4]. It optimizes both the balance of columns and the "edge cut", i.e., it takes care for short inter-processor border lines. In order to improve the load balance, techniques allowing for redistribution of blocks have been implemented. A block's work load is estimated using the numbers of Jacobian ( N j ) and function evaluations (NF) applied during a past time period. They represent measures for the expense of factorizing the system matrix and solving the resulting systems. The work load of processor P is defined by
a . - Z (Nf +
Nf),
(2)
BEP
B E P stands for the blocks currently located on this processor, NcB is the number of columns of block B. For repartitioning, we again use ParMETIS which is called if the ratio /
iilin ap/ max ap P
P
gets below of a certain critical value. According to the work loads of the blocks, ParMETIS searches for a better distribution, besides minimizing the movements of blocks. The communication required for the exchange of block data can be done by means of similar strategies as for the boundary exchange.
0,8
~0,6
0,4
20
0,2 0
20 number of processors
40
60
80
0
20
40
60 80 numberof processors 1
+
static, ToI=I.E-2
+
dyn , ToI=I.E-2
-~
static, Tol=I.E-3
-~
dyn , ToI=I.E-3 I
I
Figure 5. Parallel efficiency (left) and speedup (right) on the Cray T3E. The efficiency is related to the basic 1 processor run with Tol = 10 -2. Note that the sharper tolerance Tol = 10 -3 is often necessary in real applications.
Performance and Scalability. Only the work load of the MUSCAT processors is changed by the step size control and by the dynamic load balancing procedure. The parallel performance
369 of the model runs is presented in Fig. 5. Higher accuracy requirements increase the work load needed for the implicit integration in MUSCAT significantly. By using dynamic load balancing the CPU times can be reduced by a factor of about 0.7-0.9 in comparison to the static grid decomposition. The measured speed-up values show that the code is scalable up to 80 processors. In our experience, the amount for coupling is negligible. The idle times of the LM processors growth if the work load of the MUSCAT processors caused by tighter tolerances is increased. Therefore, the number of LM processors can be reduced especially in the runs with higher accuracy. However, the basic strategy consists in the avoidance of idle times on the MUSCAT processors side. The costs for the projection of the wind fields are comparable with that of the LM computations. But it seems that this work load can be reduced by more suitable termination criteria of the cg-method. REFERENCES
[1 ]
[2] [3] [4] [5] [6] [7] [8]
[9]
G. Doms and U. Sch~ittler, The Nonhydrostatic Limited-Area Model LM (Lokal Model) of DWD: I. Scientific Documentation (Version LM-F90 1.35), Deutscher Wetterdienst, Offenbach, 1999. A.C. Hindmarsh, ODEPACK, A systematized collection of ODE solvers, in Scientific Computing, North-Holland, Amsterdam, 1983, pp. 55-74. W. Hundsdorfer, B. Koren, M. van Loon and J.G. Verwer, A positive finite-difference advection scheme, Journal of Computational Physics, 117 (1995), pp. 35-46. G. Karypis, K. Schloegel and V. Kumar, ParMETIS. Parallel graph partitioning and sparse matrix ordering library. Version 2.0, University of Minnesota, 1998. O. Knoth and R. Wolke, Numerical methods for the solution of large kinetic systems, Appl. Numer. Math., 18 (1995), pp. 211-221. O. Knoth and R. Wolke, Implicit-explicit Runge-Kutta methods for computing atmospheric reactive flow, Appl. Numer. Math., 28 (1998), pp. 327-341. O. Knoth and R. Wolke, An explicit-implicit numerical approach for atmospheric chemistry-transport modeling, Atm. Environ., 32 (1998), pp. 1785-1797. R. Wolke and O. Knoth, Implicit-explicit Runge-Kutta methods applied to atmospheric chemistry-transport modelling, Environmental Modelling and Software, 15 (2000), pp. 711-719. R. Wolke, O. Hellmuth, O. Knoth, W. Schr6der, B. Heinrich, and E. Renner, The
chemistry-transport modeling system LM-MUSCAT: Description and CITYDELTA applications. Proceedings of the 26-th International Technical Meeting on Air Pollution and Its Apllication. Istanbul, May 2003, pp. 369-379. [ 10] J.G. Verwer, W.H. Hundsdorfer and J.G. Blom, Numerical time integration for air pollution models, Technical Report MAS-R9825, CW! Amsterdam, 1998, To appear in Surveys for Mathematics in Industry.
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
371
R e a l - t i m e V i s u a l i z a t i o n o f S m o k e t h r o u g h Parallelizations T. Vik a, A.C. Elster ~, and T. Hallgren ~ aNorwegian University of Science and Technology, Department of Computer and Information Science, NO-7491 Trondheim, Norway Animation/visualization of natural phenomena such as clouds, water, fire and smoke has long been a topic of interest in computer graphics, and following the increase in common processing power, researchers have started using physically based simulations to generate these animations. Smoke, and other fluids, is extremely hard to animate by hand, and simulation techniques producing realistic results are highly applicable and desirable in areas ranging from computer games to weather forecasts. Our work combines mathematical, graphical and programming techniques including several parallelization techniques in order to produce a real-time code that visualizes rolling and twirling smoke in three dimensions. The resulting 3D parallel implementation exploits dualCPU hardware in order to produce more than 20 image frames per second on today's modern desktop computers. 1. INTRODUCTION Early researchers in the field of visual fluid simulations produced realistic results by applying mathematically generated textures to simple geometric structures [4, 9]. This approach did not count for interaction with solid objects and left the animators with the tedious work of parameter tweaking. The models and algorithms presented in [3, 11, 2], all demonstrate a clear evolution within visual simulations of fluids: In 1997, Foster & Metaxas published their paper, Modeling the Motion of a Hot, Turbulent Gas [2]. They extracted the parts of Navier-Stokes' equations that affect the interesting visual features of smoke, such as convection and vorticity, thus creating a model capable of simulating realistic smoke on relatively coarse grids. Their model was only stable for small time steps because of an explicit integration scheme; large velocities required small timesteps, and the result was animations evolving too slowly. Stam was the first to present a stable algorithm solving the full Navier-Stokes equations [ 11] independently of the time steps and implemented a version that "allows an animator to design fluid-like motions in real time". This was achieved using a semi-Lagrangian approach together with implicit solvers, allowing for much larger time steps and a faster evolving simulation. Fedkiw, Stam and Jensen [3] pointed out that the first-order integration scheme used by Stam caused small-scale details to die out. Fedkiw et al. presented a model aiming for visual simulation of smoke, compared to the general visual fluid simulator of Stam. Their model was based on the work of Foster & Metaxas [2] and Stare [ 11], but featured some modifications for improved visual quality. They reported frame rates from 1-10 fps for their fastest, low-quality simulations.
372 For a real-time environment aiming for a visual imitation of the real world, these frame rates are, however, still not sufficient. We present a configuration of parameters, sub-algorithms and numerical solvers, all analyzed with respect to visual quality, computational efficiency and parallelism, eventually aiming for frame rates of 20 and above on today's modem desktop hardware. The visual results are subjectively judged by the author; the smoke should be experienced as realistic and "feel fight". 2. SIMULATION MODEL Our simulation model is based on the one presented by Fedkiw et al. [3], and assumes that smoke can be modelled as an inviscid incompressible fluid. The algorithm dices the computational doimain into uniform sub-cubes known as voxels and uses two 3D grids - one grid storing the data for the previous frame, the other the data for the current frame. Each grid defines the velocity vector u, and values such as density p, temperature T, and user-defined forces (e.g. wind). The algorithm consists of 5 numerical steps followed by two steps handling the lighting and rendering of the simulation data. The numerical equations for these steps are found in [3, 16]. 1. Update Velocities by scaling the forces, including the bouyancy force (gravity and effects of hot air rising) and user-defined forces, with the timestep At. 2. Seif-Advect velocities using a semi-Lagrangian (SL) method [ 11 ]. 3. Solve Poisson's Equation for the pressure using CG 4. Update (project) temporary velocities by subtracting the pressure gradient scaled by the time step in order to conserve mass. 5. Advect scalar values, i.e. update density p and temperature T using SL similarly as step 2
6. Calculate lighting 7. Render image 3. MODEL OPTIMIZATIONS In order to get the best possible performance, both optimizations of the original model and parallelizations are needed. To achieve this, a 2D prototype was developed, analyzed and profiled in order to pinpoint possible processing hotspots in the algorithm (See Figure 1). This version defined the velocities at voxel faces as in Fedkiw et al. [3], and showed that the two advection steps occupied about 90% of the CPU time. Our 3D implementation adopted the approach of Stam[ 11 ] and Foster et al.[2], with velocities defined at voxel centers, thus reducing the number of linearly interpolated off-grid values. This also helped in our parallel implementation.
3.1. Poisson's solver Solving Poisson's equation requires a linear solver handling sparse Symmetric Positive Definite (SPD) matrices in an efficient manner, e.g. relaxation methods [2, 14] or iterative solvers [6, 5, 3, 11, 15, 12].
373 Add
Advect scal~ value 46 ~
Selfadvect Velocity 44 % 3uild
Project 1% 5%
adient'ector 1%
Advect scal~ value 29 %
Add Selfadvect Velocity 32 %
Project 10% 10%
8%
Figure 1. Profiling results of 2D (left) and 3D (right) serial implementation. Rendering disabled, no SIMD extension used, 5 CG iterations. Fedk et. al use an incomplete Cholesky preconditioner as part of their Conjugate Gradients (CG) solver [ 12]. But, since the time-steps and changes between each frame in real-time smoke simulations are relatively small, we skip the pre-conditioner and use the solution from the previous frame as an initial guess. As a result, our code only uses 5 iterations to produce acceptable visual results. A detailed comparison of various PETSc [10] iterative solvers and our solver is presented in [ 16]. 4. PARALLELIZATION T E C H N I Q U E S Each computational step described in the previous section consists of one or more iterations over all the voxels in the domain. Such operations are well suited for parallelism simply by splitting the computational domain into smaller subdomains and assigning each of them to a node. In addition, the calculation of a frame depends on the results from Steps 1-6 (described in Section 2) of the previous frame. However, the rendering step - Step 7 - of the previous frame may overlap with the computations of the current frame. We implemented the algorithm as one data-parallel and one modified data parallel version, where the latter also supports pipelining techniques.
4.1. Data parallel implementation This version divides the different iterative constructs of the algorithm into smaller domains and distributes them to the different nodes. The thread-administration and-synchronization were performed using Win32 API calls, and obviously adds some overhead. However, when running the simulation on relatively small number of CPUs (2 - 4), this overhead should be negligible. Section 5.1 presents benchmarks and analysis of the data-parallel approach, and shows a significant speedup compared to the serial, but also uncovers that one of the CPUs is left idle while the other is running the OpenGL rendering code.
4.2. Pipelined implementation As mentioned, steps from two different frames can be interleaved, as their dependencies do not require them to be processed in exact sequential order. Our implementation exploits this independency by using a round-robin queue to distribute jobs among the worker threads and administrate the dependencies. The efficiency of this technique depends on the performance of the graphics card, or more
374 accurately, how much time that is spent performing rendering compared to the time spent performing numerical simulation and lighting. Optimal performance is thus achieved when these two time slots are equal. If the rendering step completes too quickly, the other CPU(s) will not be able to process anything of the following frame. In this case, the pipelined approach will probably slow down the system due to the overhead added through the queueing system. 4.3. Vectorization on Intel CPUs The recent Intel families of processors include a set of SIMD instructions, known as SSE (Streaming SIMD Extension) which process values in groups of four. Hence, these processors can theoretically perform up to four times as many calculations per clock cycle. We modified most of the simulator to exploit these SSE vector instructions, and experienced about 10% average speedup. Some parts of the algorithm proved to be better suited for vectorization than others, especially the CG method. The more complex steps, such as the advections, require linear interpolation of off-grid values using eight different on-grid values not placed contiguously in memory. An SSE version of these steps hence requires the values to be gathered into special vector data types in groups of four, and adds overhead to the code. SIMD instructions may increase the memory bus usage significantly compared to standard floating-point operations since the bus should deliver more data to keep up with the parallel computations. 4.4. Parallelization of lighting calculations The lighting calculations cannot be treated as ordinary iterative operations, since the order of traversal (of the voxels) depends on the light's direction and where the rays enter the volume. We parallelized Fedkiw et al's lighting model [3] by grouping the light rays into distinct sets and assigning each set to a thread. Our model supports parallel light rays entering the volume from an arbitrary angle.
5. RESULTS 5.1. Results for parallel implementations Figure 2 shows the speedup gained by our data-parallel implementation, compared to the serial version and pipelined version. The chart shows that there is indeed a speedup with the data parallel implementation, varying from 20% to 60% depending on the simulation parameters. Still, 60% speedup when moving from one to two CPUs is far from linear scaling. A partial explanation is that one thread is performing rendering while the other threads stall for some time, waiting for the first to finish. This is confirmed by the numbers reported by the MS Task Manager, giving a CPU utilization ranging from 60% for coarse grids to 90% for fine grids. However, even with rendering disabled, the speedup is still not higher than about 66%, indicating that there are other, more substantial, explanations. Some of the processing power is of course lost due to overhead added when our code administrates and distributes tasks among the worker threads. Still, this does not explain all of the wasted 34conditions for a shared resource somewhere in the system, and tests point out the memory bus as being the probable cause. The data required by a CPU to fulfill it's work is way larger than it's local cache, and the memory bus is unable to deliver data fast enough for the program to scale linearly. Figure 2 also illustrates the evident speedup gained by the pipelined approach, compared to the data-parallel, but the efficiency is most likely still limited by memory bus bandwidth.
375
i
f Figure 2. The speedup of data parallel and pipelined implementation vs. serial. 5.2. Visuals
Figure 3. Frames [60, 110, 160] of our parallel simulation, 16xl 6x64 grid, 5 CG iterations, 40 fps. Figure 3 presents a series of pictures from our implementation. It runs about 40 frames per second on a Dual Intel Xeon 1.7 GHz, simulating 16 384 voxels. The smoke moves in a realistic fashion and interacts correctly with the obstacle present in the domain, but suffers from the lack of small-scale detail pointed out by Fedkiw et al. [3]. Figure 4 presents a simulation with the exact same parameters as Figure 3, but features eight times as many voxels (131 072). The detail is greatly improved, but the result is only 4 frames per second. Figure 5 illustrates the effect of introducing small, random forces into the domain, and show, in comparison with Figure 4, that the smoke looks less "smooth" and for some situations more realistic.
376
Figure 4. Frames [60, 110, 160] of our parallel simulation, 32x32x128 grid, 5 CG iterations, 4.1 fps.
Figure 5. Frame 60, 110 and 160 of a smoke simulation, grid sized 32x32x128, 5 CG iterations, fps = 4.1. Random forces. 6. CONCLUSION Our results showed that it is possible to run an optimized implementation of the algorithm outlined by Fedkiw et al.[3] that yields 5 to 40 frames per second on today's modem PC hardware. Algorithmic optimizations, parameter tuning and efficient programming techniques made the implementation capable of animating smoke, still with a satisfactory level of realism and speed in real-time.
377 First we modified the algorithm by sacrificing accuracy for speed through the following: - Defining the velocity at voxel centers instead ofvoxelfaces - Fast matrix-free CG (no preconditioner) - Leaving out the vorticity confinement
These three decisions unquestionably reduced the visual quality of the smoke, a fact supported by a comparison of the visual results presented by Fedkiw et al.[3] with the visual results of our implementation (see section 5.2). Nevertheless, the modifications also reduced the computational complexity of the algorithm, thus making it possible to calculate significantly more frames per second. It is interesting to notice that the number of CG iterations made little or no difference on the visual simulations; a surprising result when seen in the light of earlier discussions [3]. This effect should be investigated further. Also, as our solver at the current time only occupies about 5it is not a bottleneck neither when it comes to visual realism nor efficiency. However, if other parts of the algorithm are modified, altemative solvers should be investigated for possible improvements in both realism and efficiency. The speedups gained by parallelizing our simulation model proved it very suitable for parallelism, but also revealed the memory bus as being a limitation for memory intensive tasks. The pipelined approach presented in 4.2 introduced a technique for dynamic job distribution and pipelining, thus reducing the penalty caused by serial sub-tasks such as rendering. The technique of pipelining is well-known, especially within computer hardware engineering, but our results showed its relevance within the field of CFD visualization. Since most of today's SMP systems still share a common graphics card, the technique should also apply to a wide range of visualization software. The results also showed that certain parts of the algorithm gained significant speedups as a result of Intel's SSE instructions, in particular the CG method. Whether it is possible or not to run a visual CFD simulation in real-time, depends on the desired visual accuracy and quality. Higher software and algorithmic efficiency will allow for higher accuracy on a given hardware platform, but the numerical simulations will always have a potential for improvement. REFERENCES
[1] Peter Bartello and Stephen J. Thomas. The Cost-Effectiveness of Semi-Lagrangian Ad-
[21 [3] [4]
[5] [6]
vection. Dr. P. Bartello, RPN, 2121, voie de Service nord, Route Transcanadienne, Dorval (Quebec) H9P 1J3. N. Foster and D. Metaxas. Modeling the Motion of a Hot, Turbulent Gas. In SIGGRAPH 97 Conference Proceedings, Annual Conference Series, pages 181-188, August 1997. R. Fedkiw, J. Stam, and H. W. Jensen. Visual Simulation of Smoke. In SIGGRAPH 2001 Conference Proceedings, Annual Conference Series, pages 15-22, August 2001 G. Y. Gardner. Visual Simulation of Clouds. Computer Graphics (SIGGRAPH 85 Conference Proceedings), 19(3):297-384, July 1985. Louis A. Hageman and David M. Young, Applied Iterative Methods, Academic Press, New York, 1981. David R. Kincaid and Anne C. Elster co-editors. Iterative Methods in Scientific Computation II, IMACS Series in Computational and Applied Mathematics Volume 5, IMACS, 1999.
378 [7] [8] [9] [10] [11] [12]
[13] [ 14]
[15]
[16]
OpenGL, http://www.opengl.org OpenMP, http://www.openmp.org K. Perlin. An Image Synthesizer. Computer Graphics (SIGGRAPH85 Conference Proceedings), 19(3):287-296, July 1985. PETSc, http://www-unix.mcs.anl.gov/petsc/petsc-2/ J. Stam. Stable Fluids. In SIGGRAPH 99 Conference Proceedings, Annual Conference Series, pages 121-128, August 1999. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain Edition 1.25. Jonathan Richard Shewchuk, August 4., 1994, http://www-2.cs.cmu.edu/jrs/jrspapers.html C. M. Stein, N L Max. A Particle-Based Model for Water Simulation. (SIGGRAPH 98 Conference, july 19-24). Fred T. Tracy. A Comparison of a Relaxation Solver and a Preconditioned Conjugate Gradient Solver in Parallel Finite Element Groundwater Computations. Engineer Research and Development Center, Major Shared Resource Center. Henk A. Van Der Vorst. Lecture Notes on iterative methods. Mathematical Institute, University of Utrecht, Budapestlaan, the Netherlands. June 4, 1994. http ://www.hpcmo.hpc.mil/Htdocs/UGC/UGC01/paper/fred_tracy_paper.pdf Torbjom Vik, Real-time visual simulation of Smoke, Master thesis at department of Computer and Information Science, NTNU, http://www.tihlde.org/~ torbjorv/diploml 5.pdf
Parallel Computing: Software Technology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
379
Parallel Simulation of Cavitated Flows in High Pressure Systems P.A. Adamidis ~, F. Wrona b, U. Iben b, R. Rabenseifner ~, and C.-D. Munz c ~High-Performance Computing-Center Stuttgart, Allmandring 30, D-70550 Stuttgart, Germany bRobert Bosch GmbH, Dept. FV/FLM, RO. Box 106050, D-70059 Stuttgart Clnstitute for Aero- and Gasdynamics (lAG), Pfaffenwaldring 21, D-70550 Stuttgart, Germany This paper deals with the parallel numerical simulation of cavitating flows. The governing equations are the compressible, time dependent Euler equations for a homogeneous two-phase mixture. The equations of state for the density and internal energy are more complicated than for the ideal gas. These equations are solved by an explicit finite volume approach. After each time step fluid properties, namely pressure and temperature, must be obtained iteratively for each cell. The iteration process takes much time, particularly if in the cell cavitation occurs. For this reason the algorithm has been parallelized by domain decomposition. In case where different sizes of cavitated regions occur on the different processes a huge load imbalance problem arises. In this paper a new dynamic load balancing algorithm is presented, which solves this problem efficiently. 1. I N T R O D U C T I O N Cavitation is the physical phenomenon of phase transition from liquid to vapor and occurs in a vast field of hydrodynamic applications. The reason for fluid evaporation is that the pressure drops beneath a certain threshold, the so called steam pressure. Once generated, the vapor fragments can be transported through the whole fluid domain, as been depicted in Fig. 1. Finally they are often destroyed at rigid walls which leads to damages. Cavitation can also occur in various forms. Sometimes, they appear as large gas bubbles like in cooling systems of power plants. The work in this paper is extended for high pressure injection systems, where the fluid is expanded from a pressure of 1800 bar to nearly 10 bar. In such systems cavitation occurs as small vapor pockets and clouds. Due to the structure of cavities, the assumpFigure 1: Cavity formation behind a backward-facing tion that the flow field is homogenous, step i.e. pressure, temperature and velocity of both phases are the same, is justified.
380 The more complicated challenge is to model the cavitation process, in a fashion that it is valid for all pressure levels, which can occur in such injection systems. Therefore the fluid properties are described by ordinary equations of state [1, 2]. Among other things, this guarantees that the most important fluid property for modeling cavitation, e.g. the pressure, is always positive. As mentioned before, the complete flow field is treated as compressible, even if the fluid does not evaporate. Additionally, enormous changes in the magnitude of all flow properties occur, if the fluid is cavitating. Hence, the governing equations can only be treated explicitly, but this limits the time step size strongly. Moreover, some fluid properties can just be obtained iteratively. Wherever cavitation occurs, this iterative process takes much more time than for the non cavitating case. This leads to very large computation times if one wish to solve larger problems. Due to these facts parallization of the algorithm is unavoidable. But for an efficient parallel run the load imbalance problem, introduced by the cavitating cells, has to be solved. In the next section, the governing equations are presented. In section 3, the numerical algorithm and in section 4, the parallelization and the solution of the load imbalance problem are described. Finally in section 5, results of the parallel tests are presented. 2. GOVERNING EQUATIONS For showing the main difficulty in the parallel running case, viscous terms can be neglected. Further, the regarded fluid is water and all necessary functions can be obtained from [3]. Therefore the governing equations are the two dimensional Euler equations, symbolically written as
(1)
ut + f(u)x + g(u) v : 0, with
u =
v pw
f(u) =
E
pv PV2 + p pvw
g(u) -
v(E + p)
pw pvw Pw 2 + P
.
(2)
+ p)
Here derivatives are denoted by an index. In Eq. (2) p is the density, of the homogeneous mixture, v and w the velocity in x-, respectively in y-direction. Further the property E is introduced, which describes the total energy p(e + 1/2(v 2 + w2)) per unit volume. The density and the internal energy e are functions of the pressure p and the temperature T. The density and the internal energy are expressed as mixture properties _ _1_ # ~ P PG
1 -
#
and
e--peG+(1--p)eL.
(3)
PL
Here the properties with the subscript G are the one of the gaseous phase and with the subscript L of the liquid phase. The gaseous phase is treated as ideal gas and, the functions of the liquid phase are obtained from the IAPWS97 [3]. The mass fraction is defined by mG P-- mc+mL'
(4)
381 and describes the fractional mass portion of the gaseous phase to the total mass in a cell. Finally to close the system the mass fraction must also be expressed as a function of pressure and temperature. With the assumption that the mixture of liquid and gas, also called steam, is always in thermodynamical equilibrium and that the fluid evaporates at constant entropy, the mass fraction can be expressed as
,0, T)
- h(p, T) - h'O) h"O)
- h'(p)
"
(5)
The enthalpies h, h" and h' can also be obtained from the IAPWS97 [3]. If the actual enthalpy h is less than the enthalpy h' at the boiling line, # is set to zero, because the fluid is outside of the two-phase regime and consists of pure liquid. This happens, when the pressure is greater than the steam pressure Psteam, which depends only on the temperature. Similar to the mass fraction a void fraction can be defined, and can be calculated from Eq. (3); it can be used to detect cavitating cells: s=
Vc
VG + VL
and
s=#--.
p
(6)
PG
3. NUMERICAL SOLUTION For solving the conservation of mass, momentum and energy numerically, a transient finite volume approach is used, which calculates the system of conservative variables at each time step explicitly. The underlying mesh is unstructured and consists of triangles and rectangles. Therefore Eq. (1) is discretized as u~+ ~
9
:
At
n
Z
j6A/(i)
C(u
u~)lij,
~, n f ( u i
(7)
where N'(i) are the set of the neighbor cells of the ith cell and 9i its volume. s
u~) de-
notes an operator for different time integration methods. The numerical flux f depends on the conservative variables in the ith cell and its neighbors. This fluxes are calculated by approximate Riemann solvers, namely the HLLC-Solver [4]. Note that Eq. (7) is a strongly simplified presentation of the whole method. For more information refer [5]. After the flux calculation and the update of the conservative variables from uin to u n~+ l , the primitive variables must be computed for each cell. For some properties this is very easy, because they can be obtained analytically (the subscripts i and superscripts n + 1 are dropped for convenience)
p-u1,
u2 v=--, p
w---
u3 p
and
e-
u4 p
1 (v 2 + w 2 ) . 2
(8)
Now the pressure and the temperature for every cell is needed. But as mentioned before they can only be computed iteratively. This can be done by an iteration for the formulas in Eq. (3), because the internal energy and the density are already known
hl(p,T)=
1
#
P
PG
#
1 PL
=0
and
h2(p,T)-c-pec-(1-#)eL=O.
(9)
382 At the beginning h l is iterated with fixed pressure for the new temperature by a bisection method. Afterward, with the new value for the temperature, h2 is iterated for the new pressure also by a bisection. These two steps are repeated until both values converge. Unfortunately these two values converge slower if cavitation arises in the cell.
4. P A R A L L E L A L G O R I T H M The iterative method described above has been I Initial Meshpartioningwith METIS I proven to be the most CPU time consuming part of the I Determine Global Timestep m whole simulation. Especially, the time needed for cells MPI_Allreduce m in which cavitation occurs is much more than the one l I Calculate Numerical Fluxes ] needed by the others, because there are more iterations (using halo data) executed in order to determine pressure and temperTime integration Calculate the new conservative ature. This leads to enormous computation times for variables large realistic problems, which are in the range of several days or weeks. To solve such problems, parallelization of the algorithm is unavoidable, thus taking advantage of more processors and memory space of a parallel computing environment. As mentioned before, the numerical Calculating primitive variables I method is an explicit scheme. The natural approach + m to parallelize such an algorithm is domain decomposin Update of the halo cells m tion. The whole computational domain is decomposed m Communicatingwith MPI into several subdomains equal to the number of available processors. This initial partitioning is done using Figure 2" Flow chart of parallel algothe tool METIS [9]. rithm The whole algorithm consists of several parts, as can be seen in Fig. 2. The calculation of the fluxes is carfled out by each processor on the subdomain which has been assigned to it. For calculations on cells lying on the artificial boundary of each subdomain, data from their neighboring cells are needed. Because these neighbors belong to adjacent subdomains, the subdomains are expanded. The new additional layer, the so called halo layer, contains the aforementioned neighboring cells. The data for the halo cells is being communicated using MPI [8]. However, the appearance of cavitation affects the iterative calculation of pressure and temperature in the time integration part in two ways. On one hand, the cavitating cells are not distributed homogeneously over the whole computational domain, which means that subdomains having more such cells need much more CPU time to finish their calculations, so that subdomains with less cells have to wait for them. This causes a very significant load imbalance. On the other hand, the locations of the cavities move across subdomains. The approach followed in this work keeps the same initial partitioning, and reassigns only the work done in the iterative process of determining pressure and temperature from partitions with more load to partitions with less load, see shaded boxes in Fig. 2. Thereby, only the information needed to determine pressure and temperature of a cell is moved to another process and the result is recollected to the original location of the cell, because the cell itself is not moved to the other process. In particular, first we store the iteration time t~ needed by each cell i. Next, the
--1
383 iteration time tpj, spend for each original partition j is determined by adding up the iteration time of its cells ncellj (10) tpj = E ti , i=1
where ncellj is the number of cells in partition j. The optimal CPU time distribution would now be nprocs _
1
topt - nprocs ~
tpj
(11)
j=l
where nprocs stands for the number of processes. With the assumption that in the next time step the cell times are slightly different we determine the time difference tdiff,j
-- tpj - topt,
(12)
for each process. Depending on whether tdiff,j is greater than, less than or equal zero, the ith process is going to migrate part of the calculations done on its cells, or will be a receiver of such a workload, or will not reassign any work. In this way processes are classified as "senders" and "receivers". Sending processes, are sending workload to the receiving processes until all processes have approximately reached the optimum CPU time topt. In each time step, the above described decision-making is based upon the CPU times measured in the previous time step. With this migration, only a small part of the cell-information must be sent. All other cell-information remains at the original owner of the cell. Halo data is not needed, because the time-consuming iterative calculation of pressure and temperature is done locally on each cell. Each process calculates its own tpj value. This values are communicated by an all-gather operation to the other processes. Thus, each process determines the number, the indices of the cells and the receiving processes to which it will send the necessary information in order to have pressure and temperature calculated. After this calculation the results are sent back to the owners of the cells, and the remaining of the calculations is executed on the initial partitioning of the mesh. The computation time needed for a cavitating cell is about 1 ms and only about the fourth part is needed for a non cavitating cell, on an Intel Xeon 2GHz processor. Each sending process must send 32 Bytes (4 doubles) of data per cell to the receiving processes, and must receive 16 Bytes (2 doubles) of results. This means that 48 Bytes per cell must be communicated. Due to the small number of bytes for each cell, the approach of transferring workload implies only a very small communication overhead. On a 100 Mbit/s Ethemet a communication time of about 4.3 #s is needed. Whereas, treating the introduced load imbalance by repartitioning the whole mesh, or by diffusion schemes, as it is done by state of the art strategies [6, 7] means that cells would have to be migrated to other partitions, and the amount of data which would have to be communicated, is 1536 Bytes (32 integers and 176 doubles) per cell. This means a communication time of about 140 #s on a 100 Mbit/s Ethernet, which is 32 times more than with our approach. Considering the fact that this redestribution would have to be carried out in each time step, due to the steadily chainging number of cavitating cells, it is obvious that the overhead introduced by the entire redistribution of cells is much more than the overhead of our strategy, which temporarily redistributes partial cell information. In this way, we also avoid expanding the halo layer.
384 Table 1 CPU Time in seconds and Speedup 1 [4r Prozesse not loadbalanced
[
2
4
6
8
12
16
CPU Time 9622.3415685.98 3620.49 2748.64 2180.57 1682.86 1356.80 speedup .... i I 1.692 2.658 3.501 4.413 5.718 7.092
load balanced CPU Time void fraction Speedup
-
load balanced CPUTime Speedup CPU time
-
,
5112.81 2816.00 1920.45 1455.54 1022.48 793.907 1.882 3.417 5.010 6.611 9.411 12.120 4990.32 2574.45 1760.46 1341.62 935.26 1.928 3.738 5.466 7.172 10.288
735.52 13.082
5. RESULTS As benchmark, a shock tube problem is defined. The tube has a length of one meter. The mesh is a Cartesian grid with ten cells in ydirection and 1000 in x-direction. The comFigure 3" Schematic illustration of the bench- putational domain is initialized as illustrated mark in Fig. 3. It consists of a cavity, which is embedded in pure liquid. This can be managed by setting the whole flow field with the same temperature and the same pressure, except in the cavitated region the pressure is chosen that it is below the steam pressure. The cavity has a thickness of 0. l m and its beginning is positioned at 0.2m after the tube entry. Further the flow field is set up with a velocity v0 in the right direction and in the cavitated region with Vl vopo/pl. If the velocity is set up near to the speed of sound, the cavity must be transported in the right direction. Therefore it is an excellent benchmark for checking the load balance algorithm, because the position of the cavity moves from subdomain to subdomain. Results and the domain decomposition for a parallel run with eight processors are shown in Fig. 5. The computing platform is a cluster consisting 1oo .... ~A ........................... ! ......... ! ............... ! ............. of dual-CPU PCs, with Intel Xeon 2 GHz prov. ~ ~ _ : ! i i ! cessors. The results are summarized in Tab. 1 and Fig. 4 for several parallel runs. Further, they are compared with the load balancing approach, we used in [ 10] where the decision-making was based on the void fraction. From these results it is obvious that the gain in efficiency, achieved by the algorithm of this work is greater than the preil vious one. Without the scpecific load balancing ~o ....... i ........ : ......... i....... : i o ~ P ' d d ; ~ m C ~ ~ to handle cavitating cells, only a poor efficiency 0 2 4 6 8 10 12 14 16 Number of Processes (58% to 44% for 6 to 16 processors) could be reached. With the new approach the efficiency Figure 4" Efficiency is over 91% up to 6 processors and remains over 81% up to 16 processors. Whereas, using the void fraction approach the efficiency drops below 84% when using 6 processors and below 76% at 16 processors.
P0>/hte.,~,v0
=
385
/ / Figure 5. Void fraction (bright spots) and domain decomposition of the benchmark 6. CONCLUSION In many industrial applications cavitation occurs and cannot be calculated detailed in a proper response time on a single processor computer. In order to reduce this time the use of parallel computing architectures is necessary. Nevertheless, the parallel algorithm has to deal with load imbalance introduced by the cavities. The most time consuming part of the algorithm is the iterative calculation of pressure and temperature. In this work a new load balancing algorithm has been developed, in which not cells, but the work done on the cells during determination of pressure and temperature, is redistributed across processors. The initial partitioning of the mesh is not changed. The decision-making for migrating workload is based upon the CPU time needed by a process to calculate pressure and temperature. The results of this algorithm show that there is a significant gain in efficiency. With the new approach it is possible to treat industrial problems, in the area of simulating the flow in injection systems, in a reasonable response time. REFERENCES
[1] U. Iben, F. Wrona, C.-D. Munz, M. Beck, Cavitation in Hydraulic Tools Based on Thermodynamic Properties of Liquid and Gas, Journal of Fluids Engineering, 2002, Vol. 124, No. 4, pp. 1011-1017. [2] R. Saurel, J.R Cocchi, EB. Butler, Numerical Study of Cavitation in the Wake of a Hypervelocity Underwater Projectile, J. Prop. Pow., 1999, Vol. 15, No 4. [3] W. Wagner et. al., The IAPWS Industrial Formulation 1997 for the Thermodynamic Properties of Water and Steam, J. Eng. Gas Turbines and Power, 2000, Vol. 12. [4] P. Batten, N. Clarke, C. Lambert, D.M. Causon, On the choice of wavespeeds for the HLLC Riemann solver, SIAM J. Sci. Comp., 1997, Vol. 18, No. 6, pp. 1553-1570. [5] E.F. Toro, Riemann Solvers and Numerical Methods for Fluid Dynamics, Springer New York Berlin Heidelberg, 1997. [6] C. Walshaw, M. Cross, M. G. Everett, Parallel Dynamic Graph Partitioning for Adaptive Unstructured Meshes, Journal Parallel Distrib. Comput, pp. 102-108, Vol.47, No. 2, 1997. [7] Kirk Schloegel, George Karypis, Vipin Kumar, Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes, Journal of Parallel and Distributed Computing, Vol. 47, pp. 109-124, 1997. [8] M. Snir, S. Otto, S. Huss-Ledermann, D. Walker, J. Dongarra MPI The Complete Reference, The MIT Press, 1996. [91 G. Karypis, V. Kumar, METIS A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices, Uni-
386 versity of Minnesota, Department of Computer Science / Army HPC Research Center, 1998. [ 10] Frank Wrona, Panagiotis A. Adamidis, Uwe Iben, Rolf Rabenseifner, Dynamic Load Balancing for the Parallel Simulation of Cavitating Flows, In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, D. Laforenza, and S. Orlando (Eds.), Proceedings of the 10th European PVM/MPI Users' Group Meeting, EuroPVM/MPI 2003, Sep. 29 - Oct. 2, Venice, Italy.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
387
Improvements in black hole detection using parallelism* F. Almeida a, E. Mediavilla b, A. Oscoz b, and F. de Sande a ~Departamento de Estadistica, I. O. y Computaci6n, c/Astrofisico Francisco Sfinchez, Universidad de La Laguna, E-38271 La Laguna, Spain, ( f a l m e i d a , fsande) @ull. es blnstituto de Astrofisica de Canarias (IAC), c/Via Lfictea s/n, E-38271 La Laguna, Spain, (emg, a o s c o z ) @ l l . i a c . e s Very frequently there is a divorce between computer scientists and researchers in some other scientific disciplines. The first group have the knowledge and the second one have the problems and the needs of high-performance computing techniques. This work presents the result of a collaboration where we have applied parallelism to improve performance of an astrophysics code used to measure different properties of the accretion disk of a black hole. Several parallelizations have been developed varying from message-passing to shared-memory and hybrid solutions. We present computational results using a CC-NUMA SGI Origin 3000. 1. INTRODUCTION The use of tools and languages that exploit High Performance computers and resources is nowadays limited. The main reason for this limitation is the amount of knowledge and skills that are necessary to exploit these resources. Usually the community of technicians and scientists having needs of high performance computing is not interested in learning new sophisticated tools or languages. The techniques and expertise in which they are mostly interested are those in their own field. This work collects the experiences of a collaboration between researchers coming from two different fields: astrophysics and parallel computing. We deal with different approaches to the parallelization of a scientific code that solves an important problem in astrophysics, the detection of supermassive black holes. The IAC co-authors developed a Fortran77 code solving the problem. The execution time for this original code was not acceptable to deal with instances of the problem with scientific interest. This situation motivated them to contact with researchers with expertise in the parallel computing field. We know in advance that these astrophysics scientist programmer deal with intense time-consuming sequential codes. These codes are not difficult to tackle using parallel techniques, but researchers with a purely scientific background are not interested at all in such techniques. To be more precise, we deal with a black hole detection problem. Supermassive black holes (SMBH) are supposed to exist in the nucleus of many if not all the galaxies. The understanding of the energy machines associated to SMBHs is crucial to increase our general knowledge *This work has been partially supported by the EC (FEDER) and the Spanish MCyT (Plan Nacional de I+D+I, TIC2002-04498-C05-05 and TIC2002-04400-C03-03)
388 about black holes and investigate the formation and evolution of galaxies. To accomplish it, light curves of Quasistellar Object (QSO) images must be analytically modeled. The scientific aim is to find the values that minimize the error between this theoretical model and the observational data according to a chi-square criterion. The robustness of this procedure depends on the sampling size m. However, even for relatively small values of m, the determination of the minimum takes a long time of execution since m 5 starting points of the grid must be considered. That strongly limits in practice not only to obtain better sampled evaluations but also to study the dependence of the solution with m in order to check the consistency of the procedure and the result. We will show that parallelization can reduce the total time of execution, allowing to greatly increase the sampling and, consequently, to improve the robustness of the evaluated minimum. Since the minimization process can be applied independently to each of the points in the domain, parallelism appears as a natural choice to broach the problem. The sequential code is best suited for different parallel implementations varying from message-passing to shared-memory codes and hybrid solutions combining both. In the last years OpenMP [ 1] and MPI [2] have been universally accepted as the standard tools to develop parallel applications. OpenMP is a standard for shared memory programming. It uses a fork-join model and is mainly based on compiler directives that are added to the code that indicate the compiler regions of code to be executed in parallel. MPI uses an SPMD model. Processes can read and write only to their respective local memory. Data are copied across local memories using subroutine calls. The MPI standard defines the set of functions and procedures available to the programmer. Each one of these two alternatives have both advantages and disadvantages, and very frequently it is not obvious which one should be selected for a specific code. The pure cases of MPI or OpenMP programming models have been widely studied in plenty of architectures using scientific codes. The case of mixed mode parallel programming has also deserved the attention of some researchers in the last few years ([3], [4], [5]), however the effective application of this paradigm still remains as an open question. One of the aims of this work is to obtain the maximum reduction in the execution time for the original sequential code. With this goal in mind, we have also to consider that the amount of time invested in developments is a heavy limitation for the usual non-parallel expert scientific programmers. To balance both objectives we will devise the necessary developments and experiments for our target architecture. The conclusions and the knowledge acquired with our experiences establish the basis for future developments on similar codes. The paper has been structured as follows. Section 2 describes the astrophysics problem and the code to be parallelized. This sequential code is slightly modified to facilitate the three parallel developments introduced in section 3. The parallel approaches have been tested and a broad computational experience on a ccNUMA SGI Origin 3000 is presented in section 4. We conclude that, for the application considered, no significative differences have been found among the performances of the different parallelizations. However, the development effort of the considered versions could be appreciable. We finalize the paper in section 5 with some concluding remarks and future lines of work.
2. THE PROBLEM
Supermassive black holes (objects with masses in the range 107 - 1 0 9 solar masses, SMBH) are supposed to exist in the nucleus of many if not all the galaxies. It is also assumed that
389 some of these objects are surrounded by a disk of material continuously spiraling (accretion disk) towards the deep gravitational potential pit of the SMBH and releasing huge quantities of energy giving rise to the phenomena known as quasars. We are interested in objects of dimensions comparable to the Solar System in galaxies very far away from the Milky Way. Objects of this size can not be directly imaged and alternative observational methods are used to study their structure. One of these methods is the observation of a Quasistellar Object (QSO) images affected by a microlensing event to study the unresolved structure of the accretion disk. If light from a QSO pass through a galaxy located between the QSO and the observer it is possible that a star in the intervening galaxy crosses the QSO light beam. Thus the gravitational field of the star can amplify the light emission coming from the accretion disk (gravitational microlensing). As the star is moving with respect to the QSO light beam, the amplification varies during the crossing. The curve representing the change in luminosity of the QSO with time (QSO light curve) depends on the position of the star and on the structure of the accretion disk. The goal in this work is to model light curves of QSO images affected by a microlensing event to study the unresolved structure of the accretion disk. According to this objective we have fitted the peak of an high magnification microlensing event recently observed in one quadruple imaged quasar, Q 2237+0305. We have modeled the light curve corresponding to an standard accretion disk [6] amplified by a microlens crossing the image of the accretion disk. Leaving aside the physical meaning of the different variables ([7]), the function modeling the dependence of the observed flux, F,, with time, t, can be written as
F.(t) = A .
d1
{1 +
k//-~
[ C ( t - to)/(]}
e (Du~3/4(1-1/V/-~)-1/4) -- 1
(1)
where (max is the ratio between the outer and inner radii of the accretion disc (we will adopt (max = 100). G is a function
G(q) =
f+l
H ( q - y) y ,/q-
yv/1 -
To speed up the computation, G has been approximated by using MATHEMATICA (see appendix in [7]). Therefore, the goal is to estimate the values of the parameters A~, B, C, D~, and to by fitting F~ to the observational data. Specifically, to find those values of the parameters that minimize
X2 - E
i
cri2
(2)
where N is the number of data points (F ~ corresponding to times ti (i = 1, ..., N) and F,(ti) is the theoretical function evaluated at h. cr~ is the observational error associated to each data value. To minimize X2 we used the NAG ([8]) routine e 0 4 c c f . This routine ([9]) only requires evaluation of the function and not of the derivatives. As the determination of the minimum in the 5-parameters space depends on the initial conditions we needed to consider a 5-dimensional grid of starting points. If we consider m sampling intervals in each variable, the number of starting points in the grid is of m 5. For each one of the points of the grid we
390 computed a local minimum. Finally, we select the absolute minimum among them. Even for relatively small values of m, the determination of the minimum takes a long time of execution.
program
seq_black_hole
double precision t2(lO0), s(lO0), common/data/t2, fa, efa, length double precision t, fit(5) common / var / t, f it
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
c c
fa(lO0)
, efa(100)
Data
input Initialize best solution
do k l = l , m do k2 = 1, m do
c
ne(lO0),
k3 : 1, m do k4 = 1, m do k5= 1 , m
Initialize starting point x ( 1 ) . . . . . x(5) ca|| j ic2 (nfit,x, fx) call e04ccf(nfit,x,fx,ftol,niw,wl,w2,w3,w4,w5,w6,jic2,monit, maxcal, ifail) if (fx improves best fx) then update(best
(x,
fx))
endif enddo enddo enddo enddo enddo
Listing 1" Original sequential pseudocode
Listing 1 shows the original sequential version of the code. There are five nested loops corresponding to the traverse of the 5-dimensional grid of starting points. We minimize the j i c 2 function using the e 0 4 c c f NAG function. The starting point x, and the evaluation of j i c 2 (x) ( f x ) are supplied as input/output parameters to the NAG function. Inside the j i c 2 function, the integral of formula 1 is computed using the d01 a j f and d01 a 1 f NAG functions. These functions are general-purpose integrators that calculate an approximation to the integral of a function. The common blocks in lines 3 and 5 are used to pass additional parameters from the j i c 2 function to these integrators. Observe that j i c 2 is passed as a parameter to the minimization e o 4 c c f function and we need also to pass additional parameters. 3. P A R A L L E L I Z A T I O N It is a well-known fact that most scientific programmers use to have a code to solve a problem, they are quite confident on it but they have the inconvenience that it is rather time consuming. Furthermore, they are usually reluctant to share their code, to introduce changes on it and sometimes to open their minds to different solution approaches.
391
Therefore, one of our constraints was to introduce the minimum amount of changes in the original code. Even with the knowledge that some optimizations could be done in the sequential code. To preserve the use of the NAG functions was also another restriction in our development. Users with a scientific background are quite used and confident to these libraries, and despite the portability limitations that they introduce, we also decided to preserve them. An important observation to be done is that the kind of parallelization that we introduced in the code should be easily understood and reproduced by other non parallel programmers. The original sequential code was developed on a Sun Blade 100 Workstation running Solaris 5.8 and there were no portability problems with the code when ported to a SGI Origin 3000.
1 2 3 4 5
6
8 9 10 11 12 14 15 16 17 18
19 20 21 22 23 24 25 26 27
black_hole_mpi_omp double p r e c i s i o n t 2 ( l O 0 ) , s ( l O 0 ) , common/data/t2, fa, efa, length double p r e c i s i o n t , f i t ( 5 ) c o m m o n / v a r / t , fit /$OMP THREADPRIFATE (/var/, / data/) program
c c
call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, call MPI_COMM_SIZE(MPI_COMM_WORLD, Data input Initialize best solution
ne(lO0),
fa(lO0),
myid, ierr) mpi_numprocs,
ierr)
!$OMP PARALLEL DO DEFAULT(SHARED)PRIVATE(tid,k, m a x c a l , f t o l , w4, w5, w6, x)COPYIN(/ data /) LASTPRIVATE (fx ) c
efa(lO0)
ifail ,wl,w2,w3,
do k = m y i d , mA5 -- 1, m p i _ n u m p r o c s Initialize starting point x(1) ..... x(5) c a l l j ic2 (nfit,x, fx) call e04ccf (nfit ,x, fx, ftol ,niw, wl ,w2 ,w3 ,w4 ,w5 ,w6, j ic2, monit, maxcal, ifail) if (fx improves best fx) then update(best (x, fx))
endif enddo c Reduce the OpenMP best solution !$OMP END PARALLEL DO c Reduce the MPI best solution call MPI_FINALIZE(ierr) end
Listing 2: Mixed MPI-OpenMP code
The only modification introduced in the original sequential version was a transformation of the iteration space by reducing the five nested loops in listing 1 to a single one as it appears in listing 2. This modification does not change the underlying semantics of the sequential code nor the order in which the points are evaluated. The transformation can be easily done since there are no data dependencies amongst the different points in the iteration domain. The reason to introduce this transformation is to ease and clarify the parallelizations.
392 The code in listing 2 shows the mixed mode parallelization using MPI and OpenMP. We will use this code to explain all three parallel aproaches that have been implemented. MPI and OpenMP pure versions can be directly obtained from that code by simply eliminating the corresponding pragmas or subroutine calls. The MPI parallelization is the most straightforward: after inserting in the code the initiallization and finalization MPI calls, we simply divide the single loop iteration space among the available set of processors using a cyclic distribution. In the pure OpenMP parallelization, the main difficulty is to identify the shared/private variables in the code. Once they are located, the corresponding OpenMP p a r a l l e l do pragma can be introduced in the main loop. Variables included in common blocks deserved a particular consideration. To avoid read/write concurrent accesses the t h r e a d p r i v a t e pragma was included (line 6, listing 2). The mixed MPI/OpenMP version just merges both former versions. Each MPI process expands a certain number of OpenMP threads that takes in charge the iteration chunk of the MPI process.
4. COMPUTATIONAL RESULTS This section is devoted to investigate an quantify the performance obtained with the different parallel approaches taken. In a Sun Blade 100 Workstation running Solaris 5.0 and using the native Fortran77 Sun compiler (v. 5.0) with full optimizations the code takes 5.89 hours and 12.45 hours for sampling intervals of size m = 4 and m -- 5 respectively. The target architecture for all the parallel executions has been a Silicon Graphics Origin 3000 with 600MHz R14000 processors. We restricted our executions to this architecture because it is the only available to us with the NAG libraries installed. Using the native MIPSpro Fortran77 compiler (v. 7.4) and the NAG Fortran Library - Mark 19 in the SGI O3K with full optimizations the sequential running time is 0.91 hours and 2.74 hours for sampling intervals of size m -- 4 and m -- 5 respectively. Table 1 shows execution time (in seconds) and speedups of the parallel codes with sampling intervals of size m = 4. Assuming that we do not have exclusive mode access to the architecture, the times collected correspond to the minimum time from five different executions. The figures corresponding to the mixed mode code (label MPI-OpenMP) correspond to the minimum times obtained for different combinations of MPI processes/OpenMP threads (see table 2). Data do not show a significative difference among the different parallel approaches taken. The speedups achieved reduce the sequential running time almost linearly with the number of processors involved. For the mixed MPI/OpenMP parallel version, we have considered in the computational experience different combinations between MPI processes and OpenMP threads. Table 2 shows the running time of this parallel version. An execution with P processors involves nTh 0penMP threads and P MPI processes. The minimum running times appear in boldface. As it could be expected, the execution time using P MPI processes and 10penMP thread coincide with the correspondent MPI pure version. Analogously, the same behaviour is observed with 1 MPI process and P threads with respect to the pure OpenMP version. A regular pattern for the optimal combination of MPI processes and OpenMP threads is not observed. For the sake of briefness we do not include computational results for sampling intervals of
393 Table 1 Time and speedup for parallel codes, sampling intervals: m = 4 Time (sees.)
Speedup
P
MPI
OpenMP
MPI-OpenMP
MPI
OpenMP
MPI-OpenMP
2
1674.85
1670.51
1671.59
1.95
1.96
1.95
4
951.12
930.64
847.37
3.44
3.51
3.86
8
496.05
485.69
481.89
6.59
6.73
6.78
16
256.95
257.20
255.27
12.72
12.71
12.80
32
133.87
133.97
133.70
24.41
24.39
24.44
size m - 5. All the comments and observations made for the case m = 4 also apply for this case. The computational results obtained from the parallel versions confirm the robustness of the method.
Table 2 Time for the mixed (MPI-OpenMP) code for different combinations Threads/MPI Processes. Sampling intervals: m = 4 The number of MPI processes is n TPh P
nTh
32
16
8
4
2
1
133.70
257.55
497.12
953.34
1679.17
2
137.80
262.26
481.89
847.37
1671.59
4
151.84
278.10
485.05
938.15
8
156.56
255.27
484.77
16
139.03
256.34
-
32
133.88
-
-
5. C O N C L U S I O N S
AND FUTURE WORK
We conclude a first step of cooperation in the way of applying parallel techniques to improve performance in astrophysics codes. The scientific aim of applying high performance computing to computationally-intensive codes in astrophysics has been successfully achieved. The relevance of our results do not come directly from the particular application chosen, but from stating that parallel computing techniques are the key to broach large size real problems in the mentioned scientific field. For the case of non-expert users and the kind of codes we have been dealing with, we believe that MPI parallel versions are easier and safer. In the case of OpenMP, the proper usage of data scope attributes for the variables involved may be a handicap for users with non-parallel programming expertise. The higher current portability of the MPI version is another factor to be considered. The mixed MPI/OpenMP parallel version is the most expertise-demanding. Nevertheless, as
394 it has been stated by several authors ([5], [3]), and even for the case of a hybrid architecture like the SGI O3K, this version does not offer any clear advantage and it has the disadvantage that the combination of processes/threads has to be tuned. From now on we plan to continue this fruitful collaboration by applying parallel computing techniques to some other astrophysical challenge problems.
REFERENCES
[1]
[2] [3] [4]
[5] [6] [7]
[8] [9]
O. A. R. Board, OpenMP Fortran Application Program Interface, OpenMP Forum, h t t p : //www. openmp, o r g / s p e e s / m p - d o c u m e n t s / f s p e c 2 0 , pdf (nov 2000). M. Forum, The MPI standard, heep://www.mp2-forum, org/. L. Smith, M. Bull, Development of mixed mode MPI/OpenMP applications, Scientific Programming 9 (2-3) (2001) 83-98. G. Sannino, V. Artale, P. Lanucara, An hybrid OpenMP-MPI parallelization of the princeton ocean model, in: Parallel Computing, Advances and Current Issues. Proceedings of the International Conference, ParCo2001, Naples, Italy, Imperial College Press, London, 2001, pp. 222-229. F. Cappello, D. Etiemble, MPI versus MPI+openMP on IBM SP for the NAS benchmarks, in: Proceedings of Supercomputing'2000 (CD-ROM), IEEE and ACM SIGARCH, Dallas, TX, 2000, 1RI. N. I. Shakura, R. A. Sunyaev, Black holes in binary systems, observational appearance, Astronomy and Astrophysics 24 (1973) 337-355. L. J. Goicoechea, D. Alcalde, E. Mediavilla, J. A. Mufioz, Determination of the properties of the central engine in microlensed QSOs, Astronomy and Astrophysics 397 (2003) 517525. Numerical Algorithms Group, NAG Fortran library manual, mark 19, NAG, Oxford, UK (1999). J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (4) (1965) 308-313.
Parallel Computing: SoftwareTechnology, Algorithms, Architecturesand Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
395
High Throughput Computing for Neural Network Simulation J. Culloty a and P. Walsh ~ ~Department of Maths and Computing, Cork Institute of Technology, Bishopstown, Cork, Ireland Email: [email protected], [email protected] Artificial neural networks (ANNs) are densely interconnected networks of processing nodes that provide robust methods for learning real, discrete and vectored valued functions from example training sets and have been widely applied to problems in control, prediction and pattern recognition [1]. Processing nodes are organised into layers of input units Ni, layers of hidden units Nh and a layer of output units No. Interconnections between processing units are represented by a matrix of adjustable weight parameters W, which can be tuned by a gradient decent algorithm to learn functions presented by the training set data. However, neural network learning is computationally expensive and training can take the order of days or weeks for large training sets [2]. Training sets for applications such as character recognition and speech recognition can require the order of 106 training samples and 106 network parameters [3]. The aim of this paper to address these computational requirements by exploiting neural network parallelism in a high throughput metacomputing environment. 1. O V E R V I E W OF THE B A C K P R O P A G A T I O N A L G O R I T H M
Backpropagation is perhaps the best understood and most widely used neural network training algorithm. In feed forward operation the hidden layer network output vector is computed by multiplying the input vector X with the hidden layer weight matrix W:
where f is a non-linear squashing function:
f (a) -
1 1 + e -a
The network output is then computed by multiplying the hidden layer output matrix Yh with the output layer weights Wo: Yo,k = f
Wo,kjYh,j
-
-
(~
396 Weight updates are can be calculated by evaluating an error measure Etp for each training pattern: :
1 No
Yp,o,k)2
-
k=l
or can be calculated by computing a error measure E after all training patterns have been evaluated in a feed forward phase, known as an epoch" P
E = ~-~ Etp p=l
These error measures are used by the algorithm to compute weight updates for the output layer: 6o,
= yo,
(1 - yo, )
and for the hidden layer:
- yo, )
No 6 ,j -
yo,
(1 - yo, )
6o, Wo, j k=l
A neural network application may take a large number of training epochs to converge to an acceptable solution, where E is sufficiently small. 2. PARALLELISM OF THE BACKPROPAGATION A L G O R I T H M Given P processors, parallelism in the backpropagation algorithm can be exploited at a number of levels [2][4][5]. High throughput computing has been previously applied to neural networks, but only to train multiple networks in parallel, not to train a single network in parallel [12]. For training single networks, it has been shown that training set parallelism is particularly effective for application involving large training sets [2]. In training set parallelism the training set is partitioned into P blocks and work is allocated in a master/slave mode, where each processor has a local copy of a training block and calculates weight updates on a copy of the global network weights. Weight updates are made by a global operation at the end of each training epoch. Many implementations of training set parallelism have focussed on high performance platforms, such as a network of transputers [6][7] or a cluster of dedicated workstations [8]. Such solutions can be expensive in terms of hardware costs and can be under utilised when not required for parallel applications. A more economical solution is to exploit existing networked resources to implement a high throughput computing (HTC) metacomputer. HTC focuses on completing useful work over long time spans in a heterogeneous and unreliable environment where the resources available and the amount of computation delivered may vary [9]. This is typical of many organisations where desktop computers are used intermittently and are often idle for long time spans, such as in the evenings and weekend. The application of these idle resources to neural network simulation is challenging as there is a high degree of interdependence between tasks in training set parallelism, due to the global communication of weight updates at the end of each training epoch. Moreover, training set data must be dynamically allocated as heterogeneous computing resources become available and removed from the metacomputer.
397 These challenges can be addressed by focusing on characteristics of the problem that can increase the efficiency of a parallel implementation as demonstrated by Goux et al [9]. In neural network simulation the characteristics that we identified are dynamic grain size and data locality. Dynamic grain size refers to the increase of granularity as more processors become available to the system and a reduction in granularity as processors leave the system. A fine level of granularity also allows the master to allocate work to slaves on heterogeneous machine according to the relative speed of each processor. We use the term data locality to refer to the caching of training set blocks on slave processors. This reduces the amount of interprocess communication by allowing slave tasks to reuse local data in successive training epochs. 3. O V E R V I E W OF S Y S T E M I M P L E M E N T A T I O N This system was implemented in a Linux workstation environment using a Parallel Virtual Machine (PVM) [10] and Condor framework [ 11 ]. PVM is a message-passing interface, which allows the collection of distributed heterogeneous computational resources into a single virtual machine on which a parallel program runs. Together with the ability of PVM to probe the current state of the virtual machine, these functions provide the application programmer with a powerful tool for developing robust dynamic parallel applications. One disadvantage of this system is that when running in a non-dedicated environment a PVM program cannot easily probe the state of machines not currently included in the virtual machine. And while it is possible to write routines within a PVM program instructing it when to add or remove hosts (that may be required for other work during the day) such routines are likely to be complex and error prone. Condor provides a system which when incorporated with PVM provides the parallel application with the ability to handle the dynamics of a non-dedicated environment in a complex manner without introducing such complexities into the PVM code itself. Like PVM, Condor provides this computational power via a network of distributed heterogeneous resources. But while it is not uncommon with PVM architectures to construct a purpose built virtual machine out of dedicated hosts [3], Condor is designed to select unutilised resources from a pool of non-dedicated hosts. Condor achieves this through its ClassAds mechanism. The owner of each resource can specify a set of C like Macros detailing under what conditions a job may be run on their machine. These Macros can be used to run jobs at certain times, when the computer is idle, or if the job belongs to a certain owner. The owner of a job also specifies conditions, which have to be met before the job can be started on a host - such as CPU speed or RAM. Condor acts as a matchmaker ensuring that both host and job requirements are met before allocating a job to a host. When incorporated with PVM, Condors ClassAds mechanism can be used to improve the performance of the PVM application by informing it of newly available resources, or to schedule the addition or removal of resources at certain times. A Condor pool is formed from all hosts running Condor demons, and can be divided into three groups: 1. U n c l a i m e d hosts Hosts, which are either being used by their owner, do not meet the minimum requirements to serve as slaves, or have yet to be claimed by the master.
398
2. Claimed hosts Hosts that were marked idle by Condor and claimed by the master to act as a slave in the virtual machine. 3. The submit machine This is the machine from which the program was submitted. The master program will reside on this machine. Together with the claimed hosts the submit machine forms the virtual machine on which the PVM program runs. Condor is used as a resource manager in the system; it monitors all hosts in the pool notifying PVM if a previously unallocated host becomes available or if a slave is reclaimed by its owner and returned to the unclaimed pool. PVM is used for its message-passing interface between the master and slaves. It spawns a slave on newly allocated hosts and assigns work for each slave. 1 2 3 1 2 3 4 5 6 7
8 9 10 11 12
Void SlaveTrain() do { if (received weight update message) Unpack and update weights if (received new b l o c k assignment message) Unpack and store new d a t a block if (received block process message) U n p a c k block ID Current block = block ID Train network Send weight changes } while (true)
Listing 3: Slave Training Algorithm.
4
5 6 7 8 9 l0
12 13 14 15 16
Void MasterTrain() do { Send current weights to all slaves Assign initial blocks to each slave do { if (received block finished message) Unpack block ID Mark block p r o c e s s e d if (free blocks in pool) S e l e c t free b l o c k for slave i f ( b l o c k not stored on slave) Send block to slave Tell slave to process block } while (all blocks are not processed) Calculate new set of weights } w h i l e (stop training condition not meet )
Listing 4: Master Training Algorithm. The master task stores a copy of the complete training set, but logically divides this into a number of small blocks for allocation to the slaves. The grain size of each block is important as when running in the Condor environment a host may be lost at any moment with a complete loss of any partially completed but not yet returned results. The smaller the grain size the less work that is lost when a host is reclaimed by its owner. On the other hand the smaller the grain size the more network traffic that is generated, as more sets of updates must be returned. The master and slave training algorithms are outlined below. To optimise the assignment of blocks to slaves the master implements a scheduling algorithm with the following aims. 1. To reduce the number of copies of a block held by the slaves.
399 2. To reduce the idle time where some slaves are waiting for others to finish before the start of the next epoch. 3. To reduce network load by sending as few blocks of data as possible. 4. To reduce the recovery time when a task of host fails. In order to implement this algorithm the master stores the following information on each block of data, as shown in Listing 4. The master stores the complete training set and logically divides this into a number of blocks by creating pointers into the data, whereas the slaves stores each block as a separate structure to allow for ease of maintenance. Each block is referenced by a unique ID, which is shared between the master and all the slaves. For each block the master maintains its current status for this epoch: 9 0 unassigned. 9 - 1 processed and returned. 9 > 0 the task identified (tid) of the slave currently processing it. The master also stores a list of slaves who have a copy of each block and the epoch on which they last processed that block. When assigning a block to a slave the master selects an unassigned block using the following criteria. 1. The slave has a copy of the block. (Aim 1 and 3) 2. A block, which is shared by as few slaves as possible. (Aim 1 and 2) This system means that no slave is left idle while there are unassigned blocks to process, but a slave will always try and process the blocks which it currently holds. By showing a preference for blocks, which are held by the least amount of slaves, the system tries to ensure that a block will not be reissued to a new slave. The last epoch when each slave processed a block is used by the master when instructing a slave to delete a block. If a slave ends up storing too many blocks of data the block selected for removal is the one that is least used. 4. RESULTS Results achieved so far with this system have been encouraging. Figure 1 shows the speedup attained with a relatively modest training set size of 219 training patterns. Large scale neural network applications have training sets several orders of magnitude larger than this. These results are for machines in an idle environment, so the virtual machine remains relatively stable. While the results in Figure 2 and show that the system can achieve a near linear speedups under ideal conditions it dose not indicate how well the system can adapt to the loss or gain of hosts in a truly dynamic HTC environment. In order to show the systems ability to adapt we decided to measure the time per epoch for different configurations of the system, and them compare these times with those produced by a run of the system where the configuration changes during the run. A train set size of 215 training patterns was chosen for the tests and only the number of hosts were modified for each test.
400 Figure 3 shows the variations in time per epoch for runs using a constant number of hosts. It can be clearly seen from the figure that while the average time per epoch for each test is relatively the same, times can vary greatly from one epoch to the next. This is due to the manner in which the algorithm assigns blocks to hosts, resulting in some hosts performing more processing than others in certain epochs. To account for this variation and produce clearer results in tests it was decided that two hosts would be added or removed from the system at the same time. Figure 4 shows the effect loosing two hosts has on a typical run of the system. This figure shows the average time per epoch for a run comprised of 10 hosts, 8 hosts and a run where the number of hosts drops from 10 to 8 on the seventh epoch of the run. It can be seen from the figure that after an initial surge in time on the eight epoch - where the system reconfigures itself the average time per epoch returns to a level comparable with that of the 8 host test. This surge can be explained by the fact that the system had to resend all blocks of data that were stored on the lost hosts only. Similarly Figure 5 shows the effect gaining two hosts has on the system. In this case it can be seen that the system adapts to the change almost immediately, dropping from 10 to 8 hosts in a single epoch with no surge in time for the post-change epoch. The lack of a surge can be explained by the fact that the two new hosts immediately joined in the processing of blocks reducing the load on the previously employed hosts. These results show that the system can achieve reasonable speedups as the number of hosts increases and that in an dynamic environment the system can quickly adapt to changes in the -
Condor
Pool
Parallel V i r t u a l M a c h i n e
L [ NO"Blocks[~-]
Block of
R~ ~
/~
i
[--~ ['~ ~SlavehtiAdccessedl
/]/Pointers Id Status SlaveList
Claimed Hosts
' Slave A
]
~,,~
;cks~ ' ~
B
_~
Blocks
of Data
Submit
laveC Ii
M Unclaimed Hosts
'tI
~.
Blocks
of Data
Figure 1. Shows an overview of the system.
1~
401
I
:
IdealI
Speedup , ~
=
Test 1 ...., ~
I I
Test 2
220
I.,.
215
I~. l,} 2~0 ~11 &
0 .
. 1
. 2
.
.
.
.
.
.
3
4 5 6 7 Processors
.
.
8
9
i~
[
=
8Hosts
......4~,o~o:10 Hosts
_-
10 Hostsloose2
~"
3
I
I
t.
4
5
6
7
Epoch
8 Hosts ......~ ..... 10 Hosts ~ 270
500
8
9
10
8 Hosts Gain 2
I
. . . . . . . . . . . . . . . . . . . . . . . . . . .
250
m. 400 ~N~ m 300
I--
2
Figure 3. Variations in Time per Epoch.
600 .
,.=
195 190 I ~ ! 1
10
Figure 2. Speedup.
205
I~ ~oo
}1 ~i]
200 | ~
~~iii]
~go [ii',iiiNNi!~iN~Niiii ,
1
2
3
4
5 6 Epoch
7
8
9
10
Figure 4. System adjusting to loss of hosts.
,
1
,
2
,
3
,
4
,
5 6 Epoch
,
iNIN ,
7
8
,
,
9
,
10
Figure 5. System adjusting to gain of host.
systems configuration. So much so that within one or two epochs of the change the new configuration is producing comparable results with a system whose initial configuration matches that of the new one. Moreover they show that the master-worker paradigm employed here is feasible for large applications and could be extended to other application areas. REFERENCES
[1] Mitchell, Tom M., Machine Learning, McGraw-Hill 1997. [2] N. Sundararajan and R Saratchandran Parallel Architectures for Artificial Neural Networks Paradigms and Implementations IEEE Computer Society Press 1998.
[3] Douglas Aberdeen, Jonathan Baxter, and R. Edwards (2000), 92 c/MFlop/s, Ultra-Large[4]
[5] [6] [7]
[8]
Scale Neural-Network training on a PIII cluster. In Proceedings of Super Computing 2000, Dallas, TX., November 2000. SC2000 CDROM. Thomas, E.A.: A parallel algorithm for simulation of large neural networks. Journal of Neuroscience Methods. 98(2): 123-134, 2000. R.O. Rogers and D.B. Skillicorn. Using the BSP cost model to optimize parallel neural network training. In Workshop of Biologically Inspired Solutions to Parallel Processing Problems (BioSP3), in conjunction with IPS/SPDP'98, March 1998. H. Yoon and J. H. Nang, "Multilayer neural networks on distributed-memory multiprocessors," in INNC pp. 669-672, 1990. J. M. J. Murre, "Transputers and neural networks: An analysis of implementation constraints and performance," IEEE Trans. On Neural Networks, vol. 4 pp. 284-292, March 1993. Leslie S. Smith, "Using Beowulf Clusters to Speed up Neural Simulations", TRENDS in Cognitive Sciences, Vol. 6 No. 6, June 2002.
402 [9]
J.-E Goux, J. Linderoth, and M. Yoder. Metacomputing and the master- worker paradigm. Preprint ANL/MCS-P792-0200, Mathematics and Computer Science Division, Argonne National Laboratory, 2000. [10] V. S. Sunderam, PVM: A Framework for Parallel Distributed Computing, Concurrency: Practice and Experience, 2, 4, pp 315-339, December, 1990. [ 11] Jim Basney and Miron Livny, "Deploying a High Throughput Computing Cluster", High Performance Cluster Computing, Rajkumar Buyya, Editor, Vol. 1, Chapter 5, Prentice Hall PTR, May 1999. [12] Davis W. Opitz and Jude W. Shavlik, "Using Genetic Search to Refine Knowlede-Based Neural Networks", Machine Learning: Proceedings of the Eleventh International Conference, San Francisco, CA, 1994.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
403
P a r a l l e l a l g o r i t h m s a n d d a t a a s s i m i l a t i o n for h y d r a u l i c m o d e l s c. Mazauric a, V.D. Tran b, W. Castaings a, D. Froehlich c, and F.X. Le Dimet ~ ~LMC-IMAG Universit6 Joseph Fourier, 38041 Grenoble cedex 09 France [email protected] bInstitute of Informatics, Slovak Academy of Sciences Dubravska cesta 9, 842 37 Bratislava, Slovakia cParsons Brinckerhoff Quade and Douglas, Inc 909 Aviation Parkway, Suite 1500 Morrisville, North Carolina 27560 USA Flood prediction is a huge challenge for the decades to come. To achieve this goal it will necessary to couple many sources of information of various natures. Mathematical information provided by models and physical information from the observations (remote or in situ). An important stage in the forecast is the assimilation of data, i.e. how to use the physical information into numerical models. The method proposed in this paper is based on optimal control techniques : free parameters of the models (initial and/or boundary conditions, physical parameters) are estimated by the minimization of the discrepancy between the model and the observations. The optimization methods requires to compute gradient with respect to these parameters. This task is achieved by using an adjoint model derived from the direct model. According to the complexity of the flow in a river it will be necessary to parallelize the resolution of the direct model. A natural method of parallelization is based on domain decomposition techniques. The application to pre-operational flood model is described for a real test case. In this presentation, we will show that the tools for the assimilation of data can also be used for parallelization. In a first step, the assimilation is carried out separately in each subdomain with the control of interfaces. A second step is the mutual adjustement of these interfaces. 1. I N T R O D U C T I O N The prediction of floods is an important social issue. It requires the development of mathematical models governing the evolution of the river. These equations are based on the usual laws of fluid dynamics: mass and momentum conservation. Given the complexity of the models and the available computing ressources, parallel algorithms are required to provide a numerical solution that will approximate the natural hazard at a feasible computational cost. Parallelization of a flood model will be described and tested for an operational shallow water model. However, models are not sufficient to predict the evolution of the flood: initial and boundary conditions, model parameters must be prescribed in agreement with the physical properties of the flow, this task is named "Data Assimilation". The methods used is based on optimal control techniques: model inputs are used as control variables and adjusted in such a way that it minimizes the
404 discrepancy between the observation and a solution of the model. From the computational view point data assimilation is costly because it needs several integrations of the model. Therefore the method of data assimilation also needs to be parallelized. Some examples will be presented with a simple shallow water model. 2. MATHEMATICAL MODELLING FOR FLOOD PROPAGATION Among the analysis techniques and sources of information available for the prediction of the evolution of a flood on a portion of the earth, mathematical models are essential tools for recognition and understanding of the governing processes of flood propagation. Despite its many shortcoming, Shallow Water equations (SWE) are the most widely spread approach providing a simplified description of the mathematical laws that govern this phenomenon. In this paper, a finite volume discretization will be used. The approach is based on conservation forms of the depth-averaged fluid and momentum relations which are the results of a vertical integration of mass and energy conservation equations. In the model developed by D. Froehlich (DaveF), the depth-averaged surface-water flow equations are solved numerically using a twodimensional,explicit in time, cell-centered, Godunov-type, finite volume scheme. Godunovtype methods for solving first-order hyperbolic systems of equations are based on solutions of initial value problems,known as Riemann problems, involving discontinuous neighboring states. These methods are found to be accurate and robust when used to solve hyperbolic partial differential equations describing fluid motion. Moreover, they are able to capture locations of shocks and contact surfaces. This feature is particularly suited for problems involving wetting and drying. The model was tested, validated and applied to real scale problems for which it performed very well. 3. PARALLEL COMPUTING FOR FLOOD MODELLING Simulating river floods is an extremely computation-intensive undertaking. Several days of CPU-time might be needed for simulating floods along large rivers reaches. The model set up and calibration for a given site requires a lot of simulations to be carried out with different model inputs. Moreover, for critical situations, e.g. when an advancing flood is simulated in order to predict which areas will be threatened so that necessary prevention measures can be implemented in time, long computation times are unacceptable. Therefore, using High Performance Computing (HPCN) platforms to reduce the computational cost of flood simulation models is a critical issue. The following implementation scheme has been used. 9 Analyzing computational approaches used in the models: the methods of discretization (finite elements, finite differences, finite volumes) and the algorithms (Newton iteration, frontal solution methods) 9 Analyzing the source codes of the models: the program and data structures, data references 9 Choosing appropriate methods for parallelization 9 Coding, testing, and debugging HPCN versions 9 Installing and deploying HPCN versions
405 In DaveF, the temporal integration is carried out by a first order forward Euler scheme. Therefore, for each time step, DaveF computes the water level and depth averaged velocities for each cell from its current values and the values of its neighbors. At first sight, it seems that the scheme could be to be easily parallelized. However, a careful analysis underlines the fine granularity of the computation. In fact, the scheme is explicit in time and because of CFL restrictions the time steps are usually very small. Moreover a small amount of computation is needed for each time step to update the elements values. Although at each time step, the calculations and solution updating of each cell can be done in parallel without communication, in the next time step, calculating the new solution of an element requires the values obtained at the former iteration for its neighboring cells. It means that in distributed-memory systems like clusters, each processor has to send the solutions of all cells on the border with another processor to the processor before starting the next time step. The choice was made to overlap communication by computation: this solution does not solve the problem of fine granularity, but can reduce its effect on performance. It takes avantage of the fact that computation and communication can be done in parallel. Therefore during the communication, processors can perform other computations. The algorithm can be described as follows: 9 For each time step - Compute new solutions for bordering cells -
Send the new solutions to the neighbor processor
- Compute new solutions for internal cells - Receive new solutions of bordering cells from neighbor processor The time chart of the implemented algorithm can be described in figure [l-a]. As shown in the chart, although the communication time is long, the processor utilization is still good because during the communication, processors are busy with other work so the wasted CPU time is minimal.
6
i
5
.
.....NN
a)
~i
.
.
.
.
.
.
.
.
.
.
1
NiNNiiiii~iJ 1
la,=;=, J
V)
0
10
20 . . . . Loire~O~Ix~ ~
30
40
LoireSC~.4~
Figure 1. a) Time chart of HPCN algorithm, b) Speedup of obtained on INRIA iluster.
50
406 In order to carry out numerical experiments, a consistent data set was built for the simulation of flood propagation on the Ouzouer valley (Loire fiver). The designed mesh has 18677 elements, time steps are less than one second for a total simulation time of 400 hours. With the support of the INRIA icluster team, numerical experiments were carried out on the Linux cluster which has 216 HP e-vectra nodes (Pentium III 733 MHz and 256 MB RAM) divided into five segments. Nodes in a segment are connected by 100Mb/s Ethernet and segments are connected by 5 HP procurve 4000 switches. In order to investigate the influence of the granularity, a uniform mesh refinement was applied to obtain a second mesh which is four times larger than the previous one. Figure [l-b] shows the performance of the implemented algorithm in terms of speedup with the two data sets. The speedup reaches its maximum for 32 processors but there is no significant gain between 16 and 32 processors. For more processors, the speedup decreases because communication delays become too large for computations (the number of messages linearly increases with the number of processors while the computation time decrease). One can see that the speedup is increased with the size of input data, especially for larger number of processors. In fact because of the fine granularity of the algorithm, the more processors are used the larger it affects of the granularity performance. Other experiments where carried out on II-SAS (Institute of Informatics of the Slovak Academy of Sciences) and shown better results because the network has 16 nodes which are connected by an Ethernet 100Mb/s switch. On the icluster, since the nodes are assigned by PBS batch scheduling system there is no guarantee that the nodes are on the same segment and the network interference with other applications running on the system increases the communication time. However, despite the fine granularity, the parallel version of the model show relatively good performance on Linux clusters. 4. DATA ASSIMILATION The data assimilation techniques used in this paper is a deterministic approach based on Optimal Control Theory. The ingredients are: 9 A state variable describe the fields (water velocity, water elevation at the grid points). X is this state variable it is of great dimensionality 9 An observation Xobs, distributed in time and space 9 Control, variables, U is the initial condition, V the boundary condition. U and V are the input which have to be provided to the model to get a unique solution between time 0 and time T 9 A model which can be written by a generic form where F is a non linear operator.
dX --d7 = F (x, v )
(1)
x (o) = u
9 A cost function which estimates the discrepancy between the solution of the model associated to (U, V) and the observations. Because the variable X and the observations are
407 not in the same set we will consider C mapping the mathematical variable into a physical variable (e.g. an interpolation in time and space)
1/ T
J (U, V) - -~
[IC.X(U, v ) - Xob, ll~dt
(2)
0
The problem is: Determine U , and V , minimizing J, i.e. we seek for the inputs of the model leading to solution of the model closest to the observation. Therefore the closure of the model is stated as a problem of unconstrained optimization for which there are many known efficient algorithms. U , and V , are characterized by cancelling U , and V , the gradient of the cost function, this gradient is also necessary to carry out optimization algorithms. Computing the gradient using the Adjoint Model: We introduce P the "adjoint variable" which has the same dimensionality as X as the solution of the adjoint model: .p = C (c.x
- Xob~)
P (T) = O. Then it can be seen that the gradient of J is given by:
vj:(VuJ) V~J
-
OF t - -g-ff .P _p(i )
Therefore the gradient is obtained by a backward integration of the adjoint model. An algorithmic solution is adopted. A descent-type method is carried out, these methods are iterative, they requires several iterations and at each iteration an evaluation of the model and of the gradient. For more details on adjoint techniques, see Le Dimet and Talagrand [6]. 5. PARALLEL A L G O R I T H M S F O R DATA ASSIMILATION This section deals with the presentation of an iterative domain decomposition method. The problem is to find an optimal initial condition on each subdomains and to avoid any discontinuities at each interface between two subdomains. 5.1. Overview Let f~ be the main domain divided in n subdomain f~i. Lets consider that the flow behavior in each subdomain is computed on an independent processor. Consequently on fti and for the iteration k, the flux is governed by the following system :
dX~ at
+ F(X~, V~) x~(o)
0
(3)
= Uy
Where Vi~ is the boundary value at the interface between subdomains i and j. In order to apply the data assimilation method, a cost function should be define on f~i. We can
408 remark that we have two values of boundary conditions, one from the upstream subdomain the other from the downstream. As we should obtain a continuous flux on the entire domain f~, those values should be equals. Consequently, the cost function to minimized is : 1
J(U~) = ~llC~.X~ - X~ob~ll~o~ + ~
1
IIX(D~
k-~i-1 X(Up)fl[ ~ -
1 + _~_fi_kllX(Down)ki_ X(Up)k-1 i+ll[ 2 (4) where X(Down)~ is the value of X at the Downstream boundary point of the subdomain i at the iteration k (Up is for upstream). Remark" If C = (Ci)i=l,n is a block diagonal matrix, subdomain are independent. Consequently, a parallel algorithm consist in solving optimal control problems on independent processors at each iteration. The communication between processors are limited to the value at each interfaces. Choose an initial control variable Iteration on k k is the parameter who allows to penalized the cost function (4).
9 Run the direct code from time 0 to T. 9 Run the adjoint code from time T to 0. Optimization code .
- Computation of a descent direction
Di.
- Computation of a descent step pi. - Update of the control variable Ui+l = - If
Ui + Dipi
II v J(Ui+l)[[ _~ e then go out of the optimization code
End of the optimization code
9 Exchange interfaces values with neighbouring domains 9 If control variables values are appropriate (to be define) go out of the iteration of k End of the iteration on k
The minimization of J gives an optimal value for the initial condition on each subdomain. The iterative algorithm on k, ensures the continuity of the flow on the domain ~. 5.2. Application to a simple model 5.2.1. Overview In this section, the previously exposed methodology is applied to a basic Shallow Water model, LMCFLD [8], written in order to investigate data assimilation potential for river hydraulics. The computational approach is based on a finite difference method using a fractional step scheme. Let us consider a test case where a simple fiver ~ is built. This straight fiver is 50 Km long. Dikes on each side of this fiver can not be overflowed. The bottom slope of this fiver is small
409
(2%) At the initial time step the water surface elevation is 5 meters everywhere and the upstream boundary condition is a water level. The river f~ is divided in several subdomains. The river evolution is computed on each domain with an independent hydraulic code and on different computers. The aim of this section is to show how to find an optimal initial condition for each subdomain with the help of observations. The fiver f~ is a test case, consequently, no observation set is available. So, with the initial condition introduced above, the observation set is obtain thanks to a run of the direct code. The objective will be to give an aleatory initial condition for each subdomain and to retrieve the original one thanks to this observations set. For convenience we have assumed that observations were available at each grid nodes and at each time step for velocities and water elevations. Of course, this is far from the actual situation. An exhaustive discussion on the use of the C operator in formula (2) would be necessary but it is out of the scope of this paper. 5.2.2. Results
-Iteration1-
-Iteration7-
-Iteration13-
:/
Y
............ ,:...i,,......... ~ ~ 9
5
4
-'
Figure 2. For the subdomain 4, figures represent the relative error maked by the adjoint code to refind the optimal initial condition at different iteration. For the first iteration the maximal error is 3.2%, for the iteration seven it is 2.6% and for the 13 one it is 1.6%.
Let us consider the case where f~ has been divided in five equivalents parts (the subdivision has been made on the length). The computation of the flow evolution is made on independent processors. On each processor, the direct code is run once and the minimization of J is made. Communication are limited to boundaries conditions exchange between neighbor subdomains (i.e. processors). Figure 2 shows how the initial condition is retrieved very well. Thanks to the observation set, the minimization of the appropriate cost function allows to find an optimal initial condition near than the original one, the error which has been made is inferior to 4% whatever the iteration. At each iteration the penalization of the cost function is harder than the last one. Effectively, iteration after iteration the term ~-k IIX ( D ~ k-li-1 - X (Up)~ I]2 become much more important
410 in the function expression. Figure 2 shows the importance of this term in the minimization of the cost function. At the first iteration, the error is located mainly on the upstream boundary and after several iterations this error becomes lower. Data assimilation method gives a tools which allows to well estimate, in our example, initial condition. Moreover, the introduction of the penalization terms allows to ensure the continuity of the solution.
Speed Up
0.5
1
2
3
4
5
Processors
Figure 3. Speed UP (curve with square point).
Figure 3 shows the speed up of the parallelized code on a network of 7 biprocessors (DS20500 MHz HP/COMPAQ). By comparison with iteratives runs, CPU time gain is visible (see figure 5.2.2). For example, for 5 processors the HPCN run spend half time than the iterative one. 6. CONCLUSION A parallel algorithm of a relatively good efficiency for the given computational approach was proposed and discussed for a state of the art flood model. However, in order to provide a numerical solution that will approximate the natural hazard at a feasible computational cost , mathematical models, parallel algorithms and data assimilation methods are required. The potential of those techniques used in combination was demonstrated with a toy model and the results are encouraging for investigations using operational models. There is strong demand for the prediction of the state of the environnement. This task requires to link together all the available sources of information. The mathematical information is provided by a set of non linear partial differential equations, the physical information will be given by heterogeneous information in nature and quality. A precise definition of environmental systems as seen a mathematical systems will impose to deal with systems of very huge dimension therefore requiring enormous computational ressources for their solution. The set up of a complete operational hydrometeorlogical prediction chain will require coupling of meteorological,
411 hydrological and hydraulics models and data assimilation algorithms. The implementation of efficient algorithms for distributed architectures (clusters,grids ...) is a necessity for the future. REFERENCES
[1]
[2]
[3]
[4] [5] [6]
I-7]
[8] [9]
L. Hluchy, V. D. Tran, J. Astalos, M. Dobrucky, G. T. Nguyen, D. Froehlich. Parallel Flood Modeling Systems. International Conference on Computational Science ICCS'2002 pp. 543-551 L. Hluchy, V. D. Tran, O. Habala, J. Astalos, B. Simo, D. Froehlich. Problem Solving Environment for Flood Forecasting. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European P VM/MPI Users' Group Meeting 2002, pp. 105-113 F. Alcrudo. Impact Project: A state of the art review on mathematical modelling offlood propagation D.C.Froehlich. IMPACT Project Field Tests i and 2: "Blind" Simulation by DaveF,2002 F.-X. Le Dimet. Une dtude gdnOrale d'analyse objective variationnelle des champs mOtdorologiques. Rapport Scientifique LAMP 28, Universit6 de Clermond II, BP 45 63170 Aubi6re France, 1980. F.-X. Le Dimet et O. Talagrand. Variational algorithms for analysis and assimilation of meteorological observations: theoretical aspects. Tellus, 38A:97-110, 1986. Y.K. Sasaki. An objective analysis based on the variational method. J. Met. Soc. Jap., II(36):77-88, 1958. J. Yang Assimilation de donndes variationnelle pour les problOmes de transport de sddimerits en riviOre Rapport de those pr@ard au sein du Laboratoire LMC (GrenobleFrance) ), 1999 N. Rostaing-Schmidt et E. Hassold. Basic function representation of programs for automatic differentiation in the o d y s s 6 e system. In F-X LE DIMET editor, High performance Computing in the Geosciens, pages 207-222. Kluwer Academic Publishers B.V. NATO ASI SERIES, 1994.
This Page Intentionally Left Blank
Multimedia Applications
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
415
Parallelization o f V Q C o d e b o o k G e n e r a t i o n u s i n g L a z y P N N A l g o r i t h m A. Wakatani a aFaculty of Science and Engineering, Konan University 8-9-1, Okamoto, Higashinada, Kobe, 658-8501, Japan We propose two parallel algorithms for codebook generation of VQ compression based on PNN algorithm. By merging plural pairs of vectors in an aggressive way, the communication overhead can be amortized at the cost of slight degradation of the quality of the codebook and the effective parallelism of the algorithms can be enhanced to achieve a linear speed-up. We also confirm the effectiveness of our algorithms and evaluate the quality of the codebook generated by our parallel algorithms in an empirical way. 1. INTRODUCTION
VQ (vector quantization) is one of the most important methods for compressing multimedia data, including images and audio, at high compression rate. A key to high compression rate is to build an efficient codebook that represents the source data with the least quantity of bit stream. Methods of generating a codebook for VQ compression include PNN(Pairwise Nearest Neighbor)[1] and LBG[2] algorithms. Both methods require the vast computing resource to determine an efficient codebook. Although LBG algorithm has been studied more frequently than PNN algorithm, we focus on PNN algorithm because LBG algorithm sometimes finds a non-optimal codebook with only a local minimum and then the algorithm must be iterated with different initial value in order to avoid capturing the local minimum. PNN mainly consists of four steps: 1) select initial training vectors from the source data, 2) calculate all the distance (or distortion) value between two vectors, 3) select the minimum distance from the distance values and merge a pair of vectors with the minimum into one vector, and 4) iterate the above steps 2 and 3 until the number of vectors equals the size of codebook (=K). This algorithm can determinately find the optimal codebook with several iterations of the above steps, however the step 2 is very computationally expensive with the cost of O(T 2) where T is the number of initial training vectors. In order to alleviate the computational complexity of the algorithm, Franti proposed the variation of PNN called tau-PNN[3]. A nearest neighbor vector (nn) of each vector (or training vector) is calculated in advance. Let ( i m i n , j rain) to be the pair selected at the third step of PNN. The distance should be recalculated only for a vector whose nn is i m i n or j rain at the second step, so the computational complexity can be reduced dramatically with the cost of O(T). Moreover, as described in Figure 6, Lazy PNN sorts the distance values in an ascendant order and select the minimum from the top[4]. Each distance value (for vector i) has a valid bit, which is invalidated when the nearest neighbor of vector i is merged, and if the valid bit of the top is invalid, the distance of the vector should be recalculated and sorted again. There-
416 fore Lazy PNN can dramatically reduce the cost of selecting the minimum and the number of recalculating distance compared with tau-PNN. As mentioned, PNN algorithm is an important method to determine an optimal codebook for VQ compression and has been improved to reduce the computational complexity in many ways. However, since the algorithm is inherently sequential, the speedup of the algorithm is limited. Thus we apply a parallel computing approach to Lazy PNN algorithm in order to achieve high speed generating of codebook. 2. NAIVE PARALLELIZATION The improvement of PNN algorithm has limitations in terms of high speed computing. In order to overcome the limitation, we should parallelize Lazy PNN algorithm without any loss of scalability to the problem size and suitability for high compression rate.
Ill11111Llililinllll Iliittitllntitilltittllnl Jlllnnlllnllllllnlln llilllll!lllllillllnI Illlllltllllltl
sequent ial
.......ltilltlllittllilnltll
paral le i
Figure 1. Naive parallel algorithm
Lazy PNN algorithm consists of 7 steps as in Figure 6, in which the main part of the algorithm are steps 3 to 7. First, a naive parallel approach is considered (Naive parallel algorithm). All initial training vectors are distributed to all processors but each processor is in charge of calculating the distance for the part of the initial vectors. Suppose 4 processors are provided as shown in Figure 1, each processor calculates distances of 1 of vectors for step 2 and sorts the distance values respectively, namely the sorted lists are created in each processor respectively. Thus steps 2 and 3 are done concurrently without any communication. For step 4, since each processor has its own sorted list, so the top item (with the local minimum) of the list should be broadcasted to others to determine the global minimum. If the valid bit of the global minimum is invalid, the distance must be recalculated and step 3 must be carried out again. Finally, after the global minimum is determined, every processor merges the pair with the minimum distance into a new vector and carries out the finalization step (step 7). Steps 6 and 7 are sequential. As the main part are steps 3 to 7, it is likely the algorithm can be parallelized easily except steps 6 and 7 whose computational complexity is fortunately low. However, the problem is that steps 4 requires the communication exchange to determine the global minimum from all the
417 local minimums. The communication is just an overhead cost which spoils the effectiveness of parallel algorithm, thus the reduction of the communication cost is a key factor to enhance the effective parallelism of the algorithm. 3. B L O C K PARALLELIZATION Communication mainly consists of two phases: 1) communication start-up (ts) and 2) message delivery (to). The communication start-up phase includes allocating the buffer area and copying data to be sent between user and kernel spaces and the message delivery is time to transfer data through communication line. The reciprocal of tc is called ideal bandwidth. In general, communication cost is expressed by ts + k x t~ where k is the number of data to be sent and t~ is much larger than t~. According to the literature[8], t~ is 21 msec and t~ for one double precision data is 0.3 msec on Cray T3D and ts is 35 msec and t~ for one double precision data is 0.23 msec on IBM SP-2. Namely, if the message includes only one data, the communication cost is dominated by just start-up time and the effective bandwidth is much less that the ideal 1 Therefore, by merging as many messages as possible into one message, the bandwidth (= ~). communication cost is expected to be amortized dramatically. In Naive parallel algorithm, the number of vectors is decreases one by one every iteration and one broadcast communication is required for each iteration, thus we will consider the possibility of merging several broadcast messages into one message to enhance the effective parallelism in the following subsection. 3.1. Safe PNN algorithm In Naive parallel algorithm, only top item is broadcasted and the global minimum is determined every iteration. But according to the following Lemma 1 and Theorem 1, several vector pairs can be merged concurrently with the same optimality of original PNN algorithm. Let Si to be a vector of vectori and n~ to be the number of vectors of S~ if vector~ is a merged vector. Note that n~ is 1 if vectori is not a merged vector. The squared Euclidean distance value of vectora and vectorb weighted by the number of vectors, the number of vectors of a merged vector and the merged vector of vectora and vectorb are given by[4] d(Sa, Sb) =
n a " '~b 9 n a -3t- n b
IIS~ --
Sb I~ ,
Tta+b = Tta -I- ftb,
~a+b = n a s a -Jr- ~,bSb na +nb
Lemma I. The merge of vectora and vectorb and merge of vectorc and vectord can be done
concurrently if d(Sa, Sb) < d(Sc, Sd) and both of them are less than d(Sa, S~), d(Sa, Sd), d(Sb, S~) and d(Sb, Sd) where a ~ c, a ~ d, b ~ c, b ~ d. Proof: From the literature[4], if d(Sa, Sb) < d(Sa, S~) < d(Sb, S~), then d(Sa, Sx) < d(Sa+b, Sx) where Sa+b is a merged vector between vectora and vectorb. Therefore, since d(S~, Sd) is less than d(S~+b, Sc) and d(Sa+b, Sd), vector~ and vectord can be merged without recalculating the distance in concurrent with the merge of vectora and vectorb. (Q.E.D.) Theorem 1. The
~ vector pairs
($1, $2), ($3, S 4 ) .. and (Sn_ l , S n ) can be merged concurrently if d(S1, $2) < d(S3, $4) < ... < d(Sn-x, Sn) and Si 7~ Sj (i 5r j) Proof: It can be easily deduced Lemma 1. (Q.E.D.) So, if the nearest neighbors of plural pairs of the top distance values in the sorted list are not themselves, they can be merged concurrently. As shown in the left of Figure 2, the first three
418 pairs can be merged because the second pair must be merged after the first pair is merged and the third pair must be merged after then, but the fourth pair should be removed and the nearest neighbor of 7 should be recalculated after the second pair (Sa, $4) is merged and vector4 is removed. By this principle multiple merging can be done successfully. This is called Safe PNN algorithm. Safe PNN algorithm is that each processor 1) defines block as a set of successive vector pairs which are (locally) safe on its own sorted list, 2) broadcasts the information of the block to others, 3) merges all the information into a globally sorted list, 4) and determines (globally) safe pairs to be merged from the list. As shown in the fight of Figure 2, both processors 0 and 1 find a block with the size of 3 and broadcast them to others. The first fours pairs of the newly sorted list can be merged but the fifth pair cannot be merged because one of the vectors of the fifth pair exists in the first fours pairs.
. . . . . . . . . . . .ili. . . . . . . . . . .
,-;Processor 1 .................................................. : can be merged
i| [
J ,(b, ~3ann~
oefm~rg[7]d =4 .........................................:..........................................[ I ;'2n ""b~"'me~:;:2:.........b~nn:;ged broadcast and sort l.................................................................. /
(a) safely merged
(b) algorithm
Figure 2. Safe PNN
Obviously, since the block usually contains the information of more than one pairs, the communication startup cost can be amortized. Effectiveness of Safe PNN depends on the size of block to be merged safely. This will be mentioned in the following section. 3.2. A g g r e s s i v e P N N a l g o r i t h m
I:"Processor
0
!
I".;'roce'2;or'";222222222~._.. broadcast
Figure 3. Aggressive PNN
Meanwhile, if the size of block to be merged is not large enough for exploiting parallelism of Safe PNN algorithm, its effectiveness may be degraded. In order to overcome the difficulty, the algorithm should be improved in an aggressive way. Namely the policy is that all the broadcasted pairs should be merged even if the merge is not safe. For example, in the right of Figure 2, the pair ($5, $6) cannot be merged according to Safe PNN. However, even if the pairs ($5, $6) and ($6, Sll) are merged concurrently like ($5, $6, Sll), the accuracy of the codebook is almost same as that by Safe PNN. Similarly the pairs (S a, $4) and (Sa0, $4) should be merged concurrently like (Sa, $4, $10) as shown in Figure 3. We call this method Aggressive PNN algorithm,
419 which reduces the communication startup cost as much as possible at the cost of accuracy of codebook. The algorithm is that each processor 1) defines block as a set of successive vector pairs which may or may not be safe on its own sorted list, 2) broadcasts the information of the block to others, 3) determines groups to be merged like 5'5, 5'6, $11,4) and merges all the groups concurrently. As mentioned earlier, however this approach may cause the problem that the optimality of codebook generated by using Aggressive PNN algorithm is getting worse as the size of block is larger. Namely the quality of compressed image by VQ using a codebook generated by Aggressive PNN algorithm may be worth than that by Safe PNN algorithm. We will evaluate the loss of the accuracy of the generated codebook and the effectiveness of parallelization by using Aggressive PNN algorithm in the following section.
4. EXPERIMENT AND DISCUSSION 4.1. Block size for safe PNN algorithm We evaluate the size of block which can be merged safely in Safe PNN algorithm for a gray map medical image of CT scan with the size of 512 • 512 and the gray level of 8 bit. Suppose a vector consists of 4 • 4 pixels and 1024 training vectors (-T) are randomly chosen to determine a codebook with the size of 256 (=K). Four experiments carried out with changing initial training vectors. Note that 768 communications (-1024-256) are required without blocking the pairs (naive parallel algorithm). The number of required broadcast communications is around 370 to 399 and The averaged size of block to be merged concurrently is found to be is around 1.92 to 2.08 and over 80% of blocks are one element block which contains only one element. The early pairs of the sorted list can be merged concurrently, but most of the rest are one element block. Results are almost the same with changing different T, K and images. So, although Safe PNN algorithm can reduce the communication startup cost in some degree, the effectiveness of Safe PNN algorithm is limited and very slight and thus more aggressive approach is indispensable for alleviating the communication cost drastically on parallel computing environment. 4.2. Degradation of image quality for Aggressive PNN In order to evaluate the quality of an image compressed by using Aggressive PNN algorithm, two images are considered: one is CT scan image used in the previous subsection and another T=1024,
K=256,
33 32.5 ~" n,z or) o..
experiment
3
experiment 3
32 31.5
ctl .pgm
experiment I experiment
.................................
T=1024, [] ..... 9...... ......~ ......
30.5
............. 9 .............
9........................... o ........................ 9.....
K=256,
lenna.pgm
experiment I experiment experiment 3 experiment 4
[] ..... 9...... ......e ...... ............. 9.............
3o
31 29.5 30.5 30 10 Block
100 size
(a) CT image
10 Block
size
(b) Lenna
Figure 4. Degradation of image quality by using Aggressive PNN algorithm
100
420 is a gray map natural image of a woman called "Lenna" with the size of 512 x 512 and the gray level of 8 bit. For the images, 1024 training vectors are randomly chosen to determine a codebook of 256 vectors. Four experiments carried out with changing initial training vectors. Figure 4 shows the relation of the block size and PSNR (Peak Signal Noise Ratio) for each experiment and each image. The block size means the number of vector pairs that are aggressively merged in concurrent. Note that the block size of 1 means Safe PNN algorithm. As shown in Figure 4, by increasing the block size, PSNR is slightly decreased. For example, for experiment 1 of CT image, Safe PNN algorithm generates a codebook which compress the image at PSNR of 31.5dB, while Aggressive PNN algorithm with the block size of 10 generates a codebook at PSNR of 31.0dB. Moreover, for experiment 1 of Lenna, Safe PNN algorithm generates a codebook which compress the image at PSNR of 30.0dB, while Aggressive PNN algorithm with the block size of 10 generates a codebook at PSNR of 29.8dB. Sometimes Aggressive PNN algorithm produces better codebook than Safe PNN algorithm, for example, for experiment 2 of Lenna. It is concluded that Aggressive PNN algorithm can be applicable to codebook generation for VQ compression because the degradation of image quality is slight. 4.3. Effective parallelism
T=1024, K=128,1enna.pgm
10
.E
T=4096, K=2048, lenna.pgm
block s i z e = 8 U block s i z e = 3 2 ..... 9...... block s i z e = 1 2 8 ......~ ...... linear
8 6
~-
block s i z e = 8 [] block s i z e = 3 2 ..... 9...... block s i z e = 1 2 8 ......o ...... linear - -
12
10 8 ......'"~
6
4
...~........o i
0
~
2
~
'
D
(a) 128 vectors
2 / f
[]
4 6 8 No. of p r o c e s s o r s
...o.......~
10
0
-= 2
4 6 8 No. of p r o c e s s o r s
10
(b) 2048 vectors
Figure 5. Effective parallelism
In this subsection, the effective parallelism of Aggressive PNN algorithm is evaluated. For Lenna image, the elapsed time is measured on PC cluster which consists of 8 CPUs (Celeron 1GHz) and LAN (100Mbps) with MPICH1.2.4 and Linux 2.4. As shown in Figure 5, the parallelism is around 2 at most in the case that T is small, say 1024, however the parallelism goes up to 7.6 for 8 processors when T is large, say 4096, and the block size is 128. In the latter case, the number of broadcast messages are 285, 78 and 17 for the block size of 8, 32 128, respectively, that is, the reduction of communication messages enhances the effective parallelism of Aggressive PNN algorithm. When the block size is large enough, the startup cost is reduced comparatively to provide sufficient parallelism. Note that when T is 4096 and the block size is 128, the parallelism is not linear with 5 to 7 CPUs because the number of broadcast communication is almost the same as with 4 CPUs.
421 5. R E L A T E D W O R K S
As mentioned earlier, codebook generation algorithms for VQ compression algorithm are mainly LBG algorithm and PNN algorithm. LBG is based on an iterative method, like Kmean method, which is applicable to other algorithms than codebook generation. Ratha et al. evaluated parallel algorithm of K-mean method and GA and indicated that GA is superior to K-mean method in terms of convergence ability and parallelism[5]. Tsai et al. developed a dedicated system called reconfigurable array to evaluate parallel clustering algorithm based K-mean method[6]. Olsen evaluated parallel algorithms of hierarchical clustering by using PRAM[7]. PNN is one of hierarchical clustering methods with minimum variance, but the literature[7] does not mention the enhancement of parallelism by aggregating messages, which is proposed in this paper. 6. CONCLUSIONS We describe two parallel algorithms for codebook generation based on Lazy PNN method. Safe PNN algorithm can produce the same codebook as Lazy PNN does, but its effectiveness is limited due to the high communication cost. On the other hand, Aggressive PNN algorithm merges plural pairs aggressively into one vector at a time and may produce a slightly worse codebook. However, in spite of the demerit, Aggressive PNN method successfully exploits parallel approach and provides sufficient parallelism. Our preliminary experimental result shows that the proposed algorithm provides the parallelism of about 7.6 on PC cluster with 8 CPUs, namely our algorithm achieves a linear speed-up.
1. Initialization
s. Recalculating distance if invalid
T: the n u m b e r of t r a i n i n g v e c t o r s K: t h e s i z e of c o d e b o o k (K << T) for(i) v [ i ] = 0 ; // i n v a l i d , all dist. m=T; // init. m as T
if (v [imin] ==0) { n n [imin] .... ; d [imin] ..... ; v [imin] =I ; g o t o s t e p 3;
}
2. Calculation of distance f o r (i=0 ; i < m ; i + + ) { if(v[i]:=0){ nn [i] .... ; // f i n d n e a r e s t n e i g h b o r d[i] ..... ; // calc. dist. w i t h nn[i] v[i] =i;
}
}
3. Sorting ues in
the an
distance valascendant order
d [] ->d [] - >d [] - >d [1 - >d [] ....
4. Fetch the pair from the top of the list imin .... jmin ....
; ;
// //
the
pair (imin, jmin) (jmin=nn [imin] )
Figure 6. Lazy PNN algorithm
6.
Merging the vectors imin and jmin f o r (i=0 ; i<m; i++) { if (nn [i] = = i m i n I Inn [i] =jmin) if (nn[i] ==m-l) nn[i] = j m i n ;
}
v [i] =0 ;
create new vector from imin and a n d set t h e v e c t o r at imin; c o p y ( m - l ) - t h v e c t o r to jmin; m=m- 1 ;
7. Finalization ? if(m!=K)
goto
step
3;
jmin
422 REFERENCES
[1]
Equitz W.: "A new vector quantization clustering algorithm", IEEE trans, on Acoustics, Speech and Signal Processing, Vol.37, No.10, pp.1568-1575 (1980) [2] Gersho A. and Gray R.: "Vector Quantization and Signal Compression", Kluwer Academic Publishers, Boston (1992) [3] Franti P. and Kaukoranta T.: "Fast implementation of the optimal PNN method", Proc. of Int'l. Conf. on Image Processing, Chicago (1998) [4] Kaukoranta T., Franti P. and Nevalainen O.: "Fast and space efficient PNN algorithm with delayed distance calculations", Proc. of 8th Int'l. Conf. on Computer Graphics and Visualization (1998) [5] Ratha N., Jain A. and Chung M.: "Clustering using a coarse-grained parallel Genetic Algorithm", Proc. of the Computer Architectures for Machine Perception (CAMP'95), pp.331-338 (1995) [6] Tsai H., Horng S., Tsai S., Lee S., Kao T. and Chen C.: "Parallel Clustering Algorithms on Reconfigurable Array of Processors with Wider Bus Networks", Proc. of the 1997 Int'l Conf. on Paralllel and Distributed Systems (ICPADS'97), pp.630-637 (1997) [7] Olson C.: "Parallel algorithms for hierarchical clustering", Parallel Computing, Vol.21, pp.1313-1325 (1995) [8] Pacheco P.: "Parallel Programming with MPI", Morgan Kaufmann Publishers (1997)
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter(Editors) 9 2004 ElsevierB.V. All rights reserved.
423
A Scalable Parallel Video Server Based on Autonomous Network-attached
Storage G. Tan ~, S. Wu a, H. Jin a, and F. Xian ~ ~Internet and Cluster Computing Center, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, P.R.China This paper presents a cluster-based parallel video server using autonomous network-attached storage technology. In this architecture, the storage nodes not only take care of data storage and retrieval, but also keep track of their own resources utilization and take part in the admission control procedure. They accomplish the application related processing of media data and then send them to the clients directly, thus obviate the need for stream synchronizing at specific processing nodes. This paper studies several major technical aspects of the system design, paying particular attention to the storage policies. The experimental results show high performance and near linear system scalability with our design. 1. INTRODUCTION With the advance of technologies in multimedia computing and high-speed network, video services over the network are becoming more and more popular. The data intensive nature of media streams imposes stringent demands on the video server systems, and usually requires a high performance server to provide quality services. State-of-the-art parallel video servers can be roughly classified into two categories: SMP based single-machine systems and cluster-based multiple-computer systems. Compared with SMP architecture, the cluster-based systems have the advantages of easier scalability and potential capability for fault tolerance, and therefore have attracted a lot of attentions from researchers as well as ISP's in recent years. In this paper we present the design and implementation of a cluster based parallel video server. Our system is based on an "autonomous" network-attached storage sub-system. The "autonomous" means that, the network attached storage nodes do not only serve as data container passively under the command of control node, but also take part in the resource scheduling, accomplish the application related data processing and then transmitted the data to clients directly. This mechanism offloads all the data intensive processing from processing nodes to a large number of storage nodes, and thus maximizes the parallelism of concurrent stream operations. Based on this architecture, we study the appropriate data splitting and placement strategies. We also propose a dynamic stream scheduling and admission control algorithms. A set of experiments, in both real and simulation environments, show high performance scalability with our design and reveal the relationship between some key parameters and the system performance. The rest of this paper is organized as follows: section 2 overviews some related work; section 3 presents the system architecture; section 4 describes several major design issues; section 5 gives the experimental results and some discussion; finally we conclude in section 6.
424 2. RELATED W O R K A cluster based parallel video server can be constructed with either Proxy-based or Direct Access architecture. In the proxy-based architecture, a set of access servers take responsibility for combining streams from multiple storage nodes and forwarding them to the clients. The access servers work like proxies located between the clients and the stored data, hiding the storage nodes behind themselves. One weakness of this architecture is the heavy inter-node traffic introduced by the flow of requested media data. In Direct Access architecture, however, the media data need not an intermediate forwarding process and can be directly sent to clients under the control of access servers. It can obtain higher resource utilization by removing unnecessary data movement. Depending on whether storage nodes are exposed to the clients or not, the architecture can be further classified into Multiple-Access-Point (MAP) and Single-Access-Point (SAP). The MAP architecture allows clients to contact directly with the storage nodes. But the lack of single accessing point makes it unsuitable for an Intemet-oriented service. Unlike MAP, SAP hides the storage nodes behind a control node, and gives clients a single contact point. The client transparency is achieved through the strict control of concurrent data streams by the control node. C. Akinlar and S. Mukherjee proposed a multimedia file system based on network attached autonomous disks [1]. The autonomous disks have specific modules called AD-DFS in their operating system kernels. Clients can request data directly from individual disks using particular module in their OS kernels, which understand the block-based object interface exported by ADDFS. This mechanism is like the principles of the MAP architecture. Other systems, such as USC's Yima-2 [9], DAVID [4] and Microsoft's Tiger [3], are typical examples of the SAP architecture. Yima-2 adopts a symmetric node organization in which each node can be both control node and data node. The media data blocks are evenly distributed across all the nodes using a pseudorandom function. Yima-2 achieves precise playback with three levels of synchronization and thus is much more complex than the mechanism adopted in our system. In DAVID, the clients' data is concurrently pushed out from multiple storage nodes. The load balancing effect of storage nodes is obtained at the cost of high synchronization overhead on the control node, and as a result the system's scalability is severely restricted. Microsoft's Tiger is a special-purpose file system for video servers delivering data over ATM networks. It relies on ATM network and thus makes the system not immediately compatible with the public Internet: when receiving data from the server in a network environment other than ATM, front-end processors may be needed to combine the ATM packets into recognizable streams. While the above systems achieve intra-stream and obtain fine-grained parallelism, our system adopts inter-stream parallelism. This mechanism greatly simplifies the implementation, and can also obtain high performance with the proposed load balancing techniques. 3. SYSTEM ARCHITECTURE The server system is primarily composed of three parts: front-end node, network attached storage nodes, and high-speed intemal network, as illustrated in figure 1. The storage node in this figure can be realized by different hardware platforms, e.g., a regular workstation or active disks attached to a network. The media files are split into multiple segments and distributed across the storage nodes. When playing back, the segments of a file are retrieved from the disks and transmitted to clients one by one in order of playing time.
425
RTSP
~
t
RTP ~~ached network packets disks
Figure 1. System Architecture The front-end machine, also called control node, is the only entry-point of the system. It takes responsibility for accepting clients' requests and performing admission control in collaboration with the storage nodes. In addition, it redirects the clients' feedback (in form of RTCP [7] packets) to appropriate storage nodes. Upon receipt of a new play request, the front node first processes the authentication. If passed, the request will be broadcasted to the storage nodes, which will translate it to a set of sub-requests and schedule them according to their own task agendas. The front-end machine collects the scheduling results and makes a comprehensive decision, sends it to the client, and then schedules the storage nodes to arrange their playing tasks. The multiple storage nodes are hidden behind the control node, and share a common IP address, called site IR The media data sent out from these nodes are labelled with this site IP as their source addresses, making themselves seem to come from a single machine. In implementation, we adopt RTSP [8] and RTP/RTCP [7] as the user interaction protocol and data transporting protocol respectively.
4. SERVER DESIGN 4.1. Data striping and placement To obtain load balancing, the media files are split into many segments and distributed across the storage nodes. For the file splitting, we design a novel strategy: Owl [6]. This method defines a track layer beyond the actual multimedia data format layer, which provides a uniform accessing interface for the server, and enables it perform SEEK-type operations more easily without understanding the media data format. Moreover, it saves runtime resources for the server as most of the CPU and memory intensive processing have been performed at the preprocessing phase. Each segment has two replicas for purpose of fault tolerance. The segments of one movie are placed in a round robin fashion with the first one at a randomly chosen storage node. In our system, each movie segment together with its processing code is regarded as an object. While similar to traditional concept of object, this abstract has its own features. The attributes, data, and processing code of the object are all separately stored in disk as different forms: clip index file, clip file, and program file respectively. The clip index file is a text file recording all the clip items of one movie. It is named as xxx.index, where xxx is the corresponding movie file name, and the content of the index file is a table-like file illustrated in table 1.
426 Table 1 An example of index file for movie LifeIsBeautiful.mpg Clip No. 0 4
Clip Name LifeIsBeautiful.mpg.00 LifeIsBeautiful.mpg.04
Range 0 - 5:00 20:00- 25:00
Buddy Location Node5 Node5
Next Locationl Node3 Node3
Next Location2 Node6 Node6
In table 1, the Buddy Location refers to the location of another replica of the same movie segment, and the Next Locationl and Next Location2 point to the positions of the next segments in playing time. These parameters serve as pointers when one object needs to interact with another one. The pointers link up all the objects into a well-organized data structure. With this structure, a movie's playing back is in fact a sequence of object operations: the first object retrieves and transmits its data, then it tells the next object to transmit next segment, and then next one, until all the objects finished its task. We call the sequence of objects participating in the real playing back a "playing path", and the remaining objects of this movie compose a "shadow path". The shadow path accompanies the playing path and periodically pulling information like normal playing time, RTP sequence number, RTP timestamp, etc., as checkpoints. When one object crashes, its buddy object will resumes its work using the latest checkpoint data. 4.2. Stream scheduling and admission control In order to guarantee the service quality, we design a stream scheduling and admission control algorithm. In our system, since a movie is partitioned into multiple segments, the task of playing a movie is divided into a set of sub-tasks, each representing the playing of one segment. A movie-playing request is translated into a set of sub-requests by the storage nodes according to the movie index files. All the sub-requests must be scheduled by the storage nodes before their corresponding sub-tasks are accepted, and the scheduling results jointly determine the final result of the request. To schedule sub-requests, each storage node maintains a scheduling table recording all the playing tasks being currently executed or to be executed in the future. Denoting the bandwidth of individual task by b(T), the aggregate bandwidth at time t by B(t), and the beginning time and ending time of one task by bgn(T) and end(T), respectively. We define the peak value of aggregate data bandwidth in a time range of a task as:
A(Tnew)- Max{B(t) =
Z
b(T) + b ( T ~ ) } , bgn(T~) <_t <_ end(T~).
TEc~Abgn(T)
(1) where 9 is the set of tasks that have been already arranged in the scheduling table. Assuming the maximum data bandwidth of one node is Amax (typically determined by the local disk I/O and outbound network bandwidth), the system load in the period of a task is defined as L(Tnew) = A(Tnew)/Amax. If L(T,~w) exceeds 1, this sub-request is rejected. Based on this rule, the admission control process is handled as follows: if one sub-request is rejected by both storage nodes owning the corresponding segments, the whole request is rejected. If only one storage node can admit this sub-request, the corresponding sub-task is assigned to this node. If both storages nodes can admit this sub-request, the one with smaller L(Tn~w) will be
427 selected as the performer of this sub-task. All the scheduling results should be returned to the front-end machine for comprehensive decision. Note the Amax may be different for various nodes; this allows the scheduling algorithm adapt to a heterogeneous environment. The above discussion is based on the assumption that the streams are constant bit rate and the local admission control adopts a threshold-based deterministic policy. In fact, since each storage node is an autonomous entity, it can use whatever algorithm to process the admission control, as long as it gives a deterministic or probability-based result to the front-end node. In n the later case, the final admission probability should be P - 1-Ii=l Pi, where n is the segment number of a given movie and P~ is the probability of admitting object i given by its storage node. 5. P E R F O R M A N C E EVALUATION Currently the system is implemented on both RedHat Linux 7.2 and Windows 2000, using some libraries from Apple Computer Inc.'s Quick Time Streaming Server. We conducted a set of experiments to verify our design. In addition to system scalability testing, we study two parameters related to the storage sub-system: clip length and replicate degree (number of replicas of one movie). The experiment environment includes a cluster system consisting of a control node and four storage nodes, and a client end consisting of several PCs. We issue play requests in form of a Poisson stream according to input parameters and movie selection pattern is conform to Zipf distribution [2]. We choose a variable-bit-rate (mean bit rate = 0.46Mbps) MPEGl-coded movie clip, and make a certain number of replicas to set up the movie library. Throughout the tests we set the maximum threshold data bandwidth to 15Mbps per node.
5.1. System sealability For system scales of 1 to 4 storage nodes, the testing program measured the maximum number of concurrent streams, maximum throughput and maximum accepted arrival rate of requests. The test results are shown in table 2. Table 2 System performance with different system scales # Storage nodes Max concurrent streams Max throughput (Mbps) 1 39 14.69 2 72 27.11 3 109 40.47 4 135 51.76 Rejecting rate (number of rejected requests/number of issued requests) of movies set to 30, each with 10 clips.
Max arrival rate (req/s) 0.195 0.350 0.550 0.770 set to below 2.5%. The number
Table 2 indicates that, for scales of 1-4 storage nodes, the system performance exhibits good scalability. In the testing process, we observed that the CPU utilization of control node remains below 5% in all cases. This suggests that even a single control node has considerable potential capability to accommodate more streams. To reveal the system scalability with larger scales, we extracted the stream scheduling and admission control algorithm from the system and developed a simulator, named VoDSimu,using processes to simulate the node behavior. Under the same
428
l ~ 1000.
Experiment
Reference
J
8003 E
600
~.
400 200
0
.,dll
[ 110
' 1'5 ' 2~0 #Storage Nodes
'
Figure 2. System scalability in simulation assumptions and conditions with above experiment, the system scalability is illustrated in figure 2. The number of maximum concurrent streams generated by the simulator is 39, 75, 109 and 143 for 1, 2, 3 and 4 storage nodes, respectively. This shows good agreement with our experimental results in the real environment, which verify our simulation. The further simulations results show that the system performance can be increased with a linear scalability.
5.2. Effect of clip length on system performance The determination of clip length may have important impact on the system behavior. Intuitively, small clips lead to fine-grained load balancing and hence results in better performance, but the frequent shifting of control from one storage node to another also introduces extra overhead. To reveal the relationship between this parameter and the system performance, we conducted experiments on the 4-storage node system with the number of clips for each file being 5, 10, 20, 40, 60 and 80 respectively. Multiple rounds of testing are conducted and one set of data is illustrated in figure 3. From figure 3 we can see the length of clip have no significant influence on the system performance. This result indicates that under general conditions the number of clips need not be a large value. In fact, additional experiments in extreme cases show that a non-striping policy obtains comparable performance as a striped policy under the proposed assumptions. It should be pointed out that the examined clip length is based on a scale of minutes. On this scale, the size of one movie clip usually exceeds 10M Bytes, which makes the difference of disk seek time on system performance insignificant. 5.3. Effect of replicate degree on system performance There are two purposes for clip replication: fault tolerance and load balancing. We note that the replicate degrees for movies are constrained by the limited storage space. So we should handle the following questions: how many replicas for each movie are needed to achieve a good load balancing? We examined 4 different replicate patterns in simulation with different system scales and clip lengths: (1) Single-replica (SR): each movie has only one replica; (2) Double-replica (DR): each movie has two replicas; (3) Variable-replica I(VR1): each one of the top 10 (most frequently accessed) movies has replicate degree of 3, and the remaining movies have replicate degree of
429 160 -
Max. # streams]
290
140 -
285
120 -
275
100-
270 265
280
,O
$:::::--:.~=:ll::......~.----v
T~
~ 260
S0-
255 250 9245
60-
~ ,102t?0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
#Clips
240 235 230 225 220 215
~
v
............... o .... .................... 9...................
--m--
=
~
11o
=
~
.
,
1'.1
_
9
single-replica
....O-.-- double-replica ......~..-- v a r i a b l e - r e p l i c a --v-- variable-replica
_
112
113
114
1 2
112
Alpha
Figure 3. Number of clips vs. system perfor-Figure 4. Replica degree vs. system performance, mance 2; (4) Variable-replica 2 (VR2): each one of the top 10 movies has replicate degree of 3, the following 11-30 has replicate degree of 2, and the remaining movies have replicate degree of 1. The simulation results of performance vs. accessing skewness degree are presented in figure 4 (In all cases we set the number of storage nodes to 8, number of clips to 10, and number of movies to 200). From the figure 4 we have the following observations and conclusions: (1) from SR scheme to DR scheme, there is an obvious improvement in system performance. This is attributed to the load balancing effect by the scheduling algorithm. (2) From DR scheme to VR1 scheme, there is only slight performance improvement, which suggests that the addition of replicas on the base of DR scheme only brings quite limited benefit. (3) VR2 scheme has performance very near to VR1 scheme, and greater than the DR scheme. This result has the implications that the data storage cost can be reduced by (400 - (30 + 40 + 170))/400 = 40% without performance degradation when the reliability for cold movies service is not a critical issue. (4) In SR scheme where the stream scheduling algorithm is ignored, the system performance will decline with the increasing of skewness parameter. This tendency decreases when replicate degrees for the movies, especially those popular ones, are raised above 1. This means the proposed scheduling algorithm has the expected effect of smoothing the variability of movie accesses and is fairly robust when the access pattern is changed. 6. CONCLUSION AND FUTURE WORK In this paper we described the design and implementation of a parallel media server based on autonomous network attached storage. In this architecture, the media data is processed locally at the storage nodes and transmitted to clients directly. Both experiments on real system and simulation showed the high performance achieved with our design. In the future we should seek further performance optimization for the system. Moreover, the characteristics of the realworld server workload are needed to be study to provide more practical guidance for the real deployment. REFERENCES
[1] Akinlar and S. Mukherjee. "A Scalable Distributed Multimedia File System Using Network Attached Autonomous Disks". Proceedings of 8th International Symposium on
430
Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000. Page(s): 180 -187
[2] R. L. Axtell, "Zipf Distribution of U.S. Firm Sizes," Science. Sept. 7, 2001, Vol. 293. pp. 1818-1820
[3] William J. Bolosky, Robert P. Fitzgerald, and John R. Douceur. "Distributed Schedule [4] [5] [6] [7]
[8] [9]
Management in the Tiger Video Fileserver," In Proceedings ofSOSP-16, St.-Malo, France, Oct. 1997. Pages 212-223 A. Calvagna, A. Puliafito and L. Vita. "Design and implementation of a low-cost/highperformance Video on Demand server," Microprocessors and Microsystems. 24(2000) 299-305 Roger L. Haskin and Frank B. Schmuck. "The Tiger Shark File System," Proc. of COMPCON'96. 1996. Page(s): 226-231 Hai Jin and Xiaofei Liao. "Owl: A New Multimedia Data Splitting Scheme Based on Cluster Video Server", proceedings of EUROMICRO Conference, 2002. H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications," RFC 1889, http://www.ietf.org/rfc/rfc 1889.html, Jan. 1996 H. Schulzrinne, A. Rao, and R. Lanphier, "Real Time Streaming Protocol (RTSP)," RFC 2326, http://www.ietf.org/rfcs/rfc2326.html, Apr. 1998 Cyrus Shahabi, Roger Zimmermann, Kun Fu, and Shu-Yuen Didi Yao. "Yima: a secondgeneration continuous media server," Computer, Volume: 35 Issue: 6, June 2002 Page(s): 56 -62
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
431
Efficient Parallel Search in Video Databases with Dynamic Feature Extraction S. Geisler ~ aDepartment of Computer Science, Technical University of Clausthal Julius-Albert-Strage 4, 38678 Clausthal-Zellerfeld, Germany E-mail: [email protected]
Video databases with dynamic feature extraction offer the possibility for powerful and flexible queries. However, the retrieval is very time consuming. Efficient algorithms are required to manage the search for objects in one single video, and parallel methods are necessary to cope with the large number of videos in a database. In this paper both are regarded with a focus on the parallelism. We compare three different levels of parallelism to each other using experimental results and discuss the suitability of SMP and cluster architecture.
1. INTRODUCTION In the last few years the number of available digital video files increased enormously. Nearly every television station has implemented a digital video archive or is at least interested in. Two of many advantages are fast access to the video data and the possibility for video on demand. The efficient storage and retrieval is still a field of ongoing research. The most popular systems are VIDEOQ [1], the VIRAGE VIDEO ENGINE [2] and CUEVIDEO [3]. The goal is a content-based search in the video data, so that the expensive creation of textual descriptions can be avoided. All these systems have in common that the feature vectors like colour moments, textures, shape or motion of recognized objects, are calculated when the video is inserted into the database. As the provider cannot know what the interests of the later users will be, it is possible that important information will not be extracted and stored. Besides, the automatic recognition of objects still works unreliably. In this paper we present an approach, in which the feature vectors are calculated after the user has sent his query to the video database. This follows a suggestion of KAO for image databases [4]. Objects or persons can be found because the template matching algorithm is performed on each image in the database. Obviously such systems will need a lot of computational power which today cannot be realised without parallelism and efficient algorithms for searching in digital videos. Both are presented in the following sections after a short description of the video retrieval process. The efficiency of different levels of parallelism and different parallel architectures are compared with each other.
432 2. THE QUERYING PROCESS The regarded system follows the visual paradigm like VIDEOQ [ 1] or several image databases such as PHOTOBOOK [5], QBIC [6], VISUALSEEK [7], VIRAGE [8] or CAIRO [4]. Instead of key words and logical operators the query consists of a sketch or an example image or video. Additionally the user can mark a region of interest, which are in most cases persons or objects. The task is not to find exact matches but a set of videos or video shots, which contain a similar frame or a frame with a corresponding region. The comparison between the query image and a video frame is the same as in image databases. In the case of object search template matching is used. This means the query template is compared to every region of the video frame with the same size. Motion is not in the focus of this paper, but it is another important aspect of video databases. 3. HIGH EFFICIENT NON-PARALLEL SEARCH Before the main aspect of this paper, the speed up by parallelism, is regarded, the acceleration through the reduction of the number of frames to analyse and the work in the compressed video domain are introduced. 3.1. Reducing the number of frames to analyse Videos normally contain 25 (PAL) or 30 (NTSC) frames per second, thus a 90 minutes long movie contains 135,000 or 162,000 frames. This huge amount cannot be analysed in acceptable time, but obviously this is not necessary, as neighbouring frames in most cases do not differ very much. The first idea to search only one key frame per shot was not pursued as the interesting detail may not be part of this frame. It could be hidden by another object or appear before or after the key frame in this shot. Instead of this, one flame in a fixed interval is used. The length of the interval was chosen to nearly half a second, which is short enough to find every object the viewer of the video will be able to remember. There is another interesting implementation detail. In most MPEG-1 [9] and MPEG-2 [10] videos every 12th or 15th flame is an I-frame, which means all information is stored in the frame structure without references to other frames. The video analysis can be implemented in a very efficient manner if the search will only be performed on the I-flames. 3.2. Working on compressed MPEG-1/2-videos The frames in MPEG-1/2 videos are partitioned into blocks of 8 x 8 pixels, to which the Discrete Cosine Transformation (DCT) is applied. In the compressed file only the coefficients of the transformation are stored. The inverse DCT is the most time consuming step during the decoding phase of a video. For this reason, the search is not performed on the pixel data but on the coefficients of the DCT, after the same transformation is performed on the query image. FALKEMEIER et al. [11] as well as SHEN et al. [12] use similar approaches to work in the compressed domain for video parsing and indexing. Another speed improvement is achieved by comparing only the first coefficient of each block, the DC-coefficient which represents the average value of the luminescence or chroma value of the block. This is comparable with the search on the flame with reduced resolution. As objects in most cases will not be placed at the edges of the blocks, a search including subpixels of the DC-image leads to better results, but on the other hand lowers the gain in speed.
433 4. P A R A L L E L I Z I N G THE SEARCH With the former methods the search in a video of 15 minutes length can be performed in 90 seconds on a 2.2 GHz Xeon system. In this example the query template even is tested in three sizes, three rotation steps and four subpixel positions of the DC-image. This time is acceptable for users if the task is to search only in one single video, but in video databases where hundreds of videos have to be analysed, parallelism is necessary to achieve acceptable response times. In [13] BRETSCHNEIDER et al. have shown that for large image databases cluster architectures are best suited, as the number of access conflicts while using the hard disk and other resources is under the value in SMP computers. For smaller databases with up to four processing elements (PE) both architectures show similar performance. These results may be transferred to video databases with little differences. One problem of the cluster architecture is that in general a good schedule can only be achieved, when the videos are transferred between the nodes. This is necessary, because the query can consist of several steps (e.g. search in videos of a special author or from a special week) and the dynamic search for objects takes place only on a small part of the database, which in general is not symmetrically distributed among the cluster nodes. Using the introduced algorithm to analyse a video the computing time in respect to the file size is shorter than for images due to the disregarded components like P- and B-frames, ACcoefficients and the whole audio information. The transfer of hundreds of megabyte of video files would slow down the process significantly. In SMP-architectures the PE's all have the same access times to the videos. A second aspect is that the probability of a good distribution of the videos increases with a smaller number of nodes. This considerations lead to the goal to find a good combination between SMP- and cluster architecture. Three different levels of parallelism will be introduced in the following. Coarse-grained parallelization on video file level: The easiest way is to start one task for each PE (figure l a). Each task has to search one video file. The advantage is minimal communication effort, but on the other hand, idle times occur due to different length of the videos and if the number of files to analyse is not a multiple of P, if P is the number of PE's in the SMP-computer. Parallelization on frame level: The second approach is to create P threads to analyse one video file (figure 1b). After the video is loaded and demultiplexed all threads start at the beginning to go through the video file. A thread searches a frame only if no other frame has analysed or is just analysing it. Hence communication is necessary before the search can be performed. idle times are reduced to the time for the analysis of one frame per video and the demultiplexing at the beginning. Fine-grained parallelization by partitioning each frame: Each frame is divided into P parts. For each section a thread is generated, which searches the part of the frame for the query image (figure 1c). The load of the PE's is nearly symmetric, but there is more c o ~ u n i c a t i o n overhead than in the other cases. Another drawback is that only the analysis is executed in parallel, not the decoding of the frame.
434
P4 ~iiiii~iiiiiiiiiiiiiiiiiiiiiiiiiiiiii~iiiiiiiiiiiiiii i~iiiiiiiiiiiiii~i~i~iiiiiiiiiiiiii~i~iii~iiil P3 [iiiii~i
i
~
i~!i
~
!ii !!!!!ii~i ~i i i i !i~i i i i i i i~i i i i i i i~]
P2
P1 [ili
~!~iii!iiiii
P4
|~i
P2
[
P3
P1
~
~i
[ L,
~
P2
Pl [L, Ollm'lil
!i~i ~I
:iiii~
P4
i
[
~i
~I ... i~i] iii:"'" i
ii~
~|ilil
D [li!ili!!
P3
i~i~~ii~iiii!!iiiiiiii~iiiiiiiiiiiiiii!iiiiii~i]
iill
~!~i~]
| L,
~
o[
iiiiiil ...
il
[ii
i~
~iiiiil...
i]
......
i~iiii!]......
[il
...... ~ii
ilI
~
"'" [~
"'"
Figure 1. Parallelization a) on video file level, b) on frame level, c) by partitioning each frame (L, D: load and demultiplex video file; C: communication, searching next free frame; Dec: decode frame; RV: gather results of all threads of the video; RA: gather results of all tasks) 5. D I S C U S S I N G P E R F O R M A N C E A N D SPEED UP
In this section the performance and speed up of the different parallel approaches are compared and a combination to improve the performance is discussed. Two different architectures are compared, a dual Xeon 2.2 GHz and an Alpha workstation with 4 PE at 600 MHz. The Xeon system is tested with and without hyperthreading. The hyperthreading technology allows using some functional units in parallel by pretending an additional PE. 5.1. Parallelization to speed up the search in one video file The results of the performance measurements for searching in one video file are shown in figure 2a. The first interesting fact is that the parallelism on frame level works more efficiently than the finer grained algorithm. For this reason an even finer grained algorithm like parallelization of the DCT, is not taken into account. This statement applies to all tested architectures. Secondly the maximum speed up is 3.4. Not surprisingly it was achieved with the four processor Alpha workstation. A speed up of 2.2 can be reached with two real PE's due to hyperthreading. It may be even increased, if special characteristics of the architecture are exploited. 5.2. Parallelization to speed up the search in many video files The next test scenario is to analyse more than one video file in parallel. The parallel search in different files is compared to the parallelism on frame level. The test set contains more than 100 video files. Figure 2b depicts that the coarse-grained algorithm works slightly more efficient on dual processor architectures. The advantage increases with raising number of PE's. The parallel search in videos on the four PE Alpha workstation reaches a speed up of 3.9 whereas with the parallel search in different frames of one video only a speed up of 3.1 is achieved on the same
435
I
9Frames (X) 3,5 :-~ 9Regions (X) 9 Frames (XH) x Regions (XH) 3 9Frames (A) "~ M Regions (A) u'l
2,5
i Videos (X)
1
II o~ ~
~.~
oO•
'~
/J S
~
. . . .
S
J~'
9 Frames (A)
-u ~ 2,5
o....
oo/"
9 Frames (X) ,~ Videos (XH) 9 Frames (XH) o Videos (A)
3 ,j
~
f
/
J
3,5
+/ //
I i it
~
jo~
~
+.++~
1,5
j,11
,
,
2
3
1 4
Threads
1
2
3
Threads/Tasks
a)
b)
Figure 2. Parallelization to accelerate the search in a) 1 and b) 100 video files. Used architectures are a dual Xeon with (XH) and without (X) hyperthreading and an Alpha workstation with 4 PE's (A). machine In this test was assumed that the data set can be divided into equal sized subsets. A more realistic scenario is discussed in the next section, where a fixed number of videos of different length have to be analyzed. 5.3.
A cluster
architecture
with SMP-nodes
The results from the last sections give us a hint how to build a cluster of SMP-computers for larger databases. We assume that the video files are stored distributed over the nodes without any order, as we do not know, what will be searched. Furthermore we assume that a subset of all stored videos, e.g. all videos from a special date or author, have to be searched for the query image. The query is processed as follows: First of all every node searches the locally stored videos. If it runs idle, it requests a video from another node, which have not yet been analyzed. The search is continued in this video, which will be deleted on this node, after the search has been completed. Table 1 shows the speed up for different configurations of the cluster. From these results we can see that there is no best solution for every use case, but it depends on the size of the subset of videos. For the small set of 50 images the parallelization on frame level is better suited than parallelization on video level. The reason is not the algorithm itself as seen before, but with smaller execution times a better schedule can be built and the load of the PE's is more evenly distributed. Furthermore the best suited number of PE's per node is two for the small subset of videos. The algorithm performs not very well with four, as shown in section 5.2. Using 32 nodes with one PE each leads to a disadvantageous distribution of the video files. For larger datasets the negative influence of the uneven schedule is lower than the positive
436 Table 1 Speed up for the analysis of 50, 100 and (each Xeon with hyperthreading is treated Xeon Videos Nodes/PE Parallelization level Video I Frame
500 videos in different configurations with 32 PE's as two PE's). Xeon (H) Alpha Parallelization level Parallelization level Video I Frame Video I Frame
50
8/4 16/2 32/1
22.8 22.8
28.1 22.8
13.8 18.9 22.8
17.6 25.8 22.8
22.3 22.7 22.8
24.2 27.1 22.8
100
8/4 16/2 32/1
30.4 3O.O
29.4 3O.O
18.4 25.2 30.1
17.9 27.1 30.1
29.5 30.0 29.9
24.4 28.2 29.9
19.3 18.0 31.1 24.7 8/4 31.8 30.0 26.4 27.6 31.6 29.0 16/2 31.8 31.8 31.8 31.8 31.8 31.8 32/1 The speed up values are computed by extrapolation of the experimental resultsand an assumed Myrinet connection between the cluster nodes. 500
influence of the better scalable parallelization on video level. One or two PE's per node should be used. The speed up reaches a value of 30 for 100 videos or even 31.8 for 500 videos. The time for communication between the nodes is relatively small, as it takes not even one second to transfer a 100 megabyte video file with a high performance cluster network like Myrinet. 6. D I S C U S S I N G QUALITY OF T H E R E T R I E V A L RESULTS The acceleration of the search process is based on the neglect of certain information. Therefore it is necessary to have a look on the quality of the retrieval results, which can be measured by the precision of the result set according to the query object. It is defined by the ratio of the number of relevant shots to the number of shots in the result set. Obviously this measure is subjective, as it depends on the user to decide, which result video is relevant and which one is not. Another important aspect is the size and the quality of the query image. In the test database videos with a running time of more than 15 hours are stored, thus nearly 100,000 I-frames have to besearched. The size of the result set is set to 20. Neighbouring I-frames in the result set are summarized as one single hit, so that in most cases there is only one hit per shot. Searching whole frames a precision of 0.9 is reached. Objects with about 10,000 pixel lead to an average precision of 0.6. In all test cases the shot from which the search template was taken was in the result set at a high rank, even when some minor changes, like rotation or scaling where performed. One example query image and result set is presented in figure 3. 7. C O N C L U S I O N S In this paper an efficient method for searching in video files was presented. The search in the compressed video data allows a fast search even for objects with a high precision in the result set. A strategy to parallelize this algorithm for the high number of videos in a database was implemented. The best strategy and computer architecture depends on the concrete use case, but in manycases a cluster consisting of SMP-nodes with 2 PE's and a coarse-grained algorithm
437
Figure 3. A query image showing a scene from the flood 2002 in East Germany and some selected frames from the corresponding result set. The first false detection is rank 8 for this query. will be the best choice. Future work includes the development of parallel search algorithms in the compressed domain of the MPEG-4 video format. The most interesting part will be the efficient search for objects, as the frames are not divided into 8 • 8 pixel blocks but to free shapes, in the ideal case the contours of the objects. Furthermore motion of camera and objects will be regarded. To improve the quality of the result set other techniques to compare the template with the underlying image area than the difference of the pixels will be developed. Approaches like garbor wavelets will be integrated and benefit from the parallel infrastructure.
REFERENCES
[1] [2]
[3] [41 [51 [6]
[7]
[8]
[91
S.-F. Chang, W. Chen, H. Meng, H. Sundaram, D. Zhong, VideoQ: An Automated Content Based Video Search System Using Visual Cues. Proc. of ACM Multimedia, pp. 313-324, 1997. A. Hampapur, A. Gupta, B. Horowitz, C.F. Shu, C. Fuller, J. Bach, M. Gorkani, R. Jain, Virage video engine. Proc. SPIE: Storage and Retrieval for Image and Video Databases, pp. 188-197, 1997. D. Ponceleon, S. Srinivasan, A. Amir, D. Petkovic, Key to Effective Video Retrieval: Effective Cataloging and Browsing. Proc. of ACM Multimedia, pp. 99-107, 1998. O. Kao, S. Stapel, Case study: Cairo- a distributed image retrieval system for cluster architectures. In T.K. Shih, editor, Distributed Multimedia Databases: Techniques and Applications, pp. 291-303. Idea Group Publishing, 2001. A. Pentland, R. Picard, S. Sclaroff, Photobook: Content-based Manipulation of Image Databases. Proc. SPIE: Storage and Retrieval for Image and Video Databases II, vol. 2185, 1994. M. Flickher, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D.Steele, P. Yanker, Query by Image and Video Content: The QBIC System. IEEE Computer, 28 (9), pp. 23-32, 1995. J. R. Smith, S-F. Chang, Visualseek: a fully automated content-based image query system. Proc. of ACMMultimedia, pp. 87-98, 1996. J. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Humphrey, R. Jain, The virage image search engine: An open framework for image management. Proc. of the SPIE: Storage and Retrieval for Image and Video Databases, vol. 2670, pp. 76-87, 1996. ISO/IEC 11172-1 (2,4), Information technology - - Coding of moving pictures and asso-
438
[ 10]
[ 11]
[ 12] [13]
ciated audio for digital storage media Part 1 (2,4): Systems (Video, Compliance Testing), 1993. ISO/IEC 13818-1 (2,4), Information technology - Genetic coding of moving pictures and associated audio information Part 1 (2,4): Systems (Video, Compliance Testing), 19982000. G. Falkemeier, G.R. Joubert, O. Kao, A system for Analysis and Presentation of MPEG Compressed Newsfeeds. In Business and Work in the Information Society: New Technologies and Applications, lOS Press, 454-460, 1999. K. Shen, E. Delp, A fast Algorithm for Video parsing using MPEG compressed Sequences, Proc. of the IEEE International Conference on Image Processing, 252-255, 1995. T. Bretschneider, S. Geisler, O. Kao, Simulation-based Assessment of Parallel Architectures for Image Databases, Proc. of the International Conference on Parallel Computingg (ParCo 2001), Imperial College Press, pp. 401-408,2001.
Architectures
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
441
I n t r o s p e c t i o n in a M a s s i v e l y Parallel P I M - B a s e d A r c h i t e c t u r e H.R Zima a aJet Propulsion Laboratory, California Institute of Technology, Pasadena, California, U.S.A.,
and Institute for Software Science, University of Vienna, Austria E-mail: [email protected] Processing-In-Memory (PIM) architectures avoid the von Neumann bottleneck by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commodity clusters. In this paper we discuss the capability of PIM-based architectures to create a large number of lightweight threads that can be efficiently managed via special hardware support. We focus on the use of lightweight threads for the implementation of asynchronous agents that perform introspective monitoring of the application program for such purposes as program validation, performance analysis, or automatic performance tuning. 1. INTRODUCTION The integration of COTS (commercial off-the-shelf) processing and memory components has become the mainstream approach for the construction of large high-end computing systems. However, the experience with these systems increasingly indicates that this approach is severely limited in its efficiency for many important types of application as well as its scalability in view of the cost, power, and space requirements. An emerging alternative addressing this important problem is based on Processing-InMemory (PIM) technology. PIM is enabled by semiconductor fabrication processes which make possible the co-location of both DRAM (or SRAM) memory cells and CMOS logic devices on the same silicon die. It connects logic directly to the wide row buffer of the memory stack, providing the opportunity to expose substantially greater memory bandwidth while imposing lower latency and requiring less power consumption than conventional systems. A host of research projects has been undertaken to explore this new design and application space. PIM is being pursued as a means of accelerating array processing in the Berkeley IRAM project [7], and for providing a smart memory for conventional systems in the FlexRAM and DIVA projects [6, 4]. It has also been considered for the management of systems resources in a hybrid technology multithreaded architecture (HTMT) for ultra-scale computing [2]. In 1999, IBM announced the Blue Gene (BG/C) project [ 1], which is based on PIM nodes with multithreading technology. Multithreading is also used in the PIM-Lite project at the University of Notre Dame [3] and the Gilgamesh project conducted at the California Institute of Technology [8]. Finally, PIM plays an important role in the Cascade architecture currently under design in
442 a DARPA-funded High Productivity Computing Systems (HPCS) project. In this paper, we first describe the structure of the Cascade architecture, followed by a discussion of PIM-based systems (Section 2). We continue with an introduction to introspection (Section 3) and the use of asynchronous agents, implemented via PIM-based threads, to monitor the execution of application programs (Section 4). A case study illustrates how this approach can be applied to performance analysis and tuning (Section 5). Section 6 concludes the paper. 2. A PIM-BASED MASSIVELY PARALLEL SYSTEM 2.1. An overview of the cascade architecture The Cascade architecture is being designed by a Cray-led consortium including the California Institute of Technology, Jet Propulsion Laboratory, Notre Dame University, and Stanford University. It targets a PetaFlops-scale production system by 2010. Fig. 1 shows a block diagram of the current system design. Cascade is a hierarchical architecture with two levels of processors. At the top level, the system can be represented as a homogeneous arrangement of locales connected by a global interconnection network. Locales contain a heavyweight processor (HWP) with a software-controlled data cache, and a smart local memory implemented as a homogeneous array of PIM building blocks. 1 The heavyweight processors are designed to provide high performance for algorithms with temporal locality while the processing capability of the PIM-based memory can be used to exploit spatial locality. A large-scale Cascade system will contain thousands of heavyweight processors and possibly millions of lightweight processors. Cascade is a virtual shared-memory architecture with a unique hybrid UMA/NUMA data access mechanism. Data can be allocated in a segment whose addresses are scrambled pseudorandomly across the memory, providing uniform-memory access with no regard for locality. If locality is essential, as in many scientific applications, data can be organized in such a way that either coarse grain (at the locale level) or fine-grain locality (at the level of PIMs) can be exploited. 2.2. Homogeneous PIM arrays Homogeneous PIM arrays can be subsystems of more complex architectures, as discussed above for Cascade, or they can be designed as independent systems [ 1, 8, 3]. We will outline some of their salient properties independent of the context.
As shown at the bottom of fig. 1, a homogeneous PIM array is a collection of interconnected PIM nodes. Each individual PIM node consists of a lightweight processor (LWP) tightly coupled with a memory macro (M). A number of PIM nodes can be placed on a PIM chip containing multiple banks of DRAM storage, processing logic, external I/O interfaces, and inter-chip communication channels. PIMs communicate via parcels, a special class of active messages [9] supporting message driven computation. A parcel, which can be considered as a mobile thread with very small state, targets a virtual address and may be used to perform conventional operations such as remote load or store as well as to create a new thread by invoking a remote method. As a consequence, the conventional paradigm of data being transferred to the site of work is complemented in a PIM-based system by the capability of work moving to data. Thread 1A variant of this design combines a smaller set of PIMs with "dumb" DRAM memory controlled by the PIMs.
443
GLOBAL INTERCONNECTION
NETWORK
HWP
HWP
cache
cache Q local. . . . r y )
.'(local . . . . . y ~ ........
I
Local Interconnection Network
LWP
LWP
LWP PIM
Figure 1. The Cascade architecture creation and management as well as synchronization are lightweight hardware-supported operations. Most recent PIM-based architectures are multithreaded with direct hardware support for single cycle context switching. PIM architectures emphasize sustained memory bandwidth, achieving one to three orders of magnitude higher memory bandwidth to arithmetic function logic compared to conventional multiprocessor system architectures. A second important value metric is effective use of system network bandwidth. 2.3. Uses of PIM-based lightweight threads PIM-based systems offer the capability to create a large number of lightweight threads. For example, a PetaFlops-scale Cascade system will potentially be able to create millions of lightweight threads which can be managed with very small effort compared to heavyweight thread mechanisms. Such threads can be exploited in a variety of ways including the following:
9 direct support of fine-grain parallelism in an application, 9 implementation of a service layer for various functions, such as a data transport layer for data gathering, scattering, or distribution, and 9 construction of agent systems that operate asynchronously to the main computation, dealing with issues such as program validation, fault tolerance, and performance tuning. It is the last of these topics that we study in more detail in the rest of the paper.
444 3. INTROSPECTION The degree of parallelism in a future parallel architecture as well as its computational power and complexity may exceed by orders of magnitude current high performance computing systems. Furthermore, most decisions regarding critical system or application issues such as global resource allocation, fault recovery, security, performance tuning of applications and even aspects of compilation strategies will increasingly be moved to the runtime as a consequence of the fact that the information statically available may be inadequate as a basis for rational solutions of the many interrelated problems to be solved. Such a system requires a much higher degree of autonomy than present-day architectures. Its actions, across all levels of the system hierarchy, must be continuously monitored, and enhanced capabilities must be provided to deal with faults and contingencies in a transparent manner, thereby reducing the necessity of human intervention in the control of the system operation. Here we discuss introspection, one of the enabling technologies contributing to the above goal. It can be defined as a system's ability to 9 explore its own properties, 9 reason about its internal state, and 9 make decisions about appropriate state changes where necessary. Introspection can be implemented in a variety of ways. An attractive approach decouples introspection from the execution of the application program, communicating with the application only via a thin interface. More specifically, the monitoring of an application can be performed by a collection of asynchronously operating agents which synchronize their actions with the application program only if special situations such as the detection of a faulty component make this necessary. The following discussion will focus on this approach. 4. ASYNCHRONOUS INTROSPECTION VIA AGENT SYSTEMS
4.1. Systems of autonomous agents We introduce an agent as an autonomous computational entity that is able to perceive its environment and to act upon it [ 11 ]. Agents have an internal state; their actions may depend on the environment in which they operate, the actions of other agents, and their internal state which may reflect an agent's experience and history. Autonomy means that agents can control their behavior in accordance with their goals, and act without intervention of humans or other systems. Systems of agents are comprised of multiple agents that interact. Interaction may take the form of cooperation towards a common goal, or competition if different agents pursue different goals. Key characteristics of a multiagent system include (1) individual agents have incomplete information and restricted capabilities, (2) system control is distributed, (3) system data are distributed, and (4) computation is asynchronous. As an example, consider a system of cooperating agents that monitor a set of resources managed by the operating system. This may include tasks such as contributing to deadlock prevention by providing early warnings of potential cyclic dependences, or detecting threads that may encounter exceptional conditions inside a critical section. Such agents may, in simple cases,
445 perform a local failure correction that solves the problem, or, if they lack sufficient capability to do so, relay their information to higher levels of the system. 4.2. Examples for introspection via agent systems Apart from the obvious potential of agents for monitoring the hardware components of a complex system there are many software scenarios in which an asynchronous monitoring of application behavior will represent a natural approach. We will discuss a few of these situations below.
9 Invariants Invariants are logical expressions expressing constraints which are guaranteed to hold either in a certain program region, or, if they are universal, during the whole lifetime of a program execution.
Invariants may be evaluated either periodically or event-driven, triggered by the occurrence of specific events such as the change of a variable on which the invariant depends. They can represent many different kinds of constraints, relating to correctness, performance, resource allocation, security, or workload balancing. 9 Validation of User Directives In some languages, the user may specify directives that affect the semantics of the language and thereby indirectly the way in which the compiler treats the program construct. For example, an HPF [5] program may assert a parallel loop as independent expressing the fact that the loop does not contain loop-carried dependences. This can be exploited by the compiler to efficiently rearrange the communication and overlap communication with computation. If the programmer has erred in specifying the directive then such transformations are incorrect. The compiler will in general, for example in the presence of subscripted subscripts, not be able to verify the directive. However, verification can be performed at runtime by analyzing all relevant array access patterns and determining the resulting dependences. A system of independent agents can perform such a test without disturbing the execution of the application program or reducing its efficiency.
In a related context where the user does not provide an assertion the compiler may hypothesize that a loop is independent and create code for that case, with the provision of reversing this decision if falsified by a runtime analysis. 9 Information gathering by systems of agents Agent systems can be used to gather information about certain aspects of the behavior of an application program or the operating system, without interfering with the program. Everything that a background process can do falls into this category; more specific examples include the gathering and processing of performance data (profiling, filtering, and visualization), the detection of performance bottlenecks, the analysis of accesses to a distributed data structure, and the detection of significant statistical irregularities in the access to resources or nonlocal data.
446 5. A CASE STUDY: A SOCIETY OF AGENTS FOR PERFORMANCE ANALYSIS AND
FEEDBACK-ORIENTED TUNING In Figure 2 we provide an overview of key components and their interaction in an agent-based system supporting automatic performance analysis and tuning. A central component in this scheme is the Program Data Base which is a centralized information repository containing system knowledge as well as knowledge about each individual program subject to compilation and performance tuning. In addition to conventional information characterizing the program, such as the syntax tree, call graph, unit flow graphs, data flow and dependence information, the Program Data Base stores performance information characterizing the dynamic behavior of the program. Performance analysis in this scheme is performed by three agents which operate asynchronously with respect to the application program and to each other. Each of these agents can consist of many threads. For example, the execution of a data parallel application may be subject to a set of invariants specifying conditions to be met by the execution in every PIM node. Such conditions could relate to synchronization delays, the length of work queues, or the time a thread spends in a queue before being processed. The invariant-checking agent may dedicate one lightweight thread to every PIM-node used by the application. If an invariant is not satisfied, an exception is raised in the Performance Exception Handler which has to determine the severity of the case and initiate a corresponding action. Three types of actions with increasing complexity are illustrated by the figure: (1) a modification of program execution without recompilation or re-instrumentation: for example, a replacement of one method by another, more efficient one; (2) re-instrumentation, for example, to refine the observation of a certain section of the program; and (3) recompilation combined with re-instrumentation, for example if a loop nest has to be recompiled for more efficient execution on the architecture.
6. CONCLUSION This paper discussed PIM-based architectures, focusing on their capability to create a large number of efficiently manageable lightweight threads. Such threads can be used to perform introspection, using a framework of agents operating asynchronously with respect to the application program to be monitored. This approach was illustrated in a case study dealing with performance analysis and tuning. We expect that in the long term PIM technology will lead to a significant change of the conventional programming paradigm due to the opportunities and challenges associated with handling millions of threads in a system.
ACKNOWLEDGMENTS The research described in this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration.
447
simplificatagent iondata : reductiand onfiltering Compiler .
~ [ instrumenteIr l
~
~
I~[T~g/\IE.....
X
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
invariachecki nt agent ng
IT~g/\ ~ [ IE.....io~ifS2stemt] cdledamcct ~rlionogramDataBas~~_:/_ :~\ ~//// :: ~~lcatiin program) ~ iPerformance E.'X, "--~IceptiHandl on ier
~
~~/// ~
i i
TTT ....................................................................................................................................... .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure 2. Case Study in performance analysis and tuning REFERENCES
[1]
G. S.Almasi,C. Cascaval,J.G.Castanos,M.Denneau,W.Donath, M.Eleftheriou, M.Giampapa, H.Ho, D.Lieber, J.E.Moreira, D.Newns, M.Snir, and H.S.Warren,Jr. Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflop Computer. Research Report RC 21965, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, February 2001. [21 J.B.Brockman,P.M.Kogge,V.W.Freeh,S.K.Kuntz, and T.L.Sterling. Microservers: A New Memory Semantics for Massively Parallel Computing. Proceedings ACM International Conference on Supercomputing (ICS'99), June 1999. [3] J.B.Brockman,E.Kang, S.Kuntz, and P.Kogge. The Architecture and Implementation of a Microserver-on-a-Chip. Notre Dame CSE Department Technical Report TR02-05. J.LaCoss, J.Granacki, J.Brockman, [41 M.Hall,J.Koller, RDiniz,J.Chame,J.Draper, A.Srivastava, W.Athas, V.Freeh, J.Shin, and J.Park. Mapping Irregular Applications to DIVA, a PIM-Based Data Intensive Architecture. Proceedings SC'99, November 1999. [5] High Performance Fortran Forum. High Performance Fortran Language Specification, Version 2.0, January 1997. [6] Y.Kang,W.Huang,S.-M.Yoo,D.Keen,Z.Ge,V.Lam,RPattnaik, and J.Torrellas. FlexRAM: Toward an Advanced Intelligent Memory System. Proc.International Confon Computer Design (ICCD), Austin, Texas, October 1999.
448
[7] [8]
[9]
[10] [11 ]
D.PattersonT.Anderson,N.Cardwell,R.Fromm,K.Keeton,C.Kozyrakis,R.Tomas, and K.Yelick. A Case for Intelligent DRAM: IRAM. IEEE Micro, pp.33-44, April 1997. T.Sterling and H.Zima. Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing. Proc. SC2002- High Performance Networking and Computing, November 2002, Baltimore. T.Von Eicken,D.E.Culler, S.C.Goldstein,and K.E.Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. Proc. 19th Int.Symposium on Computer Architecture,Gold Coast, Australia, ACM Press (1992). M.Wooldridge. Intelligent Agents. In: G. Weiss (Editor): Multiagent Systems. A Modern Approach to Distributed Artificial Intelligence,pp.27-78. The MIT Press, 1999. G.Weiss (Editor). Multiagent Systems. A Modem Approach to Distributed Artificial Intelligence. The MIT Press, 1999.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
449
Time-Transparent Inter-Processor Connection Reconfiguration in Parallel Systems Based on Multiple Crossbar Switches E. LaskowskP and M. Tudruj ~b aInstitute of Computer Science Polish Academy of Sciences 01-237 Warsaw, Ordona 21, Poland
{laskowsk, tudruj }@ipipan. waw.pl
bpolish-Japanese Institute of Information Technology 02-008 Warsaw, Koszykowa 86, Poland Look-ahead dynamic inter-processor connection reconfiguration is a new architectural model, which has been proposed to eliminate connection reconfiguration time overheads. In systems based on multiple crossbar switches inter-processor connections are set in advance in parallel with program execution. An application program is partitioned into sections, which are executed using redundant communication resources, i.e. crossbar switches. Parallel program scheduling in such a kind of environment incorporates graph partitioning problem. Graph structuring algorithm for look-ahead reconfigurable multi-processor systems is presented. It is based on list scheduling and a new iterative clustering heuristics for graph partitioning. The experimental results are presented, which compare performance of several program execution control strategies for such a kind of environment. 1. INTRODUCTION The aim of this paper is to present a parallel program execution environment, based on dynamically reconfigurable connections between processors [5]. This environment can be considered as a new kind of MIMD system with distributed memory and point-to-point communication based on message passing in reconfigurable inter-processor network. The new proposed paradigm of inter-processor communication solves the problem of connection reconfiguration time overheads, which exists in current reconfigurable systems. These overheads can not be completely eliminated by an increase of the speed of communication/reconfiguration hardware. However, they can be neutralized by a special method applied at the level of system architecture and at the level of program execution control, which is the look-ahead dynamic link connection reconfigurability [5, 3]. It is based on increasing the number of hardware resources used for dynamic link connection setting in the system (multiple crossbar switches) and using them interchangeably in parallel for program execution and run-time look-ahead reconfiguration. A new program execution control strategy is included, in which application programs are partitioned into sections executed by clusters of processors whose mutual connections are prepared in advance. Inter-processor connections are set in advance in crossbar switches in parallel with program execution and remain fixed during section execution. At sections boundaries, relevant
450
,..1...I..1 I.. I.I...l"n: .. I..1...I.. ,,n Is Processorlinksetswitch
troll
H
Synchronization
...
Path
Figure l. Look-ahead reconfigurable system with multiple connection. processor's communication links are switched to the crossbar switch(es) where the connections had been prepared. In the computer environment presented in this paper, an application program has to be partitioned into sections, thus programs structuring consists in task scheduling and program partitioning. The algorithm is presented, which determines at the compile time, the program schedule and partition into sections. It is based on list scheduling and iterative task clustering heuristics. The optimization criterion is the total execution time of a program partitioned into sections including the cost of connection reconfiguration. This time is determined by program graph symbolic execution. The paper consists of three parts. The first part describes the look-ahead dynamic reconfigurable system and respective parallel programming model. In the second part program graph scheduling algorithm is discussed. The last part consists of discussion of experimental results and summary.
2. THE LOOK-AHEAD RECONFIGURABLE MULTI-PROCESSOR S Y S T E M S The general scheme of a look-ahead dynamically reconfigurable system based on redundant link connection switches is shown in Fig. 1. It is a multiprocessor system with distributed memory and with communication based on message passing. Worker processor subsystems (Pi) can consist of a single processor or can include a data processor and a communication processor sharing a common memory. Pis have set of communication links connected to the crossbar switches $ 1 . . . Sx by the Processor Link Set Switch. This switch is controlled by the Control Subsystem (CS). The switches $ 1 . . . Sx are interchangeably used as active and configured communication resources. CS collects messages on the section execution states in worker processors (link use termination) sent via the Control Communication Path. The simplest implementation of such path is a bus but more efficient solution can assume direct links (data lines) connecting worker processors with the CS. Depending on the availability of links in the switches $ 1 . . . Sx, CS prepares connections for execution of next program sections, in parallel with current execution of sections. Synchronization of states of all processors in clusters for next sections is performed using the hardware Synchronization
451 Pl
Lll
1, .......... 2
..... ( i ---
a
)_. C)----
3
,,'"4 Section 1" a ",
Section 2" b Section 3" d Section 4: e,e
a) An example of DAG.
b) Modeling of scheduled macro-dataflow graph by the APG. Figure 2. Program graph representations.
c) Communication Activation Graph.
Path. When all connections for a section are ready and the synchronization has been reached, CS switches all links of processors, which will execute the section, to the look-ahead configured connections in a proper switch. Then, execution of the section in involved worker processors is enabled via the Control Communication Path. Thus, this method can provide inter-processor connections with almost no delay in program execution time, allowing a time transparent dynamic link connection reconfiguration. 3. PROGRAM STRUCTURING ALGORITHMS IN THE LOOK-AHEAD CONFIGURABLE ENVIRONMENT An initial representation of a program is a weighted Directed Acyclic Graph (DAG, see Fig. 2a)), where nodes represent computation tasks and directed edges represent communication that corresponds to data dependencies among nodes. The weight of a node represents a task execution time, the weight of an edge is a communication cost. The graph is static and deterministic. Program is executed according to the macro-dataflow [4] model. A two-phase approach is used to tackle the problem of scheduling and graph partitioning [2]. In the first phase, a list scheduling algorithm is applied to obtain a program schedule with reduced number of communications and minimized program execution time. In the second phase, scheduled program graph is partitioning into sections for the look-ahead execution in the assumed environment. 3.1. The program schedule representation In the look-ahead reconfigurable environment the schedule determines task and communication mapping onto processors and processors' links, tasks and communications execution order and program partitioning into sections (they are not existent in systems with on-request reconfiguration [4]). In presented algorithm, a program with specified schedule is expressed in terms of the Assigned Program Graph (APG) [2, 3]. APG assumes the synchronous communication model (CSP-like). There are two kinds of nodes in an APG: the code nodes (shown as rectangles in Fig. 2b)) and communication nodes (circles in Fig. 2b)). Activation edges are shown as vertical lines in Fig. 2b), communication edges as horizontal lines (solid for inter-processor and dashed
452 B := i n i t i a l set of section, e a c h s e c t i o n is c o m p o s e d of s i n g l e c o m m u n i c a t i o n a s s i g n e d to x - b a r c u r r _ x := 1 { c u r r e n t n u m b e r of s w i t c h e s used} f i n i s h e d := f a l s e While not f i n i s h e d Repeat u n t i l e a c h v e r t e x of C A G is v i s i t e d a n d t h e r e is no e x e c u t i o n t i m e i m p r o v e m e n t s {i} v -= v e r t e x of CAG w h i c h m a x i m i z e s the s e l e c t i o n f u n c t i o n a n d w h i c h is not in t a b u list S :-- set of s e c t i o n s that c o n t a i n c o m m u n i c a t i o n s of all p r e d e c e s s o r s of v M := Find_sections_for_clustering(v, S)
1
If M ~ ~
Then M Include to B a n e w
B
:= B -
Else
section
built
of v a n d c o m m u n i c a t i o n s
s
:= s e c t i o n t h a t c o n s i s t s of c o m m u n i c a t i o n v Assign c r o s s b a r s w i t c h (from i . . c u r r _ x ) to s e c t i o n If reconfiguration i n t r o d u c e s t i m e o v e r h e a d s Then c u r r x := c u r r _ x + 1 B r e a k Repeat
from
sections
in M
s
EndIf EndIf EndRepeat finished
EndWhi le
:= t r u e
Figure 3. The general scheme of the graph partitioning algorithm. for intra-processor communications). Using APG, we model asynchronous, non-blocking communications as in look-ahead reconfigurable environment. Processor link behavior is modeled as a subgraph (marked a s Lil on Fig. 2b)), parallel to the computation path. To enable an easier partitioning analysis we use another program representation, called Communication Activation Graph (CAG). This graph is composed of nodes, which correspond to inter-processor communication edges of the APG and of edges, which correspond to activation paths between communication edges of the APG, Fig. 2c). Program sections are defined by identification of subgraphs in the APG or in CAG. A partition of an exemplary program communication activation graph into sections is shown in Fig. 2c) (communication activation edges, which do not belong to any section are denoted by dashed lines).
3.2. I phase- the scheduling algorithm The scheduling algorithm is based on the ETF/Earliest Task First/heuristics, proposed by Hwang et al. [1 ]. It differs from the original version of ETF in communication model assumed: instead of fixed inter-processor network topology, we use system with look-ahead dynamically created connections. We take into account a limited number of links and links contention. Modification of ETF consists in new formulae used for evaluation the earliest task starting time (Ready procedure [ 1]). In this procedure link reconfiguration time overheads are minimized by reduction of the number of link reconfigurations.
3.3. II phase- the partitioning algorithm The algorithm, applied in the second phase, finds program graph partitioning into sections and assigns crossbar switch to each section (see Fig. 3). The heuristics also finds the minimal number of switches, which allow program execution without reconfiguration time overheads. At the beginning, an initial partition consist of sections, which are assigned to the same crossbar switch and each of it is built of single communication. In each step, a vertex of CAG is selected and then the algorithm tries to include this vertex to a union of existing sections determined by edges of the current vertex. The union, which gives the shortest program execution time is selected. When section clustering for current vertex doesn't give any execution time improvement, the section of the current vertex is left untouched and crossbar switch is assigned to it. When the algorithm cannot find any crossbar switch for section that allows to create connections
453
P1 P2 P3
P4 P5, P6 Sectio2n r L12~L1 'aLl~~in ........... |[~ U / r ~ ~
~ - - - " ~ activation ~5 messages
I
Act~ivation;rocess1)
C~176
'
"
~"Rer176176
Figure 4. Modeling the reconfiguration control in an APG with no reconfiguration time delay, then current number of switches used (curr_x in Fig. 3) is increased by 1 and algorithms is restarted. The vertices can be visited many times. The algorithm stops when all vertices have been visited and there hasn't been any program execution time improvement in a number of steps. Vertices for visiting are taken in the order of decreasing value of selection function. The value of this function is computed with use of the following APG and CAG graph parameters (see [2] for details): a) the critical path CP of APG, in the graph partitioned into sections, b) the delay D of vertices of CAG, in relation to fully connected network, c) the value of criticalpoint ofreconfiguration Q function for the investigated vertex, d) the dependency on processor links use between communications. The program execution time is estimated by simulated execution of the partitioned graph in a modeled look-ahead reconfigurable system. A APG graph with a valid partition is extended by subgraphs which model the look-ahead reconfiguration control, Fig. 4. The functioning of the Communication Control Path, Synchronization Path and the Control Subsystem CS are modeled as subgraphs executed on virtual additional processors. Weights in the graph nodes correspond to latencies of respective control actions, such as crossbar switch reconfiguration, bus latency, and similar. 4. EXPERIMENTAL RESULTS The presented environment was used for evaluation of execution efficiency of programs for different program execution control strategies and system parameters. Two families of exemplary program graphs were used during experiments: test7 (randomly generated, with irregular communication patterns), testa (with communication patterns of FFT algorithm), executed in the look-ahead and in on-request system, with the following system parameters: nb. of processors 4, 8, 12, nb. of proc. links 2, 4, synchronization: via bus or hardware barrier, reconfiguration time of a single connection tR and section activation time t v (tv = t s + ts + tA, see fig. 4) in range 1-100. The program execution speedup in function of parameters of reconfiguration control (tR and tv), on the example of the testa graph is shown in fig. 5 and 6. For low values of control
454
I 0 .......~):.)!)~
7
.S
4
ii;i~i~
!!~ii~
':"::~'~ '~::!~)il ~i~i)i)ii!i...........)i)=~i)il
)i)~iiii
6
2
2O
12
tv
.~
1~
t.
tv
961OO
5 1 24
a) 2 crossbars
~
'
96100
50
b) 6 crossbars
Figure 5. Speedup obtained for test8 program executed in the look-ahead environment (12 processors, 2 proc. links) for different values of system parameters (trt, tv). ._ "~,~,~i)i)ili)i))il))i);i ....
9
D 80,00%-I00,00%I
i
9~ , ~ % - 8 0 , Z %
[] 40,00%-60,00% i 20,0(Y'/0-40,00%]
[]
m o,~%-~,~%
[] -2o,oo%-o,oo%
12 tv
"~
24
96100
a) speedup
b) reductionof reconfigurationcontrol overhead
Figure 6. Results obtained for test8 program executed in the look-ahead environment (12 processors, 4 proc. links, 6 crossbars).
~ ~ )~
)
9
4
8
.
1
96 ~ 5 0
a) 2 crossbars
~
,
5
'
i~,~% 90,00% 98 o , ~ % - ~ , ~ % I
~o,oo. ~o,oo.
~,~O/o
3O,00%
,oo%
L]-,o,ooojo ~o,oo%
=,o.ooo, o-~O.OOO,o
l
.~o,oo..~o,oo. i .,o,~,o.,o,oo. i [] 30,~O/o-~,~Olo
0 20,~%-~,~% I I m 1~176176176176176 I
20 t
b) 6 crossbars
Figure 7. Reduction ofreconfiguration control overhead against the on-request system for test7 program executed in the look-ahead environment (12 processors, 4 proc. links).
overhead t v the number of crossbar switches has big influence on total program execution time. For two crossbars (fig. 5a)) the speedup 8.5 is in a narrow area of lowest values of t v and trt. With the increase of the number of switches, the speedup increases to over 9 in a large area of t v , trt. When t v is bigger, the look-ahead strategy suffers from section activation time overheads and speedup is reduced. Fig. 5 and 6 show also the speedup for different number
of processor links. The larger is the number of links in a processor, the look-ahead method is
455 successfully applicable for wider range of reconfiguration and activation time parameters than with the on-request reconfiguration. Fig. 6b) and 7 show the reduction of the reconfiguration control time overhead when lookahead control is used instead of on-request. Multiple crossbar switches used with the lookahead control strongly reduce reconfiguration time overheads. When reduction is close to 100%, the system behaves for any given program as a system with fully-connected inter-processor network. 5. CONCLUSIONS The look-ahead reconfigurable multi-processor system has been presented in the paper. The presented environment is supported by the scheduling and iterative graph structuring algorithms, which allow automatic preparation of parallel programs for execution in this environment. For a program partitioned into sections for which inter-processor connections can be configured transparently in time (connection reconfiguration completely overlaps with program execution including communication), the multi-processor system behaves as a fully connected processor structure. The future works will focus on partitioning algorithm improvements, which could lead to further optimizations in section clustering and mapping of resources to communications. REFERENCES
[1]
[2] [3] [4] [5]
J-J. Hwang, Y-C. Chow, F. Angers, C-Y. Lee Scheduling Precedence Graphs in Systems with Interprocessor Communication Times, SIAM J. Comput., Vol. 18, No. 2, 1989. E. Laskowski New Program Structuring Heuristics for Multi-Processor Systems with Redundant Communication Resources, Proc. of the PARELEC 2002 Int. Conf., Sept. 2002, Warsaw. E. Laskowski, M. Tudruj, A Testbedfor Parallel Program Execution with Dynamic LookAhead Inter-Processor Connections, Proc. of the 3rd Intl. Conference on Parallel Processing and Applied Mathematics PPAM '99, Sept. 1999, Kazimierz Dolny, Poland. E1-Rewini H., Lewis T. G., Ali H. H. Task Scheduling in Parallel and Distributed Systems. Prentice Hall 1994. M. Tudruj, Look-Ahead Dynamic Reconfiguration of Link Connections in Multi-Processor Architectures, Parallel Computing '95, Gent, Sept. 1995.
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
457
S I M D d e s i g n to s o l v e p a r t i a l d i f f e r e n t i a l e q u a t i o n s R.W. Schulze a aschulze@vpv, inf. tu-dresden, de, D r e s d e n University o f T e c h n o l o g y
The relation between autonomous and communication phases determines the throughput of parallel structured information processing systems. Such a relation depends on the algorithm which is to implement and the technical parameters of the system. As shown in the example of the numerical solution of the Laplace's differential equation the sensitivity of the relation, compared to parameter changes, increases significantly with an increasing reduction of the data transmission time. An optimal number of processors in a SIMD-architecture can be indicated for the implementation of a Laplace differential equation. 1. I N T R O D U C T I O N The throughput of a parallel structured information processing system is characterized by the present assignment of the segmented algorithm on the processors as well as their communication structure. Massively parallel systems contain a large number of processors and are therefore suitable for the direct implementation of algorithms. This paper is based on the model of the Turing machine as an abstract model of a technical information processing system which consists of a processor and a storage unit. In a processor an instruction stream permeates a data stream. Several data streams can be permeated of the same instruction stream when the algorithm which is to be implemented allows it. So the throughput time decreases for a fixed data quantity. The possibility to permeate several data streams to each other in the time shadow does not change however the logical dependence existing between the data streams. It remains as it is and requires a communication between the processors. But a communication increases though the throughput time for the fixed data quantity. Thus an interaction exists plausibly between throughput and communication. In the example of the discrete solution of a Laplace's differential equation such an interaction is examined. 2. TURING-MACHINE Alan M. Turing published in [4] a formalism for the calculation of algorithms which was marked later as Turing machine I . The machine consists of a reading-writing-head and a control unit (Turing table). Under the reading-writing-head runs an one-sided endlessly long band. The control unit initiates reading, writing, and moving of the band. The Turing-machine is defined by band alphabet, position-, occurrence- and condition-set. lhttp: //www. turing-maschine, de/
458 Band alphabet
B -- {bl, b 2 , . . . , b , . . . , bm, bo}
is a set whose indexed elements b are marked inscriptions for a band segment with bo as empty element. Position set P = { p o , p l , . . . ,pt-l,pt,pl+l... ,p,... ,pz-1} is a linearly organized set whose indexed elements p are band positions 9 Let e~o] E { P • B} be the inscription in the band segment of the position p. The sequence of the band positions is defined by a control information (fig. 2). control information
L / R / H : Pz ~ Pt-1/Z+l/Z C P
(band movement left/fight/stop) occurrence set K = {ko, k l , . . . , k, k ' , . . . , t(} is an organized set whose indexed elements k are occurrence lays. state set Z = {z0, z l , . . . , k , . . . , zn, Zo} is a set whose indicated elements z are conditions.
Turing-table
!~i~!i~il~!~~'i!~i!~i
i(i i! control information for the band movement
reading-writing-head
(fixed!
,_3
IRwxl
~
~
C~ S
current position of endlessly long one~sided band with an a band segment one-digit inscription per band segment under the RWH
Figure 1. Turing-machine
Let ke[p] be entry on the band position p und k z condition in the occurrence situation k. The figure A - (ke~o],k Z) ( k' e~],k' z, L / R / H ) is split into A i 9(kelp] k Z) ~ (e[p]) k' 9 calculation unit, /X2 (ke[p],kz)--~ (k'Z) " calculation unit, 9 calculation unit, A3" (ke[p],k Z) --~ ( L / R / H ) Calculation unit, program and system control unit are the fundamental components of a processor. In combination with the main storage they configure the computer kernel (fig. 2).
459
!
(control signals)
0 ',
control unit CU
from register bank tbr
processing unit PU
instructions addresses
for status information for segment information
from operating sequence control
I
from register bank forcurrent running time opcrands and calculation unit
[
....... !
Figure 2. Computer kernel of processor and main storage 3. S I M D - A R C H I T E C T U R E Innovations of the information processing systems exist in the multiplication of the processors as well as in the room/time splitting of the storage. According to Flynn an elementary classification takes the multiplex of the data- and the instruction streams as a basis. For parallel structured systems are of importance 9 SIMD-architectures
(Single Instruction Multiple Data)
- vector-computers - field-computers 9 MIMD-architectures
(Multiple Instruction Multiple Data)
In the SIMD-architecture one and the same instruction is implemented over several data sets. They contain both the primarily allocated operand quantity x and results of instructions implemented before[ 1, 2].
Field computers Through a processor field data streams different from each other are driven, over which one and the same instruction need to be implemented [3]. A scalar unit assigns each processor one and the same instruction; each processor has an access to the main storage over the connecting network for obtaining the corresponding data sets. But in principle a local storage can also be allocated to each processor for gathering the data sets provided for it. In this case the communication between the processors takes place on the basis of messages [2].
Difference method to solve the Laplace's differential equation In a field processor system with n processors is to solve
460 02/Z
Laplace's differential equation
02U
0 = ~-~x2-+- ~Oy 2
(1)
with u = u(x, y) and given edge value function R(x, y) in a continuous integration area G. A system of processors is a discretely structured system. As a result the continuous integration area G needs to be converted into a discrete integration area G , . The integration area G , corresponding to 151 should be an orthogonal structured grid net with M x N equidistant grid points P, indexed with p c { 1 , . . . , M} regarding x and q E { 1 , . . . , N} regarding y. Orthogonal point distances are Ax = h and Ay = k. With respect to Laplace's differential equation is defined G-- 1,q -- 2Up,q -+- Up+ 1,q gp q_l - 2Up q + G,q+l __ 0 + ' ' 2h 2k
(2)
that means
Up,q =
1
2 [h +
q]
[k (Up+l,q 4-Up_l,q) + h (Up,q+l --~ Up,q_1)]
(3)
as approximate solution for u(x, y) for x = p- Ax and y = q. Ay. Equation 3 describes a local operator, which is led over G , . For the implementation of 1 the discrete integration area G , with M x N grid points is divided into identically large segments from respectively ~. ~ grid points (fig. 3). Applied should be 0 = M m o d ~ and 0 - N m o d r For the calculation of the function values U embedded in G , exactly one processor of the system is allocated to each segment. In an autonomous phase of a processor the function values are calculated iteratively; in a following communication phase adjacent segments with a permeate depth A exchange information. Altogether p iterations should be necessary, so that each function value in G , falls below a predetermined error limit. The operating method of the field processor system provides for the fact that in one and the same iteration level all processors complete their autonomous phases (Tauto) to each other in the time shadow, followed by the communication phases (rkomm). For the generation of a function value a processor needs the time tb. Let be c preparation time of a communication phase for a data transmission from one segment to an adjacent segment and q transfer time for a date. From one processor to another a data will be transferred respectively. It applies for M = x~ and N = y~b at a ratio of
_ . x__= ~ y ~
Htb + xyc# + 2v/-H 9[x + y] Aq# (4) T(x, y) - Tauto(X, Y) + tko,~m(x, y) = # ~ xy
with H = M N with
and
Htb Tauto = Tauto(X, Y ) = # ~ = #~r xy Tkomm = Tkomm(X, Y) = xy [c + a (~, r A) q] # under a(~, r A) ~ 2[~ + r
as generation time for altogether H function values in G , (fig. 4).
(4a) (4b)
461
1,1
q-1
q
q+l
1,N
!
[
I I
I i
, I
i I
I I
I I
I i
6 i
i+++~+ii+{+i!+{+ii++{+{~+i+++++~++++++ i{+ ++++++++++i +i++++++++++!i + ++++ i+i++++:++i+l++ii++{++++++: +++++++++++++: ++{++++, :+:++++++{+++++ ++ ++i++++~ii++++ii:.+i~i++++i ++++++{+~i ++{! +i~+i++~+++++++~ii++ii++ii++i{+++++i ++++++++i {+++++++++++++ +++++ii+{i+{++ii++i++++++++++++i ++++++++++{+++++++: +{+{+++i+++++++ +++++
p-1
+{++++++++++ii++++ii++++i ++++i +++i ++i++++i{++{{++{i+i{++i++++i++i++++++#++++++i +++N+i~i+ i++ i+:++++ i +++++{+++++i+++++++i+++++++{+{++i ++++++ i++i+++++++i {+N +i++++i++++{+++++i++++++++~i ++i+i+i+N+++++i +i+++++++{++++ + li~+ii+++i+++{+i++ :+' ++++++++++{++ + +++++++++ +++i+i++g +++ +++++++i ++++++ ++++++i ++++{+++++ ++++++++++ ++++~: +++!+ +++++i++++++{+++++++++++: ++++++++++i++L++++~ +N+++++ +~++++++++++++++++++++
...................
_
.
.
.=, =.--.""" . . . . . . .
p+l+ I
N~+++
.
.
..,. 9-- 9.. ....
.
.
.
.....
.
.
~ + +
N-----.-,~
i+i++++++i++i+i++i++i+++i++
+~ii~+i+i+i+++i++ii+i+ii+i+i+i+
movement
direction
. ++++++++~++i+++++i ++++i++i+.+i+i+++ +i++.{+++ . . . . ++i++++i#Ni+i++i+i++++ ++++i+ '1+!i]+i+++++{+++++++++++++++++{+++++i+ii+ +~+N++++ ~ ~ ~ .. ++N++++{N+++{++i+++i+++++++i+{+++{+{+++++++
+I+++++++++++++++++! ++++i+++++++i +#+ {i+++ +++:+++++++++++++++
~V~+++++I+++++++++++++ +++++i+++++++
: :+.............
++++++++++++++++++ ++++++++++++::+++++i+++++++++ ocal o erator I<:++++:+<++++ +++++++++++++++i++++++++++++++++++ ++++i++i++++++| +++++i +++++::+++++++++++++++{+~+i ............... +i{}++++++++ ++++++++++++++++
9 permeate
+pth
I _~
l+i+?++++++++i +++i++i++iii++
I+i++++++{+;+{+++++++++++++++++ ++++++++++++:++++++++++++++++++++++{ ++,, +++++{+++:.++++++i+++++++++++++++++: ', +.++{++q , } +,+ +++++++]++{:{;{++++{+++++++++++{+{{+{+{++++i+++i+ , , , +++ +++++++++++++++++l , , ' +++++++++ ++++++++++ +++++++++++++++++1 ++++++++++++++ ++++++i+++{+++++ ++++++{+++++ +++++++++++}+++++++++ +++++++++++++++++++++++++++++:]
, ~
, -
+++++++++++++++++++++++++++++++++:++++++++++:++:+++[++++discrete ::::::: :::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
9
M,1
I
+++I I
a "
integration area G* +
:.......................................................................................................................
9
!
!
I +
I ,
I /
L
M,N
Figure 3. Discrete, o f segments consisting integration area G , for the approximate solution o f the Laplace's differential equation In T ( x , y) none of the variable is exclusively towards another, so x ~ y is justified, therefore x + y --+ 2x and
x . y --~ x 2"
Htb T ( x , y) --+ T ( x ) = # - ~ + x2c# + 4 x / ~ . xAq#
(5)
d
Let be x element from the quantity of solid numbers. F r o m ~ x T ( X ) - 0 results the m i n i m u m for T(x) at the point x -
7] = x~in with A 7] -
f (7])
-
7]4 -~- A713 -- B -- 0 for
2 v / H 9A q, t3 = Htb . It is
lira 7]j for 7]j . . . . .
J-~
17]ad x,~,~. It applies 2# c dx T ( x ) -
7]j-l ~- A
(6)
solution o f f(7]) = 0 with 7]o as start value. Since T ( x ) assumes a m i n i m u m at the point x - ?7 ( = xmin), it is applied T (x) > T (x,~in). Start value is 7]0 - 2, because the system should consist o f at least 2 processors.
Example: H - 256 9256; t6 - 7.5#s, c - 50#s, q - 1.1#s resp. 0.55#s; # - 10; A - 2
462 1 i~,i!i
1
"~::~
.......... . . . . . . . . . . . . . _.,,w..T..?,,o
I
I
;
/
X
-•H2
function values
X
segment X
(=2)
____ill
I integration area G*
M
...........
Figure 4. Discrete, of segments consisting integration area G , for the approximate solution of the Laplace's differential equation 0 -- Au(x, y) through a difference equation. Each segment in G , consists of ~ grid points (fig. 3) and is allocated to a processing unit of the field processor system from altogether n processing units. The approximate values Up,qemerge in the catchment area of a local operator, which is led over a segment into an autonomous phase in the presented movement direction. Segments adjacent to each other with given permeate depth A exchange information in a communication phase. It applies 0 _= M r o o d ~ and 0 = N r o o d ~. Figure 5 shows the development of the generation time T(x) for all function values in G . on the basis of this parameter set over x = 3 , . . . , 17 processors. In a system of 4 x 4 processors, that is x = 4 in horizontal as well as in vertical direction, is T ~ 400ms (point A) with an assumed transfer time of q = 1.1#s. The halving to q = 0.55#s reduces T(x) to approximately 360ms without influence on the autonomous phases Ta~to(T(x)) (points A', B'). The duration of the communication phases (point B") are also relatively slight. To this point applies that the reduction of the data transmission time has remained without considerable influence on the generation time T(x) at a steadily held number of processing units. If the number of the processors of the data transmission time is adjusted, T(x) will reduce itself. From equation 6 results xmin = 8 for the adjusted number of processors with q = 0.55#s. In a system of 8 • 8 processors applies T (x = xmin) = 200ms (line D, D ~) with the given parameter set from
463
autonomous time rauto(T(x)) = 75ms of a processor (line D', D") and communication time rkomm(T(xauto)) 125ms (line D", D'") between all processors in consideration of # = 10. As shown, the generation time T(x) of the function values in G , can be minimized with an adjustment of the system parameters. -
-
t _ 700 • 60O +
i,i'~176
q=l,llas~[ A, ] q---'0'551as~'~,.~.i,.,,l~
:
1 I
i
3~
T(x) [ms]
[l
q=1,1l.tS~
.l__~,..~
q=0,551as
,o, i1 ~176
. . . . . . . . . .
250 2 ~
150
;
100
ili~176 ......... I "i
.-,-~-;--:-,-;-,
50
x=3
t t l l l l l l l l l l t l
~5
7
i
.
...... , ..... , , - - ,
9
11
13
.
.
.
.
.
.
.... , ...... , .... ; . . . . . . . .
15
17
~ l l . l l l i l l l l l ~ l
i200 ]-
D 1
"
~
300 I
400 F
5~ls
1
5OO
q=l,lps
i 600 700 i
~r
Tko~m(T ....) [ms]
Figure 5. Development of the generation time T(x), split into the autonomous time Ta~to(T(x)) of a segment and into the communication time Tkomm(Ta~to) between all segments for the parameter set H = 256 z 256, tb = 7.5#s, # --- 10, c = 50#s, andA = 2. The transfer time varies in q = 1.1#s and q = 0.55#s.
REFERENCES [1]
K.N. Dharani Kumar, K.M.M. Prabhu, P.C.W. Sommen: A 6, 7 kbps vector sum excitedlinear prediction on TMS32oC54X digital signal processor. Microprocessors and Microsys-
464
[2]
[3] [4]
tems 26(2001) pp. 27-38 MiJller-Schloer, E. Schmitter: RISC-Workstation-Architekturen. Springer-Verlag Berlin Heidelberg New York, 1991 R.W. Schulze: Neuronale Topologiesynthese fiir Massiv Parallele Systeme. Verlag der Wissenschaften Peter Lang, 2003 Alan M. Turing: On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 2(1937) p. 42.
Caches
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
467
Trade-offs for Skewed-Associative Caches H. Vandierendoncka*and K. De Bosschere at aDept, of Electronics and Information Systems, Ghent University Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium. The skewed-associative cache achieves low miss rates with limited associativity by using inter-bank dispersion, the ability to disperse blocks over many sets in one bank if they map to the same set in another bank. This paper formally defines the degree of inter-bank dispersion and argues that high inter-bank dispersion conflicts with common micro-architectural designs in which the skewed-associative cache is embedded. Various trade-offs for the skewed-associative cache are analyzed and a skewed-associative cache organization with reduced complexity and a near-optimal cache miss rate is proposed. 1. I N T R O D U C T I O N Cache memories hide the rapidly increasing memory latency by storing the most recently accessed data on-chip. Most implementations of caches are organized as a direct mapped or set-associative structure. In such an organization, every block of data may be stored in only a few locations of the cache. These locations form one set of the cache, hence the name "setassociative". The benefit of this approach is a fast cache access time, but the down-side is that many cache misses occur when many blocks are accessed that all map to the same set. These misses are called conflict misses and result from the set-associative cache organization. These misses occur frequently in scientific workloads and can significantly deteriorate performance [ 1, 2]. It was shown that the skewed-associative cache can remove these misses and improve the performance predictability [3]. The skewed-associative cache is an organization that combines a fast cache access time with few conflict misses [3, 4, 5]. A 2-way skewed-associative cache has a miss rate comparable to a 4-way set-associative cache [4]. An n-way skewed-associative cache has n direct mapped banks. Each bank is indexed using a different hash function. When many blocks map to the same set in one bank, then it is very unlikely that all of them map to the same set in all banks. The ability to spread blocks that map to the same set in one bank over multiple sets in the other banks is called inter-bank dispersion [4]. Ideally, inter-bank dispersion should be maximum, i.e., it should be possible to spread blocks over all sets in another bank. This paper shows that programs typically do not require such maximum inter-bank dispersion, but that a moderate amount of inter-bank dispersion suffices to *Hans Vandierendonck is sponsored by the Flemish Institute for the Promotion of Scientific-Technological Research in the Industry (IWT). He can be reached at h v d i e r e n @ e l i s . UGent .be t Koen De Bosschere can be reached at kdb@e l i s. UGent. be
468 remove nearly all conflict misses. Furthermore, we give some reasons why a moderate amount of inter-bank dispersion is desirable, from a micro-architectural point of view. The remainder of this paper is organized as follows. In section 2, we present mathematical models of hash functions and formally define the degree of inter-bank dispersion. Then we describe in section 3 how various parameters of the skewed-associative cache affect the complexity of a skewed-associative cache organization. Section 4 presents a technique to construct conflict-avoiding hash functions with a pre-specified degree of inter-bank dispersion, number of hashed address bits and inputs per XOR. Using this technique, we evaluate trade-offs for the skewed-associative cache and propose a near-optimal configuration in section 5. Section 6 concludes this paper. 2. M A T H E M A T I C A L T R E A T M E N T This section describes a mathematical model of a hash function and uses it to define the degree of inter-bank dispersion. These definitions are treated in detail in [6].
2.1. Hash functions We represent an n-bit block address a by a bit vector [an-lan-2... a0], with an-1 the most significant bit and a0 the least significant bit. A hash function mapping n to m bits is represented as a binary matrix H with n rows and m columns. The bit on row r and column c is 1 when address bit ar is an input to the XOR computing the c-th set index bit. Consequently, the computation of the set index s can be expressed as the vector-matrix multiplication over GF(2), denoted by s = a H. GF(2) is the domain {0, 1} where addition is computed as XOR and multiplication is computed as logical and. Every function in the design space of XOR-based set index functions can be represented by the null space of its matrix [7]. The null space N ( H ) of a matrix H is the set of all vectors that are mapped to the zero vector:
N ( H ) = ( x c {0,
Ix
= 0).
The null space is a vector space with dimensionality dim N(H) = n - m. Conflict misses occur when x H = y H or, (x | y) E N(H), by noting that the XOR is its own inverse.
2.2. Set refinement and the lattice of hash functions Set refinement is a relation between two set-associative caches. The relation holds when all addresses that map to the same set in one cache also map to the same set in the other cache [8]. This is expressed for XOR-based hash functions using null spaces as follows: Definition 1. For matrices H and G, H refines G (H E G) iff N(H) c_ N(G). E.g., if H maps two addresses x and y to the same set, then (x| E N(H). If set refinement holds, then (x | y) E N(G) as weil and, by considering all pairs of addresses, it follows that
N(H) c_ N(a). The set of hash functions is a lattice where the functions are ordered by the set refinement relation. The lattice has a smallest element, namely the function that maps every address in main memory to itself, and it has a largest element, namely the function that maps all addresses to the same set (i.e., the one used in a fully-associative cache).
469 Value of H2 00
11
0C xO0OO xOlOl x1111 xlo10 xOllO xOOll 11 ,~ x1001 x1100 --V---_= 01 / 10 st ~t
a) Illustration of a lattice
01
10 ,=.,=,,~
impossible combination
x0100 x0001 . ~ set 1 of x1011 x1110 supremum "
x0OlO xO111 x11Ol xlOOO
set 0 of supremum
b) Illustration of Hs~p and inter-bank dispersion
Figure 1. Illustration of the lattice of hash functions and an example. Every function is a refinement of the largest element and is itself refined by the smallest element. The set refinement relation does not order every pair of functions. This happens in particular for the functions used in a skewed-associative cache. The situation can be understood from a fictitious lattice (Figure la)). Each node in the lattice corresponds to a function and the arrows show where the set refinement relation holds. For every function, there is a path from the smallest element to the largest element, passing through that function. The functions labeled G and H are not directly comparable to each other, as there is no path that passes through both G and H. We can however express their similarity by quantifying how much the paths from smallest to largest element for G and H overlap. In the graph, the paths diverge at the hash function I and converge again at S. These functions are the infinum (greatest lower bound), respectively supremum (least upper bound). Two hash functions are equal when their supremum equals their infinum (i.e., the paths from smallest to largest element do not diverge at all). The functions are as different as possible when the infinum equals the smallest element and/or the supremum equals the largest element.
2.3. Inter-bank dispersion The supremum hash function and its relation to inter-bank dispersion is illustrated for two set index functions, taken from a family of functions defined in [3]. The index functions H1 and /-/2 are defined by [b3, b2,
b0]H1
=
[bl @ b3, bo 9 b2]
[b3, b2, bl, bo]H2
=
[bo @ b3, bl @ b21
bl,
Every address in main memory is mapped to a set in bank 1 by H1 and to a set in bank 2 by//2. These mappings are illustrated in a 2-dimensional plot (Figure l b)). Each axis is labeled with the possible set indices for that bank. Every address is displayed in the grid in a position that corresponds to its set indices in each bank. The part x in the address bears no relevance to the value of the index functions. Both the addresses z0000 and z0101 map to set 00 in bank 1. The addresses are dispersed in bank 2::cO000 maps to set 00 and :c0101 maps to set 11. Inter-bank dispersion is limited to 2 of the 4 sets: there are no blocks that map to set 00 in bank 1 and either set 01 or 10 in bank 2. This is a consequence of the similarity of the functions H1 and/-/2 and it is described
470 mathematically by the supremum sup(H1, H2). The supremum is an imaginary hash function that places the addresses in imaginary sets. In the example, the supremum maps addresses to one of two sets and is defined by: [b3, b2, 51, bo]sup(H1,/-/2)
--
[bo | bl G b2 | b3]
i.e., all addresses with an even number of ones are mapped to one set and those with an odd number of ones are mapped to another set. The supremum is, by definition, refined by both H1 and/-/2. This is shown graphically in Figure l b). Set 0 of the supremum corresponds to the upper left-hand square. When refined by H1, it falls apart into sets 00 and 11 in bank 1. When it is refined by//2, it splits into sets 01 and 10 in bank 2. Inter-bank dispersion is always limited to one set of the supremum function. We define the degree of inter-bank dispersion as the 2-logarithm of the number of sets in a bank that have their addresses mapped to the same set of the supremum. It is limited to the range 0 to m. Definition 2. The degree of inter-bank dispersion (IBD) equals dim N ( s u p ( H 1 , H 2 ) ) dim N(H1).
It is assumed here that H1 and//2 have the same dimensions. For a-way skewed-associative caches (a > 2), inter-bank dispersion is defined only for every pair of banks [6]. Using the above definition, one can prove that inter-bank dispersion is maximal ( I B D = m) if and only if dim U(sup(H1, H2)) = n, i.e., the supremum has only one set. The definition implies an upper bound on inter-bank dispersion: I B D < n - m, as dim N(sup(H1, H2)) <_ n and dim N(H1) = m. Inter-bank dispersion is limited by the number of hashed address bits n. If n is too small, then the functions H1 and/-/2 will have to be similar to some extent. Maximum inter-bank dispersion can be reached only if n >_ 2m. 3. MOTIVATION F O R L I M I T E D I N T E R - B A N K D I S P E R S I O N
Ideally, a skewed-associative cache should always have as much inter-bank dispersion as possible, because this minimizes conflict misses. There are, however, situations where putting a limitation on inter-bank dispersion is called for. Most processors access the level-1 cache and the TLB in parallel. Therefore, the hash functions operate on the virtual address. In order to avoid aliases in the cache, it is necessary that only untranslated address bits (i.e., bits in the page offset) are hashed. Therefore, the page size places a limitation on the number of available address bits n, which in turn limits the inter-bank dispersion ( I D B <__n - m). In some processors the computation of the address partially overlaps the cache access. The cache access starts as soon as the set index bits are computed and the remaining address bits are computed after the cache access starts. By hashing, the number of required address bits rises from m (bit selection) to n. Consequently, the cache delay increases in this type of design. 4. CONSTRUCTING HASH FUNCTIONS Conflict-avoiding hash functions with pre-specified parameters are constructed from profiling information. The method borrows heavily from the method to construct hash functions for direct mapped caches [7, 9].
471 4.1. Direct mapped caches The estimated number of conflict misses, called the score henceforth, for a hash function is decomposed into a cost for each vector in the null space:
Z
yEN(H)
We compute cost(v) as an estimate of the number of conflict misses caused by v, if that vector were a member of the null space of a hash function. The estimates cost(v) for each vector v are computed during a profiling run of the program, which is explained in detail in [7, 9]. 4.2. Skewed-associative caches We estimate the number of conflict misses in a skewed-associative cache as s c O r e s k e w ( H 1 , H2) -
1
-~(score(H1) + score(H2)+
8 c o r e ( i n f (H1, H 2 ) ) )
This formula balances the number of conflict misses in each bank (i.e., it tries to minimize the need for inter-bank dispersion) and increments this with the conflicts for the infinum function. Blocks that map to the same set of the infinum function will be mapped to the same set in every bank. Hence, these misses cannot be avoided using inter-bank dispersion and add to the total cost of the hash functions. We construct a pair of hash functions that minimizes the estimated number of conflict misses for a given benchmark suite by randomly generating 10 million pairs of hash functions as follows. For m columns, n rows and inter-bank dispersion equal to I B D , the two hash functions have m - I B D columns in common. The remaining I B D columns of the hash functions also have to be linearly independent, so a total of m - I B D + 2 9 I B D linearly independent columns is needed. For each column, an m-bit random value is generated. If this value is linearly dependent on the previously generated columns, then it is skipped and a new m-bit random value is generated until it succeeds. The constraint on the number of inputs per XOR is enforced by requiting that all randomly generated columns have at most the specified number of 1-bits. 5. EVALUATION The evaluation uses a simulation-based methodology. The SPEC95 benchmarks are simulated using the SimpleScalar tool set. The benchmarks are compiled for the Alpha instruction set using the vendor supplied compiler with optimization flags - f a s t - 0 4 - a r c h ev6. In order to match a realistic situation, the profiling is performed by running the benchmarks with the training inputs. The evaluation is performed using the reference inputs. The simulation effort is restricted to representative slices of 500 million instructions per benchmark. The evaluation is performed for an 8-KB data cache with 32-byte blocks. The cache is 2-way skewed-associative, hence there are 128 blocks in each bank and the hash functions compute 7 set index bits. 5.1. Number of address bits We analyze the influence of the number of hashed address bits (n) on the miss rate. Interbank dispersion is maximum, i.e., I B D - n - m . We construct hash functions that minimize the conflict misses for all applications simultaneously. The functions index into an 8-KB 2-way
472 I+ f p
---I---owratl ----.A-~ int ---~-- IBD [
I ---O-- fp ---!1-- o~rall ~
25%
25% ............................................................................................................................................................................................... T 8 20%
. A I , ..... ~ ..... 4' { 7
~
~ ....4r
20%
int I
.........................................................................................................................................................................................................
' ~
16 . 15%
i
10%
^
5%
~,.._.._~..~
0%
r 7
,
8
,
9
10
11
,
12
iIO~o
~
.-..-..-..,..,..,.
~
,
13
, = , = , =t 14
15
16
.
2
50/0
1 0
0%
~
ZP---.___..
2WSA
~
0
1
No. of Randomissd Address Bits
=
:
.~.
.-.
2
3
,
:
:
.
.
.-.
,'.
&
&
4
5
6
7
IBD or Cache Organisstlon
a) The miss rate for varying n and I B D .
b) The miss rate as a function of the degree of interbank dispersion for n -- 14 hashed address bits.
Figure 2. The impact of n and I B D on the miss rate.
21% 52%
20%
44%
41%
21%
18% 16% 14% 12% =
.-5
lo% 8% 6% 4% 2% O% .r-
r-
.
co ...
o
._
~.
~
,,.-
._
E
Figure 3. The miss rates for all benchmarks.
skewed-associative cache with 32-byte blocks (m = 7) and have a different number of hashed address bits n. For each value of •, inter-bank dispersion is maximum, i.e., n - m . Furthermore, we simulate true LRU replacement in each cache. Nor maximum inter-bank dispersion, nor a high number of hashed address bits are required (Figure 2a)). The average miss rate is minimal at 7.9%, while at n = 10, a near-optimal miss rate of 8.1% is already obtained. For this configuration, the best degree of inter-bank dispersion that can be obtained is I B D = 3. 5.2. Inter-bank dispersion In this experiment, we hash n = 14 address bits. Pairs of hash functions are generated that have a degree of inter-bank dispersion that varies from 0 to 7. Figure 2b) also shows the miss rate in a conventional 2-way set-associative cache (2WSA). The miss rate is almost not affected by the inter-bank dispersion, and decreases only slowly as the IBD is increased. Furthermore, when the IBD is zero, the skewed-associative cache is really a hashed 2-way set-associative cache (in contrast to the conventional 2-way set-associative cache that uses bit selecting mapping). The miss rate of this cache is only slightly higher than the miss rate of the skewed-associative caches. E.g., for I B D = 7, the average miss rate is 7.6%, while for I B D = 0, the miss rate is 8.2%. This indicates that the 2-way skewed-associative cache gets
473 most of its miss rate reduction by using hash functions. The skewed-associative character itself provides a much smaller reduction. Figure 3 shows the miss rates of the 2-way set-associative cache, the hashed 2-way setassociative cache with I B D = 0 and the skewed-associative cache with I B D = 7 for each benchmark. We observe different trends for the floating-point and the integer benchmarks. For the floating-point benchmarks, adding hashing to a 2-way set-associative cache has a larger effect on the miss rate than adding skewed-associativity. E.g., for swim, the miss rate decreases from 52% to 8.1% by adding hashing to the set-associative cache. By adding skewing with maximum inter-bank dispersion, the miss rate drops only to 7.9%. There are some exceptions to this rule. For applu, the best hash function is, in fact, bit selection mapping. Therefore, the hash function that we generated introduces more cache misses than the non-hashed cache. By adding skewed-associativity, this negative effect can be countered. A similar explanation holds for apsi, hydro2d and mgrid. The trend for the integer benchmarks is different. Here, skewed-associativity has a larger influence on the miss rate than hashing. The integer benchmarks have less regular memory access patterns than the floating-point benchmarks, therefore the hash functions (which are static and remain the same during the execution of the program) cannot easily map these patterns in a conflict-free manner in the data cache. Skewed-associativity has the advantage that it can dynamically move data around in the cache using a process called data reorganization [3 ]. This process is useful only when the hash functions cannot already remove most conflict misses. The conclusion of this experiment is that a better performance/complexity trade-off is made by having little inter-bank dispersion if the intended system will mostly run high-performance applications that typically match the behavior of the floating-point benchmarks. If, however, the system will be used for general-purpose computation and will run programs that are more similar to the integer benchmarks, then it is worthwhile to implement a skewed-associative cache with maximum inter-bank dispersion.
5.3. Complexity of functions Finally, we consider the effect of the complexity of the hash functions, which is measured by the number of inputs to each XOR. The number of inputs was unlimited above, but now we limit it to 2. For the floating-point benchmarks, these functions have a slightly, but not significant, higher miss rate. For the purpose of comparison, the functions proposed in [5] (4 inputs) and those in [3] (2 inputs) are evaluated. Our 4-input functions perform slightly better than the published functions (10.2% vs. 10.5%), while the 2-input functions perform clearly better: 10.4% vs. 12.2%. This validates our approach for generating hash functions and shows that the generated functions are comparable to or better than the best known functions. 6. CONCLUSION This paper studies skewed-associative caches and its prominent feature: inter-bank dispersion. We formally define the degree of inter-bank dispersion in a skewed-associative cache and describe the conditions required for maximum inter-bank dispersion. We demonstrate that applications do not require a high degree of inter-bank dispersion, but a small amount of it is able to remove most misses. We examine trade-offs of the skewedassociative cache with respect to the degree of inter-bank dispersion, the number of hashed
474 address bits and the complexity of the hash functions. A skewed-associative cache with a nearoptimal complexity/performance trade-off is proposed: The optimized 8-KB two-way skewedassociative cache maps 11 address bits onto 7 set index bits and has a degree of inter-bank dispersion equal to 3. There are two components to the miss rate reduction of the two-way skewed-associative cache, namely hashing and inter-bank dispersion. Hashing by itself removes almost all conflict misses in the SPEC95 floating-point benchmarks, while skewing provides little benefit. For the integer benchmarks, hashing and skewing contribute about evenly. REFERENCES
[1] A. Gonz/dez, M. Valero, N. Topham, and J. Parcerisa, "Eliminating cache conflict misses through XOR-based placement functions," in ICS'97. Proceedings of the 1997 International Conference on Supercomputing, pp. 76-83, July 1997. [2] H. Vandierendonck and K. De Bosschere, "Evaluation of the performance of polynomial set index functions," in WDDD: Workshop on Duplicating, Deconstructing and Debunking, held in conjunction with the 29th International Symposium on Computer Architecture (ISCA-29), pp. 31-41, May 2002. [3] F. Bodin and A. Seznec, "Skewed associativity enhances performance predictability," in Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 265-274, June 1995. [4] A. Seznec, "A case for two-way skewed associative caches," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 169-178, May 1993. [5] A. Seznec and F. Bodin, "Skewed-associative caches," in PARLE'93: Parallel Architectures and Programming Languages Europe, (Munich, Germany), pp. 305-316, June 1993. [6] H. Vandierendonck and K. De Bosschere, "On null spaces and their application to model randomisation and interleaving in cache memories,' Tech. Rep. DG02-02, Ghent University, Dept. of Electronics and Information Systems, July 2002. [7] H. Vandierendonck and K. De Bosschere, "Efficient profile-based evaluation of randomising set index functions for cache memories," in 2nd International Symposium on Performance Analysis of Systems and Software, pp. 120-127, Nov. 2001. [8] M. D. Hill and A. J. Smith, "Evaluating associativity in CPU caches" IEEE Transactions on Computers, vol. 38, pp. 1612-1630, Dec. 1989. [9] H. Vandierendonck and K. De Bosschere, "Highly accurate and efficient evaluation of randomising set index functions," Journal of Systems Architecture, vol. 48, pp. 429-452, May 2003.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
475
Cache Memory Behavior of Advanced PDE Solvers D. Wallin ~*, H. Johansson ~, and S. Holmgren ~ ~Uppsala University, Department of Information Technology P.O. Box 337, SE-751 05 Uppsala, Sweden Three different partial differential equation (PDE) solver kernels are analyzed in respect to cache memory performance on a simulated shared memory computer. The kernels implement state-of-the-art solution algorithms for complex application problems and the simulations are performed for data sets of realistic size. The performance of the studied applications benefits from much longer cache lines than normally found in commercially available computer systems. The reason for this is that numerical algorithms are carefully coded and have regular memory access patterns. These programs take advantage of spatial locality and the amount of false sharing is limited. A simple sequential hardware prefetch strategy, providing cache behavior similar to a large cache line, could potentially yield large performance gains for these applications. Unfortunately, such prefetchers often lead to additional address snoops in multiprocessor caches. However, applying a bundle technique, which lumps several read address transactions together, this large increase in address snoops can be avoided. For all studied algorithms, both the address snoops and cache misses are largely reduced in the bundled prefetch protocol. 1. I N T R O D U C T I O N Designing the memory system of a parallel computer is a complex, multi-variable optimization problem. One parameter that must be determined is the cache line size. In a uniprocessor a large cache line size normally efficiently reduces the number of cache misses. However, in multiprocessors a longer cache line may lead to increased traffic as well as more cache misses if a large amount of false sharing occurs [ 1]. Another complication when designing the cache subsystems is that the preferred cache line size is very application dependent. Most available multiprocessors are optimized for running commercial software, e.g. databases and server-applications. In these applications, the data access pattern is very unstructured and a large amount of false sharing occurs [2, 3]. The computer vendors usually build multiprocessors with cache line sizes ranging between 32 and 128 B. This cache line size range gives rather good performance trade-offs between cache misses and traffic for commercial applications, but might not be ideal for scientific applications. Computing the solution of partial differential equations (PDEs) is a central issue in many fields of science and technology. Despite increases in computational power and advances in *This work is supported in part by Sun Microsystems, Inc. and the Parallel and Scientific Computing Institute, Sweden.
476 solution algorithms, PDEs arising from accurate models of complex, realistic problems still result in very time and memory consuming computations which must be performed using parallel computers. In this paper, we evaluate the behavior of three PDE solvers. The kernels are based on modern, efficient algorithms, and the settings and working sets are chosen to represent realistic application problems. These problems are much more demanding than commonly used scientific benchmarks, e.g. found in SPLASH-2 [ 1], which usually implement simplified solution algorithms. The codes are written in Fortran 90 and parallelized using OpenMP directives. The study is based on full-system simulations of a shared memory multiprocessor, where the baseline computer model is set up to correspond to a commercially available system, the SunFire 6800 server. However, using a simulator, the model can easily be modified to resemble alternative design choices for possible future designs. Based on the simulations, we conclude that the spatial locality is much better in these PDE kernels than in commercial benchmarks. Therefore, the optimal cache line size for these algorithms is larger than in most available multiprocessors. Spatial locality could be better explored also in a small cache line size multiprocessor using prefetching. A very simple sequential prefetch protocol, prefetching several consecutive addresses on each cache miss, gives a cache miss rate similar to a large cache line protocol. However, the coherence and data traffic on the interconnect increase heavily compared to a non-prefetching protocol. We show that by using the bundling technique, previously published in [3], the coherence traffic can be kept under control.
2. THE PDE SOLVERS
The kernels studied below represent three types of PDE solvers, used for compressible flow computations in computational fluid dynamics (CFD), radar cross-section computations in computational electromagnetics (CEM) and chemical reaction rate computations in quantum dynamics (QD). The properties of the kemels differ a lot with respect to the amount of computations performed per memory access, memory access patterns, amount of communication and communication patterns. The CFD kernel implements the advanced algorithm described by Jameson and Caughey for solving the compressible Euler equations on a 3D structured grid using a Gauss-Seidel-Newton technique combined with multi-grid acceleration [4]. The implementation is described in detail by Nord6n [5]. The data used in the computations are highly structured. Each smoothing operation in the multigrid scheme consists of a sweep through an array representing the grid. The values of the solution vector at six neighbor cells are used to update the values in each cell. After smoothing, the solution is restricted to a coarser grid, where the smoother is applied again recursively. The total work is dominated by the computations performed within each cell for determining the updates in the smoothing operations. These local computations are quite involved, but the parallelization of the smoothing step is trivial. Each of the threads computes the updates for a slice of each grid in the multi-grid scheme. The amount of communication is small and the threads only communicate pair-wise. The CEM kernel is part of an industrial solver for determining the radar cross section of an object [6]. The solver utilizes an unstructured grid in three dimensions in combination with a finite element discretization. The resulting large system of equations is solved with a version of the conjugate gradient method. In each conjugate gradient iteration, the dominating operation is a matrix-vector multiplication with the very sparse and unstructured coefficient matrix. Here,
477 the parallelization is performed such that each thread computes a block of entries in the result vector. The effect is that the data vector is accessed in a seemingly random way. However, the memory access pattern does not change between the iterations. The QD kernel is derived from an application where the dynamics of chemical reactions is studied using a quantum mechanical model with three degrees of freedom [7]. The solver utilizes a pseudo-spectral discretization in the two first dimensions and a finite difference scheme in the third direction. In time, an explicit ODE-solver is used. For computing the derivatives in the pseudo-spectral discretization, a standard convolution technique involving 2D fast Fourier transforms (FFTs) is applied. The parallelization is performed such that the FFTs in the first dimension are performed in parallel and locally [8]. For the FFTs in the second dimension, a parallel transpose operation is applied to the solution data, and the local FFTs are applied in parallel again. The communication within the kernel is concentrated to the transpose operation, which involves heavy all-to-all communication between the threads. However, the communication pattern is constant between the iterations in the time loop. 3. SIMULATION ENVIRONMENT AND W O R K I N G SETS All experiments are carried out using execution-driven simulation in the Simics full-system simulator. The modeled system implements a snoop-based invalidation MOSI cache coherence protocol. We set up the baseline cache hierarchy to resemble a SunFire 6800 server with 16 processors. The server uses UltraSPARC III processors, each equipped with two levels of caches. The processors have two separate first level caches, a 4-way associative 32 KB instruction cache and a 4-way associative 64 KB data cache. The second level cache is a shared 2-way associative cache of 8 MB. The only hardware parameter that is varied in the experiments is the cache line size, which is normally 64 B in the SunFire 6800. The second level caches of this computer system are subblocked. To isolate the effects of a varying cache line size and to avoid corner cases in the prefetch experiments, the figures presented are for simulated non-subblocked caches. A comparative study was performed also with subblocked second level caches with a small increase in cache misses as a consequence. This article only presents results for the measured second level cache misses and the amount of traffic produced rather than to simulate the contention that may arise on the interconnect. This will not allow us to estimate the wall clock time for the execution of the benchmarks. Estimating execution time is difficult based on simulation and it is highly implementation dependent on the bandwidth assumptions for the memory hierarchy and the bandwidth of coherence broadcast and data switches. We only perform a small number of iterative steps for each PDE kernel to limit the simulation time. Also a limited number of iterations gives appropriate results since the access pattern is most often regular within each iteration for the PDE kernels. The results were verified with longer simulations. Before starting the measurements, each solver completes a full iteration to warm-up the caches. The CFD solver runs for two iterations, the CEM solver runs until convergence (about 20 iterations) and the QD solver runs for three iterations. The CFD problem size has a grid size of 32 x32 x32 elements using four multigrid levels. The CEM problem represents a modeled generic aircraft with a problem coefficient matrix of about 175,000x 175,000 elements; a little more than 300,000 of these are non-zero. The QD problem size is 256 x256 x20, i.e. a 256 x256 2D FFT in the x-y plane followed by a 20 clement FDM in the z-direction, a realistic size for problems of this type.
478 4. I M P A C T O F V A R Y I N G C H A C H E L I N E S I Z E O N T H E P D E S O L V E R S
The cache miss characteristics, the number of snoop lookups and the data traffic have been measured for different cache line sizes for the three PDE solvers in Figure 1. The cache misses are categorized into five different cache miss types according to Eggers and Jeremiassen [9].
0.8
[] Cold [] Capacity [] False [] True [] Upgrade[-
0.7
i
o.Hm
0.6 0.5 0.3
o, 0.1 0
18 --~QCold [] Capacity [] False [] True [] Upgrade~ 16
14 i 12 10
m
.
.
.
.
.
:H
.
| I ; ;
4
o:1
i H" I ~:!
I I 4I
CFD
1
-
9
10.4
B
2.5 2
0~1 0
ii
.
4,J, CEM
i
n
02,-
c) Cache miss ratio 9
_
i[ ]
QD
Data traffic l
1.5
[]
9Capacity [] False []True [] Upgrade~
o
b) Cache miss ratio
a) Cache miss ratio
1.5
~. CEM
2.5 2
': | !
i
o
CFD
9 - ~ D Cold 8 7 [] S [[ ]] S []
I
I J,
2.5 ID Snoop Iookups [] Data traffic I 2
1
o.~
9
4Id
o
I I I o J
QD
d) Snoop lookups and data traffic e) Snoop lookups and data traffic f) Snoop lookups and data traffic
'i
Figure 1. Influence of cache line size on second level cache misses, snoop lookups and data traffic in the three PDE kernels. The miss ratio in percent is indicated in the cache miss figures. The snoop lookups and the data traffic are normalized relative to the 64 B configuration.
The data traffic is a measurement of the number of bytes transferred on the interconnect, while the term snoop lookups represents the number of snoops the caches have to perform. The CFD kernel performs most of the computations within each cell in the grid, leading to a low overall miss ratio. The data causing the true sharing misses and the upgrades exhibit good spatial locality and the amount of these misses is halved with each doubling of the line size. However, the true and false sharing misses decrease slower since they cannot be reduced below a certain level. The decrease in snoop lookups is approximately proportional to the decrease in miss ratio. Due to the large increase in data traffic, the ideal cache line size is shorter in this kernel than in the other kernels. The decrease in cache miss ratio is influenced by a remarkable property in this kernel: the false sharing misses are reduced when the line size is increased. The behavior of false sharing is normally the opposite; false sharing misses increase with larger line size. When a processor is about to finish a smoothing operation, it requests data that previously have been modified by another processor and thus is invalidated. The invalidated data are placed consecutively in the remote cache. Larger pieces of this data are brought into the local cache when a longer cache line is used. If a shorter cache line is used, less invalidated data will be brought to the local cache in each access and therefore a larger number of accesses is required to bring all the requested data to the cache. Therefore, all these accesses will be categorized as false sharing misses. The CEM kernel has a large problem size, causing a high miss ratio for small cache line sizes.
479 Capacity misses are most common and can be avoided using a large cache line size. The true sharing misses and the upgrades are also reduced with a larger cache line size, but at a slower rate since the data vector is being randomly accessed. False sharing is not a problem in this application, not even for very large cache line sizes. The data traffic exhibits nice properties for large cache lines because of the rapid decrease in the miss ratio. The snoop lookups are almost halved with each increase in cache line size. Therefore, the optimal cache line size for the CEM kernel is very long. The QD kernel is heavily dominated by two miss types, capacity misses and upgrades, which both decrease with enlarged cache line size. The large number of upgrades is caused by the allto-all communication pattern during the transpose operation where every element is modified after being read. This should result in an equal number of true sharing misses and upgrades, but the kernel problem size is very large and replacements take place before the true sharing misses occur. Therefore, a roughly equal amount of capacity misses and upgrades are recorded. The data traffic increases rather slowly for cache lines below 512 B. The snoop lookups show the opposite behavior and decreases rapidly until the 256 B cache line size, where it levels out.
5. SEQUENTIAL PREFETCHING AND BUNDLED SEQUENTIAL PREFETCHING The former results showed that a very long cache line would be preferred in a computer optimized for solving PDEs. A preferable line size would probably be between 256 and 512 B for these applications. Even larger cache lines lead to less efficiently used caches with a large increase in data traffic as a consequence. Unfortunately, no computer is built with such large line size. Instead, a similar behavior to having a large cache line can be obtained by using various prefetch techniques. These techniques try to reduce miss penalty by prefetching data to the cache before the data are being used. A very simple hardware method to achieve cache characteristics similar to having a large cache line, while keeping a short cache line size, is to implement sequential prefetching. In such systems, a number of prefetches is issued for data having consecutive addresses each time a cache miss occurs. The amount of prefetching is governed by the prefetch degree, which decides how many additional prefetches to perform on each cache miss. Dahlgren has studied the behavior of sequential prefetching on the SPLASH benchmarks [ 10]. Sequential prefetching normally efficiently reduces the number of cache misses since more data is brought into the cache on each cache miss. Sequential prefetching requires only small changes to the cache controller and can easily be implemented without a large increase in coherence complexity. The main disadvantage with sequential prefetching schemes is that the data traffic and the snoop lookups usually increase heavily. The snoop lookups can be largely reduced in a sequential prefetch protocol using the bundling technique [3]. Bundling lumps the original read request together with the prefetch requests to the consecutive addresses. The original read request is extended with a bit mask, which shows the address offset to the prefetch addresses. Bundling efficiently limits the number of address transactions on the interconnect. However, not only the address traffic can be reduced but also the number of snoop lookups required. This is done by requiting bundled read prefetch requests to only supply data if the requested prefetch cache line is also in the owner state in the remote cache. Data that are in other states will not be supplied to the requesting cache. Write and upgrade prefetches are not bundled and will not reduce the amount of address traffic
480 generated or the snoop bandwidth consumed. Bundling gives a large performance advantage since the available snoop bandwidth is usually the main contention bottleneck in snoop-based multiprocessors. The available data bandwidth is a smaller problem since the data packets do not have to be ordered and can be returned on a separate network. For example, the Sunfire 6800 server has a data interconnect capable of transferring 14.4 GB/s while its snooping address network can support 9.6 GB/s worth of address snoops [ 11 ]. We have studied sequential hardware prefetching on the PDE solvers in Figure 2. We present 0.6 0,5 0.4 0.3 0.2 0.1 0
~[]Cold[] Capacity[] False NTrue
ID Cold[] CapacitymFalse NTrue
[] Upgradel
t i l l _ _ . ~ M~m M ~ ~ lltn-nl
t
CFD a) Cache miss ratio 2.5 []Snoop lookupsmDatatl~lfflc[ J
" "1~ | ~
,, .
[] Upgrade]
,.4 ~ ".~
i
,.15
nnnnm
O.oS
CEM b) Cache miss ratio
14
9
I ~! I I . , Fin n n
0.5
CFD d) Snoop lookups and data traffic
1.4 1.2 DSnooplookupsmData 1 I~i~l. _ ~ ~ ! 9
o.o
o,
0.4
0.2
~
tl l ~ i ~
mmFalse
i
[] True[] Upgrade~
n
E E E
11
.
liD
c) Cache miss ratio 1.6 1.4 [] Snoop Iookups 9Data ~fflc~---"_- .........E~ _4-,1 1 . 2 - =~ 9 I [i!ii I I~
~H
9
CEM e) Snoop lookups and data traffic
1 ] ~ ! ! H
o-'
0.6 0.4
n n L~m n n i
0.2
QD f) Snoop lookups and data traffic
Figure 2. Influence of sequential and bundled sequential prefetching on second level cache misses, snoop lookups and data traffic. The miss ratio in percent is indicated in the cache miss figures. The snoop lookups and the data traffic are normalized relative to the 64 B configuration. results of non-prefetching configurations having different cache line sizes: 64, 128, 256 and 512 B. Several sequential prefetching configurations, 64sl, 64s3 and 64s7, are also studied which prefetch 1, 3 and 7 consecutive cache lines based on address on each cache miss. Finally the corresponding bundled configurations prefetching 1, 3 and 7 consecutive addresses, 64bl, 64b3and 64b7, are evaluated. Sequential prefetching works very well for the studied PDE solvers. If we compare the sequentially prefetching configurations, 64sl, 64s3 and 64s7, with the non-prefetching configuration 64, we see that for all kernels the cache misses are largely reduced. Compared with the non-prefetching configuration with a comparable cache line size, e.g the 512 B configuration compared to the 64s7 configuration, the cache misses are lower or equal for the prefetching configuration. The main reason for the discrepancy is that with the sequential prefetching protocol, the consecutive addresses will always be fetched. If a protocol with a large cache line is used, a cache line aligned area around the requested address will be fetched, that is, both data before and after the desired address. Especially the CEM kernel takes advantage of this, where the cache misses are reduced about 30 percent in the sequential protocol compared to the large cache line non-prefetching protocol.
481 Sequential prefetching generally leads to more data traffic than the baseline 64 B nonprefetching configuration and less data traffic than the corresponding larger non-prefetching configuration. The snoop lookups increase rather heavily for the CFD and QD kemels compared with both a small and a large cache line size non-prefetching configuration. Bundling efficiently reduces the snoop lookup overhead. For all kernels, bundling yields an equal amount of cache misses as the corresponding sequential prefetch protocol. However, both the snoop lookups and the data traffic are reduced in the bundled protocols for all kemels. The reason for the decrease in data traffic is that the bundled protocol is more restrictive at providing data than the sequential protocol. Data are only sent if the owner of the original cache line is also the owner of the prefetch cache lines. Bundling makes it possible to use a large amount of prefetching at a much smaller cost, especially in address snoops, than normal sequential prefetching. There is still more address snoops generated in the bundled prefetch protocol than in a non-prefetching large cache line size protocol. This is almost entirely caused by upgrades generating prefetch messages, which cannot be eliminated since only reads are bundled. The overall performance for the bundled sequential protocols compared with the baseline 64 B non-prefetching is excellent especially for the CEM and QD-kemels. For example the 64b 7 configuration has about 88 percent less cache misses, 70 percent less address snoops and 30 percent less data traffic than the 64 B non-prefetching protocol for the CEM kernel. The performance of the bundled protocols in the CFD kernel is somewhat worse since the data traffic increases rather heavily compared with the non-prefetching protocol. However, since the total ratio of cache misses is small in this kernel, contention on the interconnect will most likely not be a performance bottleneck. Also, the snoop bandwidth is often more limited than the data bandwidth in snoop-based systems. Compared with previous studies [3], the PDE kernels gain more from bundling than the SPLASH-2 benchmarks and commercial JAVA-servers. This is an effect of more carefully coded applications with better spatial locality. Sequential prefetching can simply be implemented in a multiprocessor without a large increase in hardware complexity. Some extra hardware is needed in the cache controller to fetch more data on each cache miss. The cost of implementing a bundled protocol is rather small even though some extra comer cases can be introduced in the cache coherence protocol [3].
6. CONCLUSION From full-system simulation of large sized state-of-the-art PDE solvers we have learned that these applications do not experience large problems with false sharing as is the case in many benchmark programs and commercial applications. Also, the spatial locality is good in the PDE solvers and therefore it is beneficial to use large cache line sizes in these applications. Sequential prefetching is one of the simplest forms of hardware prefetching and can easily be implemented in hardware. For the studied PDE solvers, it is beneficial to use sequential prefetching. To further improve the performance of the PDE solvers, sequential prefetching can be used together with the bundling technique to largely reduce the amount of address snoops required by the caches in shared-memory multiprocessors. Using this technique, both the number of cache misses and the address snoops become lower than in a non-prefetching configuration.
482 REFERENCES
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh and A. Gupta, The SPLASH-2 Programs: Characterization and Methodological Considerations, Proceedings of the ISCA, 1995. [2] M. Karlsson, K. Moore, E. Hagersten and D. A. Wood, Memory System Behavior of Java-Based Middleware, Proc. of HPCA, 2003. [3] D. Wallin and E. Hagersten, Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors, Proc. of the IPDPS, 2003. [4] A. Jameson and D.A. Caughey, How Many Steps are Required to Solve the Euler Equations of Steady, Compressible Flow: in Search of a Fast Solution Algorithm, Proc. of the CFD, 2001. [5] M. Nordrn, M. Silva, S. Holmgren, M. Thun6 and R. Wait, Implementation Issues for High Performance CFD, Proc. of International IT Conf., Colombo, Sri Lanka, 2002. [6] F. Edelvik, Hybrid Solvers for the Maxwell Equations in Time-Domain. PhD thesis, Department of Information Technology, Uppsala University, 2002. [7] A. Petersson, H. Karlsson and S. Holmgren, Predissociation of the Ar-12 van der Waals Molecule, a 3D Study Performed Using Parallel Computers. Submitted to Journal of Phys. Chemistry, 2002. [8] D. Wallin, Performance of a High-Accuracy PDE Solver on a Self-Optimizing NUMA Architecture. Master's thesis, Department of Information Technology, Uppsala University, 2001. [9] S.J. Eggers and T. E. Jeremiassen, Eliminating False Sharing, Proc. of the ICPP, 1991. [10] F. Dahlgren, M. Dubois and P. Stenstr6m, Sequential Hardware Prefetching in SharedMemory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 6(7):733-746, 1995. [ 11] A. Charlesworth, The Sun Fireplane System Interconnect, Proc. of Supercomputing, 2001. [1]
Performance
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
485
A Comparative Study of MPI Implementations on a Cluster of SMP Workstations G. Rfinger a and S. Trautmann ~ a Fakult~t for Informatik, Technische Universit/it Chemnitz, 09107 Chemnitz, Germany {ruenger, svtr }@informatik.tu-chemnitz. de Small to medium-sized clusters of workstations with up to four processors per node are very popular due to their performance-price ratio. In this paper we study how different implementations of the MPI standard using different interconnection networks influence the runtime of parallel applications on a cluster of SMP workstations. Two different benchmark suites are used: the SKaMPI benchmark suite for measurements of MPI communication operations and the NAS Parallel Benchmark suite for application specific measurements. 1. INTRODUCTION The use of clusters with up to four processors per node (usually realized as symmetric multiprocessors - SMP) for parallel computing is increasing. High performance networks like Myrinet, Gigabit-Ethernet, Infiniband or SCI are popular to build a low-priced high-performance parallel computer from off-the-shelf components. The Message Passing Interface (MPI) [1 ] is the most common option to program clusters of workstations combining the power of their nodes. Recent implementations of the MPI standard cope with clusters of SMPs in different ways. In this paper we consider different MPI implementations and investigate their performance behavior on a cluster of Dual-Xeon nodes. Several comparisons of MPI implementations have been done recently. The attention focused mainly on how an alternative interconnection network may influence the performance of MPI based applications. To avoid the overhead of the TCP/IP protocol stack some implementations of the MPI standard use different approaches like the Virtual Interface Architecture (VIA) to communicate between cluster nodes. A comparison of MPI implementations using TCP/IP and VIA on a Gigabit-Ethernet is presented in [2]. In [3] implementations of the MPI standard are compared on a shared memory machine. In this paper we consider four MPI implementations, two using the Fast Ethernet interconnection network (LAM 6.5.9 [4], MPICH 1.2.515]) and two using the Scalable Coherent Interface (SCI) network (MP-MPICH CVS snapshot April 2003 [6], ScaMPI 1.14.617]). For an evaluation of the MPI implementations two benchmark suites are used: the Special Karlsruher MPI (SKaMPI) benchmark suite [8] for MPI point-to-point and collective communication operations and the NAS Parallel Benchmark (NPB) suite [9],[ 10] for application specific results. The measurements presented in this paper are performed on a cluster of 16 Dual Intel Xeon nodes with 1 GB RAM each. The cluster is equipped with three interconnection networks, one network using SCI and two networks using Fast Ethernet interfaces. The SCI interconnection network
486 is organized as a 2D torus (4 • 4 nodes). Both Fast Ethemet interconnection networks use a dedicated switch. The so called "communication network" is reserved for MPI communications exclusively and the "service network" is connected to a front-end machine supplying a shared file-system and allowing access to the cluster nodes. The remainder of this paper is organized as follows. In Section 2 we present some details of the studied MPI implementations regarding their communication protocols. Section 3 shows the results of our measurements for the SKaMPI benchmark suite. Results of the NPB suite are presented in Section 4. Section 5 concludes the paper. 2. MPI IMPLEMENTATIONS FOR CLUSTERS OF SMP'S
In general, all MPI implementations supporting distributed memory machines can be used on clusters consisting of uni-processor and/or SMP nodes. If SMP nodes are involved, more than one MPI process is executed per SMP node. To avoid the overhead induced by communication protocols (e.g. within the TCP/IP stack) or network interfaces some MPI implementations support a shared memory interface to run MPI programs on large SMP machines (e.g. c h shmem device for MPICH). The MPI standard defines a set of communication operations consisting of point-to-point and collective communication operations. We first present implementation details of point-to-point communication operations for the MPI implementations studied. In general, the communication strategy depends on the message size to transfer. LAM: a small message with up to 1024 byte is sent to the destination node in one packet consisting of a header and the message data. Larger messages are subdivided into packets. The first packet is sent to the destination node and the source node is waiting for an acknowledgement before the remaining part of the messages is sent. MPICIt implementing the Abstract Device Interface (ADI) [ 11 ] provides four protocols for data exchange:
short the message data is sent within a control message, eager the data is sent without request and buffered in the receivers memory, rendezvous data is sent to the destination only when requested, get the sender directly copies the data to the receivers memory (hardware support is needed). The features of the interconnection network and the message size to transfer affect the use of a protocol. MP-MPICH and ScaMPI use the protocols from the ADI. In contrast to MPICH, the protocols in ScaMPI are called inline, eagerbuffering and transporter. More details regarding the implementations using the SCI interconnection network are presented in [12] and [7], respectively. The task of collective communication operations is to exchange data between more than two processes of an MPI based application. In general, collective communication operations are built on top of point-to-point communication operations. LAM and MPICH use different approaches to implement collective communication operations. If the number of MPI processes involved in a collective communication operation is less or equal to four, LAM uses a linear approach. For a broadcast operation like MP I _ B c a s t the node marked as root sends the message to all other nodes in a sequential order. If the number of nodes involved is greater than four, LAM uses tree-based algorithms to distribute the data (mostly binary trees depending on the
487 network architecture). In contrast, MPICH uses a minimum spanning tree (MST) algorithm for small message sizes and a scatter followed by an allgather operation for large message sizes. Two kinds of collective communications operations can be distinguished: operations executing one collective communication operation (e.g. M P I _ B c a s t and MP I _ R e d u c e ) and operations composed of two consecutive collective communication operations. The communication operation MPl_Allreduce is composed of an MPl_Reduce and an MPI_Bcast operation. So the performance of this kind of communication operation depends on the implementation of the underlying collective communication operations.
3. SKaMPI BENCHMARK SUITE In this section we show how the studied implementations of the MPI standard perform for different communication operations. Point-to-point MPI communication operations" The task of a point-to-point communication operation is to send an amount of data from one MPI process to another. On a cluster of SMP machines we distinguish three different scenarios. For processes on the same SMP node data may be exchanged using a memory copy operation. Alternatively, an MPI_Send/Recv operation may be used to exchange data on one node. If the MPI processes involved do not share the same node an MPI_Send/Recv operation has to be executed. A communication consisting of a pair of M P I _ S e n d and M P I _ R e c v operation is just one possibility to realize a point-to-point communication operation. Alternatively, buffered, synchronous, blocking, and non-blocking send-receive-operations may be used. First we compare a memory copy operation with an M P I _ S e n d - M P I _ R e c v operation on one SMP node. As shown in Figure l a) the use of a memory copy operation may reduce the communication time by a factor of 10 or more, even in comparison with the SCI-based MPI implementations. If a programming model with a shared address space would be used the memory copy operation could be avoided but the programmer has to take care of memory consistency. Figure lb) shows the results for a point-to-point communication operation (MPI_Isend - MPI_Recv in this case) where the MPI processes involved do not share the same node. In contrast to a point-to-point communication operation on one SMP node the communication times are up to ten times slower. In general, the runtime for point-to-point communication operations depend heavily on the latency and bandwidth of the interconnection network used. Both MPI implementations using Fast Ethernet deliver comparable results with a small advantage in favor of LAM if the message size is smaller than 1024 byte. In contrast, the implementations using the SCI interconnection network show a non-uniform behavior, especially for message sizes between 128 bytes and about 1024 byte. In general, the ScaMPI implementation performs better than MP-MPICH due to the use of alternative low-level drivers for the SCI network adaptor. For a message size of up to 128 byte or above 1024 byte MP-MPICH and ScaMPI produce comparable results. For medium message sizes MP-MPICH uses a different protocol for data exchange which is the reason for increasing communication times (details are presented in [6]). The results for other MPI realizations of point-to-point communication operations are similar to the results for MP I _ I s e n d - M P I _ R e c v presented above.
488 10 ~
..... ,:'...-memcopy !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!! r ~ I.AM iIINXKIKKKKKKKIK/IKKKIXKK;;;;KXI :=LIO s - ' 9MPICH ~,~,~ ~,,,,,~,~,,,,,,,,,,,,,,,~,<,,~,,,,,~,,~,,,~,,~,,,~,~,~,~,~,,,,~,~.c: -v- MP-MPICH ::N:::: :: :::S::::::::: :::::!::::::::::::::/::/::S::::::::::::::::::::::::::::::::::::::: (~ 4 : ScaMPI :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: E 10 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
~r"9
t INX/KXIIK;K;K/KIKX !;;;!;;;N;!;;!! !!;!;;;22;!!!; ;!; ;;!!;!; IKKIX !!!! !!!;K!!!!!!!!;;!!!;)! !;!;!!!!!!!!;!!;!;;;!!!!!!!!{!!::~ KKXI;IIIIIIIKKKIXI ~ ! !
:ip~
10 6
..~L 10s
"~
10 4
"~O10 -a i/ii/;;i/iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii/iiii//i" iii!i...il//iilX;~'/iiiiiiiiiiiiiiiiii~
"~10 2 I!!::!!!!::{!!!??!!!!!!!!!!!!:!!!!!!!::::::::::::::::::;~:.=,:~§ . . : / .......................................... E 1~ i i . ~ i . i ~ - : - ! . ! ! : ~ ~ :::;; E lO }!!!!i :!i!~N~S!!!!!!!!!!....................~.............. .......~::::;,..'::::::~ ....... ..........................................................
10
u ............... ....,:........~...~ ......,,;..,,.:. . ......
E 10 o
~..-c-+,
0
10 0
:::::::::::::::::::::::::::::::::: ::::::::::::::::::::::: ........................... ::::::::::::::::::::::::::::::: [.i,~=.,...::~.,:~->~,~z~.e-,..9~ ~" ...................... ::::::::::::::::::::::::::: 10 2
10 4
10 2
10 s
message size in bytes
10 4
message size in bytes
b) Communication MPIm Isend-MPI Recv ent nodes of a cluster.
LAM MPICH ..........
!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!{!!!!!!!!!!::!!!!!!:::::/::!!:::::.!!:: :/:::!!!::! ::!:!ii ::!
I::::; ........ :::::::::::::::::::::::::::::::::::::::: ...............
.--
10 s ............ ScaMPI I................................................. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::
._O
1
t=
14oo _ _
I..AM
.......~...... M P I C H
12oo ~
MP-MPICH
...........ScaMPI ..... ~, I ~ . . ~ . ~ -= 9
.
.
.
.
10 ~
times for between differ-
a) Memory copy operation compared to MP ISend-MP I R e cv on one SMP node using a network interface. Figure 1. Measurements of point-to-point communication. 10 6 ......................................................................................
!
.....................~~:~ ~........... ~.~....................................
.
.............................................. ;:<~ ..................
/ ,,\~......~
.................................. : ........ I ........... .../\i .......... <"-'~ "
~i 800 ................. .........!...... i
...
I
'
............... '".......................
~ ......................................... ~:S.! .............................. ~ii;;ii;;iiiiiii;
E o
KkXKK
10
XXXi~
XL.ii~<<:.i,i/?;KK;K;IUKIKXI;K;
..... 101
10 2
message
a)
10 a size
10 4
10 5
in b y t e
Varying message
size
0~I".......... +............ '4.......................... :6............................ ::........................ i:;i0 .............. i:2....... il.........I~1:6 n u m b e r of nodes
b) Varyingnumberof nodes
Figure 2. Results for the MPI_Allreduce communication operation and different MPI implementations using 16 nodes (a) and a message size of 256 byte (b) Collective communication operations: As an example for a collective communication operation we present results for the MP I A 1 1 r e d u c e communication operation, which is widely used in the NAS Parallel Benchmark suite discussed in Section 4. In general the communication times show a similar behavior with respect to the message size as the communication times for point-to-point operations, see Fig. 2a). As shown in Fig. 2b) the communication times are influenced by the number of nodes involved in a collective communication operation. Due to the use of binary trees to implement collective communication operations, the communication times for the LAM implementation increase continuously with the number of nodes involved. In contrast, MPICH and MP-MPICH show a less uniform behavior. If the number of nodes is divisible by four the communication times are smaller than in other cases caused by the use of minimum-spanning-tree-based algorithms. Since the pointto-point communication operations implemented in ScaMPI produce the fastest runtimes and due to the fact that collective communication operations are built on top of point-to-point com-
489 munication operations ScaMPI outperforms all competing MPI implementations in most of the cases. In contrast to point-to-point communication operations MP-MPICH performs better than ScaMPI for very large message sizes due to a more efficient implementation of the collective communication operations MPI _ R e d u c e and M P I _ B c a s t . In general, the use of SMP-nodes has only a very small influence on the runtime of collective communication operations. For the LAM MP! implementation scatter and gather operations benefit most. 4. NAS PARALLEL B E N C H M A R K SUITE In this section we present measurements for benchmark programs from the NAS Parallel Benchmark (NPB) suite. The NPB suite consists of kernel benchmark programs and application benchmark programs. All NPB programs are executed for several problem sizes. In this paper, we present the results for a medium problem size with respect to runtimes- "B" in terms of the NPB suite due to the fact that results for smaller and larger problem sizes show a similar behavior. All measurements are performed for up to 25 or 32 MPI processes depending on the requirement of algorithms for a square number of processes. Due to a startup-problem in MP-MPICH the measurements for this implementation are currently limited to 16 processes. NAS kernel benchmarks
After presenting algorithmic details we discuss the results for kernel benchmark programs from the NPB Suite. Algorithmic details of kernel benchmark programsThe EP ("embarrassingly parallel") kernel benchmark program evaluates an integral by pseudo-random trials. This kernel requires a total of four MPI A l l r e d u c e communication operations. The second kernel benchmark program measured is IS ("integer sort") which performs a sorting operation. The IS benchmark program uses MPI I r e c v , MPI Send, MP I_A11 r e d u c e , MPI_A11 t o a l 1, and MP I _ R e d u c e communication operations. Many small messages are sent in this kernel benchmark program. The CG ("conjugate gradient") kernel benchmark program solves a set of sparse linear equations. The MPI communication operations MPI I r e c v , MPI Send, and MPI R e d u c e are used. The final kernel benchmark program in our evaluation is the MG ("multi-grid") which requires highly structured long distance communication and tests both short and long distance data communication abilities. MPI B c a s t , MPI A l l r e d u c e , MPI I r e c v , MPI S e n d and MPI R e d u c e are used in this kernel benchmark. Execution times of the kernel benchmark programs Since the EP kernel benchmark program uses only four communication operations but has a large computation requirement the MPI implementation used does not effect its runtime. Doubling the node number halves the runtime of the benchmark program for all MP! implementations. The results for the IS benchmark program (Fig. 3a)) depend heavily on the interconnection network used. MPI implementations using the SCI interconnection network lead to smaller runtimes than MP! implementations using Fast Ethernet. The runtime of an IS benchmark program using MPICH is much larger than for LAM due to the fact that many small messages are exchanged. Both MPI implementations using the SCI interconnection network deliver corn-
490 parable results with a small advantage in favor of ScaMPI. The results for the CG benchmark program are shown in Figure 3b). This benchmark program uses many communication operations. For both MPI implementations using Fast Ethernet the number of nodes has only a small influence on the runtime of the algorithm. In contrast, both MPI implementations using the SCI interconnection network deliver faster runtime. The CG benchmark program benefits most from the use of an interconnection network providing high-bandwidth and low latency. In contrast to IS and CG, the results for the MG kernel benchmark program (Fig. 3c)) are much more uniform and the influence of a high-performance interconnection network is less strong. MPICH performs better than LAM, and ScaMPI performs better than MP-MPICH for this kernel benchmark program.
~
~9
MPlimpl . . . .
.~
. . . . . rofMPIp ....... a)tatiISintegerb esort
MPI
.~_
~)Ie
'
mcn~3~
of MPI processes conjugatenumber gradient
number of MPI processes c) MG multi-grid
MPI implementation SCAMP,
Figure 3. Results of the kernel benchmark programs from the NAS Parallel Benchmark Suite for different MPI implementations and varying numbers of nodes.
NAS application benchmarks To complete our survey of benchmark programs we present application programs from the NPB suite. Algorithmic details of application benchmark programs" The L U application benchmark program solves a regular-sparse, block lower and upper triangular system of equations. This benchmark program uses MP I__A11 r e d u c e , MP I _ B c a s t , MPI I r e c v , MPI R e c v andMPI S e n d communication operations. Sets of linear equations which form a (2,2)band matrix with the property that Aii - 0 if I i - j I< 3 ("scalar pentadiagonal") are solved by the SP application benchmark program. MP I_A1 l r e d u c e , MPI B e a s t , MPI I r e c v , MPI I s e n d and MPI R e d u c e communication operations are used. Finally the B T application benchmarks solves sets of linear equations which form a (1,1)band matrix with the property that Aij - 0 if I i - j I< 2 ("block tridiagonal") and uses M P I _ A l l r e d u c e , MP I B e a s t , MP I I s e n d , MP I I r e c v and MP I R e d u c e communication operations. Runtime of application benchmark programs The runtime for the LU benchmark program from the NPB suite are presented in Figure 4a). The influence of the interconnection network on the runtime is small. In contrast, the use of more MPI processes has a strong influence on the runtime of the benchmark program. In contrast to the LU benchmark program, the influence of the interconnection network on the runtime of the SP benchmark program (Fig. 4b)) is much stronger. ScaMPI delivers the fastest overall runtime and LAM performs better than MPICH. The results for the BTbenchmark program, see Fig. 4c), are similar to the results for the LU benchmark program. The improvements caused by the SCI interconnection network are small.
491
....
MPI implementation
nombero,.P,p........
a) LU
MPI
MP-MPICH implementation
i
~
ScaMPI
b) SP
number of MPI processes
MPI implementation
84
number of MPI processes
c) B T
Figure 4. Results of application programs from the NAS Parallel Benchmark Suite for different MPI implementations and varying numbers of nodes. The results from the NPB programs can be summarized as follows: In general, ScaMPI using the SCI interconnection network delivers best results for all benchmark programs. MP-MPICH also using the SCI interconnection network can keep up in most of the cases. The MPI implementations using a Fast Ethernet communication network behave not as uniform. If point-topoint communications operations with small message sizes dominate the communication LAM delivers better runtime, in other cases MPICH performs better. 5. CONCLUSION In this paper we have studied how the choice of an MPI implementation can influence the runtime of a parallel programs on a SMP cluster of Dual Xeon nodes. The runtime of MPI applications is influenced by two facts" the performance of the interconnection network used and the implementation of point-to-point and collective communication operations. Due to the minimum spanning tree approach for collective communication operations MPICH and MP-MPICH perform best if the number of nodes involved is a power of two. If a MPI_Allreduce communication operation with a message size of about 5000 byte or larger is executed MP-MPICH performs better than ScaMPI due to a more efficient implementation. Both MPI implementations using Fast Ethernet deliver comparable results with a small advantage in favor of LAM if many small messages are exchanged. ScaMPI delivers best results for almost all benchmark programs. MP-MPICH, the freely available SCI MPI implementation shows good results for all benchmark programs but is currently limited to 16 MPI processes. We experienced runtime improvements of up to 80% if the SCI interconnection is used instead of Fast Ethemet. In average, for application specific measurements a runtime improvement of 30% seems more realistic. REFERENCES
[1] [2]
[3] [41
[5]
Message Passing Interface Forum, MPI: A Message Passing Interface, in: Proceedings of Supercomputing '93, IEEE Computer Society Press, 1993, pp. 878-883. H. Ong, R A. Farrell, Performance Comparison of LAM/MPI, MPICH, and MVICH on a Linux Cluster connected by a Gigabit Ethemet Network, in: Proc. of Linux 2000, 4th Annual Linux Showcase and Conference,, 2000, pp. 353-362. B. V. Voorst, S. Seidel, Comparison of MPI Implementations on a Shared Memory Machine, in: Proc. of the IPDPS 2000 Workshops, 2000. G. Bums, R. Daoud, J. Vaigl, LAM: An open cluster environment for MPI, in: Proceeding of Supercomputing Symposium 94, 1994, pp. 379-386,. http ://www-unix.mcs. anl.gov/mpi/.
492 [6] [7] [8] [9]
[ 10]
[ 11 ] [12]
J. Worringen, T. Bemmerl, MPICH for SCI-connected clusters, in: Proc. SCI Europe, 1999, pp. 3-11, toulouse, France. Scali AS, ScaMPI Design and Implementation. R. Reussner, P. Sanders, L. Prechelt, M. Mtiller, SKaMPI: A Detailed, Accurate MPI Benchmark, in: PVM/MPI, 1998, pp. 52-59. D.H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, R O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, S. K. Weeratunga, The NAS Parallel Benchmarks, The International Journal of Supercomputer Applications 5 (3) (1991) 63-73. D. Bailey, T. Harris, W. Saphir, R. V. der Wijngaart, M. Y. A.C. Woo, The NAS parallel benchmarks 2.0, Tech. rep., NASA Ames Research Center, Moffett Field, CA, report NAS-95-020. W. Gropp, E. Lusk, N. Doss, A. Skjellum, High-performance, portable implementation of the MPI Message Passing Interface Standard, Parallel Computing 22 (6) (1996) 789-828. J. Worringen, K. Scholtyssik, MP-MPICH Documentation.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
MARMOT:
493
A n M P I A n a l y s i s a n d C h e c k i n g Tool
B. Krammer ~, K. Bidmon a, M.S. Mfiller a, and M.M. Resch ~ aHigh Performance Computing Center Stuttgart Allmandring 30, D-70550 Stuttgart, Germany {krammer, bidmon, mueller, resch }@hlrs.de The Message Passing Interface (MPI) is widely used to write parallel programs using message passing. MARMOT is a tool to aid in the development and debugging of MPI programs. This paper presents the situations where incorrect usage of MPI by the application programmer is automatically detected. Examples are the introduction of irreproducibility, deadlocks and incorrect management of resources like communicators, groups, datatypes and operators. 1. I N T R O D U C T I O N Due to the complexity of parallel programming, there is a clear need for debugging of MPI programs. According to our experience, there are several reasons for this. First, the MPI standard leaves many decisions to the implementation, e.g. whether or not a standard communication is blocking. Second, parallel applications get more and more complex and especially with the introduction of optimizations like the use of non-blocking communication also more error prone. Debugging MPI programs has been addressed in different ways: 9 Classical debuggers have been extended to address MPI programs. This is done by attaching the debugger to all processes of the MPI program. There are many parallel debuggers, among them the very well-known commercial debugger Totalview [ 13]. The freely available debugger gdb has currently no support for MPI, however, it may be used as a backend debugger in conjunction with a front-end that supports MPI, e.g. mpigdb [16, 15]. Another example of such an approach is the commercial debugger DDT by streamline computing [12], or the non-freely available p2d2 [4, 5, 14]. 9 The second approach is to provide a debug version of the MPI library (e.g. mpich). This version is not only used to catch internal errors in the MPI library, but it also detects some incorrect usage of MPI by the user, e.g. a type mismatch of sending and receiving messages [6]. 9 Other tools like MPI-CHECK are restricted to Fortran code and perform argument type checking or find problems like deadlocks [8]. Similar to MARMOT, there are tools using the profiling interface, e.g. Umpire [3]. But in contrast to our tool, Umpire is limited to shared memory platforms.
494 The disadvantages of these tools are that they may not be freely available, that they require source code modification or language parsers, that they are limited to special platforms or language bindings, or that they do not catch incorrect usage of MPI, but help to analyze the situation only after the incorrect usage has produced an error like e.g. a segmentation violation. Neither of these approaches address portability or reproducibility, two of the major problems when using MPI on a wide range of platforms. The idea of MARMOT is to verify the standard conformance of an MPI program automatically during runtime and help to debug the program in case of problems. 2. DESIGN GOALS AND A R C H I T E C T U R E OF M A R M O T MARMOT is a library that has to be linked to the application in addition to the native MPI library. No modification of the application's source code is required. MARMOT supports the full MPI-1.2 standard, its main design goals are: 9 Portability: by verifying that the program adheres to the MPI standard [1, 2] it enables the program to run on any platform in a smooth and seamless way. 9 Scalability: the use of automatic techniques that do not need user intervention allows to debug programs running on hundreds or thousands of processors. 9 Reproducibility: the tool issues warnings in situations that may cause possible race conditions. It also detects deadlocks automatically and notifies the user where and why these have occurred. MARMOT uses the MPI profiling interface to intercept the MPI calls for analysis before they are passed from the application to the native MPI library, see Figure 1. As this profiling interface is part of the MPI standard, MARMOT can be used with any MPI implementation that adheres to this standard.
ppicationliiiiiiiiiiiiiiiiiiiil tiiiiiiiiiiDIiiiiiiiil ~ 1 i =~
Additional
Interface
-
MARMOT Core Tool
(Debug Server)
Native
MPI
Client Side
Server Side
Figure 1. Design of MARMOT.
MARMOT adds an additional MPI process for all global tasks that cannot be handled within the context of a single MPI process, like deadlock detection. Another global task is the control of the execution flow, i.e. the execution will be serialized if this option is chosen by the
495 user. Local tasks like for example verification of resources are performed on the client side. Information between the MPI processes and this additional debug process are transferred using MPI. For the application, this additional process is transparent. This is achieved by mapping MPI CONN_WORLD to a MARMOT communicator that contains only the application processes. Since all other communicators are derived from NPI_CONN_WORLD they will also automatically exclude the debug server process. Another possible approach is to use a thread instead of an MPI process and use shared memory communication instead of MPI [3]. The advantage of the approach taken here is that the MPI library does not need to be thread safe. Without the limitation to shared memory systems the tool can also be used on a wider range of platforms. Since the improvement of portability was one of the design goals we did not want to limit the portability of MARMOT. This allows to use the tool on any development platform used by the programmer.
3. PROBLEMS IN MPI USAGE 3.1. Reproducibility The possibility to introduce race conditions in parallel programs with the use of MPI is one of the major problems of code development. Although it is possible to write correct programs containing race conditions there are many occasions where they must be avoided. In an environment where strong sequential equivalence is required race conditions are explicitly forbidden in order to achieve bitwise identical results. But even if the applied numerical method allows race conditions between MPI messages they present a problem if a bug exists in the program that cannot be easily reproduced because the program behaves potentially different in each run. In any case, the programmer should be aware where possible race conditions are introduced to decide whether they are intended and judge their potential danger. To reproduce and identify bugs methods like record and replay can be used [7]. The first disadvantage is that problems can only be identified after they occurred, the second problem is that in order to allow a potential replay the recording has to be done with all runs. It should be noted that this method does not identify race conditions, but allows to reproduce the exact conditions of a run despite the existence of race conditions. A thorough dependency analysis is necessary to identify races and can also be used to limit the recording to events where a potential race exists. This improves efficiency and also provides more information to the programmer. With MPI it is possible to choose a different approach. Here, the possible locations of races can be identified by locating the calls that are sources of race conditions. One example is the use of a receive call with MPIANY_SOURCE as source argument. A second case is the use of MPI_CANCEL, because the result of the later operation depends on the exact stage of the message that is the target of the cancellation. If strong sequential equivalence is required the use of MP I_REDUCE is another potential source of irreproducibility. By inspecting all calls and arguments MARMOT can detect dangerous calls and issue warnings. 3.2. Resource management The MPI standard introduces the concept of communicators, groups, datatypes and operators. As already mentioned in the section on MARMOT's architecture, MARMOT maps MPI communicators to its own communicators and uses the MPI profiling interface to intercept the MPI calls with their parameters for further analysis. In more detail, the MPI communicators, groups, datatypes and operators are mapped to corresponding MARMOT communicators,
496 groups, datatypes and operators (and vice versa) and can thus be scrutinized. For example, all predefined communicators such as MP I_COMM_WORLD are automatically inserted into the set of MARMOT communicators. All user-defined communicators are inserted when created and removed when freed. That way, MARMOT can check if a used communicator is valid or not. The same approach is taken with groups, datatypes etc. Apart from these validity checks, MARMOT can perform further checks as shown in the following examples" 9 Communicators: MARMOT checks whether the communicator is valid, i.e. whether it is MPI COMN NULL or whether it is a communicator that has been created and registered with the appropriate MPI calls. Depending on the particular MPI call further checks may be performed. For example, when using the call MPI C a r t c r e a t e ( M P I C o m m c o m m old, int ndims, int *dims, int * p e r i o d s , int r e o r d e r , M P I _ C o m m * c o m m _ c a r t ) , we check whether c o m m o l d is valid and whether the size of the new cartesian communicator c o m m c a r t is largerthan the size of the old communicator c o m m o I d. Another example is the use of M P I _ C o m m _ c r e a t e ( M P I _ C o m m comm, M P I _ G r o u p g r o u p , MPI C o m m * n e w c o m m ) where we check whether the old communicator c o m m is valid and whether the new communicator n e w c o m m is a subset of the old communicator.
9 Groups: MARMOT checks whether the group is valid, i.e. whether it is MPI GROUP NULL or whether it has been created and registered. For example, when using M P I _ G r o u p _ e x r ( M P I _ G r o u p g r o u p , 2 n t c o u n t , 2nt * r a n k s , MPI G r o u p * n e w g r o u p ) we check whether the old group g r o u p and the new group n e w g r o u p are valid, whether it is possible to employ the specified array r a n k s consisting of the integer ranks in g r o u p that are not to appear in n e w g r o u p or whether the new group is identical to the old group. n
9 Datatypes: MARMOT checks whether the datatype is valid, i.e. whether it is MP I DATATYPE NULL or whether it has been created and registered properly. Ax~other example of datatype checks showed up when an application was using MP I_SCATTERV and MPI GATHERV with a user-defined datatype. MARMOT reported that this userdefined datatype contained holes. Such holes are not forbidden by the MPI standard, but removing them helped to improve the efficiency of communication. 9 Operators: MARMOT checks whether the operation is valid, e.g. when using MPI OP FREE whether the operation to be freed exists or has already been freed, or when using MP I_REDUCE whether the operation is valid. m
9 Miscellaneous" Apart from communicators, groups, datatypes and operators, other parameters can be taken by MPI calls. For arguments such as tags or ranks, we check whether they lie within a valid range. These ranges can be different on different platforms. However, to write portable code developers should stay within the ranges guaranteed by the MPI standard. For example, when using MP I_SEND the validity of the count, datatype, rank, tag and communicator parameters are automatically checked. MARMOT also verifies if requests are handled properly within non-blocking send and receive or wait and test routines, e.g. if unregistered requests are used or if active requests are recycled
497 in calls like MPI_ISEND or MPI_IRECV. Wrong handling of requests may easily lead to deadlocks or to the abortion of the application. 3.3. Deadlocks A deadlock occurs when a process is blocked by the non-occurrence of something else. In general, deadlocks may be caused by several reasons. A process may execute a receive or probe routine without another process calling the corresponding send routine or vice versa, or there may exist a send-receive cycle. The latter may depend on the MPI implementation, e.g. whether MPI SEND is executed in buffered or synchronous mode. Improper use of collective routines or of requests can also lead to deadlocks [8]. MARMOT detects deadlocks in MPI programs by using the following mechanism. The additional debug process surveys the time each process waits in an MPI call. If this time exceeds a certain user-defined limit the process is reported as pending. If all processes are pending, the debug process issues a deadlock warning. In the case of a deadlock, the last few MPI calls can be traced back for each process. It is also possible to detect a deadlock with a dependency graph analysis [7, 3]. However, this approach cannot detect a deadlock in the presence of a spin-loop and has therefore to be combined with a timeout approach [3]. The need for a central server to build and analyze the dependency graph also limits the scalability. Other approaches modify the application code to measure the timeout directly on the MPI processes without the need of a central server [8]. This approach has the advantage to scale better. One disadvantage is the necessary source code modification, which not only makes the tool more difficult to implement, since the source code has to be parsed for all supported languages. The second problem is that it cannot be detected whether the timeout value is just reached because a complex calculation is done on a different process. That means a reasonable timeout value will depend on the application. The approach taken in MARMOT has the advantage that a reasonable timeout value only depends on the underlying MPI implementation and network and not on the application. The disadvantage is the limited scalability due to the existence of a central server. Also, currently deadlocks cannot be detected if the program is waiting using spin-loops.
4. O V E R H E A D ANALYSIS Attaching a tool like MARMOT to an application makes some overhead inevitable. To measure this impact on real applications and to test the scalability we chose the NAS Parallel Benchmarks Version 2.4 (NPB) [9, 10]. We selected the cg- and the is-benchmark to have examples both of C and of Fortran code. We also employed a ping-pong benchmark to see the impact of increasing message sizes on latencies and bandwith. These benchmarks were run in three different modes: only the native MPI was used, the native MPI was used with MARMOT and, third, the native MPI was used with MARMOT and the execution was serialized by MARMOT. The ping-pong benchmark displays the effect of increasing message size on the latency and the bandwidth as can be seen in Figure 2. This test was performed on an IA64-cluster with Myrinet interconnect using mpich-l.2.5..10. The latency for short messages increases from 12 #s to 90 #s when MARMOT is used. The bandwidth for large messages is unaffected by the use of MARMOT. With serialization the latency for short messages drops to 5.8 ms and the bandwidth for long messages decreases from 223 MB/s to 137 MB/s.
498 Ping-Pong Latency . .., . .., . .., . .., native MPI ' M A R M O T ---~--s e r i a l i z e d ....~---
.., MARMOT
.
..,
.
Ping-Pong
.
i000
native MARMOT
I00
MPI
MARMOT
s
Bandwidth
'
e
r
~
~"
lw y .~/'~''x
.
~
.*ci'
-
r~''*""
0.001 9
~
r,
I0
'.
~
.'.
.
.
i00
.
.
,
i000
Message
.
'
"
i0000
size
.
.
.
.
.
.
i00000
.
.
.
.
le+06
0.0001
le+07
[Bytes]
9
,
1
,T
,
,
i0
,T
.
.
i00
.:
.
,
1000
Message
.i
,
i0000
size
,
,1
,
100000
.
.~
.
le+06
.
le+07
[Bytes]
Figure 2. Comparison of the latency and bandwidth for a ping-pong benchmark between the native MPI approach, the approach with MARMOT and the serialized approach with MARMOT.
Figure 3 shows the impact on the total number of Mops by using two of the NPB on an IA32-cluster with Myrinet interconnect and mpich-l.2.4..8a. The cg-benchmark is a conjugate gradient method written in Fortran. We selected class B as it had a reasonable execution time of about 154 s on two processors of the target platform. Using class B, the size of the sparse matrix is 75 000, the number of nonzeroes per row is 13, the number of iterations 75 and the number of processes has to be a power of two (plus one additional process for MARMOT). The is-benchmark is an integer sort method written in C with size 33 554432, doing 10 iterations on a power of two number of processes. We selected class B again with an execution time of about 9 s on two processors.
cg.B 2500
native
MPI
MARMOT
2000
9
is.B 160
.
. . . .
x . . . .
.
~
. . . .
.
.
.
.
.
140
_
MARMOT
native MPI ~ o ~ serialized
'
.........
~ " :
.......... 9
..'"
120 1500
o
4~ i 0 0 4J
I000
~ o
8o 60
500
0
2 0
0
2
4
6
8
#
Processes
i0
12
14
16
0
2
4
6
8
#
Processes
i0
12
14
16
Figure 3. Comparison of the total number of Mops for the NAS-benchmark cg.B and is.B between the native MPI approach, the approach with MARMOT and the serialized approach with MARMOT.
499 It is hardly a surprise that the performance is the best when only using the native MPI and the lowest when serializing the execution. In the latter case, the overhead of MARMOT adds to the overhead of serialization. Using MARMOT in a non-serializing way does however affect performance but not that much that MARMOT's usability is seriously affected. Scalability is still preserved when using MARMOT in a non-serialized way. 5. C O N C L U S I O N S AND FUTURE W O R K
This paper has presented MARMOT, a tool to check during runtime if an MPI application conforms to the MPI standard, i.e. if resources like communicators, datatypes etc. and other parameters are handled correctly. Another problem with parallel progams is the occurrence of race conditions or deadlocks. Examples of this can be found in section 3. The benchmark results in section 4 show that the use of MARMOT introduces a certain overhead, especially when the execution is serialized. As the development of MARMOT is still ongoing, the main emphasis has so far been on implementing its functionality and not on optimizing its performance. However, the benchmarks results show that for applications with a reasonable communication to computation ratio the overhead of using MARMOT is below 20-50% on up to 16 processors. ACKNOWLEDGMENTS
The development of MARMOT is supported by the European Union through the IST-200132243 project "CrossGrid" [ 11 ]. REFERENCES
[1] [2]
[3] [4]
[5] [6]
[7]
[8]
Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1995. http://www.mpi-forum.org/. Message Passing Interface Forum. MPI-2: Extensions to the Message Passing Interface, July 1997. http://www.mpi-forum.org/. J.S. Vetter and B.R. de Supinski. Dynamic Software Testing of MPI Applications with Umpire. In SC2000: High Performance Networking and Computing Conf. ACM/IEEE 2000. Robert Hood. Debugging Computational Grid Programs with the Portable Parallel/Distributed Debugger (p2d2). In The NASA HPCC Annual Report for 1999. NASA, 1999. http ://hpcc. arc.nasa.gov: 80/reports/report99/99index.htm. Sue Reynolds. System software makes it easy. Insights Magazine, 2000. NASA, http://hp cc. arc.nasa, gov: 80/insights/vo 112/. William D. Gropp. Runtime Checking Of Datatype Signatures In MPI. In Jack Dongarra, Peter Kacsuk and Norbert Podhorszki, editors, Recent Advances In Parallel Virtual Machine And Message Passing Interface, pages 160-167. Springer 2000. Dieter Kranzlmiiller. Event Graph Analysis For Debugging Massively Parallel Programs. Phd thesis, Joh. Kepler University Linz, Austria, 2000. Glenn Luecke, Yan Zou, James Coyle, Jim Hoekstra and Marina Kraeva. Deadlock Detection In MPI Programs. Accepted for publication in Concurrency and Computation: Practice and Experience.
500 [9] NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/ [10] R.F.van der Wijngaart. NAS Parallel Benchmarks Version 2.4. NAS Technical Report NAS-02-007. NASA Ames Research Center, Moffett Field, CA, 2002. [ 11 ] MARMOT. http://www.hlrs.de/organization/tsc/projects/marmot [12] DDT. The Distributed Debugging Tool. http://www, streamline-computing, com/softwaredivision_ 1.shtml [ 13] Totalview. http://www.etnus.com/Products/TotalView [14] http://www.nas.nasa.gov/Groups/Tools/Projects/P2D2 [15] mpigdb, http://www-unix.mcs.anl.gov/mpi/mpich/docs/userguide/node26.htm#Node29 [ 16] The GNU Project Debugger. http://www.gnu.org/manual/gdb
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
501
BenchIT - Performance Measurement and Comparison for Scientific Applications G. Juckeland ~, S. B6rner ~, M. Kluge ~, S. K611ing~, W.E. NageP, S. Pflfiger~, H. R6ding ~, S. SeidP, T. William ~, and R. Wloch ~ ~Center for High Performance Computing, Dresden University of Technology, 01062 Dresden, Germany INTRODUCTION "Contrary to common belief, performance evaluation ~iii~ili~,ii~i'~i~J~iii~iii~!i!iii~ii~i~i~i~ii~i!~i~i~ii~i~i~ii~i~!iiii:~!!~ii!~::``ii!!ii!~i ~ii~i~;i is an art."[1 ] With an increasing variety of operation iii. . . . . ~ ...~. fields from office applications to data-massive, highI performance computing with very different user de~eaas mands, the programmer's know-how of program op(.... /1 timization, the choice of the compiler version, and the ~ ~ ... usage of the compiler options have an important influ . . . . V' ence on the runtime Current and future microproces~.te~ sors offer a variety of different levels of parallel pro- [~iiiii~ii!i!!i!i!~ii~i}iiii~iNii|174 ~ii!!~'~i!iNii~iiii!ii;~ii~!iiiiii/ cessing in combination with an increasing number of intelligently organized functional units and a deeply I provides staged memory hierarchy. Traditional benchmarks (e.g. [2, 3]) highlight only intertace.h a few aspects of the performance behavior. Often i computer architects, system designers, software defulfills velopers and decision-makers want to have more de- ............................................................I................ ~ili ,~ | 1 7 4, tailed information about the performance of the whole ilii i~ii~i~iii~iiii~iiii~, system than only one or a few values of a perfor-tii mance metric. This paper introduces BenchIT- a tool Figure 1" Components of the BenchITcreated by the Center for High Performance Comput- Project ing Dresden to accompany the performance evaluator. This "art" of performance evaluation actually contains two steps: Performance measurement as well as data validation and comparison. BenchIT's modular design, therefore, consists of three layers (as shown in figure 1): The measuring kernels, a main program for the measurements, and a web based graphing engine to plot and compare the gathered data. The unique step in this project is the concept of splitting the evaluation into exactly the two steps mentioned above and thus being so flexible to be used for any kind of performance measurement. The Center for High Performance Computing Dresden presents the established infrastructure for this project, which is designed to allow the HPC community easy access to a variety of performance measurements, easily extendable by own measurements and even, but especially, own measuring kernels. 9
:...............................................................................
~,.................................................................................
]. ..........................................................................................................................................
502 1. M E A S U R I N G E N V I R O N M E N T
The BenchIT measuring environment is especially designed for the "hazardous" conditions on all kinds of measuring platforms. In reducing all varying factors on different machines, only two utilities are certain: a shell and a compiler. The BenchIT measuring environment deliberatly reduces itself to use only those two to allow the highest compatibily. The environment on a certain operating system is set up by a number of cascading shell scripts compiling the measuring kernel, linking it to a main program and executing the measuring run. Some common definitions are placed in one small file named COMMONDEFS. This script provides the base name of the directory, the nodename, and the hostname of the machine as environmental variables used by the main program 9 The next file used by each kernel is the file ARCHDEFS providing a basic set of system variables depending on the operating system on the machine. They look like the following: if
[ "${uname_minus_s}"
HAVE
HAVE
HAVE
HAVE 9
.
CC= 1
=
"Linux"
] ; then
F77=I F90=0
MPI=I
.
C C = "CC"
CC
C FLAGS="$CC
CC
C FLAGS
CC .
.
.
C FLAGS
STD="-O2"
-Wall
-Werror
-Waggregate-return
-Wcast-align"
HIGH="-O3"
.
LIB 9
C FLAGS
m
PTHREAD="
- ipthread"
.
These "default" values enable BenchIT to run on a "normal" installation of the OS's included. Nevertheless, each user might want to set machine specific variables. This is possible by defining a set of L O C A L D E F S The LOCALDEFS-file is named after the nodename of the machine running on and holds exactly the same variables as already defined in the ARCHDEFSfile, therefore, allowing an easy customization. Additionally The LOCALDEFS-directory accomodates the two input-files for each node. They are named <nodename >_input_archi tecture and <nodename >_input_display and allow to fill in large sections of the output-file (see 2.1) since they are just copied into the output-files. The last part of the environment is made up of the variables used in the shell-script of the kernel itself and usually sets some kernel specific values or overwrites already existing variables (from the ARCHDEFS or LOCALDEFS) 2. MODULE INTERFACES In between the three BenchIT program layers stand two interface files. They ensure that the modules have a common basis to work together. The result-file - also called output-file - is, after it has been created on the local machine, transferred to the BenchIT webserver. The file interface, h is used as a common basis in the compilation and linking of one measurement run.
The following will provide a more detailed view at the two necessary and important interfaces.
503
2.1. The output-file A possible way to explain the results of a measuring kernel is to collect all the relevant data in a structured output file. This idea was realized in the BenchIT output-files saved in the subdirectory "output". They are coded in ASCII format for easy viewing and editing. The different parts of the structure are bounded by the keywords beginofxxxxxand endofxxxxx and introduced in the following. Measurement information This part of the output-file includes a kernel-string as a short description of the measuring kernel, for example "Fortran dot product", a timestamp, a comment, the programming language, the used compiler and its compiler flags, and minima and maxima for the x- and y-values. Additionally the string code-sequence, for example "do i= 1,n# sum-sum+x(i)*y(i)#enddo" shows the characteristic feature of this measuring program. Architecture Important architectural statements are the node-name and the host-name. Output-files will not be accepted on the project homepage([6]) without this information. A collection of architectural information was designed as a guideline of this part of the outputfile, first to explain the measurement results and further to identify the machine the measurement ran on. The following characteristics are included (selection): mainboard manufacturer, mainboard type, mainboard chipset, processor name and clock rate, processor serial number, processor version, instruction set architecture and its level, several instruction set architecture extensions, processor clock rate, instruction length, processor word length, and the number of integer, floating point, and load-store units. The cache hierarchy is described by the sizes, organization and location. To characterize the memory system information about the used memory chip type, memory bus type and clock rate are necessary. Display This section holds all information needed to set up the plotting engine to display the results contained in the output file. This includes axis texts and labels for all measured functions, axis setup (linear or logarithmic), and the boundaries for the plotting range. Additionally information from the sections Measurement Information and Architecture can be placed in the graph. Identifier-strings This section is used to relate easily readable strings prepared for the web menu to all identifier-strings in the output-file, for example "ISA Extension" to the identifier-string processorisaextension2. Data The measured physical values are stored in the data section in a 2-dimensional ordering: The first value per row is the x value followed by y values depending on the number of measuring functions inside the kernel. Each new x value generates a new row. All values (integers or floating point numbers) are represented as ASCII coded decimal strings. The design of the output-files is no static. It is possible that additional parts will be inserted during the further development of the BenchIT project.
504 2.2. The file interface.h The two data acquisition layers of the BenchlT project are linked through the C header file i n t e r f a c e , h. It defines an info structure, where a kernel provides information about itself. Furthermore it specifies the functions called by the main program and service functions to be used by the kernels. Info structure: Some elements are used to fill out the output file, such as: kernelstring, k e r n e l l i b r a r i e s (e.g. PThread, MPI, BLAS), codesequence, axis texts and properties,and legend texts. The main program itselfneeds a few more detailsabout the kernel, e.g. maxproblemsize, numfunctions, o u t l i e r _ d i r e c t i o n _ u p w a r d s for errorcorrectionby the main program, and kerne l _ e x e c s _ X X X which allow an adaption to the kind of parallelism the kemel wants to execute. Interface functions
The main program uses the functions bi_getinfo, bi_init, bi_entry, and b i c l e a n u p - first to inform itself about the kernel to run, initialize the kernel, than to run the measurements for various problem sizes, and finally to cleanup files and memory used by the kernel. Furthermore, the main program provides two tool functions - b i _ g e t t i m e and bi st rdup. 3. MODULE COMPONENTS Having introduced the BenchlT module layer interfaces, the paper will now turn the focus to the BenchIT modules itself. BenchIT consists of three module layers: the kernels, the mainprogram, and the website. Each layer offers different services which will be presented together with the modules itself in the following. 3.1. The kernels
Within this project a "kernel" is referred to as an algorithm or measuring program. Typical examples are a matrix multiplication or the Jacobi algorithm. Programming a kernel demands a certain discipline from the kernel author. Since BenchlT is to run on a variety of computation platforms, the kernel code has to be compatible to all of them. This can be best accomplished by: using only basic program structures, avoiding system calls and system specific operations 1, and utilizing the functions provided by the main program. The professed goal of the BenchlT-Team is to have every kernel distributed with BenchlT being executable on every platform. Nevertheless it is possible and not valued less to write a problem specific kernel. A typical use for this strategy might be the optimization of a certain algorithm on a specific target architecture. As up today the following kernels are included in the BenchlT package: MPI-performance measurement (Roundtrip-Message and Binary-Tree-Broadcast programmed in C), performance measurement for the Jacobi algorithm (sequential in C and Java; parallel in Java using Java-Threads and in C using PThreads), matrix multiplication (sequential in C, Fortran 77, and Java; parallel in Fortran 77 using MPI), performance measurement for calculating the dot product for large vectors (sequential in Fortran 77; parallel in C using PThreads), perfor1If system calls become necessarythey will have to be accordingto the POSIX([4]) standard.
505 mance measurement for the mathematical operations sine, cosine, and square root (sequential in C, Java and Fortran 77), memory bandwidth (sequential in C), and IO-performance such as write rate and read rate for small and large file (parallel in C using PThreads). Every BenchIT-User is also able and asked to act as an author of a kernel. A "custom" kernel can then be sent to the BenchIT-Team and will be taken into the kernel set, if considered useful and complying with the kernel rules. 3.2. T h e m a i n p r o g r a m
Initialize Program & Kernel
Measure one Problemsize
~1~
The first service module within the BenchIT layers is the main program for the measurement. It controls the generation of measurement data by the kernels, offers them service routines (see 2.2), and writes the resultfile (see 2.1). The main program has to operate Oust as the kernels) unyes ~ der a wide variety of system environments. However, the environment of the operating system is just one part of this variety. Another issue is the runtime environment. Since no BenchIT supports among others MPI as a parallel environment, the main program has to adapt itself to that as well. 2 Analyze Data One might argue that it would also be feasible to have different main programs for each runtime environment, yet the BenchIT designers considered it an unnecessary code redundancy, especially since so far using just one main file Write Result- &QuickviewFile has been practicable. One measurement run follows the scheme shown in figFigure 2: Schematic view of one ure 2. During the measurement the main program calls the measurement run. kernel with a certain problem size. This is just an internal value and must not have something to do with the actual measurement. 3 The translation is done by the kernel. The main program also contains an error correction for the kernels since performance differences during a measurement run for one problem size due to other system processes running on the CPU are inevitable. BenchIT thus uses the following approach: Measure one problem size n times 4. Each kernel informs the main program in the init routine if the outliers of each function have to be expected upwards or downwards. BenchIT then uses the best value of the • runs. After measuring the main program will analyze the gathered data. In this step minima and maxima are gathered and useful display boundaries are calculated. Furthermore some environment variables (see 1) are gathered and the two computer specific input files are opened. With all this done, the main program will then write the output file (see 2.1) as well as a gnuplot-file used by the local QUICKVIEW.
I
2This is in case of MPI done by compiling the main program with the "-DUSE_MPI"-option. 3The internal problem size might be the same as the external in case of a matrix multiply, but it could also be scaled by a certain factor. 4The n is set by the compiler option "-DERROR_CORRECTION=n - 1".
506
3.3. The webserver The BenchIT web interface([6]) complements the BenchIT project, by giving the possibility to plot the results of the measuring kernels and compare them directly. It is the unique step in the project and allows acces to all measurement data with just an internet browser. Matrix Multiply 4.5e+08 ]
~;~'
....
'
'~jk
4"
3.3.1. Specification ~,~ ~", The Webserver manages the output-files 3.50+08 =.~ (see 2.1) uploaded by the registered users. 30+08 % ~ ...,.,.~ They are held as ASCII-fles as well as entries in a PostgreSQL-Database. The PHP-Webpages use the database to assemble a plot, then writes instructions for gnuplot([5]) which produces an eps-file 50+07 ~ that can be downloaded directly (as done 50 t00 150 200 250 300 350 in figure 3). Additionally a JPEG-image is Matrix Size created and displayed on the website. Figure 3" The graph for a matrix multiplication It is specified that all kind of measuring data can be displayed in one graph. The only limitation is that the data has to have one or the other unit (e.g. FLOPS, seconds, or a number of hits or misses) since gnuplot can at the maximum display two different y-axes. Another important question to be answered is how the plots will be assembled and how the user can customize the plots. The BenchIT Team has so far implemented two strategies: ~:~.. ~.~-
Selection by architectural characteristics The first possibility is to compare different values of one architectural feature. It is possible to show the sensitiveness of the results of the measuring kernels on the physical size of one architectural feature. This way it is possible to look for specific performance data for a searched architectural feature and compare it to other architectures. Selection by the measuring kernel The second possibility compares different characteristics of architecture, which are all calculated by just one measuring kernel. It can be considered the "expressway" in the adaption of the plot result since it is possible to customize a plot result with just three steps. 3.3.2. The construction of the BenchlT web interface The BenchIT web interface consists of two parts: An open and a restricted section. The measurement data is only accessible after registering on the website. This is also a security question since it is, therefore, trackable who uploaded which output-file. At the moment only registered users can download the measurement program, because BenchIT is still in a status of development. The new accounts will first be locked automatically and unlocked by the web interface administrators. All output-files uploaded to the webserver are backed up on a daily basis, hence, ensuring the availability of the data. Additional security measures are implemented, so the data classified as non-disclosure can be uploaded and only be viewed by one user or a group of users.
507 Guido@bluerabbit $ ./SUBDIREXEC.SH
-/benchit/src/kernel/matmul_c
No d e f i n i t i o n s for y o u r o p e r a t i n g s y s t e m found in ARCHDEFS. You will to set them m a n u a l l y in your LOCALDEFS. B e n c h I T will not run w i t h o u t least one set of d e f i n i t i o n s . . .
have at
No d e f i n i t i o n s for y o u r o p e r a t i n g s y s t e m found in ARCHDEFS. You will to set them m a n u a l l y in y o u r LOCALDEFS. B e n c h I T will not run w i t h o u t least o n e s e t of d e f i n i t i o n s . . .
have
Warning: BenchIT: BenchIT: BenchIT: BenchIT: BenchIT: BenchIT:
the v a r i a b l e 'ENVIRONMENT' is not s e t u s i n g N O T H I N G as d e f a u l t G e t t i n g info about kernel . [ OK G e t t i n g s t a r t i n g time... [ OK ] S e l e c t e d kernel: M a t r i x M u l t i p l y I n i t i a l i z i n g kernel... [ OK ] A l l o c a t i n g m e m o r y for results... [ Measuring...
]
OK
]
BenchIT: Total time limit reached. S t o p p i n g m e a s u r e m e n t . BenchIT: A n a l y z i n g results... [ OK ] BenchIT: W r i t i n g r e s u l t f i l e . . . [ OK ] BenchIT: W r o t e o u t p u t to "matmul cO A m K 7 _ I G 3 3 _ 2 0 0 3 08 15 BenchIT: W r i t i n g q u i c k v i e w file... [ OK ] BenchIT: F i n i s h i n g . . . [ OK ] rm: cannot u n l i n k "matmul_c': No such file or d i r e c t o r y Guido@bluerabbit $
at
15 55.bit"
-/benchit/src/kernel/matmul_c
Figure 4. Output of one measurement run 4. FIRST RESULTS OF THE PROJECT The project has been running for one year now and most of the immediate goals have been achieved. The measurement (as shown in figure 4) is so flexible that an adaption to a new platform is a matter of filling out one configuration file. The kernels run on all platform with the compilers and libraries necessary. The webserver is well capable of administering and plotting the files. It has been especially designed to work without Java-Script to allow the greatest browser compatibility. After first attempts without a database to support the server in managing the resultfiles for plotting, it hast been decided that a database for the arrangement of the plots is necessary to receive acceptable response times on the website. 5. S U M M A R Y AND O U T L O O K The BenchIT kernels generate a large amount of measurement results in dependence of the number of functional arguments. Using the web interface the user is given the chance to show the selected results of different measuring programs in only one coordinate system. Often there are different reasons they can cause characteristic minima, maxima or a special shape in a graph. It is necessary to collect additional information about the tested system to explain such effects on a base of well-known system properties and physical values of the realization. The BenchIT-Project wants to provide such an evaluation platform by offering a variety of measurement kernels as well as a easily accessible plotting engine, thus enabling an easy way to measure performance on a specific system and compare the result, which is a full graph instead of just a number, to other results contributed by other users. The further development of the BenchIT-project will take place on all module layers. A GUI for the configuration of the measurements is under development - it will provide an easier way
508 to handle the measurements by partially substituting the shell scripts running the measurements up to this point. The power of the PCL will we utilized to access more measurement data. Furthermore an additional way to plot the data on the website by using Java-Applets and Java graphing tools is planned. The BenchIT-project will not merely be just "another" tool in the art of performance analysis - yet it will have prove to be a very powerful one. REFERENCES
[1]
[2] [3] [4] [5] [6]
Raj Jain: The Art of Computer Systems Performance Analysis. John Wiley, Chichester. 1991 Standard Performance Evaluation Corporation (SPEC): http://www.spec.org/ LINPACK: http://www.netlib.org/linpack/ IEEE POSIX: http://standards.ieee.org/regauth/posix/ Gnuplot: http ://www.gnuplot.info The BenchIT Webserver: http://www.benchit.org
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
509
Performance Issues in the Implementation of the M-VIA Communication Software Ch. Fearing a, D. Hickey ~, RA. Wilsey ~, and K. Tomko ~ aExperimental Computing Lab, University of Cincinnati, Cincinnati, Ohio 45221-0030, USA
The Modular-Virtual Interface Architecture (M-VIA) is a communication software suite developed by the National Energy Research Scientific Center. M-VIA is based on the Intel hardware-based VI standard. This software currently outperforms TCP in both bandwidth and latency. We are conducting a performance analysis of the M-VIA software to determine if there are opportunities to gain additional performance improvements for parallel applications. In particular, we are looking to further improve M-VIA by optimizing both the amount of time it waits for a given transaction, and how efficient the time is spent waiting on the local machine. In this work, we identify performance critical regions of the M-VIA software and we explore two modifications to the existing code that aim to improve local computer performance without affecting overall network performance in a cluster environment.
1. INTRODUCTION Many Beowulf cluster implementations are created using expensive communication hardware such as Myrinet, Giganet, or VI. Unfortunately, such additions to Beowulf clusters can substantially increase the cost of building these clusters (sometimes doubling the cost). With the ever increasing reductions in costs for commodity processors, continued reliance on high cost networking hardware for speeding fine-grained distributed applications is rapidly becoming infeasible. Thus, we must alternate techniques that provide high performance communication (low latency and high bandwidth) with low-cost networking hardware. In particular, low-cost fast and giga-bit ethernet hardware. M-VIA [ 1, 3] is a software implementation of the VIA [2] hardware standard developed by Intel, intended to run on standard Ethernet NICs. M-VIA is implemented with a modified device driver, loadable kernel module, and user-level library. Performance increases over the standard TCP are accomplished using fast traps and other software techniques. M-VIA allows for both blocking and non-blocking operation through the use of wait and poll communication routines. In the remainder of this paper, we describe our experimental hardware, summarize our profile data, describe our proposed optimizations, and report the results we have obtained that demonstrate an improvement in overall M-VIA performance.
510
•
Others
Others ..... VipCQWait
~
VipRecvWait
~
VipSendWait
~
VipSendDone
iiiil~il PMPIWai . tany
~
VipRecvDone
~
VipCQDone
~
rnain
Figure 1. Profile data for Vnettest
MPIDDeviceCheck
Figure 2. Profile data for FDTD
2. INITIAL ANALYSIS Our system architecture consists of a server (Dual PentiumPro-180, 256 RAM, Tulip NIC) and 5 cluster nodes (4 Dual PentiumPro-180s, 256 Meg RAM, 1 Pentium-233, 64 Meg). These are connected using a Cisco Catalyst 1900 10 Mbit/s switch. Each node uses a standard tulip NIC. Our system is setup for each node to boot up from the server using an NFS root file-system. Every system runs version 2.4.20 of the Linux kernel. Cluster communication is implemented using M-VIA, as well as MPI [6, 7] as a layer over M-VIA using MVICH [4]. We are using the Linux kernel profiling tools Oprofile (http://oprofile.sf.net) and Prospect (http://prospect.sf.net). Using these tools, we obtained performance results with two applications, namely: (i) vnettest, an M-VIA-aware network benchmark and (ii) FDTD, a Finite Difference Time Domain simulation described in [5]. We briefly review the results of these studies in the next two subsections. 2.1. Vnettest Vnettest is provided as an example M-VIA application used to measure latency and bandwidth between two cluster nodes. The program is extremely network intensive and is used to demonstrate performance for applications that rely on the network heavily. User profile results for this test case are shown in Figure 1. 2.2. F D T D The FDTD application uses the MVICH implementation to provide MPI support over MVIA. The application is used to simulate cosite interference in wireless communication among military ground vehicles. FDTD partitions a 3-D domain into subdomains that are processed in parallel to simulate electromagnetic properties of the transmitted communication signals and relies on MPI to exchange subdomain boundary values to other processes requiring the data for computation. The analysis of this application provides results more indicative of real-world usage. Performance results for this test case are shown in Figure 2. 2.3. Profile results summary The results for both profiles indicated an excessive amount of time spent in the
functions vipCQWait () and VipCQDone() for F D T D and VipRecvWait () and VipRecvDone () for Vnettest. VipCQWait () and VipRecvWait () perform relatively the same operations on completion queues and receive queues respectively. Keeping this in
511 mind, the improvements made to the receive queue calls can also be applied to the completion queue portion of the M-VIA implementation. Upon further analysis, the realization came that v i p c Q w a i t () contains a static variable S p i n C o u n t that is used in a for loop to continuously call VipCQDone () until a VIP_SUCCESS is returned. This busy-looping is far from optimal, as a v I P_SUCCESS is not expected until after an amount of time equal to or greater than the round-trip latency between nodes. After a time specified by the round-trip latency, the communicating node should expect the data to be available and should then make a call to VipCQDone (), the non-blocking receive queue check.
3. TWO PROPOSALS FOR PERFORMANCE IMPROVEMENT We propose two optimizations for this configuration to allow the cluster node to perform useful computation during the round-trip wait time without affecting communication performance for the existing system. The first proposal optimizes the wait time by postponing a VipCQDone ( ) call until after a lower-bound of the round-trip latency has been reached and using timers to relinquish control of the CPU to other processes that have actual work to do. The second proposal optimizes the wait time by postponing a VipCQDone ( ) call until the user application receives a signal from the device driver, based on the interrupt handler identifying an M-VIA packet bound to that application. These two proposals will be described in more detail in the following sections and the results obtained after implementation will be presented.
3.1. Implement timer-based sleep model With the advancement of high precision clocks available at the kernel level, it is possible to measure time in tens of microseconds. Using a fine grained clock to measure the roundtrip latency during the initial connection phase, a lower bound on the network delay can be established such that it would be highly unlikely to receive v:r P SUCCESS status during this interval. To determine the sleep time interval, a small benchmark was inserted into the initial connection phase for each M-VIA host. This benchmark measured the time for a send and acknowledge cycle to take place for a packet size of 1KB and 2KB. The time taken for packets of these sizes was used to determine an equation that could be used to establish the expected latency given the size of a packet. After establishing an equation for the expected latency, a nanosleep call was inserted into the VipRecvWait function with the computed value as the sleep time. To our disappointment, the precision of the nanosleep call was inadequate to be used for the latency values we had established, and several attempts at finding a more suitable sleep call were unsuccessful. 3.2. Implement signal-based sleep model Our next attempt to improve MVIA was to take a more asynchronous approach. Our plan was to put a process to sleep for a time, and when a packet arrives, the kernel sends the signal SIGCONT to the sleeping process. This is accomplished with several changes. First, the signal sending needs setup inside the kernel module. This is accomplished first by creating a pid variable in the structure VIPK VT, within v i p k . h . This allows the kernel structure associated with a VIA connection to contain the process id that created the connection. After this, the newly created pid must be set. Setting the pid is done within the function V i p k C r e a t e V i () inside v i p k o p s . c. This stores the current running process' pid inside m
512 the VI connection being created. Our assumption is that the process setting up the communication by calling V i p k C r e a t e V i () will be at the top of the scheduler when it calls the kernel module. This kernel module code, then, is uninterruptible, so the pid to be stored at the top of the scheduler is still valid. The next kernel change is to actually send the signal once a packet has arrived and is validated as a via packet. The signal is sent from within VipkViRecvComplete inside vipk. h. This is the last function called once the packet is received. The function k i l l _ p r o c _ i n f o is used because it accepts a signal, and a pid as arguments. This is much simpler to deal with than the t a s k _ s t r u c t necessary for using sendsig. After the kernel signal-sending is set up, the user-level process needs to go to sleep and be ready to receive the signal. This is accomplished by using n a n o s l e e p . For longer periods, n a n o s l e e p sends a process to sleep for a specified amount of time. If a signal is sent to that process then during a sleep, the process wakes up. Our n a n o s 1 e e p is placed in VipRecvWai t and VipSendWai t. This is placed within the loop that was previously simply a busy-wait. This code is compiled into the 1 i b v i p l , a. L i b v i p l . a is user-level, and compiled into any via application, including MPICH programs. Once this is set up, a simple one-way communication structure has been created. This has been tested to work nearly as fast as the previous busy-loop style. It has been shown to improve user-time by nearly thirty percent, depending on the amount of node-to-node communication. The saved user-time can then be turned into real-time when other processes are contending for CPU time. It should be noted that there is a convergence of the performance graphs where the overhead of dealing with a sleeping process and sending a signal can take the same or even more than the amount of time as busy-waiting for a small packet. Our benefits are intended for packet sizes greater than 2000 bytes.
4. RESULTS These charts show how by using signals and the nanosleep call, user wait time is nearly eliminated. The charts also show how the gain of user time is then given to the kernel compile and vnettest. Our implementation has created a relationship between newly gained user time from an MVIA application, and faster total completion time for any other application running on the cluster node. This shows that the time saved by being more efficient can be utilized for an overall optimal implementation. The improvement that signals create is only realized when the time to transmit a packet is larger than the time spent sleeping. In our experiments, this size was around 2000 bytes. The version of FDTD we used had smaller packet sizes and showed no improvement. In order to show improvement we needed a constant source of CPU activity rtmning, while our improved MVIA was running. This is why we focussed on vnettest, and a kernel compile. Figure 3 shows the initial timing results for running Vnettest and a 2.4.20 kernel compile concurrently. The results provide a baseline indicating the time taken with the original M-VIA implementation to complete a computationally intensive task. Figure 4 shows the timing results after modifying the M-VIA code to implement the signalbased solution discussed above. The results indicate that a significant amount of time is returned to the CPU during network communication.
513
1600 1400
1200 ,., 1 0 0 0
u
~, 8oo E
, i
"
600 400
200
Figure 3. Timing data for Vnettest and kernel compile on original M-VIA
1600
1400 1200 ,,~ 1000 u
-~" 8oo E
, m
"
600
400
200
Figure 4. Timing data for Vnettest and kernel compile on new M-VIA
514 Figure 5 provides a baseline for the same test using a one gigahertz Pentium III computer as the destination node for Vnettest and compiling the kernel on the original server. This trade-off was made to eliminate the additional traffic that would be generated for a kernel compile over the NFS mount.
1600
1400 1200 ,., 100o u
~
8oo
E ~
600 400
200
Figure 5. Timing data for Vnettest and kernel compile
Figure 6 presents the final timing results for our tests. Using a newer cluster, the time savings relationship is confirmed. Even as the M-VIA communication is nearly 10 times faster, the one minute user time gained is seen in a faster completion time for the kernel compile (by a minute). 5. C O N C L U S I O N S Once user-time has been saved while using VIA communication, there are many uses for this time. File copies utilize larger packet sizes and would be a great example of a network-process taking place while the user continues working. The FDTD simulation used in this paper also has more complex variants that use SPICE like simulation to support modeling of a wide variety of receiver circuits instead of restricting the simulation to a specific receiver model hardcoded in the application. Thus the spice model could run as a separate thread on a processor containing a receiver. This creates more tasks to do locally, and is thus a prime candidate for utilizing the extra CPU cycles that result from not busy-waiting. For MVIA to be more of a complete local area network protocol, than just a strict cluster node communication protocol, better efficiency is very useful.
515
1600 1400
1200 ~, lO0O U ILl
800 E
, 1
600
400 200
Figure 6. Timing data for Vnettest and kernel compile Signal implementation introduces a new bottom barrier related to the time it takes to handle a signal, and scheduler introduced delays. Transmission times for small packets can actually be less than this new delay. An improvement to our implementation would be a simple bottom boundary where the introduced delay meets the packet transmission size. This would require benchmarking. On our cluster, the bottom bound appears to be 2 kilobytes, as found from the latency and bandwidth graphs generated by vnettest. Other than this simple fix, MVIA is now a very locally simple and small application. Further speed improvements would need to take place at the communication layer with attention to VI compliance. REFERENCES
[1]
[2] [3]
BOZEMAN, P., AND SAPHIR, B. A modular high performance implementation of the virtual interface architecture. In Extreme Linux Conference (1999). CAMERON, O., AND REGNIER, a . The Virtual Interface Architecture. Intel Press, 2002. CENTER, N. E. R. S. C. M-VIA: A high performance modular VIA for linux, http:
//www.nersc.gov/research/FTG/via/. CENTER, N. E. R. S. C. MVICH: MPI for virtual interface architecture. [51 CZARNUL, P., VENKATASUBRAMANIAN, S., SARRIS, C. D., HUNG, S., CHUN, D., TOMKO, K., DAVIDSON, E. S., KATEHI, L. P. B., AND PERLMAN, B. Locality enhancement and parallelization of a finite difference time domain simulation. In HPCMO Users Group Conference (2001). [6] GROPP, W., LUSK, E., AND SKJELLUM, A. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, MA, 1994. [4]
516 [7]
PACHECO, P. S. Parallel Programming with MPI. Morgan Kauffman Publishers, Inc, San Francisco, CA, 1997.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
Performance and performance counters on the Itanium 2 case study
517
A benchmarking
U. Andersson a, P. Ekman ~, and P. Oster ~ aCenter for Parallel Computers, Royal Institute of Technology, SE-100 44 Stockholm, Sweden We study the performance of the Itanium 2 processor on a number of benchmarks from computational electromagnetics. In detail, we show how the hardware performance counters of the Itanium 2 can be used to analyze the behavior of a kernel code, Yee b e n c h , for the FDTD method. We also present performance results for a parallel FDTD code on a cluster of HP rx2600 Intel Itanium 2 based workstations. Finally we give results from a benchmark suite of an industrial time-domain code based on the FDTD method. All these results show that the Itanium 2/rx2600 is very well suited to the FDTD method due to its high main memory bandwidth. 1. INTRODUCTION PDC', the major Swedish center for parallel computers, has recently purchased a ninety node cluster. Each node is an HP rx2600 server (zxl chipset) with two 900 MHz Itanium 2 processors and six GB 266 MHz DDR SDRAM memory. We conducted extensive benchmarking prior to this purchase and will here present performance analysis on one of the code packages, the GEMS time-domain codes, used during this evaluation. Even though poor serial performance makes it easier to get good speed-up, it is of course extremely important to optimize your serial code version in order to get good overall performance of your parallel code version. Hence, we will spend the majority of this paper studying the performance of the serial code version using the hardware performance counters of the Itanium 2. We will show how these can be used to pinpoint the major performance bottleneck, which is main memory access latencies. In all our computations, we use 64-bit precision. 2. THE GEMS CODES
The General ElectroMagnetic Solvers (GEMS) project was a code development project at the Swedish center of excellence PSCI 2 (Parallel and Scientific Computing Institute). The aim of the GEMS project was to develop codes that were state-of-the-art on the international level and form a platform for future development by Swedish industry and academia. The Gems codes are continuously developed within new PSCI projects. The GEMS3 development phase has just started.
lhttp://www.pdc.kth.se/ 2http://www.psci.kth.se/
518 The Gems codes consists of two parts, the frequency-domain codes and a hybrid time-domain code GemsTD (aka f r i d a ) . We will here solely deal with the time-domain code and concentrate on the structured grid part of it. GemsTD combines the finite-difference method on structured Cartesian grids with the finite-element method on unstructured tetrahedral grids into a stable second-order hybrid method [5]. It is written in standard conforming Fortran 90. The core of the finite-difference time-domain (FDTD) method in 3D are two triple-nested do loops for the leap-frog update of the electromagnetic field variables. Added to this core, G e m s T D contains implementations of advanced absorbing boundary conditions (split and unsplit perfectly matched layers), incident plane-wave generation, near-to-far-field transforms (a time-domain transform, a frequency-domain transform and a continuous-wave transform) and much more [8]. 3. Y e e
bench
Yee b e n c h implements the core of the FDTD method, i.e., the triple-nested do-loops for the leap-frog update. It assumes a cubic computational domain and loops over the problem size N - Nx = Ny = Nz. The memory usage for 64-bit precision is 24N a + 24(N+1)a bytes where the two terms refer to the magnetic and the electric field components respectively. We use no padding when we allocate the fields. Hence we expect poor performance whenever N or N+I is a power-of-two. Y e e _ b e n c h is a floating-point benchmark developed at PDC. It is carefully described in [2]. 3.1. Performance Results for Y e e _ b e n c h using 64-bit precision on an HP rx2600 server with Itanium 2 processors running at 900 MHz with a 1.5 MB L3 cache are given in Figure 1. We have used the Intel compiler e f c Version 7.1 Build 20030814 (version 7.1-31) and the options - 0 3 - f t z . We know that this code is memory bound (see [2]) on most architectures, and that the performance (flop/s) is limited by 1 Ph- ~BM,
(1)
where BM is the (main) memory bandwidth (byte/s). Eq. (1) is only valid for large problem sizes. This is in accordance with the fact that performance for large problem sizes are more relevant than performance for small problem sizes in real applications. To reach the upper limit set by Eq. (1), we must have an extremely fast cache that is large enough to hold an entire slice of the 3D domain. This means that the cache should be larger than 8 9 10 9 Nx * Ny - 80N 2. We measure BM using the s t r e a m 2 DAXPY benchmark. For the rx2600 server, the result converge to BM -- 5.0 billion byte/s for large problems, thus giving Ph = 1.25 Gflop/s. The theoretical maximum main memory bandwidth of the rx2600 server is 6.4 GB/s. Note the effect of the L2 cache around N - 30, and the effect of the L3 cache after N = 150. As predicted we get poor performance for N - 255 and N = 256. Table 1 displays results for Y e e _ b e n c h for several different computers. It is included in order to put the performance numbers of the Itanium 2 into context. It is beyond the scope of this paper to discuss these performance numbers. See [2] for more details on the IBM performance numbers.
519 Table 1 Single CPU results for u
Average performance for large problem sizes is given.
Processor
IBM pwr4
IBM pwr3
Itanium 2
AMD Opteron
f(GHz) Compiler
1.1 xlf90_r
0.375 x l f 9 0 7.1.1.3
0.9 efc7.1-31
1.8 p g f g 0 5.0-1
8.1.0.1
BM (109 byte/s)
1.8
1.2
5.0
2.9
Perf. (Gflop/s)
0.36
0.28
0.98
0.52
1.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I
150
200
.
.
.
.
.
.
.
.
.
.
.
.
.
.
!
.
.
.
.
.
.
.
.
.
.
i
I . . . . . . . .
1 0.9 I
50
100
250 N=N=N
300
350
I
I
400
450
500
Figure 1. Y e e _ b e n c h results on an HP rx2600 server.
3.2. Using the performance counters For a test case with problem size N - 170 we have a total memory usage of 227MB. Examining the assembly code we see that one iteration needs thirteen bundles to complete. The prolog (and the epilog) consists of four iterations (see Chapter 2 in [4] for an explanation of the terms prolog and epilog). Eighteen floating-point operations are performed in these bundles. Maximum performance on our system for Y e e _ b e n c h is therefore 900 9 18/7 9 170/174 --2.26 Gflop/s. The actual performance seen on our system for this problem size is around 991 Mflop/s i.e., 80% of the value predicted by Eq. (1) and 44% of 2.26 Gflop/s. We will try to analyze the behavior of the code using the hardware performance counters in the Itanium 2 processor. The Performance Monitoring Unit (PMU) in the Itanium 2 processor enables the user to collect a wealth of information concerning the behavior of a code [6]. This spans from cache misses and cycle accurate instruction counts to branch traces. We access the PMU under Linux through a program called p frnon that builds on the 1 • fm API [ 10]. For detailed cycle accounting and many other performance measurements on the Itanium 2 [7] we have to rely on derived events, metrics that are composed by a relation between several, more specific, hardware events. A simple example is the IPC, instructions executed per clock cycle, which is a derived event that consists of the relation IA64_INST_ RETIRED_ THIS/CPU_ CYCLES. The composition of the derived events used here is described in the • p l script [9]. The Itanium 2 PMU can only count at most four events at a time. Often, especially when measuring derived events, the code that is being measured will therefore have to be executed several times when more than four hardware events are needed. It is then important that the code behaves in roughly the same manner each time
520 it is executed to collect measurements, or hardware events from different executions will fail to correlate accurately. It is also good if the code runs for some time so that transient event perturbations are smoothed over. The test case ran for 128 iterations to smooth out the dynamic variations in the event counts. The code was executed in batch sixteen times to collect the necessary performance event counts. Performance as measured by the code itself (only measuring the kernel loops, ignoring startup costs) varied from 987.32 Mflop/s to 993.85 Mflop/s for a maximal 0.7% performance variation between executions. The average performance was 991.61 Mflop/s. The following execution efficiency metrics were recorded: Useful instructions/cycle 2.054 NOPs/instruction 0.217 FLOP/cycle 1.107 Undispersed instructions/cycle 0.006 NOPs/cycle 0.568 Loads/store 3.314 Instructions/cycle 2.622 Total stalls 0.557 Main memory bandwidth used 5.282 The hardware recorded 1.107 flop/cycle for 996.3 Mflop/s on our 900 MHz processor. Main memory bandwidth consumed was 5.282 bytes/cycle or 4753.8 MB/s. The processor issues two instructions/cycle on average which is only 1/3 of peak. Examining the source code we see that in each iteration, we need 12 fmas, 3 stores and 10 loads, i.e., we want an IPC of at least 4. The CPU is stalling 56% of the execution time which is clearly a major performance limiter. A breakdown of stall reasons as ratios to the total number of CPU stall cycles gives us: Data cache stalls 0.259 RSE stalls 0.000 Branch mispredict stalls 0.000 Integer register dependency stalls 0.000 Instruction cache stalls 0.000 Support register dependency stalls 0.000 FPU stalls 0.740 Sum 1.000 Only data cache and FPU stalls contribute measurably to the total stall count. Data cache stalls include data load dependencies in the integer part of the CPU, TLB misses, failed speculative loads and resource starvation in the cache hierarchy. Breaking down the data cache stall reasons we get: Virtual memory stalls (TLB + HPW) Store related stalls L2 capacity stalls Sum Failed speculative load penalty stalls L2 bank conflicts ratio Integer load latency stalls L2 recirculation stalls Virtual memory overhead has a rather small impact on performance with the configured page size (16 kB). L2 capacity stalls are caused when the L1D (or the FPU) issues a request to the L2 cache and there are not enough resources available in the L2 to allocate the request. This happens when the queuing structure (called OZQ) that makes the L2 non-blocking and out-oforder is either full or blocked. This, in turn, should only happen when there are conflicts in the L2 cache that cause the request latency to increase. A common cause for latency-increasing L2 conflicts are bank conflicts in the L2 data arrays. The arrays are sixteen bytes wide and simultaneous requests to the same bank either cause the OZQ to block or the first request to be cancelled and then reissued which incurs a four cycle penalty. Around 11% of all L2 cache references cause a bank conflict that results in a request being cancelled in this code. L2 recirculation stalls are listed separately since they overlap with L2 capacity stalls. L2 requests that cannot be determined as a clear hit or miss (such as when it hits a cache line that is already in the process of being fetched from memory) are recirculated to the tag lookup stage
521 until the request can be clearly determined as a hit or miss. FPU stalls include register-register dependencies as well as load dependencies in the floatingpoint unit. There is unfortunately no simple way of separating these two stall reasons from each other. Since the code is supposedly memory bound we will take a look at the cache hierarchy performance: LII hitrate 1.000 L2 hitrate 01963 LII prefetch hit rate 0.935 L2D hit rate 0.963 L2I hit rate 0.667 L3 hit rate 0.071 L1D hit rate 0.890 L3D hit rate 0.072 The L 1D cache is not accessed much since floating-point loads and stores bypass it and this code is very floating-point intensive. The L2 cache has a fair hit rate. At this problem size the innermost loop of the kernel covers 170 9 10 9 8 bytes of data (13 kB) which fits easily in the 256 kB L2 cache. This means that in each iteration we hope that two of the ten loads will always be an L2 cache hit. The other eight loads will give a cache miss, if they are referring to a value that are first in a cache line i.e., they will have a cache hit rate of 15/16 since each 128 byte L2 cache line contains sixteen 64-bit floating-point values. The three stores should always give an L2 cache hit. We get an expected cache hit rate of (2 + 8 , 1 5 / 1 6 + 3)/13 = 25/26 ~ 0.962, which agrees well with the measured L2D hit rate value of 0.963. The L3 cache has a very low hit rate. The middle loop covers 1702 9 10 9 8 bytes of data (2.2 MB) which is slightly larger than the 1.5 MB of L3 cache. This means that we may get an L3D cache hit for two of our eight loads. The remaining six loads are compulsory cache misses since it is the first time during this update that these values are used. Hence we will get an L3D hit rate below 25%. Jarp [7] describes a formula for relating cache event counts to stall cycles: Stall cycles = L2 data hits, N1 -}-L3 data hits, N2 + L3 data misses, N3 + L2DTLB misses, N4,
(2) where the factors Ni are determined by the latency in cycles of the respective memory hierarchy level. Using the factors proposed in [7], N1 -- 2, N2 = 10, N3 - 150, N4 -- 30, and the measured counts: L2 datahits 17.0,109 L2DTLB misses 0.004,109 L3 data hits 0.05 9 109 Stall cycles 11.3 9 109 L3 data misses 0.6 9 109 We get 1 7 , 2 + 0 . 0 5 , 1 0 + 0 . 6 , 1 5 0 + 0 . 0 0 4 , 3 0 = 3 4 + 0 . 5 + 9 0 + 0 . 1 2 = 124.62,109 cycles. This is eleven times more than the total number of measured stall cycles, clearly the compiler does a good job of hiding memory access latencies. The formula still gives us an indication of the proportion each cache level contributes to the memory access stall cycles. The L3 cache is being almost totally trashed, with L2 contributing around 27% and main memory 72% of the memory access latency cycles. It does not seem unreasonable to assume that L2 and main memory accesses influence data dependency stall cycles in roughly the same proportion. As we have seen we have no major stall cycle contributor other than data cache accesses and stalls caused by the FPU and they match the distribution of the latency cycles (26/74 versus 27/72) closely. Examining the code for the computational kernel ([2], Appendix B) we see that there are no
522 RAW dependencies between iterations which, in combination with the analysis above, draws us to conclude that the FPU stalls are for the most part caused by main memory access latencies. 4. p s c y e e _ M P I
p s c y e e _ M P I is the parallel version of Y e e _ b e n c h . It has been parallelized with MPI. Since leap-frog is an explicit timestepping scheme and we have a Cartesian grid, it is relatively straightforward to parallelize it using domain decomposition (see Chapter 6 in [ 1]). We let the MPI routine MPI DIMS CREATE decide the domain decomposition. However, when this routine returns values in decreasing order, we flip these values to increasing order in order to adapt to the store-by-column rule in Fortran. In all cases we will start only one process per node, i.e., use only one CPU on each node. p s c y e e MPI contains about ten different implementations of the communication, using different MPI routines. The results presented here uses M P I IRECV and MPI SSEND. Figure 2a) displays the scale-up performance of p s c y e e MPI on cubical computational domains. With scale-up, we mean that the problem size is increased linearly with the number of processes (p). The dips in performance occurs when p is a prime number. In this case we are forced to split the computational domain into slices instead of bricks which means that more data must be exchanged between the processes. We have used version 7.1-31 of the Intel Fortran compiler e f c and the compiler options -03 - f t z - f n o - a l i a s . All nodes of the PDC cluster are connected through a Myrinet-2000 128-port switch. M3FPCIXD-2 cards are used. Measured network latency is 6.3 microseconds and the bidirectional network adapter peak bandwidth is 489 MB/s (248 MB/s unidirectional). The results labeled OSC are from a cluster of HP zx6000 Intel Itanium 2 based workstations at the Ohio Supercomputer Center (OSC). Each node has two 900 MHz Itanium 2 processors and at least 4 GB of memory. The nodes are connected using Myrinet, a switched 2.0 GB/s network. We have used version 7.0-82 of the Intel Fortran compiler e f c and the compiler options - 0 3 - f t z - f n o - a l i a s . Scale-up investigations are relevant since the major reason for using a parallel computer for an FDTD computation would be to get access to more memory. A typical simulation would run for a day or two and use all the available memory. However, speed-up is not irrelevant and therefore we present speed-up results in Figure 2b). With speed-up, we mean that the size of the computational domain is kept fixed independently of the number of processes. Again, the dips in performance occur mainly when p is a prime number. As explained above, it is impossible to make an efficient domain decomposition, under the constraint that all parts shall be of (almost) the same size, using exactly p processes. Obviously, the best way to use 47 nodes, is to discard two of them and split the computational domain onto the remaining 45 nodes.
5. GemsTD The benchmarking suite of G e m s T D is described in [3]. Timings from seven different cases, all with N : 80, are combined into a weighted sum, t~, in order to include the timings of subroutines from a number of different modules for absorbing boundary conditions, incident plane-wave generation and near-to-far-field transforms. The time to do the leap-flog update is excluded from this weighted sum. Hence this benchmark measures how the compiler and
523 60 50
, - e - PDC, efc 7.1-31 - o - PDC ideal scale-up OSC, efc 7.0-82 OSC ideal scale-up
40
8-30....... 10
.
0
.
i:~.~_
; - ~
.
0
.
.
:'~
10
....
:.
20
.... > +"
--'
6011~- e o--.50 *-
l
PDC, efc 7.1-31 PDC ideal speed-up OSC, efc 7.0-82 OSC ideal speed-up
f"l
J
. ") .....
...........
....
,
"
t
20
,
.
,o
/:"
. . . . .
.
.
.
.
: .
'l ........
.
.
30 # processes
.
40
50
60
0
10
20
30 # processes
40
50
60
a) pscyee_MPI scale-up performance for cubical b) pscyee_MPI speed-up performance for a cubical computational domains. Memory usage per process is computational domain (N - 404). Memory usage is approximately 3.0 GB. approximately 3.0 GB. Figure 2. architectures performs on more intricate Fortran 90 code. Results for 64-bit precision are given in Table 2.
Table 2 Timings for the GemsTD benchmark suite. CPU
IBM pwr2
IBM pwr3
Itanium 2
AMD Opteron
f (GHz)
0.16
0.375
0.9
1.8
Peak (Gflop/s)
0.64
1.5
3.6
3.6
compiler
xlf 6.1.0.3
xlf 8.1.0.3
efc 7.1-16
pgf90 5.0-1
tw (s) tw 9Peak
523 335"1017
456 684"1012
129 464"1012
181 688"1012
We see that the Itanium 2 gives a considerable improvement over the older IBM CPUs. In Table 2 we also present a very crude efficiency number by computing t~v 9 Peak. The Intel compiler e f c does a fairly good job here considering the fact that it is a rather new compiler. We would also like to mention that the pwr3 would probably have had a lower t~ 9 Peak value if version 7 of the IBM compiler x l f had been used. However, this version was no longer available at PDC when the final version of this benchmark suite was constructed. The Itanium 2 have the same peak performance as the AMD Opteron. The Itanium 2 is about 30% faster than the AMD Opteron on this benchmark, i.e., it is more efficient. On the other hand, it is also more expensive.
REFERENCES
[1]
U. Andersson. Time-Domain Methods for the Maxwell Equations. February 2001.
PhD thesis, KTH,
524 Available at http: //media. lib. kth. se :8080/dissfull. asp. [2] U. Andersson. Yee_bench--A P D C benchmark code. TRITA-PDC 2002:1, KTH, November 2002. Availabe at http ://www. nada. kth. s e / ~ u l fa/. [3] U. Andersson. The G e m s T D benchmark suite. TRITA-PDC 2003:2, KTH, October 2003. Availabe at ht tp ://www. nada. kth. se / ~ul fa/.
M. Comea, J. Harrison, and P. T. P. Tang. Scientific Computing on Itanium-based Systems. Intel Press, 2002. [5] F. Edelvik and G. Ledfelt. A comparison of time-domain hybrid solvers for complex scattering problems. International Journal of Numerical Modeling, 15(5), September/October 2002. [6] Intel Corporation. Intel Itanium 2 Processor Reference Manual For Software Development and Optimization, 002 edition, April 2003. [7] S. Jarp. A methodology for using Itanium 2 performance counters for bottleneck analysis. Technical report, HP Labs, August 2002. [8] A. Taflove. Computational Electrodynamics: The Finite-Difference Time-Domain Method. Artech House, Boston, MA, second edition, 2000. [9] i2prof.pl website, h t t p : / / w w w . p d c . k t h . s e / ~ p e k / i a 6 4 / . Valid Aug 14, 2003. [10] Perfmon website, h t t p : / / w w w . h p l .hp. c o m / r e s e a r c h / l i n u x / p e r f m o n / . Valid July I, 2003. [4]
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) O 2004 Elsevier B.V. All rights reserved.
525
On the parallel prediction of the RNA secondary structure* F. Almeida ~, R. Andonov bt, L.M. Moren@, V. Poirriez c, M. P6rez ~, and C. Rodriguez ~ ~D.E.I.O.C. La Laguna, Spain ,falmeida) @ull.es,(lmmoreno,casiano)@ull.es bINRIA/IRISA,Rennes, France [email protected] CLAMIH/ROI,Valenciennes, France [email protected] We present the parallelization of the Vienna Package algorithm that predicts the secondary structure of the RNA. Before the tool can be effectively used it must be tuned on the target architecture. This tune consist of finding the tile sizes for optimal executions. We have developed an analytical model to find these optimal tile sizes. The validity of the analytical model developed for the parallel algorithms presented in [ 1] is studied through an extensive set of experiments performed on a Origin 3000. We present a statistical model, to deal with the cases where the hypothesis of the analytical model are not satisfied. The algorithms are satisfactorily applied to real sequences. INTRODUCTION The Vienna RNA Package [2, 3] consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures. RNA secondary structure prediction through energy minimization is the most used function in the package. It provides three kinds of dynamic programming algorithms for structure prediction: the minimum free energy algorithm of [4] which yields a single optimal structure, the partition function algorithm of [5] which calculates base pair probabilities in the thermodynamic ensemble, and the suboptimal folding algorithm of [6] which generates all suboptimal structures within a given energy range of the optimal energy. For secondary structure comparison, the package contains several measures of distance (dissimilarities) using either string alignment or tree-editing [7]. Finally, the package provides an algorithm to design sequences with a predefined structure (inverse folding). The code can be included into user programs or to be executed as stand-alone programs. The package is free software and can be downloaded ( h t t p : / / w w w . t b i . u n i v i e , a c . a t / - . . i v o / R N A / ) as C source code that should be easy to compile on almost any flavor of Unix and Linux. Starting from the Vienna RNA package (RNAfol d routine) we have obtained parallel versions of the minimum free energy algorithm of [4]. We have implemented the horizontal and *Authors are listed in alphabetical order. This work has been partially supported by the EC (FEDER) and the Spanish MCyT (Plan Nacional de I+D+|, TIC2002-04498-C05-05 and TIC2002-04400-C03-03). t on leave from LAMIH/ROI, Universit6 de Valenciennes
526 vertical strategies to traverse the iteration space that we presented at [ 1]. The parallel versions make use of the standard message passing interface MPI. They have been tested on a shared memory machine, an Origin 3000. The I/O interface for the parallel code is the same that the interface of the RNA Package. The parallel approaches considered are summarized in section 1. In section 2 we develop a broad computational experience. The analytical prediction tool presented in section 1 is validated through randomly generated molecules. A statistical model is also proposed for the cases where the analytical model proved to be non effective. The experience ends by applying the tools to real sequences. We conclude the paper in section 3 giving some remarks and future lines of work. 1. PARALLEL A L G O R I T H M S In [ 1] we approached the RNA base pairing problem following the dynamic programming algorithm presented by [8]. The approach introduced at [8] is also an energy minimization algorithm that can be considered a simplified version of the more general algorithm implemented by the Vienna package. However, in terms of recurrences both algorithms can be included into a general family that appears when applying the dynamic programming to the shortest paths, optimal binary search tree and optimal chained matrix multiplication parenthesization problems [9, I0, 11]. We addressed in [ 1] the problem of optimal tiling for this set of recurrences. The iteration space is n • n triangular and the dependences are nonuniform. We give solutions for a ring ofp processors. Two traverse methods of the iteration domain have been considered, the horizontal and vertical traverse. We determine analytic formulations of the execution time of our programs, and formulate the discrete non-linear optimization problems of finding the optimal size of the tile. The tiling approach is applied to the algorithm domains by considering rectangular files where each tile contains x rows of the iteration domain points and there are y points in each row. We call the integers x and y tile height and width respectively. In our parallel scheme we chose a block distribution of tile-columns to processors (i.e., the tile width is fixed to y = ~-) and we proved that this provides an optimal solution in the dominant cases. For the sake of convenience we also set m - px and t5 = p - 1. Let us denote by x~ to the analytical solution for the optimal tile size that minimizes the horizontal traverse (respectively x~ for the vertical traverse). Deriving and having into account the boundary conditions we arrive to the following analytical expression: 9 9= Xh
zpp
x~ntsc__ n - - p2
x29 -
9
9 Xv =
/
i
x3 -
ifx~ < x~ntsc < x~
(])
ifx~ntsc < x~ 4~ pp(2--~v+~)
/ ~p
_intsc_ 2__nn ~v - - p2 ,
ifx~ < x~ntsc
//3(i0p-4)
V ~ )
_intsc i f x I9 <_ xv _intsc < if x a9 <_ % _ x 19
if :z_ ivn t s c
<
__ X 3
(2)
9
where c~ is the time to execute a single instance of the loop body,/3 is synchronization parameter and i7- is the network bandwidth.
527
-
Since the domain size of the vertical (respectively horizontal) traverse is p x ~ (respectively n x ~) we obtain that for both traverses the optimal tile volume is the same and namely v* = ~/ 2 "P~;~ This result simply means that when ri0+a"
p2 V
p/7/3 > 2Tp + a
(3)
pp~ ) is one of the infinity many solutions giving the optimal tile the tile size (y*, x*) - ( ~n, ~/ 2~+~ y
volume v*. In the case of the RNA constants a,/3, ~-, we obtain that the inequality 3 is true for the dominant cases. 2. COMPUTATIONAL E X P E R I E N C E The parallel aproaches outlined in section 1 have been implanted into the fold routine of the Vienna package using the standard message passing library MPI [ 12]. The sizes width y and height x of the tiling procedure have been left as input parameters of the parallel module. Our methodology was to conduct empirical studies, and compare the results from these against the predicted behavior obtained by the analytical modeling of the application. The codes have been run on an SGI Origin 3800 with 600MHz R14000 processors. To avoid perturbances as a consequence of a non exclusive execution mode, each instance was executed five times. All the tables show the minimum time of these five executions. The times are expresed in seconds. Randomly RNA chains have been generated for sizes 256,512, 1024, 2048, 4096. A wide parallel execution has been performed varying x E [1, 20] and y C [1, ~] such that ~ rood y - 0. Table 1 just summarize the best running times obtained for the two parallel versions, gathering the best values obtained from executions that consider every feasible value for (y, x). An inappropriated selection of the parameters (y, x) could provide running times much larger than the times appearing in the tables. Note that the best parameter y* experimentally attained is the same both in the horizontal and vertical algorithms. Regarding to the performance of the algorithms, it must be pointed out that no significative differences have been observed, the speedups achieved are similar in both cases. The experimental optimal value of y behaves according to the equation established in [ 1] y -for p E 8, 16. It also holds for p = 4 processors when the problem is small (n = 256,512). The analytical formulas provided in section 1 should be able to predict optimal x* sizes for these cases. 2.1. The asymptotic case The parameters ,~ and ~- have been measured using linear regression fit from a ping-pong test for different sizes. Special consideration deserves the evaluation of the parameter a. Although, for the sake of simplicity, in section 1 the parameter a has been considered as a constant, in fact, it varies with (i, j) and a proper evalutation of this parameter constitutes a laborious task by itself. We have considered a as a constant for a given problem size. We approximated it by the average of the running time of the sequential algorithm over the whole set of points, i. e., T(~) a (n) = ~ , where T(n) denotes the time spent computing the triangular dependence matrix, 2
i.e., the execution time of the sequential algorithm.
528 Table 1 Experimental Optimal Tiling with Horizontal and Vertical Traverses. Horizontal Vertical p n y x Time speedup x Time speedup 2 256 32 6 0.16 1.61 8 0.16 1.64 2 512 32 14 0.75 1.63 12 0.72 1.69 2 1024 32 22 3.94 1.52 20 3.85 1.55 2 2048 128 14 20.18 1.51 13 19.80 1.54 2 4096 256 7 114.98 1.91 3 114.52 1.92 4 256 16 5 0.10 2.68 6 0.09 2.75 4 512 32 8 0.44 2.81 2 0.42 2.89 4 1024 32 14 2.18 2.74 11 2.10 2.84 4 2048 64 9 12.02 2.54 12 11.52 2.65 4 4096 128 5 66.11 3.32 9 65.54 3.35 8 256 4 4 0.09 2.84 11 0.07 3.51 8 512 8 5 0.34 3.57 6 0.32 3.80 8 1024 16 3 1.46 4.09 4 1.45 4.12 8 2048 32 13 7.31 4.17 4 7.74 3.94 8 4096 64 9 41.70 5.26 8 42.69 5.14 16 1024 4 1 1.57 3.79 6 1.49 4.01 16 2048 8 9 7.99 3.82 11 7.08 4.31 16 4096 16 4 41.23 5.32 6 42.00 5.22
The running time T ( n ) of our current sequential version can be approached through the linear fit of the sequential executions. We have obtained that T(n)
-
~ 1.479e - 09 9n 3 + 4.199e - 06 9 n 2 - 7 . 4 7 2 e - 06 9n - 1.516e - 02
i f n < 1024
( 2.201e -
otherwise
09
9 n 3
+ 7.552e -- 06 9n 2 -- 2.038e -- 02 9n + 1.992e + O1
(4)
We checked that the inequality 3 holds for every instance over p = 4, 8, 16 processors, i. e., the attained experimentally in table 1. Nevertheless, we report predicted versus measured results for every test instance. Table 2 shows the optimal parameters obtained by the model and the relative error made (H stands for the Horizontal traverse and V for the Vertical one). The errors behave according to the theoretical predictions for the range of validity of the equations (p = 4 . . . ) with the exception of the large error percentage obtained for the n = 1024 instance. As an additional validation experiment we decided to observe the behavior of the algorithm for values of n not included in the intensive computational study that was summarized in tables 2. Table 3 presents the results for these two cases: n = 3200 and n = 8192. The values of y and x were obtained from equations 1 and 2. The resulting speedups behave according to the progression of the optimal speedups observed in table 1. 2.2. T h e s m a l l cases This section is devoted to the resolution of the tiling problem for those situations where the hypothesis of the model are not satisfied. Table 1 hold useful information that can be used to
develop statistical models providing predictive capabilities. In particular, linear regression can
529
Table 2 Optimal Tiling through the Analytical and Statistical Prediction. H stands for the Horizontal traverse and V for the Vertical one Analytical Prediction Statistical Prediction p n y x %Error-H %Error-V y x-H %Error-H x-V %Error-V 2 256 64 1 1.04 0.48 16 14 4.46 14 5.39 2 512 128 1 40.38 39.06 32 14 0 14 1.95 2 1024 256 1 35.37 35.35 64 13 2.34 14 7.96 2 2048 512 1 33.49 48.59 128 12 0.43 10 0.25 2 4096 1024 1 60.66 59.64 256 10 0.98 6 0.06 4 256 16 2 2.19 3.41 16 9 5.6 6 0 4 512 32 2 3.25 0 32 9 2.1 7 16.64 4 1024 64 2 16.29 14.75 32 9 2.31 7 2.57 4 2048 128 2 11.01 13.65 64 8 0.1 9 3.44 4 4096 256 1 6.99 8.8 128 7 0.62 11 0.01 8 256 4 5 13.39 1.08 4 5 13.39 7 0.01 8 512 8 5 0 3.02 8 5 0 7 12.77 8 1024 16 4 24.31 0 16 6 4.14 7 12.93 8 2048 32 4 6.52 0 32 8 5.55 7 4.52 8 4096 64 3 3.35 9.24 64 11 6.82 6 3.26 16 1024 4 9 90.37 18.39 4 4 22.04 8 14.71 16 2048 8 8 3.08 2.58 8 4 5.03 8 2.58 16 4096 16 6 11.05 0 16 5 4.63 7 7.39
Table 3 Optimal Tiling through Analytical and Statistical Prediction.
p 4 4 8 8
n 3200 8192 3200 8192
p 2 2 4 4 8 8
n 3200 8192 3200 8192 3200 8192
Analytical Prediction Horizontal Vertical y x Speedup x Speedup 200 2 2.95 2 2.90 512 1 2.47 2 2.55 50 3 4.41 3 4.71 128 1 5.05 1 4.71 Statistical Prediction Horizontal Vertical y x Speedup x Speedup 200 11 1.79 8 1.80 512 5 1.63 1 1.62 100 7 2.81 10 3.08 256 4 2.79 16 2.68 50 9 4.37 6 4.64 128 18 4.12 6 5.37
be developed over y* and x*, so that we can obtain for any processor equations o f the form y* = a 9 n + b (a cares about the slope and b cares about the x-intercept.), analogous equations can be obtained for x*. The asumption o f a linear behavior o f x* and y* on the p r o b l e m input
530 Table 4 Statistical Prediction of (y*, x*). Horizontal and Vertical Traverses. Horizontal Vertical y x x p y-intercept slope x-intercept slope x-intercept slope 2 -2.66667 0.06216 14.458333 -0.001171 14.666667 -0.002184 4 9.33333 0.02839 9.2500000 -0.0006615 6.083333 0.001208 8 -5.507e-15 1.563e-02 4.166667 0.001659 6.833333 -0.000147 16 -2.072e-15 3.906e-03 3.5000000 0.0004883 8.5000000 -0.0003488
size n, once the machine is fixed (i.e. p, c~,/3 and ~- are fixed) agrees with equations 1 and 2. Table 4 sumarizes these equations. Table 3 collects the parameters for (y*, x*) given by this approach and the error made. We can see the small error made for most of the cases. Table 2 shows the speedup obtained with the (y*, x*) calculated from the linear regresion equations for the problems of size n = 3200, 8192. Again, they show speedups similar to that obtained in the former cases considering the optimal experimental parameters. 2.3. R e a l c a s e s
We deal now with the application of the former methodologies to non synthetic RNA molecules. We have downloded several molecules from http://rrna.uia.ac.be/ssu/, the European Ribosomal RNA database. They correspond to different molecule sizes. The molecules are the Bacillus Anthracis (the Anthrax bacillus, n = 1506), the Bacillus Cereus (the bacterium cereus food poisoning, n = 2782), the Simkania Negevensis (a bacterial contaminant related to respiratory infection, n - 2950) and the Acidianus Brierleyi (tested for their ability to oxidize pyrite and to grow autotrophically on pyrite, n = 3046), table 5 depicts the sequential times obtained for them. We computed the optimal tiling through both the Analytical and Statistical Predictions. Tables 5 summarize the results obtained from the parallel executions. The algorithm used was the horizontal version, where the processors synchronize at the end of the macro-rectangle computations so that not necessarily y* has to be a divisor of ~ . The accuracy of the statistical model is slightly better for the considered cases. Since the evaluation of the slope and x-intercept for the statistical model implies performing a set of additional experiments, the analytical model can be the option for the asymptotic case. 3. CONCLUSIONS AND FUTURE R E M A R K S The parallelization of a completely functional tool to predict the secondary structure of the RNA has been presented. Before the tool can be effectively used it must be tuned on the target architecture. The validity of the analytical model developed for the parallel algorithms presented in [ 1] has been studied through an extensive set of experiments. We have presented a statistical model, to deal with those cases where the hypothesis of the analytical model are not satisfied. The algorithms have been applied to real sequences. The speedups, although satisfactory, do not scale for some of those instances, i.e. p > 8. This is probably due to the use of a common static tile size. We plan to introduce a dynamic variable tile size on every macro-rectangle.
531 Table 5 Optimal Tiling through Analytical and Statistical Prediction for non synthetic RNA molecules. Bacillus Anthracis n -- 1506 Analytical Statistical p Time y x Speedup Time y x Speedup 2 1 2 . 7 5 376 1 1.11 9.22 105 14 1.53 4 6.36 94 2 2.22 6.50 50 5 2.17 8 4.15 24 4 3.40 4.12 24 5 3.42 16 6.52 6 9 2.16 5.48 6 4 2.57 Bacillus Cereus n = 2782 Analytical Statistical p Time y x Speedup Time y x Speedup 2 58.20 695 1 1.21 39.63 199 12 1.78 4 28.72 174 2 2.46 25.81 83 7 2.73 8 18.06 43 4 3.91 17.81 43 5 3.96 16 19.33 11 7 3.65 17.89 11 5 3.95 Simkania Negevensis n = 2950 Analytical Statistical p Time y x Speedup Time y x Speedup 2 68.87 737 1 1.22 46.10 212 11 1.82 4 32.66 184 2 2.57 33.14 88 7 2.53 8 19.62 46 3 4.28 21.74 46 6 3.86 16 19.55 12 7 4.29 21.03 12 5 3.99 Acidianus Brierleyi n = 3046 Analytical Statistical p Time y x Speedup Time y x Speedup 2 73.58 761 1 1.26 48.51 219 11 1.90 4 34.42 190 2 2.68 32.88 90 7 2.81 8 21.32 46 3 4.33 23.84 48 6 3.88 16 21.78 12 7 4.24 20.23 12 5 4.57
REFERENCES
[1] [2]
[3] [4]
[5] [6]
F. Almeida, R. Andonov, D. Gonzalez, L. Moreno, V. Poirriez, C. Rodriguez, Optimal tiling for the rna base pairing problem, in: 14th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2002, winnipeg, Manitoba, Canada. I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, P. Schuster, Fast folding and comparison of rna secondary structures, Monatshefte Ffir Chemie 125 (1994) 167-188. I. H. M. Fekete, R Stadler, Prediction of rna base pairing probabilities using massively parallel computers, J. of Comp. Biol. 7 (2000) 171-182. M. Zuker, R Stiegler, Optimal computer folding of large rna sequences using thermodynamics and auxiliary information, Nucleic Acids Research 9 (1981) 133-148. J. S. McCaskill, The equilibrium partition function and base pair binding probabilities for m a secondary structure., Biopolymers 9 (1990) 1105-1119. S. Wuchty, W. Fontana, I. Hofacker, P. Schuster, Complete suboptimal folding of m a and
532
[7] [8] [9] [ 10] [11]
[12]
the stability of secondary structures, Biopolymers 49 (1999) 145-165. B.Shapiro, K. Zhang, Comparing multiple rna secondary structures using tree comparisons, Comput. Appl. Biosci 9 (1990) 309-318. J. Setubal, J. Meidanis, Introduction to Computational Molecular Biology, PWS Publishing Company, 1997, Ch. 8. Z. Galil, K. Park, Parallel algorithms for dynamic programming recurrences with more than o(1) dependency, Journal of Parallel and Distributed Computing 21 (1994) 213-222. A. Gibbons, W. Rytter, Efficient parallel algorithms, Cambridge University Press, 1988, Ch. 3.6. C. Rodriguez, D. Gonzfilez, F. Almeida, J. Roda, F. Garcia, Parallel algorithms for polyadic problems, in: Proceedings of the 5th Euromicro Workshop on Parallel and Distributed Processing, 1997, pp. 394-400. The MPI standard, http://www.mpi-forum, org/.
Clusters
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
MDICE-
535
a M A T L A B T o o l b o x for Efficient C l u s t e r C o m p u t i n g
R. Pfarrhofer a, P. BachhiesP, M. Kelz ~, H. St6gner a, and A. Uhl b ~Carinthia Tech Institute, School of Telematics & Network Engineering Primoschgasse 8, A-9020 Klagenfurt, Austria bSalzburg University, Department of Scientific Computing Jakob-Haringerstr.2, A-5020 Salzburg, Austria A MATLAB-based toolbox for efficient computing on homogenous and heterogenous Windows PC networks is introduced. The approach does not require a MATLAB client installed at the participating machines and allows other users to employ the involved machines as desktop. Experiments involving a Monte Carlo simulation demonstrate the efficiency and real-world usability of the approach. 1. INTRODUCTION MATLAB has established itself as the numerical computing environment of choice on uniprocessors for a large number of engineers and scientists. For many scientific applications, the desired levels of performance are only obtainable on parallel or distributed computing platforms. With the emerge of cluster computing and the potential availability of HPC systems in many universities and companies, the demand for a solution to employ MATLAB on such systems is obvious. A comprehensive and up-to-date overview on high performance MATLAB systems is given by the "Parallel MATLAB Survey" at http: //supertech. ics .mit. edu/~cly/survey.html. Several systems may be downloaded from ftp :/ / ftp. mathworks, corn/pub/contrib/v5/tools / and also from http ://www. mathtools, net/MATLAB/ParalIel/. There are basicallythree distinct ways to use MATLAB on HPC architectures: 1. Developing a high performance interpreter (a) Message passing: communication routines usually based on MPI or PVM are provided. These systems normally require users to add parallel instructions to MATLAB code [1, 6, 7]. (b) "Embarrassingly parallel": routines to split up work among multiple MATLAB sessions are provided in order to support coarse grained parallelization. Note that the PARMATLAB and TCPIP toolboxes our own development is based upon fall under this category. 2. Calling high performance numerical libraries: parallelizing libraries like e.g. SCALAPACK are called by the MATLAB code [9]. Note that parallelism is restricted within the
536 library and higher level parallelism present at algorithm level cannot be exploited with this approach. 3. Compiling MATLAB to another language (e.g. C, HPF) which executes on HPC systems: the idea is to compile MATLAB scripts to native parallel code [2, 3, 8]. This approach often suffers from complex type/shape analysis issues. Note that using a high performance interpreter usually requires multiple MATLAB clients whereas the use of numerical libraries only requires one MATLAB client. The compiling approach often does not require even a single MATLAB client. On the other hand, the use of numerical libraries and compiling to native parallel code is often restricted to dedicated parallel architectures like multicomputers or multiprocessors, whereas high performance interpreters can be easily used in any kind of HPC environment. This situation also motivates the development of our custom high performance MATLAB environment: since our target HPC systems are (heterogenous) PC clusters running a Windows system based on the NT architecture, we are restricted to the high performance interpreter approach. However, running a MATLAB client on each PC is expensive in terms of licensing fees and computational resources. Consequently, our aim in this work is to develop a high performance interpreter which requires one MATLAB client for distributed execution only. In section 2, we present the fundamentals of our development MDICE. Section 3 describes the basics of an application from the area of numerical mathematics (Monte Carlo simulation) and discusses the respective experimental results applying MDICE. Section 4 concludes the paper. 2. " M D I C E " - A TOOLBOX F O R EFFICIENT MATLAB CLUSTER COMPUTING MATLAB-based Distributed Computing Environment (MDICE) is based on the PARMATLAB and TCPIP toolboxes. The PARMATLAB toolbox supports coarse grained parallelization and distributes processes among MATLAB clients over the intranet/internet. Note that each of these clients must be running a MATLAB daemon to be accessed. The communication within PARMATLAB is performed via the TCPIP toolbox. Both toolboxes may be accessed at the Mathworks ftp-server (referenced in the last section) in the directories p a r m a t l a b and t cp i p , respectively. However, in order to meet the goal to get along with a single MATLAB client the PARMATLAB toolbox needs to be significantly revised. The main idea is to change the client in a way that it can be compiled to a standalone application. At the server, jobs are created and the solve routine is compiled to a program library (*.dll). The program library and the datasets for the job are sent to the client. The client is running as background service on a computer with low priority. For this reason the involved client machines may be used as desktop machines by other users during the computation (however, this causes the need for a dynamic load balancing approach of course). This client calls over a predefined standard routine the program library with the variables sent by the server and sends the output of the routine back to the server. After the receipt of all solutions the server defragments them to an overall solution. The client-server approach is visualized in Fig. 1. In order to implement this concept, compilation limitations and restrictions need to be considered as follows:
537
Sercer+ MATLAB
~:~:i~':i~~wi''!~"i.....
NT background s e r v i c e ! worke~, exe i : .................... z :
....... TASK + ~ t t , ~ g e ; !
~par.kages !
{
T A S K
i [c olmut] C~mpii6"~" i
"
package~
i k?~ ?::}-~_?}::=~2- :~:.i'!}:.i :: result
9
.:
. .
.......... . .
:
.....
Figure 1. Client-server concept of MDICE. 9
Built-in MATLAB functions cannot be compiled. However, most of these functions are available since they are contained in the MATLAB Math C/C++ Library. This library can be transferred to an arbitrary number of clients without the payment of additional licence fees which enables the execution of corresponding code without running MATLAB on the clients. About 70 functions (e.g. d i a r y , h e l p , whos, etc.) are not supported which need to be replaced by custom code if required by the application. 9 The arguments of l o a d , s a v e , and e x i s t compile-time.
(usually file names) need to be known at
9 The arguments of e v a l , i n p u t , and f e v a l (usually data variables) need to be known at compile-time. For example, f eval is used in the PARMATLAB toolbox to evaluate the function residing at the client using the arguments sent by the server. It is avoided by sending the *.dll library to the client which has a fixed interface ( f u n _ t a s k ) for each configuration of input and output variables. The following example illustrates the replacement of the functions e v a l and d i a r y . The code is taken from the file w o r k e r , m where the result is being sent back to the server after the computation has been done. Original
PARMATLAB
Code:
%%%% S E N D A R G U M E N T S disp (' S e n d i n g o u t p u t a r g u m e n t s ' ) s e n d v a r (ip_f id, hostname) for i=l :func. a r g o u t eval(['sendvar(ip fid,' . . . 'argo' int2str(i) ') ']) end
MDICE _ _
Code:
%%%% SEND A R G U M E N T S displog('Sending output arguments') ; s e n d v a r (ip_fid, hostname) ; for i=l :f u n c . a r g o u t , s e n d v a r ( i p _ f i d , v a r _ a r g o u t (i) .data) ; end;
In order to facilitate compilation, the functions disp and eval need to be replaced since d i a r y (which uses d i s p output) can not be compiled and e v a 1 requires arguments known at compile-time. Instead of d i s p we introduce the custom function d i s p l o g which may additionally write data to a log-file besides displaying it. This is especially important if the
538 client is operated in background mode. In the original code, the outgoing arguments of the function are stored in several variables (argo l, argo2, etc.) and the s e n d v a r command is constructed dynamically using e v a 1 . Since this can not be compiled in this form, the results are stored in a variable of type "struct array". The communication functionalities of the PARMATLAB toolbox have been extended as well. For example, in case of fault-prone file transmission (e.g. no space left on the clients' hard disk) the server is immediately notified about the failure. The underlying TCPIP toolbox requires all data subject to transmission to be converted into strings. For large amounts of data this is fairly inefficient in terms of memory demand and computational effort. In this case, we store the data as MAT-file, compress it (since these files are organized rather inefficiently), and finally convert it into strings. After the computation is finished and the result has been sent, a new job may be processed by the client. Note that the *.dll library and constant variables do not have to be resent since the client informs the server about its status. MDICE does not support any means of automatic parallelization or automatic data distribution. The user has to specifiy how the computations and the associated data have to be distributed among the clients. The same is true of course for the underlying PARMATLAB toolbox. 3. APPLICATIONS AND E X P E R I M E N T S The computational tasks of the applications subject to distributed processing are split into a certain number of equally sized jobs N to be distributed by the server among the M clients (usually N > M). Whenever a client has sent back its result to the server after the initial distribution of M jobs to M clients, the server assigns another job to this idle client until all N jobs are computed. This approach is denoted "asynchronous single task pool method" [5] and facilitates dynamic load balancing in case of N >> M. The computing infrastructure consists of the server machine (1.99 GHz Intel Pentium 4, 504 MB RAM, Windows XP Prof.) and two types of client machines (996 MHz Intel Pentium 3, 128 MB RAM, and 730 MHz Intel Pentium 3, 128 MB RAM, both types under Windows XP Prof.). The Network is 100 MBit/s Ethernet. In order to demonstrate the flexibility of our approach, we present results in "homogenous" and "heterogenous" environments. In the case of the homogenous environment, we use client machines of the faster type only, the results of the heterogenous environment correspond to six 996 MHz and four 730 MHz clients, respectively. Note that the sequential reference execution times used to compute speedup have been achieved on a 996 MHz client machine with a compiled (not interpreted) version of the application to allow a fair comparison since the client code is compiled as well in the distributed application. We use MATLAB 6.5.0 with the MATLAB compiler 3.0 and the LCC C compiler 2.4. 3.1. Numerical mathematics: Monte Carlo simulation This problem is known as part of the ArgeSim comparison of parallel simulation techniques 1. A damped second order mass-spring system is described by the equation
m!i(t) + kx(t) + d2(t) - 0 1http://argesim.tuwien.ac.at/comparisons/cp1/cpl.html
(1)
539 with x'(0) = 0, x(0) = 0.1, k = 9000, and m = 450. The damping factor d should be chosen as a random quantity uniformly distributed in the interval [800, 1200]. The task is to perform a couple of simulation runs and to calculate and store the average responses over the time interval [0, 2] for the motion x(t) with step size 0.0005. This can be trivially distributed by generating different random quantities d on different clients. In order to solve the differential equation for a single value of d (which is done on each client), we use the Runge-Kutta-NystrSm algorithm [4, p. 960]. Using this approach, a second order ordinary differential equation (ODE) does not need to be decomposed into a system of ODEs of order one. For a given initial value problem of second order
[l = f (t,y,~t);
y(t=
to) = Yo;
(2)
~t(t=t0)=?)0
we compute in each iteration step of the fourth order Runge-Kutta-NystrSm algorithm the following experessions:
h (k~ + k2 + k3), y~§
-
1 ~ + 5(kl + 2k~ + 2k~ + k~),
ti+l
=
ti + h h
h
h
h
h
where ]~1 = -~f(ti, Yi, Yi), ]~2 -- ~f(ti § ~ , Y i + -~]i-~- ~kl,Yi-~- kl), ]% = ~f(ti § h Yi + h 9 + -g h k l, y~ + k2) , and k4 = ~hf ( t i + h, Yi + h~]i + hk3, {1i + 2k3). For each d, we iterate this -iYi procedure 4001 times.
3.2. Experimental results Monte Carlo algorithms are well known to allow for straightforward parallelism. In our case, the partial solution for x(t) within each job is computed on the clients, consequently the server only needs to average N (number of jobs) partial results after the clients have finished their work. We first discuss the homogeneous environment. Fig. 2.a shows the speedup of this application when varying the problemsize (# of iterations which denotes the number of random quantities d used in the entire computation) and keeping the number of jobs distributed among the clients fixed (30). Sequential execution time is 452, 902, and 1803 seconds for 1680, 3360, and 6720 iterations, respectively. As it is expected, speedup increases with increasing problemsize due to the improved computation/communication ratio. However, also the plateaus resulting from load distribution problems are more pronounced for larger problem size (e.g., 30 jobs may not be efficiently distributed among 14 clients whereas this is obviously possible for 10 clients). Fig. 2.b shows a visualization of this phenonemon during the distributed execution, where black areas represent computation time-intervals, gray areas communication events, and white areas idle times. The plateaus in speedup can be avoided in two ways. First, by setting N = M as depicted in Fig. 2.a. However, this comes at the cost of entirely losing any load balancing possibilities which would be required in case of interfering other applications on the clients (see below). Second, by increasing the number of jobs. Clearly, a high number of jobs leads to an excellent
540 O 1680 iterations (30 jobs) ] * 3360 iterations (30jobs) 3360 iterations (jobs==#clients) 6720 iterations (30 jobs)
V
/ 10 /
/
/
/
/
/
/
/ /
/
/
/ / n
Doing nothing Communicating Processing
/
i 8 6
4
2
2
4
6
8
10
12
Number of worker-nodes
14
16
a) Speedup,varyingproblem-size
18
20
0
20
40
60
80
100
Time in seconds
120
140
160
180
b) Visualization of execution with 14 clients and 30 jobs (6720 iterations)
Figure 2. Results of Monte Carlo Simulation in homogeneous environment. load distribution, but on the other hand the communication effort is increased thereby reducing the overall efficiency. The tradeoff between communication and load distribution is inherent in the single pool of task approach which means that the optimal configuration needs to be found for each target environment. These facts are shown below in the context of the heterogeneous environment.
320L
,!
m
!I II i
28~ II
260j-
Number of jobs
0
m
.. == n n == 50
100
150
Time in seconds
200
Doing nothing Communicating Process ng
250
a) Fixed problem-size (6720 iterations), varying job b) Vizualizationof execution with 10jobs on 10 clients number Figure 3. Results of Monte Carlo simulation in the heterogeneous environment. Lower speedup and less pronounced plateaus are exhibited in case of smaller problem size (Fig. 2.a). Here, the expensive initial communication phase (where the server has to send the compiled code and input data to each of the clients) covers a significant percentage of the overall execution time. This leads to a significantly staggered start of the computation phases at different clients which dominates the other load inbalance problems. Next we discuss the heterogeneous environment. Fig. 3.a shows the time demand for computing the solution using a fixed problem size (again 6720 iterations) while changing the number
541 of jobs used to distribute the amount of work among the 10 clients (as defined in the last subsection). For our test configuration, the optimal number of jobs is identified to be around 60. A further increase leads to an increase of execution time as well as a lower number causes worse results. Fig. 3.b shows an execution visualization where N = M - 10. Of course, the execution is not balanced due to the slower clients involved and these machines (numbers 1,2,4,9) are immediately identified in the figure. Whereas for N - 60 we have relatively little communication effort but obviously sufficiently balanced load, the higher amount of communication required to achieve better balanced load for N >_ 84 degrades the overall performance of these configurations. Finally, we investigate the impact of interfering applications running as a desktop application on MDICE client machines. For that purpose, we execute our Monte Carlo simulation code in sequential (1680 iterations, 452 seconds sequential execution time) on two out of four clients running the Monte Carlo code under MDICE (6720 iterations).
Table 1 Effects of interfering application running on MDICE client machines. Execution time (seconds) 1[ 4jobs 112 jobs I 30jobs MDICE MC (6720 iter., 4 dedicated clients) 462 463 495 MDICE MC (6720 iter., seq. MC on 2 clients) 915 761 706 Sequential MC (1680 iterations, 452 sec.) 458 458 458
Table 1 shows at first that increasing the number of jobs in a homogeneous system (4 identical client machines) in fact decreases execution performance. However, similar to the heterogeneous case, increasing the job number in case of load inbalance (here caused by the sequential code running on two machines) clearly improves the results (from 915 seconds using 4 jobs to 706 seconds using 30 jobs). The third line of the table shows that no matter how many jobs are employed within MDICE, the impact of MDICE on running desktop applications remains of minor importance (below 2% of the overall execution time). 4. C O N C L U S I O N The custom MATLAB-based toolbox MDICE may take advantage of the large number of Windows NT based machines available in companies and universities. The most important property of MDICE is that no MATLAB client is required on the participating machines. Based on the results obtained from a Monte Carlo simulation, it proves to be very flexible and efficient in terms of low licencing fees and execution behaviour. REFERENCES
[1]
[21
J.F. Baldomero. PVMTB" Parallel Virtual Machine Toolbox. In S. Dormido, editor, Proceedings of lI Congreso de Usarios MATLAB'99, pages 523-532. UNED, Madrid, Spain, 1999. L. DeRose and D. Padua. A MATLAB to Fortran 90 translator and its effectiveness. In
542
Proceedings of l Oth ACMInternational Conference on Supercomputing. ACM SIGARCH and IEEE Computer Society, 1996. [3] E Drakenberg, E Jakobson, and B. Kagstr6m. A CONLAB compiler for a distributed memory multicomputer. In Proceedings of the 6th SIAM Conference on Parallel Processingfor Scientific Computing, volume 2, pages 814-821, 1993. [4] E. Kreyszig. Advanced Engineering Mathematics, 8th edition. Wiley Publishers, 1999. [5] A.R. Krommer and C.W. fdberhuber. Dynamic load balancing- an overview. Technical Report ACPC/TR92-18, Austrian Center for Parallel Computation, 1992. [6] V.S. Menon and A.E. Trefethen. MultiMATLAB: integrating MATLAB with high performance parallel computing. In Proceedings of 1l th A CM International Confernce on Supercomputing. ACM SIGARCH and IEEE Computer Society, 1997. [7] S. Pawletta, T. Pawletta, and W. Drewelow. Comparison of parallel simulation techniques -MATLAB/PSI. Simulation News Europe, 13:38-39, 1995. [8] M. Quinn, A. Malishevsky, N. Seelam, and Y. Zhao. Preliminary results from a parallel MATLAB compiler. In Proceedings of the International Parallel Processing Symposium (IPPS), pages 81-87. IEEE Computer Society Press, 1998. [9] S. Ramaswamy, E.W. Hodges, and P. Banerjee. Compiling MATLAB programs to SCALAPACK: Exploiting task and data parallelism. In Proceedings of the International Parallel Processing Symposium (IPPS), pages 814-821. IEEE Computer Society Press, 1996.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
543
Parallelization of Krylov Subspace Methods in Multiprocessor PC Clusters D. Picinin Jr. ab, A.L. Martinott@, R.V. Dorneles b, R.L. Rizzi c, C. H61big~d, T.A. Diverio ~, and P.O.A. Navaux ~ aPPGC, Instituto de Inform~tica, UFRGS, CP 15064, 91501-970, Porto Alegre, RS, Brasil bDepartamento de Informfitica, Centro de Ciancias Exams e Tecnologia, UCS, Rua Francisco Getfilio Vargas, 1130, 95001-970, Caxias do Sul, RS, Brasil cCentro de Ciancias Exatas e Tecnol6gicas, UNIOESTE, Campus de Cascavel, Rua Universitfiria, 2069, 85819-110, Cascavel, PR, Brasil dUniversidade de Passo Fundo, UPF, Campus I - Km 171 - BR 285 Bairro Silo Jos6, Passo Fundo, RS, Brasil This paper presents the parallelization of two Krylov subspace methods in multiprocessor clusters. The objective of this parallelization was the efficient solution of sparse and huge linear equations systems, resulting from the discretization of a hydrodynamics and mass transport model. One objective of this work was the comparison of the use of different tools, as well as the use of different approaches to the exploration of intra-node parallelism, presenting an analysis of the results obtained. In this work we also compare the computational performance obtained with the use of two processors in dual processor nodes, and the use of a single processor in each node, using more nodes to get the same number of processors. From the obtained results an analysis of the cost/benefit ratio of the use of dual machines versus monoprocessed machines in PC clusters is done. 1. INTRODUCTION The Group of Computer Mathematics and High Performance Processing (GMCPAD) of Universidade Federal do Rio Grande do Sul (UFRGS) has been working in the development of the HIDRA model, a multiphysics 3D parallel computational model with dynamic load balancing for simulation of hydrodynamics and mass transport in water bodies [3]. An application of the model is in the simulation of flow and sewerage transport in Guaiba Lake, which bathes the metropolitan region of Porto Alegre, being important to fluvial transportation, irrigation and water supplying, but receiving a constant emission of polluents. Some of the numerical methods being adopted to solve the linear equations systems generated by the discrete models are the Conjugate Gradient (CG) and Generalized Minimal Residual (GMRES) [4]. CG is used to solve the equations systems generated by the hydrodynamics model, as the coefficient matrices are sparse, huge and symmetric positive defined (SPD). For the solution of the equations systems generated by the substance transport model the GMRES
544 is used, as the coefficient matrices are sparse, huge and non-symmetric. The objective of this work was the development of parallel versions of these methods which can be executed in multiprocessor PC clusters, as this kind of architecture has become particularly attractive, due to its good cost-benefit relation and to the advances in network technology. In the case of clusters composed by multiprocessor nodes, parallelism can be explored in two levels. In a first level, exploring intra-node parallelism (shared memory) and in a second level exploring inter-node parallelism (distributed memory). The objective of this paper is to present the strategies adopted in the parallelization of the CG and the GMRES in clusters of multiprocessor nodes, as well as an analysis of the results obtained. Other approaches to parallelism have been presented in [3]. 2. EXPLORATION OF PARALLELISM IN MULTIPROCESSOR CLUSTERS According to Buyya [2], parallelism in multiprocessor PC clusters can be explored through three different forms: 9 Exploring intra-node and inter-node parallelism using multiple processes, which communicate through message passing. This approach has the disadvantage of not using the shared memory in intra-node parallelism; 9 Exploring intra-node and inter-node parallelism using mechanisms such as DSM (Distributed Shared Memory). This kind of mechanism introduces an additional cost for the management of addresses and cache coherence; 9 Exploring intra-node and inter-node parallelism distinctly, where inter-node parallelism is explored through the use of message passing libraries and the exploration of intra-node parallelism is done through the use of multiple threads. This approach presents the characteristic of using the advantages of both memory architectures (shared and distributed). In the development of this work two alternatives were analyzed. The first one refers to the use of only multiple processes, and the second one refers to the simultaneous use of multiple processes and multiple threads. In what concerns to message passing, implementations using two different libraries were done aiming a performance analysis. The first library used was MPICH, an implementation of MPI (Message Passing Interface) standard, from which blocking and non-blocking primitives were used. The use of non-blocking primitives allowed the overlapping of communication and processing, i.e., during communication the processes continue to execute some kind of processing. The second library used was DECK (Distributed Execution and Communication Kernel) [ 1], developed by the Distributed and Parallel Processing Group (GPPD) of the Informatics Institute of the Universidade Federal do Rio Grande do Sul, which offers support only to blocking primitives. DECK presents as an extra feature, when compared to MPICH, the possibility of several threads sending and receiving messages simultaneously, inside the same process (thread-safe). Concerning to multithreading, the parallelization of the algorithms was done through the use of Pthreads library (IEEE POSIX threads), which follows the standard ISO/IEC 99-45-1.
545 3. PARALLEL SOLUTION To obtain a parallel solution in the kind of model used in this work, data must be distributed between the processors according to minimize the communication and balance the computational load. Figure 1 shows an example of the partitioning of Guaiba Lake.
"" ......P ....
~-
t
~'
I
i
I , ~ ....
Figure 1. Partitioning (18 processes) of Guaiba Lake using the RCB algorithm In this work the parallel solution is obtained by data decomposition, and the computational domain is partitioned using RCB algorithm (Recursive Coordinate Bisection). After the partitioning, as the data decomposition approach is used, only one equation system is generated for the entire domain, each processor generating the part of the system corresponding to its subdomain. The system is solved through the use of a parallel numerical method as, for example, parallel versions of the CG or the GMRES. As the system generated is sparse and huge, the storage of the coefficient matrices is done using CSR (Compressed Sparse Row) format. This format is one of the most effective in the storage of matrices with such characteristics [4]. The CG and GMRES methods used in this work are composed by linear algebra operations with matrices and vectors. Thus, the parallelization of these methods can be obtained through the parallelization of such operations. Some of these operations, the scalar product and matrix-vector product, require communication between different processors due to the data dependencies between the different subdomains. In the case of the scalar product, communication between all the processors involved is necessary. Each processor calculates the scalar product over its local data and a reduction operation is done over the partial results. In the sparse matrix-vector product, each processor needs parts of the vectors of the neighbor processors. It happens because during the generation of the matrix, a data dependency is created due to the cells of the artificial frontiers, i.e., the frontiers originated in the partitioning of the domain. The solution adopted to effectuate the sparse matrix-vector product was to store the matrix elements that are multiplied by non-local elements in separate vectors, associated to the vectors received in the communication and to a vector containing the positions where the results of the operations must be stored. Thus, the sparse matrix-vector product can be divided in two parts: Ax = Axint + Axext, where Axint corresponds to the part of the sparse matrix-vector product that uses only local elements, and A x ~ t corresponds to the part of the sparse matrix-vector product that considers vector elements that are not available locally [5]. In the exploration of intra-node parallelism with multithreading, the strategy adopted was the creation of secondary threads, which are similar to the main thread and have the same opera-
546 tions. The only difference between them is that the secondary threads do not have primitives for message passing and the main thread is responsible by the communication between the processes. To ensure the consistency of data, the secondary threads are synchronized with the main thread through the use of barriers. These barriers are implemented through the use of mutexes and condition variables. 4. TESTS AND RESULTS The algorithms developed were implemented using C programming language, MPICH 1.2.4 (Argonne National Laboratory and Mississipi State University) and DECK 2.1.2 (Instituto de Inform~itica UFRGS) library, over the Linux Debian O.S. The tests were effectuated in the clusters of Clusters Technology Laboratory (LabTec), a research laboratory kept by Informatics Institute (UFRGS) and Dell Computers. The cluster is constituted by 21 nodes, 20 of which are Dual Pentium III 1.1 GHz dedicated to processing and a Dual Pentium IV 1.8 GHz used as a server. All the nodes have 512Kbytes of cache memory and 1Gbyte of RAM memory. The interconnection of the processing nodes is done through a Fast Ethernet network by a switch, in which is also connected the server node through a Gigabit Ethernet network. The tests were effectuated over a domain from which resulted, after the discretization, 180.000 equations. The GMRES method was executed using a base of size equal to 20. The number of iterations presented by the CG and GMRES methods were, respectively, 8 and 5, with an error limited to 10-8. In section 4.1 the results obtained in the parallelization with the use of different libraries and types of primitives are presented and compared. In section 4.2 the strategies adopted in this work to the exploration of intra-node parallelism are presented and compared. Finally, in section 4.3, a performance analysis of the use of multiprocessor nodes and monoprocessed nodes is presented, as well as a comparison between them. 4.1. Different message passing libraries Figures 2 and 3 present execution times of the CG and GMRES, respectively. These execution times were obtained varying the number of nodes from 1 to 18. Although the nodes in LabTec cluster are dual, in these tests only one processor in each node was used. The execution time graphs show that DECK library presented a better performance than MPICH. When using MPICH library, two approaches were adopted: using blocking primitives and using non-blocking primitives. It was done because one objective of this experiment was an analysis of the use of non-blocking primitives on overlapping of communication and processing.
1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16 Processes
Figure 2. Execution time of CG
Figure 3. Execution time of GMRES
547 The tests showed that, when using MPICH, the use of non-blocking primitives had a better performance. Figures 4 and 5 show the speedup of the execution times presented in figures 2 and 3. 1412 10-
64-
1210
~o4-
2 O-
Figure 4. Speedup of GC
Figure 5. Speedup of GMRES
In the graphs of figures 4 and 5 it can be observed that the CG had a better gain of performance than GMRES. One possible reason is the higher communication overhead existing in GMRES. This overhead is due to Gram-Schmidt process, that uses several scalar products with global communication. Figures 6 and 7 present the graphs of efficiency of the execution times presented in figures 2 and 3.
Figure 6. Efficiency of CG
Figure 7. Efficiency of GMRES
Concerning to GMRES, a better efficiency occurred when running from 2 to 7 processors. This behavior was not expected because of the increase in communication overhead. As GMRES stores the search vectors of the "subspace base", it uses a great amount of memory. One possible reason for this behavior is the fact that the increase in the number of processors reduces the size of the search vectors and consequently reduces the memory use. 4.2. Multiple processes vs. multiple threads Figures 8 and 9 present a comparison between the use of multiple processes and the simultaneous use of multiple processes and multiple threads. The results presented show that the exploration of intra-node parallelism using multiple threads had a better performance. The gain of performance when using multiple threads, when compared to the use of multiple processes, increased as the number of processors became gradually higher. Tables 1 and 2 show the percentage of gain when using simultaneously multiple
548
r
0,5
!i:il;!i!!-ii~..... t
.DECK ........................
IIDECKTHREADS
[ii!iiiiliiiiiiiii::i:i!-:
0,45 0,4
0,35 15'
0,3
|
0,25 0,2
10'
0,15 0,1
0,05 O
0 2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
Processors
Processors
Figure 8. Process vs. Threads in CG
Figure 9. Process vs. Threads inGMRES
threads and multiple processes and the percentage of gain obtained using only multiple processes. Table 1 Process vs. Threads in GC Dual Machines DECK MPI IMPI
2,~ % 1,9% 3,1%
18 26,3 % 17,3% 18,2%
Table 2 Process vs. Threads in GMRES Dual Machines DECK MPI IMPI
1,16 % 1,3 % 1,7 %
18 24,0 % 20,6% 20,9%
4.3. Performance analysis of multiprocessor machines In this section the execution of processes running in two processors in a multiprocessor node is compared to the execution of processes running in two processors in different nodes. Figures 10 and 11 present the results obtained with CG and with GMRES, respectively.
Figure 10. Multiprocessor machines in CG
Figure 11. Multiprocessor machines in GMRES
The results obtained were influenced by some factors as, for example, the memory contention existing in multiprocessor machines and the cost of intra-node and inter-node communication through message passing. It can be observed, in the graphs, that with the increase in the number of nodes the difference of performance in the use of two processors in dual machines reduces gradually when compared to the use of monoprocessed machines. It can be explained by the reduction in the use of memory and, consequently, the reduction of memory contention when the number of machines grows.
549 In the GMRES method the memory contention with few machines is higher but, as the number of machines grows, this contention decreases faster. Tables 3 and 4 show the percentage of memory contention obtained in the execution of the methods for 2 and 18 processes. Table 3 Memory contention in GC
Table 4 Memory contention in GMRES
Procs in Dual Machines 2 36
DECK MPI IMPI
38,12% 37,7 % 37,3 %
21,9% 20,3 % 19,1 %
Procs in Dual Machines 2 36
DECK MPI IMPI
41,4% 39,7 % 39,8%
10,9% 6,6 % 6,6 %
5. CONCLUSIONS In this paper three different analysis concerning to algorithms development in PC clusters were presented. The first analysis was the comparison between different tools for the exploration of inter-node parallelism. The second one was the comparison of the use of only multiple processes and the use of multiple processes and multiple threads. The third analysis refers to the architecture, comparing the use of clusters of dual processor nodes and the use of monoprocessor nodes. In what concerns to inter-node parallelism, DECK presented a better performance than MPICH in the tests effectuated. It was also observed that the use of non-blocking primitives in MPICH, allowing the overlapping of communication and processing and reducing the overhead caused by the communication, improved the performance of the methods. Regarding to inter-node parallelism, the results obtained showed that the simultaneous use of multiple threads and multiple processes in multiprocessor clusters presents a better performance than the use of only multiple threads. Although the gain of performance obtained with the use of multiple threads can be small when compared to the gain obtained with the use of only multiple processes when few nodes are used, this gain can be significant with the increase of the number of nodes. Regarding to the architecture, an analysis of the cost-benefit relation shows that the acquisition cost of two monoprocessed nodes is higher than the cost of one dual processor node. Furthermore, the use of multiprocessor nodes in clusters allows the increase of the number of processors keeping the same number of interconnection devices necessary to the use of multiprocessor nodes. Other advantages in the use of multiprocessor nodes are the cost of management, maintenance and electrical power. An important issue that must be considered in the decision of using dual processor nodes or monoprocessed nodes is memory contention. Some tests done with two processes in dual processor nodes showed that memory contention grows with the use of memory by the application. This result suggests that, in applications that use less memory and have a good scalability, the use of multiprocessor nodes is more viable, as it presents a similar result to that obtained using monoprocessed nodes, by a lower cost. Other advantage in the use of multiprocessor nodes is the possibility of explorating intra-node parallelism through the use of multithreading.
550 REFERENCES
[1]
[2]
[3] [4]
[s]
Barreto, M. DECK: an environment for parallel programming on clusters of multiprocessors Porto Alegre RS - Brasil: Universidade Federal do Rio Grande do Sul, 2000. (Master Thesis). Buyya, R. (Ed.). High performance cluster computing: architectures and systems. Upper Saddle River: Prentice Hall PTR, 1999. 849p. Rizzi, R. L., Domeles, R. V., Diverio, T. A., Navaux, P. A. Parallel Solution in PC Clusters by the Schwarz Domain Decomposition for Three-dimensional Hydrodynamics Models. Vecpar. 2002. Porto, Portugal, 2002. Saad, Y. Iterative Methods for Sparse Linear Systems. Boston, PWS Publishing Company, 447 p. 1996. Saad, Y. Iterative Methods for Large Sparse Matrix Problems: Parallel Implementations. 84 p. (course in Winter Workshop on Iterative Methods for Large Sparse Matrix Problems) 2002. Brisbane.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
551
First Impressions of Different Parallel Cluster File Systems T.P. Boenisch a, P.W. Haas a, M. Hess% and B. Krischok ~ ~High-Performance Computing-Center Stuttgart (HLRS), Allmandring 30, D-70550 Stuttgart, Germany During the last years clusters have become more and more important. In order to provide a single system view a cluster file system supporting equal access to the file system for each cluster node is essential. Nevertheless, the achievable performance should be comparable to that of a local file system. In this effort we installed and tested many of the important cluster file systems. 1. INTRODUCTION Clusters built from off the shelf hardware have become more and more powerful during the last years. Meanwhile, for some applications large cluster systems are as powerful as large supercomputers as they have entered the Top 10 [ 1] of the most powerful computers. Currently, the fastest cluster is the third fastest system on earth according to the Top 500. Simultanously, a large set of cluster tools for the administration and operation of such clusters has been developed. Additionally, parallel cluster file systems are developed, to enable a single system view also from a file system point of view. If one has to make a decision, which file system should be chosen, a lot of opportunities have to be considered. There are freely available file systems like PVFS [2] and in future Lustre [3] as there are proprietary solutions from different hardware vendors like GPFS [5] from IBM, CXFS [6] from SGI, QFS from SUN and GFS [7] from NEC. Furthermore, there are file systems from software only vendors available like StorNext File System (SNFS) [8] from ADIC or GFS [9] from Sistina. All these file systems should allow for access from many cluster nodes with a performance clearly above the level of NFS [10]. 2. TESTED PRODUCTS AND E N V I R O N M E N T S
The different file systems can be ranked into two classes roughly: the cluster file systems of the first class need to share a physical disk, today mainly based on fibre channel technology. In addition, participating nodes have a second network connection, where the control flow between the node and a metadata server is being handled. Some manufacturers even recommend a dedicated network for this task. CXFS, SNFS and NEC's GFS as well as Sistinas GFS are located in this class which can be described as the "shared storge" class. The other class uses storage nodes, where the disk drives are located within or close to the system. All data, the control data between metadata servers and the clients as well as the data to be stored, are handled by the cluster interconnect(s). PVFS, Lustre and GPFS belong to that class of
552 message-based architectures. A more detailed description and a discussion of the specific advantages and disadvantages of architectures of the two different classes can be found in [ 11 ] At the High Performance Computing Center Stuttgart (HLRS), there are currently three Cluster File Systems in four different configurations in use. These are NEC GFS, PVFS and Lustre. So, representatives of each file system class are available. Additionally, we have been able to perform measurements with CXFS on a SGI test cluster. 2.1. PVFS Developed by the Parallel Architecture Research Lab at Clemson University, PVFS [2] is a virtual parallel file system for Linux clusters. Data is distributed over several I/O servers and multiple cluster nodes have access to the data simultaneously. A metadata server, called mgr, manages file system structure, data access and the distribution of data onto the I/O servers. The I/O servers themselves use regular files on an already existing local file system to store data. PVFS can be used as any other file system and once it is mounted on a cluster node, it is completely transparent to applications. PVFS is not failsafe. If an I/O server is lost, access to data stored on a PVFS partition requiring that node is only possible after that I/O server has been restarted. In case of a disk failure, all data on PVFS partitions involving the respective nodes disk are lost. Benchmarks of PVFS were conducted on an an IA-64 cluster. This cluster consists of eight dual Itanium 2 (McKinley) nodes, with 900 MHz clock frequency. Each node has a local disk, which are used for the PVFS file system. The nodes are connected via Myrinet 2000 and Gigabit Ethernet. For the measurements we configured 4 I/O nodes using the Myrinet network connection.
2.2. NEC GFS The NEC GFS is an NEC extension for NFS. Any node in the NEC GFS environment needs to be able to address the physical shared disk directly. This access is granted by using fibre channel SAN's exporting the specific storage device to all participating nodes. As NEC GFS is an extension to NFS and uses the NFS protocol, the GFS clients have to import the file system from the GFS server. NEC GFS has two different transport protocols. When accessing small files, the whole transfer is performed using the NFS protocol only. When the files are large, only the metadata are transferred using the NFS mechanism. The data themselves are transferred directly using the disk access via fibre channel. Hereby, the client gets the information where to write or read from the GFS server. The HLRS currently runs two configurations. One GFS environment consists of two NEC SX-5 vector systems, 16 processors each. One system is configured as server, one as client. Each system has four 1-Gbit/s FC-paths to the storage device, a NEC DS 1200. The native file system on disk is the NEC Super-UX FS. The file system block size is 4 MB. Additionally, there is a test configuration with the two vector systems as client and a NEC Azusa system running Linux as server. The Azusa system is equipped with 16 Itanium processors. In this GFS environment, each host has two 1-Gbit/s paths to the RAID, a NEC iStorage $2100. The native file system is XFS with a block size of 16kB. 2.3. Lustre On a small test cluster in our center, we have Lustre installed. The cluster consists of eight double processor 1GHz Pentium III PC's with 1 GB of main menory each. The systems are connected by 100 MBit Ethemet and Myrinet. For the current installiation we used the freely
553 available Lustre software [4], Version Lustre Lite 0.6 (1.0 beta 1). Lustre Lite 1.0 with an improved performance for bandwidth as well as for metadata is expected to be released in November 2003. The Lustre configuration in our center is made up of one metadata server (MDS) and between one and four object storage targets (OST). For both of them, the local IDE-disk with 36 GB was used to store the metadata and storage objects respectively. The clients which request the data were located at further systems in the cluster. The Lustre installation was configured to use Myrinet for the network connections and data transfer. 2.4. CXFS In addition, we have been able to access a CXFS environment from SGI for measurements. In CXFS, there is a strict separation of metadata management and data operations. The respective entities are the metadata controller, or CXFS server, and the CXFS client. Metadata operations have to be carried out by the CXFS server, exclusively. The client has direct access to the XFS file system data on the shared disk; the access is restricted to read/write and long seek operations, though. The measurement environment consisted of a SGI Origin 3000 with 20 processors as server and a SGI Origin 3000 with 8 processors as client. The storage device used was a SAN Data Director manufactured by Data Direct Networks (DDN). Each host was connected with four 1-Gbit/s FC-AL ports. The RAID layout was 4x (8+ 1+ 1 RAID-5) striped. 3. M E A S U R E M E N T M E T H O D
For the measurements we were using our own disk benchmark disk-I/O which allows to configure the file size for writes and the I/O block size dynamically. But measuring the bandwidth to and from the file system on disk is only half the story. We found metadata performance being as important as the plain bandwidth. Metadata in this frame are data necessary to control the storage of the data, i.e. the additional information also stored within the filesystem. This includes the stored creation and change dates, access permissions and all other information in the inodes and the inodes itself. Especially on cluster file systems where every metadata operation finally must be communicated between systems, metadata performance is not negligible. For example, when compiling a lot of temporary files are created and used. Consequently, poor metadata performance delays the whole compilation, by factors probably. Therefore, we are also measuring metadata performance. This is done by measuring the creation of zero byte files (create performance), performing stat operations on these files (list performance) and deleting them afterwards (unlink performance). 3.1. disk-I/O For file system measurements HLRS develops its own disk benchmark called disk-I/O. The program is written in C and uses basically the C I/O subroutines for unformatted write and read. The file size during writes as well as the I/O block size can be configured at run time. Files can be created for writing and also be rewritten. For metadata performance measurements the file status can be requested and the files can be deleted. The benchmark can be controlled either manually for read and write tests, but it should be mainly used with its configuration files. There, you can specify the test to be performed one after the other. In addition, several data streams can be tested in parallel. For this, the benchmark uses either p-threads or MPI. When testing parallel streams, each measurement is synchronised so that different measurements do
554 Table 1 File System Throughput [MB/s] 1 process, 1 client, 1MB IO chunks CXFS NEC GFS NEC GFS PVFS (Az) (SX) Buffered Write Client 177 23 30 45 Read Client 90 27 25 65 Buffered Write Server/Local 213 64 92 20 Read Server/Local 89
Lustre (10ST) 10 11 30 36
Lustre (20ST) 10 9 30 36
no influence each other. The results for CXFS were measured with a preliminary version of the benchmark software but the results were checked to be consistent with the current benchmark version.
3.2. Measurement problems Compared to CPU performance measurements, measuring file system performance has to cope with much more dependencies. In addition to the host system hardware, these are the storage interface, the FC host bus adapter (HBA), the FC fabric including the switches and the storage device, typically a RAID system. Within the RAID, the controller, the internal structure, the caching behavior and the RAID level are influencing the performance. From the software side this is the layout of the logical volume, i.e. the number of physical volumes and their arrangement and the file system configuration, mainly the block size, which have to be well configured by the system administrator. When measuring cluster file systems in addition the network performance and layout influences the results. Unfortunately, all these parameters might be different or cannot be adjusted easily. Therefore, it might be difficult to compare the measurement results. Fortunately, most of these parameters mainly affect the data throughput but not the metadata performance. Thus, the metadata performance should be comparable more easily. Nevertheless, the large number of influencing parameters leeds the I/O performance optimization to become highly complex. 4. RESULTS To get a first impression of the file system's performance we measured the throughput of a client to the cluster file system by writing and reading large files with 1MB I/O chunk size with one I/O process. The file size choosen was different on the different file systems because it has to be large enough to minimize cache effects from the file system cache. Before measuring the read performance, another large file was written to flush the file system cache. In row two and three of table 1 the obtained results are shown. Row four and five show the write and read performance on an equivalent local disk for Lustre and PVFS while it shows the respective performance of the server nodes for CXFS and NEC GFS. While CXFS shows quite a good performance and nearly no degradation when using the file system from a client, NEC GFS cannot compete on the described hardware. The reason for the poor performance with the SX5 as server as well as using the Azusa is the lack of metadata performance. The reason is, that for each new file system block a metadata communication has to be performed. NEC is already aware of this problem and promised better results for a GFS environment with SX-6 clients and a TX-7 (Azama) as server. PVFS shows also quite a nice performance if the disk performance is taken into account. Lustre shows currently a limited throughput, both with
555 160
'
'
140 120
NE c' GFS (s X) server ' NEC GFS (SX) client NEC GFS (Azusa) server . . N E C GFS (Azusa)client PVFS '
'
, • [] ---
100 rn
80 60 40 20
I
0
I
I
I
2000 4000 6000 8000 10000 12000 14000 16000 18000 blocksize [kb]
Figure 1. Throughput for the different file systems depending on the I/O chunk size one OST or two OST's with 4k block striping configured. But these results for Lustre are 33% and 16% of the disk performance which is more as it was originally planned for the beta version. It will be very interesting to see, how the first semi production version, planned for November, will behave. Figure 1 shows the throughput for the read operation depending on the I/O chunk size for NEC GFS and PVFS. The main difference between the SX5 and the Azusa as GFS server is the block size of the underlying file system. On the SX5 it is 4 MB while it is only 16 kB on the Azusa. Unfortunately, this cannot be increased currently due to Linux restrictions. The consequence is a larger number of disk blocks to be addressed for the same file to be written. But fortunately, on the Azusa the possible interrupt rate is quite higher compared with the SX5. With the SX5 as server, the client gets a quite good bandwith to disk which is nevertheless lower than that on the server. But compared with the Azusa as server the client gets much better performance. Nevertheless, the throughput of the client with the Azusa as server is still increasing and achieves 62 MB/s for 32 MB I/O chunks. But, even this is less than half of the servers performance. As already mentioned, the reason for this is the small disk block size on the Azusa. PVFS shows also an increasing performance when increasing the I/O chunk size. In figure 2 the metadata performance for the file create operation for one client and an arbitrary number of processes on that client is shown. Except CXFS all file systems show a saturation with at maximum five processes on the client. Surprisingly, Lustre shows already a promising performance with around 100 create operations per second, which is nearly as much as NEC's GFS with the Azusa as server. The SX5 as GFS server results in only half of the create performance on the client compared to the Azusa as server. The reason is presumably the lower interrupt rate of the SX5. But the SX5 is a vector supercomputer, not a file server. In figure 3 the metadata performance for getting the file status is presented. This gives also the performance which can be expected for directory listings. For all file systems, the performance is increasing with the number of processes on a client until the saturation is reached. Then, the metadata performance decreases slightly. It is interesting to note that the
556 10000
'
'cxfs
',
NEC GFS (SX) • NEC GFS (Azusa) PVFS [] L u s t r e ~ ___I--
1000
t1:i t_ 0
100
0
0
I
I
I
I
I
I
I
I
I
2
4
6
8
10
12
14
16
18
20
processes
Figure 2. Metadata performance create for one client depending on the number of processes 100000
.
.
.
.
.
.
.
10000
CXFS
',
PVFS Lustre
[] -
o~ Q. 0
1000
o~ .m i
100
0
0
I
I
I
I
I
I
I
I
I
I
2
4
6
8
10
12
14
16
18
20
processes
Figure 3. Metadata performance for one client depending on the number of processes metadata performance of CXFS for the create as well as for the list operation is at least one order of magnitude higher than the metadata performance of all other measured cluster file systems as CXFS takes advantage from the proprietary communication mechanism. Obviously, NEC GFS is suffering again from the inadequate NFS metadata performance because the possible metadata rate on the respective server is quite higher. PVFS shows now, in contrast to the create performance, a reasonable rate; probably the metadata information is cached locally on the clients which are in the same way I/O-nodes in our setup. Comparing the rates for Lustre guides us to an interesting behaviour: Lustre has more or less the same performance for the list
557 operation as for the create operations. This leeds to the conclusion that the metadata server itself has a limitation which will be probably changed in the next version but at latest when switching to metadata server clusters as it is planned for the future. 5. C O N C L U S I O N S The measurements have shown that CXFS is already a mature and performant cluster file system. The other mentioned file systems have to be significantly improved, especially from the metadata performance point of view. NEC has already an improved version for SX-6 and Azama servers, the PVFS group will present a completely new designed PVFS2 soon and Lustre is of cause an ongoing development as we have measured the first beta version of Lustre Lite. The new versions should and will be measured as they become available. Nevertheless, one has to consider the environment where the cluster file system is used as some of them currently depend on different hardware platforms and usage models. 6. A C K N O W L E D G E M E N T S
We want to thank SGI for the opportunity to perform measurements on their Origin 3000 test system. We gratefully acknowledge that Data Direct Networks provided a Data Director RAID System for one of these measurements. We also want to thank NEC for the GFS setup and the support during the measurements. REFERENCES
[1]
Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst D. Simon: Top 500 Supercomputer
Sites www.top500.org [2]
[3] [4] [5]
[6] [7] [8] [9] [10]
W . B . Ligon III and R. B. Ross: Implementation and Performance of a Parallel File System for High Performance Distributed Applications Proceedings of the Fifth IEEE International Symposium on High Performance Distributed Computing, August, 1996 Peter J. Braam et. al.: The Lustre Storage Architecture Version 24/10/2002, http://www. 1ustre, org/do c s. html Cluster File Systems Inc.: The Lustre Home Page http://www.lustre.org Frank Schmuck and Roger Haskin: GPFS: A Shared-Disk File System for Large Computing Clusters First USENIX Conference on File and Storage Technologies (FAST'02), Monterey, CA, January 28-30, 2002. Available at http ://www.almaden.ibm.com/StorageSystems/file_systems/GPF S/. SGI White Paper: CXFS: A High-Performance, Multi-OS SAN Filesystem from SGI Available at http://www.sgi.com/products/storage/cxfs.html Hiroyuki Endoh: Performance Improvement of Network File System Using Third-Party Transfer in: NEC Research&Development, Vol. 39, No.4 October 1998 Brad Kline, ADIC: Bistributed File Systems for Storage Area networks Available at http ://www.adic.com/us/collateral/distfilesystems.pdf Sistina: Sistina Global File System Available at http://www.sistina.com/resources.html R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh and B. Lyon: Design and Implementation of the SUN Network File System Proceedings of the Summer USENIX Conference, pp. 119-130, 1985.
558 [11 ] Steven R. Soltis, Thomas M. Ruwart and Matthew T. O'Keefe: The Global File System Proceedings of the Fifth NASA Goddard Conference on Mass Storage Systems, College Park, MD, 1996. IEEE Computer Society Press.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004Elsevier B.V. All rights reserved.
559
Fast Parallel I/O on P a r a S t a t i o n C l u s t e r s N. Eicker ab, F. Isaila c, T. Lippert ~, T. Moschny c, and W.F. Tichyc aDepartment of Physics, University of Wuppertal, Gaul3straBe 20, 42097 Wuppertal, Germany bparTec AG, Possartstr. 20, 81679 Mtinchen, Germany CInstitute for Program Structures and Data Organization (IPD), University of Karlsruhe, Postfach 6980, 76128 Karlsruhe, Germany It is shown that the ParaStation3 communication system can considerably improve the performance of parallel disk I/O on cluster computers with Myrinet or Gigabit Ethernet connectivity. Benchmarks of the open source parallel file system PVFS on the 128-node ALICE cluster at Wuppertal University outperform existing results by more than a factor of 2. The large-scale computation of I/O-intensive eigenmode problems from computational field theory demonstrates the long-time stability of PVFS/Parastation. 1. INTRODUCTION Commodity-off-the-shelf (COTS) clusters have evolved towards cost-effective generalpurpose HPC devices and show up increasingly on the TOP500 list [15]. Communicationintensive number-crunching tasks on such clusters have been enabled by gigabit network technology, such as Myrinet. The communication software for these technologies, employing special techniques to avoid copying of data (zero-copy) [ 1, 6, 5], however merely focused on applications using the message passing paradigm, so I/O-intensive computations did not benefit from cluster computers to the same extent. In principle, this situation can be overcome by using a parallel file system (PFS), that allows to utilize the aggregate capacity and bandwidth of distributed disks in a cluster by I/O-intensive, parallel applications. In a typical implementation, a set of compute nodes reads and writes data to another set of I/O nodes that hold the physical resources of the PFS. The two sets of nodes may be identical, may partly overlap or may even be distinct. This concept allows for scalability of the I/O rate with the number of I/O nodes, provided that the network is scalable, fast enough and shows low latency. The first requirement is fulfilled by crossbar or multi-stage crossbar networks with full bi-sectional bandwidth, the second leads to gigabit communication systems. Systems like QS-net, Myrinet [2] or standard Gigabit Ethernet can deliver gigabit bandwidth and are, to a large extent, scalable. The parallel file system chosen is the Parallel Virtual File System (PVFS), developed at Clemson University and Argonne National Laboratory [4], because it is freely available (under the GNU General Public License) and working in a stable manner. The communication back-end of the standard distribution of PVFS is based on the TCP/IP protocol. Therefore, PVFS can readily operate on top of any network supporting TCP/IP.
560 Our testbed is the 128 node ALICE cluster at Wuppertal University, Germany [8], that is connected through a Myrinet Clos net. Fast communication using MPI and TCP/IP over Myrinet (and Ethernet) on this system is enabled by the ParaStation cluster middleware [5], that additionally provides administration software for Linux. TCP/IP is routed by a special kernel module via the ParaStation communication system, which renders Myrinet an additional IP-network with full bi-sectional bandwidth. On ALICE, we have set up four PVFS partitions consisting of 32 I/O nodes each. ParaStation, developed at Karlsruhe University, implements the concept of virtual nodes, operating closely with queuing systems like PBS [7]. The communication system provides safe multi-user operation and outstanding stability, with single-point-of-administration management by means of the ParaStation-daemons. Besides a comparison of read and write performances, we exemplify the superior features of PVFS using ParaStation, computing giant, data-intensive eigenproblems [ 17] in lattice field theory. In our simulations, we repeatedly have to compute (.9(1000) low-lying eigenvectors of the fermionic matrix, which describes the dynamics of quarks. The size of each vector is O(106). Typically, about 10 GB of I/O is carried out in data-intensive production steps of about 10 minutes compute time on 16 to 64 processors. Our parallel I/O can cut down reading or writing from 10 minutes to 20 seconds. 2. THE PARASTATION TCP/IP KERNEL MODULE We implemented a TCP/IP kernel module on top of the ParaStation communication system. In this manner, stable communication with gigabyte bandwidth is provided for TCP/IP based applications like the parallel virtual file system (PVFS). The module provides an alternative network driver interface to the Linux kernel. This way, virtually any Ethernet protocol that is supported by the kernel can be transported over Myrinet. Internally, the module uses the kernel variant of the ParaStation low-level communication interface to send raw Ethernet datagrams to any other host in the cluster. This interface supports a minimal set of basic communication operations. Upon startup, a so-called kernel context is obtained from the ParaStation module. This context reserves a certain number of communication buffers in the memory of the Myrinet network adapter card and instructs the driver to listen for messages addressed to the TCP/IP module. Secondly, a call-back mechanism is established, such that whenever a message arrives, an interrupt is risen and the ParaStation interrupt handler calls a method of the TCP/IP module which in turn hands over the message to the Linux network stack. If a message is to be sent, the network stack functions call a method of the module that was registered upon initialization, handing the message over to the ParaStation module. In order to address other nodes in the cluster, the TCP/IP driver module (in fact an Ethernet driver) maps the ParaStation node identification number (ParaStation ID) onto Ethernet hardware addresses. As ParaStation IDs can be derived from IP-addresses, static ARP tables are not required. 3. PVFS ON ALICE The Alpha-Linux-Cluster-Engine ALICE consists of 128 Compaq DS10 workstations [8]. The machine is equipped with Alpha 21264 EV67 processors with 2 MB cache, each operating at a frequency of 616 MHz. The total amount of memory is 32 GB, and the total disk space is 1.3 TB. The nodes are interconnected by a 2 x 1.28 Gbit/s Myrinet network configured as
561 a multi-stage crossbar with full bi-sectional bandwidth. The M2M-PCI64A-2 Myrinet cards utilize a 64 bit/33 MHz PCI bus. We have extended the Myrinet crossbar network to include an external file and archive server as well as to provide gigabit links to external machines for the purpose of fast on-line visualization, see Figure 1. On ALICE, we run 4 different PVFS
_
ALICE Tier I .... . . . . ..............
.... . . . .
M2M-dual
M2M-dual M2M-dua~ M2M4ual
M2M-dua( M2M-dua! M2M-dua|
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
M2M.... SW8
M2M..... SW8
[
M2M. . . . W8
~
~
~ I ! ~...W ~
\] .Rem~ _ em em
i
ALICE Tier II
Figure 1. Adding external devices to the Myrinet crossbar net.
partitions with 32 nodes each. Each PVFS partition is represented by a mount point on each node and on the file server. Mounting PVFS on the file server enables us to copy UNIX files with ParaStation TCP/IP speed from the external RAID to PVFS. For each partition, the last node plays the r61e of the management node. 4. L O C A L I/O B E N C H M A R K S Prior to benchmarking the parallel file system we have to investigate ALiCE's basic I/O components characteristics like the local file system performance as well as TCP/IP and MPI data rates. We performed tests on local file systems, formatted with ReiserFS, by means of b o n n i e + + [10]. Read and write performances are determined for file sizes ranging from 10 MB up to 2 GB. In order to reveal buffer cache effects, we have adjusted the "dirty buffer" parameter 1 to two different values, 40 % and 70 %, see Figure 2. The write performance to buffer cache of 60 MB/s goes down to about 23 MB/s at a file size of 100 and 200 MB for 40 and 70 % dirty buffers, respectively. The subsequent read operation shows a bandwidth of more than 280 MB/s for small files, while at a file size of about 200 MB, the speed drops down to 24 MB/s. On ALICE, we can route TCP/IP packages via ParaStation/Myrinet. The t t c p benchmark issues TCP/IP packets over a point-to-point connection to determine the unidirectional TCP/IP speed, cf. Figure 3a. The performance is seen to saturate at 93 MB/s. It is instructive to compare these results with the outcome of the Pallas s e n d - r e c e i v e MPI benchmark [3], see Figure 3b. For the s e n d - r e c e i v e case (i.e. the hi-directional situation), iThe Linux kernel starts flushing bufferpages as soon as the given percentage of memory available for buffer cache is "dirty", i.e., if it changed in memory but has not yet been written to disk.
562
300 - i -i = -- l ' l t t l l t 3 i ~ m t
"
"
T 250
'
I
"
read40d/o
....
write 4 0 %
.... 9....
read 70% write 70%
l
....
.... .... ~) ..... .... D
200 m z~ 150
==7
m~ 1 0 0
-2
= 50
100
,ooo
ooo 1
1000
File size [ M B y t e ]
Figure 2. Performnces of Local file system on ALICE. 100
200
80 .........................................................................................................
>/ .............................. i ...........
60
p
........................................................................ -.............-...........}-i ....
i
4o ........................................................................
:
i ............ i-/i :
i/~
i/
150
i
....................................................................... , ............................ " l
~
0
.............................................................
i
4
9L--~~-~--"k"§ 16
64
256
100
..........................................................
.
.
.
.
...................................... 9.............
16k 6 4 k 2 5 6 k 1M
4 M 16M 6 4 M
Size of packets [bytes]
j. /'
9
./ /
50
............. ~........................................... i ............. i / :
0
4k
"
:
.o
.i= I-
. ........ ;-y- ............................................................................. :: -Z
lk
,
- .........."...........i...........- ..........
i
20
i/
SendRrecv
,
! 4
:
. _.i_ ....... t j
16
64
/
/
'i i
........ ~............. ~.............. ~............. ~.............. ; .............
/i
i
i
256
lk
4k
16k
64k
256k
1M
4M
Size of Packets [bytes]
Figure 3. Performance test by t tcp via ParaStation (a) and performances of the s e n d - r e c e i v e Pallas MPI-benchmark on ALICE (b). the performance levels off at a total bandwidth of about 175 MB/s (adding up the data rates of both directions). As the PALLAS benchmark does not provide uni-directional measurements, we have prepared corresponding s e n d and r e c e i v e programs. The uni-directional MPIperformance is saturated at about 140 MB/s. Hence, the data rate via TCP/IP is only about 34 % smaller than via MPI. 5. B E N C H M A R K I N G PVFS/PARASTATION The tests of the concurrent read/write performance of the parallel file system are carried out in the following manner: a new PVFS file, belonging to P compute processes, is opened on N I/O nodes. Concurrently, the same amount of data S is written from each of the P processes (the virtual partitioning) to disjoint parts of the file. PVFS stripes the data onto the N I/O nodes (the physical partitioning) with a standard stripe size of 64 kB. After the data has been written, we close the file and reopen it again immediately afterwards. Then the same data is copied back to the compute nodes. The bandwidth for write and read operations is computed from the maximum of the wall clock execution times achieved on all the P compute nodes. This benchmarking procedure closely follows Cams et al. [ 11 ], who report performance results on 60 nodes of the Chiba-City cluster at Argonne National Laboratory. This
563
cluster is equipped with the same Myrinet version as ALICE, thereby enabling us to compare our results using TCP/IP over ParaStation on an Alpha system with TCP/IP over the Myrinet/GM software on a Pentium III cluster. Varying P in the range P = 1 . . . 64, and repeating each measurement for N -- 4, N = 16 and N = 32 I/O nodes, the amount of data S written and read per compute node is chosen proportionally to the number N of I/O nodes, SIN - const. In this manner the buffer cache will be saturated for one and the same number of compute nodes, see Figure 4.
1000
2000
900
1800
800 ~'
700
.="
600
o
N=4, S = 8 M B N=16, S=32MB N=32, S = 6 4 M B
, , ,
o ~. ~
, , ,
',
1600
i i i
1400
&
400
o
300
200 24
32
40
48
P: Number of compute nodes
.~..~=.=,~'~~ ....~
A,=0~A
600 400
16
, , ,
800
100 8
o ~ "
1000
200
0
, , ,
1200
500
0
N=4, S = 8 M B N=16, S = 3 2 M B N=32, S = 6 4 M B
56
64
0
0
8
16
24
32
40
48
56
64
P: Number of compute nodes
Figure 4. Concurrent write (a) and read (b) performances.The smallest and the largest results of 5 measurements were discarded, and the remaining ones have been averaged.
As we see from Figure 4a, the performance quickly reaches a plateau for each I/O partition. There is no visible impact from the number of compute nodes as long as the buffer cache of the I/O nodes is not saturated. The write performance reaches between 29 and 35 MB/s for each I/O node, not exhausting the b o n n • e+ + figures of Section 4 or the TCP/IP speed, see Figure 3. However, we achieve about 30 % faster write performances than those reported in by Cams et al. [ 11 ] as well as much better stability due to ParaStation. After buffer cache saturation, the performance drops down to a value which is about 18 % smaller as compared to the hard disk performance benchmark results displayed in Figure 2. The read operation is carried out directly after the write, with only a synchronizing barrier inbetween. Therefore, the read process can draw the data directly from buffer-cache. As explained above, dirty buffers are written to disk if their size exhausts the limit of 100 MB. However, they remain in memory and can be read back at a rate limited only by the memory bandwidth. Thus, Figure 4b shows no loss of performance throughout the test range. We find that each I/O node can send with a speed varying between 56 MB/s and 75 MB/s, since several sockets are served simultaneously. This performance is just 20 % slower than the measured point-to-point performance of TCP/IR but still about 45 % slower than the actual capabilities of ParaStation as seen in MPI applications, see Section 4. We should remark, however, that such read test, as proposed in Ref. [ 11 ] is rather artificial. It is only meaningful given a hard disk with a read performance faster than 75 MB/s. In order to test the throughput of the disk, i.e. without utilizing the buffer cache, we modified the benchmark by creating a huge amount of data (multiple times the size of the buffer cache),
564 writing the data to several files. After writing and closing the last file, it is very unlikely that any data from the first files is still present in the buffer cache. A subsequent read operation will therefore be forced to read directly from hard disk. Figure 5 shows the result of this benchmark.
350
N=4, S=8MB , o , N=16, S=32MB ~. ~ ~. ~,~,,~, ,,~ ,~i N=32, S=64MB . ~ , ~ e,~,~,~, ~ k,,~a,~, .~,
300 ~ ' 250 1:I3 200 t-
o~ 150 c-
~
100 i
0
8
16 24 32 40 48 P: Number of compute nodes
56
64
Figure 5. Performance of concurrent read from disk.
The throughput for read operations from PVFS drops dramatically for such realistic setup. Still, a total read performance of 300 MB/s can be achieved if all 32 I/O nodes are utilized. 6. USING PVFS/PARASTATION FOR LATTICE FIELD THEORY EIGENPROBLEMS We apply PVFS/ParaStation for the computation of huge eigensystems. We have to compute O(1000) low eigenvectors of the so-called fermionic matrix, which describes the dynamics of quarks on thousands of field configurations. The size of each vector is O(106). The theory of strong interactions, quantum chromodynamics (QCD), deals with the determination of hadronic properties. In the hadronic energy domain, this theory is solved on a spacetime lattice [ 16]. Particularly important observables are given by the mass spectrum of bound quark states, as for instance the masses of hadrons like the 7r Meson and the p Meson. Some important hadronic states are classified as singlet representation of the flavor-SU(3) group. Their correlation functions, Cn,(tl - t2), the quantities which allow to extract the physical properties of the hadrons by exploring their decay in time, At -- tl - t2, contain contributions from correlators between closed virtual quark-gluon loops. These disconnected diagrams represent quantum mechanical vacuum fluctuations, which follow from Heisenberg's uncertainty principle. The reliable determination of disconnected diagrams leads to the numerical problem of getting information about functionals of the complete inverse fermionic matrix M -1. Recently, we have shown how to estimate the mass of the r/' meson just using a set of lowlying eigenmodes of M [17]. Strictly speaking, our approach deals with the matrix Q, the hermitian form of M, the eigenvectors of which form an orthogonal base [12], Q -- %M. The
565 Wilson-Dirac matrix is given by 4
Mx,y = l~s6x,y - n Z
#--1
(ls - %) G U, (x) 5x,y-, + (ls + %) Q U~(x - #)6x,y+,.
(1)
The symbols % stand for the 4 x 4 Dirac spin matrices. The 3 x 3 matrices U~ E color-SU(3) represent the gluonic vector field, thus lc~ is a 12 x 12 unit matrix in color and spin space. M is a sparse matrix in 4-dimensional Euclidean space-time with matrix values stochastically distributed coefficient functions of type (1~ -+- %) | U~ (x) at site x. Once the low-lying modes are computed, it is possible to approximate the full inverse matrix Q-1. The correlator of the flavor non-singlet 7r meson is defined as
1'1)
x,y
U
while the flavor singlet ~ meson correlator is composed of two terms, the first corresponding to the propagation of a quark-antiquark-pair from :~ to ~ without annihilation inbetween and the second one being characterized by intermediate pair annihilation:
)
x,y
U
The brackets (..-}u indicate the average over a canonical ensemble of gauge field configurations. For large time separations t, the respective correlation functions are dominated by the ground state, because the higher excitations die out and therefore become proportional to exp(-m0t), m0 is the mass of the particle associated to the correlation function. As already mentioned, the ~- correlator (2), equivalent to the first term of the ~7' correlator, is obtained by solving the linear system M(:g, tl; if, t2)c(ff, t2) = ~(]~, 1; :g, t l ) o n 12 source vectors. The second term in eq. (3) depends on the diagonal elements of Q-1. The inverse can be expanded in terms of the eigenmodes weighted by the inverse eigenvalues:
1 I~{(:~, tl)}(~{(ff, tg)l Q-I(:~, tl; g t2) = Z
Z
(~1~)
'
(4)
i
where )~ and ~k{ are the eigenvalues and the eigenvectors of Q respectively. We found that we can approximate the sum on the right hand side by restriction to (9(300) lowest-lying eigenvalues.
We compute the eigensystem by means of the implicitly restarted Arnoldi method [ 14]. Crucial ingredient of our approach is a Chebyshev acceleration technique. The spectrum is transformed such that the Arnoldi eigenvalue determination becomes uniformly accurate for the entire part of the spectrum we aim for. We work on a space-time lattice of size 163 x 32. The Dirac matrix acts on a 12 x 16a x 32 1.572.864 dimensional vector space. Hence, the inversion of the entire Dirac matrix is not feasible requiring about 40 TB memory space, whereas the determination of 300 low-lying eigenvectors leads to about 7.5 GB memory space only. Our computations are based on canonical ensembles of 200 field configurations with nf - 2 flavors of dynamical sea quarks [13]. It
566 Table 1 Read and write performances for three compute partitions. # of compute nodes # of eigenmodes read performance per I/0 node [MBytes/s] write performance per I/0 node [MBytes/s]
16 100 11.9
32 64 200 300 1 3 . 3 11.1
11.9
13.4
13.2
takes us about 3.5 Tflop-hours to solve for 300 low-lying modes per ensemble for each quark mass. Altogether we aim at 0(30) valence mass/coupling combinations. First physical results of our computations can be found in Refs. [ 17, 9, 18, 19]. The typical compute partitions used for the TEA on ALICE range from 16 to 64 nodes, depending on the number of eigenmodes required for the approximation as well as on the memory available. Each node reads its specific portion of a given eigenmode corresponding to the sublattice assigned to this node. In our case, a simple regular space decomposition of the 163 • 32 lattice in z and t directions is applied. Physically, the 300 eigenmodes for each field configuration are stored as a single large file in round-robin manner. In our application we use MPI-IO calls (MPI F i l e _ r e a d _ a t ()) instead of standard I/O to read from the PVFS. In Table 1 performance averages over four measurements are presented. The results fluctuate only marginally. As to be expected from Figure 5, the throughput for read operations is as high as 13 MB/s, which is the actual hard disk performance. Typically, the computation takes about 10 minutes. Another 10 minutes are needed for loading all 300 modes from the local disk. Loading the data from an external archive (which is the usual procedure since local disks do not provide enough capacity for an entire ensemble of field configurations and eigenmodes) lasts about 30 minutes. PVFS via ParaStation/Myrinet cuts down the I/O times to less than 20 seconds for both reading and writing. This is a substantial acceleration compared to local or remote disk I/O. 7. S U M M A R Y
We have demonstrated that the ParaStation3 communication system speeds up the performance of parallel I/O on cluster computers such as ALICE. I/O benchmarks with PVFS using Parastation over Myrinet achieve a throughput for write operations of up to 1 GB/s from a 32processor compute partition, given a 32-processor PVFS I/O partition. These results outperform known benchmark results for PVFS on 1.28 Gbit Myrinet by more than a factor of 2, a fact that is mainly due to the superior communication features of ParaStation. Read performance from buffer cache reaches up to 2.2 GB/s, while reading from hard disk saturates at the cumulative hard disk performance. The I/O performance achieved with PVFS using ParaStation enables us to carry out extremely data-intensive eigenmode computations on ALICE in the application field of lattice quantum chromodynamics. For the future, we plan to utilize the I/O system for storing and processing mass data in high energy physics data analysis on clusters. ACKNOWLEDGMENTS This work was supported by the Deutsche Forschungsgemeinschaft (DFG) as project "AlphaLinux-Cluster" (Ti 264/6-1 & Li 701/3-1). Physics is supported under Li701/4-1 (RESH
567 Forschergruppe FOR 240/4-1), and by the EU Research and Training Network HPRN-CT-200000145 "Hadron Properties from Lattice QCD". REFERENCES
[ 1]
Myricom Inc. Myrinet generic communication software GM, URL: h t t p - //www. myr2. c o m / m y r i n e t / p e r f o r m a n c e / i n d e x , h t m l , 2002. [2] Myricom Inc. Myrinet, URL" h t t p - / / w w w . m y r 2 , corn, 2002. [3] Pallas GmbH. Pallas MPI Benchmark, URL: h t t p - //www. p a l l a s , c o m / e / p r o d u c t s/pmb, 2002. [4] Parallel Virtual File System, URL: http- //parlweb. parl. clemson, edu/pvfs/index, html, 2002. [5] ParTec AG. ParaStation Cluster Middleware, URL" h t t p - / / w w w . p a r - t e c , c o r n / i n d e x , h t m l , 2002. [6] PC Cluster Consortium. SCore Cluster System Software, URL" http- //pdswww. rwcp. or. jp/score/dist/score/html/en/ index, html, 2002. [7] Veridian Systems. OpenPBS, URL" h t t p - //www. o p e n p b s , org, 2002. [8] H. Arndt, G. Arnold, N. Eicker, D. Fliegner, A. Frommer, R. Hentschke, F. Isaila, H. Kabrede, M. Krech, Th. Lippert, H. Neff, B. Orth, K Schilling, W. Schroers, and W.Tichy. Cluster-computing und computational science mit der wuppertaler alphalinux-cluster-engine alice. Praxis der Informationsverarbeitung und Kommunikation, 25. Jahrgang, 1/2002, Saur Publishing, Miinchen, Germany, 2002. [9] N. Attig, T. Lippert, H. Neff, K. Schilling, and J. Negele. Giant eigenproblems from lattice gauge theory on CRAY T3E systems. Comput. Phys. Commun., 142:196-200, 2001. [I0] Tim Bray. bonnie++, URL" http- //www. textuality, corn/bonnie, 2002. [11 ] P. H. Cams, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A Parallel File System For Linux Clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, Atlanta, GA, pages 317-327, 2000. [12] I. Hip, T. Lippert, H. Neff, K. Schilling, and W. Schroers. Instanton dominance of topological charge fluctuations in qcd? Phys. Rev., D65:014506, 2002. [ 13] Th. Lippert. Sea Quark Effects on Spectrum and Matrix Elements - - Simulations of Quantum Chromodynamics with Dynamical Wilson Fermions. Habilitationsschrift, Bergische Universit/it Wuppertal, Germany, 2001. [ 14] Kristi Maschhoff and Danny Sorensen. PARPACK, URL" http- //www. caam. rice. edu/~kristyn/parpack_home, html, 2002. [15] H. Meuer et al. Top500 Liste, URL: h t t p - //www. t o p 5 0 0 , org, Nov. 2002. [16] I. Montvay and G. Mfinster. Quantum fields on a lattice. Cambridge, UK: Univ. Press, 491 p. (Cambridge monographs on mathematical physics), 1994. [17] H. Neff, N. Eicker, T. Lippert, J. W. Negele, and K. Schilling. On the low fermionic eigenmode dominance in QCD on the lattice. Phys. Rev., D64:114509, 2001. [ 18] Hartmut Neff, T. Lippert, J. W. Negele, and K. Schilling. The eta' signal from partially quenched wilson fermions. 2002. [ 19] K. Schilling, H. Neff, N. Eicker, Y. Lippert, and J. W. Negele. A new approach to eta' on the lattice. Nucl. Phys. Proc. Suppl., 106:227-229, 2002.
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
569
PRFX 9a runtime library for high performance programming on clusters of SMP nodes B. Cirou a, M.C. Counilh a, and J. Roman ~* ~LaBRI UMR 5800, Universit6 Bordeaux 1 & ENSEIRB, F-33405 Talence, FRANCE We present PRFX, a library for implementing parallel algorithms with both static and dynamic properties and executing them on clusters of SMP nodes. The parallelism is expressed through RPC (tasks) working on special shared data. An inspector builds the task DAG which is used to schedule and guide the parallel execution. The implementation of PRFX uses POSIX threads with one-sided communications. We illustrate PRFX by experimental results for LU matrix factorization and Laplace equation solver on an IBM| 1. INTRODUCTION In the recent years, we have seen the development of Symmetric Multiprocessor computers (SMP) with multi-threaded operating systems and also reentrant libraries. There are several SMP machines currently such as IBM SP nodes, Sun| HPC, DEC| Server, SGI| 3800. Several approaches are possible when programming parallel applications on clusters of SMP [ 12, 11]. The message passing standard MPI[2] allows a portable low level MPMD programming model. Data distribution, communication schemes, communication/computation overlap and load balance must be specified in order to achieve high performances. However, optimized MPI program heavily relies on the hardware and thus performance portability is achieved through conditionnal compilation. In this case, SkaMPI[ 13] proposes benchmarks of MPI implementations to help the programmer. Moreover, the MPI model with one process per processor does not match the SMP node architecture. Some OpenMP directives can be then added to an existing MPI program [7] but there is no guarantee of any performance improvement over the MPI code. From these facts, it is interesting to have a look at high level parallel languages. For example, HPF [ 1, 9] allows the user to specify data distribution and data parallel computations on global shared arrays. Data access must be regular in order to allow the HPF compiler to generate efficient communications. For irregular cases, a library (Halo [6]) based on an inspector/executor scheme should be used but performances can only be achieved by using low level constructs of the library. OpenMP [3] provides an easier way to develop parallel programs through task parallelism using a shared memory. But, synchronisation in irregular programs may be difficult to express (use of locks). Currently, these languages achieve good scalability only for special classes of regular algorithms. To our knowledge, these languages do not propose primitives for exploiting the SMP *ScA1Aplixproject, INRIAFuturs
570 cluster hierarchy, except the HPF-like Vienna FORTRAN language [5]. Between these two extremes, there exist extended languages and runtime libraries using task DAG (Direct Acyclic Graph) model for describing parallel algorithms. In this model, one has a partial order over communications and computations whereas a MPI program is one execution instance of a task DAG. This model is more general than those proposed by HPF and OpenMP since task DAG can describe any parallel programs. Two main languages use task DAG and support irregular parallel programming and dynamic execution : Jade [ 14] and Athapascan [8]. They propose on thefly parallelism discovery from a source code that is near sequential but is annotated to describe concurrency. As the communication is implicit, the user can concentrate on writing the parallel algorithm. Jade and Athapascan respectively extends C and C++ languages with task creation and specification of data access type for each task. At runtime, the data dependencies between tasks are computed only by using their arguments and the tasks are assigned to the CPU with a load balancing strategy. These two languages do not manage well strided or linked shared objects. The overhead due to data copying, on the fly DAG building and resolution of data dependencies can lead to poor performances. RAPID [10] is a C library restricted to static parallel programs. The parallel program is divided in two parts : (1) a special sequential function that declares all the tasks of the DAG and their data access; (2) a set of functions which are associated with each task type. The inspector executes the part one of the code and statically computes data dependencies over the arguments of each task. The user does not have a shared view of the memory, so pointers to shared data must be obtained through a RAPID call. This approach offers good performances for static irregular computations (for example sparse Cholesky factorization). RAPID does not exploit SMP shared memory but uses one-sided communications on preallocated memory buffers. This paper describes a runtime library (herein referred as PRFX) for implementing and running parallel algorithms with both static and dynamic properties. Our programming paradigm is similar in spirit to the one of Athapascan or Jade. The parallel execution is based and guided by the task DAG which is computed during a partial pre-execution of the code (inspection phase). More precisely, we adopt an intermediate position between a statically known task DAG (cf. RAPID) and a DAG built on the fly (cf. Jade and Athapascan); thus, we allow static properties of the code to be analyzed (inspection phase) and we leave the possibility to solve dynamic features of the application during the parallel execution. After the inspection phase, a task schedule is computed. Therefore, the parallel execution follows this task schedule but it can be computed again in order to manage unpredictable machine or program behaviors. Our work differs from the previous approaches in that we propose task migration and an execution model which matches the SMP architecture and the task DAG structure. That is the reason why we choose threads for an efficient use of the shared memory (iso-memory as presented in [4]) and one-sided communications for implementing the edges of the DAG. Moreover, PRFX manages data as sets of address space intervals and not as contiguous objects. Moreover, our shared memory support allows data migration while keeping the validity of pointers. The paper is organized as follows. Section 2 presents our PRFX environment and in section 3, we give experimental results achieved from an implementation on an IBM SP3.
571 2. D E S C R I P T I O N OF P R F X
The proposed programming model is close to both the RPC and the shared memory programming models. The RPC are asynchronous for the initiator (caller) but are synchronized with other RPC. The data consistency is computed automatically by following the sequential semantic of the code and by using information given by the programmer. To do so, PRFX adopts the classic inspector/executor scheme. Particularly, the shared data (called here shareable data) used in the arguments of the RPC are inspected in order to generate all necessary communications but with the possibility to solve some dependencies on thefly. The parallel program is written in the C language with calls to the PRFX library for specifying RPC and describing their data access. Once written, the parallel program is compiled and linked with the PRFX library; then, it is launched like an MPI program but with one process per SMP node as shown on the left of figure 1. We call here shareable data those which are involved in the parallel computation. The only data types that can be stored in a shareable data are scalar types or pointers to shareable data. Pointer validity is kept thanks to our memory support which also allows data migration. A shareable data is either a static global data (restricted to Read-Only access mode) or a data obtained through a specific memory allocation call or a set of shareable data. In fact a shareable data is described by a set of address space intervals. Thus, a shareable data could potentially describe any shape of data. However, we only provide support through stencils for describing geometric shapes. Those stencils are horizontal and vertical lines, rectangles, regular parallelograms, fight-angled triangles. A valid shareable data access is the association of a pointer with a stencil which must stays inside a shareable data. For example a triangular stencil inside a square matrix can be used to access the upper triangular part of the matrix. A stencil is said fuzzy if one of its parameters (width, height, stride length, scale factor, base pointer, ... ) depends on values computed during the parallel execution. Such stencils will generate fuzzy data dependencies that can only be computed at runtime. A task refers to the execution of the function targeted by an RPC and an RPC is performed through the call to the PRFX function" PRFX_eall ( hum, fune_ptr, argl, . . . , argN). The first argument allows to map the tasks onto a virtual grid of processors. All tasks that have the same number will be executed by the same thread. An RPC argument is either a scalar value or a pointer to a shareable data. Note that setting an RPC argument to a pointer on a linked list will only transmit the first cell and not all the cells (no serialization). In order to help the inspection, the programmer must explicitly declare the access mode (Read-Only, Read-Write, ... ) of the arguments of each RPC. Due to the fact that our inspector only operates on the arguments and lacks to make a whole code analysis, the data access inside a task is restricted in the following way" (1) data access is limited to its arguments, its allocations and local variables; (2) after issuing an RPC, the data access mode of transmitted shareable data become more restrictive. Two cases are possible 9an argument passed in RO mode can only be read after the call; an argument passed in RW mode cannot be accessed after the call, but it still can be used without any restriction for issuing subsequent RPC. Any shareable data distribution can be done by distributing its initialization. To do so, an RPC must be issued on each piece of this shareable data.
572
Compilation libPRFX+
linkage with
libpthread lib one-sided comm.
~
~
Task)anv~~unicatiOn
t ,....
thre~eadl thre~eadl thre0~t~adl thre~readl 2[thread I 3 thread2 thread3 thread2thread3 threadI[thread3 I
IdCe~nPa~r e~nCmi~eeS~n~O r~ Im:odring ~l~ bm~:=nognWi 1]th
1
thread
Figure 1. PRFX runtime description
Inspector/scheduler phase The inspection phase builds the task DAG for a specific parallel execution by using a set of structuring data. Then a scheduling algorithm is applied to this task DAG in order to get the task vector of each executor thread. This step is done by a single CPU while all others are waiting the end of the inspection. The RPC tree must be built and fixed from a partial execution (inspection) of the code on an input set of structuring data. This means that no new task should appear during the parallel execution (so the RPC tree must be exact or overestimated). This assumption avoids the problem of the parallel program termination. Thus, the inspector executes almost all the program sequentially (the leaves of the RPC tree are bypassed), intercepts RPC and computes data dependencies for the arguments of each task (cf. fight part of figure 1). These computations require performing many intersections on the sets of address intervals describing the shareable data. For each argument of a task, we have to compute the set of tasks it depends on. Especially, to compute the data communications we use the following rule : for a given argument, the data will come from all the previous tasks that own pieces of this shareable data in Read-Write mode. Note that, in our current implementation of the inspector, the programmer controls the level of inspection by specifying which leaves of the RPC tree must be executed. Therefore, the programmer must know which stencils are fuzzy and he has to declare them as such. The task DAG fixes a partial execution order over the tasks and a better efficiency can be reached by applying a static scheduling. This scheduling could be computed by using a library of scheduling heuristics. Information from the user such as data distribution, task cost and machine model must be the parameters of this step. Thus, the SMP hierarchy could be taken into account here. This initial static schedule could be changed thanks to task migration during the parallel phase even if a load balance has been statically computed. Indeed, due to the lack of a machine model or to unpredictable task costs, the parallel execution can become unbalanced. So a load monitoring should be used to decide when to recompute a schedule as shown on the fight part of figure 1.
Parallel execution The parallel execution follows the MPMD model. One process is launched per SMP node and one thread is created per processor. Each thread knows the task DAG built by the inspector and has its own task scheduling vector computed by the static scheduler. A thread is also created for managing task communications. The parallel execution starts with the execution of
573
1
void
2 3 4
{
RPC LUMatrixBlk2D( nb_blk, matrix A)
uint32_t
uint32_t i,j,k; blk2d Akk, Lkk, Aik, Ukk, Aki, Akj, Aij; u l o n g b l k w sz = b l k _ d i m 9 sizeof(elem_t); 9 blk_w_sz" ulong b l k _ s z = b l k _ d i m
13 14 15 16
PRFX_stencilDeclBlk2D(&Akk, PRFX_stencilDeclLDiag(&Lkk, (elem_t) , s i z e o f ( elem_t ) ) ; PRFX_stencilDeclBlk2D(&Aki PRFX_stencilDeclUDiag(&Ukk, blk w sz, s i z e o f ( elem_t ) ) ; PRFX_stencilDectBtk2D(&Aik PRFX_stencilDeclBlk2D(&Akj PRFX_stencilDeclBlk2D(&Aij
18
for ( k
19 20 21 22 23
{
10 11 12
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
uint32_t
blk_dim,
= 0;
k < nb
blk;
blk_sz ) ; blk_w_sz,
blk_dim,
, blk_sz ) ; blk_w_sz,
blk_dim,
, blk_sz) ; blk_sz ) , blk_sz) 9 k++)
144
/ / Ukk al~d I~kk p o i ~ t s to t h e u p p e r /! triangular p a r t o f Akk Ukk = L k k = A k k = A [ k ] [ k ] ; PRFX_call (k,nb blk+k, RPC_LUBIk2D, for
(i
= k +
1"
sizeof
i < nb_blk"
and
lower
blk_dim,
16
64
t44
16
64
144
Figure 3. Parallel times for Block 2D LU &Akk)
i++)
{ Aki = A[il[k], Aik = A[kl[i]" PRFX_call(k,nb_blk+i, RPC_trsm, blk_dim, ,r,,, &Aki , & Lkk) ; PRFX_call ( i ,nb_blk+k, RPC_trsm , blk_dim , 'U' , &Aik, &Ukk) ; } for
{
(j
= k + 1" j
< nb_blk;
j++)
Akj
= A[j ][kl; for ( i = k + 1 ; i < nb b l k ; i + + ) { Aij = A[j ][i ] , Aik = A[k][i l; PRFX_calI( i,nb_blk+j , RPC_gemm, &Aij, &Aik, &Akj) ; }
blk_dim
,
} } }
16
Figure 2. PRFX source code the block2D LU factorization
32
64
128
16
32
64
128
16
32
64
IZ8
Figure 4. Parallel times for 1D Laplace
the main(... ) by one of the executor thread, then the tasks of the DAG are executed according to the scheduling vectors and as soon as their control and data dependencies are satisfied. At the end of its execution, the task gives to the communication thread all the information about communication to perform (list of data and their target tasks). 3. E X P E R I M E N T A L TESTS We have performed some tests on a parallel machine of the CINES (Montpellier, FRANCE) : an IBM SP3. The parallel experiments were run on 28 nodes interconnected by a colony switch and each node is a NH2 with 16 POWER3 (375MHz) running AIX 5.1 . For each node, the PRFX executor is configured with 15 threads for executing user's tasks and one thread for managing one-sided communications (LAPI). This configuration avoids the interleaving of computations with communications on a processor which leads to performance degradations. At the current time, a simple static scheduling algorithm is applied to the task DAG; the tasks are scheduled by using the sequential order of the RPC calls. All the tests use double precision computations. LU parallel factorization without pivoting The LU factorization is a well known algorithm in the parallel scientific community. We chose to implement a block version without pivoting of this algorithm using the PRFX library. This algorithm has the following static properties 9 (1) the task DAG can be fully built from
574 a structuring data set reduced to " the number of blocks of the matrix (nb_blk x nb_blk), the dimension of the blocks (blk_dim x blk_dim), and the start pointer on matrix A; (2) a 2D block cyclic distribution can be chose for load balancing. The matrix is stored by column of square blocks and we use some IBM BLAS routines (trsmO and gemmO). Figure 2 gives the code of the function RPC_LUMatrixBIk2D(...) that implements the three classic loops of the LU algorithm. The parts missing are the main(...) which calls RPC_LUMatrixBIk2D(...), RPC access declarations, allocation (PRFXisoMalloc(...)) and distributed initialization of the matrix by using RPC calls. The two first arguments of the function RPC_LUMatrixBIk2D(...) are scalar integer values and the last one is the start pointer of the matrix allocated in iso-memory. From lines 4 to 14, several pointers to matrix blocks are declared and associated with stencils. Two kinds of stencils are u s e d " triangular (PRFX_stencilDeclLDiag(...), PRFX_stencilDeclUDiag(...)) and rectangular (PRFX_stencilDeclBlk2D(...)). Note that the parameters of these stencils are directly computed from the structuring data. So there are no fuzzy stencils in that case. The code from line 16 to the end is identical to the sequential one. The first argument of PRFX_call(... ), which is used for task mapping, is set in order to have a block cyclic distribution. In the inspection step, the function RPC_LUMatrixBIk2D(...) is executed whereas all other functions called from RPC_LUMatrixBIk2D(...) are not executed as they do not make any call to PRFX. This algorithm generates enough parallelism to supply all the processors and is irregular enough to show the interest of using PRFX. But this algorithm leads to a large DAG with 6)(nb_blk 3) tasks. So, this example is not really favorable to our inspector support. The histogram on the figure 3 shows 9 total LU factorization times for 3 different matrix sizes" 4000 x 4000 (nb_blk = 40, blk_dim = 100), 6000 x 6000 and 8000 x 8000 achieved for 3 different SMP node configurations " 1 x 1 grid, 2 x 2 grid, and 3 x 3 grid SMP nodes of 16 processors. The first layer corresponds to the inspection time, the second one to the launch time of all the RPC and the third one to the effective parallel time needed for computing the LU factorization. As expected, the LU inspection time is constant for a given problem size and grows with the number of blocks in the matrix (see Figure 3). As there is a cubic number of tasks, the inspection time is dominant. We have reported the time taken by RPC_LUMatrixBIk2D(...) (RPC launch in the histogram) as our implementation sequentially launches the RPC. This time is high since each task requires a small message to get its parameter list and these communications are sequentialized. On 16 processors, this time is low as this parameter list is copied in shared memory. PRFX achieves a good scalability on the parallel LU factorization when passing from one to four SMP nodes. With more SMP nodes, we still go faster but not significantly regarding to the number of CPU used. In fact, for the chosen size of block (100 x 100) the overhead due to the management of a very large number of tasks grows too much with the number of SMP nodes. If we only consider the parallel phase, we achieve per processor 9950/420/220 MFlops for 40 x 40 problem size, 970/640/330 MFlops for 60 x 60 problem size, and 950/750/420 MFlops for 80 x 80 problem size. The POWER 3 375MHz have a theoretical peak of 1500 MFlops. For all problem sizes, we achieve the same number of MFlops on a single node since there is
575 no communication. Then this number of MFlops decreases first when there is not enough tasks (40 x 40) and next when more SMP nodes are involved in the communication. We also conducted some tests with the IBM implementation (PESSL) of the LU algorithm on 1 to 9 NH2 nodes. Here PESSL use shared memory for intra node communications. The parallel phase time of PRFX is about 1.2 to 1.6 times faster on small matrices (8000 x 8000 with blocks of 100 x 100) and almost equal for medium matrices (16384 x 16384 with blocks of 128 x 128 ). The times and efficiencies for the parallel phase are good but the sequential inspection is costly due to the large number of tasks of the DAG.
1D Laplace Equation Solver We have also implemented a block 1D version of the Laplace algorithm using the PRFX library. The mesh is stored by block of lines and is split in as many blocks as processors involved in the parallel computation. This algorithm have the following static properties : (1) the task DAG can be fully built from a structuring data set reduced to : the number of CPU, the mesh width, the mesh height (415800), the number of iterations, the start pointer on the initial mesh and the pointer on the updated mesh; (2) a 1D block distribution gives both a good data locality and a good load balancing. The task DAG for Laplace algorithm is regular with N x k tasks for a mesh of N blocks (N equals the number of CPU) and k iterations (here k = 50). This algorithm shows the ability of PRFX to automatically generate neighborhood communications since tasks use shareable data that overlap on their boundaries. The histogram on Figure 4 shows 12 total execution times for the Laplace algorithm for different mesh sizes : 415800 • 200, 415800 • 400, 415800 x 800. Here the RPC launch is performed in parallel so this duration is included in the Laplace computation time. The first tiny layer corresponds to the sequential inspection time and the second one to the time taken by the Laplace computation. As the task DAG holds N x k tasks, the inspection cost grows with the number of CPU. Here the inspection time varies from 0. l s, when the mesh is split into 15 blocks, to l s when it is split into 120 blocks. So, this second example is much more favorable than the LU example. The block height is computed as : blk_height = 415800/number of CPU. In all the cases, PRFX achieves a good speedup when passing from 1 to 2 SMP nodes; the parallel time is not significantly reduced with more nodes for small problem sizes. In fact, the more the number of CPU increases the more the block height is small. As PRFX adds an overhead due to task management, it cannot be masked when tasks are fine-grained (415800 x 200). When we double the problem size and use two SMP nodes instead of one, the time should be theoretically constant. For example if we compare the first with the sixth histogram bar, PRFX only adds an overhead of 15% for managing tasks and communications. This is a good result as communication times for 16 CPU are constant (intra node). In a same way, we can see that the overhead from 2 to 4 SMP nodes is 25%, and 35% from 4 to 8 SMP nodes. 4. CONCLUSION AND FUTURE W O R K We have presented a runtime library for clusters of SMP, dedicated to a specific class of parallel algorithms having static and dynamic properties.
576 Results are promising thanks to the use of threads and one-sided communications; since the PRFX library is still under development, performance is likely to improve. We also plan to make some optimizations in order to keep good scalability. These optimizations will consist of using a scheduling algorithm dedicated to SMP clusters and of reducing the task management overhead. They will be validated on benchmarks for a sparse Cholesky factorization. From a dynamic point of view, we will implement a task migration manager in order to achieve hierarchical dynamic load balancing, as well as a distributed solver of fuzzy data dependencies. We have implemented and tested a centralized naive version of the fuzzy solver on the LU algorithm with pivoting. Managing these two kinds of dynamic behaviors will allow PRFX to execute a wider class of programs. REFERENCES
[1 ] [2] [3] [4]
[5] [6] [7] [8]
[9]
[10]
[ 11 ]
[12] [13]
[14]
HPF: High Performance Fortran langage specification, http://www.crpc.rice.edu/HPFF. MPI: Message Passing Interface Forum. http://www.mpi-forum.org. OpenMP Language Application Program Interface Specification. http://www.openmp.org. G. Antoniu, L. Bougr, and R. Namyst. An efficient and transparent thread migration scheme in the PM2 runtime system. In Runtime Systemsfor Parallel Programming, volume 1586 of Lect. Notes in Comp. Science, pages 496-510. Springer-Verlag, 1999. S. Benkner and T. Brandes. Compiling Data Parallel Programs for Clusters of SMPsl 9th International Workshop on Compilersfor Parallel Computers, CPC 2001, June 2001. T. Brandes. HPF library and compiler support for halos in data parallel irregular computations. Parallel Processing Letters, 10(2/3): 189-200, 2000. F. Cappello and D. Etiemble. MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks. Supercomputing, November 2000. Gerson G. H. Cavalheiro, Francois Galilre, and Jean-Louis Roch. Athapascan-l: Parallel programming with asynchronous tasks. YaleMultithreaded Programming Workshop,USA, June 1998. Chris H. Q. Ding. High Performance Fortran for Practical Scientific Algorithms: An up-to-date Evaluation. Journal of Future Generation Computer Systems, 15(3):343-352, May 1999. C. Fu and T. Yang. Run-time techniques for exploiting irregular task parallelism on distributed memory architectures. Journal of Parallel and Distributed Computing, Vol 42:143-156, May 1997. W.W. Gropp and E. L. Lusk. A taxonomy of programming models for symmetric multiprocessors and SMP clusters. Proceedings of Programming Modelsfor Massively Parallel Computer, pages 2-7, 1995. Attila Gursoy and Ilker Cengiz. Mechanisms for Programming SMP Clusters. Parallel and Distributed Processing Techniques and Applications (PDPTA '99), June 1999. R. Reussner and G. Hunzelmann. Achieving Performance Portability with SKaMPI for High-Performance MPI Programs. In Computational Science-ICCS 2001, volume 2074 of Lecture Notes in Computer Science, pages 841-850, May 2001. M. C. Rinard. Applications experience in jade. Concurrency Practice & Experience, 10(6):417-448, May 1998.
Grids
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
579
Experiences about Job Migration on a Dynamic Grid Environment R.S. Montero a*, E. Huedo b, and I.M. Llorente ab ~Departamento de Arquitectura de Computadores y Automfitica, Universidad Complutense, 28040 Madrid, Spain bCentro de Astrobiologia, CSIC-INTA, Associated to NASA Astrobiology Institute, 28850 Torrej6n de Ardoz, Spain Several research centers share their computing resources in Grids, which offer a dramatic increase in the number of available processing and storing resources that can be delivered to applications. However, efficient job submission and management continue being far from accessible due to the dynamic and complex nature of the Grid. A Grid environment presents unpredictable changing conditions, such as dynamic resource load, high fault rate, or continuous addition and removal of resources. We have developed an experimental framework that incorporates the runtime mechanisms needed for adaptive execution of applications on a changing Grid environment. In this paper we describe how our submission framework is used to support different migration policies on a research Grid testbed. 1. INTRODUCTION The management ofjobs within the same department is addressed by many research and commercial systems [2]: Condor, LSF, SGE, PBS, LoadLeveler... However, they are unsuitable in computational Grids where resources are scattered across several administrative domains, each with its own security policies and distributed resource management systems. The Globus [4] middleware provides the services and libraries needed to enable secure multiple domain operation with different resource management systems and access policies. However, the user is responsible for manually performing all the submission stages in order to achieve any functionality: system selection, system preparation, submission, monitoring, migration and termination [ 10]. The development of applications for the Grid continues requiring a high level of expertise due to its complex nature. Moreover, Grid resources are also difficult to efficiently harness due to their heterogeneous and dynamic characteristics, namely: dynamic resource load and cost, dynamic resource availability, and high fault rate. Migration is the key issue for adaptive execution of jobs on dynamic Grid environments. Much higher efficiency can be achieved if an application is able to migrate among the Grid resources, adapting itself according to its dynamic requirements, the availability of the resources and the current performance provided by them. *This research was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-97046while the first and last authors were in residence at ICASE, NASA Langley Research Center, Hampton, VA 23681-2199. This research was also supported by the Spanish research grant TIC 2002-00334.
580 In this paper we present a new Globus experimental framework that allows an easier and more efficient execution of jobs on a dynamic Grid environment in a "submit and forget" fashion. Adaptation is achieved by implementing automatic application migration when one of the following circumstances is detected: 9 Grid initiated migration: A new "better" resource is discovered; the remote resource or its network connection fails; or the submitted job is canceled by the resource administrator. 9 Application initiated migration: The application detects performance degradation or performance contract violation; or self-migration when the requirements of the application change. The rest of the paper is as follows. The submission framework is described in Section 2. Then the Grid and application initiated migration capabilities of the framework, are demonstrated in Section 3, in the execution of a Computational Fluid Dynamics (CFD) code. Finally, Section 4 highlights related work, and Section 5 includes the main conclusions. 2. E X P E R I M E N T A L F R A M E W O R K
The GridWay experimental submission framework provides the runtime mechanisms needed for dynamically adapting the application to a changing Grid environment. Once the job is initially allocated, it is rescheduled when performance slowdown or remote failure are detected, and periodically at each discovering interval. Application performance is evaluated periodically at each monitoring interval by executing a Performance Degradation program and by evaluating its accumulated suspension time. A Resource Selector program acts as a personal resource broker to build a list of candidate resources. Since both programs have access to files dynamically generated by the running job, the application has the ability to take decisions about its own scheduling.
r
GridWay Framework
S(~;r ) ~ ~
L Monitor ~,~?
I ..... esicsFile _]
/
Figure 1. Architecture of the Experimental Framework.
The Submission Agent (figure 1) performs all submission stages and watches over the efficient execution of the job. It consists of the following components (see [5] for a detailed description):
581 9 The client application uses a Client API (Application Program Interface)to communicate with the Request Manager in order to submit the job along with its j o b t emp 1 a t e, which contains all the necessary parameters for its execution. The client may also request control operations to the Request Manager, such as job stop~resume, kill or reschedule. 9 The Dispatch Manager periodically wakes up at each scheduling interval, and tries to submit pending and rescheduled jobs to Grid resources. It invokes the execution of the Resource Selector corresponding to each job, which returns its own priorized list of candidate hosts. The Dispatch Manager submits pending jobs by invoking a Submission Manager, and also decides if the migration of rescheduled jobs is worthwhile or not. 9 The Submission Manager is responsible for the execution of the job during its lifetime, i.e. until it is done or stopped. It also probes periodically at each polling interval the connection to the jobmanager and Gatekeeper to detect remote failures. The Submission Manager performs the following tasks: -
Prologing: Preparing the RSL (Resource Specification Language) and submitting the Prolog executable. The Prolog sets up remote system, transfers executable and input files, and in the case of restart execution also transfers the restart files.
-
Submitting: Preparing the RSL, submitting the Wrapper executable, monitoring its correct execution, updating the submission states via Globus callbacks and waiting for migration, stop or kill events from the Dispatch Manager. The Wrapper wraps the actual job in order to capture its exit code.
- Epiloging: Preparing the RSL and submitting the Epilog executable. The Epilog transfers back output files when termination, restart files when migration, and cleans up remote system. 9 The Performance Monitor periodically wakes up at each monitoring interval. It requests rescheduling actions to detect "better" resources when performance slowdown is detected and at each discovering interval. The flexibility of the framework is guaranteed by a well-defined AP! for each Submission Agent component. The framework has been designed to be modular, to allow extensibility and improvement of its capabilities. The following modules can be set on a per job basis: Resource Selector, Performance Degradation Evaluator, Prolog, Wrapper, and Epilog. 2.1. Resource selector Due to the heterogeneous and dynamic nature of the Grid, the end-user must establish the requirements which must be met by the target resources and an expression to assign a rank to each candidate host. Both may combine static machine attributes (operating system, architecture,...) and dynamic status information (disk space, processor load,...). Different strategies for application level scheduling can be implemented, see for example [ 1, 8, 6]. The Resource Selector used in the experiments consists in a shell script that queries MDS for potential execution hosts, attending the following criteria: 9Host
requirements are specified in a h o s t r e q u i r e m e n t file, which can be dynamically generated by the running job. The host requirement setting is a LDAP filter, which
582 is used by the Resource Selector to query Globus MDS and so obtain a preliminary list of potential hosts. In the experiments below, we will impose two constraints, an SPARC architecture and a minimum main memory of 512MB, enough to accommodate the CFD simulation. 9 A rank expression based on workload parameters is assigned to each potential host. Since our target application is a computing intensive simulation, the rank expression benefits those hosts with less workload and so better performance. The following expression was considered:
rank -
FLOPS if CPU15 >_ 1; F L O P S . CPU15 if CPU15 < 1.
(1)
where F L O P S is the peak performance achievable by the host CPU, and CPU15 is the average load in the last 15 minutes.
2.2. Prolog and epilog File transfers are performed through a reverse-server model. The file server (GASS or GridFTP) is started on the local system, and the transfer is initiated on the remote system. The executable (one per each architecture) and input files are assumed to be stored in an experiment directory. In the experiments, the Prolog and Epilog modules were implemented with a shell script that uses Globus transfer tools (i.e. g l o b u s - u r l - c o p y ) to move files to/from the remote host. The files which, being generated on the remote host by the running job, have to be accessible to the local host during job execution are referred as dynamic files (host requirement, rank expression and performance profile files). Dynamic file transferring is not possible through a reverse-server model in closed systems, such as Beowulf clusters. This problem has been overcome by using afile proxy on the front-end node of the remote system.
2.3. Performance and job monitoring The GridWay framework provides two mechanisms to detect performance slowdown:
9 A Performance Degradation Evaluator (PDE) is periodically executed at each monitoring interval by the Performance Monitor to evaluate a rescheduling condition. In our experiments, the solver of the CFD code is an iterative multigrid method. The time consumed in each iteration is appended by the running job to a dynamic p e r f o r m a n c e profile file. The PDE verifies at each monitoring interval if the time consumed in each iteration is higher than a given threshold. This performance contract and contract monitor are similar to those used in [3]. 9 A running job could be temporally suspended by the resource administrator or by the local queue scheduler on the remote resource. The Submission Manager takes count of the overall suspension time of its job and requests a rescheduling action if it exceeds a given threshold.
3. EXPERIENCES We next demonstrate the migration capabilities of the experimental framework in the execution of a CFD code. The target application is an iterative robust multigrid algorithm that solves
583 the 3D incompressible Navier-Stokes equations [9]. Application-level checkpoint files are generated at each multigrid iteration. In all the experiments, monitoring, polling and scheduling intervals were set to 10 seconds. The following experiments where conducted in the TRGP research testbed, whose main characteristics are described in table 1.
Table 1 TRGP (Tidewater Research Grid Partnership) resource characteristics host coral whale urchin
Model Intel Pentium II, III, 4 Sun UltraSparc II Sun UltraSparc I Sun UltraSparc IIi Sun UltraSparc II, IIi
carp, tetra sciclone
Nodes 104 2 2 1 160
OS Linux 2.4 Solaris 7 Solaris 7 Solaris 7 Solaris 7
Memory 56GB 4GB 1GB 256MB 54GB
Peak Performance 89 Gflops 1.8 Gflops 672 Mflops 900 Mflops 115 Gflops
VO ICASE ICASE ICASE ICASE W&M
3.1. Periodic rescheduling to detect new resources In this case, the discovering interval has been deliberately set to a small value (60 seconds) in order to quickly reevaluate the performance of the resources. The execution profile of the application is presented in figure 2 (left-hand chart). Initially, only ICASE hosts are available for job submission, since sciclone has been shutdown for maintenance. The Resource Selector chooses urchin to execute the job, and the files are transferred (Prolog and submission in time steps 0s-34s). The job starts executing at time step 34s. A discovering period expires at time step 120s and the Resource Selector finds sciclone to present higher rank than the original host (time steps 120s-142s). The migration process is then initiated (Cancellation, Epilog, Prolog and submission in time steps 142s-236s). Finally the job completes its execution on sciclone. Figure 2 (left-hand chart) shows how the overall execution time is 42% lower when the job is migrated. This speedup could be even greater for larger execution times. Note that migration time, 95 seconds, is about 20% of the overall execution time.
static - -D- adaptative o s c i c l o n e / ~
l0
ill wha,
.--'~D
8 urchin
6
~9
",~,
i i
~.-i
p
i .~" Prolog i ~('~( : - - ' " :~/ I
4 2
//~'~ 0
i 0
~ 2
4
,~ f
6 t Failure ])etection
Epilog
,ET i313"
Epilog
,,
...... ~- Mlgranon -
6 8 Time (Minutes)
Cancellation Epilog Prolo~
"~
Sub2ssion
10
[21--43
sciclone
12
4 ~ Prolog
is"
~' l
Prolog Ef El' Migration Submission
0 o, 14
0
1
.... 2
/
3
i i 4 5 6 Time (minutes)
t 7
i 8
9
Figure 2. Execution profile of the application when a new "better" resource is detected (lefthand chart). Execution profile of the application when the remote resource fails (right-hand chart).
584 3.2. The remote resource or its network connection fails The Resource Selector finds whale to be the best resource, and the files are transferred (Prolog and submission in time steps 0s-45s). Initially, the job runs normally. At time step 125s, whale is disconnected. After 50 seconds, a connection failure is detected and the job is migrated to sciclone (Prolog and submission in time steps 175s-250s). The application is executed again from the beginning because the local host does not have access to the checkpoint files generated on whale. Finally the job completes its execution on sciclone. The execution profile of the application is presented in figure 2 (fight-hand chart). 3.3. Performance degradation detected by the maximum suspension time Sciclone is selected to run on initially, and the files are transferred (Prolog and submission in time steps 0s-51 s). The job is explicitly held just after Prologing by executing the qho l d PBS command on the remote system. The job is rescheduled as soon as the maximum suspension time is exceeded (40 seconds). The Resource Selector selects whale as next resource, and the migration process is then initiated (Cancellation, Prolog and submission in time steps 102s165s). Finally the job completes its execution on whale. The execution profile of the application is presented in figure 3 (left-hand chart).
10 - sciclone
whale
8 - Max. Suspension: 6 Time
~4
)r
~"
L ~
I
1
2
i
anceation
Migration ~ Prolog [.Submission i
3 4 Time (minutes)
~z~ /
6
Epilog
0 I
sciclone ~ D~E~Io~
whale
i3.~"
o
static - -[]- adaptative o
8
Jff
2 . . . . ~ _ _ -[] . . . . . . .
0
10 f
~-~o
i
i
5
6
~.
4
"~
2 0 ,
,
,
0
2
4
,
(Cancellation )Epilog Migration "]Prolog , [.Submission
6 8 Time (minutes)
10
12
Figure 3. Execution profile of the application when the maximum suspension time is exceeded (left-hand chart). Execution profile of the application when a workload is executed (right-hand chart).
3.4. Performance degradation detected by a Performance Profile dynamic file The Resource Selector finds whale to be the best resource, and job is submitted (Prolog and submission in time step 0s-34s). However, whale is overloaded with a compute-intensive workload at time step 34s. As a result, a performance degradation is detected when the iteration time exceeds the iteration time threshold (40 seconds) at time step 20%. The job is then migrated to sciclone (Cancellation, Epilog, Prolog and submission in time steps 209s-304s), where it continues executing from the last checkpoint context. The execution profile for this situation is presented in figure 3 (fight-hand chart). In this case the overall execution time is 35% lower when the job is migrated. The cost of migration, 95 seconds, is about 21% of the execution time.
585 4. RELATED W O R K
The need of a nomadic migration approach for job execution on a Grid environment has been previously discussed in [7]. Also, the prototype of a migration framework called the "Worm", was implemented within the Cactus programming environment [3]. In the context of the GRADS project, a migration framework that takes into account both the system load and application characteristics is described in [11 ]. The aim of the GridWay project is similar to the aim of the GRADS project, simplify distributed heterogeneous computing. However, its scope is different. Our framework provides a submission agent that incorporates the runtime mechanisms needed for transparently executing jobs in a Grid; it is not bounded to a specific class of applications; it does not require new services; and it does not necessarily require source code changes. In fact, our framework could be used as a building block for much more complex service-oriented Grid scenarios like GRADS. 5. CONCLUSIONS We have demonstrated the migration capabilities of the GridWay experimental framework for executing jobs on Globus-based Grid environments. The experimental results are promising because they show how application adaptation achieves enhanced performance. The response time of the target application is reduced when it is submitted through the experimental framework. Simultaneous submission of several applications could be performed in order to harness the highly distributed computing resources provided by a Grid. Our framework is able to efficiently manage applications suitable to be executed on dynamic conditions. Mainly, those that can migrate their state efficiently. ACKNOWLEDGMENTS
This work has been performed using computational facilities at The College of William and Mary which were enabled by grants from Sun Microsystems, the National Science Foundation, and Virginia's Commonwealth Technology Research Fund. We would like to thank Thomas Eidson and Tom Crockett for their help with TRGP. REFERENCES
[1] H. Dail, H.Casanova, and F.Berman. A Modular Scheduling Approach for Grid Applica[21
[3] [4] [5]
tion Development Environments. Technical Report CS2002-0708, UCSD CSE, 2002. T. E1-Ghazawi, K. Gaj, N. Alexandinis, and B. Schott. Conceptual Comparative Study of Job Management Systems. Technical report, George Mason University, February 2001. G. Allen et al. The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment. Intl. J. of High-Performance Computing Applications, 15(4), 2001. I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. Intl. J. of Supercomputer Applications, 11(2): 115-128, 1997. E. Huedo, R. S. Montero, and I. M. Llorente. An Experimental Framework for Executing Applications in Dynamic Grid Environments. Technical Report 2002-43, ICASE-NASA Langley.
586 [6]
E. Huedo, R. S. Montero, and I. M. Llorente. Experiences on Grid Resource Selection Considering Resource Proximity. In Proc. of 1st European Across Grids Conf., February 2003. [7] G. Lanfermann et al. Nomadic Migration: A New Tool for Dynamic Grid Computing. In Proc. of the 10th Symp. on High Performance Distributed Computing (HPDC10), August 2001. [8] R.S. Montero, E. Huedo, and I. M. Llorente. Grid Resource Selection for Opportunistic Job Migration. In Proc. Intl. Conf. on Parallel and Distributed Computing (EuroPar), August 2003. [9] R.S. Montero, I. M. Llorente, and M. D. Salas. Robust Multigrid Algorithms for the Navier-Stokes Equations. Journal of Computational Physics, 173:412-432, 2001. [10] J. M. Schopf. Ten Actions when Superscheduling. Technical Report WD8.5, The Global Grid Forum, 2001. Scheduling Working Group. [ 11] S. Vadhiyar and J. Dongarra. A Performance Oriented Migration Framework for the Grid. In Proceedings of the 3rd Int'l Symposium on Cluster Computing and the Grid (CCGrid), 2003.
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
587
Security in a Peer-to-Peer Distributed Virtual Environment J. Kfhnlein a aTechnische Informatik 4-11/1, TU Hamburg-Harburg, SchwarzenbergstraBe 95, D-21073 Hamburg 1. I N T R O D U C T I O N
Distributed Virtual Environments (DVEs) belong to the most demanding applications, requiring a network with a high quality of service on the one hand and high graphical and computational power of the participating hosts on the other. A DVE is a computer generated threedimensional scene that is generated, displayed and modified by multiple interconnected hosts simultaneously. Typical examples of DVEs are multi-player online games, military simulations and virtual reality systems for engineering, medicine or science. This contribution describes a new security architecture for DVEs that covers three main aspects: access control, authentication and confidentiality. This article is structured as follows: The rest of this section treats general issues of DVEs. The security aspects and the design goals of the security architecture are described section 2. In section 3, a concrete scenario is presented, which is analysed in section 4. In section 5 the security architecture is developed. Finally, a conclusion and an outlook on future work is given in section 6.
1.1. Consistency and performance Compared to other distributed systems, the state of a DVE does not require strict consistency, e.g. the position of an avatar must only be exact to the degree of anticipation of the user. Therefore, numerous techniques have been developed to trade off consistency versus performance: IP-Multicasting can be used for a scalable many-to-many communication, while the employed UDP-Protocol only offers a best-effort delivery. The impact of lost messages can be lowered using periodic state messages. With dead reckoning state transitions are extrapolated locally, reducing the number of necessary update messages at the cost of precision, while the transition from the predicted to the correct state can be smoothly interpolated to avoid the user noticing the difference. With a good dead reckoning mechanism latencies can be hidden to a certain amount. 1.2. Network Nowadays the main bottleneck for the performance of a DVE is the quality of service (QoS) of the underlying network: Each active entity of the scene, such as an avatar steered by a user or a computer controlled robot, needs to send an update message as soon as it changes its state. In large scale environments with a huge number of active entities these update messages soon fill up the available network bandwidth. As the users want to interact with the environment in the
588 Table 1 Bandwidth and RTT between client and ISP for different network connections Connection Type Bandwidth [Kbit/s] RTT client-ISP Latency [ms] down-/upstream [ms] (RTT/2) Modem a 56.6 > 110 >55 DSL with interleaving c 768/128 50-70 25-35 ISDN ~ 64 20-40 10-20 Cable Modem ~ 1500/128 5-8 2.5-4 ADSL Fastpath b 768/128 5 2.5 WLAN ~ 11 000 5 2.5 Local Etherneff 100 000 < 1 < 0.5 RTT values are from a[ 1], bprovider information and town experiments. most realistic way, state changes have to be processed by all participating hosts instantaneously, but the transmission infers a lot of network latency. Into the bargain, the update messages have to be transmitted to all interested hosts in a scalable way. Experiments with short range connections using the Internet show [ 1] that the delay produced by the Internet backbone under normal conditions is low compared to the latency coming from the connection between clients and their Internet Service Provider (ISP). Table 1 lists round trip time (RTT) measurements between users and their ISPs with different types of connections. The values are not authoritative, as latency varies a lot under different network conditions, but they give an idea about the order of the delays that have to be expected from the individual connection. It can be concluded that connections fall into five categories of latency: Direct connections below 0.5ms, fast connections below 5ms, ISDN connections below 20ms, ADSL with interleaving below 35ms and Modem above 55ms. It makes sense to put constraints on the QoS of the network, excluding users with unsuitable connections, as users with high latencies will produce the most consistency problems, and users with low bandwidth will bound the scalability of the system. Users with modem connections are therefore excluded in the rest of this paper.
2. SECURITY ASPECTS The most obvious security aspect of a DVE is access control: Consider an online game, in which players have to pay for participation. This game must offer a way to let players who have paid participate and to keep others out. In many other applications such as a military simulations or medical applications the system must only accept authorised personal. The second issue covered in this contribution is authentication: The DVE must offer its users the means to verify that a message has been sent by a legitimate host. This is even more important in the context of IP-multicasting, because the subscription and thereby the ability to send a message to a multicasting group is completely receiver based. Forged or illegitimate messages must be detected, and must be dropped to avoid inconsistencies and processing overhead. In a DVE, immediate authentication of messages is more important than the ability to prove the messages authenticity at a later stage (non-repudiation). In the third place, it will be shown how the confidentiality of update messages can be incorporated into the system.
589 2.1. Goals
Adding security features to a DVE will degrade its performance, due to additional computations, increased network traffic or higher storage requirements. The constraints for the security architecture presented in this paper are therefore: 9 Avoid central services wherever possible. Replicate unavoidable central services to preserve scalability. 9 Keep computational overhead low. 9 Keep additional message processing latency low. 9 Keep additional network traffic low. 3. SCENARIO The DVE in this contribution is a networked computer game. The game is built according to the peer-to-peer network paradigm, in the sense that there is no central authoritative server managing the state of the game. Each host keeps track of the scene's state by itself, renders it and processes the input of exactly one player. This mapping allows to use the terms host and player interchangeably in the following. In order to profit from the peer-to-peer architecture, the game's topic should not be too competitive. For example, first-person-shooters usually have a server architecture because the incentive for cheating on the game's state is strong. State change messages are communicated over a multicasting socket. Active players can invite new players. An invitation should be revocable, if a player "misbehaves". An invitation must also stay valid if the inviter leaves the environment, but, as every credential, it should expire after a certain period of time. Uninvited users can watch the scene passively in a TV like manner, without the ability to change the game's state, to decide if the game is interesting and worth applying for an invitation. An interested watcher only has to join the multicast group and run the game in a passive rendering mode. Provided the network routers are powerful enough, this construction scales to any number of passive watchers. The networking API is assumed to allow sending datagrams over IP-multicasting sockets. As we are concerned with a game scenario, all hosts are assumed to run a standard operating system that does not offer any additional network security services on a lower network level. As a consequence, the security architecture must be implemented entirely within the game on the application layer. 4. SECURITY ANALYSIS A user can take one ore more of the following roles: The initiator is responsible for bootstrapping the environment, setting up the multicasting group, defining the security policy, and announcing the game to the public. There is exactly one initiator for each environment. A spectator is passively watching the scene. The number of spectators is not bounded. A spectator does not have an invitation but can apply for one. An active player controls entities in the scene. An active player has been invited by another active player and can invite other spectators by sending them invitations. Active players are the peers of the peer-to-peer architecture. The set of active hosts is referred to as the active group.
590 For brevity, this contribution only covers two main threats: The first consists of a spectator who tries to modify the game's state without an invitation. This can be anticipated by access control. In the second threat an active player is trying to impersonate another active player to take over control of one or more entities owned by the player, which can be met by message authentication. Players need some mechanism to distinguish legitimate update messages from illegitimate ones. Because the header of an IP-datagram can be easily forged, additional security features must be implemented to verify the origin of a message. In our scenario two kinds of authentication are investigated: group authentication and individual authentication. The purpose of group authentication is to decide whether a message has been sent by a member of the active group, while individual authentication proves that a specific active player is the originator of a message. 4.1. Trust model In a traditional client-server architecture, the server is the central policy enforcer: The authoritative game state on the server has to be protected by the server from illegal modifications. In a peer-to-peer architecture, every peer has the same rights and obligations. Regarding the security of the environment, this means that every active host must itself enforce the security policy that protects its local game state. There is nothing like a central entity enforcing the security of the environment as a whole, because all peers are equal. Consequently, there are at least two possible ways to define the trusted computing base (TCB). In the first case, the TCB is the active group. Every active host is trusted to enforce the security policy. This setting requires that the network traffic in the active group is protected by a group credential shared only by the active hosts. In the second more cautious case, a peer's TCB only consists of the peer itself. In this scenario each host uses individual credentials to authenticate its outgoing messages, while incoming message have to be checked using the sender's credentials. While the first setting is easier to implement and less resource consuming because it only deals with a single set of security parameters for the group as a whole, it is susceptible to collusion attacks: A malicious active player could tell the secret group credentials to an outsider allowing her to intrude the environment. In a game scenario, this attack is very probable, especially when players have to pay for participation. The second setting, on the other hand, is more resource consuming but collusion attacks are not very appealing, because a colluding player would give away her own credentials only, e.g. the ability to control the own avatar. The TCB can additionally contain a set of special trusted peers. An example is a set of group key controllers that distribute the load of the group key management (cf. 5.2). This setting is useful to make parts of the system more scalable that suffer from the typical problems of a fully distributed architecture, such as too much additional network traffic. Note that this part of the system becomes rather a distributed-server than peer-to-peer architecture.
5. SECURITY A R C H I T E C T U R E In this section, the main components of the proposed security architecture are presented. After describing the access control mechanism in section 5.1, group authentication is covered in section 5.2, followed by a short description of synchronisation in section 5.3. Section 5.4 describes the algorithms for individual authentication. Group and individual authentication
591 can also be combined, but if a majority of the messages requires individual authentication, the additional overhead of group authentication can be set aside. That is why the individual authentication is described in more detail. 5.1. Access control The access control mechanism proposed here is based on digital signatures. Invitations are modelled as X.509 certificates [2]. Each certificate must contain at least the time when it becomes valid, the time when it expires, a serial number and a subject's public key. Furthermore, the certificate must be digitally signed by the certificate issuer, the inviter, using her private key. Two certificates issued by the same issuer must always have a different serial number. To set up the access control, the initiator generates a private/public key pair and a self signed root certificate. This certificate is used by all future players as a common trusted anchor for the validation of other certificates. A certificate is valid, if there is a chain of valid certificates, each certifying the authenticity of the next one, from the root certificate to the certificate itself. When joining the game, each active player generates a public/private key pair. She sends a certificate request containing the public key and her name to the inviter, who generates a certificate using her own private key. The new certificate and the issuer's valid certificate chain is sent to the other players using the multicast socket. Note that certificates do not have to be kept secret, and that way new certificates are deployed in a very scalable way to all active players at once. If a certificate must be revoked, the issuer multicasts a certificate revocation, containing the serial number of the certificate and a reason flag indicating why the certificate has been revoked, signed by the issuer of the certificate. Certificate revocations are collected in Certificate Revocation Lists (CRLs) that are multicast on change. For persistent storage of X.509 certificates and CRLs the Lightweight Directory Access Protocol (LDAP, [3]) can be used. LDAP directories can be replicated easily, thereby preserving scalability. Because computationally intensive asymmetric cryptography is used to generate and verify certificates, some countermeasures against denial-of-service attacks have to be considered. Each active host should limit the number of certificate requests it processes per minute. Receivers should cache already validated certificate paths and only verify expiration dates and the new certificates.
5.2. Group authentication To authenticate messages from active group members and to protect the messages' integrity, active players append a message authentication code (MAC) to every message. The MAC is calculated using a shared secret group key. In this architecture, hash based MACs (HMACs, [4]) are used. Because the group key is only known to the active players, an active receiver can easily check if a message comes from an active host, by hashing the received message with the same key and comparing the two HMACs. The hash calculation also makes sure that the message cannot be altered without detection. The calculation of one HMAC needs two hash function evaluations which are computationally light weight. There are situations in which the group key must be changed. For example, when a malicious player is expelled, a new group key must chosen to make the knowledge of the old key worthless to her (perfect forward security). Into the bargain the amount of data processed with the same key should always be limited to make known ciphertext attacks difficult. For the purpose of
592 re-keying and initial key dissemination a group key management scheme is needed. Various group key management schemes have been proposed (see [5] for a comparison). Most of them rely on one or more group key manager(s), while server-free, completely distributed solutions tend to produce a large amount of additional network traffic. The favoured key management scheme in this contribution is called MARKS [5]. A binary hash chain tree hybrid, whose leaves are the secret group keys for different time intervals is constructed from a small number of seed keys. By revealing a subset of inner tree nodes to a player, she is able to compute the secret keys for a certain period of time. Because the original paper contains a detailed description how to use this scheme in a game environment, the interested reader is referred to [5]. MARKS is chosen for the architecture in this contribution, because it is free of side effects, computationally cheap and easy to replicate. Nevertheless the architecture looses a bit of its peer-to-peer character through the introduction of key managers (cf. 4.1). The initial group key is acquired out of band, by contacting the group key manager via a unicast connection. The user sends her certificate to a group key manager, and receives the initial key encrypted with her public key to make sure that only she can decrypt it. Group key managers should limit the number of requests processed per second to avoid denial-of-service attacks. Because only the active players know the group key, it can be used to keep group messages confidential, by encrypting them with a symmetric cipher using a traffic encryption key derived from the group key. 5.3. Synchronisation To process events from different players in the fight order, the game requires the clocks of all players to be synchronised with a small offset of absolute value Cdock. The network time protocol (NTP, [6]) can be used to keep Cdockbounded to the order of a few milliseconds. It also delivers statistical information about network transmission delays, and defines a field that can be used for the authentication of NTP messages, e.g. by using the group key form the previous section in a MAC. For brevity, the details are omitted in this contribution, as well as attacks involving tampered clocks. In the following, it is assumed that the clocks of all players are synchronised within a small range of length Cdock. 5.4. Individual authentication Though the group authentication does not distinguish individual group members, this might be necessary for some messages. For example, a player might want to be the only one controlling her avatar. The standard solution for individual authentication is to digitally sign the messages using the sender's private key. In a large scale DVE this is not a viable solution, because it requires the receivers to perform computationally intensive signature verification on all state change messages of all entities within the player's area of interest. Symmetric cryptographic operations are usually several hundred times faster than their asymmetric counterparts. Hence, using MACs instead of digital signatures is appealing. Individual authentication based on a MAC usually works as follows: The sender uses a secret key for the computation of a MAC. The key is revealed later to allow the verification (one-time signature). After its disclosure, the secret key cannot be used for further MACs anymore, because it is no longer secret. The main differences between a digital signature and such a MAC based authentication scheme is that the former is much slower, while the latter lacks non-repudiation, which is not needed in a DVE as argued above.
593 f
K, <
K o ~:1
K;
.
.
To
.
,
J .
11
f
K2 ~
K;
j~K3
,
4-
~
K;
I--
f j.jK,
f
~
,
K;
,
! '_W --I--
K, ~ - . . . .
keychain
K;
MAC keys
,
4-
I-----t2>-
I
packets
I
I I
.
-
.
.
-
.
.
12
I.
.
I
.
.
.
.
~-
I~
I.
.
.
.
.
.
.
I
I,
+
.
.
.
.
.
.
.
I
I-
-
I
I~
time
Figure 1. The TESLA protocol with d = 2. In this architecture, the Timed Efficient Stream Loss-tolerant Authentication (TESLA, [7]) protocol is employed. In TESLA, each sender maintains chain of keys (Ki)i=0...n generated using a one-way function 1 and uses MACs for message authentication. The time interval I during which the key chain is used, is split up into n - d subintervals (Ii)i=l ....n-d of equal length Tint, where d is called the authentication delay and describes the number of time intervals between the usage and the disclosure of a key in the key chain. A message M sent in an interval I~ reveals the corresponding key K~. It also contains a MAC that is computed from M using a key derived from a later key K~+d of the key chain. A receiver can authenticate messages m as soon as she receives the disclosed key used in its MAC computation. First, she has to verify that the disclosed key belongs to the key chain and after that she can verify the MAC. The key chain is secure because the one-way function used to construct it is irreversible, and the MACs are safe because the secret keys K~ are disclosed after they are used for signing. On the sender's side the protocol works as follows: First, the sender generates a random seed key K ~ a and computes the key chain (Ki)i=0...n recursively as K~ = Kseea,
and
Ki = f ( K i + l ) ,
Vi = 0 . . . n -
1
with a one-way function f. The sender then multicasts packet consisting of the sender's ID, Ko, I, d and n signed with her private key. Note that only a single asynchronous cryptographic operation is needed during the whole time interval I. To authenticate a message M, the sender determines the current time interval, retrieves the corresponding key Ki+a from the key chain and derives the actual key for the MAC by applying a different perfect random function g by K~=g(tq+~),
ie{1,...,n-d}
(1)
The sender adds a timestamp to the message, computes the MAC and adds it to the message, as well as the key Ki which is thereby disclosed to the receivers. A receiver receives the initialisation message and verifies the signature. On success the values are stored. For each following message the receiver first checks the security condition: ,,A packet arrived safely, if the receiver is assured that the sender can not yet be in the time interval in which the corresponding key is disclosed." The security condition makes sure that the key cannot 1A one-way function is a function that can not be inverted with a reasonable amount of computation. In [8] HMAC-SHA1 is recommended as the one-way function.
594 have been disclosed yet to anybody else, who could have used it to forge the sender's MAC on the message. If the security condition is not satisfied, the message is dropped, otherwise the message is buffered. Then the disclosed key is extracted, and the receiver verifies if the key belongs to the key chain, by applying the one-way function f the number of times it should take to reach the last stored disclosed key Kj. If the result is Kj, Ki fits in the key chain. Then the MAC keys of all messages in the buffer sent at intervals (Ik), k = j - d + 1 , . . . , i - d can be derived using equation 1 and the MACs can be computed and verified. The TESLA protocol has the following very useful properties:
Resistance to arbitrary packet loss: Missing disclosed keys can be recovered from later disclosed keys by applying the one-way function a couple of times.
Low storage requirements for keys on the receiver's side: The sender needs to store the whole key chain, while the receiver only stores the last disclosed key for each sender.
Low computational effort: Only one digital signature is necessary within a large time interval I. Only two one-way function evaluations per key and one MAC computation per message are needed on each side. The key chain can even be generated offline by the sender.
Low additional network traffic: Only one small additional multicast message per key chain. One MAC and one key overhead per message 2, assuming we need timestamps for the application anyway. The disadvantages of TESLA are related to the authentication delay d:
Receiver buffering: Unauthenticated messages must be buffered until the corresponding key is disclosed. This could lead to storage problems on the receiver's side if there are many senders and long disclosure delays.
Additional message authentication delay: Incoming messages cannot be authenticated until the corresponding key disclosure is received. The security condition demands weak synchronisation between sender and receiver: The receivers need to know an upper bound to the senders current time. Assuming the session starts at To, and a packet has been sent at sender's time t s in the Interval Ij and has been received at the receivers time tn, the security condition reads
l tn + eclock + T0J Tint
- j < d.
Using a sharp boundary j > ts-T~ -- 1 an optimum value for d can be derived as
Tint
+ Cclock1 d _~ -~SR Tint -J- 1,
with
~sn = tn - ts,
(2)
where ~Ssn is the network transmission delay between the sender and the receiver. In particular, equation 2 shows that a key Kj for a message sent at the end of an Interval Ij cannot be revealed before the interval Ij+2 because of the transmission delay and clock offsets. 240 bytes using HMAC-SHA1
595
2 K i§
K z
K 2
2 K i+7
K z
MAC key 2
K i§1
K i§1
I( 1
K 1
K I
i§
i§
MAC key 1
K,
K .I§,
K.I§2
K.z I§
K i+4
disclosed key
I
I
I
I
i
time
i
i+5
i+1
i+6
i+4
i+2
i+8
i+3
i+4
Figure 2. Two concurrent TESLA instances for do - 2 and
dl --
4.
The overall processing delay of an update message in a DVE must be kept as low as possible. The processing delay of a message due to authentication with TESLA, including the transmission delay, is ~Sauth = dTint. Equation 2 yields that 3auth reaches its minimum if 3sR + Cclock < Ti~t, i.e. d = 2. So Ti~t should be set to a value slightly above 8sn + edock. Consequently, d can be chosen smaller for a low latency network connection than for a connection with high latency. If we choose d to small, players with slow connections will never authenticate any message, if it is too large, all players have the same high delays. In [8], Perrig et al. have extended the TESLA protocol to overcome these shortcomings. The idea is to use m instances ~-k, k = 1 , . . . , m, of TESLA with different disclosure delays dk synchronously. Each message contains one MAC for every TESLA instance. The MAC keys K~ used in the time interval Ii for the different TESLA instances can be derived from the same key chain (Ki), by K ~ - - g k ( K i + d k),
k-1,...,m
and
i-1,...,n-
max (dl).
l=l,...,m
where k denotes the TESLA instance, K~ the ith key in the key chain and gk a one-way function specific to the TESLA instance Tk3. By deriving all MAC keys from the same key chain, no extra keys have to be disclosed. Each receiver can now pick the TESLA instance according to her transmission delay to the sender, so players with fast connections are no longer slowed down by players with slow connections, while the latter still have a chance to participate, though with the handicap of a greater lag.
5.5. Parameter tuning If more than one state update message of an entity in a DVE is sent within the same time interval Ij, the last update will override all the previous ones. Because all messages sent within Ij can be authenticated at the same time, it is not sensible to send more than one update per time interval. Consequently, the key interval length ~nt should be smaller than the periodicity of entity state update messages Tupdat~, which is usually bounded by the available bandwidth and the number of entities sending updates. To minimise the latency further, updates are sent at the end of each time interval rather than at the beginning. To limit the authentication overhead of the update messages, a small number m of TESLA instances must be chosen. One approach to determine the key interval length 7)nt and the authentication delays (dk)k=l .....~ without prior knowledge of transmission delays between the peers is based on the disjoint latency ranges presented in section 1.2. Supposing an additional 20ms for transmission time in the Internet backbone, choosing Tint = 10ms and (dk) = (2, 3, 5, 6) gives 3The authors of[81 suggest to use g k ( K ) = H M A C ( K ,
k).
596 a good coverage of expected overall network latencies for the different connections, excluding modem users. A more adaptive approach would employ the knowledge of the real transmission delays and try to optimise T~nt and dl, ..., dm in a similar manner. Receivers must limit the sizes of their message buffers. By fixing the update periodicity Tup~ate, the number of entities and the authentication delays (dk), the maximum buffer sizes can be calculated a priori and buffer overflows should not occur. Due to TESLA's resistance to package loss, receivers can also choose to drop some packets from each sender after extracting the disclosed key, if the buffer has run full or if the sender undershot the update periodicity. The dead reckoning scheme must be able to hide the maximum latency d,~T~nt from the users with smooth transitions between the predicted and the updated states. In the given example, a latency of at most 60ms must be hidden, which is not a hard restriction. 6. CONCLUSION AND FUTURE W O R K This paper has pointed out, that DVEs have high demands on the quality of service of the network. Experimental results have been presented showing the impact of the users' specific type of network connection on the latency. Two main issues of security for DVEs have been investigated: access control and authentication. For the specific scenario of a multicasting based peer-to-peer online game with invitations, a security analysis has lead to two possible trust models, which require either group or individual authentication of update messages. It has been demonstrated, how X.509 certificates can be efficiently used for access control, and how group authentication can be implemented using a group key management scheme such as MARKS. For individual authentication, a variant of the TESLA protocol has been presented, that copes with different network latencies among the active hosts. It has been shown, how the security parameters can be chosen and how they relate to the update periodicity. The resulting protocol offers individual authentication with a minimum computational effort, produces a very low key storage and message length overhead, and is resilient to arbitrary packet loss, at the cost of a doubled latency. The MSEC working group 4 of the Intemet Engineering Task Force is currently working on a number of protocols to secure multicast traffic, mostly focused on group authentication. EMSS [7] and BiBa [9] are just two examples for other individual authentication schemes based on symmetric cryptography. In [ 10] a security architecture for the NPSNET system is presented, concentrating mostly on confidentiality. There is also a lot of work on cheat protection, e.g. in [11]. Numerous issues of a complete security architecture for DVEs, for example the annunciation of a game, the resolution of conflicts or cheat protection, have not been covered in this contribution and remain for future work. An implementation of the described architecture is in progress, as well as the analysis of other scenarios such as client-proxy-server architectures. REFERENCES [1]
S. Cheshire, Latency Survey Results, private homepages (2001). URL http://www.stuartcheshire.oro/rants/katencyResults.html
4http://www.ietf.org/html.charters/msec-charter.html
597 [2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[ 11]
R. Housley, W. Ford, T. Polk, D. Solo, Internet X.509 Public Key Infrastructure Certificate and CRL Profile, RFC 2459 (Jan 1999). URL http :llwww. ietf. orglrf clrfc2 459.txt W. Yeong, T. Howes, S. Kille, LDAP: The Lightweight Directory Access Protocol, RFC 1777 (1995). URL http :/Iwww.ietf.orglrfclrfc1777.txt H. Krawczyk, M. Bellare, R. Canetti, HMAC: Keyed-Hashing for Message Authentication, RFC 2104 (1997). URL http://www.ietf.org/rfc/rfc2104.txt B. Briscoe, MARKS: Zero Side Effect Multicast Key Management Using Arbitrarily Revealed Key Sequences, in: Networked Group Communication, First International COST264 Workshop, Vol. 1736 of Lecture Notes in Computer Science, Springer, Pisa, Italy, 1999, pp. 301-320. URL http://citeseer.nj.nec.com/276130.html D.L. Mills, Network Time Protocol: Specification, Implementation and Analysis, RFC 1305 (1992). URL http://www.ietf.org/rfc/rfc1305.txt A. Perrig, R. Canetti, J. D. Tygar, D. X. Song, Efficient Authentication and Signing of Multicast Streams over Lossy Channels, in: IEEE Symposium on Security and Privacy, 2000, pp. 56-73. URL http://citeseer, nj. nec.com/perrig00efficient, html A. Perrig, R. Canetti, D. Song, J. D. Tygar, Efficient and secure source authentication for multicast, in: Proceedings of the Symposium on Network and Distributed Systems Security (NDSS 2001), Internet Society, 2001, pp. 35-46. URL http ://www. isoc. org/isoc/conferences/ndss/01/2001/paper%s/perrig.pdf A. Perrig, The BiBa one-time signature and broadcast authentication protocol, in: Proceedings of the Eighth ACM Conference on Computer and Communication Security, 2001, pp. 28-37. URL http://portal.acm.org/citation.cfm?id=501988 E. J. Salles, The Impact of Quality of Service When Using Security-Enabling Filters to Provide for the Security of Run-Time Virtual Environments, Master's thesis, Naval Postgraduate School, Monterey, USA (Sep 2002). N. Baughman, B. N. Levine, Cheat-Proof Playout for Centralized and Distributed Online Games., in: Proceedings IEEE INFOCOM 2001, Anchorage, USA, 2001, pp. 104-113. URL http://www.ieee-infocom.org/2001/paper/808.ps
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
599
A Grid Environment for Diesel Engine Chamber Optimization G. Aloisio a, E. Blasi a, M. Cafaro ~, I. Epicoco ~, S. Fiore ~, and S. Mocavero a alSUFI/Center for Advanced Computational Technologies University of Lecce, 1-73100, Italy giovanni.aloisio, euro.blasi, massimo.cafaro, italo.epicoco, sandro.fiore, [email protected]
The goal of this paper is to show that computer modelling techniques can be used to solve the real life problem of Diesel Engine performance improvement. The purpose of our work is to achieve the lowest emissions level and improved fuel efficiency with respect to European Norm for emissions. Particularly, we are interested to reduce NO + H C and soot emissions and to maximize the P M I , a pressure proportional to engine power. These parameters depend on combustion chamber geometry too, so we propose its optimization turning to the new micro Genetic Algorithms (micro-GA) technique. The idea consists in the automated random generation of a lot of meshes, each of these representing a different chamber geometry, with respect to some common geometric constraints; then in the use of the micro-GA to optimize, at each iteration, the results obtained in the previous steps. The innovative feature of our work is the multiobjective nature of the optimization process. This is the main reason to chose micro-GA rather than simple Genetic Algorithms. Emissions level and fuel efficiency can be evaluated using a modified version of KIVA3 code that outputs three values, each of these related to one of the three specific fitness functions to be maximized. The optimization process involves the execution of a lot of KIVA3 simulations to calculate fitness values of the chamber geometries taken in consideration during all of the optimization steps. We propose the use of Grid Computing technologies to increase the performance of the KIVA-micro-GA, showing how a distributed environment allows to reduce the computational time needed by the optimization process, taking advantage of intrinsic parallelism of micro-GA. In fact, their structure allows executing simultaneously KIVA3 simulations over the random meshes and over the geometries that populate the micro-population at each iteration. The services offered by the system are the micro-GA parameters definition, the submission of the optimization process and the monitoring of the process status. A trusted user can access the implemented services using a grid portal, called DESGrid (Grid for Diesel Engine Simulation). The analysis of the results, achieved by execution of KIVA-micro-GA on three ES40 Compaq nodes, each one equipped by four processors, shows a good reduction in both emissions and fuel consumption. In the paper we show numerical values and related geometries representation obtained after the first steps of the global optimization process execution.
600 1. I N T R O D U C T I O N In the last years, much research effort has focused on the Diesel Engine emissions optimization. It is well known that emission values depend on combustion chamber geometry too. Thus, many optimization tools have been developed to explore the wide space of solutions; we consider each of these solutions represented by the sequence of the coordinates of the points defining the chamber geometry profile. Wickman, Senecal and Reitz KIVA-GA [ 1] are less costly means for studying chamber geometry optimization problem, since an experimental study would require many different pistons to be manufactured. KIVA-GA work considered the optimization of a single object function f(X), where X is the vector of parameters to be varied. The aim of this work is a multiobjective optimization; in fact, we consider three different functions, each of these related to a specific fitness to be maximized: the inverse of NO and hydrocarbon mix emissions; the inverse of soot emission and the indicated medium pressure (PMI). In a recent study Coello and Pulido [2] suggested a new evolutionary multiobjective optimization algorithm based on the micro genetic algorithm (micro-GA) theory. We use this model to realize a KIVA-micro-GA for the optimization of the three fitness functions, to obtain good solutions during and not just at the end of the optimization process and to achieve the convergence to the optimum in fewer iterations compared to a simple Genetic Algorithm. The total time required to compute the geometry of a Diesel engine chamber using the KIVA-micro-GA is prohibitive on a sequential machine. Grid technologies give us the possibility to use a collection of distributed resources on which it is possible to simultaneously execute all the needed simulations. Hence, we have developed a grid portal that allows users without computer science expertise to define micro-GA parameters, to start the distributed simulations and to monitor the status of the execution using a simple Web GUI. 2. O P T I M I Z A T I O N MODEL: M I C R O - G E N E R I C A L G O R I T H M A micro-genetic algorithm [2] is a genetic algorithm (GA) characterized by a small population. Applying the classical genetic operators to the latter, we can reach the convergence to an optimal current solution after few cycles and then generate a new population by transferring the achieved best individual to it. The remaining individuals will be randomly generated. Senecal et al. [ 1] have considered an optimization model based on Krishnakumar micro-GA [4] characterized by a simple structure, based on the Carroll Genetic Algorithm (GA) code [5]. In this micro-GA, mutation operator is not applied. Our micro-GA is based on an innovative model proposed by Coello and Pulido [2], showed in Fig. 1. The model is developed on two levels. Externally we have a modified version of a simple GA, characterized by the presence of two memory portions from which the algorithm selects initial micro-population individuals. These portions are referred to as replaceable and non-replaceable. The first one contains solutions that may be replaced during the optimization process by better fitness value solutions; the second portion will never change, therefore it represents a source of diversity for the algorithm. Internally, a micro-GA cycle is executed until the so-called nominal convergence is reached. Nominal convergence may be defined in terms of a given (generally low) number of cycles or in terms of similarities among the solutions of the micro-population. In our work, we have chosen the second criterion to reach the nominal convergence, since it is difficult in our case to estimate a fixed number of iterations which guarantees the achievement of a good percentage of similarity.
601
r i l l |
~
iiiii
iiiii
l i i l l i l i
ill|i
lllll
I l l i i l l l
I I I I I
Illll
IlUl|
III1|
in both : memory portions |
| N
'
Non -Replaceable
II1|
Initial Pop
I ,
r
I
Replaceable l
~lli
lllll
micro-GA II cycle Il . . . .
liill
IIIli
i i l l l i l l
ilill
Ilili
liill
iilllllil
IIIIn
HI
IIIId
_1
i xtea I Memory
Figure 1. micro-GA model In the process of chamber geometry optimization, micro-GA advantages compared with GA are the following: 9 Reduction of computational time. In a micro-GA, such as in a simple GA, the number of requested simulations for each algorithm iteration is equal to that of the individuals that populate the algorithm population and have been matched with success in that iteration. Hence, reduced dimension of micro-population provides for the reduction of the maximum number of combinations requested by each cycle; in this way the number of simulations to be executed decreases and consequently the total computational time is reduced. 9 Convergence rate to intermediate optimal element. Explored space at each micro-GA cycle is a subspace of the entire research space; hence nominal convergence to an intermediate optimal solution is reached shortly. This solution will likely be used as parent of an optimal new one, so that during the iterative process, the absolute optimal solution (if there is one) is obtained. 9 Percentage of diversity maintenance. In a simple GA, the probability for an individual to be selected is proportional to his fitness. A low selection probability makes the individual's removal a more probable event soon after the early iterations. However, an individual with low fitness, if matched with another one, could generate a good individual in order to achieve the optimum solution. Then it is convenient to keep some solutions with a low probability of selection to guarantee the preservation of their genetic characteristics. The presence in the model of a non-replaceable memory portion is a constant diversity source used for this purpose.
Referring to Fig. 1, we distinguish the following steps: 1. Random geometries generation. The generation of the mesh representing the chamber is automated: we have realized a 'mesh maker' which generates a random geometry in a few minutes. Senecal et al. [ 1] changed the value of three geometric parameters (central crow height, bowl depth, and bowl diameter of the piston) to explore the search space of the solutions. The number of possible solutions, using only three parameters, is restricted. Our geometry is generated point to point using an algorithm that is able to select any possible point of the plane with regard to predefined geometric constraints, on the
602 basis of a chamber study of the feasibility. One or more smooth operations guarantee the linearity of the shape. The mesh maker outputs two text files: iprep file containing geometric information needed for the simulation by KIVA3 code and a second file containing only mesh points coordinates, used to match two geometries. Random geometries and a baseline geometry, representing the optimization process starting point, will populate the two memory portions. 2. Fitness estimate. For each generated mesh, a full simulation of KIVA3 code is executed, in order to generate a text file, referred to as fit.dat, containing the three fitness values in formula (1): 1 fl =
NO + HC
1 f2--
soot
f3 = P M I
where the unit of measurement of N O + H C and soot is g/Kg of fuel and P M I pressure, measured in Mpa, proportional to engine power.
(1) is a
3. m i c r o - G A cycle. During each cycle [2], the following conventional genetic operators will
be applied to the meshes of the micro-GA initial population: micro-GA initial population is achieved by sorting out the geometries from both replaceable and non-replaceable portions with a parametric probability. The value of the latter, such as the value of other micro-GA parameters, can be modified by the user before each process execution. The selection probability of a specific geometry from any of the two portions is not chosen by the user; it is indeed proportional to the geometry fitness: in order to do this, micro-GA use the roulette m e t h o d that associates to each mesh a portion of the roulette proportional to the fitness of the mesh.
9 Selection.
The application of a crossover type instead of another is closely connected with the nature of the problem. In our study, we have chosen the two-point crossover because it allows keeping the already achieved good building blocks, without bounding the search space. The need to apply a smooth operation (which generally modifies the mesh profile) after changing the coordinates of mesh points, suggests reducing the number of crossover points. However a single crossover point could bound the explored search space. The crossover is applied with a probability generally chosen equal to 0.5. If the selected individuals don't match, two spitting images of the parents are copied to the following generation.
9 Crossover.
9 Mutation. We have used the uniform mutation: after the crossover application, the
ordinate of each point of the mesh is mutated with a low probability. This technique mutation allows obtaining uncommon chamber geometry shapes; a low probability of its application allows speeding up the convergence process. Coello's and Pulido's approach [2] uses three types of elitism. The first form of elitism consists in copying the best solution produced from each cycle into the initial population of the next one. The second form consists in replacing the replaceable portion with the nominal solutions (the best solutions achieved when nominal convergence is reached); with these solutions, we will gradually reach the
9 Elitism.
603 true Paretofront of the problem. The selection mechanism of the solutions building the Pareto front is based on the so called rank method: these solutions are characterized by a rank of 1 (non-dominated solutions). The third type of elitism is applied at certain intervals, when a certain amount of points (each of these represents a nominal solution) from the Pareto front is used to fill the replaceable memory. 3. GRID T E C H N O L O G I E S F O R SOLVING THE O P T I M I Z A T I O N P R O B L E M
The present work is an example of an industrial problem solved by the use of computer simulations. Each simulation of KIVA3 code, used to calculate the fitness values, requires about 40 minutes on the single processor of an ES40 Compaq node; to get significant results, almost 3000 simulations have to be executed. Thus, the total time required to compute the geometry of a Diesel engine chamber is prohibitive on a sequential machine. Grid technologies allow using a collection of distributed resources on which it is possible to simultaneously execute all of the needed simulations. By means of these technologies, computational time is considerably reduced, thanks to the intrinsic parallelism of micro-GA: after the generation of the random geometries, the KIVA-micro-GA algorithm executes a simulation for each of them and waits for all of the simulations to end before starting the first micro-GA cycle; likewise, the generation of a new micro-population takes place only when the simulations over the new geometries, that will be part of it, have been executed. Used KIVA3 code is a sequential version of the simulator, but needed simulations are simultaneously executed on the grid machines; thus, we exploit the High Throughput Computing paradigm. Today, the presence of many high-level tools and grid specific libraries makes grid application development easier. It allows reducing the development costs, since the number of available services, which can be merged with the new applications, increases. In our work we have used the GRB [6] and GRB-GSIFTP [7] libraries which provide high-level services using the Globus Toolkit. These libraries have been used for the simulation submission on the grid machines and the transmission of the input/output files. Since the size of the files is actually small, (about one hundred byte), then data transfer costs doesn't limit the performance of the process. The typology of the present scientific problem calls for the implementation of a simple Web interface that allows users without computer science expertise to start a distributed simulation and to monitor the status of the execution. In order to do this, we have developed a grid portal, called DESGrid (Grid for Diesel Engine Simulation), to access the implemented services. 3.1. DESGrid architecture DESGrid architecture, showed in Fig. 2, consists of three essential modules: a web interface to access the grid transparently; a second module for the management and execution of micro-GA, and finally a grid resource manager to optimize the use of available computational resources. A trusted user can access the system by authenticating herself: the Grid Security Infrastructure (GSI) [8] protocol, as provided by the Globus Toolkit, provides single sign-on authentication, communication protection, and support for delegation of credentials. The user authenticates herself once (single sign-on) and thus creates a user proxy representing its own credentials; a program can use these credentials to act on behalf of the user for a limited period of time. GSI uses X.509 certificates as the basis for user authentication. The services offered by DESGrid are: 1) micro-GA parameters definition; 2) Job submission;
604
r ~
-_:.. -~i,'! i
o ptimis~tion
processreque st .
/
iS
~,~ icro-GA M
NO
"~ * ' %
USER A
9
.... ~.
%% :
.
~i~;::~;:!~:i;;;':;::::::;i~;!i'+!!!i::~' ~ ;:~
LOGINS, PSW .....
~.-~._~:~,.~ .
~ " ~
USER B
I
-"--
|
_
'
/1
Jl FT,
-I
~
geometry ?~t
II
II
;I " ; ....... :.........;::'.........
~Z5
chamber
%_
L _, !
chamber
i
~
['{
~ ~r -,
,,.~
Combu~ioo ~ I geo,,,~
1 / 1[ Mlcro~3A ~
;"i ......... .......................
e clie e"
~ ~
t !
-
%
%
.......
IKivasimulation
/r
.,'
.... ~::i:-'i'i:.:i:~
~'.,,,.",,,.
/
I"~
M cro:OA
'%
I L//request
~
%
...... ~::.:~:>i:;;' ..... ~
Simulation
M cro.OA I .......:.............................. .... I ~:!/;;::i~;':'!;:;!;;J:!~ ...........
~
....
#~r
-
,,,' ,,,.
I~"I~.I~l~tll
A---.,-.~,., Z]~L.L~./::.::~ / lUH=I"I=I! Kiv,3simul=ion / I
Itmhi,.~beonr, %
" ~ - ' L ~ K~iv~3"m ulation
"~
)
USER Z
Figure 2. DESGrid architecture 3) Monitoring ofjob status. The user can modify micro-GA, before the optimization process starts, according to theoretical considerations or previous results. The application stores the new values that will be read at run time. Job submission doesn't require a complex procedure; it is enough for the user to click on the submission button to start the process, without specifying any input files: we have considered a single baseline geometry as the starting point for all the optimization processes. When a trusted user asks the process status to the system, she is provided with a web page containing the job start time and the fitness values of the best current solution; the page also provides a link to related iprep file. Micro-GA manager (micro-GA-MNG) is a demon, the task of which is the iterative control of job requests made by users: each request corresponds to the generation of a micro-GA distinguished by its own evolution. Micro-GA MNG has the responsibility to coordinate and execute the micro-GA, described in the previous section. User requests are queued in an appropriate spool directory and will be processed by the threaded manager asynchronously. The estimation of the fitness values requires a high computational load, so that each request for a simulation is submitted to a scheduler, referred to as the DES Manager (DES MNG), responsible for the scheduling of all the simulation requests on the available grid resources. When the simulation finishes, DES manager receives back a fit.dat file as simulation result. If the simulation fails because the chamber geometry is not suitable, an output file with null fitness values is generated and sent to DES manager. DES manager chooses the node where the simulation has to be executed with respect to the scheduling policy. Scheduling policy [3] generally means choosing the 'best' resource according to performance criteria. In our case these are meant to minimize the execution time: the scheduler knows the total number of the simulations to be executed and the execution time spent by each resource; then it can schedule the jobs so that it is possible to reach an optimal load balancing. The current scheduling policy doesn't consider the grid resources load due to other applications execution. In the future, it will be possible to define a more realistic scheduling
605 policy that manages the resources considering all the processes scheduled on each of them. At the end of the simulation, the scheduler estimates the execution time on a single machine with reference to the performance of the last 5 simulations and schedules the residual work on the basis of this performance. If the simulation fails due to the failure of a remote host, the simulation will be run again and the machine will be temporally removed from the available resources list. Grid machines report their presence to the scheduler by a signal at constant intervals, thus when a machine doesn't signal its presence, it is temporally removed from the list until a new presence signal has been received; when a resource is added or temporally removed by the grid administrator, this is reported to the scheduler that can re-schedule the jobs on the basis of the available resources. 4. EXPERIMENTAL RESULTS
Table 1 Baseline fitness values compared with the best random geometries values Geometry 1/(NO+HC) [g/Kg of fuel] 1/soot [g/Kg of fuel] Baseline geometry 0.321162E-01 0.999295E-03 Geometry (a) 0.432698E-01 0.999671E-03 Geometry (b) 0.212138E-01 0.999921E-03 Geometry (c) 0.203309E-01 0.999766E-03
Baseline geometry
(a)
(b)
PMI [MPa] 0.427497E-01 0.476306E-01 0.863489E-01 0.879526E-01
(c)
Figure 3. Baseline geometry compared with the best three new random geometries
First experimental results have been obtained by executing the micro-GA on three nodes (ES40 Compaq), using four processors on each node. We considered 100 micro-GA iterations; the average number of cycles for each micro-GA iteration is 7 (based on the experimental results of the specific problem); on average, 2 minutes are needed to generate a random geometry; the number of solutions that populate the memory portions has been chosen to be 50; estimated time for a single cycle management is 1 minutes; estimated time needed to realize crossover and mutation of two geometries is about 50 sec. Thus, the time needed for the entire optimization process using 12 processors is about 20 days. In this paper we show an example of the improvement that we can obtain after the first steps of the optimization process. The first one consists in the generation of 50 random geometries. We achieved a good result for the three fitness values compared with those related to the baseline geometry. Because of the random nature of the
606 generation process, we did not obtain the best values according to the same geometry. Table 1 shows in the first row baseline fitness values which are compared with the fitness values of the three random geometries which have optimized them, shown in the next rows. Fig. 3 shows the geometries representation. The factor of reduction of NO+HC is about 34.73%, the factor of soot reduction is about 0.06%, whereas PMI is increased of 105.74%. Table 2 shows the fitness values related to the best geometries achieved after four micro-GA iterations. The last column indicates, for each iteration, the micro-GA cycles number needed to achieve nominal convergence. Examining the results, it is clear that, also during the random geometries evolution, a more consistent result is related to the increase of PMI. NO+HC emissions have not been optimized, probably because of the random choice of non-dominated solutions during evolution process. After the first tests we realized that emissions and PMI improvement is related to the concave geometries, then we are working on a new 'mesh maker' which generates only this kind of geometries.
Table 2 The fitness values related to the best geometries achieved after each of the first four #-GA iterations
Iteration 0 1 2 3 4
1/(NO+HC)
I/soot
0.321162E-01 0.274563E-01 0.239250E-01 0.229558E-01 0.213448E-01
0.999295E-03 0.100026E-02 0.999590E-03 0.100029E-02 0.100007E-02
PMI 0.427497E-01 0.699444E-01 0.794345E-01 0.908426E-01 0.899849E-01
Cycles number 3 10 7 9
5. CONCLUSIONS AND FUTURE WORKS The described work is the first step for the realization of a multi-user system for the optimization of a specific geometry related to the knowledge and experience of different work-groups. In this work, we have considered only a baseline geometry for all the optimization processes; we believe that, in the future, many users will submit their optimization process beginning from a particular baseline geometry, this being the result of their own expertise. Moreover, the purpose of the future work is the implementation of complex scheduling policies that take into consideration all the problems related to the scheduling of simulations on the grid resources. Genetic Algorithms theory suggests the third growth factor for present application: it is actually possible to conceive another genetic algorithm, at an external level, that fixes the micro-GA parameters on the basis of the results of the previous chamber geometry optimization processes.
REFERENCES [1]
D.D. Wickman, EK. Senecal, R.D. Reitz- "Diesel Engine Combustion Chamber Geometry Optimization using Genetic Algorithms and Multi-dimensional Spray and Combustion Modelling" - SAE 2001-01-0547, 2001
607 [21
C.A. Coello, G.T. Pulido -"Multiobjective Optimization using Micro-Genetic Algorithm" Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pp. 274-282, Morgan Kaufmann Publishers, San Francisco, California, July 2001 I. Foster, C. Kesselman - "The Grid: Blueprint for a New Computing Infrastructure" Edited by I. Foster and K. Kesselman, Morgan Kaufmann Publishers, Inc., ISBN 1-55860475-8, 1999 K. Krishnakumar - "Micro-genetic Algorithms for stationary and non-stationary Function Optimization" - In SPIE Proceedings: Intelligent Control and Adaptive Systems, volume 1196, pages 289 - 296, 1989 D.L. Carroll -"Genetic Algorithms and Optimizing Chemical Oxygen-Iodine Lasers"Developments in Theoretical and Applied Mechanism, 18, 411, 1996 G. Aloisio, E. Blasi, M. Cafaro, I. Epicoco - "The GRBLibrary: Grid Programming with Globus in C " - Proc. HPCN Europe 2001, Amsterdam, Netherlands, Lecture Notes in Computer Science, Springer-Verlag, N.2110, 133-140, 2001 G. Aloisio, M. Cafaro, I. Epicoco - "Early experiences with the GrifFTP protocol using the GRB-GSIFTP library"- Future Generation Computer Systems, Volume 18, Number 8 (2002), pp. 1053-1059, Special Issue on Grid Computing: Towards a New Computing Infrastructure, North-Holland I. Foster, C. Kesselman, G. Tsudik, S. Tuecke - "A Security Architecture for Computational Grids" - Proc. 5th ACM Conference on Computer and Communications Security Conference, pp. 83-92, 1998 -
[3]
[4]
[s] [6]
[7]
[8]
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
609
A B r o k e r A r c h i t e c t u r e for O b j e c t - O r i e n t e d M a s t e r / S l a v e C o m p u t i n g in a Hierarchical Grid System M. Di Santo ~, N. Ranaldo ~, and E. Zimeo ~ aDepartment of Engineering - Research Centre on Software Technology University of Sannio - Benevento - ITALY Even though many simulation frameworks for predicting performance in Grid systems have been implemented, real Grid middleware platforms lack of effective resource managers able to help them to dynamically make a decision about task placement. This paper proposes a broker architecture and its implementation for a hierarchical Grid system. The broker, whose logic is based on an economy-driven model, is able both to transparently split a sequential object-oriented application into tasks, according to the master/slave computing model, and to automatically distribute slave tasks to a set of computational resources selected so as to execute the application satisfying the QoS specified by the user. The goal is achieved by designing the broker with three patterns: Grid Broker, Reflection and Master/Slave. 1. I N T R O D U C T I O N One of the most important features in a grid environment is the ability to well exploit a high, variable number of distributed heterogeneous resources in order to run an application within a deadline and without exceeding a prefixed budget [4]. Such a service is typically provided by a specific middleware component: the broker. Recently, researchers in the field of Grid brokering and scheduling have developed many simulation frameworks [4, 5] for predicting application or system performance. Unfortunately, these simulators may not be immediately applied to real Grid middleware platforms in order to help them to dynamically make a decision about task placement. In a real environment many questions arise. How to describe a real application? How to individuate tasks and their dependencies? What resource parameters are significant for splitting a job into tasks and for scheduling them? How tasks are related to the programming model used? How to perform the distribution of application load on the basis of resource information? At our knowledge, only few papers [11, 13, 14] have analyzed all these aspects of a real Grid Broker. However, they don't (1) individuate a reference software architecture, (2) define resource and application descriptors, (3) analyze dependencies among tasks, (4) show how to reuse sequential code for executing an application in a Grid. This paper attempts to answer these questions by defining a broker architecture for a hierarchical Grid system. The proposed broker, whose logic is based on an economy-driven model, is able both to transparently split a sequential object-oriented application into tasks, according to the master/slave computing model, and to automatically distribute slave tasks to a set of resources selected so as to execute the application satisfying the QoS specified by the user. The goal is achieved by designing the broker with three patterns: Grid Broker, Reflection and Master~Slave. In particular, the broker
610 is integrated in a Java-based Grid middleware, called Hierarchical Metacomputer Middleware (HiMM) [6, 7], which provides information and communication services and allows users to program applications by adopting a distributed object model. The remainder of the paper is organized as follows. w introduces hierarchical Grids and briefly describes our middleware. w discusses different approaches for selecting resources in Grids. w presents the Grid Broker pattern, which is a pattern purposely defined for designing Grid middleware platforms. w tackles the problem of object-oriented task scheduling when the master/slave computing model is adopted. w concludes the paper and introduces future work. 2. H I E R A R C H I C A L GRID ARCHITECTURES Hierarchical topology fits for grid systems as it allows (1) remote resource owners to enforce their own policies on external users [ 11 ], (2) applications to exploit the performances of dedicated networks and (3) the system to be more scalable. Following these observations we have developed HiMM, a customisable middleware based on an architecture composed of four layers as proposed in [9]. HiMM is able to exploit collections of computers (hosts) interconnected by heterogeneous networks to build Hierarchical Metacomputers (HiMs). In fact, at the connectivity layer, HiMs are composed of abstract nodes interconnected by a virtual network based on a hierarchical topology. The nodes allocated onto hosts hidden from the Internet or connected by dedicated, fast networks can be grouped in macro-nodes, each one controlled by a Coordinator (C). The coordinator of the highest hierarchical level is the root of a HiM and is interfaced with the Console. Each host wanting to donate its computing power runs a special server, the Host Manager (HM), which receives creation commands by the console and creates the required nodes. It is worth noting that the console can create nodes only at the highest hierarchical level of a HiM. However, if a coordinator receives a creation command, it creates nodes inside the macro-node that it controls according to supplied configuration information. At the resource layer, HiMM provides an information service based on the interaction between the Resource Manager (RM) and the HM. ARM, allocated on one of the hosts taking part in a macro-node, is periodically contacted by the HMs running on the hosts belonging to the same macro-node and wanting to publish information about: (1) the CPU power and its utilization; (2) the available memory; (3) the communication performance. Information is collected by the RM and made available to subscribers. 3. RESOURCE SELECTION AND TASK MAPPING As stated in [ 14], the most common current Grid broker is the user but many research efforts [1, 17, 18] are underway to change this approach. The simplest and immediate scheme for selecting resources requires that the result of resource discovery is directly presented through the console to the user, who, on the basis of the application knowledge, selects a subset of the discovered machines for building a meta-system. A more sophisticated approach is based on the interaction among the information system, the user and the broker. With this scheme, resource selection is automatically performed by the broker. To this end, the broker needs some information about resources, the application to run and the QoS desired by the user. The result of the selection can be managed to perform the mapping either by the user through the console or directly by the broker (fig. 1.a). The mapping regards the spatial allocation of application tasks to the selected machines, which, in our environment, means the creation of the processes
611 on the selected machines that will host object-oriented tasks. In this paper we focus on the third approach, which guarantees a transparent use of heterogeneous, multi-owner, distributed resources. With this approach, the broker acts as a mediator between the user (console) and grid resources (HIM) by using middleware services (Info System and Host Managers). In particular, the broker becomes responsible for resource discovery, resource selection, process/task mapping, task scheduling, and presents the Grid (a HiM) as a single, unified resource. The broker architecture is distributed and hierarchical in order to match the organization of a hierarchical Grid (fig. 1.b). This architecture is not conceptually new [11], but the paper discusses how such an organization can be used to schedule object-oriented tasks dynamically produced by a running application. 4. GRID B R O K E R A R C H I T E C T U R A L PATTERN
Individuating a software architecture for designing and studying complex distributed systems is a key factor for their engineering and widespread diffusion [ 13]. Software architectures are typically described through architectural and design patterns which offer an effective way to reuse design solutions. Following this approach, our broker is designed according to the Grid Broker pattern which is a variant of the Broker pattern [3] purposely modified to satisfy the requirements of Grids. Due to limited space, in this paper we give only a brief overview of the pattern, which will be fully described in a future work. The pattern uses the broker as a key component to achieve decoupling of clients (grid users) and resources (computational power, communication bandwidth, storage capacity). Resources register themselves with the broker and make their functions available to clients. Clients use the resources by issuing jobs to the broker. The broker is responsible of discovering resources, selecting resources on the basis of QoS parameters, mapping tasks to the selected resources, scheduling tasks according to a pattern of dependencies and collecting results. Therefore, the broker hides all the details of a Grid system by offering a simple framework for the deployment and the execution of Grid applications (in the following called jobs). The structure of the Grid-Broker pattern comprises six participants: Job, Client, Resource, Task, Grid Broker and Information Service. Job is a generic application whose execution needs many resources. It is composed of the application code along with documents (descriptors) describing the parallel computing model to use and every other detail useful for a correct execution, such as its complexity, I/O operations, sizes of inputs and outputs. These details can be used by the broker to select resources in order to meet the QoS constraints (such as deadline and budget) also specified by the user within a descriptor. In particular, when the application is unstructured, the job is a set of independent and self-contained tasks (code and input data). Client is the component which submits a job to the
Figure 1. (a) Automati<~ resource sel~ctio--6 and mapp'lng; t'b') hlerarchl""cal arc itecturecture"-6fothe broker
612
Omat ~-wdgtok~rAH
]sd~
] asea~PI I
~k~
filxlR~our
A +e~cutv(JeblD)
4d~cardOobID) § #~k~l~souax~0 #sel~dukATsskO #exlraeffmh0 +~e.ln:hxll~o~o~sO -~a~'tl~ults 0
]
+qx~O selmt
Ta~ ~ u ~ ' O :Obtccr +tmninateO:void
+~stl~ourcesO +fHIl~somr
<
+m•~ ~vc ute(TaskID).Objvct +t~(TasklD).~dd
........ s
Figure 2. Structure and dynamics of the Grid Broker pattern Grid Broker for the execution. It (1) runs on the Grid user machine, (2) can interact with the broker, (3) can control and monitor the job execution, and (4) can terminate the job execution. Resource is a generic resource which makes available its services to users for executing tasks. It is able to register itself in an information service in order to make its services available for Grid computing. Task is an object allocated on a Resource by the Grid Broker in order to run a task of the job. Grid-Broker is the key component of the pattern. It exports an API to clients for submitting jobs and to resources for registering themselves. In order to accomplish its tasks, the Grid Broker interacts with the Information Service (which can be integrated in the broker or not) for discovering resources. Moreover, the Grid Broker can select resources able to execute an application satisfying user requirements by using an algorithm chosen by the user. Moreover it allocates tasks to the selected resource and schedules them for the execution. Information Service is used by resources to register themselves. It provides registration, indexing and discovery services. The sequence diagram in figure 2 shows a typical scenario for executing a job. The grid broker activity is divided in three phases: resource discovery, resource selection, task mapping and scheduling. However, the implementation of the Grid Broker pattern in a real Grid requires to address several problems regarding application and resource description, the algorithm used by the broker to select resources in order to satisfy the desired QoS, the scheduling of tasks and the management of their dependencies.
4.1. Application and resource descriptors In order to execute an application within the desired deadline and without exceeding a budget, the user must specify his requirements whereas resource owners must specify the features of their resources. In order to collect this information, we defined descriptors based on XML for describing user requirements, a job and resource features. These descriptors are called User Requirements Description Format (URDF), Job Description Format (JDF) and Resource Description Format (RDF), respectively. As regards to RDF, it is worth noting that to perform an effective task scheduling, low-level (such as CPU speed), medium level (such as the time required for performing a simple operation), high level (such as the time required to execute a common task), and dynamically acquired (such as a specific benchmark executed by a mobile agent) resource information can be used.
613
4.2. Scheduling algorithms The selection algorithms adopted by our broker implementation are based on the economy model, proposed in [4]. In particular, the time minimization algorithm selects the resources whose aggregate cost is lower than the budget and that are able to complete the job execution as quickly as possible. The cost minimization algorithm selects the cheapest resources able to complete the application execution within the deadline. A third algorithm selects the resources whose cost is lower than the budget and that assures the completion of the job execution within the deadline. To facilitate the adoption of different scheduling algorithms, the Grid broker could be designed according to the Strategy behavioral pattern.
4.3. Management of task dependencies Managing all kinds of task dependencies is difficult. In this paper we focus on the master~slave model of parallelism applied to object-oriented applications, also known as Master~Slave pattern [3]. This programming model has been used successfully for a wide class of parallel applications [16, 8] and is suited to program in heterogeneous Grid environments [2, 15]. In this model, the dependencies and the communication patterns among tasks are simple and statically definable. In particular, a special process, the master, assigns tasks to other processes, the slaves. So, for each slave the following actions have to be performed: 1) transmission of data and command towards the slave; 2) execution of the computational task; 3) transmission of the result from the slave back to the master; 4) processing of partial results produced by the slave in order to collect the final result. In particular, in order to transparently subdivide application tasks among computational resources, programmers have to specify how: (1) the computation can be divided using an algorithm or domain specific information; (2) the final result of the whole task can be computed by using the sub-results obtained from the slaves. 5. O B J E C T - O R I E N T E D MASTER/SLAVE SCHEDULING In our object-oriented environment, the master/slave parallel model requires the creation of the processes and the allocation onto them of globally referable remote (slave) objects, whose methods can be asynchronously invoked by the master object. These objects can be instances either of classes implementing the master/slave pattern or of regular existing classes. In both cases, when a resource is a Coordinator/Broker, the task is further subdivided in a number of smaller tasks. The splitting continues until no more coordinators are met or the task can not be further subdivided. The computation completes when all the subtasks are completed and the partial results collected by the master. In the first case (fig. 3), a client object invokes a remote method of a master object (service) that, by exploiting broker information, splits the task in a number of subtasks and assigns them to the high level resources of a HiM. In the second case (fig. 4), every existing class can be used to instantiate an object that works as a master. A JDF descriptor indicates which method has to be used as service method and which task has to be split in subtasks. Therefore, the pattern is dynamically implemented and every object can be turned in a master object able to transparently split the service task into subtasks. This approach is made possible by using reflection and a meta-object protocol (MOP)[12]. Among the several MOP schemes [ 10] presented in literature, we chose the proxy-based runtime MOP to avoid modification to virtual machines (VMs). Such approach is based on the introduction at compile time of hooks in a program so as to reify run-time events such as object creations and method invocations. To avoid VMs modifications, this scheme uses the Proxy design pattern:
614 I ~-,
class Master extends Task implements
}
IMaster { public Object execute() { return service();} public Object service() { splitWork(); callSlaves();combineResults(); return ...; }
class Slave extends Task implements ISlave private Matrix b; private float[][] a;
}
public Object execute() public Object service()
}
[ :MSOrkll~k~ [
]
~---]
I
job'Job
I
suhm~iob~
[ i'R~uxce [
[j..~o~ce J
I I
._U. _.
_
I
{
{ return service(); { return b.multiply(a); I
I[~
seleclSle~ILesources
-<- ................. l ................ L r . q ~c.te(i~n~)
~- ~ ~
: ]+~ecute0:Objea
i :
:~ulc(Y.dO:Object
!+ten~nateO."void
.....
k):void
i+~(Tas
....................
,
Sla~
i-......... i
~ _ ~
-~=vkz~9.Obj~t ~ t ~ O ~
exlnlztTasks
::+~eo0
.......
Master
~+s(nvi:cO.Objcet ~+s#Wo~)
.~, ~+tenm~O.-r
co~e(;(l~sdls,
j ...........
[j< ............
LI ............ k-~
I
I
~FI
I
Figure 3. Structure and dynamics of the Grid Broker pattern specialized for the master/slave model surrogate objects to the meta-level.
(stubs) intercept object creation and method invocation events and pass them
Task I I newWra
~oP I
// TASK executed by a generic Resource float [] [] a ....; float [] [] b ....; Matrix mb = (Matrix) MOP.newWrapper(new Matrix(b) ) ; float [] [] c = mb.multiply(a);
I
I il
i ..... ~a,t~ok= ...... tnew[mtmaee(...):Object +newWmppar(...):Object
/
::+ 8e~Sel~t~l.Res ourcesO ~..-
<>
_. . . .
MW ~
multiply
MatrixStub
I~.o~11 ~
,~:> 9 ---I),.
J
'
I J ,
'
I
IIi1~ . . . . . . .
)
!
! -t~
t - - + - ~
~,~
r~y
I
~-g I
I I
the node is a coordinator
:fk~t[][]
~L...~
comt~c~-sults
I mul~ly A0V~.aota~vd
I
M=talevd
~
I
---Lt~[
I i,~=:0
U
. . . . . .
I
I
I
h ..... ' - - '
callSlTs +mu~(noat[][])
I G~-~
I
I ~ U ~ I ' - ' ]
- -
I
i
~
~
~)-i--i
ly
I I i
I
J
I if We node is not a ea~ordinator
Figure 4. Structure and dynamics of the Grid Broker pattern with reflection
6. CONCLUSIONS AND FUTURE DIRECTIONS The main contribution of this paper is the definition of a platform-independent, architectural pattern for designing Grid Brokers for middleware platforms. Although, such a pattern can be implemented in whatever middleware, in this paper it is implemented for providing HiMM with the brokering services. This implementation uses XML-based documents for describing resources, applications and user requirements, and uses an economy-driven model for selecting resources from a pool in order to guarantee the execution of a job respecting QoS constraints. Finally, the paper analyses the master/slave computing model and its integration with the Grid
615 broker pattem. In particular two implementations of the master/slave pattem are discussed: a static and a dynamic, reflection-based one. In the future, we intend (1) to develop a Web Service based implementation of the broker for its integration with OGSA/Globus [19] and (2) to investigate reservation policies for guaranteeing QoS requirements when non-dedicated computers are used. REFERENCES
[ 1] [2] [3] [4]
[5]
[6] [7]
[8] [9] [10] [ 11 ]
[12] [13]
[ 14] [15] [16] [ 17] [18] [19]
G. Allen et al., The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment. J. of High Performance Computing Applications, 15(4):345-358, 2001. F. Berman, High-performance schedulers. In I. Foster and C. Kesselman, ed., The Grid: Blueprint for a New Computing Infrastructure. Chap. 12. Morgan Kaufmann Publishers, July 1998. F. Bushmann et al., Pattern-Oriented Software Architecture." A System of Patterns. Wiley and Sons, 1996. R. Buyya and M. Murshed, GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing. The J. of Concurrency and Computation: Practice and Experience, Wiley Press, pp 1-32, May 2002. H. Casanova, Simgrid: A Toolkit for the Simulation of Application Scheduling. Proc. of the First IEEE/ACM Intl Symposium on Cluster Computing and the Grid, Brisbane, Australia, May 15-18 2001. M. Di Santo, F. Frattolillo, W. Russo, and E. Zimeo, A Portable Middleware for Building High Performance Metacomputers. Proc. of Intl Conf. PARCO'01, Naples, Italy, September 4-7, 2001. M. Di Santo, F. Frattolillo, W. Russo, and E. Zimeo, A Component-based Approach to Build a Portable and Flexible Middleware for Metacomputing. Parallel Computing, 28(12): 1789-1810, Elsevier, 2002. K. Everaars and B. Koren, Using coordination to parallelize sparse-grid methods for 3-d cfd problems. Parallel Computing, 24(7): 1081-1106, Elsevier, 1998. I. Foster, C. Kesselman and S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations. Intl J. of Supercomputer Applications, 15(3), 2001. J. de O. Guimar~es, Reflection for Statically Typed Language. LNCS 1445, pp 440-461, 1998. K. Krauter, R. Buyya, and M. Maheswaran, A Taxonomy and Survey of Grid Resource Management Systems for Distributed Computing. Intl J. of Software: Practice and Experience, 32(2), Wiley Press, USA, 2002. G. Kiczales, J. des Rivires, and D. G. Bobrow, The Art of the Metaobject Protocol. MIT Press, 1991. O. F. Rana and D. W. Walker, Service Design Patterns for Computational Grids. In Patterns and Skeletons for Parallel and Distributed Computing, ed. by F. A. Rabhi and S. Gorlatch. SpringerVerlag, 2002. J. M. Schopf, A General Architecture for Scheduling on the Grid. JPDC, Special Issue on Grid Computing, April, 2002. G. Shao, F. Berman, R. Wolski, Master/Slave Computing on the Grid. Heterogeneous Computing Workshop. IEEE Computer Society Press, 2000. L. M. Silva, et al., Using mobile agents for parallel processing. Proc. of the International Symposium on Distributed Objects and Applications, Sept. 1999. Condor, http://www.cs.wisc.edu/condor/. 2003. Apples, http://grail.ucsd.edu/. 2003. OGSA, http://www.globus.org/ogsa/. 2003.
This Page Intentionally Left Blank
Parallel Computing: Software Technology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
617
A framework for experimenting with structured parallel programming environment design* M. Aldinucci a, S. Campa b, R Ciullo b, M. Coppola a, M. Danelutto b, R Pesciullesi b, R. Ravazzolo b, M. Torquati b, M. Vanneschi b, and C. Zoccolo b qnstitute of Information Science and Technologies (ISTI)- National Research Council (CNR), Via Moruzzi 1, 1-56124 Pisa, Italy bDepartment of Computer Science, University of Pisa, Via Buonarroti 2, 1-56127 Pisa, Italy ASSIST is a parallel programming environment aimed at providing programmers of complex parallel application with a suitable and effective programming tool. Being based on algoritmical skeletons and coordination languages technologies, the programming environment relieves the programmer from a number of cumbersome, error prone activities that are required when using traditional parallel programming environments. ASSIST has been specifically designed to be easily customizable in order to experiment different implementation techniques, solutions, algorithms or back-ends any time new features are required or new technologies become available. In this work we discuss how this goal has been achieved and how the current ASSIST programming environment has been already used to experiment solutions not implemented in the first version of the tool. 1. I N T R O D U C T I O N Our research group recently developed a structured parallel programming environment based on the skeleton and coordination language technology. The programming environment (ASSIST, A Software development System based on Integrated Skeleton Technology [ 13, 14]) is intended to solve some of the problems we had in the past, while designing other structured parallel programming environments such as P3k [4] and SklE [5]. Those problems were mainly related to language expressiveness and interoperability. The whole ASSIST environment has been designed and implemented exploiting well known software engineering techniques. These enable the easy extension of programming environment features. The result is an high performance, structured, parallel programming environment that produces code for plain POSIX/TCP workstation networks/clusters. The object code produced by the compiling tools demonstrated good efficiency and scalability on medium to coarse grain parallel applications. Furthermore, some of the features inserted in the structured coordination language of ASSIST (ASSISTcl) allow fair interoperability levels to be achieved (e.g. with respect to the CORBA framework) *This work has been partially supported by the Italian MIUR Strategic Project "legge 449/97" year 1999 No. 02.00470.ST97 and year 2000 No. 02.00640.ST97, and by the Italian MIUR FIRB Project GRID.it No. RBNE01KNFE
618 as well as to use different kind of existing, optimized library codes within an ASSlSTclparallel application. The result is a programming environment that can be suitably used to program complex, interdisciplinary applications. The ASSIST features are discussed elsewhere [13, 14, 6, 8, 3, 2]. In this work we want to point out how the ASSIST programming environment can be used to experiment new implementation techniques, mechanisms and solutions within the framework of structured parallel programming models. Therefore we briefly outline the ASSIST implementation structure and then we proceed discussing some experiments we performed aimed at extending the environment. Those experiments were aimed at modifying ASSIST environment in such a way it can be used to program GRID architectures, include existing libraries in the application code, target heterogeneous cluster architectures, etc. The paper is organized as follows: Section 2 outlines the ASSIST environment features. Section 3 describes some experiments performed that took advantage of the ASSIST features. Eventually, Section 4 outlines the current experiments performed with the ASSIST environment. 2. ASSIST
ASSIST is a programming environment oriented to the development of parallel and distributed high performance applications. It is based on a coordination language providing programmers with a structured coordination language (ASSlSTcl) based on customizable parallel skeletons [ 10, 12, 7] that can be used to model the most common parallelism exploitation patterns, and a developing toolkit to compile ASSISTcl programs to homogeneous clusters of POSIX workstations (astCC). ASSISTcl allows arbitrary graphs of concurrent, possibly parallel activities to be defined in a program. Items in the graph are interconnected by means of data flow streams. In turn, parallel activities appearing in the graph can be expressed using the ASSlSTcl skeletons, as well as using plain sequential C, C++ or F77 code. ASSISTcl skeletons include the p a r m o d (parallel module, a configurable skeleton that can be used to model most of the classical skeletons (pipelines, task farms, data parallel skeletons). A complete description of the coordination language can be found in [ 13, 14]. The ASSIST environment has been designed since the very beginning with the target of being efficient (i.e. able to produce fast object code) as well as modifiable on the need. Therefore, the whole ASSlSTcl compiling tools have been given a three tier design: front-end (the top tier), parses ASSISTcl syntax and produces an internal representation of the program;
core (the middle tier) that is the compiler core. It translates the intemal representation of a program into the task code. The task code represents a sort of C++ template-based, high level, parallel assembly language. The step transforming intemal representation into task code is completely implemented exploiting design pattern technology [9]. A facade pattern decouples compiler intemals from the compiler engine; back-end (the bottom tier) compiles task code down to the ASSIST abstract machine (CLAM,
the Coordination Language Abstract Machine) object code.
619 The CLAM is basically built on POSIX processes/threads and communication (SysV and TCP/IP sockets) primitives. All those primitives are used via the ACE (Adaptive Communication Environment) library [ 1]. The result of the whole compilation process consists in two distinct items: the object code files (either as compiled code or as shared libraries) and an XML configuration file, holding all the information needed to run the program: which objects need to be loaded/executed on which processing element(s), how streams are mapped to INET addresses ((host, port) pairs), which existing libraries or (external) object codes must be loaded on the processing elements, etc.) The three tiers design allows efficient code to be generated, as each tier may take the most appropriate and efficient choices related to object code production. Furthermore, the heavy usage of well known software engineering techniques, such as the design patterns, insulate all the individual parts of the compiler in such a way that modification in one compiler part neither affect the whole compilation process (but for the new features introduced/modified) nor require changes in other compiler parts. Eventually, ASSISTcl compiled code is run by means of a dedicated loader (the assistrun command), that in turn activate CLAM run-time support. A CLAM master process scans the XML configuration file produced by a s t C C compiler and arranges things in such a way that the CLAM slave processes run on the target architecture processing nodes load and execute the suitable code 2 after the set up of the communication infrastructure (i.e. after the proper TCP/IP sockets have been published on the nodes). A more detailed description of the ASSIST implementation can be found in [3]. As CLAM (and the object code itself) access POSIX features via ACE wrappers, and as ACE is available on different operating systems (Linux and Windows, as an example), CLAM actually behaves as the fourth tier of the compile/run process and guarantees a degree of portability of the whole programming environment.
3. EXPERIMENTING WITH ASSIST The first version of ASSIST has been designed in 2001 in the framework of an Italian National project involving Italian National Research Council and Italian Spacial Agency: the ASIPQE project. This version correctly compiled a significant subset of the original ASSlSTcl language. This subset did not include, as an example, pipeline and task farm skeletons. They rather can be implemented customizing a p a r m o d skeleton. Furthermore, the compiler was able to produce code only for homogeneous 3 Linux/ACE clusters. Exploiting the ASSIST implementation features, we performed different experiments on this first prototype environment, that are described in the following Sections.
3.1. Optimized library inclusion In many cases the application programmer would cope with specified problems by relying on optimized libraries. A notable example of such libraries include mathematical libraries build on top of the MPI programming environment. ASSISTcl provides different facilities to integrate existing sequential or parallel libraries. Mainly, programmers can denote in the program which external sources/object codes have to be compiled/linked to the ASSISTcl modules. The pro2either coming from ASSISTcl source code or belonging to external libraries properly named by the programmer in the ASSlSTcl source code 3in terms of processor type
620 grammer may invoke completely extemal services from within the ASSISTcl modules as well. As an example, CORBA object services can be claimed from within the p a r m o d sequential modules denoting parallel activities (or virtual processors, in our terminology). However, there was no facility in the original version of the environment that allowed programs to benefit from the execution of already existing, optimized, MPI-based library code. Exploiting the compiler and CLAM structure, we performed an experiment that allowed to use such libraries directly within the ASSISTcl code [8]. Basically, the mp i r u n command of the mp i c h version of MPI has been slightly modified in such a way that its services can be invoked from within CLAM processes. The whole library code has then been wrapped in such a way that it looked like a normal p a r m o d code to the programmer. Overall, this allowed MPI libraries to be run in an ASSISTcl program without requiring an explicit programmer intervention. Although the whole library integration process has not been included in the compiler yet, this experiment demonstrated that the compiler/CLAM pair is flexible enough to allow to completely independent, parallel library code to access the services provided by CLAM (e.g. to know the INET address of the input and output stream), as well as to provide services to the rest of the ASSISTcl program (e.g. to provide "compute" calls to other modules appearing in the ASSISTcl program graph).
3.2. GRID As discussed, the output produced by a s t CC is made up by a set of object code/DLLs and an XML file, representing the "structure" of the parallel code. Exploiting this feature, ASSISTcl programs can be run on a GRID configuration as follows performing the following three steps. First, the XML configuration file can be analyzed and resources needed to execute the program can be individuated. Such resources are defined in terms of generic "processing elements" in the original compiler XML file. Second, the resources needed to execute the program can be gathered (and reserved) from GRID using the normal GRID middle-ware tools (e.g. those provided by the Globus toolkit). Last, the XML file can be modified in such a way that the resources gathered are used to run the ASSIST code. In order to to demonstrate the feasibility of the approach we developed a tool that support such kind of manipulation of the original XML configuration file [6]. Actually, the tool only supports programmer decisions: starting from information gathered from the GRID, the tool proposes to the programmer a set of choices. Afterward, the tool produces a new XML configuration file describing the new mapping of program entities onto GRID resources. In Section 4, we should outline how this process is being fully integrated in the ASSIST programming environment in such a way the environment could efficiently target GRID architectures also. 3.3. Heterogeneous clusters The first version of ASSIST produces code for homogeneous cluster of Linux PC/WS only. The missing items needed to produce code for heterogeneous architectures are basically two: the inclusion of some kind of XDR (external data representation) wrapping messages flowing among heterogeneous processing elements, and the generation of proper makefiles to compile final object code 4. Both this problems can be solved exploiting the ASSISTcl compiling tools 4The a s t C C compileractuallyproduces a set of C++ files that include calls to ACE and CLAM services, but these need to be compiledusing a standard C++ compiler. This is because we do not aim at entering the sequential code
621 structure. As the a s t CC compiler uses a builder pattern both to generate the actual task code and to generate the makefiles needed to compile task code to object code, we are currently intervening on these two builders in order to modify them in such a way that: 9 on the one side, communication routines are produced that either process memory communication buffers with XDR routines during marshaling and unmarshaling or do not process them with XDR. The former routines will be used in case processing elements using different data representations (e.g. little/big endian machines) are involved in the communication. The latter routines instead will be used in those cases when homogeneous processing elements are involved in the communications. Proper makefiles are generated consequently 9 on the other side, the XML config file is arranged in such a way that XDR communication libraries are used when "different" architectures are involved and non-XDR routines are used in all the other cases. Again, the algorithms that solve these problems are currently being incorporated in the ASSISTcl compiling tools. We plan to have a working version of the compiler targeting Intel Linux and Mac OsX networks by the end of this year.
3.4. ASSISTcl improvement The frst version of the ASSISTcl coordination language demonstrated some problems and pitfalls mainly due to the strict timings involved in language design and implementation in the ASI-PQE project. Currently, we are enhancing the coordination language features, mainly those related to the language expressivity and to the possibility to use external libraries from within the sequential portions of code included in the ASSlSTcl source code. Different enhancements have been designed and implemented exploiting the three tier structure of the ASSIST environment compiling tools. As an example, a smarter syntax has been designed to allow items of the output data streams of a p a r m o d to be gathered from p a r m o d virtual processors (i.e. internal, concurrent/parallel p a r m o d activities). These enhancements did not imply any changes in the compiler core and back-end layers. Just the front end has been modified. Other enhancements regard the possibility to have variable length arrays in the ASSlSTcl type systems. This is dramatically important in order to efficiently implement the activities (and the communications) involved in Divide&Conquer like computations. Again, the introduction of variable length data structures 5 only interested the front-end and part of the core compiler layer. The existing task code implementation perfectly supports these changes. 4. O N G O I N G ACTIVITIES The experiments described in Section 3 have already been performed and currently the related experience is being moved to the production compiler. However, as the ASSIST environment was exactly meant to be a sort of test-bed to experiment new solutions to efficiently support compiling area. 5in the sense of C++ Vector objects
622 structured parallel programming on a wide range of target architectures, we are currently studying different new enhancements, in the context of several National Research projects. These experiences are described in the following sections. 4.1. Component-based ASSISTcl
In the context of the FIRB project, we are re-designing the ASSIST environment as a fullfledged component based programming model. This means that the upper part of the coordination language will move to a component framework and that existing ASSISTcl skeletons will be provided as parametric components to be used in the construction of parallel applications. This is possible as the modules that are used to construct the genetic graphs appearing in AS$1STcl programs are already conceived as close modules interacting with the external world (other modules) via streams and events. As a matter of fact, this means that ASSISTcl modules already behave as (non-standardized) components. On the other side, the task code is already organized as a class hierarchy providing objects that model common parallel program components. Therefore, while restructuring the coordination language level, we are also trying to expose part of the task code at the component level, in such a way that experienced users can program directly the component task code level to provide new higher level components to applicative programmers (i.e. to the end users of the ASSIST programming environment) 6. 4.2. Full GRID ASSISTcl In the meanwhile, in the context of both strategic project "legge 449/97" and FIRB project we are currently trying to make automatic the targeting of GRID architectures. In order to produce efficient code, many factors have to be taken into account, which are traditionally handled by expert parallel programmers: resource co-allocation, code and data staging, task scheduling and the alike. The structure of existing ASSIST programming environment can be exploited in this case as follows: ~ resource co-allocation can be decided on the basis of the contents of the XML configuration file produced by the ASSISTcl compiling tools. In particular, the compiler already devises the number and the kind of resources needed to execute the code, mostly exploiting user provided parameters. A CLAM version targeting GRIDs may easily process the XML config file in such a way that resources are looked for that match the needs stated in that file. 9 Code and data staging can also be managed by the CLAM setup process. Also on clusters, the first phase in the execution of an ASSISTcl program consists in deploying the proper object/library code to the interested processing nodes. Data items instead are delivered to the processing nodes needing them either as data items on the streams connecting the ASSISTcl program modules (and this happens under direct programmer control), or as data items belonging to the underlying distributed shared virtual memory subsystem (this is automatically managed by the ASSISTcl runtime). ~ Task scheduling is completely under the control of CLAM and follows the directives taken from the XML configuration file. 6Much in the style of what already happens in the design pattern environment COPS [11].
623 Therefore, by moving the GRID configuration and management phase to the XML config file processing phase and to the CLAM we expect that the whole ASSIST environment can be made "GRID aware". 5. CONCLUSIONS 9000
s~pport1.~~ In this work we outlined support 1.5% (ideal) . . . . . . support 1% ---x--the features of the A S S I S T 8000 support 1.0% (ideal) .............. programming environment and 7000 i',,,~.,, ,, "~, ,, we discussed several experi~.. ',,, ences aimed at improving the ~ 6000 .,,9.. ,,, ... , programming environment fea- ~ ~ooo ",.....,,,,, .E_ .....~ tures. We are currently con4000 9.. ,, 9... ,, .. , solidating the experimental re, '......'~, sults achieved with the ASa000 SIST programming environ2000 ment. That means that current :-:.:.:.?.?.?.7.:.? "production" compiling tools ~000 do not include some of the re0 2 3 4 5 6 7 8 9 10 sults already assessed. Rather, #PE development versions of the A S S I S T c l compiling tools in- Figure 1" Typical A S S I S T c l program performance (data minclude these results and are ing code on Linux PC cluster)
currently being debugged and used by our team. In all cases, once the compiler debugging has been completed, by using the ASSIST programming environment programmers took hours to develop complete parallel applications out of existing sequential code. And these applications demonstrated good (close to ideal) scalability on workstation clusters with either Fast or Gbit Ethernet. As a typical example of performances achieved using ASSIST, Figure 1 plots the execution times (ideal and measured) of an ASSIST data mining application run on a network of Linux PCs (the different curves are relative to different dimensions of the support set). Never, once debugged versions of the compilers has been used, programmers needed to enter the classical debug/compile/run cycle. In particular, they didn't care about correctness of process/parallel activities setup and scheduling, communications, termination etc. Instead, programmers could spend time in experimenting different parallelization strategies for the application at hand. This activity requires to rewrite from scratch a few lines of the coordination/skeleton code rather than entire parts of the program. Overall, this represent a consistent advantage with respect to what happens when using other parallel programming environments, at least in our experience. REFERENCES
[1] [2]
The Adaptive Communication Environment home page. http://www.cs.wustl.edu/~ schmidt/ACE-papers.html, 2003. M. Aldinucci, S. Campa, P. Ciullo, M. Coppola, M. Danelutto, P. Pesciullesi, R. Ravaz-
624
[3]
[4]
[5] [6]
[7] [8]
[9] [ 10] [ 11 ]
[12] [ 13] [14]
zolo, M. Torquati, M. Vanneschi, and C. Zoccolo. ASSIST demo: a high level, high performance, portable, structured parallel programming environment at work. In Proceedings ofEuroPar'03, LNCS. Springer Verlag, august 2003. to appear. M. Aldinucci, S. Campa, P. Ciullo, M. Coppola, S. Magini, P. Pesciullesi, L. Potiti, R. Ravazzolo, M. Torquati, M. Vanneschi, and C. Zoccolo. The Implementation of ASSIST, an Environment for Parallel and Distributed Programming. In Proceedings of EuroPar'03, LNCS. Springer Verlag, august 2003. to appear. B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi. P3L: A Structured High level programming language and its structured support. Concurrency Practice and Experience, 7(3):225-255, May 1995. B. Bacci, M. Danelutto, S. Pelagatti, and M. Vanneschi. SkIE: a heterogeneous environment for HPC applications. Parallel Computing, 25:1827-1852, December 1999. R. Baraglia, M. Danelutto, D. Laforenza, S. Orlando, P. Palmerini, R. Perego, E Pesciullesi, and M. Vanneschi. AssistConf: A Grid Configuration Tool for the ASSIST Parallel Programming Environment. In Proceedings of the Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, pages 193-200. Euromicro, IEEE, February 2003. ISBN 0-7695-1875-3. M. Cole. Bringing skeletons out of the closet, available at author's home page, december 2002. P. D'Ambra, M. Danelutto, D. di Serafino, and M. Lapegna. Integrating MPI-Based Numerical Software into an Advanced Parallel Computing Environment. In Proceedings of the Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, pages 283-291. Euromicro, IEEE, February 2003. ISBN 0-7695-1875-3. E. Gamma, R. Helm, R. Johnson, and J. Vissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison Wesley, 1994. H. Kuchen. A skeleton library. In Proceedings of the Euro-Par 2002 Conference, LNCS. Springer Verlag, August 2002. S. MAcDonald, J. Anvik, S. Bromling, J. Schaeffer, D. Szafron, and K. Taa. From patterns to frameworks to parallel programs. Parallel Computing, 28(12): 1663-1684, december 2002. J. Serot and D. Ginhac. Skeletons for parallel image processing: an overviwe of the SKIPPER project. Parallel Computing, 28:1685-1708, December 2002. M. Vanneschi. ASSIST: an environment for parallel and distributed portable applications. Technical Report TR 02/07, Dept. Computer Science, Univ. of Pisa, May 2002. M. Vanneschi. The programming model of ASSIST, an environment for parallel and distributed portable applications. Parallel Computing, 28:1709-1732, December 2002.
Minisymposium
Grid Computing
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
627
Considerations for Resource Brokerage and Scheduling in Grids R. Yahyapour a aComputer Engineering Institute, University Dortmund, Germany e-mail: ramin, yahyapour@udo, edu
While Grid scheduling is currently still underdeveloped, it is a key component for future Grids which support more complex applications. An automatic brokerage and scheduling system is required to allocate and orchestrate all resources necessary for a Grid job. This paper discusses the requirements for Grid scheduling and draws several conclusions for the actual design of such a system. 1. INTRODUCTION The Grid is expected to offer transparent access to resources of very different nature as for instance CPU, network, data or software. It is a cumbersome task for the user to find and utilize such resources as they are provided by different owner with individual access policies. Thus, prior accessing a remote resource, a negotiation process is required in which an agreement is reached on the resource offering and utilization. This process is manually not feasible in a large-scale Grid environment with many potentially available resources. Therefore, the Grid infrastructure must provide services for automatic resource brokerage that take cares for the resource selection and negotiation process. Clearly, such an efficient and transparent Grid resource management is a key element of any large scale Grid as the pure availability of a large amount of resources connected by a network does not constitute a Grid. As a Grid is dynamic by nature, it is essential to discover the available resources, coordinate the resources required by a complex job and monitor the progress of this job. Therefore, a large-scale Grid requires an elaborate resource management system that hides this process from the user. From a technological point of view, still only core Grid services currently exist. They deliver basic functionality for setting up Grids. Those basic services include interfaces for accessing remote resources, security services, information services, and protocols for resource reservation and data transfers. They are, for instance, provided by the Globus Toolkit which is the current de-facto standard for many Grid projects. However, higher-level services for coordinating the resource access are still missing in this toolbox. Although, there are also complete Grid systems (e.g. Condor, Legion, Nimrod/G), those systems only deliver a full set of services for specific configurations. There are many projects using Grid technology. Most of them try to achieve a fast implementation. To this end, they are often specialized for a single application scenario or a single resource type. For instance, there are Grids that only focus the distribution of computational jobs to workstations clusters or parallel machines. Similarly, some projects, like the EU Data-
628 Grid or the CERN LHC project, consider mainly data resources. However, there is no general solution yet that supports the access to the many different resource types mentioned above. There is currently also no general approach on the coordination of several resources for a Grid job. However, Grids are evolving and will ultimately require the support of efficient scheduling strategies. In the following the requirements of future Grids in respect to resource management and scheduling are discussed. Accordingly, conclusions are made for a Grid scheduling architecture and the corresponding methods for resource brokerage. 2. R E Q U I R E M E N T S BY FUTURE GRIDS Grid systems are expected to emerge to the typical vision of a flexible distributed system supporting simple and transparent access to arbitrary resources or services in a global environment. For the purpose of illustrating an actual future Grid job we regard a simple but reasonable example in which not only computational resources are considered but also network, data, storage and software. The job, respectively the user, has the following requirements: 9 A specified architecture with 48 processing nodes, 9 1 GB of available memory, and 9 a specified licensed software package are required 9 for 1 hour between 8am and 6pm of the following day. 9 In addition, a specific visualization device should be available during program execution, which requires 9 minimum bandwidth between the visualization device and the main computer during program execution 9 The program relies on a specified data set from a data repository for input. 9 The user wants to spend at most 5 Euro and 9 prefers a cheaper job execution over an earlier execution. Based on this information the user generates a request and submits it to a Grid scheduler that tries to find a suitable allocation of resources. As not all of the requested resources are available at a single site, the job requires resources at different sites provided by different owners. The efficient execution of such a Grid job needs advance scheduling which includes the reservation of resources to preserve the temporal relations between the resource allocations. Note that our example also requires advance knowledge about the planned job execution as the use of the visualization device is clearly an interactive component. The user needs to know time and place for this resource allocation in advance. The job consists of several stages. For instance, the scheduling system should consider the data transfers before the computational part. In the first stage the database, the workstation cluster and sufficient bandwidth in the network must be available at the same time. The length of this stage is mainly determined by the bandwidth of the network. However, this step is not necessary if the database is implemented on the same workstation cluster that is used for the computational task.
629 F. . . . . . . .
time
D at a
I '
Data Access
]
I
Storing Data
Network 1
I
Data Transfer
I
I
Data Transfer
Co m p ute r
I
Software License Storage
[
Loading Data
Parallel Gomputation
]
so.are usage
Providing Data
]
Data Storage
Network 2
I
VR-Device
[
]
c . . . . . icationforVisua,ization I Visualization
]
Figure 1. Scheduling Result for Grid Job Example. The considered job is actually a typical example of a Grid application. For instance, most computational jobs work on input and output data. They may require access to other limited resources like software packages with a limited number of licenses. In addition, resources are usually remote due to the very nature of Grids. Access to them yields ultimately to network communication. Therefore, it is obvious that the availability of certain service level features, as for instance certain bandwidth, will play an important role in Grid job execution. This is especially true for the management of huge data sets where network performance determines the length of transfers. This job example can be considered simple as Grid jobs might also expand to arbitrary complex workflows in which more stages with different temporal dependencies might come into play. Nevertheless, while we might consider this Grid job simple, the scheduling is obviously not. Figure 1 shows an abstract schedule produced by a Grid scheduler. Even for this job it is clear that the scheduling task is very complex. In the following, we discuss different aspects of the Grid that have to be considered for this kind of resource management and scheduling. Some of them are also discussed in [8]. 2.1. Heterogeneous resources As already mentioned, the Grid is designed to offer transparent access to very different resources as for instance CPU, network, data or software. Following the current trend, there are service-oriented approaches in which Grids are considered to provide access to arbitrary services. The different nature of the resources impose constraints that must be consider for the Grid scheduling task. 2.2. Autonomy of local resource management Some computational Grids are based on compute and network resources of a single owner, like a large company. But many computational Grids will consist of resources with different owners in order to provide a high degree of flexibility and allow efficient sharing of resources. However, in those cases most owners are not willing to exclusively assign their resources to Grid use only. For instance, a computer may temporarily be removed from a Grid to work solely on a local problem if there is need for it. In order to react immediately in those situations, owners typically insist on local control over their resources which is achieved by implementing
630 a local management system. To this end, most local resource management systems include scheduling features to decide on future resource allocations. For a Grid user it is necessary that the Grid resource brokerage and scheduling services can interact with the scheduling of local management systems to decide on suitable resource allocations. The coordinated scheduling requires a different approach than known from present distributed systems as the resources in a Grid (often) do not belong to the same administrative domain. Instead, the individual demands of the resource owners need to be observed. Taking them into account requires a new technological approach. Therefore, future Grids need a scheduling architecture that supports the interaction of the independent local management systems with Grid scheduling services. Such an architecture manages access to the various Grid resources which is typically subject to individual access, accounting, priority, and security policies of the resource owners. These combined policies are part of the individual scheduling policy for the local resources. Moreover, the different policies may be subject to privacy concerns and therefore might not be exposable to remote services.
2.3. Orchestration and co-allocation The task of Grid management and scheduling becomes even more complex as a Grid application will typically require several resources. As in our example, a remote application usually needs access to input and output files. Some of these files have to be transferred before the start and after the end of the job. Moreover, network, storage and software might also be Grid managed resources that have to be taken into account to execute the Grid application. Therefore, Grid scheduling needs the ability to orchestrate the availability of all required Grid resources to execute such Grid jobs. Thus, a complex negotiation process is necessary that combines on one hand the access policies and local resource management of the individual resource owners. And on the other hand it must include ability to combine resources to fulfill the requirements and objectives of the user and her Grid task.
2.4. Scheduling objectives Grids have to cope with the individual access and prioritization policies as defined by the resource owner. But it must include the individual demands of the user as well. This is in contrast to single parallel systems where the resource owner can establish a scheduling objective for all resources and the local users. In a Grid environment, users may have a wider range of different demands when interacting with remote and possible unknown resource providers who may charge for resource consumption. In other words, a Grid management system must simultaneously deal with individual user objectives and manage individual resource providers. The consideration of these resources policies and users demands leads to complex and multicriteria scheduling objectives. 2.5. Extensibility Future Grids are considered to provide a technological infrastructure for very different resources and services. Therefore, also different scheduling services will evolve which will be specialized for certain applications or resources. Thus, there will not only be one Grid scheduling services but many. Conceptually, it also might be reasonable that some Grid users create their personal Grid schedulers. The Grid infrastructure must support the interaction of these different services.
631
..........~i~i~i~ ~ |
,~~i~!li~i~@~
~
.....
,......... l !
.......i
Schedule
iiii~
}:::i Job-Queue
i: :::i Job-Que ue
Resource 1
Resource 2
i!!~iii::
Schedule
Schedule
...............i
Higher-level Grid Scheduling
..............:}:i! ;....!
local :::::~Job-Queue
Resource
Figure 2. Multi-Level Scheduling for Grids. 3. CONCLUSIONS FOR FUTURE GRID SCHEDULING As we see, a Grid scheduling architecture differs significantly from scheduling or management architectures available on other system: Scheduling systems of large supercomputers are central systems that implement a single objective function. There are only a few different types of resources in a supercomputer like processor nodes and memory. In contrast we have seen that the resources in a Grid stretch from various types of hardware resources (processor nodes, memory, and network bandwidth) over data and software (application programs) to other complex services (visualization, sensor, instruments). Many distributed systems deal only with simple jobs while complex Grid jobs require the concurrent cooperation of many different types of resources. Moreover, the resources in a Grid system usually have different owners with different objective functions. Those owners and the Grid users may not even know each other. This differs from many distributed systems which only have a single owner and limited local user group. On the other hand, an efficient, secure, and reliable management system is vital for the acceptance of Grids by a broad community. An Grid scheduling architecture must be distributed in order to handle the highly dynamic nature of the Grid and to prevent any dependence on single components. It must be able to handle the integration of a large range of components into the Grid and support individual policies imposed by the various resource owners and Grid users. As the business models for Grids are not yet established, the scheduling architecture must also include means to implement different economic models. In order to simplify Grid use, the architecture must automatically coordinate all resources requested by complex Grid job. In the following some general conclusions are drawn for a Grid scheduling. 3.1. Multi-level scheduling As a result a Grid management system must include a separate scheduling layer that collects the Grid resources specified in a job request, checks the considerations of all requirements, and interacts with the local scheduling systems of the individual Grid resources as shown in Figure 2. Hence, the scheduling paradigm of a Grid management system will significantly
632
I,
Figure 3. Brokerage for Grid Scheduling. deviate from that of local or centralized schedulers which typically have immediate access to all system information. Although it seems obvious that a Grid scheduler will consist of more than a single scheduling layer, the details of an appropriate scheduling architecture have not yet been established. Nevertheless, it is clear that some layers are closer to the user (higher-level scheduling instance) while others are closer to the resource (lower-level scheduling instance). A Grid scheduling layer must exploit the capabilities of the local scheduling systems to make efficient use of the corresponding Grid resources. However, those capabilities are not the same for all local scheduling systems due to system heterogeneity. Therefore, a negotiation process is required to describe the features of a lower-level scheduling instance that can be used by a higher-level scheduling instance.
3.2. Scheduling process Due to the required autonomy of the Grid resources and the local scheduling policies by the resource owners, the control must remain in the local resource management systems. The Grid must also consider the privacy concerns that might prevent the exposure of the scheduling policy to a higher-level instance. Similarly, the scheduling objectives and constraints by the user might not be publicly available. For instance, information on the economic objective or the available budget for a job might not be sent to a resource owner. However, this information must be available to a higher-level scheduling instance that has to orchestrate resources from different sites. Consequently, the future Grid scheduling must rely on brokerage and negotiation models. Only by such an approach, information and control remain at the resource owner respectively the Grid user. Figure 3 shows the required steps in a scheduling process between higher-level and lowerlevel scheduling instances. In the general model resource allocations are requested from lowerlevel scheduling instances. This keeps the decision local on whether and how to create offers. The local scheduler can consider the owner objectives in creating of such offers. The higherlevel scheduling instance will typically query several lower-level-scheduling instances. The higher-level Grid scheduler coordinates the available offers according to the user objectives. Overall, the actual scheduling process is split into a lower-level and higher-level part. They are connected by a negotiation protocol. Note, that this setup allows different implementations on both level of the scheduling process. Note, also that we do not define the strategy how resources and offers are selected by the Grid scheduler. Neither do we define the strategy for
633 creating offers by the local scheduling instances. This provides the required flexibility and extensibility of the model to different implementations while maintaining inter-operable. 3.3. Economic models Conventional algorithms have already been adapted and analyzed for Grid computing [2, 3]. However, these approaches do not take the different scheduling and management preferences of users and resource owners into account. Here, a market economic approach naturally comes into mind. They provide support for individual access and service policies to the resource owners and Grid users. Especially the ability to consider costs is very important for future Grids. Such methods have also been applied to Grid scheduling [6, 1]. Probably, market-oriented strategies will play a significant role for the Grid scheduling problem. Their advantages fill nicely the mentioned requirements of multi-level negotiation and multi-criteria scheduling. 3.4. System integration A scheduling system for future Grids has to cooperate with several other Grid services, as for instance information services, job supervisor/monitoring services, accounting and billing services. Some of these services already exist in the Globus toolkit. There are efforts in the Global Grid Forum (GGF) to define necessary services and protocols. For instance, scheduling attributes have been identified to define the features of scheduling systems which are necessary during the negotiation process between higher- and lower-level instances [7, 5]. Other GGF groups (e.g. GRAAP, DRMAA) are working on the protocol and interfaces for generating and executing resource allocations. However, significant basic research is still necessary to specify all components and interfaces required for a Grid scheduling architecture. A scheduling framework is still missing which supports the implementation of higher-level scheduling instances. Moreover, there is currently no coherent integration of different resources like data and networks into a single scheduling system. REFERENCES
[1] [2]
[3]
[4] [5]
C. Ernemann, V. Hamscher, and R. Yahyapour, "Economic Scheduling in Grid Computing", Proceedings of the 8th Annual Workshop on Job Scheduling Strategies for Parallel Processing, Springer-Verlag, LNCS, 2002 C. Ernemann, V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour, "On Advantages of Grid Computing for Parallel Job Scheduling", Proceedings of the 2nd IEEE International Symposium on Cluster Computing and the Grid (CC-GRID 2002), LNCS, 2002 V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour, "Evaluation of JobScheduling Strategies for Grid Computing", 7th International Conference of High Performance Computing, Springer-Verlag, LNCS, 2000 C. Ernemann and R. Yahyapour, Chapter "Applying Economic Scheduling Methods to Grid Environments" in Book "Grid Resource Management: State of the Art and Future Trends", co-editors J. Nabrzyski, J. M. Schopf, and J. Weglarz, Kluwer Publishing, 2003. U. Schwiegelshohn and R. Yahyapour, Chapter "Attributes for Communication Between Grid Scheduling Instances" in Book "Grid Resource Management: State of the Art and Future Trends", co-editors J. Nabrzyski, J. M. Schopf, and J. Weglarz, Kluwer Publishing, 2003.
634 [6]
[7]
[8]
R. Buyya, D. Abramson, J. Giddy, and H. Stockinger, "Economic Models for Resource Management and Scheduling in Grid Computing", "Special Issue on Grid Computing Environments, The Journal of Concurrency and Computation: Practice and Experience (CCPE), May 2002 U. Schwiegelshohn and R. Yahyapour, "Attributes for communication between scheduling instances, Scheduling Attributes Working Group, Global Grid Forum, 2001 K. Czajkowski and I. Foster and N. Karonis and C. Kesselman and S. Martin and W. Smith and S. Tuecke, "A Resource Management Architecture for Metasystems", SpringerVerlag, LNCS, 1998
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
635
J o b D e s c r i p t i o n L a n g u a g e a n d U s e r I n t e r f a c e in a G r i d c o n t e x t : T h e E U DataGrid experience G. Avellino ~, *, S. Beco ~, F. Pacini ~, A. MaraschinP, and A. Terracina ~ ~Datamat S.p.A., Via Laurentina 760, 00143 Rome, Italy The growing emergence of large scale distributed computing environments such as computational or data grids, presents new challenges to resource management, which cannot be met by conventional systems that employ relatively static resource models and centralised allocators. The workload management is the grid component that has the responsibility of matching and managing the grid resources in such a way that applications are conveniently, efficiently and effectively executed ensuring at the same time that possible resource usage limits are not exceeded. From the users' point of view, grid resources management should be completely transparent, in the sense that their interaction with the workload management services should be only limited to the description, via a high-level, user-oriented specification language, of the characteristics and requirements of their applications and to the submission through an appropriate interface of their requests. A digest of the solutions adopted for the development of a framework for the description, submission, management and control of users applications in the context of the "workload management" task of the EU datagrid project is given. 1. INTRODUCTION The workload management task (Work Package 1) of the European DataGrid project [ 1] (also known, and referred to in the following text, as EDG) was mandated to define and implement a suitable architecture for distributed scheduling and resource management in a Grid environment. The innovative issues that have been tackled in the unfolding of this activity result from the following factors: the dynamic relocation of data, the very large numbers of schedulable components in the system (computers and files) and of simultaneous users submitting work to the system, the different access policies applied at different sites and in different countries. The WorkloadManagement System (WMS) is hence the component of the EDG middleware that has the responsibility of selecting Grid resources on the basis of 9 the knowledge of the availability and proximity of computational capacity and the required data 9 the users requests and their needs *DataGrid is a project funded by the European Commissionunder contract IST-2000-25182. We wish also to thank the rest of EDG WP1 team (INFN and CESNET people) who helped us in designing, developing and testing the JDL and all the User Interface commands and graphical components.
636 and distributing user's applications to them in such a way that they can be efficiently executed in respect of possible resource usage limits. To do this, the Workload Management activities also have to include the task of providing the users with suitable means to express their needs and pass them to the Grid. The goal of this task is the development of a mechanism to represent customers and resources of the system, i.e.: 9 characteristics of the jobs (executable name and size, parameters, number of instances to consider, standard input/output/error files etc.) 9 resources (CPUs, network, storage, etc.) required for the processing and their properties (architecture types, network bandwidth, latency, assessment of required computational power, etc.) and to provide this information to the workload management layer, i.e. the scheduling sub-layer in order to identify compatible matches between customers and resources and rank them. This paper focuses on the approaches pursued and the solutions adopted in EDG WP1 for addressing the issue of the description of customers and resources in a grid environment (see sect 2) and for the development of an appropriate, flexible, simple of use User Interface (see sect.3) for accesing the grid. 2. THE E D G JOB D E S C R I P T I O N L A N G U A G E
The notion of entities (i.e. servers and customers) description, that as mentioned is fundamental for workload management systems in any distributed and heterogeneous environment such as the grid, can be accomplished with the use of a description language. We have therefore defined the structure syntax and semantics of a high-level, user-oriented Job Description Language (JDL) [2] with the following design goals: 9 Extensibility: allows defining and including new attributes in the job description 9 S y m m e t r y : both jobs and resources can be described through the same language 9 Simplicity: in both syntax and semantic
The EDG JDL is a description language (based on the Condor ClassAd language [3]), whose central construct is a record-like structure, the classad, composed of a finite number of distinct attribute names mapped to expressions. Expression contains literals and attribute references composed with operators in a C/C++ like syntax These ads conform to a protocol that states that every description should include expressions named Requirements and Rank, which denote the requirements and preferences of the advertising entity. Two entity descriptions match if each a d has an attribute, Requirements, that evaluates to t r u e in the context of the other ad. Main novel aspects of this framework can be summarised by the following three points: 9 it uses a semi-structured data model, so no specific schema is required for the resources description, allowing it to work naturally in a heterogeneous environment 9 the language folds the query language into the data model. Requirements (i.e. queries) may be expressed as attributes of the job description
637 9 ClassAds are first-class objects in the model, hence descriptions can be arbitrarily nested, leading to a natural language for expressing resources and jobs aggregates (e.g. DAGs) or co-allocation requests The JDL defined for EDG provides attributes to support: 1. the definition of batch or interactive, simple, MPI-based, checkpointable and partitionable jobs; 2. the definition of aggregates of jobs with dependencies (Direct Acyclic Graphs); 3. the specification of constraints to be satisfied by the selected computing and storage resources; 4. the specification of data access requirements: appropriate conventions have been established to express constraints about the data that job wants to process together with their physical/logical location within the grid; 5. the specification of preferences for choosing among suitable resources (ranking expressions) Despite its extensibility, the JDL encompasses a set of predefined attributes having a special meaning for the underlying components of the Workload Managemen System. Due to this some of them are mandatory, while others are optional. The set of predefined attributes [4] can be decomposed in the following groups: 9 Job attributes: representing job specific information and specifying in some way actions
that have to be performed by the WMS to schedule the job (e.g. transfer of the job input sandbox files) 9 D a t a attributes: representing the job input data and Storage Element related information.
They are used for selecting the resources from which the application has the best access to data 9 R e q u i r e m e n t s a n d Rank: allowing the user to specify respectively which are the needs
and preferences, in term of resources, of their applications. The Requirements and Rank expressions are built using the R e s o u r c e s Attributes, which represent the characteristics and status of the resources and are recognizable in the job description as they are prefixed with the string "other.". The R e s o u r c e s attributes are not part of the predefined set of attributes for the JDL as their naming and meaning depends on the adopted Information Service schema [5] for publishing such information. This independence of the JDL from the resources information schema allows targeting for the submssion resources that are described by different Information Services without any changes in the job description language itself. Hereafter follows an example of the JDL used to describe a simple job:
638
Type = "Job"; JobType = "Normal" ; VirtualOrganisation = "biome" ; Executable = "/bin/bash" ; StdOutput = "std.out" ; StdError = "std.err"; Arguments = "./sim010.sh"; Environment = "GATE_BIN=/usr/local/bin"; OutputSandbox = { "std. o u t " , "std. err", " B r a i n _ r a d i o t h 0 0 0 . root" } ; InputData = {"ifn:BrainTotal", "ifn:EyeTotal"}; DataAccessProtocol = { " f i l e " , " g r i d f t p " } ; OutputSE = "grid011.pd.infn.it"; InputSandbox = { " / h o m e / f p a c ini / J O B S / b in / s i m 0 1 0 , sh", "/home/fpacini/JOBS/jobsRAL/required/prerunGate .mac", " / h o m e / f p a c ini / J O B S / j o b s R A L / r e q u i r e d / G a t e M a t e r i al s. db"
};
rank = -other. GlueCEStateEstimatedResponseTime ; requirements = Member("GATE-l.0-3", other. GlueHostAppl icat ionSo f twareRunTimeEnvi && ( o t h e r . G l u e C E S t a t e F r e e C P U s >= 2) ;
ronment )
The job description above represents a Monte Carlo simulation of nuclear imaging. In brief it asks to run the simOlO.sh simulation script on a resource on which it is installed the GATE (Geant4 Application for Tomographic Emission)having at least 2 CPUs available for the computation. Image data to be accessed for the simulation are identified in the grid with the logical names BrainTotal and EyeTotal. For further details on the meaning of the JDL attributes the reader can refer to [4]. 3.
THE
WMS
USER
INTERFACE
After having created the descriptions of their applications, users expect, without caring about the complexity of grid, to be enabled to submit them to the Workload Management System and monitor their evolution over the Grid: this can be accomplished by means of an the appropriate API and/or a specific (Graphical) User Interface. The EDG WMS User Interface is the logical entry point for users and applications to the Workload Management System. It is the component that provides access to all services of the WMS and includes all the public interfaces exposed by the WMS. The functionalities the User Interface allow to access are: 9 Job (including DAGs) submission for execution on a remote Computing Element, also including: -
automatic resource discovery and selection,
-
staging of the application input sandbox,
- restart of the job from a previously saved checkpoint state, - interactive communication with the running job, 9 Listing of resources suitable to run a specific job according to job requirements,
639
rnber("GAYE-1.~3'~ther~G~ue~slApp~at~nS~ftwareRunT~meEnvir~nment)&~(other~G~ue~EState ::~~&e~(other. GiueCEStateStatus= ="Production"))&&(other.GlueCEStat:eSl:atus= ="Production'!))
...
i
ii
iiii!!!
Figure 1. JDL Editor: requirements editing. 9 Cancellation of one or more submitted jobs, 9 Retrieval of the output files of one or more completed jobs, 9 Retrieval of the checkpoint state of a completed job, 9 Retrieval of jobs bookkeeping and logging information. All this functionality is made available through a Python command line interface and an API providing C++ and Java bindings. The JDL and the command line User Interface provide a very powerful and lightweight mean to interact with the Grid, however JDL-based definitions of jobs can become quite complex, also thanks to the completeness and flexibility of JDL itself, and users to some extent could be required to know details about both the definition language and the grid architecture. Moreover the increasing number of provided functionality and options can make steeper the usage learning path of the command line UI. Last but not least the growing need for the management of large collections of jobs can require extra users' effort in the development of ad-hoc script wrappers and applications. In order to relieve the users from this burden and more specifically let scientists concentrate on their specific activity rather than on complex grid-related issues, simplify for newcomers the access-path to the grid and have users deal only with the necessary technical information, we
640
~i~iiiii i iii ii iiiiii::iiiiii iiiiiiiiiii!iii :,ii! i iiiiii~;ii~,iii!!iiiii:iiiiii!ii
liiiii
iiiiii
~:iii:,iI
iiiiiiii!iiiiiiii!
iiiiiii i iiiiiiiiiiiiii:,ii;iii!iii;iii
i!i!i!!iii!iiiiiiiiiiiiiii!iiliiiiiiiii!ii:!ii!ii ::i:ii!iii0i!iiiiii~i!~iiil iiii[:iiiiiiiii~iiii ~i !il iii!! iI i ii
~i
Figure 2. Job Submitter: setting preferences. have also developed a set of flexible and easily configurable Java graphical components ([6]) supporting users in their interaction with the Grid. They are: 9 the JDL Editor (see Fig. 1) whose goal is to assist the user in the creation of Job description in JDL. This component is fully developed in Java and it is fully portable (tested on Linux and Windows boxes), encapsulates the Java Classad language library insuring the correctness of the generated JDL so that it can be ingested and properly handled by the EDG WMS. The JDL Editor can parse and generate job descriptions in plain JDL and XML and allows plug-in of new resource information schemas trough simple XML configuration files. 9 the Job Submitter (see Fig.2) that allows submission of single jobs and collection of jobs to the WMS. This component supports all job types (including DAGs) and provides a graphical console for communicating with interactive jobs. It is able to work in multi-VO (Virtual Organisation) environments and introduces a session handling/recovery mechanism. 9 the Job Monitor (see Fig.3) that allows monitoring and control of single jobs and collection of jobs previously submitted to the WMS. Besides providing monitoring (logging and bookkeeping) information about jobs, this component allows retrieving the output files and performing the cancellation of single jobs and collection of jobs. It is able to work in multi-VO (Virtual Organisation) environments and introduces a session handling/recovery mechanism. Although the three mentioned GUI components can be used as standalone applications, they are also provided as an integrated tool so that the JDL Editor and the Job Monitor can be invoked
641
Figure 3. Job Monitor: jobs status panel. from the Job Submitter, thus providing a comprehensive framework covering all main aspects of job management in a Grid environment: from creation of job descriptions to job submission, monitoring and control up to output retrieval. The complete framework has been also integrated within the web portal called GENIUS (Grid Enabled web eNvironment for site Independent User job Submission) [7] so that the provided services can be accessed from everywhere with a standard browser through a virtual desktop.
4. CONCLUSIONS Since the beginning of EDG we have covered a long and steep path from the debate on which language to take as starting basis for our JDL to the design and development of a user interface to the WMS, simple and generic enough to be satisfactory for heterogeneous users communities such as High Energy Physics, Earth Observation and Biology. Although many of the requirements and the most valuable feedbacks from users have arrived quite late in the project the JDL we have developed has shown to fulfill EDG needs and to be flexible enough to enable interoperability with other projects [8],[9]. Moreover quite a comprehensive framework for accessing grid services has been offered to EDG users and dramatic improvements have been provided to (G)UI usability since the beginning of the project. In any case, although significant results have been achieved, many areas appear to still require attention for both the JDL (e.g. extensions to allow the representation of complex, genetic workflows etc.) and the User Interface (e.g. improve usability, ease installation and configuration, minimize dependencies from external software etc.). Possible paths for future activities will not disregard APIs, WSDL, portals (strongest integration with GENIUS and its successors) and will for sure focus on the adherence to the emerging standards for service oriented architectures (e.g. WSRF/OGSA [ 10]).
642 REFERENCES
[1] [2]
Home page of the datagrid project, http://www.edo.org. F. Pacini. Job description language howto. DataGrid-01-TEN-0102, 2001. http: //server11 .infn.it/workload-grid/documents. html. [3] Rajesh Raman, Miron Livny, and Marvin Solomon. Matchmaking: Distributed resource management for high throughput computing. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC7), Chicago, IL, July 1998. [4] F. Pacini. Jdl attributes. DataGrid-01-TEN-0142, 2003. http://server11.infn.it/ workload-grid/documents.html. [5] Sergio Andreozzi, Massimo Sgaravatto, and Cristina Vistoli. Sharing a conceptual model of grid resources and services. June 18 2003. Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003. [6] F. Pacini. Gui user guide. DataGrid-01-TEN-0143, 2003. http://server11.infn.it/ workload-grid/documents.html. [7] The genius portal, https://genius.ct.infn.it. [8] The crossgrid project, http://crossgrid.org. [9] Home page of the alien project, http://alien.cern.ch. [l 0] The globus alliance home page. http://globus.org.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
643
On Pattern Oriented Software Architecture for the Grid H. Prem a and N.R. Srinivasa Raghavan ~* aDepartment of Management Studies Indian Institute of Science, Bangalore 560012, India Precision, sophistication and economic factors in many areas of scientific research that demand very high magnitude of compute power is the order of the day. Thus advance research in the area of high performance computing is getting inevitable. The basic principle of sharing and collaborative work by geographically separated computers is known by several names such as metacomputing, scalable computing, cluster computing, internet computing and this has today metamorphosed into a new term known as grid computing. This paper gives an overview of grid computing and compares various grid architectures. We show the role that patterns can play in architecting complex systems, and provide a very pragmatic reference to a set of well-engineered patterns that the practicing developer can apply to crafting his or her own specific applications. We are not aware of pattern-oriented approach being applied to develop and deploy a grid. There are many grid frameworks that are built or are in the process of being functional. All these grids differ in some functionality or the other, though the basic principle over which the grids are built is the same. Despite this there are no standard requirements listed for building a grid. The grid being a very complex system, it is mandatory to have a standard Software Architecture Specification (SAS). We attempt to develop the same for use by any grid user or developer. Specifically, we analyze the grid using an object oriented approach and presenting the architecture using UML. This paper will propose the usage of patterns at all levels (analysis, design and architectural) of the grid development.
1. I N T R O D U C T I O N Do you want a quick economical and highly transparent access to computing resources and instruments? Think of "Grid computing"- the norm today. Computational grid is an emerging technology that allows the combination of widely distributed and loosely coupled resources to support large-scale computations [1, 2]. The combination of resources provides for computationally intensive applications as well as aids in the better utilization of available assets. The term "grid" stems from the analogy to the electrical power grid, where many generators produce electrical power that is distributed and delivered to customers through a complex network of power lines. In short, it exploits underutilized resources, has the potential for massive parallel CPU capacity, which is one of the most attractive features of a grid. It can provide access to increased quantities of other resources and to special equipment, software, *Author For Correspondence: [email protected]: 91-80-2933265 Fax: 91-80-3604534
644 licenses, and other services, and simplifies collaboration among a wider audience. It offers a resource balancing effect by scheduling grid jobs on machines with low utilization, which helps handle occasional peak loads of activity in parts of a larger organization.We define a computational grid as a hardware and software infrastructure that provides dependable, pervasive, consistent and inexpensive access to high-end computational capabilities. This paper emphasizes on techniques to enable future grid analyst to build and deploy a grid much faster. Analysis patterns, design patterns and architectural patterns are used in the process. Section 2 brings a comparison of the most dominant grid frameworks, followed by the SRS in section 3. Sections 4, 5 and 6 emphasize on the analysis, design in UML digrams and architectural patterns that best describe the grid. The last section presents our conclusion based on the work followed by our future plan of work and references.
2. ABRIDGED LITERATURE SURVEY Papers reviewed are primarily in the area of grid and high performance computing, with emphasis on the architecture, resource economy, security, workload management and scheduling of grid computing. Other areas that were explored are in the areas of patterns (analysis/design/architectural), and application of grid frameworks in life sciences and finance. Various grid frameworks and their features are listed in table 1. Each of these architectures have unique yet different features though all are based on a single principle of sharing [ 1].
2.1. Industry/academic grid initiatives in India 9 Centre for Development of Advanced Computing (C-DAC): Looking forward to develop the I-Grid (information grid). It plans to link the seven prestigious Indian Institutes of Technology (IITs), the Bangalore-based Indian Institute of Science and other academic institutions in the I-Grid 9 Global giants like IBM and Sun are pushing forward the concept of grid computing. In fact IBM is one of the forerunners in popularizing the concept, which is looking at bringing together the concept of web services and grid computing. It has also announced a set of specifications that would unite the two concepts. They are extending their expertise in the Indian sub-continent. 9 Companies in India are also getting ready to roll out grid computing networks with Mumbai-based Dow automotives, part of the US based billion-dollar Dow group. Currently in India Altair (US based $70 million product design technology company) with its HyperWorks software suite works with Infosys, Satyam and Infotech to enable engineers and developers to model and study complex structures and assemblies for optimizing design performance. In this paper we present for the first time, pattern oriented software architecture for the grid. 3. SOFTWARE REQUIREMENT SPECIFICATION F O R THE GRID The SRS is partitioned in two parts the non-functional (which is not dealt in this paper) and the functional requirements [3], that can be viewed as two separate modules, core module and
645
Table 1 Grid f r a m e w o r k s Framework Early Grid-I Way (Information Wide Area Year) [21 Grid General Framework [4, 51
and their features Conceived Concept Integrating 10 high band1995 width networks,using different routing and switching technologies. Late 90's
Series of layers of different widths.
Globus [1, 9, 10]
Early 2000
Emphasis: middleware, that hides the heterogeneous users and provide applications a homologous/ seamless environment.
PVM [121
1991 University of Tennessee
Provides a machineindependent communication layer.
Condor [ 15]
First version in 1986
Uniform view of processor resources.
Legion [2, 11]
November 1997 at the University of Virginia.
Builds system components on a distributed Objectoriented model.
AppLeS (Application Level Scheduling) [13, 141
2001
Provides environment for adaptively scheduling and deploying applications in heterogeneous, multiuser grid environments.
Components/Features Point of presence servers, uniform software called I-soft, single central scheduler (CRB) and multiple local scheduler daemons one in each l-pop. Lowest level: fabric - physical devices. Connectivity/resource layers are implemented everywhere. Collective layer: protocols/services. Topmost are user applications. Central element is the Globus tool Kit (GT) that consists of a set of components, each defines an interface for the higher level services to invoke the component mechanisms. Implement basic services like resource location and allocation. PVM daemon (unix process that coordinates inter-machine communications) and the libraries that sends commands to the local daemon and receive status information. Automatic resource location and job allocation, check pointing and the migration of processes. Advantages of an object-orienied approach, such as data abstraction, encapsulation, inheritance and polymorphism. APls to core object types- Classes and Metaclasses, Host objects, Vault objects, Implementation Objects and Caches, Binding Agents, Context objects and Context spaces. Customized scheduling agent that monitors available resource performance and generates dynamically a schedule for the application. Apples agent steps: resource discovery, resource selection, schedule generation, schedule selection, application execution and schedule adaptation. Apples templates were created to embody common characteristics of Apples enabled Applications.
the support m o d u l e built in a layered form. The core m o d u l e will handle the basic functionality o f the grid, like the collation o f resources (hardware, software etc.), the administration of the grid, c o m m u n i c a t i o n protocols, object identification, resource discovery and scheduling, security issues, fault tolerance and grid usability. The support m o d u l e s are built upon the core m o d u l e s , taking help o f functionalities that are built into them. These support m o d u l e s will thus handle, advance reservation o f resources, e c o n o m i c s based p a r a d i g m for m a n a g i n g
646 resource allocation, third party certificate authorities, performance/usage patterns and web services. The functional requirements for both the core and the support modules have been categorized into entity requirements and processing requirements. An entity/component refers to a separate unit in the grid, which is meant for a specific functionality, that will be initiated either by interacting with another entity, external application/client or direct database manipulation. An entity may be independent or related to one or more entities. Some of the entities under the core module are resource module, grid administration, communication protocols, object identification, resource discovery, scheduling, security and fault tolerance.
4. ANALYSIS PATTERNS Implementation of any project (software/other) requires rigorous analysis of the problem domain. A systematic approach is highly emphasized when the problem at hand is very complex, like the grid. This section thus addresses a few patterns that act as a catalyst to analyze the complex problem of designing the grid. Stated simply, analysis patterns focus on the result of the process [7], i.e. the model itself, thus providing a template to easily fit requirements by a novice in the field. Listed below are few questions, which help in developing patterns, and will be of great use to anyone who plans to build a grid in future. What are the different kinds of grids? What are the different applications that will run on the grid? What are the types of resources that would participate in the grid? How much is each donor willing to give to the grid and for what time duration? Which resources in the grid are finally in a position to participate in the
Grid Topology Pattern: General Framework
I
P.....
[Organization
I [
L]
Grid Structure Pattern
Hardware I Soft. . . . -----] Others ]
[ Participant /~
I GridType I I [1"3] ~ [ Grid Structure
[ Donor
Stat!s
[I Component ][
[
Type I Hardwarel
I Soft. . . . I
J
I User Fig l(a)
Fig l(b)
Figure 1. (a): The participant is a supertype of a person or an organization, thus any instance of the participant must either belong to the person or to the organization. Observation: Number of elements that relate to a participant (abstraction) rather then the person or the organization. This being a prelude to the possible type of grids (inter/intra/others). (b): Inter/intra/other grids differ by the level of hierarchies. Incorporation of other grid types is made easier by introducing an 'grid type' (abstraction)
grid? What are the service types available in the grid for immediate use? Chances are high that the grid analysts miss some of the many such questions, and on these lines the analysis patterns are built. Figure 1 depicts a sample analysis pattern, more of which can be obtained from the technical report [6]. 1: Increases flexibility with increase in complexity. The status entity helps to maintain various parameters like the number of participants at any given point of time, any other change in structures, depending on the choice of the party. 2: Helps the grid analyst list all possible
647 components. A rule entity attached to the type entity can also be introduced to extend the above model.
5. DESIGN PATTERNS Any important and recurring design in the system can be best exploited using a Design Pattem [8]. All the grids be it a data or a computational grid have similar components and hence this is an effort to apply design patterns for efficient development of the grid. This section describes design pattems in the grid. For further details refer [6]. The analysis patterns, software requirements specification and the UML analysis of the complex grid help develop the design patterns, hence speed up the development process. A common scenario with interactions and exchange of messages amongst entities, is depicted by the sequence diagrams (UML analysis).
Select server
datarequest
Sends results .
Fig 2 (a)
notifies agent
Fig 2 (b)
Figure 2. Two design pattens that best fit the grid scenario are depicted, the detailed description of others is available in the technical report [6]
6. GRID A R C H I T E C T U R A L PATTERNS The basis on which the grid is built is through collaboration / sharing of resources and hence the object/entity/component sharing pattern. The locality/positioning of some objects make a difference. In this section we discuss two architectural patterns that commonly are applied to the grid. 7. A PRACTICAL APPLICATION OF PATTERNS Lifesciences: Biology in itself is a multi disciplinary science. Attempts are in progress to build a grid for life sciences to support the genomic revolution [ 18], for collaborative computing, data management, sharing of data and networking infrastructure for life science research and education [16]. This area consists of mainly data intensive applications thus the need to mine, analyze and model data is inevitable. Most dominant operation is the similarity search between protein/nucleic acids. BLAST that consists of a suite of programs that perform pairwise local similarity analysis of nucleotides and protein sequences and FASTA which also has significant interest to the life science research community has non-trivial characteristics from the scheduling perspective.These are examples that use certain heuristics for faster searches,
648 State ~ u~,
Prototype [Developer]
prOtOtype
~ gridPrototype ]
I com n.~
]
].
I u~rtyp, ]
I
2
I I ....
Fig 3 (a)
I
user
I Fig 3 (b)
Figure 3. (a): Defines an interface for cloning itself [8]. Components/participants implement operations for cloning. Developer creates new objects by asking the grid prototype to clone. Client calls the clone function in the concrete class; an object with the operations is returned. Clone function in concrete just returns itself. A deep copy can be performed when getting a clone. Thus addition of participants or components at runtime becomes easier by simply registering a prototypical instance but it is difficult to implement a clone operation. (b): User defines the interface of interest to clients. User type is an interface for encapsulating behavior associated with a particular type. Adm/Par implements a behavior associated with the state of the user. Like a switch case, for certain criteria, a participant can switch behavior between a user and a donor. In the (user) context; there will be some parameter, which tells the class it belongs, and calls the corresponding function. State: stores the current state and corresponding responsibility to be handled [8].
Master [:
User1,2,...n
f Donar1,2.....
I
1
~ .... 111[ I
~.... 21[ , ....... Fig 4(a)
] '-~
....
~.t
R. . . . . .
!
I
t
I
Shared Objects
Fig 4(b)
Figure 4. (a): Donors approach the scheduler/broker to take jobs that best suit them. And the users could access the scheduler/broker to know the location/status of the jobs they submitted. This is thus a master worker kind of a situation where various workers/donors share the master. (b): The replication of location specific objects at certain key nodes can affect the throughput of computationally intensive jobs. Strategic positions are managed such that it gives an illusion that there is a single object. which are generally very much time consuming, due to the size of the database [ 17]. It is thus not really possible to replicate the whole databases at all the compute nodes of the grid. Thus the database locality requirements and the computational requirements produce an interesting scheduling challenge. Such searches in life sciences require high performance robust platforms and hardware/software solutions, which mainly depict the characteristics of availability and reliability. From the analysis patterns perspective, the grid topology and the structure pattern in combi-
649 nation clearly state the scope of the participants and thus the hierarchy or the communication protocols get defined, by just adding a rule entity. Strategic positioning of the donor machines such that they remain close to the databases is mandatory, so there may be a structural change in the grid required on a periodic basis depending on whether the whole database is replicated at different nodes or the databases is divided such that its various parts are placed at strategic positions. All these parameters differ depending on the problem domain, that gives pointers to the analyst from the functional and non-functional (eg. response time) perspective. There are many other analysis pattems like accountability, planning and measurement that have clear advantages. For further details on the usage of analysis patterns refer [6]. A prototype design pattern has a clear-cut use in the grid scenario where participants just enter and leave the grid. The problem at hand emphasizes machine availability, which is taken care of by the prototype pattern since it makes it very easy for a new participant to enter the grid just requesting the clone function to clone itself which if required makes a deep copy of its base class. Many a times it could happen that a participant switches roles between a donor and a user, depending on the hits at a given time, which means that a participant that was performing a search could delegate its job to another participant that has good access to the database. A state pattern is a good application in such a scenario. Both the architectural patterns mentioned earlier in combination can solve the given problem. Considering them in isolation, if the same database is replicated at strategic nodes, location specific architectural pattern is an obvious application, and if a single database is divided into many parts and the search is performed by a participant with respect to only the allocated database then a shared object architectural pattern is the best fit, where each donor machine selects the database where it can perform the most optimal search.
8. CONCLUSIONS AND FUTURE W O R K The advantage in opting for grid computation is so high that the concept in itself has given rise to challenges that motivate technological advancements in other areas be it webservices, patterns (analysis, design or architectural), grid mark up language, semantic grid or at the programming level. We are currently building an exhaustive list of patterns and will be applying best fit patterns to create and deploy a working prototype of a grid. The other related areas that are of interest are the semantic grid and the gridml, which is not the scope of this paper. At the implementation level, we would at a later time emphasize on parallel archetypes, since the grid environment is primarily based on parallel computing, thus the need for effective and efficient parallel programming is getting inevitable. As part of the future work we then plan to implement a prototype of the grid, using the Grid Mark up language REFERENCES
[1]
[2] [3] [4]
I. Foster, et.al, Grid Services for Distributed System Integration, IEEE Transactions (June 2002) 37-38. D. D.Roure, et.al, The Evolution of the Grid, University of Southampton,UK. L. Ferreira, Introduction to Grid Computing with Globus, IBM, (December 2002). I.Foster et.al, The anatomy of the grid, The University of Chicago, The University of Southern California, Argonne National Laboratory.
650 [5] [6] [7] [8] [9] [10] [11] [12] [ 13] [14] [ 15] [16] [ 17] [18]
I.Foster, et.al, The physiology of the grid, Argonne National Laboratory, University of Chicago, University of Southern California. H.Prem and N.R.S.Raghavan, Software Pattems:Grid Perspective. Technical Report. MS03-01. Indian Institute of Science. 2003. M. Flower, Series (eds), Analysis Patterns: Reusable Object Models (Object - Oriented Software Engineering Series), ISBN 0-201-89542-0. E. Gamma, et.al, Design Pattems, ISBN 0-201-63361-2. I.Foster and C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit, Intemational Journal of Supercomputer Applications, 11(2): 115-128, 1997. http://www.globus.org/ Grimshaw, et.al, Legion: The next logical step toward a nationwide virtual computer. Technical Report. CS-94-21. University of Virginia. 1994. http://www.epm.ornl.gov:80/pvm/ http://www-cse.ucsd.edu/users/berman/apples.html F. Berman, et.al, Adaptive computing on the Grid Using AppLeS, IEEE Transactions on Parallel and Distributed Systems (TPDS), 14(4), 369-382, 2003. J. Frey, et.al, Condor-G: A Computation Management Agent for Multi-Institutional Grids, University of Wisconsin, Argonne National Laboratory. http://www.ncbiogrid.org/ I.sharapov, Computational Applications for Life Sciences on Sun Platforms: Performance Overview, Sun Microsystems Inc, November 2001. e-biologist's workbench www.mygrid.org.uk
Minisymposium
Bioinformatics
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
653
Green Destiny + mpiBLAST- Bioinfomagic W. Feng a* aResearch & Development in Advanced Network Technology (RADIANT), Computer & Computational Sciences Division, Los Alamos National Laboratory, P.O. Box 1663, M.S. D451, Los Alamos, NM 87545, USA This paper outlines how our highly efficient, power-aware supercomputer called Green Destiny and our open-source parallelization of BLAST called mpiBLAST combine to create a bit of"bioinfomagic." Green Destiny, featured in The New York Times and winner of a 2003 R&D 100 Award, revolutionized high-performance computing by re-defining performance to focus on issues of efficiency, reliability, and availability. Green Destiny is a 240-processor supercomputer that operates at a peak rate of 240 billion floating-point operations per second (or 240 gigaflops) but fits in six square feet and sips as little as 3.2 kilowatts of power. Consequently, it does not require any special infrastructure to operate, i.e., no cooling, no raised floor, no air filtration, and no humidification control. These attributes resulted in interest from several pharmaceutical and bioinformatics institutions, which likewise did not have special infrastructure to house traditional supercomputing clusters. Subsequent interactions with these pharmaceutical and bioinformatics institutions led to the birth of mpiBLAST, an open-source parallelization of BLAST that achieves super-linear speedup via a technique called database segmentation. Database segmentation allows each computing node to search a smaller portion of the database (one that fits entirely in memory), thus eliminating disk I/O and vastly improving performance. When used in concert with Green Destiny, we demonstrate that a 300-kB BLAST query that takes nearly one full day to complete on a traditional PC or workstation takes only minutes on Green Destiny. 1. THE ORIGIN OF G R E E N DESTINY & THE BIRTH OF mpiBLAST In the summer of 2001, my research team RADIANT: Research and Development i_n Advanced Network T e c h n o l o g y - was faced with a dilemma. When running distributed network simulations on our 128-processor tranditional supercomputing cluster (i.e., Beowulf cluster [ 1]) during the summer heat wave of 2001 in a dusty warehouse that routinely reached 90~ (32~ the traditional Beowulf cluster failed on a weekly basis, and oftentimes, more frequently than that. (This summer heat wave of 2001 was also the one that caused rolling brownouts in California.) These unscheduled failures easily resulted in thousands of dollars of lost produc*This work was supported by the Los Alamos Computer Science Institute of the U.S. Department of Energy through LANL contract W-7405-ENG-36. This paper is also a Los Alamos Unclassified Technical Report: LAUR-03-6651.
654 tivity per week. For instance, in the typical case, system administrators would spend upwards of a half-day to diagnosis and repair problems with the cluster. And when repairs required new hardware, administrators would boot up the working portion of the cluster for use and then later shut it down for repair when the new hardware came in. From the applications standpoint, network simulations would have to be re-started. Clearly, the team could not continue operating in such an inefficient manner. On a related note, Table 1 shows that the problem with downtime is even more insidious for business services that routinely rely on a web-server or compute-server farm (i.e., embarrassingly parallel variations of a Beowulf cluster), particularly given the fact that 65% of information technology managers report that their web sites were unavailable to customers over a six-month sampling period and that 25% of the web sites experienced three or more outages in the six-month period [2].
Table 1 Estimated Costs of an Hour of Server Downtime for Business Services (Source: Contingency Planning Research, Inc.) Service Brokerage Operations Credit Card Authorization Amazon.com Ebay Package Shipping Services Home Shopping Channel
Cost of One Hour of Downtime $6,450,000 $2,600,000 $275,000 $225,000 $150,000 $139,000
To address this problem with downtime, we endeavored to research, develop, and integrate a low-power, always-available Beowulf cluster called Green Destiny [3, 4, 5] to address issues of efficiency, reliability, and availability. Green Destiny is a 240-processor supercomputer that operates at a peak rate of 240 billion floating-point operations per second (or 240 gigaflops) but fits in six square feet and sips as little as 3.2 kilowatts of power (or roughly 10 times less power than a traditional cluster). Because Green Destiny runs so cool, it does not require any special infrastructure to operate, i.e., no cooling, no raised floor, no air filtration, and no humidification control. Furthermore, in its lifetime, the cluster has never failed. Consequently, the above attributes resulted in significant interest from several pharmaceutical and bioinformatics institutions, who likewise did not have special infrastructure to house traditional supercomputing clusters. Subsequent discussions revolved around the fact that these institutions relied heavily on a tool called "BLAST: Basic Local Alignment Sequence Tool," a ubiquitous sequence database-search tool used in molecular biology. (In general, BLAST takes a query DNA or amino acid sequence and searches for similar sequences in a database of known sequences. At the completion of the search, BLAST reports the statistical significance of the similarities between the query and the sequences in the database.) To improve the throughput of BLAST, most of these institutions used traditional Beowulf clusters in order to run multiple
655 instantiations of the sequential BLAST code. Those that were more adventuresome looked into parallelizing the sequential BLAST code [6, 7, 8, 9, 10, 11, 12]. Ultimately, however, all the institutions have been hampered by frequent failures in their traditional Beowulf clusters (due to overheating) as well as by poor parallelizations of BLAST. These hinderances appropriately set the stage for Green Destiny and mpiBLAST, respectively. Virtually all parallel implementations rely on a technique called query segmentation, resulting in nearly linear speed-up. A few implementations, including our own open-source implementation dubbed mpiBLAST, employ a technique that we call database segmentation [ 13, 14, 15, 16]. However, our implementation leverages the ubiquitously accepted NCBI implementation of BLAST and delivers super-linear speed-up while simultaneously being free and open source. As a result, since its release in early 2003, mpiBLAST has been downloaded well over 4000 times. The remainder of this paper presents a brief overview of BLAST, followed by details about the existing techniques to parallelize BLAST as well as the technique used in mpiBLAST, and a performance evaluation of mpiBLAST, our open-source parallelization of BLAST using the ubiquitous MPI (Message Passing Interface) [ 19]. 2. AN OVERVIEW OF B L A S T BLAST searches a query sequence containing nucleotides (DNA) or peptides (amino acids) against a database of known nucleotide or peptide sequences. Since peptide sequences results from ribosomal translations of nucleotides, comparisons can be made between nucleotide and peptide sequences. BLAST provides the capability to compare all possible combinations of query and database sequence types by translating sequences on the fly, as shown in Table 2. Table 2 BLAST Search Types Search Name blastn tblastn blastx blastp tblastx
Query Type nucleotide peptide nucleotide peptide nucleotide
Database Type nucleotide nucleotide peptide peptide nucleotide
Translation none database query none query and database
In BLAST, each search type executes in nearly the same way. The BLAST search heuristic [17] indexes both the query and target (database) sequence into words of a chosen size. It then searches for matching word pairs, called hits, with a score of at least T and extends the match along the diagonal. Gapped BLAST [18], hereafter referred to simply as BLAST, modifies the original BLAST algorithm to increase sensitivity and decrease execution time by moving down the sequences until two hits have been found, each with a score of at least T within A letters of each other. BLAST then performs an ungapped extension on the second hit to generate a high-scoring segment pair (HSP). If the HSP score is greater than a second threshold, a
656 gapped extension is triggered simultaneously both forward and backward. The resulting output consists of a set of local gapped alignments found within each query sequence, the alignments score, an alignment of the query and database sequences, and a measure of the likelihood that the alignment is a random match between the query and database. 3. RELATED W O R K In this section, we briefly describe three approaches towards parallelizing BLAST: (1) hardware acceleration that parallelizes the comparison of a single query to a single database entry, (2) query segmentation, and (3) database segmentation.
3.1. Hardware Acceleration Hardware accelerators parallelize the comparison of a single query sequence to a single database entry. Because symmetric multiprocessors (SMPs) and symmetric multithreaded systems (SMTs) cannot support the level of parallelism required, custom hardware must be used. Singh et al. introduced the first hardware accelerator for BLAST [12]. More recently, TimeLogic (http://www.timelogic.com) commercialized an FPGA-based accelerator called the DeCypher BLAST hardware accelerator [ 11 ]. 3.2. Query Segmentation Query segmentation provides the most natural parallelization of BLAST by splitting up a query (or set of queries) such that each compute node in a cluster (or each processor in an SMP) searches a fraction of the sequence database, as shown in Figure 1. Thus, multiple BLAST searches can execute in parallel on different queries. However, such an approach typically requires that the entire database be replicated on each compute nodes local storage system [6, 7]. If the database to be searched is larger than core memory, then query-segmented searches suffer from the same adverse effects of disk I/O as in traditional BLAST. When the database fits in core memory, however, query segmentation can achieve nearly linear scaling for all BLAST search types. 3.3. Database Segmentation Database segmentation is an orthogonal approach to query segmentation. Whereas query segmentation keeps the database intact and uses individual query segments (or sub-queries) to match against a copy of the entire database on each node, database segmentation keeps the query intact and distributes individual database segments to each node for the query to be searched upon, as shown in Figure 2. One of the biggest challenges of this approach is to ensure that the statistical scoring is properly produced as it depends on the size of the database, a database that database segmentation dutifully chops up. Database segmentation has also been implemented in a closed-source commercial product by TurboWorx, Inc. called TurboBLAST [13]. However, its propietary implementation only results in linear speed-up. Recently, Mathog [ 16] also released an implementation of database segmentation called paral i elblast, which is composed of a set of scripts that operate in the Sun Grid Engine (SGE)/PVM environment. Aside from requiring the SGE/PVM environment, parallelblast also differs from mpiBLAST in that it is not directly integrated with the NCBI toolkit.
657 Queries Worker nodes
>Perilla Frutescens CDS 0001 TTGGTATCCACGGAAGAOAGAGAAAATG'ITCK?~AA'FITI'CAGCCK'JAC GTATAGTATCATTGCCGGAAGAGCTCrGTCKj4._'TGCCCKK}AACC
>Perilla Frutescens CDS 0002 GGAGGGTGGCTGGTGGGTATTGGCCK~CCGACCGATCTOCCCCGACC GACGCrCTCCTGCCACCCGAACATGTGATAGAAAGGAQQQQQQQQ
/r
>Perilla Frutescens CDS 0003
/
GTCCTTGACCAAATYCT'fGCTrTCTGGCACAATCTGAAGCCCAAAGC~'
Database >gi]31237441dbjlAB013447 IIAB013447 "Iq'GGTATCCACGGAAGAOAGAGAAAATG'I'I'(KiGAA'I'ITI'CAGCGGAC GTATAGTA'I'CATTGCCCrGAAGAGCTGGTCKK_'TGCCGC~AACC
>gil2217781dbjlD00026.11HS2HSVZP4
~_
AT'I'CrGCGGCCC(LA('C(L~tT(YI'CK)CCC(}ACC
G ACGGCTCCTGCCACCCGP~C ATG
>gil73289611dbj lAB032155.1 AB032154S2 TTFTT]'TCTTGATGCTGAAATCTATCC.~L~CATCACCAGTCCTCACGA GTCCTTGACCAAATrCTTOCTTTCTGGCACAATCTGAAGCCCAAAGGC
Figure 1. Query Segmentation
Queries >Perilla Vrut. . . . . . CDS oool
.....
Worker
>Perilla Frutescens CDS 0002
.......
GGACKKH'(K?K_'TGGTGGGTATTGGCCrCrCCCGACCGATCTCK.'CCCGACC
...................
>Perilla Frutescens CDS 0003
/
TTYFTTTCTTGATGCTGAAATCTATCCAAAC'ATCACC/kOTCCTCACGA GTCCTTGACCAAA'Frc'FYGCTFFCTCrC~ AC AATCTGAACK3CC AAAGGC
Database
:
:
/
J /
/
>gi[31237441dbjlAB0 13447.1 lAB013447 ~ O 0 2 6 .
.~
/ /~"
~
ltHSZHSVZP4
..........
~ ....
~" GGACaSJGTG(JCTGGTGGG'I'ATT~COGCCCG ACCGATCTGCCCCOACC
~ ( ~ S ~ T r T C T G G C
nodes
~
~
::::: "
9
~
Figure 2. Database Segmentation 4. THE mpiBLAST ALGORITHM
The mpiBLAST algorithm consists of three steps: (1) segmenting and distributing the database, e.g., see Figure 2, (2) running mpiBLAST queries on each node, and (3) merging the results from each node into a single output file [ 15]. The first step consists of a front-end node formatting the database via a wrapper around the standard NCBI formatdb called mpiformatdb. The m p i f o r m a t d b wrapper generates the appropriate command-line arguments to enable NCBI formatdb to format and divide the database into many small fragments of roughly equal size. When completed, the formatted database fragments are placed on shared storage. Next, each database fragment is distributed to a distinct worker node and queried by directly executing the BLAST algorithm as implemented in the NCBI development library. Finally, when each worker node completes searching on its fragment, it reports the results back to the front-end node who merges the results from each worker node and sorts them
658 according to their score. Once all the results have been received, they are written to a userspecified output file using the BLAST output functions of the NCBI development library. This approach to generating merged results allows mpiBLAST to directly produce results in any format supprted by NCBI's BLAST, including XML, HTML, tab-delimited text, and ASN.1. 5. PERFORMANCE EVALUATION OF mpiBLAST In this section, we show that mpiBLAST, with its database-segmenting technique, achieves super-linear performance gains. For instance, when increasing the number of compute nodes from 1 to 4, mpiBLAST achieves a speed-up of nearly 10, as shown in Table 3. Overall, our reference 300-KB query against the 5.1-GB uncompressed nt database takes 1346 minutes (or 22.4 hours) on one compute node and less than 8 minutes on 128 nodes of Green Destiny. (A more detailed performance analysis and evaluation can be found in [ 15].)
Table 3 BLAST Run Time for a 300-kB Query Against the nt Database # Nodes
Run Time (sec)
Speed-Up
1 4 8 16 32 64 128
80775 8752 4548 2437 1350 851 474
1.00 9.23 17.76 33.15 59.83 94.92 170.41
If one looks carefully at Table 3, the efficiency of mpiBLAST decreases as the number of nodes increase. That is, if we take the "speed-up" column and divide it by the "# nodes" column, the efficiency of mpiBLAST across four nodes is 2.31 (9.23/4) and drops all the way down to 1.33 (170.41/128) when run across 128 nodes. The reason for this dropoff is due to the tradeoff that exists when segmenting the database into many small fragments. There is significant overhead in searching extra fragments; thus, the ideal database segment will typically be the largest fragment that can fit in memory and not cause any swapping to disk. Making fragments smaller than the available memory simply adds overhead [ 15]. 6. CONCLUSION In this paper, we briefly described how our energy-efficient and highly-reliable Green Destiny cluster led to the development of mpiBLAST to create a bit o f " b i o i n f o m a g i c " - e.g., moving from executing on one node to four nodes resulted in nearly a ten-fold speed-up. In short, mpiBLAST is an MPI-based [ 19] implementation of database segmentation for parallel BLAST searches. Its primary contributions are that it is open source, that it routinely produces superlinear speed-up, and that it directly interfaces with the NCBI development library to provide BLAST users with an interface and output formats identical to the ubiquitous NCBI-BLAST.
659 REFERENCES
[1 ]
[2] [3] [4] [5]
[6]
[7] [8]
[9] [ 10]
[ 11] [12]
[13]
[ 14] [ 15]
[ 16] [ 17] [18]
T. Sterling, D. Becker, D. Savarese, J. Dorband, U. Ranawake, and C. Packer, "Beowulf: A Parallel Workstation for Scientific Computation," Proceedings of the 24th International Conference on Parallel Processing, August 1995. T. Wilson, "The Cost of Downtime" InternetWeek, July 30, 1999. W. Feng, M. Warren, and E. Weigle, "Honey, I Shrunk the BeowulO." Proceedings of the 31 st International Conference on Parallel Processing, August 2002. W. Feng, M. Warren, and E. Weigle, "The Bladed Beowulf: A Cost-Effective Alternative to Traditional Beowulfs," Proceedings of the 4th IEEE Cluster, September 2002. M. Warren, E. Weigle, and W. Feng, "High-Density Computing: A 240-Processor Beowulf in One Cubic Meter," Proceedings of the 15th SC: High-Performance Networking and Computing Conference, November 2002. R. Braun, K. Pedretti, T. Casavant, T. Scheetz, C. Birkett, and C. Roberts, "Parallelization of Local BLAST Service on Workstation Clusters," Future Generation Computer Systems, 17(6):745-754, April 2001. N. Camp, H. Cofer, and R. Gomperts, "High-Throughput BLAST," SGI White Paper, September 1998. E. Chi, E. Shoop, J. Carlis, E. Retzel, and J. Riedl, "Efficiency of Shared-Memory Multiprocessors for a Genetic Sequence Similarity Search Algorithm," Technical Report TR97005, Computer Science Department, University of Minnesota, 1997. E. Glemet and J. Codani, "LASSAP, a Large Scale Sequence comparison Package," ComI puter Applications in the Biosciences,13(2)" 137-143, April 1997. K. Pedretti, T. Casavant, R. Braun, T. Scheetz, C. Birkett, and C. Roberts, "Three Complementary Approaches to Parallelization of Local BLAST Service on Workstation Clusters," Lecture Notes in Computer Science, 1662:271-282, 1999. A. Shpuntof and C. Hoover, Personal Communication, August 2002. R. K. Singh, W. D. Dettloff, V. L. Chi, D. L. Hoffman, S. G. Tell, C. T. White, S. F. Altschul, and B. W. Erickson, "BioSCAN: A Dynamically Reconfigurable Systolic Array for Biosequence Analysis," Research on Integrated Systems, 1993. R. D. Bjornson, A. H. Sherman,~ S. B. Weston, N. Wilard, and J. Wing, "TurboBLAST: A Parallel Implemtnation of BLAST Based on the TurboHub Architecture for High Performance Bioinformatics," Proceedings of the 1st IEEE International Workshop on HighPerformance Computational Biology, April 2002. A. Darling and W. Feng, "BLASTing Offwith Green Destiny," IEEE Bioinformatics Conference, August 2002. A. Darling, L. Carey, and W. Feng, "The Design, Implementation, and Evaluation of mpiBLAST," ClusterWorld Conference & Expo in conjunction with the 4th International Conference on Linux Clusters: The HPC Revolution 2003, June 2003. D. R. Mathog, "Parallel BLAST on Split Databases," Bioinformatics, 19(14), 2003. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, "Basic Local Alignment Search Tool," Journal of Molecular Biology, 215:403-410, 1990. S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, 25:3389-3402, 1997.
660 [ 19] W. Gropp, E. Lusk, and A. Skjellum, Using MPI, MIT Press, 1999.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
661
Parallel Processing on Large Redundant Biological Data Sets: Protein Structures Classification with CEPAR* D. Pekurovsky a, I. Shindyalov ~, and R Bourneb ~ aSan Diego Supercomputer Center, UCSD, MC 0505; 9500 Gilman Drive, La Jolla, California 92093, USA; {dmitry, shindyal, bourne }@sdsc.edu CEPAR is a parallel, efficient software tool for family classification and all-to-all comparison of proteins in a given database such as the Protein Data Bank (PDB) based on comparing their 3D structure using the CE (Combinatorial Extension) algorithm. In designing the CEPAR algorithm the main themes were efficiently using the redundancy of data in the input database to reduce the amount of work and eliminating load imbalance to achieve high scalability. These ideas can be equally applicable to other biological applications dealing with redundant data sets. We discuss how these themes influenced the algorithm design choices. Resulting parallel performance is presented. 1. P R O B L E M F O R M U L A T I O N
Highly redundant sets of data occur frequently in computational biology problems. Examples include genome annotation [ 1], the comparison of a single open reading frame against a whole and relatively static annotated sequence database, and pairwise sequence and structure comparisons. This presentation concerns the latter kind of problem, namely structure comparison of proteins in a given database of structures using the CE algorithm [2,3]. Structure comparison is very important since it allows researchers to better understand protein function and distant homologous relationships between proteins. There have been proposed a number of pairwise structure comparison algorithms. They differ in their choices of elementary comparison units as well as algorithmic strategies, and each of the algorithms has a unique set of advantages and shortcomings. CE algorithm [2,3] is at present one of the well-known and used algorithms. As its input, it takes atom coordinates of the two protein chains to be compared (originally, only C a atoms were used, but in the latest version 2.3, C a, C~ and O atoms are needed). The algorithm finds an optimal alignment path defining fragments of the two structures that can be superimposed in space (as rigid bodies) to yield the minimal least-squares deviation. These fragments can contain gaps. The output is a set of numbers characterizing the way the two structures are related, namely the RMSD (Root Mean Square Distance), Lazi (number of aligned residues), Lgap (number of gap positions), z-score (statistical relevance score), and the detailed characterization of the alignment path. The latest version 2.3 returns several sets of the *The work of D.R was funded through the National Partnership for Advanced Computational Infrastructure (NPAC|), National Science Foundation (NSF). The work of I.N.S was funded by NSF DBI 9808706. Parallel runs were made on computers at the San Diego Supercomputer Center.
662 above numbers, each for a different optimal alignment path. For more details on CE algorithm, see refs. [2,3]. Among the abundance of available structure comparison algorithms few have resulted in regularly updated, publicly available resources, partly because of the high volume of computational work required. The problem addressed by CEPAR (CE PARallel) is classifying the input database of protein chains (such as the PDB) into families (note: the families here are used as a time-saving measure, are based on structure only, and do not carry the same biological meaning as the families in traditional protein structure classification schemes such as SCOP). Each family is represented by a (randomly chosen) protein while the rest of the family consists of proteins sufficiently similar to the representative in terms of pairwise CE comparison results. In addition, all representatives are compared with each other, which allows one to reconstruct all-against-all relationships of proteins as a post-processing step. As a result, not only the structure space of the input database is characterized by the family classification, but also many other important calculations become possible, notably protein structure prediction. The resulting data are available on a public Web server http://cl.sdsc.edu/ce.html, which allows the users, among other things, to compare a protein structure of their choice against either the entire PDB or the representative set. The latter not only saves computational time, but also is desirable in many applications in order to avoid redundancy. The criteria used in deciding if a protein belongs to a family are as follows: 1. [L1 -L2[ < Lthr(L1 + L2)/2, where L~ and L 2 are sequence lengths of the two entities, and Lth~ is the length difference threshold parameter. 2. Lal i
> = Ath r (L 1 +L2)/2, where Lazi is the number of aligned positions, and Athris the alignment length threshold parameter.
3. Lgap
< Gth r Lati, where LgapiS the number of residues in gaps and GthriS the gap threshold parameter
4. The final RMSD of the alignment RMSD parameter.
<=
Rthr, where Rth r is the RMSD threshold
Threshold parameters used in this work are as follows: Zth r - 3.5, Rth r = 2.0 A, Lth r = 0.1, Ath r = 0.67, Gth r = 0.2. 2. PARALLEL IMPLEMENTATION The algorithm employed in CEPAR is the master/workers approach. The software is written in C++ and uses MPI for inter-processor communication. The master processor dispatches assignments to the workers, containing atom coordinates of the two protein structures to be compared. The workers return results of pairwise CE algorithm and become ready to receive next assignment. The master stores results as POM (Property Object Model, see ref. [4]) databases.
663 The time of completion of one task varies from a fraction of a second to minutes and even hours, with the average being about 1 second on a modem Pentium 3 processor. If one were to naively do an all-against-all comparison in the PDB (which currently contains about 37,000 chains) the total amount of raw computational time would be about 21,000 CPU-days, which is quite impractical. Of course it is not necessary to do an all-against-all comparison, since the data is quite redundant, namely there are many proteins similar in structure which should be grouped into families (5574 families were found in the latest calculation). Therefore, with an efficient algorithm, the total amount of calculation is much less than the above number. Our main concem is to utilize this redundancy to the maximum so as to decrease the number of pairwise comparisons. That means, at any point in the calculation, that we are trying to establish a relationship between a given entity and one of the existing representatives as soon as possible, if such a relationship exists, since comparisons with the rest of representatives are going to be a waste of time. Therefore in CEPAR all existing representatives are ordered with respect to their likelihood of being similar to a given entity, roughly estimated by similarity of their amino acid frequency profiles. Two approaches are possible for ordering pairwise comparisons, keeping in mind the above considerations. The entity-first approach attempts to compare all remaining entities in tum to the first existing representative, before proceeding to the second representative, and so on. If any of the entities is found to belong to an existing family, it is deleted from the entity list. If all entities have been compared with all representatives, the first entity in the list becomes a new representative. The family-first approach attempts to compare all existing representatives with a given entity, before going on to the next entity. If the family relationship is established with one of the representatives, all other calculations involving the current entity are interrupted. If none of the representatives is found to be close enough to the given entity, this entity becomes a new representative. Both of these approaches are quite efficient as they save a lot of the work by utilizing redundancy in an optimal way. However, as can be seen in Fig. 1, the family-first approach is found to be slightly superior in performance to the entity-first approach, and also much better in terms of memory utilization for the master processor. Therefore the family-first approach is adopted for the recommended production version of CEPAR (although software allows to specify either of the two approaches at compile time). The communication time is rather small (but finite) compared to the computation time. However, the program is not embarrassingly parallel since there are dependencies between different tasks. See Figures 1 and 2 for parallel performance on two sample databases. Let us try to explain the source of load imbalance seen at the high end of the processor count. First of all, in the end of the calculation there are some calculations still going on while most of the processors have finished their work, since the time of completion of individual task varies greatly. To avoid this from occurring, it is recommended that the number of processors used for running CEPAR be much less than the number of entities in the input database (in practice, it is enough that P < N/6). Alternatively, CEPAR can be run in two steps, interrupting the first step when a significant number of processors becomes idle, and continuing the run using a much smaller number of processors. Figure 1 shows that running in two steps significantly improves scalability, which means that end-of-run load imbalance explains a lot of the inefficiency seen in the one-step runs. Another source of load imbalance is congestion of the master processor. To address this, the algorithm for master processor was thoroughly optimized. During idle time the assignments for
664 workers are buffered in advance so that the next idle worker can receive them without delay. In addition, in order to avoid congestion of the MPI send channel, it is necessary to use buffered (or asynchronous) MPI sends, since on many systems (including IBM) the MPI_send function does not buffer messages by default [5,6]. The complete algorithm for the master processor is shown in Fig. 3. 3. CONCLUSION As we have shown, CEPAR is an efficient, scalable algorithm for mutual structure comparison of proteins in a given database. We already discussed that such a problem is a good example of a biological calculation involving data sets with high degree of redundancy. Utilizing this redundancy in an optimal way is the key to achieving good performance and reasonable work completion time. We found it crucial to pay attention to ordering of the calculations due to time dependencies, and to minimize the delay in sending assignments to the workers by optimizing the master's algorithm. The limitations of the algorithm should be recognized, in particular the fact the software should be run on a machine with the number of processors much less than the number of entities in the database. Alternatively, the software can be run in two steps. Postprocessing step is necessary to convert the family classification to the all-against-all comparison information. The post-processing software can be run either in serial or in parallel, on a shared memory computer using OpenME The considerations described above.can be applied to other large-scale problems involving processing highly redundant data sets. Production run involving the entire PDB was performed on 64 processors of a LINUX cluster at SDSC. This run took 6 weeks, and achieved processor utilization close to 100%. This demonstrates that CEPAR is capable of utilizing loosely coupled clusters (although using a tightly-coupled supercomputer might allow better scalability, and therefore higher number of processors efficiently used when the turnaround time is critical). It took another day to postprocess the results on 24 processors of IBM Power4+ Regatta shared memory node. The results are available to the community at http://cl.sds.edu/ce.html. CEPAR software can also be used for incremental updates of the results to take into account the growth of the PDB itself. In addition, these data are being used as one of the essential ingredients of a new project titled "Encyclopedia of Life". This large-scale effort currently under way at San Diego Supercomputer Center aims at annotating all known genomes, including structure and functional information, providing well-defined reliability measures of predictions, true API level integration with key biological resources, and giving users convenient access through a state-of-the-art Web services interface. For more details consult http://eol.sdsc.edu. REFERENCES [1]
[2]
[3] [4]
Li, W.W., Quinn, G.B., Alexandrov, N.N., Bourne, P.E. and Shindyalov, I.N. (2003) Proteins of Arabidopsis thaliana (PAT) database: A resource for comparative proteomics. Genome Biology 4(8), R51. Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension of the optimum path. Protein Engineering, 11,739-747. Shindyalov, I.N. and Bourne, P.E. (2001) CE: A Resource to Compute and Review 3-D Protein Structure Alignments. Nucleic Acid Research, 29(1), 228-229. Shindyalov, I.N. and Bourne, P.E. (1997) Protein data representation and query using optimized data decomposition. CABIOS, 13,487-496.
665
[5] [6]
Snir, M. et al. (1996) MPI: the complete reference, MIT Press, Cambridge, MA. Barrios, M. et al. (1999) "Scientific Applications in RS/6000 SP Environments '" sections 2.8.7, 2.13 and 9.2.1. IBM Redbooks series, http://www.redbooks.ibm.com
666 FIGURES
Figure 1. Performance of two parallel versions of CEPAR: family-first and entity-first strategies.
Performance (663 ent.) I AA ~ 8
nmn ii~
384 320 o_ .~ 256
/
~. 192
or)
mmmm mmmDiu
128
mm
64
m
mm
~ammmmm 0
64
m
mmm
128
/
/
_..,,I" I -
192
256
Npror
Ideal Entity-first (2-step) - Family-,,,ot ~"-ot~vy * - Family-first (I -step)
raNg liras
liE: -..-
320
~ ,
384
448
Compute platform for these runs was IBM SP "Blue Horizon" machine (1144 Power3+ processors) at SDSC. The family-first implementation was run using both one-step [magenta, triangles] and two-step [blue, diamonds] jobs. Entity-first results are shown with green squares. Sample data consisting of 663 entities chosen at random from the PDB were used to obtain these measurements. 321 representatives were identified. Ideal scaling is shown in red. The vertical axis plots the speedup relative to the 16-processor timing of 5567 sec. for the family-first approach. (That is 16 units = 1 / [1.55 hours].)
667 Figure 2. Parallel performance of CEPAR run on Blue Horizon platform on a dataset of size 3422 entities.
Performance (3422 ent.) 1024 896
in
Hi
768 640 "0
80,,
iN li
;
///
512
InUliN uimnnn lin|/i IIN~II i|~iil I/mini
384 256
128
0
128
256
384
.////
512
Two-step One-step
j
640
768
l
............. Ideal
896
1024
Nproc
In this run 1159 families were identified using the threshold parameters mentioned in text. Green (squares) data points correspond to one-step runs, while blue (diamonds) points correspond to two-step execution. The red line shows ideal scaling. Performance is plotted as speedup relative to the 64 processors run lasting 18948 sec. (That is 64 u n i t s - 1 / [5.26 hours].)
668 Figure 3. Parallel performance of CEPAR run on Blue Horizon platform on a dataset of size 3422 entities.Algorithm logic for the master processor using the family-first strategy.
criterion?
~ r
/7"
.~ JYES
suits ived?
\j
NO i
satisfy simila~
NO
:
assignment buffer full?
inthe
bunco
i
ent~:into::~e representative's family
NO
NO
representatives left for the
simil~ between-the two chains
le~ for the
pr,p n ~d e~
b ~ r an assignment
current entity envies left?
alignment results
~ethere
entities
a representative of a new family
4
assignmentto an idle worker workers
f Makenext e n ~ current
Prepare:i~ assignment
M~e next entity current
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
669
MDGRAPE-3: A Petaflops Special-Purpose Computer System for Molecular Dynamics Simulations M. Taiji a *, T. Narumi a, Y. Ohno ~, and A. Konagaya b ~High-Performance Biocomputing Research Team, Genomic Sciences Center, R/KEN, 61-10nocho, Tsurumi, Yokohama, Kanagawa 230-0046, Japan bBioinformatics Group, Genomic Sciences Center, R/KEN, 1-7-22 Suehiro, Tsurumi, Yokohama, Kanagawa 230-0045, Japan We are developing the MDGRAPE-3 system, a petaflops special-purpose computer system for molecular dynamics simulations. It is a special-purpose engines that calculate nonbonded interactions between atoms, which is the most time-consuming part of the simulations. A dedicated LSI 'MDGRAPE-3 chip' performs these force calculations at a speed of 165 gigaflops or higher. The system will have 6,144 MDGRAPE-3 chips to achieve a nominal peak performance of one petaflops. 1. INTRODUCTION Nowadays, we are still observing a quite rapid increase in the performance of the conventional computers. Therefore, it would appear to be very difficult to compete with the conventional computers, and it often be told that the only future high-performance processor is the microprocessors in PC. However, we are going to face two obstacles on the road of the highperformance computing with the conventional technology, the memory bandwidth bottleneck and the heat dissipation problem. These bottlenecks can be overcome by developing specialized architectures for particular applications. The GRAPE (GRAvity PipE) project is one of the most successful attempts to develop such high-performance, competitive special-purpose systems[I]. The GRAPE systems are specialized for simulations of classical particles such as gravitational N-body problems or molecular dynamics (MD) simulations. In these simulations, most of the computing time is spent on the calculation of long-range forces, such as gravitational, Coulomb, and van der Waals forces. Therefore, the special-purpose engine calculates only these forces, and all the other calculations are done by a conventional host computer connected to the system. This style makes the hardware very simple and cost-effective. The strategy of specialized computer architecture was pioneered for use in MD simulations by Bakker et al. [2] and by Fine et al. [3]. However, neither system was able to achieve effective cost-performance. The most important drawback was the architectural complexity of these machines, which demanded much time and money in the development stages. *To whom all correspondence should be addressed. E-mail: [email protected]
670
.~ 10'
[MDGRAPE-3
Petaflops
MD S
O
-
..... 6~i;;:-6
8
E no
Teraflops ~
"
............. I. . . . _.z'~-~ ......q~P~i3 .... .~. r-~ ' * ~ / I D - G
RAPE
= Gigaflops
i
no GRAPE-2
Megaflops
1988 1990 1992 1994 1996 1998 2000 20(~ 2004 2006 2008 Year
Figure 1. The growth of the GRAPE systems. For astrophysics, there are two product lines. The even-numbered systems, GRAPE-2, 2A, 4, and 6, are the high-precision machines, and the systems with odd numbers, GRAPE-l, 1A, 3, and 5, are the low-precision machines. GRAPE2A, MD-GRAPE, MDM, and MDGRAPE-3 are the machines for MD simulations. Figure 1 shows the advancement of the GRAPE computers. The project started at the University of Tokyo and is now run by two groups, one at the University of Tokyo and one at the RIKEN Institute. The GRAPE-4[4] built in 1995 was the first machine to break the teraflops barrier in nominal peak performance. Since 2001, the leader in performance has been the MDM (Molecular Dynamics Machine)[5] at RIKEN, which boasts a 78-Tflops performance. At the University of Tokyo, the 64-Tflops GRAPE-6 was completed in 2002[6]. To date, seven Gordon-Bell Prizes (1995, 1996, 1999, 2000(double), 2001, 2003) have been awarded to simulations using the GRAPE/MDM systems. The GRAPE architecture can achieve such high-performance because it solves the problem of memory bandwidth bottleneck and lessens the heat dissipation problem. Based on these successes, in 2002 we launched a project to develop the MDGRAPE-3 (also known as Protein Explorer), a petaflops special-purpose computer system for molecular dynamics simulations[7]. The MDGRAPE-3 is a successor of GRAPE-2A[8], MD-GRAPE[9, 10], and MDM/MDGRAPE-2[5], which are also machines for molecular dynamics simulations. The main targets of the MDGRAPE-3 are the high-precision screening for drug design[ 11 ] and the large-scale simulations of huge proteins/complexes. The system will appear in early 2006. In this paper, we describe the hardware architecture of the MDGRAPE-3 system.
2. WHAT MDGRAPE-3 CALCULATES First, we will describe the basic architecture of the MDGRAPE-3 and what it calculates. As already mentioned, all the GRAPE systems consists of a general-purpose commercial computer and a special-purpose engine. The host sends the coordinates and the other data of particles to the special-purpose engine, which then calculates the forces between particles and returns the results to the host computer. In the molecular dynamics simulations, most of the calculation time is spent on nonbonded forces, i.e., the Coulomb force and van der Walls force. Therefore, the special-purpose engine calculates only nonbonded forces, and all other calculations are done by the host computer. This makes the hardware and the software quite simple. To use
671
From nother Chip
z
(a)
(b)
To Next Chip
Figure 2. (a)Block diagram of the force calculation pipeline in the MDGRAPE-3 chip. (b) Block diagram of the MDGRAPE-3 chip. the GRAPE systems, a user simply uses a subroutine package to perform force calculations. All other aspects of the system, such as the operating system, compilers, etc., rely on the host computer. The communication time between the host and the special-purpose engine is proportional to the number of particles, N, while the calculation time is proportional to its square, N 2, for the direct summation of the long-range forces, or is proportional to N N c, where Arc is the average number of particles within the cutoff radius of the short-range forces. Since N~ usually exceeds a value of several hundred, the calculation cost is much higher than the communication cost. In the MDGRAPE-3 system, the ratio between the communication speed and the calculation speed of the special-purpose engine will be 0.25 bytes for one thousand operations. This ratio can be fairly small compared with those in the commercial parallel processors. Next, we describe what the special-purpose engine of the MDGRAPE-3 calculates. The MDGRAPE-3 calculates two-body forces on i-th particle F i as 2
Fi - Z
ajg(bj r s ) r i j '
(1)
J
where ~'ij - - l ' j - - ??i~ r s2 = r~j + c i2. The vectors ri, rj a r e the position vectors of the i, j-th particles and ei is a softening parameter to avoid numerical divergence. For the sake of convenience, we hereafter refer to the particles on which the force is calculated as the '/-particle', and the particles which exert the forces on the/-particle as the 'j-particle'. The function 9(C) is an arbitrary smooth function. By changing the function, it can calculate Coulomb, van der Waals, or the other forces. In addition to Equation (1), the MDGRAPE-3 can also calculate potentials, diagonal part of virial tensors, and the wave-space sums in Ewald method[ 12]. There are several clever algorithms for efficient calculation of the long-range forces. In highprecision calculations such as those in molecular dynamics simulations, these algorithms use the direct summation for the near-field forces. Therefore, the GRAPE can accelerate these algorithms. There are also several implementations of these algorithms, i.e., the tree algorithm[ 13], the Ewald method[ 12], and the modified fast multipole method[ 14]. 3. M D G R A P E - 3 CHIP
In this section, we describe the MDGRAPE-3 chip, the force calculation LSI for the MDGRAPE-3 system. Figure 2(a) shows the block diagram of the force calculation pipeline
672 of the MDGRAPE-3 chip. It calculates Equation (1) using the specialized pipeline. It consists of three subtractor units, six adder units, eight multiplier units, and one function-evaluation unit. It can perform about 33 equivalent operations per cycle when it calculates the Coulomb force[15]. The count depends on the force to be calculated. Most of the arithmetic operations are done in 32-bit single-precision floating-point format, with the exception of the force accumulation. The force F~ is accumulated in 80-bit fixed-point format and it is converted to 64-bit double-precision floating-point format when it is read. The coordinates r~, rj are stored in 40bit fixed-point format because this makes the implementation of periodic boundary condition easy. The function evaluator, which allows calculation of an arbitrary smooth function, is the most important part of the pipeline. This block is almost the same as those in MD-GRAPE [9, 10]. It has a memory unit with 1,024 entries which contains a table for polynomial coefficients and exponents, and a hardwired pipeline for the fourth-order polynomial evaluation. It interpolates an arbitrary smooth function 9(x) using segmented fourth-order polynomials. Figure 2(b) shows the block diagram of the MDGRAPE-3 chip. It will have 20 force calculation pipelines, a j-particle memory unit, a cell-index controller, a force summation unit, and a master controller. The j-particle memory unit holds the coordinates of j-particles for 32,768 bodies and corresponds to the 'main memory' in general-purpose computers. Thus, the chip is designed using the memory-in-the-chip architecture and no extra memory is necessary on the system board. The same output of the memory is sent to all the pipelines simultaneously. Each pipeline calculates using the same data from the j-particle unit and individual data stored in the local memory of the pipeline. This parallelization scheme, 'the broadcast memory architecture', is one of the most important advantages of the GRAPE systems. It enables the efficient parallelization at low bandwidth. In addition, we can extend the same technique to the temporal axis by adding 'virtual pipelines'[ 16]. In this technique, multiple forces on different particles are calculated in turn using the same pipeline and the same j-particle data. In the MDGRAPE-3 chip there are two virtual pipelines per physical pipeline, and thus the total number of virtual pipelines is 40. The physical bandwidth of the j-particle unit is 2.5 Gbytes/sec, but the virtual bandwidth reaches 100 Gbytes/sec. This allows quite efficient parallelization in the chip. The MDGRAPE-3 chip has 340 arithmetic units and 20 function-evaluator units which work simultaneously. The chip has a high performance of 165 Gflops at a modest speed of 250 MHz. This advantage will become even more important in the future. The number of transistors will continue to increase over the next ten years, but it will become increasingly difficult to use additional transistors to enhance performance in conventional processors. In the GRAPE systems the chip can house more and more arithmetic units with little performance degradation. Therefore, a special-purpose approach is expected to become increasingly more advantageous than a general-purpose approach. The demerit of the broadcast memory parallelization is, of course, its limitation with respect to applications. However, we could find several applications other than particle simulations. For example, the calculation of dense matrices[ 17], the boundary value problems, and the dynamic programming algorithm for hidden Markov models are also possible to be accelerated by the broadcast memory parallelization. The size of the j-particle memory, 32,768 bodies, is sufficient for the chip in spite of the remarkable calculation speed, since the molecular dynamics simulation is a computation-intensive application and not a memory-intensive one. In addition, we can use many chips in parallel to increase the capacity. In this case each chip calculates the partial forces. To suppress the
673
Network Switch
H i g h - S p e e d Serial In t e r c o n n e c t i o n [ ...........................
, 104
,
Petaflops System Single CPU (2 Tflops) ....... .
, .... : .......... ~....... i
..""
i
/
..
1.0
.. 0;8
i ~
PU3
lo2
i!
o~ ~" lo0
g
N
0.4
~
10-2 02 .............
Host C l u s t e r ( 5 1 2 C P U )
....
SpecialPurpose Computers (6144 chips)
10-4 lo4
lOs
lo 8
N
lo 7
~
Figure 3 and 4. Figure 3 (left): Block diagram of the MD-GRAPE. Here we assume each PU contains 2 CPUs. Figure 4 (right): Sustained performance of the MDGRAPE-3 with the direct summation algorithm. The calculation time per step (left axis) and the efficiency are plotted. The solid line and dashed one indicate those of the petaflops system and the 2-Tflops system, respectively. communication between the host and the special-purpose engines, the MDGRAPE-3 chip has a force summation unit, which calculates the summation of the partial forces. The j-particle memory is controlled by the cell-index controller, which generates the address for the memory. It supports the cell-index method, which is the standard technique to calculate two-body interactions with a cutoff[ 18]. Gathering all the units explained above, the MDGRAPE-3 chip has almost all elements of the MD-GRAPE/MDGRAPE-2 system board. The chip is designed to operate at 250 MHz in the worst condition (1.08V, 85~ process factor = 1.3). It has 20 pipelines and each pipeline performs 33 equivalent operations per cycle, and thus the peak performance of the chip will reach 165 Gflops. This is 12.5 times faster than the previous MDGRAPE-2 chip. The chip is made by Hitachi Device Development Center HDL4N 0.13 #m technology. It consists of 6M gates and 10M bits of memory, and the chip size is 15.7 mm x 15.7 mm. It will dissipate 20 watts at the core voltage of + 1.2 V. Its power per performance will be 0.12 W/Gflops, which is much better than those of the conventional CPUs. There are several reasons for this high power efficiency. First, the accuracy of the arithmetic units is smaller than that of the conventional CPUs. Secondly, in the MDGRAPE-3 chip, 90% of transistors are used for the arithmetic operations and the rest are used for the control logic. The specialization and the broadcast memory architecture make the efficient usage of silicon possible. And lastly, the MDGRAPE-3 chip works at the modest speed of 250 MHz. At gigahertz speed, the depth of the pipeline becomes very large and the ratio of the pipeline registers tends to increase. Since the pipeline registers dissipate power but perform no calculation, there is a lessening of power-efficiency. Thus, the GRAPE approach is very effective to suppress power consumption.
674 4. SYSTEM A R C H I T E C T U R E Figure 4 shows the block diagram of the MDGRAPE-3 system. The system will consist of a host cluster with special-purpose engines attached. We are going to use SGI Altix 350 (16 CPU, Itanium 1.4GHz/1.5Mbytes cache) as a host computer for the initial small system. The petaflops system will have 512 CPUs connected by Infiniband. Each CPU will have the system board with 12 MDGRAPE-3 chips. Since the MDGRAPE-3 chip has a performance of 165 Gflops, the performance of the nodes will be 2 Tflops. With 512 CPU the system reaches to a petaflops. The board and the PCI-X bus of the host are connected by the special high-speed serial interconnection with the speed of 10 Gbit/sec for both up and down streams. The two boards with 4 Tflops will be mounted in a 2U-height 19" subrack. The total power dissipation of the petaflops system will be about 200 KWatts and it will occupy 100 m 2" 5. SOFTWARE A N D P E R F O R M A N C E E S T I M A T I O N In the GRAPE systems a user does not need to consider the detailed architecture of the special-purpose engine and all the necessary compute services are defined in the subroutine package. Several popular molecular dynamics packages have already been ported for MDGRAPE-2, including Amber-6[ 19] and CHARMM[20]. For them, the parallel code using MPI also works with the MDGRAPE-2. The interactive molecular dynamics simulation system on the MDGRAPE-2 has also been developed using Amber-6. It uses a three-dimensional eyeglass and a haptic device to manipulate molecules interactively. It is quite easy to convert these codes for MDGRAPE-3. Next, we discuss the sustained performance of the MDGRAPE-3 by means of a simple model[7]. Figure 4 shows the sustained performance of the MDGRAPE-3. We assumed the sustained performance of the host CPU is 1.5 Gflops and the communication speed between the host CPU is 0.5 Gbytes/sec. In the petaflops system, the efficiency reaches 0.5 at N = 5 x 105, and the system with one million particles can be simulated by 0.05 sec/step. In the target applications of the MDGRAPE-3, the throughput of systems with several ten thousand particles is important. In the single CPU system with 2 Tflops, the efficiency reaches 0.5 at the system with N = 4 • 104 particles and it can be simulated by 0.08 sec/step. Therefore, we can simulate such systems for 1.1 nsec/CPU/day, and 0.55 #sec/system/day when the simulation timestep is 1 fsec. For high-precision screening of drug candidates, a simulation of more than 5 nsec will be required for each sample. If the system is used for this purpose, we can treat 100 samples per day, which is very practical. For larger systems, the use of the Ewald method or the treecode is necessary. With the Ewald method, a system with ten million particles can be treated with 0.3 sec/step with the petaflops system. The system will be useful for the analysis of such very-large biocomplexes. 6. S U M M A R Y A N D S C H E D U L E
The MDGRAPE-3 system is a special-purpose computer system for molecular dynamics simulations with a petaflops nominal peak speed. It is a successor of MD-GRAPE, the GRAPE system for MD simulations. We are developing an MDGRAPE-3 chip for the MDGRAPE3 that will have a peak performance of 165 Gflops. The chip will have 20 force calculation
675 pipelines and a main memory. The system will consist of 6,144 MDGRAPE-3 chips to achieve a petaflops. The host computer is a cluster with 512 CPUs. The special-purpose engines are distributed to each PU of the host. A sample LSI of MDGRAPE-3 will appear in March 2004, and a system of 20 Tflops performance will be built by March 2005. The system will be complete in the first half on 2006. The total cost is about 20 million US dollars including the labor cost. The MDGRAPE-3 is expected to be a very useful tool for the biosciences, especially for structural genomics, nanotechnology, and the wide range of material sciences. ACKNOWLEDGEMENTS
The authors would like to express their sincere gratitude to their many coworkers in the GRAPE projects, especially to Prof. Junichiro Makino, Prof. Daiichiro Sugimoto, Dr. Toshikazu Ebisuzaki. This work was partially supported by the 'Protein 3000 Project' contracted by the Ministry of Education, Culture, Sports, Science and Technology of Japan. REFERENCES
[ 1] [2] [3] [4] [5] [6] [7] [8] [9]
[10] [11]
[12]
[13]
J. Makino, M. Taiji, Scientific simulations with special-purpose computers, John Wiley & Sons, Chichester, 1998. A.F. Bakker, C. Bruin, in: B. J. Alder (Ed.), Special Purpose Computers, Academic Press, San Diego, 1988, 183. R. Fine, G. Dimmler, C. Levinthal, PROTEINS: Structure, Function and Genetics 11 (1991)242. M. Taiji, J. Makino, T. Ebisuzaki, D. Sugimoto, in: Proceedings of the 8th International Parallel Processing Symposium, IEEE Computer Society Press, Los Alamitos, 1994, 280. T. Narumi, R. Susukita, T. Ebisuzaki, G. McNiven, B. Elmegreen, Molecular Simulation 21 (1999) 401. J. Makino, E. Kokubo, T. Fukushige, H. Daisaka, in: Proceedings of Supercomputing 2002, 2002, in CDROM. M. Taiji, T. Narumi, Y. Ohno, N. Futatsugi, A. Suenaga, N. Takada, A. Konagaya, in: Proc. Supercomputing 2003, 2003, in CDROM. T. Ito, J. Makino, T. Ebisuzaki, S. K. Okumura, D. Sugimoto, Publ. Astron. Soc. Japan 45 (1993)339. M. Taiji, J. Makino, A. Shimizu, R. Takada, T. Ebisuzaki, D. Sugimoto, in: R. Gruber, M. Tomassini (Eds.), Proceedings of the 6th conference on physics computing, European Physical Society, 1994, 609. T. Fukushige, M. Taiji, J. Makino, T. Ebisuzaki, D. Sugimoto, Astrophysical J. 468 (1996) 51. A. Suenaga, M. Hatakeyama, M. Ichikawa, X. Yu, N. Futatsugi, T. Narumi, K. Fukui, T. Terada, M. Taiji, M. Shirouzu, S. Yokoyama, A. Konagaya, Biochemistry 42 (2003) 5195. T. Fukushige, J. Makino, T. Ito, S. K. Okumura, T. Ebisuzaki, D. Sugimoto, in: V. Milutinovix, B. D. Shriver (Eds.), Proceedings of the 26th Hawaii International Conference on System Sciences, IEEE Computer Society Press, Los Alamitos, 1992, 124. J. Makino, Publ. Astron. Soc. Japan 43 (1991) 621.
676 [14] [15] [16] [17]
A. Kawai, J. Makino, Astrophysical J. 550 (2001) L143. A. H. Karp, Scientific Programming 1 (1992) 133. J. Makino, E. Kokubo, M. Taiji, Publ. Astron. Soc. Japan 45 (1993) 349. Y. Ohno, M. Taiji, A. Konagaya, T. Ebisuzaki, in: Proc. 6th World Multiconference on Systemics, Cybernetics and Informatics SCI, 2002, 514. [ 18] M. P. Allen, D. J. Tildesley, Computer Simulation of Liquids, Oxford University Press, Oxford, 1987. [ 19] D. A. Case, D. A. Pearlman, J. W. Caldwell, T. E. Cheatham III, W. S. Ross, C. Simmerling, T. Darden, K. M. Merz, R. V. Stanton, A. Cheng, J. J. Vincent, M. Crowley, V. Tsui, R. Radmer, Y. Duan, J. Pitera, I. Massova, G. L. Seibel, U. C. Singh, E Weiner, E A. Kollman, Amber 6 Manual, UCSF, 1999. [20] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, M. Karplus, J. Comp. Chem. 4 (1983) 187.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
677
Structural Protein Interactions: From Months to Minutes R Dafas, J. Gomoluch, A. Kozlenkov a, and M. Schroeder b
9
aDept, of Computing, City University, London, United Kingdom bBiotec/Fakult~it f'tir Informatik, TU Dresden, Germany Protein interactions are important to understand the function of proteins. PSIMAP is a protein interaction map derived from protein structures in the Protein Databank PDB and the Structural Classification of Proteins SCOR In this paper we review how to reduce the computation of PSIMAP from months to minutes, first by designing a new effective algorithm and second by distributing the computation over a Linux PC farm using a simple scheduling mechanism. From this experience we derive some general conclusions: Besides computational complexity, most problems in bioinformatics require information integration and this integration is semantically complex. We sketch our relevant work to tackle computational and semantic complexity. Regarding the former, we classify problems and infrastructure and review how marketbased load-balancing can be applied. Regarding the latter, we outline a Java-based rule-engine, which allows users to declaratively specify workflows separate from implementation details. 1. I N T R O D U C T I O N While DNA sequencing projects generate large amounts of data and although there are some 1,000,000 annotated and documented proteins, comparatively little is known about their interactions. This is however the vital context, in which to interpret function. To complement experimental approaches to protein interaction, PSIMAP, the protein structure interaction map [ 10], derives domain-domain interactions from known structures in the Protein Databank PDB [ 1]. The domain definitions that PSIMAP uses are provided by SCOP [7]. The original PSIMAP algorithm computes all pairwise atom/residue distances for each domain pair of each multidomain PDB entry. At residue level this takes days; at the atomic level even months. We approached this computational challenge from two directions: First, we developed an effective new algorithm, which substantially prune the search space. Second, we distribute the computation over a farm of 80 Linux-PCs. The combined effort reduced the computation time from months to minutes. The basic idea of our novel algorithm is to prune the search space by applying a bounding shape to the domains. Interacting atoms of two domains can only be found in the intersection of the bounding shapes of the two domains. All atoms outside the intersection can be discarded. The quality of the pruning depends on how well the bounding shape approximates the domain. The proposed algorithm uses convex polygons as bounding shapes to best approximate the structure of the domains. The new algorithm requires 60 times less residue pair comparisons *This project was supported by the European Commission (BioGrid project, IST 2002-2004)
678
Figure 1. Checking interactions of a coiled coil domain pair of C-type lectins from PDB entry lkwx, an X-Ray structure of the Rat Mannose Protein A Complexed With B-Me-Fuc shown on the left with RasMol. The old PSIMAP algorithm requires an expensive pairwise comparison of all atoms/amino acids of the two domains (third figure). Using the proposed algorithm (fourth figure), it is only necessary to pairwise compare only the atoms/residues in the intersection of two convex hulls. The new PSIMAP algorithm Given: Two domains with residue/atom coordinates 1. A convex hull for each of the two domains is computed. 2. Both convex hulls are swelled by the required contact distance threshold. 3. The intersection of the two transformed convex hulls is computed. 4. All residue/atom pairs outside the intersection are discarded and for the remaining residues/atoms the number of residue/atgm pairs within the distance threshold is computed. If this number exceeds the number threshold the two domains are said to interact. Figure 2. The backbone for the new algorithm.
than the old PSIMAP algorithm for the PDB and works extremely well at the atomic level, as most atoms outside the interaction interface are immediately discarded. Overall, this new algorithm computes all domain-domain interactions at atomic level within 20 minutes, when distributed over 80 Linux machines. 2. E F F E C T I V E C A L C U L A T I O N O F P R O T E I N I N T E R A C T I O N S The original PSIMAP algorithm computes all pairwise atom/residue distances for each domain pair of each multidomain PDB entry. It can be formally formulated as below: Let d be a distance threshold and t be a number threshold. Let D 1 - { x l , . . . , x n } and D 2 { Y l , - . . , Y ~ } , where xi,yj E R a, be two sets of (atomic/residue) coordinates of the two domains. The set of interacting residues/atoms is defined by I(D1,D2) -
X/~-~k=l(xik- yjk) 2 ~_ d}. The domains D 1 and D 2 are said to interact if and only if II[ _ t. The values of the interaction parameters, distance and number thresholds, that v
PSIMAP uses are 5A and 5 residues/atoms of different residues, accordingly. The proposed efficient algorithm uses convex polygons (convex hulls) as bounding shapes to best approximate the structure of the domains. The flow of the proposed algorithm for interaction detection in the residue/atom level is given in Fig. 2. A convex hull is represented by a list of triangles, each being a list of three vertices. The complexity of the algorithm in three dimensions, using the divide and conquer technique [ 11 ],
679
/~
.
Figure 3. The case in 2D. After we expand the two convex hulls by the distance threshold, the atoms/residues in the intersection of the first domain are visible from all the faces of the convex hull of the second domain. is O(nlog(n)), where n is the number of atoms/residues. Thus step 1 in Fig. 2 can be efficiently carried out. Step 2 and 3 can both be done in linear time. In step 2 we expand each convex polygon that have been constructed in step 1 by the distance threshold d defined by PSIMAP algorithm. This is done by shifting each polygon of the convex hull perpendicularly away from the convex hull by the distance threshold d. Essentially, for each polygon p, we compute a vector v perpendicular to p with direction that points out of the convex hull. Then we compute v' as the norm of vector v and multiply it by the required distance threshold d. Finally, we add v' to each of the vertices of p, thus shifting the polygon away from the convex hull and swelling it. In step 3, we need to compute which atoms/residues are in both the transformed convex hull. To determine whether an atom/residue u is inside a convex hull or not we use the notion of signed volumes for polyhedron as defined in [9]. Fundamentally, it holds that for any point inside a convex hull each of the signed volumes of the polyhedrons, defined by v and each of the polygons of the surface, is positive ([9]). This can be computed in linear time. Using this property we check for every atom/residue in domain D 1 whether it is inside the transformed convex polygon of domain D2, in which case it belongs to the intersection (Fig. 3). At the final step, step 4 in Fig. 2, the algorithm checks in detail how many atoms/residues pairs in the hulls' intersection are within the distance threshold. 3. DISTRIBUTED COMPUTATION Before we evaluate the pruning capabilities of the above algorithm, we present the distributed computation of PSIMAP. As far as the distributed computation is concerned, two problems arise: Can the computation be distributed and if how can the load be best balanced to achieve an overall best performance. The calculation of PSIMAP with the old and new algorithms is a so-called "embarrassingly parallel" problem: I.e. the computation can be easily partitioned into independent subproblems that do not require any communication and can hence be distributed over a loosely coupled net-
680 work of computers. Each process is then assigned a task of executing the PSIMAP algorithm for a given set of PDB file entries. Hence each process calculates a part of PSIMAP that corresponds to the domain interactions identified by the set of PDB files assigned to that process. After all the processes finish the calculation the results are collected and merged to produce the overall PSIMAE The distributed implementation of PSIMAP uses the new effective algorithm described in the previous section. We have tested the PSIMAP calculation in a distributed environment of Linux workstations. The workstations have a processing power varying between 600-800MHz. Each workstation has assigned a task that corresponds to the calculation of a subset of PSIMAE Since the processing power of the workstations is the same, the assigned tasks should be equal in terms of computation size. We estimate the computation size of one PDB file by a function of the number of protein domains and size of the protein in terms of number of residues. To simplify our analysis we further assume that the size of the protein is proportional to the size s of the PDB entry. Furthermore, given p, the number of domains of a PDB entry, there are P(P2-1) potentially interacting domain pairs. Overall, we estimate the processing time for one PDB entry of size s with p domains, as f ( s , p) = s. p(~-l). Given a number c of computers/processors, we split the whole PDB into c almost equal tasks in terms of the estimated execution times. This is done using a greedy strategy. After we have sorted the single pdb tasks in descending order according to their processing time estimates we assign each single task from the sorted list to that computer/processors that have at that point the least cumulative processing time estimate. Updating the tasks' estimates we repeat that procedure for each element of the list with the sorted single pdb tasks. Next, each computer/processor is assigned such a task, which it executes. Finally, all results are collected and inserted into a database. Using the above load-balancing strategy, we ran the convex hull algorithm for PSIMAP on a farm of 80 Linux-PC.
4. A D D R E S S I N G BALANCING
COMPUTATIONAL
COMPLEXITY:
MARKET-BASED
LOAD-
The PSIMAP application requires the computation of a large number of similar tasks, which belong to the same user, and which are all known a priori. The allocation of computational resources is therefore relatively straightforward and can be carried out in the way described above. This is usually not the case in Computational Grids which are open systems with multiple users, who may belong to different organisations. One important problem in such environments is the efficient allocation of computational resources. Over the past few years, economic approaches to resource allocation have been developed [ 12], and the question arises whether they can be applied to resource allocation on the Grid. They satisfy some basic requirements for a Grid setting as they are naturally decentralised, and as decisions about whether to consume or provide resources are taken locally by the clients or service providers. The use of currency offers incentives for service providers to contribute resources, while clients have to act responsibly due to their limited budget. Indeed, a number of systems [2, 8, 13] have been built using a market mechanism to allocate computational resources. However, the performance of market-based resource allocation protocols has not been sufficiently studied. It depends on many factors which characterise a computational environment: Typically, parameters such as the number of resources, resource heterogeneity, and commu-
681 nication delays are low in a cluster, but high in a Grid. In our work, we use simulations to compare the performance of several resource allocation protocols for various computational environments, loads, and task types. The aim is to determine which protocol is most suitable for a given situation. We study several scenarios in which the clients have different requirements concerning the execution of their tasks. E.g. clients may expect their tasks to be executed as fast as possible, or their tasks may have deadlines for their completion. Furthermore, tasks belonging to different users may have different values, and therefore, need to be weighted accordingly. In the examined scenarios, we use performance metrics which reflect these requirements. In our preliminary experiments [4, 5], we compared the performance of the Continuous Double Auction Protocol (CDA), the Proportional Sharing Protocol (PSP) and the Round Robin Protocol (RR). We investigated how these protocols perform when the objective is to minimise task completion times. Our results show that, in a cluster with homogeneous resources and negligible communication delays, the Continuous Double Auction Protocol (CDA) performs best. However, if the load is low, the differences between the three protocols are small and the Proportional Share Protocol (PSP) performs equally well as CDA. In a scenario with heterogeneous resources Round-Robin performs worse than the two market-based protocols. CDA is best in most cases, but if resource heterogeneity is very high it may be outperformed by PSP. For a high number of resources PSP performs equally well as CDA. If communication delays are high, PSP performs best, whereas the other protocols degrade strongly. These results are limited to the allocation of tasks without deadlines and consider only three protocols. In our current work, we investigate a wider range of scenarios and protocols. In particular, we examine the benefits of preemption and of opportunistic pricing by the servers. We also carry out experiments in a real computational environment in order to verify certain parameters in the simulation model including communication delays and processing delays. 5. A D D R E S S I N G SEMANTIC COMPLEXITY: RULE-BASED INFORMATION INTE-
GRATION Frequent changes and extensions to analysis workflows as well as multiplicity and rapid rate of change of relevant data sources contribute to the complexity of data and computation integration tasks when analysing protein interactions. Although theoretically, teams of programmers could continue updating and maintaining the code for the analysis workflows in traditional procedural languages such as Java or Perl, the difficulty in maintaining such code and the lack of its transparency for non-programmers or even programmers who are not the original authors make this very difficult. We have designed and implemented a rule-based Java scripting language Prova [6] to address the above issues. Prova extends an open source Mandarax [3] inference engine implemented in Java. The language offers clean separation between data integration logic and computation procedures. The use of rules allows one to declaratively specify the integration requirements at high level without any implementation details. The transparent integration of Java provides for easy access and integration of database access, web services, and arbitrary Java computation tasks. This way Prova combines the advantages of rules-based programming and object-oriented programming in Java. To demonstrate the power of Prova, we show below how the main workflow of PSIMAP computation is implemented as rules that call Java methods to perform the actual computation. The Prova program consists of
682 Prova rules: Prova facts: and Prova queries:
L : - L 1 , . . . , Ln. L.
solve(L).
Here L, L 1 , . . . are literals, : - is read as "if" and commas as "and". Literals can be constants, terms, or arbitrary Java objects. s c o p _ d o m 2 d o m ( D B , P D B _ I D , PXA, PXB) : % Rule A a c c e s s _ d a t a (pdb, P D B _ I D , Prot), % Access the PDB data s c o p _ d o m _ a t o m s (DB, Prot, PXA, DomA) , % R e t r i e v e d o m a i n d e f i n i t i o n s s c o p _ d o m _ a t o m s (DB, Prot, PXB, DomB) , DomainA.interacts (DomB) . % C a l l J a v a to t e s t i n t e r a c t i o n s c o p _ d o m _ a t o m s (DB, Prot, PX, Dora) : % Rule B findall ( % Collect the domain definition p s i m a p . S u b C h a i n (PX, C, B, E) , s q l _ s e l e c t (DB, s u b c h a i n , [px, PX] , [chain_id, C] , [begin, B] , [end, E] ) , DomDef) , Dom=Prot.getDomain(DomDef) . % Create a Java object with domain
The listing above shows Rule A that creates a virtual table representing interacting domains
P X A and P X B by accessing the PDB data in a Java object Prot corresponding to PDB_ID, invoking Rule B twice to retrieve domain definitions DomA and DomB, then finally, calling a Java method interacts() to test whether the domains interact. Rule B used by Rule A assembles a domain definition by accessing the database table subchain in the SCOP database. The declarative style of rules provides for a transparent and maintainable design. For example, access to the PDB data via the access_data predicate above automatically scans alternative mirror location records for the PDB data. If one location is available due to the current local network and disk configuration, that location is cached and subsequent accesses to PDB will be directed there. Otherwise, alternative mirrors are non-deterministically retried until the PDB data can be located. Declarative style and intrinsic non-determinism of rules offers better transparency and maintainability of the code than the usual for and if constructs in procedural languages. 6. E X P E R I M E N T A L RESULTS The new effective PSIMAP algorithm gives a dramatic improvement in the number of distance comparisons required in both residue and atom level. The algorithm, using convex hulls to approximate domain structures, often picks exactly the interacting atoms/residues, so that no further checks are needed. Overall, the new algorithm approach leads to a 60-fold improvement at residue-level. Importantly, the reductions in false positives, do not lead to any false negatives. To further reduce computation time beyond the improvements achieved through the pruning taking place in the new algorithms, we distribute the computation of the convex hull algorithm over 80 Linux-PCs as described above. As summarised in Fig. 5, the algorithm computes the interactions for the whole of PDB in 15 minutes at the residue level and in 20 minutes at the atom level. This compares to several days' computation of the old PSIMAP algorithm at residue level and an estimation of months on the atom level. The results in Fig. 4 documenting the pruning and in Fig. 5 documenting the joint reduction by pruning and distribution show that our approach successfully overcomes two problems of the original PSIMAP computation: Due to the pruning the computation is feasible at the more
683 Residue Comparisons
.
le+ll =
le+10
"~
le+09
0 "~
.
.
.
.
.
A/~-
"l
le+08
/
le+07 / 1e+06
Z
100000 1
i
1
i
2
3
4
i 5
6
7
8
9
10
Distance Threshold
Figure 4. Residue comparisons needed for the new algorithm. Number of residue pair comparisons (y-axis) for different distance thresholds (x-axis, in Angstrom). OLD represents the exhaustive search, NEW represents the convex hull method and AA the real number of interacting residue pairs. PSIMAP Residue Atom
OLD o n l PC Days Months
NEW o n l PC 4.5h 20h
NEW on 8 0 P C s ! 5 min 20 min
Figure 5. The table shows how the new improved algorithm reduces the computation times. While the OLD algorithm is estimated to take months on the detailed atomic level and hence was not feasible the NEW algorithm distributed over 80 Linux machines dramatically reduces the computation time to 20 minutes. The reason for the new algorithm taking only marginally longer at the atomic level (20 minutes) compared to the residue level (15 minutes) is due to the larger problem size leading to an overall better load balancing and CPU usage. detailed and accurate atom level and due to the combined effect of pruning and distribution the domain-domain interaction are scalable and will be able to handle the superlinear growth in PDB. 7. CONCLUSION In this paper we demonstrated how to reduce the computation of PSIMAP from months to minutes. First by designing a new effective algorithm using computational geometry and in particular convex polygons and second by distributing the computation over a Linux PC farm using a simple scheduling mechanism. However, from this experience we derived some more general conclusions: Most problems in bioinformatics have a dual nature. The first is computational complexity and the second is semantic complexity since almost all of them require intelligent information integration. Currently, systems are often poorly engineered with no particular attention to reusability, scalerability and maintainability. We have sketched our relevant work to tackle these two natures of the problems. Regarding computational complexity, before parallelising the problem should be specified, with all its
684 constrains, as precise as possible. It is rather usual that problems are formulated too general with no clear specifications. When an algorithmic solution to the problem exists, its complexity should be evaluated and be reduced as much as possible. At that stage extensive refactorisation or ever redesign of the algorithm may be necessary. Regarding semantic complexity, most problems in bioinformatics involve different data sources, which are often distributed. When designing workflows to integrate data, they should be specified at the higher level of abstraction, independent from any implementation details. A rule-based approach, we outlined in this work, is one way to realise these executable specifications. REFERENCES
[1] [21
[3] [4] [5]
[6] [7]
[8]
[9] [10]
[11 ] [12] [ 13]
H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The protein data bank. Nucl. Acids Res., 28:235-41, 2000. R. Buyya, J. Giddy, and D. Abramson. An Evaluation of Economy-based Resource Trading and Scheduling on Computational Power Grids for Parameter Sweep Applications. In Proc. of the 2nd Workshop on Active Middleware Services, Pittsburgh, USA, 2000. Kluwer. J. Dietrich. Mandarax. http://www.mandarax.org. J. Gomoluch and M. Schroeder. Market-based Resource Allocation for Grid Computing: A Model and Simulation. In Proc. oflst Intl. Workshop on Middlewarefor Grid Computing (MGC2003), Rio de Janeiro, Brazil, June 2003. J. Gomoluch and M. Schroeder. Performance Evaluation of Market-based Resource Allocation for Grid Computing. To appear in Concurrency and Computation: Practice and Experience, 2004. A. Kozlenkov and M. Schroeder. Prova language, http://comas.soi.city.ac.uk/prova. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536-40, 1995. N. Nisan, S. London, O. Regev, and N. Camiel. Globally distributed computation over the Intemet- the POPCORN project. In Proc. of the 18th Intl. Conf. on Distributed Computing Systems, Amsterdam, Netherlands, 1998. IEEE. Joseph O'Rourke. Computational Geometry in C. Cambridge University Press, second edition, 1998. J. Park, M. Lappe, and S.A, Teichmann. Mapping protein family interactions" Intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast. J. Mol. Biol., 307:929-38, 2001. i F. P. Preparata and S. J Hong. Convex hulls of finite sets of points in two and three dimensions. Communications of the ACM, 20:87-92, 1977. T. Sandholm. Distributed rational decision making. In G. Weiss, editor, Multi-agent systems. MIT Press, 2000. C. A. Waldspurger, T. Hogg, B. A. Huberman, J. O. Kephart, and W. S. Stornetta. Spawn: A distributed computational economy. IEEE Trans. Software Engineering, 18(2):103117, 1992.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
685
S p a t i a l l y R e a l i s t i c C o m p u t a t i o n a l P h y s i o l o g y : Past, P r e s e n t a n d F u t u r e J.R. Stiles a*, W.C. Ford at, J.M. Pattillo a, T.E. Deerinck b, M.H. Ellisman b, T.M. Bartol c, and T.J. Sejnowski c ~Pittsburgh Supercomputing Center, Carnegie Mellon University, 4400 Fifth Ave., Pittsburgh, PA 15213 bDepartment of Neurosciences, University of California at San Diego, La Jolla, CA 92093 cComputational Neurobiology Laboratory, The Salk Institute, La Jolla, CA 92037
1. COMPUTATIONAL PHYSIOLOGY, SPATIAL REALISM, AND PARALLEL COMPUTING Recent technological advances have changed biology from a largely data-limited, qualitative science to one of increasing quantitation that spans wide ranges of space and time. This transition creates two important downstream needs, the first aimed at large-scale data storage and analysis (Bioinformatics), and the second at simulation and prediction (Computational Physiology and Biophysics) Physiological function depends on the spatial and temporal dynamics of specific genes, proteins, signaling molecules, and metabolites within and between cells. Realistic physiological simulations present a grand challenge because of the wide range of underlying space and time scales, as well as the widely disparate organization and properties of different cells. Real cells, as opposed to typical textbook cartoons, are highly organized and structured. The relevant spatial dimensions within and around cytoskeletal components, membranous organelles, and macromolecular complexes are generally on the scale of tens of nm, and thus are on the same approximate scale as macromolecular complexes themselves. Many such complexes function as molecular machines, illustrating the discrete, discontinuous nature of cellular physiology and the need for spatially realistic simulations. Spatially realistic cell models are presently the exception rather than the rule. In the past this was a direct consequence of inadequate computing power, and even on today's massively parallel machines, atomic-to-molecular simulation methods with fs time-steps cannot be applied to problems on the cellular scale. Therefore, the overriding difficulty lies in deciding how currently available computing power can be applied effectively to complex, quantitative physiological simulations. In short, a critical challenge is to develop modeling and simulation methods that allow integration of mechanisms, kinetics, and stochastic behaviors at the molecular level with structural organization and function at the cellular level. In this paper we describe unique *Supported by NIH R01 GM068630, P20 GM65805, P41 RR06009 (JRS), P41 RR004050 (MHE), and HHMI (TJS) *Present Address: Computation and Neural Systems, California Institute of Technology
686 methods that we have developed for spatially realistic physiological simulations (Section 2), illustrate their use with simulations of synaptic transmission (Section 3), and discuss future directions and needs for algorithm design and efficient large-scale parallel computing (Section 4). 2. M C E L L A N D D R E A M M - A MICROPHYSIOLOGICAL MODELING ENVIRON-
MENT 2.1. Building spatially realistic models of physiological systems
We are presently developing a microphysiological modeling environment that consists of MCell, a Monte Carlo cellular simulation engine, (www.mcell.psc.edu or www.mcell.cnl.salk. edu), and DReAMM, a program for the Design, Rendering, and Animation of MCell Models (www. m cel I. psc. edu/D ReAMM ). MCell is a general tool for spatially realistic reaction-diffusion simulations [1, 2, 3]. It is written in C, reads input files composed of a high-level, modular Model Description Language (MDL), and is based on unique Stochastic Space Leaping (SSL) computational algorithms. These SSL methods include the use of surface meshes to represent cellular and subcellular structures, grid-free Brownian Dynamics (BD) and ray tracing/ray marching (RTRM) algorithms for molecular diffusion, and Monte Carlo probabilities that couple chemical reactions to BD and RTRM operations. DReAMM is a model design, visualization, and analysis program designed to read general mesh data as well as MCell-specific output. DReAMM runs within the OpenDX visual programming and rendering environment (www.opendx.org), but is designed to look and feel like a typical GUI-based application program. Even small MCell models can easily lead to large rendering tasks (>>107 polygons), and so the user chooses what to render or animate, and at what level of detail. Unless indicated otherwise, all 3-D images and animations included in this paper were produced with DReAMM, but further details are not included here. Figure 1 illustrates the two basic approaches that can be used to build spatially realistic models. The first step is to generate the necessary meshes. For some projects, representative model geometry is built in silico from average anatomical measurements and computer aided design software (Fig. 1B&C). With this approach, changes to the model's spatial parameters and their potential impact can be investigated systematically. For other projects, mesh generation requires surface reconstruction from serial electron microscopic data (Fig. 1A, D-F). Complex reconstructed models are not as easy to manipulate as those generated in silico, but they can be critically important to studies investigating the biophysical factors underlying biological variability. Meshes can have different properties and functions. For example, they may be reflective or transparent to diffusing molecules, and may form cellular superstructures, scaffolding for molecular positions, or sampling boundaries. Meshes to be populated with molecules (e.g., membrane-bound proteins) are first triangulated and then each element is tiled using barycentric subdivision (Fig. 1, inset). Each tile serves as a potential location for a molecule, and the average tile size usually matches the area occupied by a typical protein. Thus, MCell's algorithms represent single molecule locations explicitly, but give up details of molecular structure for the sake of computational efficiency. Depending on the mesh topology and need for precisely positioned molecules, some mesh
687
Figure 1. Building spatially realistic models of the neuromuscular junction. Electron microscopic data (A) is used to create meshes in silico (B, using FormZ, www.formz.com) or by contouring and reconstructing surfaces (D, E). In either case, mesh elements are tiled (inset, see text) to create positions for molecules (C, F) such as voltage-gated calcium channels (VGCC), calcium binding proteins (CBP), acetylcholine receptors (AChR), and acetylcholinesterase (ACHE). NT, nerve terminal; JF, junctional folds; AZ, active zone; SV, synaptic vesicle. regions may be highly refined while others remain fairly coarse (Fig. 1B&C). If a mesh must form a continuous diffusion boundary its elements must share edges exactly, but otherwise this is not required. In some cases disjoint polygons are used to produce sites for aggregated
688 molecule locations (Fig. IF). Unlike finite element simulations, mesh topology and refinement per se have virtually no impact on MCell's numerical accuracy, but can have a major impact on compute time and memory use (Section 4; [2, 3]). 2.2. SSL algorithms: Grid-free Brownian dynamics Movements of cellular constituents occur by a combination of diffusion, bulk flow, and protein-mediated (active) transport mechanisms. Diffusion is critically important over small subcellular-to-cellular distances, and a common numerical approach in 3-D is to discretize the relevant flux equations using a volume mesh. This approach is generally considered satisfactory as long as the product of voxel size and molecule concentration yields many molecules per voxel. With a realistic subcellular mesh, however, this requirement is likely to be violated. For example, embedding the surface meshes of Fig. 1 into a volume would generate many small voxels with edge lengths of 50 nm or less. With correspondingly small volumes, physiological concentrations then could easily yield less than one molecule per voxel. At the level of single molecules, diffusion is driven by thermal velocity and collisions on the scale of Brownian motion. Numerical simulation can be implemented with many different random walk algorithms, but for realism and computational efficiency the algorithm must allow each moving molecule to sample every point in space (i.e., be grid-free). In addition, it must use a time-step that is very long with respect to true Brownian motion (~ 10-13 sec; [4]). To satisfy these constraints, individual molecular movements can be of variable lengths in random radial directions. A net displacement of length 1 in a random radial direction during time At in reality results from a sequence of many actual (unknown) Brownian steps. The probability distribution for 1 is given by [3]:
.__
P~
1 ~e ~ (47rDAt)~
r2
(47vr2dr)
(1)
where r is radial distance, and Pr is the probability that 1 is between r and (r + dr) from the original location after time At. D is the molecule's diffusion coefficient and reflects an effective mobility in solution. Its value depends on the nature of the solution (e.g., physiological saline) and temperature (e.g., [5]). Equation 1 could be sampled to the limit of machine precision for each random walk movement, but this is computationally expensive and unnecessary. Instead, MCell begins a simulation by subdividing the distribu-tion into many unequal lengths of equal probability, and builds a look-up table of step lengths. A large look-up table of radial directions is also constructed, using methods that assure lack of bias [3]. Thereafter, each movement is generated by subdividing a single random number into higher and lower order bits that are used to sample each table. The random numbers themselves are uniformly distributed and are read from a buffer that is refilled periodically. Performance scales linearly with the number of diffusing molecules, is very fast, and molecular movements are highly realistic within each time-step. The algorithm's realism is illustrated in Fig. 2, which shows a population of diffusing molecules that originated at a point source and have undergone one random walk step. The resulting cloud of molecule locations is radially symmetrical and essentially grid-free. The range of possible individual step lengths is also illustrated by a succession of movements for an individual molecule.
689
o~ ~2
//
Figure 2. Grid-free Brownian Dynamics random walk. Stereo pair (cross-eyed viewing) shows cloud of molecules (upper left) obtained from a point source after a single time step (1 #s). The distribution of locations is given by Eq. 1. A path taken by a single molecule is shown for a sequence of 150 steps (white), and for an additional 150 steps (magenta). The net displacement increases by only ~ 10% in this example (yellow vs. red arrow), and on average would increase by v/2 (Eq. 3). 2.3. SSL Algorithms: Ray tracing/ray marching and Monte Carlo probabilities During each simulation time-step, decisions are made between different possible unimolecular (first order) and bimolecular (second order) reaction transitions that can occur to molecules. Bimolecular events depend on collisions between two molecules in space, while unimolecular transitions are simple time-dependent Poisson processes that occur with probability ~k) per time-step At [3]. A Monte Carlo implementation of unimolecular transitions is given by the Gillespie method and its more recent Stochastic Simulation Algorithm (SSA) variants [6, 7], but these algorithms do not directly couple a molecular simulation of diffusion to bimolecular transitions. MCell is unique in this respect, and by directly tracing molecular movements and interactions, directly simulates the space- and time-dependent dynamics of non-equilibrium, non-well-mixed systems (e.g., transient signals like synaptic currents). As a molecule diffuses, its current random walk trajectory (ray) must be traced through space to determine if it intersects with another object. If so, the result depends on the properties of the diffusing molecule and the object at the point of intersection. For example, if the ray hits a surface tile (Fig. 1) that corresponds to an unoccupied receptor protein, then a probability value (Pb) is compared to a random number to determine whether or not binding occurs. In general:
(2) where the scaling factor (A >_ 0) is a function of the tile's area and rate constant(s) for binding (for further details see [3]). In effect, A and D are dictated by the identities of the molecules in the simulation, and so the magnitude of Pb is determined by the choice of At. While p~ for unimolecular transitions shows a typical asymptotic approach to unity as At increases, Eq. 2
690 shows that Pb has an unusual, non-asymptotic dependence on v / ~ . This arises because the average rate of collisions depends on the apparent average velocity of motion (Vapp), and Vapp depends on the mean radial displacement (/-~; [3]) which in tum scales with v/-~:
l-~
4 =
(?
t ;
l~ ~app= At = 4
(
D ~At
(3)
From Eq. 3, a 2-fold increase in At will increase {~ by only ~/2, while Yapp decreases by v/2. This is a consequence of tracking net displacements rather than the total distance traveled at the level of true Brownian motion, and is illustrated by the arrows in Fig. 2. While MCell's numerical accuracy is governed by Pk, Pb, l,., and other factors, all of these parameters depend simultaneously on At and not on a voxel-based discretization of space. Therefore, a simple reduction of At reduces spatial (via the random walk) and temporal granularity simultaneously, and thus it is very easy to test for convergence of results [2]. Given typical biological systems, highly accurate simulations can be obtained with DeItat on the scale of 10 -6 sec, compared to 10 -15 o r 10 -13 sec for Molecular Dynamics simulations or true Brownian motion, respectively. 3. SIMULATION OF
SYNAPTIC
TRANSMISSION AT THE
NERVE-MUSCLE
SYNAPSE
Synaptic transmission exemplifies cellular interactions in which stochastic behaviors and spatial complexity are very important. Changes in synaptic signal size and time course (plasticity) may underlie high-level cognitive functions such as learning and memory, and can also be an integral component of many neurological disorders [8]. Several of our projects focus on the vertebrate neuromuscular junction (NMJ), the synapse between a nerve and muscle cell. This is a large archetypical synapse for which there is a wealth of normal and pathological data related to structure and function, and thus it is well suited to quantitative predictive simulations [2, 9]. At the NMJ, single packets of neurotransmitter molecules (acetylcholine; ACh; >_10000 per packet) are released spontaneously from the resting nerve, and multiple packets are released when the nerve fires. Each packet originates from a small spherical vesicle (Fig. 1A-C), and the released ACh diffuses across and within the synaptic cleft to activate neurotransmitter receptor proteins (AChRs, Fig. 1F) on the muscle cell. Removal of ACh from the cleft occurs via chemical breakdown mediated by enzyme proteins (acetylcholinesterase, ACHE) localized within the cleft (Fig. 1F). The synaptic signal is a small electrical current carried by sodium ions flowing into the muscle cell through open pores (channels) in the activated AChRs. The released packets of ACh are spatially segregated and so make spatially distinct synaptic currents on sub-?m scales, each called a miniature endplate current (mEPC). Maximum mEPC amplitude corresponds to ~1000 open channels, time to peak is ~0.3 ms, and the decay phase is exponential with an e-fold time of 1-2 ms. Even under normal conditions, mEPCs show significant variability arising from many possible underlying factors. Their relative importance is unknown and can be investigated by simulating mEPC generation in spatially realistic models. To do the simulations, it is necessary to reconstruct a portion of NMJ architecture, and then to simulate mEPCs while systematically varying spatial and/or chemical kinetic inputs. Some of the primary candidate factors under-
691 lying variability include differences in cleft topology (volume) from one ACh release site to another, different spatial arrangements of AChR and AChE molecules, and differences in vesicular ACh content and release kinetics.
Figure 3. Mouse sternomastoid muscle NMJ reconstruction. Stereo pair (cross-eyed viewing) shows nerve terminal in translucent gray, extracellular face of muscle membrane in light blue, and cytoplasmic face of muscle membrane in dark blue. Cut edges of membrane surfaces are outlined in red. Approximate dimensions 10 x 5 x 5 #m.
We have begun investigating this issue by focusing first on cleft topology. To reconstruct part of an NMJ, we selected 60 sections from ~400 serial electron micrographs (mouse sternomastoid muscle; 80 nm section thickness), and traced and interpolated nerve and muscle membrane contours. Using a marching cubes algorithm and subsequent decimation, we reconstructed preand postsynaptic membrane surfaces (Fig.'s 1 and 3). To sample the effect of tortuous cleft topology on mEPCs, the nerve terminal mesh was sampled to obtain different potential ACh release sites at 100 nm average nearest neighbor spacing (~3800 total sites, Fig. 3). In addition, the arrangement of AChE sites within the cleft reflected different assemblies (isoforms) of AChE molecules [ 10], on average yielding 12-mers of the catalytic subunit (Fig. IF). Figure 3 shows representative samples of simulated mEPCs, obtained either with repetitive ACh release from a single site (top panels), or successive release from different sites (bottom panels). These data thus illustrate predicted Type I (arising from a single release site) versus Type II mEPC variability (arising from multiple release sites). With all other model parameters [1,2, 11] held constant, it is quite evident that Type II variability is predomin-ant, and therefore synaptic topology very likely plays an important role in determining mEPC varia-bility. Figure 3 shows how mEPC amplitude varied as a function of ACh release site location. The predicted variability of mEPC amplitude and rise time approaches that seen experimentally (e.g., [11 ]), but the variability of decay time is much less than that seen experimentally. Thus, reproduction of the experimental distribution of decay times will probably require that some novel mechanism be added to the model, such as different AChR populations (based on channel opening and closing rates) in different spatial regions of postsynaptic mem-brane. In addition, preliminary simula-tions are currently being run with AChE inhibition, and these may
692
8 0 0
.....
,
~
,
9 .
,
..
,
..
,
9
9
A
-
600
|
-
~
9
,
9
!
9
|
'-
"
B
100
..._m 400 c c ,,c O IX
200 0
1
2
(..1 1 2 5 0 ............
......... ,
<
Q.
O
0
3 .
4
,.......
.... ,,6
5 ,
1000
.
,
,'t .......
7
C
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
10C
7so
1C
500
1
250
~
0
100(~
~ ......- ~ ~ ,
",,
s"
7
Time
(ms)
Figure 4. Predicted Type I (A and B, 100 mEPCs) and II (C and D, ~-,1000 mEPCs) variability. AChR and AChE distributions were similar to those shown in Fig. 1F. Similar decay times (see text) are shown by similar slopes in B and D.
Figure 5. Predicted mEPC amplitude (open AChR channels) as a function of ACh release location. Each sphere represents a possible synaptic vesicle location (--~3800 total). suggest that a second AChR open channel state is required in the model to account for amplitude increases seen experimentally under these conditions (ACHE inhibitor drugs are used to treat Myasthenia Gravis, an autoimmune disease that attacks AChRs and disrupts NMJ structure [9]).
693 4. COMPUTATIONAL ISSUES: PRESENT AND FUTURE The computational cost of an MCell simulation is mostly dependent on the number of tests for ray/polygon intersections. In a naive algorithm, every mesh element would have to be checked for a potential intersection every time each diffusing molecule moves. Thus, computer time would scale roughly as O(NM), where N is the number of diffusing molecules and M is the total number of mesh elements. This would be untenable as M increases by many orders of magnitude for large-scale models, and therefore space is partitioned into subvolumes that are used to optimize the search for collisions. As a model's spatial complexity increases, more subvolumes can be used so that execution time scales roughly as O(N). Additional subvolumes increase memory use, however, and large models can become memory-bound. For each mEPC simulation as shown in Fig. 3, optimal run-time conditions require ~2 GBytes of RAM and on the order of 10 minutes on a single 64-bit processor. With thousands of ACh release sites and thousands of replicate simulations per site, the number of simulations can easily grow to the scale of 105 and beyond. Thus, total computer time can easily reach the scale of CPU years. Since each simulation is generally an independent task, such projects are well suited to embarrassingly parallel execution. Under certain conditions this can be implemented effectively on the Grid, assuming adequate memory resources, efficient staging of input files, appropriately handled output, and effective scheduling strategies [12]. With the recent growth of large shared memory architectures, another possible route for efficient parallelization is to replicate a particular model many times within a single execution task. In effect, each replicate would be translated into a different region of spatial subvolumes, and the subvolume regions would map to different processors. This approach could be very effective for parameter sweeps and repeated trials run with different random numbers, and would tend to minimize load balancing issues. Future extensions to MCell's present capabilities include simulation of arbitrary chemical interactions between diffusing molecules in solution as well as in membrane environments, and also incorporation of volume meshes with embedded surfaces. These extensions will enable hybrid Monte Carlo/finite element approaches to simulations on dramatically expanded scales of space and time, and, inevitably, will also dramatically expand the computational costs. Such algorithms will pose major challenges for efficient parallelization, since they will require asynchronous, latency-tolerant control to achieve dynamic load balancing. Day-to-day use of such a computational environment will also have to be relatively transparent to a growing community of computational physiologists who are not expert with large computer technology. Although these are difficult problems, the eventual impact on personalized health and medicine will be tremendous.
REFERENCES
[1]
[2]
J.R. Stiles, D. Van Helden, T.M. Bartol, Jr., E.E. Salpeter and M.M. Salpeter, Proc. Natl. Acad. Sci. USA 93 (1996) 5747. J.R. Stiles, T.M. Bartol, M.M. Salpeter, E.E. Salpeter and T.J. Sejnowski, in Synapses, W.M. Cowan, C.F. Stevens, T.C. Sfidhof (eds.), Johns Hopkins Univ. Press, Baltimore, 2001.
694 [3] [4] [5] [6] [7] [8] [9] [ 10] [ 11 ] [12]
J.R. Stiles and T.M. Bartol, in Computational Neuroscience: Realistic Modeling for Experimentalists, E. De Schutter (ed.), CRC Press, Boca Raton, 2001. G.M. Barrow, Physical Chemistry for the Life Sciences, McGraw-Hill, Inc., New York, 1981. S.G. Schultz, Basic Principles of Membrane Transport, Cambridge University Press, Cambridge, 1980. D.T. Gillespie, J. Phys. Chem. 81 (1977) 2340. D.T. Gillespie, Physica A. 188 (1992) 404. W.M. Cowan, C.F. Stevens, T.C. Siidhof (eds.), Synapses, Johns Hopkins Univ. Press, Baltimore, 2001. M.M. Salpeter (ed.), The Vertebrate Neuromuscular Junction, Alan R. Liss, Inc., New York, 1987. C. Legay, Micros. Res. Tech. 49 (2000) 56. J.R. Stiles, I.V. Kovyazina, E.E. Salpeter and M.M. Salpeter, Biophys. J. 77 (1999) 1177. H. Casanova, T.M. Bartol, J. Stiles and F. Berman, Intern. J. High Perf. Comp. App. 15 (2001) 243.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
695
Cellular automaton modeling of pattern formation in interacting cell systems* Andreas Deutsch a, Uwe Brrner b, and M. B~r b aCenter for High Performance Computing, Dresden University of Technology, Zellescher Weg 12, D-01062 Dresden, Germany bMax Planck Institute for the Physics of Complex Systems, Nrthnitzer Str. 38, D-01187 Dresden, Germany Cellular automata can be viewed as simple models of spatially extended decentralized systems made up of a number of individual components (e.g. biological cells). The communication between constituent cells is limited to local interaction. Each individual cell is in a specific state which changes over time depending on the states of its local neighbors. In particular, cellular automaton models have been proposed for biological applications including ecological, epidemiological, ethological (game theoretical), evolutionary, immunobiological and morphogenetic aspects. Here, we present an overview of cellular automaton models of spatio-temporal pattern formation in interacting cell systems. Finally, we focus on a specific example - tippling pattern formation in myxobacteria and introduce a cellular automaton model for this phenomenon which is able to lead to testable biological hypotheses. 1. INTRODUCTION: ROOTS OF C E L L U L A R AUTOMATA The notion of a cellular automaton originated in the works of John von Neumann (1903-1957) and Stanislaw Ulam (1909-1984). Cellular automata as discrete, local dynamical systems can be equally well interpreted as a mathematical idealization of natural systems, a discrete caricature of microscopic dynamics, a parallel algorithm or a discretization of partial differential equations. According to these interpretations distinct roots of cellular automata may be traced back in biological modeling, computer science and numerical mathematics which are well documented in numerous and excellent sources [ 1, 2, 3, 4]. The basic idea and trigger for the development of cellular automata as biological models was a need for non-continuum concepts. There are central biological problems in which continuous (e.g. differential equation) models do not capture the essential dynamics. A striking example is provided by self-reproduction of discrete units, the cells. In the forties John von Neumann tried to solve the following problem: which kind of logical organization makes it possible that an automaton (viewed as an "artificial device") reproduces itself?. John von Neumann's lectures at the end of the forties clearly indicate that his work was motivated by the self-reproduction ability of biological organisms. Additionally, there was also an impact of achievements in automaton theory (Turing machines) and Grdel's work on the foundations of mathematics, in particular the incompleteness theorem ("There are arithmetical truths which can, in principle, *This work was partially supported by the Future & Emerging Technologies unit of the European Commission through Project BISON (IST-2001-38923).
696 never be proven."). A central role in the proof of the incompleteness theorem is played by selfreferential statements. Sentences as "This sentence is false" refer to themselves and may trigger a closed loop of contradictions. Note that biological self-reproduction is a particularly clever manifestation of self-reference [3]. A genetic instruction as "Make a copy of myself" would merely reproduce itself (self-reference) implying an endless doubling of the blueprint, but not a construction of the organism. How can one get out of this dilemma between self-reference and self-reproduction? The first model of self-reproduction proposed by von Neumann in a thought experiment (1948) is not bound to a fixed lattice, instead the system components are fully floating. The clue of the model is the two-fold use of the (genetic) information as uninterpreted and interpreted data, respectively, corresponding to a syntactic and semantic data interpretation. The automaton actually consists of two parts: a flexible construction and an instruction unit refering to the duality between computer and program or, alternatively, the cell and the genome [3]. Thereby, von Neumann anticipated the uncoding of the genetic code following Watson's and Crick's discovery of the DNA double helix structure (1953) - since interpreted and uninterpreted data interpretation directly correspond to molecular translation and transcription processes in the cell. Arthur Burks, one of von Neumann's students, called von Neumann's first model the kinematic model since it focuses on a kinetic system description. It was Stanislaw Ulam who suggested a "cellular perspective" and contributed with the idea of restricting the components to discrete spatial cells (distributed on a regular lattice). In a manuscript of 1952/53, von Neumann proposed a model of self-reproduction with 29 states. The processes related to physical motion in the kinematic model are substituted by information exchange of neighboring cells in this pioneer cellular automaton model. Chris Langton, one of the pioneers of artificial life research, reduced this self-reproducing automaton model drastically [5]. Meanwhile, it has been shown that the cellular automaton idea is a useful modeling concept in many further biological situations. 2. C E L L U L A R AUTOMATON DEFINITION Cellular automata are defined as a class of spatially and temporally discrete dynamical systems based on local interactions. A cellular automaton can be defined as a 4-tuple (L, S, N, F), where 9 L is an infinite regular lattice of cells/nodes (discrete space), 9 S is a finite set of states (discrete states); each cell / E L is assigned a state s C S, 9 N is a finite set of neighbors, indicating the position of one cell relative to another cell on the lattice L; Moore and von Neumann neighborhoods are typical neighborhoods on the square lattice, 9 F is amap F : S~[NI
{'3i}iC N
s
(1)
W"-> 8,
(2)
-~
which assigns a new state to a cell depending on the state of all its neighbors indicated by N (local rule).
697 The evolution of a cellular automaton is defined by applying the function F synchronously to all cells of the lattice L (homogeneity in space and time). The definition can be varied, giving rise to several variants of the basic cellular automaton definition. In particular: 9 P r o b a b i l i s t i c c e l l u l a r a u t o m a t o n " F is not deterministic, but probabilistic, i.e.
{si}ic N
~-+ sj with probablity pj,
where pj _< 0 and }--~jp j -
(3) (4)
1.
9 N o n - h o m o g e n e o u s c e l l u l a r a u t o m a t o n " transition rules and/or neighborhoods are allowed
to vary for different cells. 9 A s y n c h r o n o u s c e l l u l a r a u t o m a t o n : the updating is not synchronous. 9 C o u p l e d m a p lattice: the state set S is infinite, e.g. S = [0, 1].
3. C E L L U L A R AUTOMATON MODELS OF C E L L I N T E R A C T I O N Cellular automaton models have been proposed for a large number of biological applications including ecological, epidemiological, ethological (game theoretical), evolutionary, immunobiological and morphogenetic aspects. Here, we give an overview of cellular automaton models of pattern formation in interacting cell systems. While von Neumann did not consider the spatial aspect of cellular automaton patterns per se - he focused on the pattern as a unit of self-reproduction- we are particularly concerned with the spatio-temporal dynamics of pattern formation. Various automaton rules mimicking general pattern forming principles have been suggested and may lead to models of (intracellular) cytoskeleton and membrane dynamics, tissue formation, tumor growth, life cycles of microorganisms or animal coat markings. Automaton models of cellular pattern formation can be roughly classified according to the prevalent type of interaction. Cell-medium interactions dominate (nutrient-dependent) growth models while one can further distinguish direct cell-cell and indirect cell-medium-cell interactions. In the latter communication is established by means of an extracellular field. Such (mechanical or chemical) fields may be generated by tensions or chemoattractant produced and perceived by the cells themselves. 3.1. Cell-medium or growth models Growth models typically assume the following scenario: A center of nucleation is growing by consumption of a diffusible or non-diffusible substrate. Growth patterns typically mirror the availability of the substrate since the primary interaction is restricted to the cell-substrate level. Bacterial colonies may serve as a prototype expressing various growth morphologies in particular dendritic patterns. Various extensions of a simple diffusion-limited aggregation (DLA) rule can explain dendritic or fractal patterns [6]. In addition, quorum-sensing mechanisms based on communication through volatile signals have recently been suggested to explain the morphology of certain yeast colonies [7].
698 A cellular automaton model for the development of fungal mycelium branching patterns based on geometrical considerations is suggested in [8]. Recently, various cellular automata have been proposed as models of tumor growth [9, 10]. Note that cellular automata can also be used as tumor recognition tools, in particular for the detection of genetic disorders of tumor cells [ 11 ]. 3.2. Cell-medium-cell interaction models Excitable media and chemotaxis
Spiral waves can be observed in a variety of physical, chemical and biological systems. Typically, spirals indicate the excitability of the system. Excitable media are characterized by resting, excitable and excited states. After excitation the system undergoes a recovery (refractory) period during which it is not excitable. Prototypes of excitable media are the BelousovZhabotinskii reaction and aggregation of the slime mould Dictyostelium discoideum [ 12, 13]. A number of cellular automaton models of excitable media have been proposed which differ in state space design, actual implementation of diffusion and in the consideration of random effects [14]. A stochastic cellular automaton was constructed as a model of chemotactic aggregation of myxobacteria [ 15]. Here, a nondiffusive chemical, the slime, and a diffusive chemoattractant are assumed in order to arrive at realistic aggregation patterns. Turing systems
Spatially stationary Turing pattems are brought about by a diffusive instability, the Turing instability [16]. The first (two-dimensional) cellular automaton of Turing pattern formation based on a simple activator-inhibitor interaction was suggested by Young [17]. Simulations produce spots and stripes (claimed to mimic animal coat markings) depending on the range and strength of the inhibition. Turing pattems can also be simulated with appropriately defined reactive lattice-gas cellular automata [ 18]. Activator-inhibitor automaton models might help to explain the development of ocular dominance stripes [ 19]. Ermentrout et al. introduced a model of molluscan pattem formation based on activator-inhibitor ideas [20]. Further cellular automaton models of shell pattems have been proposed (e.g. [21 ]. An activator-inhibitor automaton proved also useful as a model of fungal differentiation pattems [8]. 3.3. Cell-cell interaction models Differential adhesion
In practice, it is rather difficult to identify the precise pattern forming mechanism, since different mechanisms (rules) may imply phenomenologically indistinguishable patterns. It is, particularly, difficult to decide between effects of direct cell-cell interactions and indirect interactions via the medium. For example, one-dimensional rules based on direct cell-cell interactions have been suggested as an alternative model of animal coat markings. Such patterns have been traditionally explained with the help of reaction-diffusion systems based on indirect cell interaction. A remarkable three-dimensional automaton model based on cell-cell interaction by differential adhesion and chemotactic communication via a diffusive signal molecule is able to model aggregation, sorting out, fruiting body formation and motion of the slug in the slime mould Dictyostelium discoideum [23].
699
Alignment, swarming While differential adhesion may be interpreted as a density-dependent interaction one can further distinguish orientation-dependent cell-cell interactions. An automaton model based on alignment of oriented cells has been introduced in order to describe the formation of fibroblast filament bundles [24]. An alternative model of orientation-induced pattern formation based on the lattice-gas automaton idea has been suggested [25]. Within this model the initiation of swarming can be associated with a phase transition [26]. A possible application is street formation of social bacteria (e.g. Myxobacteria). As an illustration of the automaton idea, a discrete model for Myxobacterial rippling pattern formation based on cellular collisions will be introduced in section 4.
3.4. Cytoskeleton organization, differentiation Beside the spatial pattern aspect a number of further problems of developmental dynamics has been tackled with the help of cellular automaton models. The organization of DNA can be formalized on the basis of a one-dimensional cellular automaton [27]. Microtubule array formation along the cell membrane is in the focus of models suggested by Smith et al. [28]. Understanding microtubule pattern formation is an essential precondition for investigations of interactions between intra- and extracellular morphogenetic dynamics. In [29] a rather complicated cellular automaton model is proposed for differentiation and mitosis based on rules incorporating morphogens and mutations. Another automaton model addresses blood cell differentiation as a result of spatial organization [30]. It is assumed in this model that spatial structure of the bone marrow plays a key role in the control process of hematopoiesis. The problem of differentiation is also the primary concern in a stochastic cellular automaton model of the intestinal crypt [31 ]. It is typical of many of the automaton approaches sketched in this short overview that they lack detailed analysis, the argument is often based on the sole beauty of simulations - for a long time people were just satisfied with the simulation pictures. This simulation phase in the history of cellular automata characterized by an overwhelming output of a variety of cellular automaton rules was important since it triggered a lot of challenging questions, particularly related to the quantitative analysis of automaton models. In the following section, we present a cellular automaton example in which the basic characteristics of the pattern formation dynamics can be grasped by a mean-field theory [32].
4. AN E X A M P L E : A C E L L U L A R AUTOMATON M O D E L OF M I C R O B I A L PATTERN FORMATION
4.1. Myxobacterial lifecycle It is usual to divide organisms into two basic categories: unicellular and multicellular. However, looking at unicellular organisms as bacteria and amoebae one can see their clear tendency to form multicellular structures, colonies and fruiting bodies. Here we focus on pattern formation of Myxobacteria. Myxobacteria are rod-shaped cells that glide (the precise mechanism is unknown) along their long axis. They exhibit a complex developmental cycle with individual and social phases. As long as there is sufficient food supply, vegetative cells prey, grow and divide as individuals or in small swarms. Under starvation conditions, bacteria start to act cooperatively, aggregate and finally build a multicellular structure, the fruiting body. Fruiting body
700 (a)
: iiiii ,: "":'..............
!
(b) .... ~:~
~ '~~
time
~
:~i~ ::
.....
~ ~ .... ~:~:~:~........... '~
~
!
>
Figure 1. (a) Snapshot from a rippling sequence in myxobacteria taken from a time-lapse movie (H. Reichenbach, Braunschweig). Ridges of cells (dark regions) are separated by regions with lower density (white). White bar: 300 #m. (b) Space-time plot of the density profile along the white line in (a). Wavelength is 105 #m, temporal period is 10 min. formation is often preceded by a periodic pattern originally classified as oscillatory waves and later named rippling [33]. An experimental illustration of the tippling phenomenon is shown in Fig. 1a. Bacteria organize into equally spaced ridges (dark regions) that are separated by regions with low cell density (light regions). We examined the temporal dynamics of the density profile along a one-dimensional cut indicated by the white line in Fig. la. The resulting space-time plot reveals a periodically oscillating standing wave pattern (Fig. 1b). Several experimental studies (e.g. [34]) report periodic tippling patterns with wavelengths between 45 and 100 #m, wave velocities between 2 and 11 #m min -1 and temporal periods between 8 and 20 mir~. Typical cell lengths vary between 5 and 10 pro, so that the tippling wavelength corresponds to 10 - 20 cell lengths. In addition, single cells have been found to move unidirectionally with the tipple waves in a typical back-and-forth manner [34]; they reverse their direction of motion with a mean reversal frequency of about 0.1 reversals rain -~. Intercellular communication is essential in order to maintain a complex life cycle as exhibited by Myxobacteria. There are at least five extracellular components (known as A-, B-, C-, Dand E-signal) that enable cells to coordinate their behavior. The latest acting signal during development is C-signal, presumably identical to a cell membrane-bound protein called C-factor. C-factor is proposed to play the key role for the formation of ripples. Addition of C-factor (which can be extracted from rippling cells) increases the mean reversal frequency of cells [34]. C-signaling occurs when cells are in end-to-end contact [34]. The C-factor protein is encoded by the csgA gene; csgA mutants (csgA-), i.e. mutants that carry a mutation in the csgA gene, are unable to tipple and aggregate. Sager and Kaiser put forward the following hypothesis: When two opposite moving cells collide head-on, they reverse their gliding direction due to exchange of C-factor. In order to test this hypothesis we have designed a mathematical model. 4.2. The cellular automaton model
The discrete model for the formation of ripple patterns is based on the dynamics of individual cells. It is defined on a regular cubic lattice assuming discrete space and time coordinates, analogous to cellular automaton models (Fig. 2a). The spatial lattice constants are chosen in a way that bacterial cells (assumed as equally sized) cover exactly one node. Allowed cell
701 (a)
(b)
x
z /
x
~
I1)
12)
(31
x
Figure 2. (a) Exemplary configuration of the model, the cell orientation is indicated by arrows. (b) The interaction neighborhood is a five nodes cross in the y - z-plane at the x-position the cell is directed to (here the cell orientation is the +x-direction). (c) Illustration of the migration rules. Cells move one after the other (the labeled (*) cell is the next in the row) occasionally lifting other cells (1), falling down (2) or slipping away (3). positions are (i) directly on the substrate (x-y-plane) or (ii) on top of other cells. This reflects the experimental situation in which cells glide on a suitable surface and are organized in heaps. The total number of cells is constant (absence of replication and death). The basic rules of the model are derived from the experimental results and the hypothesis described in the previous section. We extend this hypothesis by assuming cells to be temporarily refractory after reversal. As we will show, this refractory phase is the most important ingredient for tippling. (i) Movement of cells is confined to the +x-direction, a variable o c { - 1 , 1 } describing the orientation of movement is associated with every cell. Once per time step all cells move sequentially (fixed order) to the neighboring node according to their orientation. Exception scenarios are illustrated in Fig. 2c. All cells are assumed to glide with equal speed (fluctuations of this quantity are due to a small stop probalility p) of one cell length per minute. Thus a time step corresponds to roughly one real-time minute. (ii) After the migration of all cells synchronous interaction takes place. In the model, headon-collision is defined as the existence of at least one cell B with opposite orientation in the (orientation-dependent) neighborhood of cell A (Fig. 2b). We assume cells to be either sensitive or refractory. Refractory cells do not respond to C-factor, i.e. only sensitive cells reverse if invoked in collisions. After a cell has reversed it is temporarily refractory. 4.3. Simulation results
The duration of the refractory phase 7- turns out to be the most important quantity for ripple formation. Regular patterns for refractory times T 2 7-c ~ 4 time steps can be observed. Two counter-propagating travelling waves of about equal amplitude form a standing wave, in agreement with the experimental observations. For refractory times 7- < 7c we still observe pieces of waves without long-range correlations in space or time; for vanishing refractory time cells exhibit fluctuations near a homogeneous density state (Fig. 3a-f). Wavelength and wave period of the simulated ripples increase with 7- [32]. The experimental values of wavelength and temporal period of the rippling pattern (Fig. 1) are reproduced with a refractory time of 4-5 rain.
Our results show that cell reversal after head-to-head collisions between cells is an appropri-
702 (a)
x >
(b)
(c)
x >
(e)
(0
'l (d)
tl
/
:
Figure 3. (a) Simulation snapshot in a system of size 100• containing 15000 cells with a refractory phase 7- = 1 time step after ca 5000 time steps (black corresponds to high cells columns). (b),(c) Same as (a) with 7- = 3 resp. 7- = 5. In (d),(e),(f) we show the corresponding space-time plots along 1O0 substrate sites in x-direction over 1O0 time steps. ate mechanism for ripple formation and maintance only if it is supplemented by a refractory phase. So far there is no direct experimental proof for the existence of a refractory phase. However, experiments with bacterial cells exposed to isolated C-factor revealed an increased absolute reversal rate of roughly 0.3 reversals per cell and minute [34]. This can be interpreted as an upper bound for the reversal rate resulting from a refractory phase of ca 3-4 rain. Little is known from experiments about the height of cell heaps during rippling. In the latest experiments approx. 106 cells formed a circular spot of 6.5ram 2 [35]. Assuming an average cell area of 16pro 2 (length 8pro, width 2pro) this corresponds to 2.5 cell layers, neglecting that the cell density is higher within the edges [35]. The results presented here are obtained for an average number of fi = 3 cells per substrate site. However, the results depend only weakly on the number of cells in the aggregate; variations of fi between 2 and 10 do not produce significant changes. One advantage of our cell-based model is the possibility to mark and track single cells. In the simulations cells are found to move about a distance of half a wavelength before reversing [32], perfectly reflecting the back-and-forth movement of cells in the experiment. In fact, travelling waves reflect each other leaving the crest shape unchanged. While most of the cells ride with the crest and are reflected when the crest collides with a crest moving in opposite direction, occasionally a cell does not find an interaction partner and continues travelling in the original direction. Our model simulations suggest further biological studies of single cell motility to verify the refractory hypothesis and to decipher its biochemical basis.
5. DISCUSSION After giving a short overview of cellular automaton models for various cell interactions, we focused on the presentation of a specific example. A discrete model was introduced, that was originally designed to discover the cellular interactions which are essential for rippling pattern formation in myxobacteria. On the basis of the "refractory period hypothesis" we have developed modifications of the basic model, which also reproduce recent new experimental results. In particular, latest experiments have revealed a double-peak in the reversal time distribution
703 which is reproduced by our model modification. In the corresponding model, phase velocity depends on the number of counter-moving cells in the neighborhood. We have studied the effect of cooperativity by means of a nonlinear dependency. Appropriate phase velocity functions reproduce the macroscopic rippling pattern but also microscopic properties, especially the double-peak of the reversal time distribution. Other models for the formation of rippling patterns have been proposed. A continuous model [36] with similarities to our model contains additional assumptions - like an internal biochemical clock which may play a role but is not essential for rippling as our results show. Another continuous approach [37] does not assume refractoriness and predicts characteristic density-dependent reversal rates. Both models disregard the discreteness of the interacting cells which is important at small cell densities. Moreover, myxobacterial rippling provides the first example of pattern formation mediated by migration and direct cell-cell interaction, that may also be involved in myxobacterial fruiting body formation as well as selforganization processes in other multicellular systems. The example illustrates the potential of cellular automaton modeling of interacting cell systems. As a cell-based model, a cellular automaton allows manipulation of individual cells, which enables particularly the simulation of the introduction of mutants. Parallelization of the algorithms is straightforward for synchronous cellular automata, simulations are fast and allow the follow-up of large cell numbers. In simplified cell interaction models, stability analysis can be performed [26, 32, 38]. Cell size and the fastest biological process to be modeled determine the resolution of the cellular automaton model. The potential and versatility of cellular automaton models along with the availability of more and more "cellular data" (at the genetic and proteomic level) offer a promising approach to analyse self-organization in interacting cell systems. REFERENCES
[1]
[2] [3] [4] [5] [6] [7]
[8] [91
F. Bagnoli. Cellular automata. In F. Bagnoli, E Lio and S. Ruffo, editors, Dynamical modelling in biotechnologies. World Scientific, Singapore, 1998. J. L. Casti. Alternate realities. John Wiley, New York, 1989. K. Sigmund. Games of life - explorations in ecology, evolution, and behaviour. Oxford University Press, Oxford, 1993. S. Wolfram, editor. Theory and applications of cellular automata, Singapore, 1986. World Publishing Co. C. G. Langton. PhysicalO D, 10:135-144, 1984. T. A. Witten and L. M. Sander. Phys. Rev. Lett., 47(19):1400-1403, 1981. T. Walther, A. Grosse, K. Ostermann, A. Deutsch and T. Bley. J. Theor. Biol., 2004 to appear A. Deutsch. A novel cellular automaton approach to pattern formation by filamentous fungi. In L. Rensing, editor, Oscillations and morphogenesis, chapter 28, pp. 463-480. Marcel Dekker, New York, 1993. D. Drasdo, S. H6hme, S. Dormann and A. Deutsch. Cell-based models of avascular tumor growth. In A. Deutsch, M. Falcke, J. Howard and W. Zimmermann, editors, Function and regulation of cellular systems: experiments and models. Birkhauser, Basel, 2003.
704 [10] J. Moreira and A. Deutsch. Adv. Compl. Syst. (ACS), 5(2):1-21, 2002. [11] J. H. Moore and L. W. Hahn. Multilocus pattern recognition using cellular automata and parallel genetic algorithms. In Proc. of the Genetic and Evolutionary Computation Conference (GECCO-2001), pp. 1452, 7-11 July 2001. [12] M. Gerhardt, H. Schuster and J. J. Tyson. Science, 247:1563-1566, 1990. [ 13] J. C. Dallon, H. G. Othmer, C. v. Oss, A. Panfilov, P. Hogeweg, T. H6fer and E K. Maini. Models of Dictyostelium discoideum aggregation. In W. Alt, A. Deutsch and G. Dunn, editors, Dynamics of cell and tissue motion, pp. 193-202, Birkh~iuser, Basel, 1997 [14] M. Markus and B. Hess. Nature, 347(6288):56-58, 1990. [15] A. Stevens. Simulations of the gliding behaviour and aggregation of myxobacteria. In W. Alt and G. Hoffmann, editors, Biological Motion, Lecture Notes in Biomathematics, pp. 548-555. Springer, Berlin, Heidelberg, 1990. [16] A. M. Turing. Phil Trans. R. Soc. London, 237:37-72, 1952. [17] David A. Young. Math. Biosci., 72:51-58, 1984. [18] S. Dormann, A. Deutsch and A. T. Lawniczak. Fut. Comp. Gener. Syst., 17:901-909, 2001. [ 19] N. V. Swindale. Proc. Roy. Soc. London Ser. B, 208:243-264, 1980. [20] B. Ermentrout, J. Campbell and G. Oster. The Veliger, 28(4):369-388, 1986. [21] I. Kusch and M. Markus. J. Theor. Biol., 178:333-340, 1996. [22] N. S. Goel and R. L. Thompson. Computer simulation of self-organization in biological systems. Croom, Melbourne, 1988. [23] N. J. Savill and P. Hogeweg. J. Theor. Biol., 184:229-235, 1997. [24] L. Edelstein-Keshet and B. Ermentrout. J. Math. Biol., 29:33-58, 1990. [25] A. Deutsch. J. Biol. Syst., 3:947-955, 1995. [26] H. Bussemaker, A. Deutsch and E. Geigant. Phys. Rev. Lett., 78:5018-5021, 1997. [27] C. Burks and D. Farmer. Towards modelling DNA sequences as automata. In D. Farmer, T. Toffoli and S. Wolfram, editors, Cellular automata: Proceedings of an interdisciplinary workshop, New York, 1983, pp. 157-167. North-Holland Physics Publ., Amsterdam, 1984. [28] S.A. Smith, R. C. Watt and R. Hameroff. Physica 10 D, 10:168-174, 1984. [29] H. F. Nijhout, G. A. Wray, C. Krema and C. Teragawa. Syst. Zool., 35:445-457, 1986. [30] R. Mehr and Z. Agur. Bio Syst., 26:231-237, 1992. [31] C. S. Potten and M. L6ffler. J. Theor. Biol., 127:381-391, 1987. [32] U. B6rner, A. Deutsch, H. Reichenbach and M. Bar. Phys. Rev. Lett., 89:078101, 2002. [33] H. Reichenbach. Ber. Deutsch. Bot. Ges. 78: 102, 1965. [34] B. Sager and D. Kaiser. Genes & Development, 8:2793, 1994. [35] R. Welch and D. Kaiser. Proc. Natl. Acad. Sci. US, 98:14907, 2001. [36] O. Igoshin, A. Mogilner, R. Welch, D. Kaiser and G. Oster. Proc. Natl. Acad. Sci. US, 98:14913, 2001. [37] F. Lutscher and A. Stevens. J. Nonlin. Sci., 12(6):619-640, 2002. [38] A. Deutsch and S. Dormann. Cellular automaton modeling of biological pattern formation. 2004.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
705
Numerical Simulation for eHealth: Grid-enabled Medical Simulation Services S. Benkner a, W. Backfrieder b, G. Berti c, J. Fingberg c, G. Kohring c, J.G. Schmidff, S.E. Middleton d, D. Jones e, and J. Fenneff ~Institute for Software Science, University of Vienna, Vienna, Austria bDepartment of Biomedical Engineering and Physics, University of Vienna, Austria cC&C Research Laboratories, NEC Europe Ltd., St. Augustin, Germany dIT Innovation Centre, University of Southampton, UK eMedical Physics and Engineering, The University of Sheffild, UK The European GEMSS Project* is concerned with the creation of medical Grid service prototypes and their evaluation in a secure service-oriented infrastructure for distributed on-demand super-computing - the GEMSS test-bed. The medical prototype applications include maxillofacial surgery simulation, neuro-surgery support, radio-surgery planning, inhaled drug-delivery simulation, cardio-vascular simulation and tomographic image reconstruction. GEMSS will enable the wide-spread use of these computationally demanding tools originating from projects such as BloodSim, SimBio, COPHIT and RAPT as Grid services. The numerical HighPerformance Computing core includes parallel Finite Element software and Computational Fluid Dynamics simulation as well as parallel optimization methods and parallel Monte Carlo simulation. Targeted end-user groups include bio-mechanics laboratories, hospital surgery/radiology units, diagnostic and therapeutic departments, designers of medical devices in industry as well as small enterprises involved in consultancy on bio-medical simulations. GEMSS addresses security, privacy and legal issues related to the Grid provision of medical simulation and image processing services and the development of suitable business models for sustainable use. The first prototype of the GEMSS middleware is expected to be released in February 2004. 1. INTRODUCTION Computationally demanding methods such as parallel Finite Element Modelling, parallel Computational Fluid Dynamics and parallel Monte Carlo simulation are at the core of many advanced bio-medical simulation applications. Often, however, such applications have a very limited clinical impact, because there is no convenient or practical way for the typical medical end user to access the necessary software and hardware resources. Grid technology has the potential to provide medical practitioners and researchers with access to advanced simulation and image processing services for improved pre-operative planning and near real-time surgical support by providing on-demand transparent access to remote HPC hardware and services over *GEMSS, EC project number IST-2001-37153,is a 30 month project which commencedin September, 2002. Project web site: http://www.gernss.de
706 the Internet. The Grid will also allow computational resources to be brought to the medical technology industry, already using bio-numerics for virtual prototyping, but requiring largerscale compute resources due to the growth in the complexity of design problems made tractable by numerical methodology advances. The European GEMSS Project [7] is concerned with the creation of medical Grid service prototypes and their evaluation in a secure service-oriented infrastructure for distributed ondemand super-computing. The medical prototype applications include maxillo-facial surgery simulation, neuro-surgery support, radio-surgery planning, inhaled drug-delivery simulation, cardio-vascular simulation and advanced image reconstruction. GEMSS will enable the widespread use of these computationally demanding tools originating from projects such as BloodSim [4], SimBio [8], COPHIT [5] and RAPT [6] as Grid services. The GEMSS Grid infrastructure and middleware is being built on top of existing Grid and Web technologies, maintaining compliance with standards thereby ensuring future extensibility and interoperability. Furthermore, GEMSS aims to anticipate privacy, security and other legal concems by examining and incorporating into its Grid services the latest laws and EU regulations related to providing medical services over the Intemet. The rest of this paper is organized as follows: Section 2 discusses the provision of medical simulation services via the Grid. Section 3 presents an overview of the GEMSS Grid infrastructure currently under development. Section 4 discusses the HPC simulations considered in GEMSS and outlines how these applications are being Grid-enabled. Section 5 discusses related work followed by conclusions in Section 6. 2. PROVIDING MEDICAL SIMULATION SERVICES VIA THE GRID GEMSS is concemed with creating an environment in which computationally demanding tools relevant to the health sector can be made easily accessible to a wide spectrum of users.
2.1. Benefits of grid-enabled simulation services In a Grid scenario a medical end-user (client) would simply use a browser or a client software to access a service via a provider's portal or server. In this way doctors, clinicians or medical researchers can be provided with advanced tools at their workplace through easy-to-use interfaces without requiring their institutions to invest in expensive HPC hardware and related IT or engineering specialist know-how. End users will only pay a negotiated price per use. Furthermore, a reliable, pervasive and interoperable Grid infrastructure can accommodate new services and updates to existing ones immediately as they become available. 2.2. GEMSS high-level grid architecture The GEMSS architecture is based on a client/server topology employing a service-oriented architecture as shown in Figure 1. It relies on Web Service technology that will allow integration, via OGSA [13], with various hosting platforms such as UNICORE [21] and GLOBUS [12]. At its simplest, the Grid client is an intemet enabled personal computer loaded with client software that permits communication with a service provider through the GEMSS middleware. The client side applications handle the creation of the service input data and visualization of the service output data. Service providers expose numerical simulation applications running on HPC hardware as generic application services accessible over the network. Genetic application services provide
707 ....
P ~ ! I 1)
~ I Client ~ , _ J : ~
"
.
~ /
I I I [
1 ~ y I "'"
]
.... / -x~ "x~x'-.J Service re "
-"~"~.] Service providerl I J : 9 9 9 9 '
Figure 1. GEMSS High-Level Architecture
.
.
.
Set-up . . . . . nt Authori. . . . . . t . . . . . . . . . nt Monitor billing information Ch. . . . pricing model Choose license model
~
Estimate job capacity Negotiate QoS Exchange Contracts [
Upload input data Start job Monitor job Download . . . . Its
Figure 2. Three step job execution process
support for quality of service negotiation, data staging, job execution, job monitoring, and error recovery. One or more service registries will be employed holding a list of service providers and the services they support. The certificate authority, which is usually managed by a third party, provides certificate authentication after appropriate identity checks. In order to provide a custom interface for running Grid jobs, an optional client portal may be utilized. However, even if the GEMSS environment is accessed via a portal, the client will still be billed for jobs run through a portal, making it the responsibility of the client to pass these costs on to its customers. The GEMSS design supports a three step process to job execution (see Figure 2). In the initial business step, accounts are opened, payment details are fixed and a pricing model can be chosen. Next, a job's quality of service and price, if not subject to a fixed price model, is negotiated and agreed. Once a contract is in place, the job itself can be submitted and executed. 2.3. Prerequisites for Grid-enabling medical simulation applications In order to Grid-enable a simulation application under the GEMSS environment it must be possible to partition the application between client and server. Data must be separated from the application software which implies that I/O operations access files only by relative path names. In order to comply with the GEMSS QoS model, for each application a machine specific performance model must be provided, which allows the determinination of performance estimates (e.g. run time, memory requirements, etc.) on the basis of meta data charaterising a service request. Support for error recovery, if required, must be ensured through an appropriate checkpoint/restart facility within the application. Moreover, the application must be compliant with the GEMSS security model and must support the integral business model as well as the Quality of Service model as described in the following sections.
3. THE G E M S S I N F R A S T R U C T U R E The GEMSS middleware exposes medical simulation applications installed on various Grid hosts as services which support a common set of methods for data staging, remote job management, error recovery, and QoS support. GEMSS services are defined via WSDL and securely accessed using SOAP messages. HTTPS will be initially used to securely transmit SOAP messages, and the WS-Security standard will be applied where possible. For large file transfers SFTP will be investigated, as well as SOAP attachments. 3.1. Client and service provider infrastructure Figure 3 depicts the main architectural components of the GEMSS infrastructure for the client and the service provider. On the service provider side, medical simulation applications are exposed as Web Services and hosted using a secure web server and a service container (usually Apache and Tomcat
708
?:5
J......................I : ~ : i
~
Certificate & Key
__i
i
Figure 3. GEMSS Client (left hand side) and Service Provider Infrastructure (fight hand side) Axis). Services are executed under the direction and orchestration of the client, subject to availability of resources and authority to use them. The quality of service management component handles reservation with the resource manager (job scheduler) and provides input to the quality of service negotiation process so that sensible bids can be made in response to client job requests. The error recovery component handles checkpointing and re-starting of services if required. The logger component manages a database for logging auditable information and performs most of the low level event logging for intrusion detection. The service state repository component manages a conversational state database that contains information about any client-service conversation allowing it to be resumed at a later time if the user logs off. The provision of applications is based on the concept of generic application services as described in detail in the next section. The GEMSS client application programming interface hides most of the complexity of dealing with the remote services from the application developer by providing appropriate service proxies. Service proxies are in charge of discovering services and negotiating with grid services to run jobs. The session management component manages client sessions and maintains a security context, authenticating the current user and providing access criteria for the certificate and key stores. A service discovery component is available for looking up suitable services in a registry. The workflow enactment component is in charge of running business and negotiation workflows, as well as service orchestration. The client typically runs a business workflow to open negotiations with a set of service providers for a particular application. The quality of service negotiation workflow is then run to request bids from all interested service providers who can run the clients jobs subject to QoS criteria required by the client; this results in a contract being agreed with a single service provider. The client then uploads the job input data to the service provider and starts the server side application by calling appropriate methods of the service. The client infrastructure is based on a pluggable client side component framework which is described in detail in Section 3.3. 3.2. QoS-enabled generic application services A genetic application service is a configurable software component which exposes a native application as a service to be accessed by multiple remote clients over the Internet. A genetic application service provides common methods for data staging, remote job management, error
709 recovery, and QoS support, which are to be supported by all GEMSS services. In order to customize the behavior of these methods for a specific GEMSS application, an XML application descriptor has to be provided. Besides general information about a medical simulation service, the application descriptor specifies the input/output files, the script for initiating job execution, and a set of performance-relevant application parameters required for QoS support. GEMSS services support a model and process for agreeing, dynamically and on a case-bycase basis, various QoS properties, including service completion time, cost, availability and others. For this purpose, each GEMSS service provides methods for enabling clients to negotiate required QoS properties with a service provider before actually consuming a service. QoS enabled GEMSS services are capable of providing an estimated service completion time, based on meta data about a specific service request (e.g. image size, required accuracy, etc.) supplied by the client. The QoS support infrastructure for GEMSS is based on QoS contracts (XML), request descriptors (XML), and performance models, and relies on a resource scheduler that supports advanced reservation. For each GEMSS application, the service provider or application developer has to specify a set of application specific performance parameters in the XML application descriptor. For example, in the case of an image reconstruction service, performance parameters typically include image size and required accuracy (i.e. number of iterations). In order to compute various QoS properties (e.g. service completion time) on the basis of the specified performance parameters, a machine-specific performance model has to be provided. Since in general, it will not be possible to build a simple analytical performance model for all GEMSS applications, we plan to build a data base relating typical problem parameters to resource needs like main memory, disk space and running time, which will initially be populated using data from test cases. During QoS negotiation, the client has to supply a request descriptor containing concrete values for all performance-relevant parameters specified in the application descriptor. Moreover, the client has to pass to the service provider an initial QoS contract, specifying the required QoS properties and other conditions to use the service. On the service side, the request descriptor is fed into the performance model in order to determine whether or not the client's QoS requirements can be fulfilled. A generic application service is realized as a Java component, which is transformed automatically into a Web Service with corresponding WSDL descriptions and customized for a specific GEMSS application using the XML application descriptor. In order to provide a native GEMSS application as a Web/Grid Service, the application has to be installed on some Grid host, and a job-script to start the application as well as an XML application descriptor have to be provided. Finally, the generic application service has to be deployed within an appropriate hosting environment e.g. Apache Tomcat/Axis. As a result, the native application will be embedded within a generic application service and will be accessible over the Internet. In the future, GEMSS services will be extended in order to be compliant with the Open Grid Services Architecture (OGSA).
710 3.3. Client side component framework On the client side, the GEMSS Project has chosen a flexible approach built around an SDK consisting of pluggable components. Using the SDK, developers can easily integrate Grid functionality into existing applications. The SDK is conservative, in that it does not expose any low level Grid concepts which applications should not need to deal with, while at the same time providing applications full control over higher level abstractions. Among the high level components identified to date are: Business Processes, Service Discovery, QoS Negotiation, and Workflow Enactment. The GEMSS SDK exposes the interfaces for these components to the applications, while keeping their underlying implementation in terms of Web Services hidden from view. All the components making up the SDK are pluggable in the sense that they can be replaced by any implementation which supports the well defined public interfaces and properly replicates a component's documented behaviors. Support exists for simultaneously using multiple components from different providers, as well as the run-time exchange and update of the individual components. It is also planned to introduce an autonomic maintenance system in order to relieve users of the burden of downloading and installing updates to various components which are being developed by different institutions. Through the SDK, applications can use the system in one of two ways; either execute each component separately, thereby exercising full control over the process of service discovery, QoS negotiation and the service invocation, or, use a convenience component to automatically find the desired service, perform the QoS negotiation and create a service proxy. Either way, at the end of this process the application has a service proxy through which it can communicate directly with the desired service. Although most applications will use the SDK to interact mainly with GEMSS services, support is provided for interacting with any standard Web Service. (Additional support for working with OGSA services will be incorporated later.) Finally, the SDK also includes methods for application dependent session handling. Applications can designate those service proxies or other objects which are to constitute a "session" and then have them serialized for later re-instantiation. This provides a lightweight incarnation of a persistent environment useful for working with the types of long running services commonly found in GEMSS. 3.4. Business models for a medical grid With an eye to possible exploitation, GEMSS will maintain a flexible business model to allow commercial operation of Grid services. There is a concept of clients having an account with each service provider, allowing payment details to be provided and monthly bills generated for Grid use. Business models considered for GEMSS include airline reservation/business models and telephone business models. A negotiation model is also supported so a client can shop around, negotiating with all the service providers who provide the required service to get the best deal. Within each negotiation the quality of service terms associated with the required job can be discussed, as well as the price involved. The aim for GEMSS is to provide a viable and flexible approach to operating a commercial Grid. 3.5. Privacy and security for grid-based medical data Since GEMSS is concerned with the processing of highly confidential and private information, such as images of patient heads and commercially sensitive fluid dynamic models, there are serious privacy issues to consider. Within GEMSS, EU law is being examined to identify
711
F ~ i ~ ........
Figure 4. The halo device for distraction
Figure 5. Patient before distraction
Figure 6. After distraction (geometric linear simulation)
privacy, contractual and ethical issues. To operate within EU law an appropriate level of security must be applied, with medical data anonymised where possible and not held for longer than is required to achieve the purpose of the Grid processing. Grid security must be periodically reviewed and patients made aware of the processing that will occur, and be able to review and correct the information held about them. It is expected that future Grid processing would be integrated into a hospital's existing data processing procedures. The GEMSS Grid infrastructure will employ commercial off the shelf (COTS) technology, making full use of well-tested technology, regular security patches and best practice security procedures. Such security technologies will be evaluated with respect to the GEMSS Grid, in addition to the creation of a methodology for assessing the security needs at each GEMSS parther's site. The full GEMSS infrastructure is designed to provide a public key infrastructure, X.509 compliance, RSA encryption, service level authorization, logging and intrusion detection. The GEMSS Grid is designed to work with existing site firewalls, and does not require any insecure ports to be opened through them. 4. HIGH PERFORMANCE MEDICAL SIMULATION APPLICATIONS Medical simulation applications considered within the GEMSS include maxillo-facial surgery planning, neuro-surgery support, medical image reconstruction, radiosurgery planning and fluid simulation of the airways and cardiovascular system. In the following we discuss these applications and outline their realization within the GEMSS environment as HPC services. 4.1. Maxillo-facial surgery simulation
In patients suffering from severe facial malformations like maxillary hypoplasia (see Fig. 5) and retrognathia, conventional therapeutic surgery often fails to guarantee long-term stability. The use of a rigid external distraction system for midfacial distraction osteogenesis (Fig. 4) is a new method to correct the underdevelopment of the midface, surpassing traditional orthognathic surgical approaches for these patients. The treatment consists of a midfacial osteotomy (bone cutting) followed by a halo-based distraction (pulling) step. Currently surgical planning is only based on CT (computed tomography) images, and the surgeon's experience is the only means of estimating the outcome of the treatment. The goal of this application is the modeling of this distraction process to allow predictions of its outcome and enable the medical user to try out several treatments in silico before selecting the most promising plan. The tools required to perform such a simulation form a complex workflow graph. First, the
712 CT image data may have to be filtered to reduce noise effects like streak artifacts due to teeth fillings, and must be segmented into different tissue classes. Next, the data set has to be visualized, and the surgeon has to specify cuts and distraction parameters. From these parameters a geometric model of the surgical operation is generated as a basis for the following simulation of the facial distraction process by an FEM (Finite Element Method) application. Finally, the results of this simulation (Fig. 6) can be visualized and used by an expert to evaluate the quality of the treatment plan. Earlier versions of the FEM package and the mesh generation tool are described in [9]. Most tasks of this toolchain are demanding in terms of memory and computation time. The FEM simulation task stands out as the most critical and challenging component in this simulation. A typical mesh (already clipped to the region of interest) involves about 0.5M elements. This number is likely to increase when the soft tissue is modeled in more detail. Thus, Grid computing is needed to obtain results with sufficient accuracy and reliability for clinical use on remote HPC facilities. Currently, a 64 node PC cluster located at NEC's C&C Research Laboratories in St. Augustin is used. The FEM code has been parallelized for distributed memory machines based on MPI, using DRAMA [ 10] as mesh partitioning library. Proper modeling of the head turns out to be the most delicate problem, both in terms of correct reconstruction of patient geometry from the image data, and the appropriate modeling of material properties. Currently, soft tissues are modeled as a single type using either hyperelastic or visco-elastic material laws. The next step involves to further distinguish skin, fat, and muscles. Different discretizations as (tri-)linear elements, quadratic elements, and elements with internal degrees of freedom using the enhanced assumed strain (EAS) approach have been compared. The choice of the discretization turns out to be very important. While linear elements are quite robust and need less time and memory than the other elements, the results are rather inaccurate. The quadratic elements yield very accurate results, but need more memory and solution time. The EAS elements seem to be the best choice. They need more time than the linear elements, but only the same amount of memory and lead to highly accurate results. A comparison of algebraic multigrid (AMG) and Krylow solvers confirmed the well-known result that AMG solvers are optimal in a sense that the number of required floating point operations is proportional to the number of unknowns in the discretization.
4.2. Quasi real-time neurosurgery support by non-linear image registration Assistance for image-guided surgical planning by correction of the brain shift phenomenon is the focus of this application. The occurrence of surgically induced deformations invalidates positional information about functionally relevant areas acquired from functional MRI (fMRI) data. This problem is addressed by non-linear registration of pre-operative fMR images to intra-operative MRI, or to intra-operative 3D ultrasound data. Whenever an intra-operative dataset is acquired, the following image processing chain must be executed (see Figure 7): (a) Transfer of anatomical images from the scanner and conversion into a machine-independent data format, (b) Correction of intensity non-uniformities in the scan data, (c) Linear registration of the (low-resolution) intra-operative scan with a (high-resolution) pre-operative scan, (d) Intensity adjustment between both scans, (e) Non-linear registration of both scans to yield a 3D deformation field, (f) Application of the deformation field to the pre-operative fMRI datasets, (g) Overlay of the deformed functional information onto the intraoperative data, (h) Conversion and transfer to the presentation device (e.g., monitor, surgical
713
Figure 7. Single steps of the neurosurgery image processing chain microscope). The current processing time for the image processing chain is about 4 h (Intel Pentium III, single processor). However, minimal disruption of the progress of surgical intervention requires a maximum processing time of approx. 10 min. The time-consuming steps (b)-(e) can benefit from computation on the Grid and are readily parallelisable on shared or distributed memory high performance computing platforms. However, the Grid application also involves the transmission of large image datasets (hundreds of Mbytes) between client and server and therefore internet bandwidth may be a problem in such time critical applications. The data volume is sufficiently large so that it is impractical to keep moving data sets between Grid server and client as each of the steps (a) - (h) is completed. Even though not all steps require the compute power of a multi-processor environment it is likely that all corresponding modules reside on the same server to minimise the data transfer overhead. The data/application partitioning required by all GEMSS applications is clearly evident in this example and will be readily accommodated within GEMSS. A client GUI will use the GEMSS API calls to negotiate a Grid service, establish quality of service parameters and negotiate the cost of the job. Accessibility to sustained computing resources is very important in this application and timely response must be guaranteed during surgical intervention. 4.3. Near real-time cranial radiosurgery simulation This application (RAPT) is concerned with the provision of physics-based simulation for radiosurgery treatment planning. Accurate focusing of radiation to the treatment site requires a combination of stereotactic localisation and accurate modelling of the radiation dose distribution within the head. The best description of the radiation distribution can be obtained using complex, compute intense Monte-Carlo simulations. The need to treat patients as soon as possible after the stereotactic frame is fitted means that rapid computations of these distributions are needed. The use of an efficient parallel Monte-Carlo code (EGS) running on the Grid will enable high accuracy treatments to be planned and executed within existing time constraints. The accuracy of the treatment plans and the response time of the GRID system will be evaluated. The Grid-enabled radiosurgery application uses RAPT as a front end to the EGS Monte Carlo engine to model ionising radiation transport through the head of the patient. It requires: (a) Definition of patient geometry (b) Specification and distribution of material types contained within the geometry (c) Position of beam isocentre (d) Beam properties - intensity profile, spectrum
714
Axial V i e w for 2.5 Million Photons Per Beam
120 |
/
120
,.,op-
110 100
',
i
i
i
/,:fl
i
!
i\
i
!
g01 801 701 601 5 0 1 - J . . . . . . . . i-. . . . . . . ~. . . . . . . . ~. . . . . . . . :. . . . . . . . r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . :
Sagttaf View, for 2.5 Million Photons Per Beam ',
',
.........
t .........
~ ..........
~
i, ..... i
i
i\
''i\i!
.....
i
~ ..........
~..........
jz-~., ti
.......... .... i
~.........
i ..........
!---~,
i .......
i
...... ..'.? ........ !/..1 ,i =
..... ~;i~,........ '<2i.~_:-,9 ...... +/t--?*~i ......... !i......
7o~---~ .......... i--:~~~----~--:-4~-~i
.......... i ...... -'
............. i ......... i .......... i .................... i ......... J.......... i 50
60
70
80
90
100
110
.
.
.
.
.
.
.
120
Figure 8. Monte Carlo computed dose distribution. (e) Beam distribution - number of beams and their arrangement (f) Quality of simulation parameters - total number of photons, interaction types. This data is currently specified by the contents of numerous text files that are converted and loaded into the EGS solver at startup. The simulation process simulates the dose given to the region of interest by modelling millions of photons, and following their paths, employing information from photon scattering data, to correctly give the photons their positions, angles of deflection, and energies or to absorb them in the tissue as they interact with the atoms of the target. The energy distribution within the geometry equates to the dose distribution of the model. Thus the output from the modelling process is a 3D patient mesh geometry with the accumulated dose and flux variance at each element of the solution mesh. To provide this simulation code as a Grid service, the RAPT application code has been separated into client and server parts. The parallel RAPT kernel has been transformed into a Web Service, and the client has been realized with Java and visualization tools written in MatLab. A typical usage scenario comprises the following steps: (a) a text editor is used to create input files, (b) the Grid client is run to negotiate for reservation of the Grid computing resource, (c) an appropriate reservation is confirmed and the input file(s) are uploaded by the Grid client, (d) RAPT job runs as scheduled on the compute resource, (e) user receives notification of completed job, (f) user downloads the output file(s), (g) MatLab application visualization software is used to view the results. As an example test case (see Figure 8), a homogeneous spherical water phantom of radius 8 cm was modelled. The dose distribution resulting from irradiation by 201 beams (500 million photons) was calculated both in a local setup and a Grid-setup using 3 PCs. In the local set-up the overall run time amounted to approximately 22 hours on a 2.5GHz single processor PC with 1GB of RAM. In a preliminary prototype Grid set-up a run time of 18 hours was obtained using a 4-way Xeon PentiumIII, 500 MHz, 2GB RAM. The job input size amounted to 117 kBytes and the job output size to 4 MBytes.
4.4. Inhaled drug delivery simulation This component of GEMSS is a comprehensive simulation tool for the study of new treatments and drug delivery to the lungs. It has been originally developed in the COPHIT project and involves a respiratory simulation that integrates medical images, mesh generation, Computational Fluid Dynamics (CFD), compartment modelling of the lungs, and the simulation of
715
mass fraction p~ofi~e
,~-.~.,.,, -,. %,
velocity
"%.',.~., ;
; ~.,~ ; ~. ~ ~,, ~ ~ ~. ,,
c,ross-sect}on
Figure 9. Results data of a COPHIT simulation viewed on the client using CFX-Post. inhalation devices. The fluid dynamics elements are very computationally demanding and enable the simulation to derive significant benefit from the Grid. The simulation results are used to inform the user of the requirements for targeted delivery of medication to the lung and systemic circulation. There are several stages in the simulation process, including pre-processing steps that create the model geometry and define its properties. Pre-processing includes description of airways geometry from CT scans, mesh generation and specification of parameters that describe various aspects of the model (e.g. drug formulation, inhalation profile etc). Although these steps require a high level of user interaction, they are not very computationally demanding and are therefore allocated to the local client rather than the Grid server. Determination of the pattern of air flow and drug deposition throughout the model is achieved using the CFX computational fluid dynamics solver. This is the most computationally demanding part of the simulation process and therefore relies on the Grid. The modelling of physiologically relevant dynamic input flows requires a transient analysis which is particularly time consuming because it requires repetitive solving over many timesteps (possibly several hundred). However, the solution step does not need any user interaction and can be run in batch mode once the input files have been encrypted and dispatched to the Grid server. The CFX results data is encrypted and returned to the client where it can be viewed using the CFX post-processor (see Figure 9). The Grid is essential for the CFD computation stage of a COPHIT simulation, and calculation may take several days. The input files sent to the Grid server are typically many MB in size, but benefit from compression. The size of the results files depends upon the solution parameters the user has requested. Since results are produced at each timestep, the amount of data can be potentially very large - typically several hundreds of megabytes. In the face of excessive download time resulting from large files and limited internet bandwidth, the user could either request to receive a selected subset of the available data or some form of results processing could be performed on the Grid server to reduce the quantity of data. In a typical test case the input file amounted to 3.5MB (compressed). Solution was performed at 100 timesteps, requiring 14.5 hours on a Dell P4 Xeon with a 2.4 GHz processor. For each timestep of the results required for analysis, 19MB of data (compressed) must be downloaded.
716
l;i:;ii~ii'i~i!i i Y i !i',i~,i
iii~:i i':,/'~ii
~i~'i':'i!!~iIi i!':i
[
.. 0.8 l
7
/
. . . . .
l l
input
~ ~
len o/p
0 m4
~ o,z |4
E o ,m
~'%
0 ,2
-oIS
= .0.8
streamlines at t ~ e 0. 4s
-1 fl
0.1
ft.2
0.3
ft.4
time, s
0.5
0.6
ft.7
0.8
Figure 10. Test case for cardiovascular simulation: flow through a simple bifurcation model bounded by dynamic boundary condition constraints.
4.5. Compartmental modelling approach for the cardiovascular system Simulation of the cardiovascular system is a valuable tool in the development of prostheses. Its role in surgical planning offers the opportunity to answer 'what if' questions, but there is a need for improved description of boundary conditions - a scientific issue that is being addressed within GEMSS. As for the COPHIT application described previously, a tetrahedral model mesh is generated from a segmented CT or MRI scan dataset. Inlet and outlet boundary conditions and material properties are also specified. In this case the model is of a segment of vessel rather than airways, but the principle is the same. This means that the pre-processing tools developed for COPHIT can be adapted to the cardiovascular application. These steps require a high level of user interaction, but they are not very computationally demanding, and they are therefore allocated to the local client rather than the Grid server. They result in the production of a 'DEF' file, which is the input to the CFD solver (CFX). Compartment models, implemented using FORTRAN code, are coupled to the outlets of the CFD model, and these represent the peripheral circulation. The properties and initial state of the compartments are specified in various text files. Together with the 'DEF' file, these specify the complete coupled 3D model-compartment fluid dynamics problem. CFX is then used to determine the pattern of blood flow - a computationally demanding step which benefits from Grid resources. A typical transient analysis over the course of a cardiac cycle requires of the order of 100 or more timesteps, and may take several days to solve. The CFD and compartment software is able run in parallel and will thus benefit from multiprocessor Grid resources. 4.6. Fully 3D reconstruction in SPECT Single Photon Emission Computed Tomography (SPECT) is a popular diagnostic facility for tumour staging and examination of metabolisms. It provides valuable functional information complementary to MRI and x-ray CT showing anatomical details with high resolution. Accurate design of the system matrix based on empirical calibration data, allows inherent correction of spatially invariant line of response functions, and first order scatter correction. An improved variant of the well-known ML-EM algorithm for emission tomography [20] has been parallelized in such a way that it is portable across distributed-memory architectures, shared-memory architectures, and SMP clusters. On a single SMP node equipped with
717 4 processors, we observed good speed-ups better than a factor of 3. On SMP clusters, a hybrid parallelization strategy based on a combination of MPI and OpenMP outperforms a pure process based parallelization strategy, relying on MPI only. Due to the high computation-tocommunication ratio the code exhibits a good scaling behaviour also on machines where an MPI-only parallelization strategy is used. The ML-EM reconstruction has been transformed into a Web Service which can be deployed within Tomcat/Axis. The image reconstruction application relies on a Java-based client GUI which handles the acquisition of SPECT images from a scanner, the configuration and remote execution of reconstruction tasks, and the display of reconstructed images. All data transfers between clients and servers are currently based on SOAR For typical SPECT resolutions, the size of the image files to be transferred between client and service is a few MBs and thus does not represent a problem, even when transferred within SOAP messages. 5. RELATED W O R K There are a number of Grid projects that deal with bio-medical applications. The EU BioGrid [1] aims at the development of a knowledge grid infrastructure for the biotechnology industry. The main objective of the OpenMolGRID project [ 18] is to develop a Grid-based environment for solving molecular design/engineering tasks relevant to chemistry, pharmacy and bioinformatics. The EU MammoGrid [ 15] project is building a Grid-based federated database for breast cancer screening. The UK e-Science myGrid [ 16] project is developing a Grid environment for data intensive in silico experiments in biology. Most of these projects focus on data management aspects of the Grid similar to the EU DataGrid project. In contrast, the GEMSS project focuses on the computational aspect of the Grid, with the aim of providing HPC hardware resources together with medical simulation services across wide area networks in order to overcome time, space or accessibility and ease of use limitations of conventional systems. Other projects in the bio-medical field which also focus more on the computational aspect of the Grid include the Swiss BiOpera project [2], the Japanese BioGrid project [ 17], and the Singapore BioMed Grid [3]. 6. C O N C L U S I O N Within the GEMSS Project medical Grid service prototypes are being developed and evaluated in a secure service-oriented infrastructure for distributed on demand/supercomputing. Grid technology has the potential for enormously enhancing the computing infrastructure and providing enhanced computational services to new user communities. The simulation services included within GEMSS is about to demonstrate potential impact on specific medical areas, aiming to improve: non-invasive diagnosis and pre-operative planning, operative procedures, therapeutic protocols, design, analysis and testing of biomedical devices, such as heart valves, stents, and inhalers. Modelling of individuals is an ongoing research topic and targets the complete simulation of the human b o d y - there is a great deal still to be understood and developed. Grid computing hopefully can bring those developments to the medical user community. 7. A C K N O W L E D G E M E N T The GEMSS partners greatfully acknowledge funding by the IST programme of the European Commission under project number IST-2001-37153. The support from Dr. Dr. Thomas Hierl
718 from the University Clinic of Leipzig who contributed CT data and numerous helpful comments to the maxillo-facial sub-task is deeply appreciated. REFERENCES
[1] [2] [3] [4] [5] [6]
[7] [8] [9] [10] [11 ] [12] [13]
[14] [15]
[16] [ 17] [18] [19] [20] [21 ]
The BioGrid Project. http://www.bio-grid.net/index.jsp BiOpera - Process Support for BioInformatics. ETH Zfirich, Department of Computer Science. http://www.inf.ethz.ch/personal/bausch/bioopera/main.html BiomedGrid Consortium. http://binfo.ym.edu.tw/grid/index.html The Bloodsim Project. http://www.software.aeat.com/cfxlEuropean_Projects/bloodsim/ bloodsim.htm, 1998. EU Esprit Project 28350, 1998-2001. The Cophit Project. http://www.software.aeat.com/cfx/European_Projects/cophitlindex. html, 2000. EU IST Project IST-1999-14004. A. Gill, M. Surridge, G. Scielzo, R. Felici, M. Modesti, and G. Sardu, RAPT: A Parallel Radiotherapy Treatment Planning Code, in H. Liddel, A. Colbrook, B. Hertzberger, and P. Sloot, Eds. Proceedings High Performance Computing and Networking 1067, pages 183-193, 1996. The GEMSS project: Grid-enabled medical simulation services, http://www.gemss.de, 2002. EU IST project IST-2001-37153, 2002-2005. The Simbio Project. http:#www.simbio.de, 2000. EU IST project IST-1999-10378, 20002003. A. Basermann, G. Berti, J. Fingberg, and U. Hartmann. Head-mechanical simulations with SimBio. NEC Research and Development, 43(4), October 2002. The DRAMA project (Dynamic Load Balancing for Parallel Mesh-based Applications) http://www.ccrl-nece.de/DRAMA, 1999. EU ESPRIT project 24953, 1997-1999 The Datagrid Project. http://eu-datagrid.web.cern.ch/eu-datagrid/. EU Project IST-200028185,2001-2003. I. Foster, C. Kesselman, S. Tuecke. The Anatomy of the Grid: Enabling Scalable Virtual Organizations, International J. Supercomputer Applications, 15(3), 2001. I. Foster, C. Kesselman, J. Nick, S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, Open Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002. The GRIA Project. http://www.gria.org/. EU IST Project IST-2001-33240. MAMMOGRID - European federated MAMMOgram database implemented on a GRID structure, contact person: AMENDOLIA, Salvator Roberto, Email: [email protected]. The myGrid Project. http://mygrid.man.ac.uk/ The Japanese BioGrid Project. http://www.biogrid.jp/ OpenMolGRID- Open Computing GRID for Molecular Science and Engineering. Open Computing GRID for Molecular Science and Engineering (OpenMolGRID) M. Tittgemeyer, G. Wollny, and F. Kruggel. Visualising deformation fields computed by non-linear image registration. Computing and Visualization in Science, 5(1):45-51, 2002. L. A. Shepp and Y. Vardi, Maximum likelihood reconstruction for emission tomography. IEEE Trans. Med. Imaging, vol. 1, no. 2, pp. 113-122, 1982. The UNICORE Forum. http://www.unicore.org/
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
719
Parallel c o m p u t i n g in b i o m e d i c a l r e s e a r c h and the search for p e t a - s c a l e biomedical applications C.A. Stewart a, D. Hart a, R.W. Sheppard b, H. Li b, R. Cruise c, V. Moskvin c, and L. Papiez c ~University Information Technology Services, Indiana University 2711 East 10th Street, Bloomington, IN 47408, [email protected], [email protected] bUniversity Information Technology Services, Indiana University-Purdue University Indianapolis 799 W. Michigan Street, Indianapolis, IN 46202, [email protected], [email protected] CDepartment of Radiation Oncology, Indiana University School of Medicine 535 Barnhill Drive, RT-041, Indianapolis IN 46202, [email protected], [email protected], [email protected] The recent revolution in biomedical research opens new possibilities for applications of parallel computing that did not previously exist. This paper discusses three parallel applications useful in medicine and bioinformatics--PENELOPE, GeneIndex, and fastDNAml--in terms of their scaling characteristics and the way they are used by biomedical researchers. PENELOPE is a code that may be used to enhance radiation therapy for brain cancers. GeneIndex is a tool for interacting with genome sequence data when studying its underlying structure. fastDNAml is a tool for inferring evolutionary relationships from DNA sequence data. None are peta-scale applications, but all are useful applications of parallel computing technology in biomedical research. Biologists may not be familiar with the promise that parallel computing has for enhancing biomedical research. The parallel computing community will have its greatest impact on biomedical research only if significant effort is expended in reaching out to the biomedical research community. As the search for peta-scale applications in biomedical research continues, standard techniques in parallel computing may result in useful--and even potentially lifesaving--applications at more modest scales of parallelism. The ability of parallel computing techniques to enable more interactive analyses of biomedical data is particularly important. 1. INTRODUCTION In spite of the many claims about peta-scale applications in biological and biomedical sciences, there are few biological or biomedical parallel applications available today that scale well to hundreds of processors. There has clearly been a revolution in biological and biomedical research--the most obvious and commonly cited evidence is the dramatic increase in sequence information available in Genbank [1]. Also commonly cited, but not as often substantiated, is the claim that the revolution in biology soon will result in a revolution in the use of high performance and parallel computing in biological and biomedical applications.
720 The purpose of this paper is to discuss three parallel applications useful in medicine and bioinformatics--PENELOPE, Genelndex, and fastDNAmlmin terms of their scaling characteristics and the way they are used by biomedical researchers. These programs are used as examples in discussing the current and future needs and opportunities for parallel computing in biomedical research. The search for peta-scale applications is an important aspect of parallel computing research. The examples discussed here demonstrate that there are also many opportunities to advance biomedical research by the application of well-understood principles of parallel computing. A particular opportunity is to enhance the ability of biologists to conduct exploratory work more easily by converting batch mode serial applications into interactive (or nearly so) parallel applications.
2. P E N E L O P E
The Leksell Gamma Knife| [2] is a tool for delivering very precise doses of gamma radiation to treat cancerous brain tumors. Targeting of the Leksell Gamma Knife| is done by experts who estimate the proper targeting arrangement of the 201 Cobalt-60 gamma sources in the device, and calculate the resulting dosages using treatment planning software called Gamma Plan| Execution of one dosage calculation with Gamma Plan| takes roughly an hour on a desktop computer. Because targeting is a bit of an art, and because there is no closed-form solution, planning a targeting arrangement is done iteratively. Once the results of one calculation from GammaPlan| are available, the targeting plan is adjusted and dosages recalculated. This takes place iteratively until the results seem satisfactory. However, the software does not account for individual variations in the physical structure of the patient's head, so the dosage calculations are not as precise as possible [3]. PENELOPE is a Monte Carlo particle transport simulation package [4, 5] that may be used for simulation of dose computation in radiation oncology treatments with Leksell Gamma Knife| [6]. The PENELOPE code can be extended to incorporate patient body geometry and tissue structure, based on computerized tomography (CT) data [7]. An individual dosage calculation with PENELOPE takes roughly 7 hours on a desktop workstation. As a result it cannot be used as a serial code in a clinical setting, because it takes impractically long for the iterative process of planning radiation therapy. PENELOPE was first parallelized using MPI, running on 8 processors of an SGI Origin2000 [8]. Our goal in enhancing the parallelization of PENELOPE for use in radiation oncology was to develop a code that could be used in clinical applications--completing analysis of a targeting scenario in 5 to 60 minutes. The current parallel implementation is based on MPI communications except for the parallel random number generator. The parallel random number generator (PRNG) is presently the PRNG from the MILC consortium [9] which has been ported to Fortran90. A typical PENELOPE run involves l0 s radiation showers. The simulation of the radiation events is an "embarrassingly" parallelizable computational task. This work is divided up among multiple processors and as each processor finishes its assigned work, the results are reported back to processor 0. Our initial parallelization efforts have utilized Indiana University's IBM SP supercomputer [ 10], using Power3+ processors in 4-processor thin nodes. Figure 1 shows the time required to simulate l0 s radiation showers vs. number of processors used. Figure 2 shows the scalability of the code with number of processors. The scalability of the code is quite good. Analysis of the
721 100000
10000
200
r I,--
i
1000
150
100 t Scaling 50
100 1
,
,
10
100
N u m b e r of Processors
Figure 1. Wall-clock time vs. number of processors for simulation of 108 radiation showers with the parallel version of PENELOPE, running on an IBM SP using Power3+ processors,
0 0
,
,
,
,
50
100
150
200
,
250
,
300
# of Processors
Figure 2. Speedup vs. number of processors for simulation of 10 s radiation showers with the parallel version of PENELOPE, running on an IBM SP using Power3+ processors.
code performance with Vampir [ 11 ] shows that imperfections in the scalability are due at least in part to differences in the time required for various worker processes to complete their assigned calculations. The results of simulations using the parallel version of PENELOPE, on test targets with simple geometry, match very well against the Gamma Plan| treatment planning software; further details of the parallel implementation of PENELOPE are provided in [12]. The most important results of the scalability testing are the practical outcomes in terms of potential therapeutic use. With 256 processors the time to complete the calculations is less than 5 minutes. The time to complete an analysis with 32 processors is just under 35 minutes. I/O is a problem at present. Roughly one hour is required to move output data, via serial ftp, from the IBM SP to the system on which it is visualized. In addition, the use of a shared supercomputer such as IU's IBM SP for analysis of clinical patient data is not compliant with patient medical record confidentiality laws in the United States. Our future work with PENELOPE will be aimed at creating a system that is feasible for clinical use and will meet the required quality assurance and privacy standards for clinical treatment. The main steps in this effort will be porting the code to Linux and implementing parallel I/O of output data. Extensive quality assurance testing will also be required. Our work thus far has been successful as a proof of concept. The results of the performance analysis clearly demonstrate that a stand-alone Linux cluster could be used for radiation therapy planning using the parallel version of PENELOPE in a clinical setting. Such analyses could be done in a fashion that is faster than the systems currently used for radiation therapy planning in clinical settings. (Note: Our contributions to enhancements of the PENELOPE code have been submitted to the RSICC [13].) 3. GENEINDEX
To understand how the expression of a genome is regulated it is necessary to identify the locations of initiators, promoters, and other genetic control sequences embedded within the genome
722 x 105 Frequency of all words of length of 4 in a sequence of length of 81813302
14,~
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
E ~8
~6
c~
e-
~
I'r4~ g
i
1
I
I
Ill
I
I
I
I
II
2 0
15
30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 Word list: 0=AAAA, I=AAAC, 2=AAAG, 3--AAAT.... 255=TTI-r
Figure 3. Genelndex output: frequencies of all possible words of length 4 in the complete genome of Caenorhabditis elegans (nematode worm). itself. Genelndex is a tool designed to facilitate studies of genome organization. Genelndex calculates and maps the frequencies and positions of all words of a specified length in a DNA sequence. Genelndex currently accommodates word lengths of 1 to 36. Genelndex also provides visualizations of the results using Tcl/T1. Figure 3 is the output of an analysis of a genetic sequence with word length 4. The parallelization of Genelndex is straightforward. The genome is broken up into n sections, where n is the number of processors available. Each processor creates a linked list of the locations of each word found. When the individual worker processes complete the analysis of segments of the genome, the linked lists are joined. The scalability of Genelndex, using word size 4 and the entire Fruit Fly (Drosophila melanogaster) genome as input, is shown in Figures 4 and 5. Figure 5 shows that the scalability of Genelndex with number of processors is modest. However, Figure 4 shows that the parallel version of Genelndex can, with 16 processors, analyze the entire Fruit Fly genome in less than 5 minutes. With one processor analysis of the genome takes roughly 40 minutes. This is the critical point--Genelndex allows scientists to analyze entire genomes, in a way that permits interactive and successive analyses as part of the process of studying genome organization. This permits scientists to interactively investigate genomes and makes possible a style of research not possible with a series of batch jobs, each taking the larger portion of an hour. GeneIndex is an open source program, available from [14]. 4. fastDNAml
fastDNAml [15, 16] is a maximum likelihood program for inferring evolutionary relationshipsIphylogenetic treesnof different types of organisms from genetic sequence data. The process of searching for the phylogenetic tree with the highest likelihood value is of necessity heuristic, as the number of possible trees rises dramatically with the number of organisms (taxa) included in an analysis. For example, for 100 taxa, there are 1.7 x 10 ls2 possible differ-
723 10000
8 v
................................................................................................................................................................
o.
1ooo
.......
Linear Speedup
~
Observed Speedup
,
9
, 9
4O
r
30
10
Number of C P U
Figure 4. Wall clock time vs. number of processors, for analysis of all words of length 4 in the complete genome of Drosophila melanogaster (fruit fly) with the parallel version of GeneIndex, running on an IBM SP using Power3+ processors,
0
20
40
60
80
Number of CPU
Figure 5. Speedup vs. number of processors, for analysis of all words of length 4 in the complete genome of Drosophila melanogaster(fruit fly) with the parallel version of GeneIndex, running on an IBM SP using Power3+ processors.
ent trees [ 16]. fastDNAml searches the space of possible trees by building trees incrementally, starting from a tree of three randomly chosen organisms. Taxa are then added, in a random order. Each possible way of adding a taxon is tested, and the tree with the highest likelihood value at each step is used as the basis for the next step. This process is conducted many times with different randomizations of the order of entry of taxa, and the result is a set of trees and associated likelihood values. The high-scoring trees can be used to create a consensus tree. The fastDNAml algorithm is described in detail in [ 15]. The parallel version of fastDNAml and its scalability characteristics have been described in detail in [ 16], and it is available as open source software from [17]. The scalability of fastDNAml is very good. Still, if a scientist wanted to do a fixed number of different randomizations of the order of taxon addition, the best computer performance would be achieved by running multiple serial jobs. In practice this is a poor approach from the viewpoint of a biologist. The process of aligning multiple genetic sequences is still something of an art, and the parallel version of fastDNAml permits the biologist to quickly get some views of the phylogenetic trees being produced by fastDNAml. If the trees are clearly nonsensical, this is a sign to the biologist that perhaps the alignment is incorrect. More generally, the ability to get a sense for the results as they are being processed greatly facilitates the understanding and analysis of the results of maximum likelihood phylogenetic analysis. The ability to perform very large-scale calculations in days, rather than weeks, makes feasible research that would otherwise not be considered.
5. DISCUSSION This paper has briefly described three parallel applications useful in biomedical, biological, and genomic research. The scalability of two of the applications discussed is very good, while
724 the scalability of the third (GeneIndex) is modest. However, all three applications are successful examples of the use of parallel techniques because they enable biomedical researchers (and potentially clinicians) to perform their analyses in a fashion that approaches being interactive, as opposed to waiting an hour, many hours, or days for results. Many of the techniques applied in the example programs described here are well known within the parallel programming community. Useful--and in some cases potentially lifesaving--applications may require only the straightforward application of well known principles and techniques. Parallel applications that biologists can use in a fashion that is interactive or nearly so enable analyses that might lend insight into important problems but which might otherwise not be attempted. In physics, currently many research projects are planned with an expected duration of many years, sometimes spanning more than a decade from initial planning to completion of an experiment. In biology, on the other hand, the nature of interesting problems may change very rapidly. The work of Korber and coworkers in studying the phylogenetics of the HIV-1 complex is a good example [ 18]. The rapidity with which a phylogenetic analysis of the HIV- 1 complex was carried out, as a result of the use of parallel computing techniques, allowed the results this analysis to appear at a critical juncture in the scientific debate on the origins of HIV. As another example, very basic questions about the evolution of arthropods have become significant topics of debate just in the period of the last two years [ 19, 20, 21 ]. There is a clear opportunity to apply massive computing resources to analysis of arthropod evolution, but the calculations must be done in a timely fashion. A large team of researchers will do so using a grid of supercomputers during the IEEE/ACM conference SC2003 [22]. Hundreds to thousands of processors, and thousands of CPU hours, will be applied to this problem. However, the nature of the debate about the evolution of arthropods is such that this analysis is worth pursuing for biologists only if the analysis can be completed very quickly. If the analysis were to be done on a smaller computer system and take months of wall-clock time, the input data would have been superceded by the time the calculations were finished! In this case, then, the critically important factor is a tera-scale application that permits biologists to analyze a massive problem in short amounts of wall-clock time. A challenge in the application of parallel computing techniques to important biological and biomedical problems is the fact that many biologists are unfamiliar with the real possibilities of parallel computing [23, 24]. Indiana University has invested a great deal of effort developing relationships and collaborations with the biomedical research community. Similar efforts have been undertaken successfully at the Aachen University of Technology [25]. Outreach and education efforts are just as important as the actual development of parallel programs to solve biomedical problems. Very large-scale biomedical computing efforts, such as the Encyclopedia of Life and the Biomedical Informatics Research Network (BIRN) [26, 27], aim to implement peta-scale applications that will solve current and important problems. However, as the parallel computing community strives for peta-scale applications in biomedical research, it is critical that we not overlook applications of more modest scale. Interactive tera-scale applications might today be of great value to a large portion of the biomedical research community. If the parallel computing community is to aid the biomedical research community in accelerating research with the greatest possible effectiveness, then the application of parallel computing techniques must be pursued at all scales. This paper has described the development of three parallel applications that advance biomedical research (and potentially clinical medical practice) by virtue of us-
725 ing parallel computing to enable biomedical researchers to interactively and iteratively perform large-scale calculations as part of the ongoing research process. Opportunities to accelerate biomedical research abound for parallel computing experts willing to proactively work with the biomedical research community. 6. A C K N O W L E D G E M E N T S This paper is dedicated to Gerald Bernbom, Andrea McRobbie, and all people engaged in the fight against brain cancer. This research was supported in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment Inc. Other sources of support for this research include: Shared University Research grants from IBM, Inc. to Indiana University; Indiana University's IBM Life Sciences Institute of Innovation; and National Science Foundation Grants 0116050 and CDA-9601632. Some of the ideas presented here were developed while the senior author was a visiting scientist at H6chstleistungsrechenzentrum Universit~it Stuttgart. The support and collaboration of HLRS and Michael Resch, Matthias Mtiller, Peggy Lindner, Matthias Hess, and Rainer Keller are gratefully acknowledged. Editorial assistance and assistance with graphics from John Samuel, John Herrin, Malinda Lingwall, and W. Les Teach is gratefully acknowledged. We thank the organizers of ParCo2003, especially Wolfgang Nagel, and the host institution, Technische Universitfit Dresden, for an excellent conference. REFERENCES
[1] [2] [3]
Growth of GenBank. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html Home page. Elekta AB. http://www.elekta.com/ V. Moskvin, R. Timmerman, C. DesRosiers, L. Papiez, M. Randall, and P. DesRosiers, Effect of air-tissue inhomogeneity on dose distribution with Leksell Gamma Knife: Comparison between measurements, GammaPlan and Monte Carlo simulation, 45th AAPM Annual Meeting, August 10-14, 2003, San Diego [4] F. Salvat, J.M. Fernandez-Varea, J. Baro, and J. Sempau, PENELOPE, an algorithm and computer code for Monte Carlo simulation of electron-photon showers, Informes Tecnicos CIEMAT report n. 799, CIEMAT, Madrid, 1996. [5] RSICC code package CCC-682. http://www-rsicc.ornl.gov/codes/ccc/ccc6/ccc-682.html [6] V.P. Moskvin, C. DesRosiers, L. Papiez, R. Timmerman, M. Randall, and P. DesRosiers, Monte Carlo simulation of the Leksell Gamma Knife(R): I. Source modeling and calculations in homogeneous media, Physics in Medicine and Biology, 47 (2002) 1995-2011. [7] V.P. Moskvin, L. Papiez, R. Timmerman, C. DesRosiers, M. Randall, and P. DesRosiers, Monte Carlo simulation for Leksell Gamma Knife radiosurgery plan verification, Nuclear Mathematical and Computational Sciences: A Century in Review, A Century Anew, Gatlinburg, Tennessee, April 6-11, 2003, on CD-ROM, American Nuclear Society, LaGrange Park, IL, 2003. [8] J. Sempau, A. Sanchez-Reyes, F. Salvat, H. Oulad ben Tahar, S.B. Jiang, and J.M. Fernandez-Varea, Monte Carlo simulation of electron beams from an accelerator head using PENELOPE, Phys. Med. Biol. 46 2001 1163-1186. [9] MIMD Lattice Computation (MILC) Collaboration. http ://www.physics.indiana.edu/sg/milc.html [ 10] The UITS Research SP System http://sp-www.iu.edu/ [ 11] Intel GmbH, Software & Solutions Group. http:www.pallas.com
726 [12] R.B. Cruise, R.W. Sheppard, and V.R Moskvin, Parallelization of the PENELOPE Monte Carlo particle transport simulation package, Nuclear Mathematical and Computational Sciences: A Century in Review, A Century Anew. Gatlinburg, Tennessee, April 6-11, 2003, on CD-ROM, American Nuclear Society, LaGrange Park, IL, 2003. [ 13] RSICC code package CCC-713. http://www-rsicc.ornl.gov/codes/ccc/ccc7/ccc-713.html [ 14] Genelndex. http://www.indiana.edu/rac/hpc/GeneIndex/ [ 15] G. J. Olsen, H. Matsuda, R. Hagstrom, and R. Overbeek, fastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood, Comput. Appl. Biosci. 10 (1994) 41--48. [16] C.A. Stewart, D. Hart, D. K. Berry, G. J. Olsen, E. Wemert, W. Fischer, Parallel implementation and performance of fastDNAml--a program for maximum likelihood phylogenetic inference, Proceedings of SC2001, Denver, CO, November 2001. http ://www. sc2001 .org/papers/pap.pap 191 .pdf [17] Maximum Likelihood Analysis of Phylogenetic Data http://www.indiana.edu/rac/hpc/fastDNAml/ [ 18] B. Korber, M. Muldoon, J. Theiler, F. Gao, R. Gupta, A. Lapedes, B.H. Hahn, S. Wolinsky, T. Bhattacharya, Timing the Ancestor of the HIV- 1 Pandemic Strains, Science 288 (2000) 1789-1796. [19] U.W. Hwang, M. Friedrich, D. Tautz, C. J. Park, and W. Kim, Mitochondrial protein phylogeny joins myriapods with chelicerates, Nature 413 (2001) 154-157. [20] G. Giribet, G.D. Edgecombe, and W. C. Wheeler, Arthropod phylogeny based on eight molecular loci and morphology, Nature 413 (2001) 157-161. [21 ] F. Nardi, G. Spinsanti, J.L. Boore, A. Carapelli, R. Dallai, and F. Frati, Hexapod origins: Monophyletic or paraphyletic?, Science 299 (2003). [22] R. Keller, C. A. Stewart, J. Colbourne, M. Hess, D. Hart, J. Steinbachs, U. Woessner, D.K. Berry, R. Repasky, M. Mueller, H. Li, G.W. Stuart, M. Resch, E. Wernert, M. Buchhorn, H. Takemiya, R. Belhaj, W.E. Nagel, S. Sanielevici, S.t. Kofuji, D. Bannon, N. Nakajima, R. Badia, M.A. Miller, H. Park, R. Stevens, F.-R Lin, J. Brooke, D. Moffett, T.T. Wee, G. Newby, J.C.T. Poole, R. Hamza, M. Papakhian, L. Grundhoeffer, R Cherbas, Global Analysis of Arthropod Evolution, http://www.sc-conference.org/sc2003/tech_hpc.php [23] C.A. Stewart, C.S. Peebles, M. Papakhian, J. Samuel, D. Hart, and S. Simms, High Performance Computing: Delivering Valuable and Valued Services at Colleges and Universities, Proceedings of SIGUCCS, Portland, OR, October 2001. [24] C.A. Stewart, Repasky, R., Hart, D., Papakhian, M., Shankar, A., Wemert,E., Arenson, A., and G. Bernbom, Advanced Information Technology Support For Life Sciences Research, Proceedings of SIGUCCS 2003, Sept. 21-24, 2003, San Antonio, TX. [25] C. Bischof, Pushing Computational Engineering and Science at Aachen University of Technology, Proceedings of ISC2002, June 20-22, 2002, Heidelberg, Germany. [26] Encyclopedia of Life. http://eol.sdsc.edu/ [27] Biomedical Informatics Research Network (BIRN). http://www.nbirn.net/
Minisymposium
Performance Analysis
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
729
Big Systems and Big Reliability Challenges* D.A. Reed ~, C. Lu a, and C.L. Mendes a aDepartment of Computer Science, University of Illinois, Urbana, 61801 Illinois - USA Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands and with proposed petaflop system likely to contain hundreds of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. This paper quantifies system reliability using data drawn from current systems and describes possible approaches for ensuring reliable, effective use of future, large-scale systems. We also present techniques for detecting imminent failures that allow applications to execute despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage. 1. INTRODUCTION Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As an example, NCSA is deploying a 17 teraflop Linux cluster containing 2,938 Intel Xeon processors. Even larger clustered systems are being designed- the 40 teraflop Cray/SNL Red Storm cluster and the 180 teraflop IBM/LLNL Blue Gene/L system will contain 10,368 and 65,536 nodes, respectively. As node counts for multi-teraflop systems grow to tens of thousands and with proposed petaflop systems likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes much less credible. Although the mean time before failure (MTBF) for the individual components is high, the large overall component count means the system itself will fail frequently. Indeed, operation of the 8000 processor IBM/LLNL ASCI White system has revealed that despite the rate of failure decreases as the system has matured, its MTBF is only slightly more than 40 hours. Hardware failures are exacerbated by programming models that have limited support for fault-tolerance. For scientific applications, MPI is the most popular parallel programming model. However, the MPI standard does not specify mechanisms or interfaces for faulttolerance - normally, all of an MPI application's tasks are terminated when any of the underlying nodes fails or becomes inaccessible. Given the standard domain decompositions and data distributions used in message-based parallel programs, there are few altematives to this approach *This work was supported in part by Contract No. 74837-001-0349 from the Regents of University of California (Los Alamos National Laboratory)to William Marsh Rice University, by the National Science Foundation under grant EIA-99-75020, and by the NSF Alliance PACI Cooperative Agreement.
730 without underlying support for recovery. Because a program's dataset is partitioned across the nodes, any node loss implies irrecoverable data loss. In this paper, we describe possible approaches for the effective use of multi-teraflop and petaflop systems. In particular, we present techniques that allow an MPI application to continue execution despite system component failures. One of these techniques is diskless checkpointing, which enables more frequent checkpoints by redundantly saving checkpointed data in memory. Another technique uses low-cost mechanisms to capture data for failure prediction, which enables creation of dynamic schemes for improved application resilience. The remainder of this paper is organized as follows. In w we present empirical data to quantify the reliability of large systems. This is followed by a description of our fault injection techniques in w We discuss fault-tolerance for MPI applications in w and system monitoring issues in w We show how to use such techniques for adaptive control, in w discuss related work, in w and conclude by pointing to our future work in w 2. LARGE SYSTEM RELIABILITY Although component reliabilities continue to increase, so do the sizes of systems being built using these components. Even though the individual component reliabilities may be high, the large number of components can make system reliability low. As an example, Figure 1 shows the expected system reliability for systems of varying sizes when constructed using components with three different reliabilities. In the figure, the individual components have one-hour failure probabilities of 1 • 10 -4, 1 • 10 -5, and 1 • 10+-6 (i.e., MTBFs of 10,000, 100,000 and 1,000,000 hours).
1 hour retlabill~
100 -
~0.99999
+
6O
4O
0
T
System Size
T
~
Figure 1. System MTBF scaling for three component reliability levels.
Even if one uses components whose one-hour probability of a failure is less than 1 • 10 -6, the mean time to system failure is only a few hours if the system is large. The IBM/LLNL Blue Gene/L system, which is expected to have 65,368 nodes, would have an average "up time" of
731
ioo98 ~"
96
~
94
<
Overall Scheduled
...........
i
i
4O
60
i
80
I O0
Week
Figure 2. NCSA Origin Array availability. less than 24 hours if its components had reliabilities this low.t A petascale system with roughly 200,000 nodes would operate without faults for only five hours. Moreover, these estimates are optimistic; they do not include coupled failure modes, network switches, or other extrinsic factors. Because today's large-scale applications typically execute in 4-8 hour scheduled blocks, the probability of successfully completing an execution before a component failed would be quite low. Simply put, proposed petascale systems will require mechanisms for fault-tolerance, otherwise they will be unusable - their frequent failures would prevent any major application from executing to completion on a substantial fraction of the system's resources. As noted earlier, the data in Figure 1 are drawn from a simplistic theoretical model. However, experimental data from existing systems confirm that hardware and software failures regularly disrupt operation of large systems. As an example, Figure 2 depicts the percentage of nodes up and running in the NCSA's Origin array, composed of twelve shared-memory SGI Origin 2000 systems. This data reflects failures that occurred during a two-year period between April 2000 and March 2002. In summary, failures already constitute a substantive fraction of service disruption on today's terascale systems. Given the larger number of components expected in petascale systems, such failures could make those systems unusable. Hence, fault-tolerance mechanisms become a critical component for the success or large high-performance systems. 3. FAULT INJECTION AND ASSESSMENT Because the relative frequency of hardware and software failures on individual components is extremely low, it is only on larger systems that their frequency becomes large enough for testing. However, even in such cases, the errors are not systematic or reproducible. This makes obtaining statistically valid failure data arduous and expensive. Instead, one needs an infrastructure that can be used to generate faults as test cases in a systematic way. We are developing a parameterizable test infrastructure that can inject computation and communication component faults into large-scale clusters. For each fault type, we create models of fault behavior and produce instances of each. Thus, we can inject these faults into running tlBM has stressed the importance of careful engineeringto ensure higher system reliability.
732 applications and observe their effects on the underlying execution. This enables assessment of the effectiveness of the fault-tolerance mechanisms described in w We emulate computation errors via random bit flips in the address space of the application and in the register file. These correspond to memory errors or transient errors in the processor core. We inject three types of application memory faults: text segment errors, stack and heap errors, and register file errors. Communication errors can occur at many levels, ranging from the link transport level, through switches, NICs, communication libraries and application code. Many of the hardware errors can be simulated by appropriate perturbations of application or library code. Hence, we inject faults using the standard MPI library. By intercepting MPI calls via the MPI profiling interface, we can manipulate all MPI messages. We simulate four types of communication faults: redundant packets, packet loss, payload perturbation and software errors. As an example of this fault injection methodology, we generated faults in the execution of a computational astrophysics code based on the Cactus package [1 ], which simulates the tridimensional scalar field produced by two orbiting sources. We injected both computation and communication faults, and classified their effect according to the observed program behavior. The tests were conducted on an x86 Linux cluster. The outcome of an execution then had four possibilities: a crash, a hang, a finished execution producing incorrect data, or a correctly finished execution. Table 1 summarizes a portion of the results, corresponding to the effects of random bit flippings in the computation and communication parts of Cactus. Although the majority of the executions completed correctly, a non-negligible number of injected faults produced some type of incorrect behavior. In particular, errors in the registers produced numerous crashes and hangs. Similarly, errors on MPI_Gather always caused some type of error in the execution, showing that this type of message carries data extremely sensitive in the program context.
Table 1 Results from fault injection experiments with Cactus code.
Injection Location Text Memory Data Memory Heap Memory Stack Memory Regular Registers Floating Point Registers MPI_Algather MPI Gather MPI_Gatherv MPI Isend
Crash 49 12 4 82 139 0 15 17 0 0
Hang Wrong Output Correct Output 12 31 36 43 180 10 0 16 0 0
6 0 10 0 0 10 5 41 23 0
933 959 950 877 189 480 70 0 13 90
733 1
1 spare,/group
0.9 0.8 >, ._
0.7
~
0.6
m
0.5
8
0.4
3
0.3
.13 0
o
3 spares/group ......... 4 spares/group ....
\
\ \
\
x
\ i
\ \
0.2 0.1 0
1O0
200
300
400
500
600
700
800
900
Number of Phases
Figure 3. Probability of successful execution with diskless checkpointing. 4. FAULT T O L E R A N C E IN MPI CODES Historically, large-scale scientific applications have used application-mediated checkpoint and restart techniques to deal with failures. However, the MTBF for a large system can be comparable to that needed to restart the application, which means that the next failure can occur before the application recovers from the last failure. For the foreseeable future, we believe library-based techniques, which combine message fault-tolerance with application-mediated checkpointing, are likely to be the most useful for large-scale applications and systems. Hence, we are extending LA-MPI [2] by incorporating diskless checkpointing [3] as a complement to its support for transient and catastrophic network failures. In this scheme, fault-tolerance is provided by three mechanisms: (a) LAMPI's fault-tolerant communication, (b) application-specific disk checkpoints and (c) librarymanaged, diskless intermediate checkpoints. In diskless checkpointing, an application's nodes are partitioned into groups. Each group contains both the nodes executing the application code and some number of spares. Intuitively, enough extra nodes are allocated in each group to offset the probability of one or more node failures within the group. When an application checkpoint is reached, checkpoint data is written to the local memory in each node, and parity data is computed and stored in the memories of spare nodes assigned to each group. If a node fails, the modified MPI layer directs the application to rollback and restart the failing process on a spare node, using data regenerated from checkpoint and parity sets. To evaluate this approach, we simulated a hypothetical system containing 10,000 computation nodes with 1,000 spare nodes, under different configurations of spares per group. We assumed an inter-checkpoint interval of 45 minutes and a checkpoint duration between 2.4 and 20 minutes, proportional to the size of each group. Figure 3 shows the probability of successful execution (i.e., no catastrophic failure) as a function of application duration, expressed as checkpoint periods. Intuitively, the longer the program runs, the more likely it will encounter a catastrophic crash; assigning two spares per group can allow the application to continue execution for a period three times longer (at 90% success probability).
734 5. SYSTEM M O N I T O R I N G FOR RELIABILITY System failures are frequently caused by hardware component failures or software errors. Others are triggered by external stimuli, either operator errors or environmental conditions (e.g., Feng [4] predicts that the failure rate in a given system doubles with every 10 degree Celsius rise in temperature). Rather than waiting for failures to occur and then attempting to recover from those failures, it is far better to monitor the environment and respond to indicators of impending failure. We are starting to develop an integrated set of fault indicator monitoring and measurement tools, while validating the toolkit and quantifying the relative frequency of failure indicators and modes on Linux clusters. This fault monitoring and measurement toolkit will leverage extant, device-specific fault monitoring software to create a flexible, integrated and scalable suite of capture and characterization tools. We plan to integrate three sets of failure indicators: disk warnings based on SMART protocols, switch and network interface (NIC) status and errors, and node motherboard health, including temperature and fan status. The frequency of data collection will be flexible and adjustable for each kind of indicator, ranging from a sample every few hours to one or more samples per second. Because collecting and analyzing data from every node in a large system can become prohibitively expensive, we plan to adopt low-cost data capture mechanisms. One such scheme uses statistical sampling techniques [5]. In some cases, however, simply reducing the number of nodes for data collection may not be sufficient, as the amount of data produced in each processor could still be excessively large. For these situations, we can implement data reduction using application signature modeling [6], a lossy compression mechanism that preserves the most salient features in the captured data. 6. I N T E L L I G E N T ADAPTIVE CONTROL To respond to changing execution conditions due to system failures, applications and systems must be nimble and adaptive. Performance contracts provide one mechanism for adaptation. Intuitively, a performance contract specifies that, given a set of resources with certain characteristics and for particular problem parameters, an application will achieve a specified performance during its execution [7]. Combining instrumentation and metrics, a contract is said to be violated if any of the contract attributes do not hold during application execution (i.e., the application behaves in unexpected ways or the performance of one or more resources fails to match expectations). Reported contract violations can trigger several possible actions, including identification of the cause (either application or resource) and possible remediation (e.g., application termination, reconfiguration or rescheduling) Using our Autopilot toolkit [8], we have developed a contract monitoring infrastructure [9] that includes distributed sensors for performance data acquisition, actuators for implementing performance optimization decisions, and a decision-making mechanism for evaluating sensor inputs and controlling actuator outputs. In concert with that development, we created a graphical user interface where the user can request visualization of contract outputs or values from specific metrics [10]. In addition to application's computation and communication activity, we are now extending the set of possible metrics by including operating system events, which can be captured by the Magnet toolkit [ 11 ].
735 7. RELATED W O R K
Several research efforts have addressed the fault-tolerance limitations in MPI. These research efforts include Los Alamos MPI (LA-MPI) [2], which provides end-to-end network fault tolerance, compile-time and runtime approaches [12] that redirect MPI calls through a fault-tolerance library based on compile-time analysis of application source code, and message logging and replay techniques [ 13]. Each of these approaches potentially solves portions of the large-scale fault-tolerance problem, but not all. Compile time techniques require source code access, and message recording techniques require storing very large data volumes. Meanwhile, real-time performance monitoring and dynamic adaptive control have been used in the past mostly in the domain of networks and distributed systems [ 14]. Although many of those techniques are applicable to large systems, adaptation due to failing resources has more involved constraints. It must deal not only with poor performance, but also with conditions that can potentially abort an application execution completely. 8. CONCLUSION As node counts for multi-teraflop systems grow to tens or hundreds of thousands, the assumption of fully reliable hardware and software becomes much less credible. Although the mean time before failure (MTBF) for the individual components is high, the large overall component count means the system itself will fail frequently. To be successful in this scenario, applications and systems must employ fault-tolerance capabilities that allow execution to proceed despite the presence of failures. We have presented mechanisms that can be used to cope with failures in terascale and petascale systems. These mechanisms rest on four approaches: (a) real-time monitoring for failure detection and recovery, (b) fault injection analysis of software resilience, (c) support of diskless checkpointing for improved application tolerance to failures and (d) development of adaptive software subsystems based on the concept of performance contracts. We plan to extend our performance contracts both in spatial as in temporal scopes. We will create global contracts covering all the executing processors, and validation schemes based on both current and past system state. REFERENCES
[1] [21
[3] [4]
[5]
G. Allen et at, The Cactus Code: A Problem Solving Environment for the Grid, Ninth IEEE International Symposium on High-Performance Distributed Computing, Pittsburgh, 2000. R. Graham et al, A Network-Failure-Tolerant Message-Passing System for Terascale Clusters, International Conference on Supercomputing, New York, 2002. J. Plank, K. Li, M. Puening, Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems, No. 9, (1998). W. Feng, M. Warren, E. Weigle, The Bladed Beowulf: A Cost-Effective Alternative to Traditional Beowulfs, Cluster 2002, Chicago, 2002. C. Mendes, D. Reed, Monitoring Large Systems via Statistical Sampling, LACSI Symposium, Santa Fe, 2002.
736 [6] [7] [8]
[9]
[ 10] [ 11]
[12] [13] [ 14]
C. Lu, D. Reed, Compact Application Signatures for Parallel and Distributed Scientific Codes, SC02, Baltimore, 2002. F. Vraalsen, R. Aydt, C. Mendes, D. Reed, Performance Contracts: Predicting and Monitoring Grid Application Behavior, LNCS No. 2242 (2001). R. Ribler, J. Vetter, H. Simitci, D. Reed, Autopilot: Adaptive Control of Distributed Applications, Seventh IEEE Symposium on High-Performance Distributed Computing, Chicago, 1998. K. Kennedy et al, Toward a Framework for Preparing and Executing Adaptive Grid Programs, NGS Workshop, International Parallel and Distributed Processing Symposium, 2002. D. Reed, C. Mendes, C. Lu, Application Tuning and Adaptation, In "The Grid: Blueprint for a New Computing Infrastructure", 2nd edition, Elsevier (2003). M. Gardner et al, MAGNET: A Tool for Debugging, Analysis and Reflection in Computing Systems, Third IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. G. Bronevetsky et al, Collective Operations in an Application-level Fault Tolerant MPI System, International Conference on Supercomputing, San Francisco, 2003. J. LeBlanc, J. Mellor-Crummey, Debugging Parallel Programs with Instant Replay, IEEE Transactions on Computers No. 36, (1978). K. Nahrstedt, H. Chu, S. Narayan, QoS-aware Resource Management for Distributed Multimedia Applications, Journal on High-Speed Networking, Vol. 8, No. 3-4, (1998).
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
737
Scalable P e r f o r m a n c e A n a l y s i s o f Parallel S y s t e m s : C o n c e p t s and E x p e r i e n c e s H. BrunsP b and W.E. NageP ~Center for High Performance Computing, Dresden University of Technology, 01062 Dresden, Germany {brunst, nagel }@zhr.tu-dresden.de bDepartment for Computer and Information Science, University of Oregon, Eugene, OR, USA [email protected] We have developed a distributed service architecture and an integrated parallel analysis engine for scalable trace based performance analysis. Our combined approach permits to handle very large performance data volumes in real-time. Unlike traditional analysis tools that do their job sequentially on an external desktop platform, our approach leaves the data at its origin and seamlessly integrates the time consuming analysis as a parallel job into the high performance production environment. 1. INTRODUCTION For more than 20 years, parallel systems have been used successfully in many application domains to investigate complex problems. In the early eighties, supercomputing was performed mostly on vector machines with just a few CPUs. The parallelization was based on the shared memory programming paradigm. We should keep in mind that in these times, the performance of applications on such machines was several hundred times higher, compared to even state-ofthe-art mainframes. Compared to PCs, the factor was even in the order of a couple of thousands. This means scalability of performance, and its usage was reasonably easy for the end users in those days. The situation has changed significantly. Today, we still want to improve performance, but we have to use thousands of CPUs to achieve performance improvement factors comparable to those of past times. Most systems are now based on SMP nodes which are clustered- based on improved network technology - to large systems. For very big applications, the only working programming model is message passing (MPI[6]), sometimes combined with OpenMP[3] as a recent flavor of the shared memory programming model. The reality has shown that applications executed on such large systems often have serious performance bottlenecks. The achievement mostly is scalability in parallelism and only sometimes scalability in performance. This paper describes extended concepts and experiences to support performance analysis on systems with a couple of thousand processors. It presents a new tool architecture which has
738 recently been developed in the ASCI/Earth Simulator scope to support applications on very large machines and to ease the performance optimization process significantly. 2. A DISTRIBUTED P E R F O R M A N C E ANALYSIS SERVICE The distributed performance analysis service described in this paper has been recently designed and prototyped at Dresden University of Technology in Germany. Based on the experience gained from the development of the performance analysis tool Vampir[1, 2], the new architecture uses a distributed approach consisting of a parallel analysis server, which is supposed to be running on a segment of a large parallel production environment, and a visualization client running on a desktop workstation. Both components interact with each other over the Internet by means of a standard socket based network connection. In the discussion that follows, the parallel analysis server together with the visualization client will be referred to as VNG (Vampir Next Generation). The major goals of the distributed parallel approach can be formulated as follows: 1. Keep performance data close to the location where they were created 2. Perform event data analysis in parallel to achieve increased scalability where speedups are on the order of 10 to 100 3. Limit the network bandwidth and latency requirements to a minimum to allow quick performance data browsing and analysis from remote working environments. VNG consists of two major components, an analysis server (vngd) and a visualization client (vng). Each may be running on a different kind of platform. Figure 1 depicts an overview of the overall service architecture. Boxes represent modules of the two components whereas arrows indicate the interfaces between the different modules. The thickness of the arrows is supposed to give a rough measure of the data volume to be transferred over an interface whereas the length of an arrow represents the expected latency for that particular link. On the left hand side of Figure 1 we can see the analysis server, which is to be executed on a dedicated segment of a parallel computer having access to the trace data as generated by an application. The server can be a heterogeneous program (MPI combined with pthreads), which consists of master and worker processes. The workers are responsible for trace data storage and analysis. Each of them holds a part of the overall trace data to be analyzed. The master is responsible for the communication to the remote clients. He decides on how to distribute analysis requests among the workers and once the analysis requests are completed, the master merges the results into a response to be sent to the client. The right hand side of Figure 1 depicts the visualization clients running on remote desktop graphics workstations. A client is not supposed to do any time consuming calculations. Therefore, it has a pretty straightforward sequential GUI implementation. The look and feel is very similar to performance analysis tools like Vampir, Jumpshot[8], and many others. Following this distributed approach we comply with the goal of keeping the analysis on a centralized platform and doing the visualization remotely. To support multiple client sessions, the server makes use of multi-threading on the master and worker processes. The next section provides detailed information about the analysis server architecture.
739
Parallel Platform
iiiiiii!iiiiiiiill i!iii~ i!i~iii~i~ii~ Ii~!........
Figure 1. Overview of the distributed performance analysis service
3. A NEW PARALLEL ANALYSIS INFRASTRUCTURE
3.1. Requirements During the evolution of the Vampir project, we identified three requirements with respect to current parallel computer platforms that typically cannot be fulfilled by classical sequential post mortem software analysis approaches: 1. exploit distributed memory for analysis tasks, 2. process both long (regarding time) and wide (regarding number of processes) program traces in real-time, 3. limit the data transferred to the visualization client to a volume that is independent of the amount of traced event data. 3.2. Architecture Section 2 has already provided a rough sketch of the analysis server's internal architecture, which will now be described in further detail. Figure 2 can be regarded as a close-up of the left part of the service architecture overview. On the fight hand side we can see the MPI master process being responsible for the interaction with the clients and the control over the worker processes. On the left hand side m identical MPI worker processes are depicted in a stacked way so that only the upper most process is actually visible. Every single MPI worker process is equipped with one main thread handling MPI communication with the master and if required, with other MPI workers. The main thread is created once at the very beginning and keeps on running until the server is terminated. Depending on the number of clients to be served, every MPI process has a dynamically changing number of n session threads being responsible for the clients' requests. The communication between MPI processes is done with standard MPI constructs whereas the local threads communicate by means of shared buffers, synchronized by mutexes and conditional variables. This permits a low overhead during interactions between the mostly independent components. Session threads can be subdivided into three different module categories as there are: analysis modules, event data base modules, and trace format modules. Starting from the bottom, trace format modules include parsers for the traditional Vampir trace format (VPT), a newly designed trace format (STF) by Pallas and the TAU[4, 7] trace format (TRC). The modular approach allows to easily add other third party formats. The data base modules include storage objects
740
i~i~ii84 ...... ii~ ....... ~/~ ..... ~ilii~
Trace
ml \/
~
i~
II ~
II W
M.
I.I I
Client
n Session Threads m MPI Workers Figure 2. Analysis server in detail for all supported event categories like functions, messages, performance metrics etc. The final module category provides the analysis capabilities of the server. This type of module performs its work on the data provided by the data base modules. In contrast to the worker process described above the situation for the vngd master process is slightly different. First of all, the layout with respect to its inherent threads is identical to the one applied on the worker processes. Similar to a worker process, the main thread is also responsible for doing all MPI communication with the workers. The session threads on the other hand have different tasks. They are responsible for merging analysis results received from the workers, converting the results to a platform independent format, and doing the communication with the clients like depicted on the fight hand side of Figure 2. 3.3. Loading and distributing the trace data The main issue in building a parallel analysis infrastructure, is how to distribute the accruing work effectively. One way is to identify analysis tasks on dedicated data sets and distribute them among the worker processes. Unfortunately, this approach is rather inflexible and does not help to speed up single analysis tasks which remain sequential algorithms. Hence, the central idea here is to achieve the work distribution by partitioning the data rather than distributing activity tasks. Naturally, this requires corresponding distributed data structures and analysis algorithms to be developed. To interactively browse and analyze trace data, the information needs to be in memory to meet the real-time requirements. To quickly load large amounts of data, every worker process (see Figure 1) directly accesses the trace data files via an instance of a trace format support library local to the process, which allows for selective event data input. That is, every worker only has to read the data that is going to be processed by himself. To support multiple trace file formats (VTF, STF, TAU), we introduced a format unification middleware that provides a standardized interface to internal data structures for third party trace
741 file format drivers. The middleware supports function entry/exit events, point to point communication events, collective operations and hardware counter events. Depending on their type and origin (thread), the data is distributed in chunks among the workers. On each worker process it is attached to a newly created thread that is responsible for all analysis requests concerning this particular program trace data.
3.4. Analyzing the trace data in parallel One of the major event categories handled by performance analysis tools is the category of block-enter and block-exit events. They can be used to analyze the run time behavior of an application on different levels of abstraction like the function level or the loop level. Finer or coarser levels can be generated depending on the way an application is instrumented. Being a common performance data representative, we will now discuss its parallel processing in detail. The analysis of the above event type requires a consistent stream of events to maintain its inherent stack structure. The calculation of function profiles, call path profiles, timeline views, etc. highly depends on this structure. To create a degree of parallelism to benefit from, we chose a straight forward data/task distribution. Every worker holds a subset tw of traces where w is the worker ID. The single trace to worker mapping function can be arbitrary but should not correlate to any thread patterns to avoid load balancing problems. In our implementation, we used an interleaved thread ID to worker mapping, which works best for traces with a large number of threads. To receive a function profile for an arbitrary time interval the following steps take place. Depending on user input the display client sends a request over the network to the server. On the server side, the session thread (every client is handled by a different session thread) on the MPI master process receives the request and buffers it in a queue. Request by request, the session thread then forwards the request to the main thread on the MPI master process from where it can be broadcasted to all MPI worker processes. The responsible session threads on every worker can now calculate the profile for the traces he owns. The results are handed back (to the MPI master process) by the main thread of every MPI worker process. As soon as the information arrives at the respective session thread on the MPI master process, the results are merged and then handed over to the client. By keeping the event data on the worker processes and exchanging pre-calculated results only between the server and client, network bandwidth and latency is no longer a limiting factor. 4. EVALUATION To verify this scalable, parallel analysis concept, a test case was chosen and evaluated on our test implementation. The following measurements were taken and compared to sequential results as obtained by the commercial performance analysis tool Vampir 4.0: 9 Wall clock time to fully load a selected trace file, 9 Wall clock time to calculate a complete function profile display, 9 Wall clock time to calculate a timeline representation. The tests are based on a trace file (in STF format) that was generated by an instrumented run of the sPPM[5] ASCI benchmark application. The trace file has a size of 327 MB and holds
742 approximately 22 million events. Depending on the available main memory VNG successfully handles files that are ten to hundred times bigger but for the reference measurements, Vampir also had to load the full trace into main memory which was the limiting factor. The experiments were carried out on 1, 2, 4, 8, and 16 MPI worker processes and one administrative MPI process, respectively. The test platform was an Origin 3800 equipped with 64 processors (400 MHz MIPS R12000) and 64 GB of main memory. The Client was running on a desktop PC operating under Linux. 4.1. Benchmark results
Table 1 shows the timing results for the three test cases above. Column 1 of Table 1 gives the wall clock reference times measured for the commercial tool Vampir. The experimental results obtained with VNG will be related to those values. The graph displayed in Figure 3, illustrates the speedups derived from Table 1.
Table 1 Experimental results for the test cases, vampir(1) vng( 1+ 1) Load Op.: 208.00 91.50 Timeline: 1.05 0.17 Profile: 7.86 0.82
measured in seconds (wall clock time). vng(2+ 1) vng(4+ 1) vng(8+ 1) vng(16+ 1) 43.45 21.26 10.44 5.20 0.18 0.17 0.15 0.16 0.44 0.25 0.16 0.14
The first row in Table 1 illustrates the loading time for reading the 327 MB test trace file mentioned above. We can see that already the single worker version of VNG outperforms standard Vampir. This is mainly due to linear optimizations we made in.the design of VNG which is also the reason for the super scalar speedups we observe in Figure 3 where we compare a multi process run of VNG with a standard Vampir. Upon examination of the speedups for 2, 4, 8, and 16 MPI tasks, we see that the loading time typically is reduced by a factor of two, if the number of MPI tasks doubles. This proves that scalability is granted for a moderate degree of parallelism. Another important aspect is that the performance data is equally distributed among the nodes. This allows to analyze very large trace files in real-time on relatively cheap distributed memory systems whereas with standard Vampir, this is only possible on very large SMP systems. The second row depicts the update time for a timeline like display. In this case almost no speedup can be observed for an increasing number of MPI processes. This is due to the communication latencies between workers, master and client. Most of the very short time measured is spent in network infrastructure. Row three shows the performance measurements for the calculation of a full featured profile. The sequential time of approximately eight seconds is significantly slower than the times recorded for the parallel analysis server. The timing measurements show that we succeeded in drastically reducing this amount of time. Absolute values in the order of less than a second even for the single worker version allow smooth navigation in the GUI. The speedups prove scalability of this functionality as well.
743
60 50 ,,,, 40
~,
'10
m
Loading -time
.......~ ........Timeline SumChart
20
=is
vampir(1)
i
1
vng(l+l)
i
vng(2+l)
i
vng(4+l)
f
vng(8+l)vng(16+1)
MPI Tasks
Figure 3. Speedups for the benchmarks 5. CONCLUSION The numbers above show that we have overcome two major drawbacks of trace based performance analysis. Our client/server approach allows to keep performance data as close as possible to its origin. No costly and time consuming transportation to remote analysis workstations or poor remote display performance needs to be endured anymore. The local visualization clients use a special purpose low latency protocol to communicate with the parallel analysis server which allows for scalable remote data visualization in real-time. Distributed memory programming on the server side furthermore allows the application and the user to benefit from the huge memory and computation capabilities of today's cluster architectures without a loss of generality. REFERENCES [1]
[2]
[3] [4]
Brunst, H., Nagel, W. E., Hoppe, H.-C.: Group-Based Peformance Analysis for Multithreaded SMP Cluster Applications. In: Sakellariou, R., Keane, J., Gurd, J., Freeman, L. (eds.): Euro-Par 2001 Parallel Processing, Vol. 2150 in LNCS, Springer, (2001) 148-153 Brunst, H., Winkler, M., Nagel, W. E., Hoppe H.-C.: Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach. In: Alexandrov, V. N., Dongarra, J. J., Juliano, B. A., Renner, R. S., Kenneth Tan, C. J. (eds.): Computational ScienceICCS 2001, Part II, Vol. 2074 in LNCS, Springer, (2001) 751-760 Chandra, R.: Parallel Programming in OpenMP. Morgan Kaufmann, (2001) Malony, A., Shende, S.: Performance Technology for Complex Parallel and Distributed Systems. In: Kotsis, G., Kacsuk, P. (eds.): Distributed and Parallel Systems From Instruction Parallelism to Cluster Computing. Proc. 3rd Workshop on Distributed and Parallel Systems, DAPSYS 2000, Kluwer (2000) 37-46
744
[5] Accelerated Strategic Computing Initiative (ASCI): sPPM Benchmark. http://www.llnl.gov/asci_benchmarks/asci/limited/ppm/asci_sppm.html [6] Message Passing Interface Forum: MPI: A Message Passing Interface Standard. International Journal of Supercomputer Applications (Special Issue on MPI) 8(3/4) (1994) [7] Shende, S., Malony, A., Cuny, J., Lindlan, K., Beckman, P., Karmesin, S.: Portable Profiling and Tracing for Parallel Scientific Applications using C++. Proc. SIGMETRICS Symposium on Parallel and Distributed Tools, SPDT'98, ACM, (1998) 134-145 [8] Zaki, O., Lusk, E., Gropp, W., Swider, D.: Toward Scalable Performance Visualization with Jumpshot. The International Journal of High Performance Computing Applications 13(3) (1999) 277-288
Parallel Computing:SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
745
CrossWalk: A Tool for Performance Profiling Across the User-Kernel Boundary A.V. Mirgorodskiy~ and B.P. Miller a
a{mirg, bart}@cs, wise. edu
Computer Sciences Department University of Wisconsin Madison, Wisconsin, 53706, USA
Performance profiling of applications is often a challenging task. One problem in the analysis is that applications often spend significant amount of their time in the operating system. As a result, conventional user-level profilers can, at best, trace the performance bottleneck to a particular system call. This information is often insufficient for the programmer to deal with the performance problem--system calls can be quite complex, making it hard to pinpoint the cause of the problem. To address this deficiency, we have designed a tool called CrossWalk that is able to do performance analysis across the kernel boundary. CrossWalk starts profiling at the user level, profiles the main function, then its callees and walks further down the application call graph, refining the performance problem to a particular function. If it determines that this function is a system call, it walks into the kernel code and starts traversing the kernel call graph until it locates the ultimate bottleneck. The key technologies in CrossWalk are dynamic application instrumentation and dynamic kernel instrumentation. For the former, we use an existing library called Dyninst API. For the latter, we have designed a new framework called Kerninst API with an interface modeled after Dyninst. When combined, the two libraries provide a unified and powerful interface for building cross-boundary tools. We demonstrate the usefulness of the cross-boundary approach by analyzing the Squid proxy server with CrossWalk. By drilling down into the kernel, we were able to identify the ultimate cause of Squid's performance problems and remove them by modifying the application's source code. 1. INTRODUCTION Many applications make heavy use of functions provided by the operating system. Naturally, the performance of such applications depends on how they make use of these functions and how efficiently these functions are implemented by the operating system. For example, I/O is key to the efficiency of many high-performance applications. Network performance is often the constraining factor for such tools as Web and proxy servers, and efficient use of synchronization primitives is crucial for multithreaded applications. Finding performance problems in OS-bound applications has always been a challenging task.
746 A user-level profiler might locate a region of application's code where most of the system time is spent, but it might be unable to explain why this is happening or how to fix the problem. For example, if an application spent 90% of its time in the o p e n system call, the profiler might be able to detect it and report to the programmer. However, o p e n is a complex system call; it can create new files, truncate existing ones or even do network I/O in distributed file systems. Without knowing the exact cause of the problem the programmer may not know how to fix it. If there were no opaque boundary between the user and kernel spaces, we could have continued profiling further down into the kernel. By refining the o p e n bottleneck, the profiler might find that the performance problem is in file truncation that happens as an option in open. With this information, the programmer may be able to modify the application to work around the bottleneck. For example, it may be possible to avoid truncating files on o p e n where not necessary. Alternatively, the programmer could tune certain kernel parameters to make the problem go away. The tuning might be done through a standard operating system interface, obviating the need for kernel recompilation or even reboot. The goal of this work is to demonstrate that such cross-boundary profiling is feasible and indeed effective in finding performance problems. Our approach is based on the idea of using dynamic instrumentation to traverse the call graph in the search for bottlenecks. This technique has already been studied for call graphs of user applications [3]. We generalize it to seamlessly walk down the joined call graph of the application and the kernel. The key technologies here are dynamic application instrumentation and dynamic kernel instrumentation. For the former, we use an existing library called Dyninst API. For the latter, we have designed a new framework called Kerninst API with an interface modeled after Dyninst. The synergy between the two libraries provides a unified and powerful interface for developing cross-boundary tools. By bringing all the pieces together, we have designed a tool called CrossWalk, which can locate bottlenecks in an application, then seamlessly cross the kernel boundary and find the corresponding bottlenecks in the kernel code. With CrossWalk, a performance bottleneck can be refined from the m a i n application entry point through the application's code and then all the way down into the kernel space to a kernel function that is the ultimate problem. While CrossWalk is still a simple prototype, it has already proved valuable in finding performance problems in our study of the Squid proxy server. Using CrossWalk's results, we were able to boost Squid's performance on a standard workload by more than a factor of 2.5.
2. RELATED W O R K There exist several tools that already accomplish certain parts of our goal and can be combined together. For example, one could profile an application with Gprof [5] and then profile the kernel with Kgmon [7] or Kernprof [6]. Such analysis might find a set of bottlenecks in the user space and another set of kernel-space bottlenecks. However, understanding how the two sets are related to each other is not an easy task. In fact, it is even possible that all the found kernel bottlenecks are induced by other processes and have no connection to the process under study. Therefore, using two disjoint tools is often insufficient for understanding the cause of the problem. While there are no complete tools for cross-boundary analysis, there exist several different technologies for implementing such tools. One can use sampling [5], tracing and profiling
747 Refine~mai: Refine A current bottleneck
•
current 9bottleneck
Figure 1. Walking the call graph through source-code modification [8, 9, 12], or dynamic instrumentation [3, 14]. In CrossWalk, we use dynamic instrumentation since it works on production systems and provides cycle-accurate timing results with little overhead. After narrowing the choice down to dynamic instrumentation, again, one can choose from several application instrumentation frameworks [2, 4, 13, 10]. However, the number of options for kernel instrumentation is smaller [ 10, 11 ]. Of the two, only DProbes [ 10] appears to be currently maintained. While DProbes has sufficient functionality for our purposes, its instrumentation is trap-based. From our experience, the overhead of traps is prohibitively high to be used for performance monitoring.
3. BASIC T E C H N O L O G Y An effective way to search for performance bottlenecks is to walk down the application call graph using dynamic instrumentation. Paradyn uses this technique to look for bottlenecks in user code [3]. The following two sections describe call-graph walking and dynamic instrumentation in greater detail.
3.1. Call-graph walking Assume that we are searching for a performance bottleneck, that is, a function which takes up more than a certain amount of wall-clock time (e.g., 25% of total running time of the application). The call-graph approach mimicks what an experienced performance analyst would do when searching for a bottleneck. The corresponding step-by-step diagram is shown in Figure 1. First, the method examines all functions called directly from the m a i n application entry point. If the inclusive time spent in such a function is above the threshold, the search marks it as a bottleneck. Next, it tries to refine this bottleneck by examining immediate callees of that function. If one of them is a bottleneck, the search will descend in that function and continue doing so until it reaches a function that has no callees above the threshold (or no callees at all). That function is the final user-space bottleneck for which we were searching. If the identified function is a system call, CrossWalk's analysis does not stop there. The tool seamlessly walks into the kernel, starts at the system call entry point, descends into its callees if necessary and keeps doing so until the final in-kernel bottleneck is identified. The result of the analysis is a chain of functions connecting m a i n and the located bottleneck. We collect performance data through dynamic instrumentation. While the application is running, the tool injects code at a function's entry point to start a timer and code at the exit points to stop the timer. The code is removed when it is no longer needed. This approach does not require source code modification or even recompilation. We discuss it in more detail below.
748
User Application
A(L
Instrumentation Tool
~ [f
InstrumentationTool
]
OSKernel ~ / Instrumentationcode a) A Dyninst-based system
Instrumentationcode / b) A Kerninst-based system
Figure 2. Frameworks for dynamic application and kernel instrumentation 3.2. Application instrumentation: Dyninst API CrossWalk uses Dyninst API for application instrumentation. Dyninst is a C++ library that allows you to write your own instrumentation tools. The typical structure of such a tool is shown in Figure 2a). The library lets you attach to a running application and perform the following operations: Instrument: the tool can generate instrumentation code and ask the library to inject it at a specified point in an application. With this facitily, CrossWalk puts its timing code into functions. Browse code resources: the tool can locate application's modules, functions, basic blocks, and retrieve control flow information. For example, CrossWalk uses this functionality to locate m a i n and identify all functions called from a given function to walk down the call graph. Inspect/modify application's data: the tool can read/write an application's variables while the application is running. With this facility, CrossWalk peeks at values of the timers that its instrumentation code created.
3.3. Kernel instrumentation: Kerninst API Dyninst API allows us to instrument user applications, but to drill into the kernel one needs to use another mechanism. We already had some experience with kernel instrumentation after designing a kernel performance monitor called Kperfmon [ 14]. With a reasonable amount of effort, we factored out all instrumentation-related functionality into a library, which we called Kerninst API. Keminst API is closely modeled after Dyninst API. It provides the same functionality, has the same interface and the same syntax where possible to let the programmer use a single and consistent model for both application and kernel instrumentation. Finally, to enable tools like CrossWalk, we designed Kerninst API to coexist with Dyninst in the same application if necessary. A typical Kerninst-based system is shown in Figure 2b). The system has a few important properties. First, even though the instrumentation tool now injects code into the kernel, the tool itself is a user-space application. Second, Keminst works on a standard production operating system. There is no need to recompile the kernel or even reboot the machine. The tool simply attaches to the kernel and starts instrumenting it. Third, Kerninst requires the user to have root privileges in the system. Otherwise, allowing a user to insert and execute code in the kernel would be a security hole.
749
FunctionA
Instrumentation ~::i~i~,ii i':~i~:~,~:~: :,
[_~.lA'Stimervariable starttime totaltime :@--c>~i;:~N|................. :::i SN a) Timing a function
FunctionA Instrumentation ~ N i N ~ :i!!iJ 1 Regularcall,call BO 1 ~ I~i~i~iiiii l i d I:rll'r~1-
Hashtable
I
Indirectcall~ 1 r25 b) Identifyingcallees of a given function
Figure 3. Major building blocks of CrossWalk 4. I M P L E M E N T A T I O N Two major issues in implementing CrossWalk is how to time a function, and how to find a function's callees to walk down the call graph. The similarity of Dyninst and Kerninst APIs let us implement these features in a unified way for both the kernel and user spaces. 4.1. Timing in CrossWalk Figure 3a) shows the sketch of instrumentation that we insert into a function of interest. To provide cycle-accurate results with low overhead, the code uses processor hardware counters for timer readings. It samples the cycle counter at the entry point of the function, samples it again at the exit point and computes the delta. As the code executes, the per-function timer variable accumulates the total number of wall-clock cycles spent in the function. Note that this approach can be easily extended to handle other types of performance metrics, such as CPU time or number of cache misses. CrossWalk uses similar primitives for timing application and kernel code. The only difference is that in the kernel it has to constrain timing to the process under study. If multiple threads execute within a kernel function, the timer should only reflect the time spent there by our process's thread. CrossWalk achieves this goal by wrapping the kernel start/stop primitives with a check for the proper process identifier of the currently running thread. The check is fast, taking only five instructions. 4.2. Identifying a function's callees When a function proves to be the bottleneck, CrossWalk will try to refine it by looking at the function's callees. Both Dyninst and Kerninst API provide a method for identifying callees through static analysis. The method locates all call instructions in the function and extracts the destination addresses from them. While static analysis works in most cases, it fails to analyze indirect calls whose destinations are computed at run time. Such constructs are common in practice: calls through function pointers, callbacks, and C++ virtual functions all get translated to the indirect call instruction. CrossWalk solves this problem in the same way as Paradyn [3]; it instruments the indirect call sites to identify callees at run time. As Figure 3b) shows, when the call-site instrumentation is invoked, a callee address has already been computed, so the instrumentation simply adds it to a hash table. As a result, the hash table will soon contain all callees invoked from the watched sites. This approach has worked well. Without support for following indirect calls, CrossWalk would not be able to advance past the system call entry point, where system call handlers are dispatched through an indirect call.
750
ii,
a) Bottlenecks in the original version
b) Bottlenecks in the tuned version
Figure 4. Squid's bottlenecks. Links are dashed where functions were omitted for brevity. 5. EXPERIMENTAL RESULTS: SQUID A popular technique for optimizing Web transfers is proxy caching. It places an intermediary, a proxy server, between Web servers and their clients and caches server responses to the clients. While the technique takes extra load off Web servers and the network infrastructure, its effectiveness heavily depends on performance of the proxy server. A study from four years ago [ 14] found and fixed some performance problems in a popular open-source proxy called Squid [ 15]. Here, we use CrossWalk to see if the current version of the application can be enhanced further. We ran Squid-2.5.STABLE3 on a Sun Ultra-10 workstation with UltraSparc-IIi 440MHz processor, 256 MB of RAM, 9GB IDE hard drive, 100Mbit network card, and the 64-bit version of Solaris 8. To model typical Web activities, we subjected it to a standard workload known as the Wisconsin Proxy Benchmark [ 1]. The workload issues concurrent requests from 60 clients to 10 different Web servers through Squid. The benchmark's measure of performance is how much wall-clock time it takes to serve 12000 requests (200 requests per client). CrossWalk attached to Squid and found the chain of functions shown in Figure 4a) to be above the 25% wall-clock time threshold (white nodes belong to the application and gray to the kernel). Functions httpReadReply and storeSwapOut (receive an object from a Web server and store it in the on-disk cache) appear as bottlenecks. However, it was not the network r e a d , nor the file w r i t e that was the problem; surprisingly, the bottleneck was in the o p e n system call. By skimming through the source code for Squid's s t o r e C r e a t e we realized that these calls to o p e n correspond to opening individual cache files for creating new objects in them. At that point, a user-level profiler would have to stop, but CrossWalk was able to drill further into the kernel and traced the o p e n bottleneck to the u f s m c r e a t e routine. Apparently, Squid spends most time creating new files in the on-disk cache. A possible way to remove the bottleneck that CrossWalk found is to pre-create all files in the Squid's cache. Now, when a new object comes from a Web server, Squid can open an existing zero sized file and save the object in it, avoiding the need to create a new file. An obstacle to this approach is that Squid removes old files from the cache periodically to keep the cache size under control. As a result, it will eventually fall back to the original mode of creating files. This obstacle is minor however, since one could modify Squid's source to never remove old files, but simply truncate them to zero size. As we discovered, Squid already provides support for this truncation strategy, though it is not enabled by default. When we turned it on by recompiling Squid with appropriate parameters, the running time of the Wisconsin Proxy Benchmark dropped by more than a factor of 2: from 625 seconds for the original scheme (removing files) to 275 seconds for the modified one (truncating files). To optimize Squid further, we ran it under CrossWalk again and discovered problems shown in Figure 4b). The new user-space bottleneck is the u n l i n k d U n l i n k function calling
751 select. Refinement into the kernel suggests that Squid simply blocks on a condition variable ( c v _ w a i t u n t i l _ s i g ) and gets switched-out while doing so ( s w t c h ) . As this chain indicates little kernel activity, the problem must be in the application. We examined the source for u n l i n k d U n l i n k and found that it is waiting until Squid's helper daemon, u n l i n k d , consumes some file-truncate requests from a filled-up queue. If the requests come in bursts, one might give u n l i n k d a chance to catch up simply by making the queue longer. When we extended it from 20 to 200 entries and recompiled Squid, the running time of the benchmark dropped by additional 11% from 275 to 245 seconds. Again, CrossWalk correctly identified the cause of this performance problem and provided enough information for us to fix it. To estimate the overhead of profiling, we ran the benchmark multiple times with and without CrossWalk and determined that CrossWalk increased the total running time by approximately 15%. To isolate individual contributions of application and kernel instrumentation, we then modified CrossWalk to perform application-only or kernel-only call-graph walks. We found that the application instrumentation accounted for approximately 60% of the total slowdown, the kernel one for 40%. Overall, the overhead was reasonably low and did not prevent CrossWalk from finding correct bottlenecks.
6. F U T U R E W O R K
We have identified two directions for further improvement of CrossWalk. The first is to support analysis of multithreaded applications, which are becoming increasingly common. Paradyn can already search for user-space bottlenecks in such applications [ 16]. Providing matching support for the kernel-space analysis should be relatively straightforward. The second direction is to account for asynchronous kernel activities. Often, applications spend a lot of time in the kernel, but perform no system calls. A typical example is an outof-core application, which does not fit into the main memory and thus suffers from excessive paging activities. In such application, CrossWalk would narrow the problem down to a userlevel function that takes up most time, but it would not be able to refine it further down into the kernel, as the system call entry point will not be the bottleneck. One possible way to account for such activities is to walk the kernel call graph from all entry points simultaneously, but the feasibility of this approach is yet to be studied. REFERENCES
[1]
[2] [3] [4]
[5]
J. Almeida and R Cao, "Wisconsin Proxy Benchmark", Technical Report 1373, Computer Sciences Department, University of Wisconsin-Madison, April 1998. B. Buck and J.K. Hollingsworth, "An API for runtime code patching", Journal of High Performance Computing Applications 14, 4, pp. 317-329, Winter 2000. H.W. Cain, B.P. Miller, and B.J.N. Wylie, "A Callgraph-Based Search Strategy for Automated Performance Diagnosis", Euro-Par 2000, Munich, Germany, August 2000. L. DeRose, J.K. Hollingsworth, and T. Hoover, "The dynamic probe class library--an infrastructure for developing instrumentation for performance tools", International Parallel and Distributed Processing Symposium, April 2001. S. Graham, P. Kessler, and M. McKusick, "gprof: A Call Graph Execution Profiler", SIGPLAN '82 Symposium on Compiler Construction, Boston, June 1982, pp. 120-126.
752 [6] [7] [8]
[9] [10] [ 11]
[ 12]
[13] [ 14]
[15] [ 16]
Kernprof, http://oss.sgi.com/projects/kernprof/ M. McKusick, "Using gprof to Tune the 4.2BSD Kernel", European UNIX Users Group Meeting, April 1984. B.P. Miller, M. Clark, J.K. Hollingsworth, S. Kierstead, S.-S. Lim, and T. Torzewski, "IPS2: The Second Generation of a Parallel Program Measurement System", IEEE Transactions on Parallel and Distributed Systems 1, 2, pp. 206-217, April 1990. B. Mohr, D. Brown, and A. Malony, "TAU: A portable parallel program analysis environment for pC++", CONPAR 94 - VAPP VI, Linz, Austria, September 1994. R.J. Moore, "Dynamic Probes and Generalised Kernel Hooks Interface for Linux", 4th Annual Linux Showcase & Conference, Atlanta, October 2000. D.J. Pearce, P.H.J. Kelly, T. Field, U. Harder,'GILK: A Dynamic Instrumentation Tool for the Linux Kemel", 12th International Conference on Computer Performance Evaluation, April 2002. D.A. Reed, R.A. Aydt, R.J. Noe, P.C. Roth, K.A. Shields, B. Schwartz, and L.F. Tavera, "Scalable Performance Analysis: The Pablo Performance Analysis Environment", Scalable Parallel Libraries Conference, Los Alamitos, CA, October 1993, pp. 104-113. M. Ronsse and K. De Bosschere, "JiTI: A Robust Just in Time Instrumentation Technique", Workshop on Binary Translation, Philadelphia, October 2000. A. Tamches and B.P. Miller, "Using Dynamic Kemel Instrumentation for Kemel and Application Tuning", International Journal of High Performance Computing Applications 13, 3, Fall 1999. D. Wessels, "Squid Intemet Object Cache", http://squid.nlanr.net/Squid/, August 1998. Z. Xu, B.P. Miller, and O. Naim, "Dynamic Instrumentation of Threaded Applications", 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Atlanta, May 1999.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
753
Hardware-Counter Based Automatic Performance Analysis of Parallel
Programs* F. Wolf~ and B. Mohr b aUniversity of Tennessee Innovative Computing Laboratory Knoxville, TN 37996, USA [email protected] bForschungszentrum Jfilich Zentralinstitut •r Angewandte Mathematik 52425 Jfilich, Germany [email protected] The KOJAK performance-analysis environment has been designed to identify a large number of performance problems on parallel computers with SMP nodes. The current version concentrates on parallelism-related performance problems that arise from an inefficient usage of the parallel programming interfaces MPI and OpenMP while ignoring individual cpv performance. The article describes an extended design of KOJAK capable of diagnosing low individual-cpv performance based on hardware-counter information and of integrating the results with those of the parallelism-centered analysis. 1. INTRODUCTION The performance of parallel applications is determined by a variety of different factors. Performance of single components frequently influence the overall behavior in unexpected ways. Application programmers on current parallel machines have to deal with numerous performance-critical aspects: different modes of parallel execution, such as message passing, multi-threading or even a combination of the two, and performance on individual cPus that is determined by the interaction of different functional units. In particular, as the gap between microprocessor and memory speed increases, the understanding of processor-memory interaction becomes crucial to many optimization tasks. As a consequence, advanced performance tools are needed that integrate all these aspects in a single view. The KOJAK performance-analysis environment has been designed to identify a large number of performance problems on typical parallel computers with SMP nodes. Performance problems are specified in terms of execution patterns to be automatically recognized in event traces. The detected patterns are then classified and quantified by type and severity, respectively. The results *This work was supported in part by the U.S. Department of Energy under Grants DoE DE-FG02-01ER25510 and DoE DE-FC02-01ER25490 and is embedded in the European IST working group APARTunder Contract No. IST-2000-28077
754 are presented to the user in a single integrated view along three interconnected dimensions: class of performance behavior, call path, and thread of execution. Each dimension is arranged in a hierarchy, so that the user can investigate the behavior on varying levels of detail. While KOJAK provides a well-integrated tool to analyze parallelism-related performance problems coming from an inefficient usage of the parallel programming interfaces MPI and OpenMm, it still lacks the ability to investigate cpu and memory performance in more detail. Hardware counters integrated in modem microprocessors are an essential tool for monitoring this aspect of performance behavior. These counters exist as a small set of registers that count occurrences of specific signals related to the processor's function. Monitoring these counters facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. The article describes an extended design and implementation of KOJAK capable of diagnosing low application performance based on this type of performance data and shows how the extensions are integrated with the parallelism-centered analysis. After presenting related work in Section 2, we will describe KOJAK's overall architecture and the underlying approach in Section 3. Section 4 will outline the extensions done to integrate the new type of performance problems. Finally, Section 5 will discuss limitations of the approach and propose possible enhancements. 2. RELATED W O R K In the past few years, much research has been done on performance analysis with hardware counters. Many tool builders use libraries, such as PAPI [ 1] or PCL [2] which provide a standard application programming interface for accessing hardware performance counters on most modem microprocessors. Higher-level instrumentation tools, such as Dynaprof [3], CATCH [4], SCALEA [5], SvPablo [6] and TAU [7], already provide a mapping of hardware-counter information onto static or dynamic program entities. Also, the event-trace visualization tools VAMPIR [8] and Paraver [9] provide hardware-counter information as part of their time-line views. There are also other automatic end-user tools that use hardware counters to analyze applications. Paradyn [ 10] was the first automatic performance tool based on a hierarchical decomposition of the search space. It searches for performance problems along various program-resource hierarchies including the call graph. Performance problems are expressed in terms of a threshold and one or more metrics, such as cPu time or message rates. The latest release uses PAPI to find memory bottlenecks. Aksum [ 11] uses hardware counters in its multi-experiment analysis. A distinctive feature of KOJAK in contrast to Paradyn and Aksum is the uniform mapping of all performance behavior onto execution time which allows the convenient correlation of different behavior in a single view. Also, Ftirlinger et al. [12] propose a design for an automatic online-analysis tool targeting clustered SMP architectures. On the lowest level, the tool records hardware-counter information in a distributed fashion, which is passed on to a hierarchy of agents that transform the low-level information stepwise into higher-level information. However, the design is still too early to draw a qualified comparison.
755 3. P E R F O R M A N C E ANALYSIS W I T H K O J A K
The KOJAK performance-analysis environment is depicted in Figure 1. It shows the different components with their corresponding inputs and outputs. The arrows illustrate the whole performance-analysis process from instrumentation to result presentation. The NO JAN analysis process is composed of two parts: a semi-automatic instrumentation of the user application followed by an automatic analysis of the generated performance data. KOJAK's instrumentation software runs on most major UNlX platforms and works on multiple levels, including source-code, compiler, and linker. Instrumentation of user functions is done either during compilation using a compiler-supplied profiling interface, on the source-code level using TAU [7], or on the binary level using DPCL [13, 14]. Events related to MPI are captured using a Figure 1: KOJAK overall architecture. PMP! wrapper library; OpenMP constructs can be instrumented using OPARI [ 15] or DPOMP [ 14]. To get access to hardware counters the application must be linked to the PAPI (Performance APl) library [ 1]. A more detailed description of the instrumentation process can be found in [ 16]. Running the resulting executable after linking generates a trace file in the EPILOG format. The EPILOG format is suitable to represent the executions of MPI, OpenMP, or hybrid parallel applications distributed across one or more (possibly large) coupled SMP systems. In addition to coupled SMPS, target systems also can be meta-computing environments as well as more traditional non-coupled or non-SMP systems. After program termination, the trace file is fed into the EXPERT analyzer. The analyzer does not read the raw trace file as generated by the tracing library. Instead, it accesses the events through the EARL abstraction layer [16]. EARL is a high-level interface to access an event trace. Events are identified by their relative position and are delivered as a list of key-value pairs. These pairs represent event attributes, such as time and location. In addition to providing random access to single events, EARL simplifies analysis by establishing links between related events and identifying events that describe an application's execution state. The abstraction layer provides both a Python and a C++ interface and can be used independently of KOJAK for a large variety of performance-analysis tasks. The highly integrated view of performance behavior KOJAK offers to the user is achieved by uniformly quantifying all behavior in terms of the execution time. The entire performance space is represented as a mapping of a performance problem, a call path, and a location (i.e., process or thread) onto the fraction of time spent on the problem by that particular thread in that particular call path. This time is called the severity of the tuple (problem, call path, thread).
756 Each of the three dimensions is arranged in a hierarchy: the performance problems in a hierarchy of general and specific ones, the call tree in its natural hierarchy, and the locations in an aggregation hierarchy consisting of the levels machine, SMP node, process, and finally thread. After completion, the analyzer generates an analysis report, which serves as input for the EXPERT presenter (Figure 2). The presenter allows the user to conveniently navigate through the entire search space along all of its dimensions. The automatic analysis can be combined with VAMPIR [8], which allows the user to investigate the patterns identified by KOJAK manually in a time-line display. To do this, the user only needs to convert the EPILOG trace file into the VTF3 format. 4. COVERING INDIVIDUAL CPU P E R F O R M A N C E The integration of individual-cpu performance affected all levels of the tool environment. The following subsections briefly explain the necessary changes and extensions.
4.1. Trace format and library The EPILOG trace format has been extended to accommodate hardware-counter values and other system metrics, such as memory utilizationt, as part of all region-entry or exit records. A metric-description record can be used to define a system metric. The metric is assigned a name, a description, and a data type (i.e., float or integer). In addition, the user can specify whether the metric is a counter or a sample that applies to an interval or a distinct point in time, respectively. If it refers to an interval, the user can specify whether the interval starts at program start or whether the value covers only the period from the last measurement or to the next measurement. The number and order of metric-description records defines the layout of a metric-value array attached to the entry and exit records. The solution provides a high degree of flexibility, since it is not restricted to a particular set of metrics. However, to improve tool interoperability, EPILOG defines a list of names for common hardware counters with welldefined semantics. Each metric value adds eight bytes to each enter and exit record, which is one half or two thirds of their original length, respectively. Please note that all other event records remain unchanged. To implement the new features of the trace format, a module to access hardware counters has been added to the EPILOG tracing library. The current implementation uses PAPI for the low-level access. However, the flexible module interface allows the easy integration of other hardware counter libraries in future versions of our instrumentation system. The user specifies the desired counters via an environment variable as a colon-separated list of predefined counter names. The module achieves thread safety by creating per-thread event sets. The module interface consists mainly of three functions: e l g _ m e t r i c _ o p e n () ; event_set = elg_metric_create() ; elg m e t r i c a c c u m (event set, v a l u e
array) ;
To initialize the module and check for availability of the requested event set, EPILOG uses the open call. After that, the create call is used to create a per-thread event set. The object returned is then supplied to all accumulate calls to read the counters for a particular thread. tThanks to Holger Brunst for his helpful suggestions.
757
4.2. Abstraction layer The counters or system metrics are integrated in the EARL abstraction layer as additional attributes of enter and exit events and are conveniently accessible using their name as a key. In addition, the trace-file object provides methods to query the number and kind of hardware metrics available in the trace file. The following Python code example shows how to access a metric value of an event. n = t r a c e . g e t _ n m e t s () mobj = trace, g e t _ m e t ( i ) m n a m e = mobj ['name'] e = trace, e v e n t (k) p r i n t e [mname]
# # # # #
get n u m b e r of m e t r i c s get m e t r i c i get n a m e a t t r i b u t e of m e t r i c i get k - t h e v e n t p r i n t v a l u e of m e t r i c i for e v e n t
k
4.3. Analyzer As pointed out in Section 3, the highly integrated view of performance behavior provided by KOJAK comes from the fact that all behavior is uniformly mapped onto execution time. However, in view of highly optimized processor architectures capable of out-of-order execution, it is hard to determine the time penalty introduced by, for example, cache misses. Even estimates are often inaccurate and highly platform dependent. Instead, KOJAK identifies tuples (call path, thread) whose occurrence ratio of a particular hardware event is below or above a certain threshold. The execution time of these tuples delivers an upper bound of the problem's real penalty, allows the user to compare cpu-performance problems to parallelism-related problems, and narrows attention down to the affected call path and thread. We specified two experimental performance problems to be used in the analyzer.
L1 data cache misses per time above average: It computes for every tuple (call path, thread) the cache misses per time. It also computes the total number of misses and divides it by the total execution time to obtain the average miss rate. Then, the analyzer assigns to all tuples whose miss rate is above the average a severity value that is equal to the entire execution time associated with the tuple. The severity of the remaining ones is set to zero. Floating-point operations per time below 25 % peak: This property computes for every tuple (call path, thread) the number of floating point operations per time. Since there is no way of automatically determining the peak flop rate, it must be set at installation time. Then, the analyzer assigns to all tuples whose flop rate is 25 % below the peak a severity value that is equal to the entire execution time associated with the tuple. The severity of the remaining ones is again set to zero. The way how data is displayed by KOJAK requires siblings in the problem hierarchy to be non-overlapping. That means that a thread cannot contribute to two sibling problems during overlapping wall-clock intervals. However, inefficient cache behavior can easily coexist with weak floating-point performance or parallelism-related problems. Since the severity assigned by KOJAK to a tuple (call path, thread) with respect to a counter-related performance problem is always zero or equal to the entire execution time of the tuple, the above requirement could be relaxed to accommodate cache performance on the same level as floating-point performance. On some platforms, not all counters needed to analyze these performance problems can be recorded simultaneously. Also, to limit the trace-file size the user might not want to record
758 all of them in the same run. For this reason, the analyzer automatically checks for availability of certain metrics and ignores the corresponding performance problems if the data are not available.
Figure 2. KOJAK's display of cache behavior for the
ASCI
benchmark SWEEP3D.
Figure 2 shows the cache behavior of the SWEEP3D ASCI benchmark [17] on a Power4 platform. Since PAPI cannot measure floating-point instructions and L1 data cache misses simultaneously on Power4, the analysis was restricted to L1 data cache misses only. The left pane represents the problem (i.e., property) hierarchy, the middle pane represents the call tree, and the fight pane represents the location hierarchy, consisting of the levels machine, SMP node, and process. Since SWEEP3D is a pure MPI application, the thread level is hidden. Each node in the display carries a severity value, which is a percentages of the total execution time. The value appears twice as a number and as a colored icon left to it allowing the easy identification of hot spots. The translation of colors to numbers is defined in the color scale at the bottom. By expanding or collapsing nodes in each of the three trees, the analysis can performed on different levels of granularity. A collapsed node always represents the entire subtree, an expanded node only represents itself not including its children. The left tree shows how much time was spent on a particular performance problem by the entire program, that is across all tuples (call path, thread). In the example, the execution-time fraction of all tuples with above-average L1 Data Cache miss rate is 15.3 %. Selecting this performance problem, as shown in the figure, causes the middle pane to display its distribution across the call tree, whose nodes are labeled with a function name together with the line number from which it was called. Apparently, some of the MPI calls exhibit a miss rate above the average, whereas the computational parts seem to be without major findings. Finally, the right tree shows the severity of the selected call path broken down to different processes. Interesting is that among the identified call paths the two with the highest execution time (i.e., the selected one with 5.8 % and another one with 4,7 %) have also been identified to be the source of a non-negligible Late Sender problem, that is, nearly all of the execution time was
759 actually spent waiting on a message to be received (not shown in the figure). Therefore, the question arises whether a more cache-friendly receive function would be able to significantly speed up the application, since it wouldn't necessarily speed up the delivery of the message it was waiting for most of the time. 5. C O N C L U S I O N As the previous example suggests, KOJAK's ability to analyze parallelism and individualcpv performance problems simultaneously can provide useful insights into the performance behavior of a parallel application and help avoid hastily conclusions that might occur based on lesser integrated data. However, while the way KOJAK defines the severity of counter-based performance problems allows a high level of integration with parallelism-related behavior, it does not give much information on the actual run-time penalty. Also, the classification 'above or below a certain threshold' does not even say how far above or below. Therefore, we believe that the time-based view should be combined with additional views that provide the actual occurrence numbers and rates of the hardware events. Also, the current set of counter-based problems (i.e., cache and floating point) needs to be extended to cover additional aspects of individual-cpu performance, such as, for example, TLB misses. Since often not all desired counters can be recorded simultaneously, methods are needed to combine the data obtained from multiple experiments offline thereby taking into account the limited reproducibility of a single experiment. REFERENCES
[1] [2]
[3] [4]
[5] [6]
[7]
[8]
S. Browne, J. Dongarra, N. Garner, G. Ho, P. Mucci, A Portable Programming Interface for Performance Evaluation on Modern Processors, The International Journal of High Performance Computing Applications 14 (3) (2000) 189-204. R. Berrendorf, H. Ziegler, PCL - The Performance Counter Library: A Common Interface to Access Hardware Performance Counters on Microprocessors, Tech. Rep. FZJ-ZAMIB-9816, Forschungszentrum Jtilich (October 1998). P. Mucci, Dynaprof home page, http:l/www.cs.utk.edu/~mucci/dynaprofl. L. A. DeRose, F. Wolf, CATCH - A Call-Graph Based Automatic Tool for Capture of Hardware Performance Metrics for MPI and OpenMP Applications, in: Proc. of the 8th International Euro-Par Conference, Lecture Notes in Computer Science, Paderborn, Germany, 2002, pp. 167-176. H.-L. Truong, Y. F. G. Madsen, A. D. Malony, H. Moritsch, S. Shende, On Using SCALEA for Performance Analysis of Distributed and Parallel Programs, in: Proc. of the Conference on Supercomputers (SC2000), Denver, Colorado, 2001. L. DeRose, D. A. Reed, SvPablo: A Multi-Language Architecture-Independent Performance Analysis System, in: Proc. of the International Conference on Parallel Processing (ICPP'99), Fukushima, Japan, 1999. S. Shende, A. D. Malony, J. Cuny, K. Lindlan, P. Beckman, S. Karmesin, Portable Profiling and Tracing for Parallel Scientific Applications using C++, in: Proc. of the SIGMETRICS Symposium on Parallel and Distributed Tools, ACM, 1998, pp. 134-145. A. Arnold, U. Detert, W. E. Nagel, Performance Optimization of Parallel Programs: Trac-
760
[9]
[ 10]
[ 11]
[ 12] [13] [14]
[15] [ 16] [ 17]
ing, Zooming, Understanding, in: R. Winget, K. Winget (Eds.), Proc. of Cray User Group Meeting, Denver, CO, 1995, pp. 252-258. European Center for Parallelism of Barcelona (CEPBA), Paraver - Parallel Program Visualization and Analysis Tool - Reference Manual, http://www.cepba.upc.es/paraver/ (November 2000). B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvine, K. L. Karavanic, K. Kunchithapadam, T. Newhall, The Paradyn Parallel Performance Measurement Tool, IEEE Computer 28 (11) (1995) 37-46. C. Seragiotto Jfnior, M. Geissler, G. Madsen, H. Moritsch, On Using Aksum for SemiAutomatically Searching of Performance Problems in Parallel and Distributed Programs, in: Proc. of 11th Euromicro Conf. on Parallel Distributed and Network based Processing (PDP 2003), Genua, Italy, 2003. K. Ffirlinger, M. Gerndt, Distributed Application Monitoring for Clustered SMP Architectures, in: Proc. of the 9th International Euro-Par Conference, Klagenfurt, Austria, 2003. L. DeRose, T. Hoover Jr., J. K. Hollingsworth, The Dynamic Probe Class Library- An Infrastructure for Developing Instrumentation for Performance Tools. L. DeRose, B. Mohr, S. Seelam, An Implementation of the POMP Performance Monitoring Interface for OpenMP Based on Dynamic Probes, in: Proc. of the 5th European Workshop on OpenMP (EWOMP'03), Aachen, Germany, 2003. B. Mohr, A. Malony, S. Shende, F. Wolf, Design and Prototype of a Performance Tool Interface for OpenMP, The Journal of Supercomputing 23 (2002) 105-128. F. Wolf, Automatic Performance Analysis on Parallel Computers with SMP Nodes, Ph.D. thesis, RWTH Aachen, Forschungszentrum Jiilich, ISBN 3-00-010003-2 (February 2003). Accelerated Strategic Computing Initiative (ASCI), The ASCI sweep3d Benchmark Code, http ://www. IIn I. gov/asci_ bench m a rks/.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
761
O n l i n e P e r f o r m a n c e O b s e r v a t i o n o f L a r g e - S c a l e Parallel A p p l i c a t i o n s A.D. Malony a, S. Shende% and R. BelP a {real o n y , s a m e e r , b e r t i e }@cs. u o r e g o n , edu Department of Computer and Information Science University of Oregon Eugene, Oregon 97405, USA Parallel performance tools offer insights into the execution behavior of an application and are a valuable component in the cycle of application development, deployment, and optimization. However, most tools do not work well with large-scale parallel applications where the performance data generated comes from upwards of thousands of processes. As parallel computer systems increase in size, the scaling of performance observation infrastructure becomes an important concern. In this paper, we discuss the problem of scaling and perfomance observation, and the ramifications of adding online support. A general online performance system architecture is presented. Recent work on the TAU performance system to enable large-scale performance observation and analysis is discussed. The paper concludes with plans for future work. 1. INTRODUCTION The scaling of parallel computer systems and applications presents new challenges to the techniques and tools for performance observation. We use the term performance observation to mean the methods to obtain and analyze performance information for purposes of better understanding performance effects and problems of parallel execution. With increasing scale, there is a concern that standard observation approaches for instrumentation, measurement, data analysis, and visualization will encounter design or implementation limits that reduce their effective use. What drives this concern, in part, is the problem of measurement intrusion, and the fact that simple application of current approaches may result in more perturbed performance data. It is also clear that scaling of standard methods will raise issues of performance data size, the amount of processing time required to analyze the data, and the usability of performance presentation techniques. Concurrently, there is an interest in the online observation of parallel systems and applications for purposes of dynamic assessment and control. With respect to performance observation, we think of performance monitoring as constituting online measurement and performance data access, and performance interaction as additional infrastructure for affecting performance behavior externally. Certainly, there are several motivations for online performance observation, including the control of intrusion via dynamic instrumentation or dynamic measurement. However, the influence of scaling must again be considered when evaluating the benefits of alternative approaches.
762 In this paper, we consider these issues with respect to profiling and tracing methods for online performance observation. Our main interest in this work is to understand how best to scale a parallel performance measurement model and its implementation, and to extend its functionality to offer runtime control and interaction. We present results from the development of scalable online profiling in the TAU performance system. We also have develop online tracing capabilities in TAU, but this work is discussed elsewhere.
2. SCALING, INTRUSION, AND ONLINE OBSERVATION As the starting point for understanding the influences of scaling on performance observation, it is reasonable to consider the standard methods for performance measurement and analysis: profiling and tracing. Profiling makes measurements of significant events during program execution and calculates summary statistics for performance metrics of interest. These profile analysis operations occurs at runtime. In contrast, tracing captures information about the significant events and stores that information in a time-stamped trace buffer. The information can include performance data such as hardware counts, but analysis of the performance data does not occur until after the trace buffer is generated. For both profiling and tracing, it is usually the case that the performance measurements (profile or trace) are generated and kept at the level of application threads or processes. What happens, then, as the application scales? We consider scaling mainly in terms of number of threads of execution. In general, one would expect that the greater the degree of parallelism, the more performance data overall that will be produced. This is because performance is typically observed relative to each specific thread of execution. Thus, in the case of profiling a new profile will be produced for each thread or process. Similarly, tracing will, in general, produce a separate event sequence (and trace buffer) for each thread or process. Certainly, these consequences of scaling have direct impact on the management of performance data (profile or trace data) during a large-scale parallel execution. Scaling, it is expected, will also cause changes in the number, the distribution, and perhaps the types of significant events that occur during a program's run, for instance, with respect to communication. Furthermore, larger amounts of performance data will result in greater analysis time and complexity, and more difficulty in presenting performance in meaningful displays. However, the real practical question is whether our present performance observation methods and tools are capable of dealing with these issues of scale. Most importantly, this is a concem for measurement intrusion and performance perturbation. Any performance measurement intrudes on execution performance and, more seriously, can perturb "actual" performance behavior. While low intrusion is preferred, it is generally accepted that some intrusion is a consequence of standard performance observation practice. Unfortunately, perturbation problems can arise both with only minor intrusion and small degrees of parallelism. How scaling affects intrusion and perturbation is an interesting question. Traditional measurement techniques tend to be localized. For instance, thread profiles are normally kept as part of the thread (process) state. This suggests that scaling would not compound globally what intrusion is occuring locally, even with larger numbers of threads (processes). On the other hand, it is reasonable to expect that the measurement of parallel interactions will be affected by intrusion, possibly resulting in a misrepresentation of performance due to performance perturbation. The bottom line is that performance measurement techniques must be used in an intelligent
763
I il .o
eormance Visualization
/
Performance
Analysis ..........
~ ..........
..................................................
Figure 1. Online Performance Observation Architecture manner so that intrusion effects are controlled as best as possible. But this must involve a necessary and well-understood tradeoff of the need for performance data for solving performance problems against the "cost" (intrusion and possible perturbation) of obtaining that data. Online support for performance observation adds interactivity to the performance analysis process. Several arguments justify the use of online methods. Post-mortem analysis may be "too late," such as when the status of long running jobs needs to be determined to decide on early termination. There may also be opportunities for steering a computation to better results or better performance interactively by observing execution and performance behavior online. Some have motivated online methods as a way to implement dynamic performance observation where both instrumentation and measurement can be controlled at runtime. In this respect, online approaches may offer a means to better manage performance data volume and measurement intrusion. Most of the arguments above assume, of course, that the online support can be implemented efficiently and results in little intrusion or perturbation of the parallel computation. This is more difficult with online methods as they involve more directly coupled mechanisms for access and interaction. Again, one needs to understand the tradeoffs involved to make an intelligent choice of what online methods to use and how. 3. ONLINE P E R F O R M A N C E OBSERVATION A R C H I T E C T U R E The general architecture we envision for online performance observation is shown in Figure 1. The online nature is determined by the ability to access the performance data during execution and make it available to analysis and visualization tools, which are typically external. Additionally, performance interacton is made possible through a performnace control path back into the parallel system and software. Here, instrumentation and measurement mechanisms may be changed at runtime. How performance data is accessed is an important factor for online operation. Different access models are possible with respect to the general architecture. A Push model acts as a producer/consumer style of access and data transfer. The application decides when, what, and how much data to send. It can do so in several ways, such as through files or direct communication. The external analysis tools are consumers of the performance data, and its availability can be signalled passively or actively. In contrast, a Pull model acts as a client/server style of access and transfer. Here, the application is a performance data server, and the external analysis tool decides when to make requests. Of course, doing so requires a two-way communication mechanism directly with the application or some form of performance control component. Combined
764
contexts
I000.
application access ~ I/~, to profile data ~ ~
000
~
i ~r-~ I I
I
~176176
i
J ~ - - - J ~
profile samplesin memory ~ then written to files " [~
[ ParaProf
~ ~ ~ j
storeprofile sample in NFS file system ~.i
ProfileAnalysis Tools-I ParaVis
I
Figure 2. Online Profiling in TAU
Push~Pull models are also possible. Online profiling requires performance profile data, distributed across the parallel application in thread (process) memory, to be gathered and delivered to the profile analysis tool. Profiling typically involves stateful runtime analysis that may or may not be consistent at the time the access is requested. To obtain valid profile data, it may be necessary to update execution state (e.g., callstack information) or make certain assumptions about operation completion (e.g., to obtain communication statistics). Assuming this is possible, online profiling will then produce a sequence of profile samples allowing interval-based and multi-sample performance analysis. The delay for profile collection will set a lower bound on interval frequency. This delay is expected to increase with greater parallelism. Similarly, online tracing requires the gathering and merging of trace buffers distributed across the parallel application. The buffers may be flushed afterwards, thereby allowing only the last trace records since the last flush to be read. Such interval tracing may require "ghost events" to be generated before first event and after the last event to make the trace data consistent. If the tracing system dynamically registers event identifiers per execution thread, it will be necessary to make these identifiers uniform before analysis. (Static schemes do not have this problem, but require instead that all possible events be defined beforehand.) 4. ONLINE PROFILING IN TAU We have extended the TAU performance system [12] to support both online profiling and tracing. Given space constraints, we only describe our approach to online profiling is this paper. Our online tracing work can be found in [ 13]. 4.1. Approach The high-level approach we have taken for online parallel profiling is shown in Figure 2. The TAU performance system maintains profiling statistics on a context basis for each thread in a context [12]. Normally, TAU collects performance profiles at the end of the program run into profile files, one for each thread of execution. For online profiling, TAU provides a "profile dump" routine that, when called by the application, will update the profile statistics for each thread, to bring them to internally consistent states, and then output the profile data to files. The performance data access model we have implemented and used in TAU is a Push model.
765
Application ~ ~
__
~~~~~ t+Perforn Ip TAU ~ '.. Con~ erformance ~ ~ / System J ,,
,.proleO
I I 1 1 1 [~" parallel
Shared ~ Perform~ File Systemj "[ Data Re~ Figure 3. ParaVis Online Profile Analysis and Visualization
The application scenario we want to target is one where there are major phases and/or interations in the computation where one would like to capture the current profile at those time steps. Thus, at these points, the application calls the TAU profile dump routine to output the performance state. Each call of the dump routine will generate a new set of profile files or append to files containing earlier profile dumps. The updating of the profile dump files is used to "signal" the external profile analysis tools. One of the advantages of this approach is that it can be made portable and robust. The only requirement is support for a shared file system, using NFS or some other protocol. It is possible to implement a push model in the TAU performance system using a signal handler approach, but it introduces other system dependencies that are less robust. A valid argument against this approach is that it has problems when the application scales, as the number of files increase and the file system becomes the bottleneck. There are four mechanisms we are investigating to address this problem. First, thread profiles for a context can be merged into a single context profile file. This directly reduces the number of files when there are multiple threads per context. Second, the profile dump routine allows event selection, thereby reducing the amount of profile data saved. The third mechanism is to utilize a data reduction network facility, such as Wisconsin' MRNet [6], to gather and merge thread/context profiles using the parallel communication hardware, before producing output files. This can both address problems with scaling file systems and problems with large number of files, by merging profile data streams in parallel until generating profile output files. Finally, the fourth mechanism is to leverage the more powerful I/O hardware and software infrastructure in the parallel system that one would expect to be present in the system as it is scaled (e.g., parallel file system, multiple I/O processors, clustered file system software, etc.). 4.2. Tools Our work has produced two profile analysis tools that can be used online. ParaProf [8] is the main TAU tool used for offline performance analysis. It is capable of handling profiles from multiple performance experiments and gives various interactive capabilities for data exploration. ParaProf can accept profile data from raw files, our performance database, or through a socket-based interface. The ParaVis tool [9] was developed to experiment with scalable performance analysis and visualization using three-dimensional graphics. The ParaVis arcitecture is shown in Figure 3; see [9] for details. A key result from our work with this tool is the importance of selection and
766 focus in different components of the tool. Also, use of the tool demonstrated the online benefits of being able to see how performance behavior unfolds during a computation. 4.3. Application access to profile data Part of the online performance observation model includes the possibility of the application itself accessing its measured performance data. This is presently supported in TAU at the context level, where a thread can request the profile data for some measured event. Our intention is to extend this capability to include access to performance data on remote application contexts.
5. RELATED W O R K The research ideas and work presented here relate to several areas. There has been a long time interest in the monitoring of parallel systems and aplications. This is due to the general hypothesis that by observing the runtime behavior or performance of the system or application, it is possible to identify aspects of parallel execution that may allow for improvement. Several projects have developed techniques that allow parallel applications to be responsive to program behavior, available resources, or performance factors. The Falcon project [3] is an example of computational steering systems [ 15] that can observe the behavior of an application and provide hooks to alter application semantics. These "actuators" will lead to changes in the ongoing execution. Because computational steering systems enable direct interaction with the application, they are often developed with visualization frontends that provide graphical renderings of application state and objects for execution control. Online performance observation systems look to achieve several advantages for performance analysis. Paradyn [7] works online to search for peformance bottlenecks, while controlling the measurment overhead by dynamically instrumenting only those events that are useful for testing the current bottleneck hypothesis. Thus, the performance analysis done by Paradyn at runtime both collects profile statistics and interprets the performance data to decide on the next coarse of action. Where as Paradyn attempts to identify performance problems, Autopilot [ 1] is an online performance observation and adaptive control framework that uses application sensors to extract quantitative and qualitative performance for automated decision control. While both Paradyn and Autopilot are oriented towards automated performance analysis and tuning, neither address the problem of scalable performance observation or provide capabilties to analyze or visualize large-volume performance information. Indeed, the difficulty of linking application embedded monitoring to data consumers will ultimately determine what amount of runtime information can be utilized. This involves a complicated tradeoff of instrumentation and measurement granularity versus the overhead of application / performance data transport versus the information requirements for desired analysis [4]. Projects such as the Multicast Reduction Network (MRNet) [6] will help in providing efficient infrastructure for data communication and filtering. Similarly, the Peridot [ 10] project is attempting to develop a distributed application monitoring framework for shared-memory multiprocessor (SMP) clusters that can provide scalable trace data collection and online analysis. The system will have selective instrumentation and analysis control, helping to address node- and system-level monitoring requirements. A different approach to scalable observation is taken in [5]. Here, statistical sampling techniques are used to gain representative views of system performance characteristics and behavior. In general, we believe the benefits seen in the application of online computation visualiza-
767 tion and steering, itself requiring demanding monitoring support, could also be realized in the parallel performance domain. Our goal is to consider the problem of online, scalable performance observation as a whole, understanding the tradeoffs involved and designing a framework architecture to address them. 6. CONCLUSION The combination of scalable performance observation and online operation sets a high standard for effective use of present day performance tools. Many performance systems are not built for scale and work primarily offline. Our experience is one of extending the existing TAU performance system to address problems of scale through improved measurement selectivity, new statistical clustering functions, parallel analysis, and three-dimensional visualization. In addition, online support in TAU is now possible for both profiling and tracing using a Push model of data access. We have demonstrated these capabilities for applications over 500 processes. However, it is by no means correct to consider our TAU experience as evidence for a general purpose solution. As with TAU, it is reasonable to expect that other traditional offline tools could be brought online under the fight system conditions. In the wrong circumstances, the approaches may be ineffective either because they process larger volumes of data or require more analysis power. Any solution to scalable, online performance observation will necessarily be application and system dependent, and will require an integrated analysis of engineering tradeoffs that include concems for intrusion and quality of performance data. Our goal is to continue to advance the TAU performance system for scalability and online support to better understand where and how these tradeoffs arise and apply. REFERENCES
[1] R. Ribler, H. Simitci, and D. Reed, "The Autopilot Performance-Directed Adaptive Con[2]
[s] [4]
[s] [6] [7]
[8]
trol System," Future Generation Computer Systems, 18(1): 175-187, 2001. H. Brunst, W. Nagel, and A. Malony, "A Distributed Performance Analysis Architecture for Clusters," International Conference on Cluster Computing, December 2003, to be published. W. Gu, G. Eisenhauer, E. Kraemer, K. Schwan, J. Stasko, J. Vetter, and N. Mallavarupu, "Falcon: On-line Monitoring and Steering of Large-Scale Parallel Programs," Proceedings of the 5th Symposium of the Frontiers of Massively Parallel Computing, pp.422-429, 1995. A. Malony, "Tools for Parallel Computing: A Performance Evaluation Perspective," in Handbook on Parallel and Distributed Processing, J. Blazewicz, K. Ecker, B. Plateau, an D. Trystram (Eds.), 2000, Springer-Verlag, pp. 342-363. C. Mendes and D. Reed, "Monitoring Large Systems via Statistical Sampling," LACSI Symposium, 2002. P. Roth, D. Arnold, and B. Miller, "MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools," Technical report, University of Wisconsin, Madison, 2003. B. Miller, et al., "The Paradyn parallel performance measurement tool", IEEE Computer 28(11), pp. 3 7-46, November 1995. R. Bell, A. Malony, and S. Shende, "ParaProf: A Portable, Extensible, and Scalable Tool for Parallel Performance Profile Analysis," International Euro-Par Conference, pp. 17-26, August 2003.
768 [9]
[10]
[ 11]
[12] [ 13]
[14] [ 15]
K. Li, A. Malony, R. Bell, and S. Shende, "A Framework for Online Performance Analysis and Visualization of large-Scale Parallel Applications," International Conference on Parallel Processing and Applied Mathematics, Czestochowa, Poland, September 2003, to be published. K. Fuerlinger and M. Gerndt, "Distributed Application Monitoring for Clustered SMP Architectures," accepted to EuroPar 2003, Workshop on Performance Evaluation and Prediction, 2003. D. Reed, C. Elford, T. Madhyastha, E. Smirni, and S. Lamm, "The next frontier: Interactive and closed loop performance steering,' Proceedings of the 25th Annual Conference of International Conference on Parallel Processing, 1996 TAU (Tuning and Analysis Utilities). See http://www.acl.lanl.gov/tau. H. Brunst and A. Malony and S. Shende and R. Bell, "Online Remote Trace Analysis of Parallel Applications on High-Performance Clusters," International Symposium on HighPerformance Computing, October 2003, to be published. W.Nagel, A.Amold, M.Weber, H.Hoppe, and K.Solchenbach, "Vampir: Visualization and Analysis of MPI Resources," Supercomputing, 12(1):69-80, 1996. J. Vetter, "Computational Steering Annotated Bibliography," SIGPLANNotices, 32(6), pp. 40-44, 1997.
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
769
Deriving analytical models from a limited number of runs* R.M. Badia a, G. Rodriguez a, and J. Labarta ~ ~CEPBA-IBM Research Institute, Technical University of Catalonia, c/Jordi Girona 1-3, 08034 Barcelona, SPAIN We describe a methodology to derive a simple characterization of a parallel program and models of its performance on a target architecture. Our approach starts from an instrumented run of the program to obtain a trace. A simple linear model of the performance of the application as a function of architectural parameters is then derived by fitting the results of a bunch of simulations based on that trace. The approach, while being very simple, is able to derive analytic models of execution time as a function of parameters such as processor speed, network latency or bandwidth without even looking at the application source. It shows how it is possible to extract from one trace detailed information about the intrinsic characteristics of a program. A relevant feature of this approach is that a natural interpretation can be given to the different factors in the model. To derive models of other factors such as number of processors, several traces are obtained and the values obtained with them extrapolated. In this way, Very few actual runs of the application are needed to get a broad characterization of its behavior and fair estimates of the achievable performance on other hypothetical machine configurations. 1. MOTIVATION AND GOAL Obtaining models of the performance of a parallel application is extremely useful for a broad range of purposes. Models can be used to determine whether a given platform achieves the reasonably expected performance for a given program or to identify why this is not achieved. Predictive models can be used in purchasing processes or scheduling algorithms. Models based on simulations can be used to explore the parameter space of a design but can be time consuming and lack the abstraction and possibility of interpretation that analytic models provide. Deriving analytic models for real parallel programs is nevertheless a significant effort [6, 7, 8]. It requires a deep understanding of the program, a lot of intuition of how and where to introduce the approximations, a detailed understanding of the behavior of the parallel platform and how this interacts with the application demand for resources. In this paper we are interested in looking at the possibility of deriving analytical characterizations of an application and models of its performance without the requirement of understanding the application and minimizing the effort to derive the model. The study aims at extending the analysis capabilities of the performance analysis tools DIMEMAS [1] and PARAVER [2] developed at CEPBA. DIMEMAS is a event-driven simulator that predicts the behavior of MPI *This work has been partially funded by the European Commission under contract number DAMIEN IST-200025406 and by the Ministry of Science and Technology of Spain under CICYT TIC2001-0995-CO2-01
770 applications. PARAVER is a performance analysis and visualization tool that supports the detailed analysis, not only of the results of the DIMEMAS simulations but also the behavior of real execution of MPI, OpenMP, MPI+OpenMP and other kind of applications. Both DIMEMAS and PARAVER are based on post-mortem tracefiles. Trace based systems do support very detailed analysis at the expense of having to handle the large amounts of data stored in the trace files. This poses some problems in the analysis process and rises some philosophical questions that we briefly describe in the next paragraphs. Compared to analytic models, simulations are slow as a large program characterization (the tracefile) has to be convolved with the architecture model implemented by the simulator. Furthermore, the point-wise evaluation is slow and it is difficult to interpret its sensitivity to a given parameter. Analytic models are based on a concise characterization of the program. They also have the advantage that are more amenable to a high level interpretation of the parameters in the model and the sensitivity to parameters of the architecture can be derived analytically. A first question that arises when looking at all the data in a tracefile is how much is real information? In this work we used traces from three applications whose respective sizes were 1.7, 4 and 10.2 MB. The question is, can such a large characterization of the program behavior be condensed to a few numbers and still capture the inherent behavior of the program? A second issue we want to address refers to the way the characteristics of the basic components in the architectural model propagate to the application performance. Some components of the architecture model in DIMEMAS are linear, for example the communication time (T = L + S / B W ) which is proportional to the inverse of the bandwidth and proportional to the latency and to the message size. A linear coefficient of relative CPU speed is also used to model target processors different from the one where the trace was obtained. Other components of the DIMEMAS model are highly non-linear, for example, the delays caused by blocking receives or resource contention at the network. The questions is how these two types of components get reflected in the final application execution time. Does it vary linearly with the inverse of the bandwidth? If so, with which proportionality factor? Thus, our objective in this work is to identify a method to evolve the results of a bunch of simulations (from a single trace of a single real execution) to an analytic model, as the one shown in equation 1 where the factors (BW, L,CPUspeed) characterize the target machine and should be independently measured or estimated in order to perform a prediction. BW is the bandwidth of the interconnection network, L is the latency and CPU speed is the relative speed of the processors in the target machine to those in the machine where the trace was obtained.
T = f ( B W ) + g ( L ) + h(CPUspeed)
(1)
Our target is to identify whether a linear expression for each of the terms in equation 1 is adequate. If so, the coefficients of each term would characterize the application and could be given abstract interpretations. As the trace characterizes an instantiation of problem size and number of processors the above model does not include these factors. In order to include them, different traces would be needed and the model modified accordingly. In section 2 we describe the methodology. In section 3 we describe how the methodology is applied and the results we obtained. Finally, section 4 concludes the paper.
771 2. M E T H O D O L O G Y For each application that we want to analyze a real execution is performed to extract the tracefile that feeds the simulator (DIMEMAS). The traces are obtained with the instrumentation package MPIDtrace [9] which relies on DPCL [ 10] to inject probes in a binary at load time. By using this dynamic instrumentation technology the tracing tool can instrument production binaries, not requiting any access to source code. The tracefile obtained contains information of communication requests by each process as well as the CPU demands between those requests. With MPIDtrace it is possible to obtain a trace of a run with more processes that available processors and still get very accurate predictions as shown in [11, 12, 13] . Other previously proposed methods needed the source code and knowledge of the program structure to be able to instrument the application [ 14, 15]. DIMEMAS implements a simple abstract model of a parallel platform. The network is modeled as a set of buses and a set of links connecting each node to them. These parameters of the simulator are used to model in an abstract way the network bisection (number of buses) and the injection mechanism (number of links and whether they are half or full duplex). The communication time is modeled with the simple linear expression: S T = L + BW
(2)
The latency (L) term is in this model constant, independent of the message size (S) and uses the CPU. The transfer time is inversely proportional to the bandwidth (BW) and does use resources of the interconnection network during the whole transfer (one output link, one bus and one input link). Blocking as well as non blocking MPI semantics are implemented by the simulator. An extremely important consideration about the model implemented by DIMEMAS is that it is an abstraction not only of the hardware of the target machine, but also of all the software layers in the MPI implementation. In this work, a bunch of simulations of the same tracefile is launched randomly varying for each simulation the architectural parameters for which we want to characterize the application (i.e., latency and network bandwidth). From the results of the simulations a linear regression is performed that allows us to extract a linear model for the elapsed time of the application against the architectural parameters. The coefficients in the model become the summarized characterization of the application. 2.1. Applications For this work the following applications have been used: NAS BT [3], Sweep3D [5] and RNAfold [4]. All traces were extracted from an IBM SP2 machines, with different processors counts, ranging from 8 to 49 processors. For the NAS BT only ten iterations were traced, since the benchmark is quite large. 3. RESULTS OF THE EXPERIMENTS Three kind of experiments were realized, changing the parameters that are taken into account in the model: experiments varying latency and bandwidth; experiments varying latency, bandwidth and CPU speed; and experiments varying latency, bandwidth and number of processors. In the simulator we considered a simple network model with unlimited busses but a single
772 half duplex link between the nodes and the network. For each experiment we report not only the results but also the experience.
3.1. Latency and bandwidth The objective in this set of experiments was to obtain a linear model for each application which follows an equation like : 7 T - t~ + /f . L + B----~,
where L is the latency and BW is the bandwidth
(3)
This simple expression is quite similar to equation 2. An interpretation based on such similarity can be given to the coefficients, c~ can be interpreted as the execution time of the application under ideal instantaneous communications. This factor will actually represent the execution time of the application under such ideal communications and will account for load imbalances and dependence chains (i.e. pipelines). Coefficient/3 can be interpreted as the number of message sends and receptions in the critical path of the application with infinite bandwidth. Parameter 7 can be interpreted as the total number of bytes whose transference is in the critical path. The interesting thing about the above interpretations is that the coefficients represent a global measure for the whole run and across all processors. The values does not need to match the actual number of messages or bytes sent. The ratios between these coefficients and the total numbers can be good indicators of general characteristics of the application as described by the following indexes: Parallelization efficiency =
Total C P U time P . c~
(4)
Here, Total C P U time is the sum for all processes of all useful computation bursts in the tracefile. When divided by P, which is the number of processors, we have the average value per processor. When divided by c~, it should result in a value close to 1. Values below that mean a poor parallelization due to load imbalance or to dependence chains. Message overhead =
P./3 Total Messages Sent. 2
(5)
To derive this index we considered that under infinite bandwidth, messages in the critical path will pay the latency overhead twice (at the send and at the receive) and also assumed that all processors send the same number of messages. Even with those assumptions this index is a useful estimator of the fraction of messages sent that contribute to the critical path. Values higher than 1 will indicate the existence of dependence chains as for example pipelined executions. Values close to 1 essentially indicate that each processors pays the overhead cost of all the messages it sends and receives. Values below 1 will indicate that the overhead of some messages is not paid by the application. This may happen when some of the startup overheads take place in periods preceding a reception of a message that will block the process. Thus, this is not as positive as might look like because it actually indicates inefficiencies due to load imbalance. We should also be aware that this is a global number, thus local behaviors in different parts of the program may be compensated by other parts. P'7
Transfers overlap = Total Bytes Sent
(6)
773 The result of this calculation represents the fraction of bytes sent that contribute to the critical path (assuming a uniform distribution of messages sent by the different processors). Values less than 1 will indicate overlap between communication and computation. Values above 1 will indicate dependence chains or resource contention. Values close to 1 indicate concurrency in communications but no overlap with computation. Again, this is a global number that may hide local behaviors. 3.2. Initial results An initial set of simulations was obtained by considering the tracefiles of the three applications with 16 processors. The range used in the simulations for the latency was between 2#s and 100ms and for the bandwidth between 2MB/s and 800MB/s. From those results, we performed a linear regression for each application, where the output variable was the predicted elapsed time (by DIMEMAS) and the input variables were the latency and the bandwidth. The results from those regressions were not very good. The confidence interval of the regression for parameter/3 in the NAS BT and Sweep3D models included 0, so those two values should be considered NULL. For RNAfold the confidence interval was also very large although did not include 0. However, a more important fact was that the relative error of the model for some observations was very large (39% for the NAS BT and 54.3% for the Sweep3D). To identify the problem, we analyzed the model values for the NAS BT against the inverse of the bandwidth. Although the visual inspection of the behavior of the application seemed to have a fit the linear regression with the inverse of the bandwidth, a detailed observation showed that different trend lines could fit the application behavior. For example, for large values of the bandwidth, a very different trend line appears (values above 5MB/s). This was also observed on the plot of the relative errors of the model against the bandwidth: for low values of bandwidth, the model had low relative values, but for values of bandwidth above 25MB/s, the relative error was also above 20% and growing. The actual results of the simulation for such a wide dynamic range of the architectural parameters does match a piecewise linear model. Then, we selected the range of values between 80 and 300MB/s, which also are more representative of current and future platforms to repeat the experiments. 3.3. Results with reduced bandwidth range The results for this second bunch of simulations (see results of table 2 for 16 processors) is accurately fitted by the linear model and the error is very small. With these values, we calculated the metrics defined before, and obtained the results shown in table 1. The parallelization efficiency parameter gives good values for the NAS BT and the RNAfold applications, which are well parallelized applications. The lower value for the Sweep3D reflects the fact that this application has a dependency chain between the different processes which reduces its parallelism. Regarding the message overlap, a value slightly larger than 1 is obtained for the Sweep3D. This
Table 1 Metrics evaluation Application NAS BT Class A Sweep3D RNAfold
Parallelization efficiency 0.994 0.792 0.926
Messageo v e r l a p 0.26 1.24 1.03
Transferoverlap 2.82 0.33 0.02
774 is again due to the dependency chain present in this application, which is in fact composed of a series of point to point communications where each depends on the one of a neighbor process. On the case of the transfer overlap it has a large value for the NAS BT application. Analyzing the PARAVER tracefile of a simulation for that application, it was observed that at the end of each iteration there is a bunch of overlapped point to point communication between all processes. As each process is sending several messages at the same time (six for the 16 processors case), a network resource contention arises.
3.4. Latency, bandwidth and CPU speed For the latency, bandwidth and CPU speed experiments, the inverse process was applied. The CPU speed parameter is a reference of the CPU speed of the architecture for which we want to predict against the CPU speed of the architecture were the trace was generated. A value of 1 in this parameter, means that we are predicting for an architecture with the same CPU speed. A value of 2, will mean that we want to predict for a machine with a CPU speed twice than the one of the instrumented architecture. Instead of regressing the results of the simulation, we made the hypothesis that the same equation obtained could be reformulated in the following way: T . ~ +./ 3
.L-4 . "7 . ~
BW
T
c~
CP U speed
+ / 3 . L-~_ ~"Y
(7)
BW
Thus, in this case the data obtained from the simulations was used for validation of the above method. The range of values for the CPU speed parameter was between 0.5 and 10. The results of this experiments were very satisfactory, since all applications fitted in the extended equation model with maximum percentage of relative error between 7% and 13%.
3.5. Latency, bandwidth and number of processors The DIMEMAS simulations do require a trace obtained by running the application with the target number of processes (although the instrumented run can be obtained on less processors by loading each one of them with several processes). Our final objective is to extend the above models to take into account the number of processors as parameter. The approach is to generate several traces with different processor counts, apply to each of them the previously described model and extrapolate the values for c~,/3 and -~ for the desired number of processors. Table 2 shows the results of the model parameters for each of the three applications and different numbers of processors. From the results obtained, we can see that, for each application, Table 2 Parameters obtained for different processor counts Application # Procs c~ /3 NAS BT Class A 9 6.53 289.74 NAS BT Class A 16 3.27 124.54 NAS BT Class A 25 1.57 289.02 NAS BT Class A 36 0.94 323.38 Sweep3D 8 15.46 2089.92 Sweep3D 16 8.84 2425.34 Sweep3D 32 4.91 2930.30 RNAfold 8 11.88 3671.77 RNAfold 16 8.26 3770.68 RNAfold 28 4.07 3311.30
9' 21.8 29.55 37.63 40.37 4.13 9.45 19.10 0.18 0.26 0.17
R2 0.9704 0.9975 0.9982 0.9973 0.9989 0.9994 0.9996 0.9999 0.9996 0.9999
Max err.(%) 0.32% 0.36% 0.70% 1.67% 0.04% 0.05% 0.13% 0.02% 0.09% 0.03%
775 Table 3 Execution time prediction based on model extrapolation Application Bandwidth Simulation time NAS BT Class A 80 1.22 NAS BT Class A 100 1.10 NAS BT Class A 300 0.83 Sweep3D 80 4.22 Sweep3D 100 4.17 Sweep3 D 300 3.92
Model prediction 1.31 1.18 0.83 4.43 4.35 4.16
% Error -8.17 -7.31 0.12 -5.04 -5.26 -6.09
parameter c~ is fairly proportional to the inverse of the number of processors. A first approximation for "7 would be to consider it linear on the number of processors. The slope of such variation depends on the application and thus requires a few traces to estimate it. Parameter/3 also seems to have some linear behavior except for the point corresponding to 16 processors in the NAS BT application. In this trace, many small imbalances in the CPU consumption by each thread result in processes blocking for a small time at the receives. This imbalance is small compared to the duration of the CPU runs but sufficient to allow for several latency overheads to be absorbed by the blocking time. Because of the small relative significance of the imbalance, the model is actually not very sensitive to the fl value. Based on the above considerations we extrapolated the values of c~,/3 and "7 for a number of processors larger than the ones included in Table 2 and made the prediction of the execution time. The results are compared in table 3 with the prediction obtained with DIMEMAS.
4. CONCLUSIONS In this work we have introduced an approach to a methodology that allows to obtain abstract, global, and interpretable information of the behavior of a parallel application from a few (or single) real run. Furthermore, this analysis can be performed without looking at the source code. We consider that the initial results are very interesting, showing the importance and the need for combining simple model methodologies and applying more statistical analysis techniques in performance modeling of parallel application. As future work, we foresee the extension of this work to wider ranges of the input parameters by means of using piecewise linear models. A critical point here would be how we can detect the knots and interpret that knots since it may allow detection of actual cause of problem (due to bottleneck changes). Also we want to automatize the process.
REFERENCES
[1] [2] [3]
Dimemas: performance prediction for message passing applications, http://www.cepba.upc.es/dimemas/ Paraver: performance visualization and analysis, http://www.cepba.upc.es/paraver/ David Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, Maurice Yarrow, "The NAS Parallel Benchmarks 2.0", The International Journal of Supercomputer Applications, 1995.
776 [4] [5] [6]
[7]
[8]
[9] [ 10]
[ 11 ]
[ 12]
[13] [ 14] [15]
.I.L. Hofacker and W. Fontana and L. S. Bonhoeffer and M. Tacker and R Schuster, "Vienna RNA Package", http://www.tbi.univie.ac.at/ivo/RNA, October 2002. "The ASCI sweep3d Benchmark Code", http ://www.llnl.gov/asci_benchmarks/asci/limited/sweep3 d/asci_sweep3 d.html D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, and T. von Eicken, "LogP" Towards a Realistic Model of Parallel Computation", in Proc. of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, May 1993. Mark M. Mathis, Darren J. Kerbyson, Adolfy Hoisie, "A Performance Model of nonDeterministic Particle Transport on Large-Scale Systems", In Proc. int. Conf. on Computational Science (ICCS), Melbourne, Australia, Jun 2003. Adeline Jacquet, Vincent Janot, Clement Leung, Guang R. Gao, R. Govindarajan, Thomas L. Sterling, "An Executable Analytical Performance Evaluation Approach for Early Performance Prediction", in Proc. of IPDPS 2003. MPIDtrace manual, http://www.cepba.upc.es/dimemas/manual_i.htm L. DeRose, "The dynamic probe class library: an infrastructure for developing instrumentation for performance tools". In International Parallel and Distributed Processing Symposium, April 2001. Sergi Girona and Jesfis Labarta, "Sensitivity of Performance Prediction of Message Passing Programs", 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Monte Carlo Resort, Las Vegas, Nevada, USA, July 1999. Sergi Girona, Jesfis Labarta and Rosa M. Badia, "Validation of Dimemas communication model for MPI collective operations", EuroPVM/MPI'2000, Balatonf'tired, Lake Balaton,Hungary, September 2000. Allan Snavely, Laura Carrington, Nicole Wolter, Jesfis Labarta, Rosa M. Badia, Avi Purkayastha, "A framework for performance modeling and prediction", SC 2002. P. Mehra, M. Gower, M.A. Bass, "Automated modeling of message-passing programs", MASCOTS'94, pgs. 187-192, Jan 1994. P. Mehra, C. Schulbach, J.C. Yan, "A comparison of two model-based performanceprediction techniques for message-passing parallel programs", Sigmetrics'94, pgs. 181190, May 1994.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
777
Performance Modeling of HPC Applications A. Snavely a, X. Gao ~, C. Lee ~, L. Carrington b, N. Wolter b, J. Labarta c, J. Gimenez ~, and E Jones d aUniversity of California, San Diego, CSE Dept. and SDSC, 9300 Gilman Drive, La Jolla, Ca. 92093-0505, USA bUniversity of California, San Diego, SDSC, 9300 Gilman Drive, La Jolla, Ca. 92093-0505, USA cCEPBA, Jordi Girona 1-3, Modulo D-6, 08034 Barcelona, Spain dLos Alamos National Laboratory, T-3 MS B216, Los Alamos, N.M. USA This paper describes technical advances that have enabled a framework for performance modeling to become practical for analyzing the performance of HPC applications. 1. INTRODUCTION Performance models of applications enable HPC system designers and centers to gain insight into the most optimal hardware for their applications, giving them valuable information into the components of hardware (for example processors or network) that for a certain investment of time/money will give the most benefit for the applications slated to run on the new system. The task of developing accurate performance models for scientific application on such complex systems can be difficult. In section 2 we briefly review a framework we developed [1] that provides an automated means for carrying out performance modeling investigations. In section 3 we describe ongoing work to lower the overhead required for obtaining application signatures and also how we increased the level-of-detail of our convolutions with resulting improvements in modeling accuracy. In section 4 we show how these technology advances have enabled performance studies to explain why performance of applications such as POP (Parallel Ocean Program) [2], NLOM (Navy Layered Ocean Model) [3], and Cobalt60 [4] vary on different machines and to quantify the performance effect of various components of the machines. In section 5 we generalize these results to show how these application's performance would likely improve if the underlying target machines were improved in various dimensions (as for example on future architectures). 2. BRIEF REVIEW OF A CONVOLUTION-BASED F R A M E W O R K FOR PERFORMANCE MODELING
In a nutshell, we map a trace of program events to their expected performance established via separately measured or extrapolated machine profiles. We combine single-processor models along with the parallel communications model, to arrive at a performance model for a whole
778 parallel application. We emphasize simplicity in the models (leaving out, for example, second order performance factors such as instruction level parallelism and network packet collisions) while applying these simple models at high resolution. A user of the framework can input the performance parameters of an arbitrary machine (either existing and profiled, or under-design and estimated) along with instruction/memory-access-pattern signatures and communications signatures for an application to derive a performance model. The convolver tools calculate the expected performance of the application on the machine in two steps, first by modeling the sections between communications events and then by combining these models into a parallel model that includes MPI communications. Detailed descriptions of our performance modeling framework can be found in papers online at the Performance Modeling and Characterization Laboratory webpages at ht tp ://www. sdsc. e d u / p m a c / P a p e r s / p a p e r s , html 3. REDUCING TRACING TIME For portability and performance reasons we ported our tracing tool, MetaSim Tracer, to be based on DyninstAPI [5] [6]. Previously it was based on the ATOM toolkit for Alpha processors. This meant applications could only be traced on Alpha-based systems. A more critical limitation was due to the fact that ATOM instruments binaries statically prior to execution. This means tracing cannot be turned on and off during execution. DyninstAPI is available on IBM Power3, IBM Power4, Sun, and Intel processors. It allows code instrumentation via runtime patching. The image of the running program can be modified during execution to add instrumentation. The instrumentation can be dynamically disabled. The opportunity was to enable a feature whereby MetaSim Tracer can sample performance counters by adding instrumentation during sample phases. The program can be de-instrumented between sample phases. Slowdown due just to minimal hooks left in the code to allow re-instrumentation should be greatly reduced between sample phases. An open question remained that we wished to answer before proceeding; whether application traces based on sampling could yield reasonably accurate performance models. Some previous work [7] showed this is possible and in recent work we also demonstrated it can via experiments with the ATOM based version of MetaSim Tracer. In these experiments we turned on and off the processing of instrumented sections (we could not actually turn off the instrumentation in the ATOM version, so we just switched off processing in the callback routines). In this way we were able to explore the verisimilitude of interpolated traces based on sampled data, and we showed that these could indeed be usefully accurate [8]. Encouraged by these results we then implemented a DyninstAPI version of MetaSim Tracer so that the duration and frequency of sampling periods (counted in CPU cycles) is under the control of the user. The user inputs two parameters: 1) SAMPLE = number of cycles to sample 2) INTERVAL - number of cycles to turn off sampling. The behavior of the program when sampling is turned off is estimated by interpolation. Thus MetaSim Tracer now enables a tradeoff between time spent tracing and verisimilitude of the resulting trace obtained via sampling. A regular code may require little sampling to establish its behavior. A code with very random and dynamic behaviors may be difficult to characterize even from high sampling rates. Practically, we have found techniques for generating approximated traces via sampling can reduce tracing time while preserving reasonable trace fidelity. Also we found that representing traces by a dynamic CFG decorated with instructions (especially memory instructions) characterized by memory access pattern can reduce the size of stored trace files by three orders of magnitude
779 [9]. These improvements in the space and time required for tracing have now rendered fullapplication modeling tractable. In some cases it is possible to obtain reasonably accurate traces and resulting performance models from 10-~ or even 1 -~ sampling. 4. MODELING APPLICATION PERFORMANCE ON HPC MACHINES 4.1. Parallel Ocean Program (POP) The Parallel Ocean Program (POP) [2] was specifically developed to take advantage of high performance computer architectures. POP has been ported to a wide variety of computers for eddy-resolving simulations of the world oceans and for climate simulations as the ocean component of coupled climate models. POP has been run on many machines including IBM Power3, and IBM Power4 based systems, Compaq Alpha server ES45, and Cray X1. POP is an ocean circulation model that solves the three-dimensional primitive equations for fluid motions on the sphere under hydrostatic and Boussinesq approximations. Spatial derivatives are computed using finite-difference discretizations, formulated to handle any generalized orthogonal grid on a sphere, including dipole, and tripole grids that shift the North Pole singularity into land masses to avoid time step constraints due to grid convergence. The benchmark used in this study is designated ' x l ' (not to be confused with the Cray X1 machine, one of the machines where we ran the benchmark); it has coarse resolution that is currently being used in coupled climate models. The horizontal resolution is one degree (320x384) and uses a displace-pole grid with the pole of the grid shifted into Greenland and enhanced resolution in the equatorial regions. The vertical coordinate uses 40 vertical levels with smaller grid spacing near the surface to better resolve the surface mixed layer. This configuration does not resolve eddies, and therefore it requires the use of computationally intensive subgrid parameterizations. This configuration is set up to be identical to the actual production configuration of the Community Climate System Model with the exception that the coupling to full atmosphere, ice and land models has been replaced by analytic surface forcing. We applied the modeling framework to POP. The benchmark does not run so long as to require sampling. Table 1 shows real vs. model-predicted wall-clock execution times for several machines at several processor counts. We only had access to a small 16 CPU Cray X1. The model is quite robust on all the machines modeled with an average error of only 6.75-~ where error is calculated as (Real Time - Predicted Time)/(Real Time) *100. 4.2. Navy Layered Ocean Model (NLOM) The Navy's hydrodynamic (iso-pycnal) non-linear primitive equation layered ocean circulation model [3] has been used at NOARL for more than 10 years for simulations of the ocean circulation in the Gulf of Mexico, Carribean, Pacific, Atlantic, and other seas and oceans. The model retains the free surface and uses semi-implicit time schemes that treat all gravity waves implicitly. It makes use of vertically integrated equations of motion, and their finite difference discretizations on a C-grid. NLOM consumes about 209 of all cycles on the supercomputers run by DoD's High Performance Computing Modernization Program (HPCMP) including Power3, Power4, and Alpha systems. In this study we used a synthetic benchmark called synNLOM that is representative of NLOM run with data from the Gulf of Mexico and has been used in evaluating vendors vying for DoD TI-02 procurements. Even though synLOM is termed a 'benchmark' it is really a representative production problem and runs for more than 1 hour un-instrumented on 28 CPUs (a typical configuration) on BH. Thus, in terms of runtime, it is an order-of-magnitude more challenging to trace than POP xl. We used 1 ~ sampling and the
780 Table 1 Real versus Predicted-by-Model Wall-clock Times for POP Benchmark Cray X1 at 16 processors had real time 9.21 seconds, predicted time 9.79 seconds, error 6.3 percent # o f CPUs
Real Time (sec)
Predicted Time (sec)
Error9
Blue Horizon Power3 8-way SMP Colony switch 16
204.92
214.29
-59
32
115.23
118.25
-3 9
64
62.64
63.03
19
128
46.77
40.60
139
Lemeiux Alpha ES45 4-way SMP Quadrics switch 16
125.35
125.75
09
32
64.02
71.49
-119
64
35.04
36.55
-49
22.76
20.35
119
128
Longhorn Power4 32-way SMP Colony switch 16
93.94
95.15
- 19
32
51.38
53.30
-49
64
27.46
24.45
119
19.65
15.99
169
128
Seaborg Power3 16-way SMP Colony switch 16
204.3
200.07
29
32
108.16
123.10
-149
64
54.07
63.19
-179
128
45.27
42.35
69
resulting models yielded less than 59 error across all of the same machines as in the above POP study. We estimate a full trace would take more than a month to obtain on 28 processors! NLOM is reasonably regular and the low error percentages from 19 sampling do do not seem to justify doing a full trace although the code is important enough to DoD that they would provide a month of system time for the purpose. 4.3. Cobalt60 Cobalt60 [4] is an unstructured Euler/Navier-Stokes flow solver that is routinely used to provide quick, accurate aerodynamic solutions to complex CFD problems. Cobalt60 handles arbitrary cell types as well as hybrid grids that give the user added flexibility in their design environment. It is a robust HPC application that solves the compressible Navier-Stokes equations using an unstructured Navier-Stokes solver. It uses Detached-Eddy Simulation (DES) which is a combination of Reynolds-averaged Navier-Stokes(RANS) models and Large Eddy Simulation (LES). We ran 7 iterations of a tunnel model of an aircraft wing with a flap and endplates with 2,976,066 cells that runs for about an hour on 4 CPUs of BH. We used a 2-step trace method
781 2.5
9POP Performance [] Processor and Memory Subsystem [] Network Bandwidth
1.5
/
B Network Latency
i
m
0.5
PWR3 Colony (BH)
Case1
Case2
Case3
Case4
Alpha ES45 Quadrics
(Tcs) Figure 1. Modeled Contributions to Lemeuix.s (TSC) performance improvement over Blue Horizon on POP xl at 16 CPUs. to ascertain in the first phase that 709 of the time is spent in just one basic block. We then applied 1-~ sampling to this basic block and 10-~ sampling to all the others in the second step of MetaSim tracing. We verified models for 4, 32, 64, and 128 CPUS on all the machines used in the previous study with average of less than 5-~ error.
5. P E R F O R M A N C E SENSITIVITY STUDIES Reporting the accuracy of performance models in terms of model-predicted time vs. observed time (as in the previous section) is mostly just a validating step for obtaining confidence in the model. More interesting and useful is to explain and quantify performance differences and to play 'what if' using the model. For example, it is clear from Table 1 above that Lemeiux is faster across-the-board on POP xl than is Blue Horizon. The question is why? Lemeuix has faster processors (1GHz vs. 375 MHz), and a lower-latency network (measured ping-pong latency of about 5 ms vs. about 19 ms) but Blue Horizon.s network has the higher bandwidth (pingpong bandwidth measured at about 350 MB/s vs. 269 MB/s with the PMaC probes). Without a model one is left with a conjecture 'I guess POP performance is more sensitive to processor performance and network latency than network bandwidth'. With a model that can accurately predict application performance based on properties of the code and the machine, we can carry out precise modeling experiments such as that represented in Figure 1. We model perturbing the Power3-based, Colony switch Blue Horizon (BH) system into the Alpha ES640-based, Quadrics switch system (TCS) by replacing components. Figure 1 represents a series of cases modeling the perturbing from BH to TCS, going from left to fight. The four bars for each case represent the performance of POP x l on 16 processors, the processor and memory subsystem performance, the network bandwidth, and the network latency all normalized to that of BH. In Case 1, we model the effect of reducing the bandwidth of BH's network to that of a single rail of the Quadrics switch. There is no observable performance effect as the POP x 1 problem at this size is not sensitive to a change in peak network bandwidth from 350MB/s to 269MB/s. In Case 2 we model the effect of replacing the Colony switch with the Quadrics switch. There is a significant performance improvement due to the 5 ms latency of the Quadrics switch versus the 20 ms latency of the Colony switch. This is because the barotropic
782
,--,,,)< --4x Processor - - - ~ i - - 1/4 lat & 4 x BW .L
1/4 lat
I000 .~
~ecution
Time
4){ BW BASE
Latency Performm~ce Normalized
Pro:essor P eft ormanc 9
9
Bandwidth Performance Normalized
Figure 2. POP Performance Sensitivity for 128 cpu POP xl. calculations in POP xl at this size are latency sensitive. In Case 3 we use Quadrics latency but Colony bandwidth just for completeness. In Case 4 we model keeping the Colony switch latencies and bandwidths but replacing the Power3 processors and local memory subsystem with Alpha ES640 processors and their memory subsystem. There is a substantial improvement in performance due mainly to the faster memory subsystem of the Alpha. The Alpha can load stride-1 data from its L2 cache at about twice the rate of the Power3 and this benefits POP xl a lot. The last set of bars show the values of TCS performance, processor and memory subsystem speed, network bandwidth and latency, as a ratio to BH's values. The higher-level point from the above exercise is that the model can quantify the performance impact of each machine hardware component. We carried out similar exercise for several sizes of POP problem and for NLOM, Cobalt60, and could do so for any application modeled via the framework. Larger CPU count POP xl problems become more network latency sensitive and remain not-very bandwidth sensitive. As an abstraction from a specific architecture comparison study such as the above, we can use the model to generate a machine-independent performance sensitivity study. As an example, Figure 2 indicates the performance impact on a 128 CPU POP x l run for quadrupling the speed of the CPU and memory subsystem (lumped together we call this processor), quadrupling network bandwidth, cutting network latency by 4, and various combinations of these four-fold hardware improvements. The axis are plotted logscale and normalized to 1, thus the solid black quadrilateral represents the execution time, network bandwidth, network latency, CPU and memory subsystem speed of BH. At this size POP xl is quite sensitive to processor (faster processor and memory subsystem), somewhat sensitive to latency because of the communications-bound with small-messages barotropic portion of the calculation and fairly insensitive to bandwidth. With similar analysis we can 'zoom in' on the processor performance factor. In the above results for POP, the processor axis shows modeled execution time decreases
783 from having a four-times faster CPU with respect to MHz (implying 4X floating-point issue rate) but also implicit in '4X node' is quadruple bandwidth and 1/4th latency to all levels of the memory hierarchy (unfortunately this may be hard or expensive to achieve architecturally!). We explored how much faster a processor would perform relative to the Power3 processor for synNLOM if it had 1) 2X issue rate 2) 4X issue rate, 3) 2X issue rate and 2X faster L2 cache 4) base issue rate of 4*375 MHz but 4X faster L2 cache. Space will not allow the figure here but qualitatively we found synLOM at this size is compute-bound between communication events and would benefit a lot just from a faster processor clock even without improving L2 cache. Not shown but discoverable via the model is that synNLOM is somewhat more network bandwidth sensitive than POP because it sends less frequent, larger messages. With similar analysis we found Cobalt60 is most sensitive to improvements in the processor performance at this input size and this remains true at larger processor counts. The higher-level point is that performance models enable 'what-if' examinations of the implications of improving the target machine in various dimensions. 6. C O N C L U S I O N A systematic method for generating performance models of HPC applications has advanced via efforts of this team and has begun to make performance modeling systematic, time-tractable, and thus generally useful for performance investigations. It is reasonable now to make procurement decisions based on the computational demands of the target workload. Members of this team are now working closely with the Department of Defense High Performance Computing Modernization Program Benchmarking Team to effect TI-05 procurements by just such criteria. It is reasonable now to tune current systems and influence the implementation of near-future systems informed by the computational demands of the target workload; team members are collaborating in the DARPA HPCS program to influence the design of future machines. 7. ACKNOWLEDGEMENTS This work was sponsored in part by the Department of Energy Office of Science through SciDAC award High-End Computer System Performance: Science and Engineering. This work was sponsored in part by the Department of Defense High Performance Computing Modernization Program office through award HPC Benchmarking. REFERENCES
[i] A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia A. Purkayastha, A Framework for Performance Modeling and Prediction, SC2002.
See http-//www, acl. lanl. gov/climate/models/pop/ current release/UsersGuide.pdf [3] A. J. Wallcraft, The Navy Layered Ocean Model Users Guide, NOARL Report 35, Naval Research Laboratory, Stennis Space Center, MS, 21 pp, 1991. [4] See http-//www, cobaltcfd, com/ [5] See www.dyninst.org [6] J. K. Hollingsworth, An API for Runtime Code Patching, IJHPCA, 1994. [2]
784
[7] J. L., Hennessy, D. Ofelt, .Efficient Performance Prediction For Modern Microprocessors.,
[8] [9]
ACM SIGMETRICS Performance Evaluation Review, Volume 28, Issue 1, June 2000. L. Carrington, A. Snavely, N. Wolter, X. Gao, A Performance Prediction Framework for Scientific Applications, Workshop on Performance Modeling and Analysis - ICCS, Melbourne, June 2003. X. Gao, A. Snavely, Exploiting Stability to Reduce Time-Space Cost for Memory Tracing, Workshop on Performance Modeling and Analysis - ICCS, Melbourne, June 2003.
Minisymposium
OpenMP
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
787
T h r e a d b a s e d O p e n M P for n e s t e d p a r a l l e l i z a t i o n R. Blikberg a* and T. Sorevikbt aParallab, BCCS, UNIFOB, University of Bergen, NORWAY bDept, of Informatics, University of Bergen, NORWAY Many computational problems have multiple layers of parallelism. Here we argue and demonstrate that utilizing multiple layers of parallelism may give much better scaling than if one restrict oneself to only one level of parallelism. Two important issues in this paper are load balancing and implementation. We provide an algorithm for finding good distributions of threads to outer-level tasks, when these are of uneven size, and discuss how to implement nested parallelism in OpenMP. Two different implementation strategies will be discussed: The preferred one, directive based parallelization, versus explicit thread programming. 1. INTRODUCTION Computational sciences have made quantum leaps forward over the past decades, due to impressive progress in algorithms, computer hardware and software. This has open up the possibility to tackle very complex problems by the assistance of computer simulations, and as more problems are being attacked the appetite for more compute power and more efficient methods seems to increase rather than decrease. As the computational problems get larger their complexity also grows. A standard way to handle the increased complexity is to decompose the problem. This can be done functionally by splitting the problem in smaller problems, or tasks, to be solved independently or by domain splitting, dividing the computational domain into smaller parts. Next these subproblems are solved independently before they eventually are glued together. Such splitting provide an obvious coarse-grained parallelism. This parallelism has to be utilized. However, in many cases this easily available outer-level parallelism is not enough as it may consist of only (relatively) few tasks, and they may often be of unequal size. Thus parallelism has to be applied within each task as well. In this paper we address the challenge of how to implement such multi-level parallelism in OpenMP. We also briefly describe how to allocate threads to tasks in order to achieve a good load balance. The paper is organized as follows: The question of allocating threads to tasks is addressed in Section 2, where we give an algorithm which distributes threads to tasks. In Section 3, we discuss how to implement 2-level parallelism in OpenMP. The effect of 2-level parallelism on *http://www. ii .uib.no/~ragnhild %http://www.ii.uib.no/-tors
788 a real application, a wavelet based data compression code, has been tested and is reported in Section 4. Finally, the conclusions will be given. 2. THE DISTRIBUTION A L G O R I T H M In this section we will present an algorithm for distributing threads to tasks. Different situation may occur depending on the number and sizes of tasks and the number of available threads. One immediately identifies two extreme cases; The case where all tasks need at least one thread, and the case where each thread should take one or more tasks. We first introduce algorithms for these two extreme cases, and then show how they can be combined to deal with all possible cases. Case 1: All tasks need at least one thread The threads should be grouped together in teams, where each team is responsible for doing the work associated with one task. The distribution of threads to tasks can be done in the following way: First assign one thread to each task. Then find the task with highest 'work-tothread-ratio' and assign an extra thread to this task. Repeat until all threads are assigned to a task. A detailed description and analysis of this algorithm is given in [5]. It needs as input P (the number of threads available), N (the number of tasks) and a vector w of size N, giving the work-load estimate for the N tasks. The output is a vector p of size N containing the number of threads assigned to each task. This correspond to the following function call: [p] = Distributel(N, F, w) Case 2" All threads do at least one task The other extreme is when the tasks are many and small such that most tasks have to share one thread and under no circumstances will any task have more than one thread. In some sense this is the dual of the previous problem. Here we will assign tasks to threads. This subproblem correspond to a bin-packing problem. One version of the problem is to assume we have a fixed number of bins (in our case threads) and seek to pack each bin such that the size of the largest bin is minimized. Alternatively, we can assume that the maximum bin-size is fixed and try to minimize the number of bins needed to pack all tasks. In either case the problem is known to be NP-complete, and a good approximate solution can be found by the best fit decreasing method. For detailed description and analysis of the best fit decreasing algorithm see [2]. There are different advantages and problems with the two formulations, but these will not be discussed here. In our computation we have chosen the first formulation. The output of the algorithm is the vector t a s k _ t o _ t h r e a d of size N, telling which thread a task is distributed to. This correspond to the function call:
[ t a s k _ t o _ t h r e a d ] = Distribute2(N, P, w) Our distribution algorithm works by simply sorting the tasks in two sets; one for those tasks large enough to rightfully ask for at least one thread for themselves, and a second set consisting of tasks so small that they must be prepared to share a thread.
789 Algorithm 1" The distribution algorithm /* This algorithm finds a distribution o f the work o f N tasks to P threads. The tasks are divided in 2 sets, 1: ("large" tasks) and S ("small" tasks), where one or more threads are working on the same task in ~, and where each thread has one or more tasks to work on in S. */
[s S, p, t a s k _ t o _ t h r e a d ] = Distribute_All(N,/9, w) ws = {Vi; w~ > ~};
Pc
-
int (WL/w);
/* For tasks E s
-w/p {Vi; w~ __ ~}; : lSl; Ps int (Ws/~);
S-
-
*/
[p] : Distributel(NL, Pc, we) /* For tasks E S: */
[task_to_thread] = Distribute2(Ns, Ps, ws) For this algorithm to work, an estimate of the workload, or weights, of the different tasks is needed. In the cases we have applied the technique to, a reasonable assumption is that the work is proportional to the amount of data. With data stored in 2d arrays workload estimate is than easily available. 3. IMPLEMENTATION OF NESTED PARALLELISM IN OPENMP
Nested parallelism is possible to implement using message passing parallelization. In MPI [6], creating communicators will make it possible to form groups of threads working together in teams and sending messages to other members within the team for fine-grained parallelism, while the coarse-grained parallelism implies communication between communicators. The distribution of work to multiple threads in SMP programming is usually done by the compiler. The programmer's job is only to insert directives in the code to assist the compiler. A more explicit approach, where the programmer explicitly distributes tasks to threads is also possible. The explicit approach gives the programmer full control, but does require a much higher level of programmer intervention. Therefor directive based SMP programming is usually the recommended approach. Below we discuss the possibilities and limitations of the two approaches for multi-level parallelism. 3.1. Directives for nested parallelism Explicit construct for expressing multi-level parallelism is not usually found in directive based, multi-threaded programming for SMP. OpenMP [1 ] does however have constructs for nested parallelism. In OpenMP a parallel region in Fortran starts by the directive !$OMP P A R A L L E L and ends by ~$OMP END PARALLEL. The standard allows these to be nested ~. The more complicated situation where the tasks are split into the sets s and S, and where nesting only should be executed for set s should also be possible to implement by nested directives. An example of how this situation can be implemented using nested directives, and a few changes in the code, is given in Example 1.
*To enable the nesting one has to set the environment variable OMP NESTEDto TRUE or call the subroutine OMP
SET
NESTED.
790
Example 1" call
compute_weights_of_tasks
call
Distribute
/* Distributing threads to tasks, Alg. 1 */ !$ O M P P A R A L L E L
All(N,
SECTIONS
NUM
P, w,
(w) p,
task
to t h r e a d )
T H R E A D S (2)
!$ O M P S E C T I O N /* Set f. */ !$ O M P P A R A L L E L DO P R I V A T E (i) N U M T H R E A D S (N) do i = I, N L !$OMPPARALLEL DO P R I V A T E ( j ) S H A R E D ( i ) N U M T H R E A D S ( p ( i ) ) do j = I, w(i) < W O R K ( j , i) > e n d do e n d do !$OMP S E C T I O N /* Set S */ !$OMP P A R A L L E L P R I V A T E (i, j ) N U M T H R E A D S (P S) t h r e a d = O M P G E T T H R E A D N U M () do i = N L+I, N if ( t h r e a d /= t a s k to t h r e a d ( i ) ) c y c l e do j = I, w(i) < W O R K ( j , i) > e n d do e n d do [$OMP E N D P A R A L L E L !$ O M P E N D
PARALLEL
SECTIONS
The $OMP SECTION directive is used to separate between the two cases. This splitting will in most cases require some restructuring of the code. In the first section the work in set/2 is carried out, and the directives !$OMP PARALLEL DO are nested. All variables used in a parallel region are by default SHARED. We declare the i index as PRIVATE in the outer loop, and as SHARED in the inner loop, since it should be private for each team and shared among the threads within the same team. The NUM_THREADS clause, included in OpenMP 2.0, controls the number of threads in a team. If NUN_THREADS is not set, it is implementation dependent how many threads will work in each (nested) parallel region. Assuming all tasks in set/2 and using Distribute 1 to distribute threads to tasks, the natural choice is to use N threads on the outer loop, and p ( i ) threads on the inner loop. In the second section, the work in set S is carried out, but here there is only one level of parallelization. Each thread checks for each task if the task is supposed to be done by itself, and executes the task if so. When nested parallelism is enabled, the number of threads used to execute nested parallel regions is implementation dependent. As a result, OpenMP-compliant implementations are allowed to serialize nested parallel regions even when nested parallelism is enabled. As of writing (summer 2003), most vendors appears to have an implementation of OpenMP version 2.0 for Fortran available. However, still most vendors seems to have chosen to serialize nested parallel regions.
791
3.2. Explicit thread programming OpenMP also allows a more low level work distribution, where the programmer explicitly assigns work to threads. We illustrate how this technique can be used to obtain 2-level parallelism with the following simple example. Suppose a problem has been divided in set s and ,9, and that set 12 consists of N = 4 independent outer-level tasks, where the work of the tasks is given by the weight vector w = {10, 8, 2, 7}. Feeding this into Distributel with P = 8 will give the distribution p = {3, 2, 1, 2} of threads to the 4 outer-level tasks. We want the threads within a team to divide the work equally among each other, to obtain a good load balance. If one weight unit equals the work associated with one inner-loop iteration, the threads will divide the iterations as shown in Table 1. Table 1 t h r e a d will work on t a s k and do the iterations from j b e g i n to jend. thread
task jbegin jend
0
1
2
3
4
5
1 1 4
1 5 7
1 8 10
2 1 4
2 5 8
3 4 4 1 1 5 2 4 7
6
7
The information needed in the computation is given in Table 1 and is easily computed with the threads-to-task distribution available. We apply a simple service routine for this purpose (f i n d _ j i n d e x e s ) . Implementing the s in Example 1 by explicit thread programming, can now be done as:
Example 2: c a l l f i n d _ j i n d e x e s /* Construct Table 1 */ !$OMP P A R A L L E L P R I V A T E (thread, i, j ) N U M T H R E A D S (N) t h r e a d = O M P G E T T H R E A D N U M () i = t a s k (thread) do j = j b e g i n ( t h r e a d ) , j e n d (thread) < WORK(j,i) > e n d do !$0MP E N D P A R A L L E L
Implementing the complete Example 1 by explicit thread programming, the $OMP SECTION-COnStruct disappears. Instead, it is explicitly decided which threads should do the work in set/~ and which should do the work in set $. The kind of programmer interventions needed for these explicit thread programming changes are to some degree similar to the work needed when parallelizing using MPI [6]. However, no explicit communication is needed as a (virtual-)shared memory is assumed. PARALLEL
4. DATA C O M P R E S S I O N EXPERIMENTS In this section we report on experiences on 2-level parallelism applied to a real application, a wavelet-based data compression routine. The load balancing is done using Algorithm 1 in Sec-
792
--
-9 .... --- -
....
1024 1024
g
ideal s p e e c l - u p
2-level 2-level 2-level 2-level l-level
.
.
.
.
I
nested speedup, all tasks in L work-load corrected speedup, all tasks in L I nested speedup, AIg. 1 work-load corrected speedup, AIg. 1 I speedup __1
'
~
'
/
r
J _~__.
' J
J /./
~/
. _ jr /" - -
//- i/
30
7
/
.'
~
-
-
/~
-o
20
./'- -- 1 '/i ./ ................................................ 512 256
.~./ 5
./ 10
15
20
25 30 Number of threads
35
40
45
50
Figure 1. a) 1792 x 1792 grid divided into 9 pieces, b) Ideal speedup, 2-level work-load corrected (theoretical) speedup, and obtained 2-level nested speedup for the data compression experiments tion 2. For implementation we have used the explicit thread programming technique outlined in Section 3.2. The wavelet-based data compression routine is used in an out-of-core earthquake simulator to minimize memory usage as well as disk-traffic [8]. In our experiments we run the compression routine as a stand alone routine. The routine first transforms the data into wavelet-space using a 2d-wavelet transform, and then it stores only the non-zero wavelet coefficients [7]. The wavelet routine only works for arrays m x n where m and n are integers power of 2. Thus we first divide the array in N blocks of (different) power of 2 sizes. For each of these blocks a 2d-wavelet transform is carded out. As our test case we have chosen a 2d array of size 1792 x 1792, which is divided into 9 pieces of unequal sizes, as shown in Figure 1a). We can not expect perfect load balance due to the integer restriction on the number of threads. If we correct for the work imbalance between the outer-level tasks, and assume no extra parallel overhead and perfect load balancing within an outer-level task, we obtain a sharper bound on the achievable 2-level speedup. The formal definition is as follows: Definition 4.1. The work-load corrected 2-level speedup is defined as Sp - T1/T p where T 1 =
}-~iwiandTp=max(maxicNc(wi/pi),maxi~Ps
threadworki) ~.
In Figure l b) we display the linear speedup and 2-level work-load corrected speedup, together with the speedup achieved for 2-level and l-level parallelization. The runs are done on a dedicated Origin 2000 using MIPSpro Fortran Compilers, Version 7.3.1.3m. We have applied two versions of our distribution algorithm; Algorithm 1 and a simplified version where all tasks are put in set/2 regardless of size. When the simplified version is used, it is not possible to run the nested version on less than 9 threads, which is the reason why the curve starts at this point. The l-level parallelized code reaches its maximum speedup at about
w
is the amount of work distributed to thread i, and can be computed in Distribute2.
793 20 threads, while the 2-level parallelized code increases its speedup up to at least 50 threads, where the speedup is 33. We find these results to be very encouraging. 4.1. An adaptive mesh refinement application We have also applied 2-level nested parallelism to an adaptive mesh refinement [3] application. In contrast to the compression routine test case, which have a static number of tasks, the number of tasks are here dynamically changing. The results can be found in [4]. 5. CONCLUSIONS The main purpose of this paper has been to examine the possible gain of utilizing nested parallelism when available in the problem. Our findings are very encouraging. Using the two levels of parallelism turned out to be imperative for good parallel performance on our test cases. As always good load balancing is essential in achieving good scalability. This becomes more difficult when applying multi-level parallelism. We have presented a simple algorithm which distributes threads to tasks, computing near optimal solution to an NP-complete problem. We also show how 2-level parallelism can be implemented in OpenMP, using explicit thread programming. The work of parallelizing in 2 levels in OpenMP can often be time consuming, mainly because directives appropriate for the team concept are missing. For instance, a team-aware directive like t$OMP TEAMPRIVATE, corresponding to [$OMP THREADPRIVATE, can save the programmer for extra changes in the code. This directive is not a part of OpenMP 2.0, even if nesting was implemented in the compiler. The proper implementation of tasks in set S could have been made simpler and more intuitive with an ON_THREAD clause to the $OMP PARALLEL DO directive. Such a clause could give the programmer admission to dictate which thread should do which loop iteration. For debugging purpose, we also miss to use the [$OMP BARRIER within parallel do-loops. As we see larger SMP-systems becomes more and more common, the scalability of OpenMP becomes more important. Utilizing multi-level parallelism will become an important issue in this context. The extension for OpenMP 2.0 points in the right direction. We are, however, very unhappy with the fact that serializing nested parallelism is still compliant with the OpenMP spec. REFERENCES
[1] [2]
[3] [4] [5] [6]
OpenMP. h t t p : //www. openmp, o r g / . G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and M. Protasi. Complexity and Approximation. Springer, 1999. M. J. Berger and R. J. LeVeque. AMRCLAW, Adaptive Mesh Refinement + CLAWPACK. http ://www. amath, washington, edu/-rj I/amrclaw, 1997. R. Blikberg. Nested parallelism applied to AMRCLAW. In Proceedings ofEWOMP'02, Fourth European Workshop on OpenMP, 2002. R. Blikberg and T. Sorevik. Nested parallelism: Allocation of threads to tasks and OpenMP implementation. Journal of Scientific Programming, 9(2,3): 185-194, 2001. W. Gropp and R. Lusk. Using MPI, portable parallel programming with the Message Passing Interface. The MIT Press, 1994.
794
[7]
[8]
M. Luckfi and T. Sorevik. Parallel wavelet based compression of two-dimensional data. In Proceedings of Algoritmy 2000, 2000. P. Moczo, M. Luckfi, J. Kristek, and M. Kristekovfi. 3D displacement finite differences and a combined memory optimization. Bull. Seism. Soc. Am., 89:69-79, 1999.
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
795
OpenMP on Distributed Memory via Global Arrays* L. Huang a, B. Chapman a, and R.A. Kendall b adept, of Computer Science, University of Houston, Texas. {leihuang,chapman }@cs.uh.edu bScalable Computing Laboratory, Ames Laboratory, Iowa. [email protected] This paper discusses a strategy for implementing OpenMP on distributed memory systems that relies on a source-to-source translation from OpenMP to Global Arrays. Global Arrays is a library with routines for managing data that is declared as shared in a user program. It provides a higher level of control over the mapping of such data to the target machine and enables precise specification of the required accesses. We introduce the features of Global Arrays, outline the translation and its challenges and consider how user-level support might enable us to improve this process. Early experiments are discussed. The paper concludes with some ideas for future work to improve the performance of this and other approaches to providing OpenMP on clusters and other distributed memory platforms. 1. I N T R O D U C T I O N OpenMP [13] provides a straightforward, high-level programming API for the creation of applications that can exploit the parallelism in widely-available Shared Memory Parallel Systems (SMSs). Although SMSs typically include 2 to 4 CPUs, large machines such as those based on Sun's 6800 architecture may have 100 and more processors. Moreover, OpenMP can be implemented on Distributed Shared Memory systems (DSMs) (e.g. SGI's Origin systems). A basic compilation strategy for OpenMP is relatively simple, as the language expects the application developer to indicate which loops are parallel and therefore data dependence analysis is not required. Given the broad availability of compilers for OpenMP, its applicability to all major parallel programming languages, and the relative ease of parallel programming under this paradigm, it has been rapidly adopted by the community. However, the current language was primarily designed to facilitate programming of modest-sized SMSs, and provides few features for large-scale programming. In fact, current implementations serialize nested parallel OpenMP constructs which can help exploit hierarchical parallelism. Also, OpenMP does not directly support other means of addressing the performance of programs on Non-Uniform Memory Access (NUMA) systems. A more serious problem for the broad deployment of OpenMP code *This work was partially supported by the DOE under contract DE-FC03-01ER25502 and by the Los Alamos National Laboratory Computer Science Institute (LACSI) through LANL contract number 03891-99-23. This work was performed, in part, under the auspices of the U.S. Department of Energy (USDOE) under contract W7405-ENG-82 at Ames Laboratory, operated by Iowa State University of Science and Technology and funded by the MICS Division of the Office in Advanced Scientific Computing Research at USDOE.
796 is the fact that OpenMP has thus been provided on shared memory platforms only (including ccNUMA platforms) and OpenMP parallel programs can therefore not be executed on clusters. Given the ease with which clusters can be built today, and the increasing demand for a reduction in the effort required to create parallel code, an efficient implementation of OpenMP on clusters is of great importance to the community. Efforts to realize OpenMP on Distributed Memory systems (DMSs) such as clusters have to date focused on targeting software Distributed Shared Memory systems (software DSMs), such as TreadMarks[ 1], OMNI/SCASH[ 14]. Under this approach, an OpenMP program does not need to be modified for execution on a DMS: the software DSM is responsible for creating the illusion of shared memory by realizing a shared address space and managing the program data that has been declared to be shared. Although this approach is promising, and work is on-going to improve the use of software DSMs in this context, there are inherent problems with this translation. Foremost among these is the fact that their management of shared data is based upon pages, and that these are regularly updated. Basically, the update is not under the application programmer's control and thus cannot be tuned to optimize the performance of the application. 2. GLOBAL ARRAYS Compared with MPI programming, Global Arrays (GA)[ 10] simplifies parallel programming on DMSs by providing users with a conceptual layer of virtual shared memory. Programmers can write their parallel program for clusters as if they have shared memory access, specifying the layout of shared data at a higher level. GA combines the better features of both messagepassing and shared memory programming models, leading to both simple coding and efficient execution. It provides a portable interface through which each process is able to independently and asynchronously access the GA distributed data structures, without requiring explicit cooperation by any other process. Moreover, GA programming model also acknowledges the difference of access time between remote memory and local memory and forces the programmer to determine the needed locality for each phase of the computation. By tuning the algorithm to maximize locality, portable high performance is easily obtained. Furthermore, since GA is a library-based approach, the programming model works with most popular language environments: currently bindings are available for FORTRAN, C, C++ and Python. 3. TRANSLATION FROM OPENMP TO GA
Global Arrays programs do not require explicit cooperative communication between processes. From a programmer's point of view, they are coding for NUMA (non-uniform memory architecture) shared memory systems. It is possible to automatically translate OpenMP programs into GA because each has the concept of shared data. If the user has taken data locality into account when writing OpenMP code, the benefits will be realized in the corresponding GA code. The translation can give a user advantages of both programming models: straightforward programming and cluster execution. We have proposed a basic translation strategy of OpenMP to GA in [7] and are working on its implementation in the Open64 compiler [12]. The general approach to translating OpenMP into GA is to generate one procedure for each OpenMP parallel region and declare all shared variables in the OpenMP region to be global arrays in GA. The data distribution is specified.
797 We currently assume a simple block-based mapping of the data. Before shared data is used in an OpenMP construct, it must be fetched into a local copy, also achieved via calls to GA routines; the modified data must then be written back to its "global" location after the computation finishes. GA synchronization routines will replace OpenMP synchronizations. OpenMP synchronization ensures that all computation in the parallel construct has completed; GA synchronization will do the same but will also guarantee that the requisite data movement has completed to properly update the GA data structures.
Program test integer a(100,100),j,k j=2 k=3 call sub 1 $OMP PARALLEL SHARED(a) PRIVATE(j) FIRSTPRIVATE(k)
Program test call MPI_INIT0 call ga_initialize0 common/ga_cb 1/g_a !create global arrays for shared variables g_a=ga_create(..) myid--ga_nodeid0 if(myid .eq. 0) then j=2 $OMP END PARALLEL k=3 call sub2 call sub 1 end endif subroutine sub2 call omp_sub 1(k,myid) $OMP PARALLEL call sub2 call ga_terminate0 $OMP END PARALLEL call MPI_FINALIZE(rc) end end subroutine sub 1 subroutine omp_sub 1(k,myid) integer i, j, k, g_a, myid,a(100,100) end common/ga_cb 1/g_a if (myid .eq. 0) call ga_brdcst(MT_INT, k, 1,0) !get a local copy of shared variables call ga_get(g_a,..,a,) !perform computation !put modified shared variables back global arrays call ga_put(g_a,..,a,) end b) Figure 1. OpenMP program(a) and translated GA program (b) . . o
The translated GA program (Fig. 1) first calls MPI_INIT and then GA_INITIALIZE to initialize memory for distributed array data. The initialization subroutines only need to be called once in the GA program. For dealing with sequential part of OpenMP program, the translation
798 may either attempt to replicate work or may let the master process carry out the sequential parts and broadcast the needed private data at the beginning of parallel part. A replicated approach may lead to more barriers and we have currently chosen the latter. All the sequential parts and subroutines (sub 1 in Fig.1 (a)) that do not include any OpenMP parallel construct are executed by the Master process only. However, the calls to subroutines (sub2 in Fig.1 (a)) that contain OpenMP parallel constructs need to be executed by every process, since GA programs generate a fixed number of processes at the beginning of the program, in contrast to the possibility of forking and joining threads during execution of an OpenMP code. All other processes except the Master process are idle during the sequential part. We intend to consider alternative approaches to deal with the implied performance problems as part of our future work. Variables specified in an OpenMP private clause can be simply declared as local variables, since all such variables are private to each process in a GA program by default. Variables specified in an OpenMP firstprivate clause need to be passed as arguments to an OpenMP subroutine (omp_sub 1 in Fig. 1 (b)), and broadcast to all other processes by the Master process. The translation will turn shared variables into distributed global arrays in GA code by inserting a call to the GA_CREATE routine. The handlers of global arrays need to be defined in a common block so that all subroutines can access the global arrays. GA permits the creation of regular and irregular distributed global arrays. If needed, ghost cells are available. The GA program will make calls to GA_GET to fetch the distributed global data into a local copy. After using this copy in local computations, modified data will be transferred to its global location by calling GA PUT or GA ACCUMULATE. GA TERMINATE and MPI FINALIZE routines are called to terminate the parallel region. OpenMP's FIRSTPRIVATE and COPYIN clauses are implemented via the GA broadcast routine GA_BRDCST. The REDUCTION clause is translated by calling GA's reduction routine GA_DGOP. GA library calls GA_NODEID and GA_NNODES are used to get process ID and number of processes, respectively. OpenMP provides routines to dynamically change the number of executing threads at runtime. We do not attempt to translate these since this would amount to redistributing data and GA is based upon the premise that this is not necessary. In order to implement OpenMP loop worksharing directives, the translated GA program calculates the new lower and upper loop bounds in order to assign work to each CPU based on the specified schedule. Each GA process fetches a partial copy of global data based on the array region read in the local code. Several index translation strategies are possible. A simple one will declare the size of each local portion of an array to be that of the original shared array; this avoids the need to transform array subscript expressions [16]. For DYNAMIC and GUIDED schedules, the iteration set and therefore also the shared data, must be computed dynamically. In order to do so, we must use GA locking routines to ensure exclusive access to code assigning a piece of work and updating the lower bound of the remaining iteration set; the latter must be shared and visible to every process. However, due to the expense of data transfer in distributed memory systems, DYNAMIC and GUIDED schedules may not be as efficient as static schedules, and may not provide the intended benefits. The OpenMP SECTION, SINGLE and MASTER directives can be translated into GA by inserting conditionals to ensure that only the specified processes perform the required computation. GA locks and Mutex library calls are used to translate the OpenMP CRITICAL and ATOMIC directives. OpenMP FLUSH is implemented by using GA put and get routines to update shared variables. This could be implemented with the GA_FENCE operations if more
799 explicit control is necessary. The GA_SYNC library call is used to replace OpenMP BARRIER as well as implicit barriers at the end of OpenMP constructs. The only directive that cannot be efficiently translated into equivalent GA routines is OpenMP's ORDERED. We use MPI library calls, MPI_Send and MPI_Recv, to guarantee the execution order of processes if necessary. Compared with the Software DSM approach for implementing OpenMP on distributed memory system, our strategy uses precise data transfer among processes instead of page-level data migration. Our approach may overcome the overhead of memory coherence and avoid false sharing or redundant data transfer in page-based software DSM. Furthermore, the explicit control mechanisms of OpenMP (or GA) allow the application developer to tune the synchronization events based on the performance and data requirements directly. 4. P E R F O R M A N C E ISSUES
An interactive translation from OpenMP to GA could achieve better performance by getting some help from users. For example, we currently assume a block distribution for shared arrays: it is helpful by providing a better data distribution information by users, or using the GA ghost (halo) interface as required by the algorithm. GA supports a variety of block data distribution. However, some applications need cyclic data distribution for better load-balance, which GA does not support. The SPMD style OpenMP program [5, 9] has been investigated by some researchers, which provides better performance on cache coherent Non-Uniform Memroy Access (cc-NUMA) and software DSM systems by systematically appling data privatization. However, The systematically applied data privatization may require more programming effort. Translation from the SPMD style OpenMP to GA program can achieve better performance in DSM system, but this is complex for user to write this sytle of OpenMP program. Relaxing synchronization achieves better performance by allowing sequential work to be executed while waiting for synchronization. It is also useful in GA programs which have multiple synchronization mechanisms, fences, mutexes, and a global barrier as described above. For example, at the end of an OpenMP parallel loop construct a set of fences that identify the data synchronization required at the next immediate phase of computation could be used instead of a full barrier (GA_SYNC). 5. B E N C H M A R K S
We have translated small OpenMP programs into GA and tested their performance and scalability. The experiments shown here compare serial with OpenMP versions of the Jacobi code with a 1152"1152 matrix as input. Fig.2 (a) gives the performance of the Jacobi OpenMP and GA programs on an Itanium 2 cluster with 1 1GHz 4-CPU node and 23 900MHz 2-CPU nodes at the University of Houston; The results in Fig.2 (b) were achieved using an SGI Origin2000 DSM system from NCSA, a hypercube with 128 195MHz MIPS R10000 processors, in multiuser mode. Fig.2 (c) shows performance of this code on a 4*4 SUN cluster (1 4-way ULtraSPARC-II 400 MHz E450 and 3 4-way 450MHz E420s) with Gigabit Ethemet connectivity; The serial version was measured on the E450 machine. The results of our experiments present that a straightforward translation from OpenMP to GA program could achieve good scalability in DSM systems. We expect to get better performance by appling more optimizations.
800
1 O0
JaCObi B e n c h m a r k
i~l
600 Jacobi Benchmark soo ~4oo g3oo
Jaoob~B~.ohmark
100 [
100
o
:1
2 32:40 No.:~ Processors: "
3011!
Speedup
9T ~ O p e n M P J
~i~I
9 ~..,~ . . . . . . .
No, of Processors
a)
,-
....1 2 : 4
!
,~o4864
llo; of Processors: S-pe~du--P
60
~-
l
~ ........
....4 :8 12 16 No, e.f Pro:cessors
JacobiSpeedup 31+open~P |-=,TJ-i
--m-CA
~m--
i
~ o~ 19
2
4
6
16::::::32
40
No. of P:rocessors
b)
48 64
1
2.
,4
8
12
N o. o f P re c e t e r a
16 9
c)
Figure 2. Jacobi program performance on an Itanium2 cluster(a), SGI Origin 2000(b), Sun Cluster(c) 6. RELATED W O R K There have been a variety of efforts that attempt to implement OpenMP on a distributed memory system. Some of these are based upon efforts to provide support for data locality in ccNUMA systems, where mechanisms for user-level page allocation[11 ], migration, data distribution directives have been developed by SGI [4, 15] and Compaq [3]. Data distribution directives can be added to OpenMP [6]. However, this will necessitate a number of additional language changes that do not seem natural in a shared memory model. A number of efforts have attempted to provide OpenMP on clusters by using it together with a software distributed shared memory (software DSM) environment [ 1, 2, 14]. Although this is a promising approach, it does come with high overheads. An additional strategy is to perform an aggressive, possibly global, privatization of data. These issues are discussed in a number of papers, some of which explicitly consider software DSM needs [2, 5, 9, 17]. The approach that is closest to our own is an attempt to translate OpenMP directly to a combination of software DSM and MPI [8]. This work attempts to translate to MPI where this is straightforward, and to a software DSM API elsewhere. While this has similar potential to our own work, GA is a simpler interface and enables a more convenient implementation strategy. GA is ideal in this respect as it retains the concept of shared data. 7. CONCLUSIONS This paper presents a basic compile-time strategy for translating OpenMP programs into GA programs. Our experiments show good scalability of a translated GA program in distributed memory systems, even with relatively slow interconnects. There are several ways to implement OpenMP on clusters. A direct translation such as that proposed here allows precise control of parallelism and creation of a code version that user could manually improve if desired. All cluster code would benefit from a modification that explicitly considers how data/work will be
801 mapped to a machine and this is no different. Some additional user information would also benefit translation, but this could be outside of the OpenMP standard: We intend to explore these issues further as part of our work on an implementation that will enable us to handle large-scale applications. REFERENCES
[ 1] [2]
[3]
[4]
[5]
[6] [7] [8] [9]
[10]
[ 11]
[ 12] [13] [ 14] [15]
C. Amza, A. Cox et al.: Treadmarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2): 18-28, 1996 A. Basumallik, S-J. Min and R. Eigenmann: Towards OpenMP execution on software distributed shared memory systems. Proc. WOMPEI'02, LNCS 2327, Springer Verlag, 2002 J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. A. Nelson, C. D. Offner: Extending OpenMP For NUMA Machines, Proceedings of Supercomputing 2000, Dallas, Texas, November (2000) R. Chandra, D.-K. Chen, R. Cox, Maydan, D. E., Nedeljkovic, N., and Anderson, J. M.: Data Distribution Support on Distributed Shared Memory Multiprocessors. Proceedings of the ACM SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June (1997) B. Chapman. F. Bregier, A. Patil and A. Prabhakar: Achieving Performance under OpenMP on ccNUMA and Software Distributed Shared Memory Systems. Special Issue of Concurrency Practice and Experience, 14:1-17, 2002 B. Chapman and R Mehrotra: OpenMP and HPF: Integrating Two Paradigms. Proc. Europar '98, LNCS 1470, Springer Verlag, 65-658, 1998 L. Huang, B. Chapman and R. Kendall: OpenMP for Clusters. Proc. EWOMP'03, Aachen, Germany, September (2003) R. Eigenmann, J. Hoeflinger, R.H. Kuhn, D. Padua et al.: Is OpenMP for grids? Proc. Workshop on Next-Generation Systems, IPDPS'02, 2002 Z. Liu, B. Chapman, Y. Wen, L. Huang and O. Hemandez: Analyses and Optimizations for the Translation of OpenMP Codes into SPMD Style. Proc. WOMPAT 03, LNCS 2716, 26-41, Springer Verlag, 2003 J. Nieplocha, RJ Harrison, and RJ Littlefield: Global Arrays: A non-uniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10:197-220, 1996 Nikolopoulos, D. S., Papatheodorou, T. S., Polychronopoulos, C. D., Labarta, J., Ayguad"~,, E.: Is Data Distribution Necessary in OpenMP? Proceedings of Supercomputing 2000, Dallas, Texas, November (2000) Open64 Compiler, http://open64.sourceforge.net/userforum.html OpenMP Architecture Review Board. OpenMP Fortran Application Program Interface, Version 2.0, November 2000 M. Sato, H. Harada and Y. Ishikawa: OpenMP compiler for a software distributed shared memory system SCASH. Proc. WOMPAT 2000, San Diego, 2000 Silicon Graphics Inc. MIPSpro 7 FORTRAN 90 Commands and Directives Reference Manual, Chapter 5: Parallel Processing on Origin Series Systems. Documentation number 007-3696-003. http://techpubs.sgi.com
802 [16] Mario Soukup: A Source-to-Source OpenMP Compiler, Master Thesis, Department of Electrical and Computer Engineering, University of Toronto [ 17] T.H. Weng and B. Chapman: Asynchronous Execution of OpenMP Code. Proc. ICCS 03, LNCS 2660, 667-676, Springer Verlag, 2003
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter(Editors) 9 2004 ElsevierB.V. All rights reserved.
803
Performance Simulation of a Hybrid OpenMP/MPI Application with HESSE R. Aversa a, B. Di Martino a*, M. Rak a, S. Venticinque a, and U. Villano bt ~DII, Seconda Universit~ di Napoli, via Roma 29, 81031 Aversa (CE), Italy bUniversit~ del Sannio, Facolt~ di Ingegneria, C.so Garibaldi 107, 82100 Benevento, Italy This paper deals with the performance prediction of hybrid OpenMP/MPI code. After a brief overview of the HESSE simulation environment, the problem of the tracing and of the simulation of an OpenMP/MPI application is dealt with. A parallel N-body code is presented as case study, and its predicted performance is compared to the results measured in a real cluster environment. 1. INTRODUCTION Recently there has been great interest in the adoption of clusters of symmetrical multiprocessors (SMP) for parallel computing. This is mainly due to the availability of inexpensive computing nodes equipped with multiple CPUs, which are rapidly replacing conventional monoprocessor designs as building blocks for network of workstations. Current workstation clusters are typically clusters of SMPs (hence the name CLUMPS), networked by either high-speed networks and switches, or inexpensive and rather slow commodity networks. In both cases, the architectural shift towards multiple CPUs for each processing element sharing a common memory has given a boost to the adoption of the shared-memory programming paradigm, widely used in the early days of parallel computing and progressively abandoned with the advent of distributed memory message-passing machines [9]. In light of the above, currently both high performance parallel systems and "poor man's" workstation and PC clusters share a two-layer architecture, where the CPUs inside a processing node can communicate efficiently by shared memory, whereas CPUs in different processing nodes have to exchange messages over the system network. Even if special two-tier programming models have been developed for such systems [4], the joint use of MPI and OpenMP is emerging as a de facto standard. In the hybrid MPI-OpenMP model, a single message-passing task communicating using MPI primitives is allocated to each SMP processing element, and the multiple processors with shared memory in a node are exploited by parallelizing loops using OpenMP directives and runtime support. Apart from studying and applying two different standards and models of parallelism (which is annoying), the hybrid MPI-OpenMP model can be relatively easily understood and programmed. After all, both MPI and OpenMP are two well-estabilished standards, and solid *The work by B. Di Martino was partially supported by "Centro di CompetenzaICT", Regione Campania tThe work by U. Villano was partially supported by Regione Campania,project 1.41/2000 "Utilizzo di predizioni del comportamentodi sistemi distribuiti per migliorare le prestazioni di applicazioni client/serverbasate su web"
804 documentation and tools are available to assist the program developer. The problem, as almost always in parallel programming, is the resulting performance. There is a wide body of literature pointing out the drawbacks and pitfalls of this model, from the use of SMP nodes, which fail to obtain a performance comparable to that of a traditional monoprocessor cluster with the same total number of CPUs [ 12], to the adoption of the hybrid programming model, which may lead to poor CPU utilization. Even neglecting the first point, the choice of the programming model is not simple. The use of a hybrid model exploiting coarse parallelism at task level and fineor medium-grain parallelism at loop level is attractive, but its performance depends heavily on architectural issues (type of CPUs, memory bandwidth, caches, ...) and, above all, on the application structure and code. Depending on the latter, sometimes it may be preferable to use a canonic single layer MPI decomposition, allocating on each node a number of tasks equal to the number of CPUs [6, 7]. In spite of all that, the use of cluster of SMPs and hybrid shared-memory and messagepassing programming model based on MPI and OpenMP stands as a viable and promising solution for the development of SMP cluster software. However, it provides the programmer with a too wide range of computational alternatives to be explored in order to obtain high performance from the system [5, 15]. The obvious solution is to resort to software development tools, able to predict the performance of the target application at any development step. These have proven to be an effective solution for message-passing parallel software development and tuning [ 10, 11, 2]. However, the use of predictive techniques is greatly complicated in a hybrid environment because of the higher architecture complexity and the multiplicity of software layers to be taken into account. In the last years, the authors have been active in the performance analysis and prediction field, developing HESSE [ 13, 14], a simulator of distributed applications executed in heterogeneous systems programmed in PVM or MPI. In the message-passing context, the adoption of tracebased simulation has turned out to be useful and accurate for the modeling and the performance analysis even of complex networked heterogeneous systems. The objective of this paper is to extend the scope of these techniques to hybrid OpenMP/MPI applications, modeling them in the HESSE simulation environment and predicting their performance. The use of HESSE for shared memory or hybrid parallel codes has never dealt with before. To the authors' knowledge, no other parallel simulator and performance estimator tool is able to cope with hybrid shared memory/message passing applications. After a brief overview of the HESSE simulation environment, this paper will go on to describe how a hybrid OpenMP/MPI application can be instrumented and traced. Then, an application of moderate complexity will be presented, showing how its performance can be predicted and analyzed. Then the overall predicted application performance will be compared to the actual results obtained in the real (i.e., non-simulated) cluster environment, discussing the accuracy of the model used and the effectiveness of the proposed approach.
2. HYBRID MPI-OPENMP CODE MODELING IN HESSE
HESSE (Heterogeneous System Simulation Environment) is a simulation tool that can reproduce or predict the performance behavior of a (possibly heterogeneous) distributed system for a given application, under different computing and network load conditions [ 13]. The distinctive feature of HESSE is the adoption of a compositional modeling approach. Distributed
805
Application trace
i
Applicationtracing
Simulated run trace
II t
I
I Configurationfile generation IpBenchmarkingand arameterevaluation~
C o m m a n d file
Input preparation
I~
Simulation
~ I i I
Simulation reports
Analysis
Figure 1. HESSE Performance Analysis Process heterogeneous systems (DHS) are modeled as a set of interconnected components; each simulation component reproduces the performance behavior of a section of the complete system at a given level of detail. Support components are also provided, to manage simulation-support features such as statistical validation and simulation restart. System modeling is thus performed primarily at the logical architecture level. HESSE is capable of obtaining simulation models for DHS systems that can be very complex, thanks to the use of component composition. A HESSE simulation component is basically a hard-coded object that reproduces the performance behavior of a specific section of a real system. More detailed, each component has to reproduce both the functional and the temporal behavior of the subsystem it represents. In HESSE, the functional behavior of a component is the collection of the services that it exports to other components. In fact, components can be connected, in such a way that they can ask other components for services. On the other hand, the temporal behavior of a component describes the time spent servicing. Applications are in general described in HESSE through traces. A trace is a file that records all the relevant actions of the program, for a single specific execution. Traces can be obtained by application instrumentation and execution on a host system (e.g., a workstation used for software development), or through prototype-oriented software description languages, such as MetaPL [ 14]. Parallel application behavioral analysis takes place as shown in Fig. 1. A typical HESSE trace file for a message passing application written in PVM or MPI is a timed sequence of CPU bursts and of requests to the run-time environment. The simulator converts the length of the CPU bursts as measured in the tracing environment to their (expected) length in the actual execution environment, i.e., in the processing node where the code will be finally executed, taking into account CPU availability and scheduling, as well as memory and cache effects. On the other hand, the requests to the run time environment involving interactions with other processing nodes are managed by simulating the data exchange on the physical transmission medium under predicted network conditions, the transmission protocols, the actual runtime algorithm adopted, etc.. The basic mechanism used for conversion from trace times to simulated times is shown in [3]. In order to describe by means of traces the behavior of a hybrid OpenMP/MPI application it is necessary to extend this simple model. In fact, now each trace can contain three types of events: 9 "sequential" computation bursts spent executing a single control flow (the master thread) on a single CPU;
806 Parallel section/loop timing
I
sequential CPU burst
CPU burst
CPU burst
I
time
Host execution trace
. r
thread2 } thread 1 I " I sequential CPU burst
Predicted execution trace
thread3
[
. . .
thread4
Figure 2. Production of the predicted execution traces 9 "parallel" computation bursts spent executing an OpenMP parallel construct on (typically) all CPUs in the processing node; 9 MPI runtime requests, involving (typically) interactions with tasks executing in other nodes. Traces can also contain further type of events that are suitably simulated, such as system calls and, in particular, I/O requests. For clarity's sake, we will ignore them in the discussion that follows. Sequential bursts and MPI requests can be simulated exactly as described above. The management of parallel computation bursts, although in theory similar to that of sequential ones, is instead a different matter. In fact, it should be observed that tracing information is collected on a machine which in general is equipped with a number of CPU different from that of the SMP node of the target machine. This may be due to the opportunity to use a monoprocessor development host, or simply to wish to use the simulator to predict the effect of a different number of CPUs per node. In either case, at tracing time it is necessary to collect low-level timing information on the loops/sections of a parallel OpenMP construct without no cognition on the number of threads into which the construct will be actually decomposed at run-time (this number will typically be equal to the number of available CPUs in every node). At simulation time, this "elementary" timing information (whose format will be introduced later) will be "assembled" into the right number of threads, also simulating the static or dynamic scheduling of work to threads in the case of work sharing OpenMP constructs. After that, the duration of the resulting threads will be possibly scaled to reflect the difference in speed between the host and target processors. Once the expected duration of the parallel threads has been found, the duration of a parallel burst is simply the duration of the longest parallel thread. The use of nonsynchronized blocks, or of additional synchronization points among threads (for example, the
807 Task trace
........................................
OpenMPconstruct
MPM2Enro!!,.:.o ". . . . . .
trace
Workload 4.101800e+01
...................................................................................
MPM_Broadcast 0 8 OpenMP 65 ParallelFor Static 200 10
ZZZZZ.ZZZZ/: "WIv
constr_id construct schedule iterations chunk
65 const 2.412 !!! ! :!!!}!~!!{!!}!!
duration
Figure 3. Multitrace structure use of a barrier) that may introduce processor idling, have to be explicitly taken into account by the simulated load scheduler. The derivation by simulation of a predicted simulation by a trace is shown graphically in Fig. 2, where it can be recognized the use of faster CPUs in the final execution target than in the host machine where traces are firstly recorded. In order to record the duration of multiple threads CPU bursts, the original HESSE trace format has been changed, obtaining what we have called a multitrace. A program multitrace is made up of a trace for each task, containing records that correspond to an ordered sequence of events. This is the sequence that actually occurred during the traced execution of the program. As should be clear from the discussion above, the traced events are sequential bursts, recorded in the trace along with the (relative) time information on their host machine, MPI communication services (here no time information is recorded, since their behavior is fully simulated) and OpenMP constructs. An OpenMP trace record is essentially a placeholder within the sequence of traced events, informing that an OpenMP construct has been executed. It contains all the attributes of the construct, which can be deduced from the source, along with a pointer to all the relevant timing information, which is kept in a separate file. Fig. 3 illustrates the adopted structure. In the figure, the highlighted directive is a ParallelFor construct. The main trace file records the type of scheduling chosen in the code, the number of iterations and chunk size, and a construct identifier used as pointer within the OpenMP trace file to the timing of a single loop iteration. In the "const" case, this is expressed as a constant relative execution time on the tracing machine. Iterations with highly variable execution time are instead modeled statistically (in the current implementation, by execution time mean and variance). In fact, the tracing system records the duration of each loop iteration. This information is post-processed to obtain the compact structure of the OpenMP construct trace file shown in the figure. The necessity of post-processing OpenMP timing information is the main reason for the adoption of a separate repository file for OpenMP construct timings. 3. CASE STUDY: PARALLEL N-BODY IN A C L U S T E R OF SMP In this section it will be predicted the behavior of a hybrid MPI-OpenMP code of moderate complexity (i.e., not a toy example) under different working conditions. Then the performance figures thus obtained will be compared to those measured running the program in a real cluster environment. The program used as a case study solves the well-known N-body problem, which
808 simulates the motion of N particles under their mutual attraction. There is a wide body of knowledge on the parallelization of this problem, and a lot of highly-parallel and optimized algorithms and implementations have been devised through the years (see for example [8]). However, the objective of this paper is not to present an ultimate solution, but just to study by simulation the performance behavior of a code that is not too trivial. The well-understood N-body problem seems particularly fit for that, thanks to the fairly complex communication structure (roughly speaking, because the force on each particle is to be computed knowing the position of the remaining N - 1 bodies). The basis for the development of the hybrid code that will be simulated is a MPI version of a conventional Barnes-Hut tree algorithm. This can be described by the following pseudo-code: /* main loop */ while (time<end) { /* build the octree */ build tree(p,t,N) ; /* compute the forces */ forces (p, t,N) ; /* compute the minimal time step */ delt=tstep (p,N) ; /* update the positions and velocities newv (p, N, delt) ; newx (p, N, delt) ; /* update simulation time */ time=time + delt;
of the particles
*/
At each step, an octal tree (t) whose nodes represent groups of nearby bodies is built, the forces on each particle are computed through a visit in the tree, the evolution time step ( d e l t) is computed, and finally the velocities and the positions of the N particles (kept in the array p) are updated along with simulation time. Further details on the algorithm can be found in [ 1]. Without entering too deep into the details of the MPI parallelization, the construction of the octal tree works by letting each participating task compute a portion of the whole octree. Then, an a l l g a t h e r allows each task to get a (complete) copy of the octree. Incidentally, this approach makes it possible to obtain a data-parallel parallelization of the remainder of the algorithm, but is not applicable to large-scale problems, where the number of particles is too high to allow the replication of the tree. The system used as testbed for the chosen application is the Cygnus cluster of the Parsec Laboratory, at the Second University of Naples. This is a Linux cluster with 4 SMP nodes equipped with two Pentium Xeon 1 Ghz, 512 MB RAM and a 40 GB HD, and a dual-Pentium III 300 frontend. The application was executed in three different conditions: MPI-only (i.e., not hybrid) decomposition into four tasks, MPI-only into eight tasks, and Hybrid (i.e., MPIOpenMP) into four tasks with two threads each. It should be noted that the first decomposition is expected to lead to the poorest figures, since only four of the eight available CPUs are exploited. On the other hand, it is not easy to foresee which of the two other decompositions will be the best. As mentioned in the introduction, this is just the type of activity that a developer should carry out by means of performance prediction environments. Table 1 shows in the first three
809 rows the running time, measured on the real hardware and simulated, for the above-mentioned decompositions, and the resulting relative error. These results were obtained for a number of particles equal to 32768. Even if this is not the objective of this paper, it turns out that for this problem size (but not necessarily for all the other problem sizes, whose figures are not presented here for brevity) the MPI-8 decomposition outperforms the Hybrid one. Here it is instead important to point out that the relative error is under 5%. The second group of three rows shows the partial times for the f o r c e s routine, where most of the shared-memory OpenMP parallelism has been exploited. The errors, substantially in the same range as for the whole application, validate the type of simulation performed in HESSE for the OpenMP code, which is one of the main concerns of the work described here. Table 1 Real Execution and Simulation Timing CodeSection
Cygnus
MPI - 4
51647.588
52925.23
2.47
8
28844.706
27543.547
4.56
30332.709
30038.19
0.97
MPIHybrid
Simulated
RelativeError(%)
MPI - 4/forces
50789.188
52236.66
2.85
MPI - 8/forces
27828.304
26765.865
3.81
Hybrid/forces
29299.368
27751.985
5.29
4. CONCLUSIONS The use of a hybrid MPI-OpenMP decomposition is an interesting solution to develop high performance codes for cluster of SMPs. The thesis discussed in this paper is that the wide range of possible design alternatives calls for the use of predictive techniques, which allow the developer to obtain indications on the performance of a given code with reasonable accuracy. The problem of the tracing and of the simulation of OpenMP/MPI code has been dealt with here, and a solution implemented in HESSE simulation environment described. A case study based on a parallel N-body code has also been presented; the small errors between the performance predicted and measured on a real cluster show the effectiveness of the approach adopted for the simulation of OpenMP code. REFERENCES
[1] [2]
[3]
R. Aversa, G. Iannello, and N. Mazzocca, An MPI Driven Parallelization Strategy for Different Computing Platforms: A Case Study. LNCS, Vol. 1332 (1996) 401-408. R. Aversa, A. Mazzeo, N. Mazzocca, and U. Villano, Developing Applications for Heterogeneous Computing Environments using Simulation: a Case Study. Parallel Computing 24 (1998) 741-761. R. Aversa, A. Mazzeo, N. Mazzocca, and U. Villano, Heterogeneous System Performance Prediction and Analysis using PS. IEEE Concurrency 6, No. 3 (July-Sept. 1998) 20-29.
810 [4]
D.A. Bader and J. Jaja, SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors. J. of Par. and Distr. Comput. 58, No. 1 (1999) 92-108. [5] T. Boku et al., Implementation and performance evaluation of SPAM particle code with OpenMP-MPI hybrid programming. Proc. EWOMP 2001, Barcelona (2001). [6] F. Cappello and D. Etiemble, MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks. Proc. Supercomputing 2000 (2000) 51-62. [7] E. Chow and D. Hysom, Assessing Performance of Hybrid MPI/OpenMP Programs on SMP Clusters. Submitted to J. Par. Distr. Comput., available at http://www.llnl.gov/CASC/people/chow/pubs/hpaper.ps. [8] A. Grama, V. Kumar and A. Sameh, Scalable parallel formulations of the Barnes-Hut method for n-body simulations. Parallel Computing 24, No. 5-6 (1998) 797-822. [9] W.D. Gropp and E. L. Lusk, A Taxonomy of Programming Models for Symmetric Multiprocessors and SMP clusters. Available at http://www-unix.mcs.anl.gov/gropp/bib/papers/1995/taxonomy.pdf [10] G. Jost, H. Jin, J. Labarta, J. Gimenez and J. Caubet, Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures. Proc. IPDPS'03, Nice, France, (2003) 80-89. [11 ] J. Labarta, S. Girona, V. Pillet, T. Cortes and L. Gregoris, DiP: a Parallel Program Development Environment. Proc. Euro-Par 96, Lyon, France (1996) Vol. II 665-674 [12] S. S. Lumetta, A. M. Mainwaring and D. E. Culler, Multi-protocol active messages on a cluster of SMP's. Proc. of the 1997 ACM/IEEE conference on Supercomputing (1997) 1-22.
[13] N. Mazzocca, M. Rak, and U. Villano, The Transition from a PVM Program Simulator to a Heterogeneous System Simulator: The HESSE Project. LNCS, Vol. 1908 (2000) 266-273. [ 14] N. Mazzocca, M. Rak, and U. Villano, The MetaPL approach to the performance analysis of distributed software systems. Proc. WOSP 2002, Rome, Italy, ACM Press (2002) 142149. [15] Rolf Rabenseifner, Hybrid Parallel Programming on HPC Platforms. Proc. EWOMP '03 (2003) 185-194.
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
811
A n e n v i r o n m e n t for O p e n M P c o d e p a r a l l e l i z a t i o n c.s. Ierotheou ~, H. Jin b, G. Matthews b, S.R Johnson ~, and R. Hood b ~Parallel Processing Research Group, University of Greenwich, London SE 10 9LS, UK bNASA Advanced Supercomputing Division, NASA Ames Research Center, Moffett Field, CA 94035, USA In general, the parallelization of compute intensive Fortran application codes using OpenMP is relatively easier than using a message passing based paradigm. Despite this, it is still a challenge to use OpenMP to parallelize application codes in such a way that will yield an effective scalable performance gain when executed on a shared memory system. If the time to complete the parallelization is to be significantly reduced, an environment is needed that will assist the programmer in the various tasks of code parallelization. In this paper the authors present a code parallelization environment where a number of tools that address the main tasks such as code parallelization, debugging and optimization are available. The parallelization tools include ParaWise and CAPO which enable the near automatic parallelization of real world scientific application codes for shared and distributed memory-based parallel systems. One focus of this paper is to discuss the use of ParaWise and CAPO to transform the original serial code into an equivalent parallel code that contains appropriate OpenMP directives. Additionally, as user involvement can introduce errors, a relative debugging tool (P2d2) is also available and can be used to perform near automatic relative debugging of an OpenMP program that has been parallelized either using the tools or manually. In order for these tools to be effective in parallelizing a range of applications, a high quality fully interprocedural dependence analysis as well as user interaction are vital to the generation of efficient parallel code and in the optimization of the backtracking and speculation process used in relative debugging. Results of parallelized NASA codes are presented and show the benefits to using the environment. 1. I N T R O D U C T I O N Today the most popular parallel systems are based on either shared memory, distributed memory or hybrid distributed-shared memory systems. For a distributed memory parallelization, a global view of the whole program can be vital when using a Single Program Multiple Data (SPMD) paradigm [2]. The whole parallelization process can be very time consuming and error-prone. For example, to use the available distributed memory efficiently, data placement is an essential consideration, while the placement of explicit communication calls requires a great deal of expertise. The parallelization on a shared memory system is only relatively easier. The data placement may appear to be less crucial than for a distributed memory parallelization and a more local loop level view may be sufficient in many cases, but the parallelization process is still error-prone, time-consuming and still requires a detailed level of expertise. The main goal
812 for developing tools that can assist in the parallelization of serial application codes is to embed the expertise within automated algorithms that perform much of the parallelization in a much shorter time frame than what would otherwise be required by a parallelization expert doing the same task manually. In addition, the toolkit should be capable of generating generic, portable, parallel source code from the original serial code [ 1]. In this paper we discuss the tools that have been developed and their interoperability to assist with OpenMP code parallelization, specifically targeted at shared memory machines. These include an interactive parallelization tool for message passing based parallelizations (Para Wise) that also contains dependence analysis capability and many valuable source code browsers; an OpenMP code generation module (CAPO) with a range of techniques that aid in the production of efficient, scalable OpenMP code; a relative debugger built on P2d2 and capable of handling hundreds of parallel processes. 2. PARAWISE, CAPO AND P2D2 TOOLS The tools in this environment have been used to parallelize a number of FORTRAN application codes successfully for distributed memory [1, 2, 4] and shared memory [5, 6, 7] systems based on distributing arrays and/or loop iterations across a number of processors/threads. A detailed description of the tools will not be given here but can be found elsewhere [2, 5]. Instead, an overview is presented here. 2.1. Overview Figure 1 shows an overview of the various tools, their functions and the nature of the interactions between them. Note that the dependence analysis and directive insertion engines provide facilities that are used by other components (e.g. symbolic algebra, proofs etc). The expert assistant and profiling and tracing tools are not completely integrated into the environment and are part of an ongoing and longer term project (see section 4). 2.2. CAPO code generation features As part of the code generation process CAPO will automatically identify a number of essential characteristics about the code. In particular CAPO will automatically classify the different types of loops in the code, of which there are four basic types. S e r i a l - loops contain a loopcarried true data dependence that causes the serialization of the loop. Other possible reasons for a loop to be defined as serial include the presence of I/O or loop exiting statements within the loop body; Covered serial - as with serial loop but contains, or is contained in, nested parallel loops. If the serial loop can be made parallel then the parallelism may be defined at a higher level; Chosen parallel - parallel loops at which the OMP DO directive is defined and also include parallel pipeline and reduction loops; Not chosen parallel - parallel loops not selected for application of the OMP DO directive because these loops are surrounded by other parallel loops at a higher nesting level. All of the code generation is automatic and includes identifying parallel loops using the interprocedural dependence analysis to define parallelism at a coarser level and where to place the OMP DO directive; creating PARALLEL regions based on the identified parallel loops; merging consecutive PARALLEL regions into a single region to reduce overheads in thread start up and shut down; detecting and producing the NOWAIT clause on an ENDDO to reduce barrier synchronisations when this is proven legal; identifying and defining the scoping of all vail-
813
(SERIALCODE) PARAWISE~ IP , [ ANALYSIS DEPENDENCE .... i....Deletion d EXPERT t ENGINEt Info Dependence ASSISTANT !DApEIEND CEGRAPH~
~CO--~DEj
i
!
I
HISTORY , 1
Bou~pit u~!ELATIVEoEBUGRGER n
~ARALLELDEBUGGER IPARALLELPROFILER__ &TRACEVISUALIZER
Z \ (I .REA ) .REAO)
1-THREAD LoopSpeedupandOpenMPRuntimeOverheads
Figure 1. Overview of the environment indicating the interactions between the different tools ables in PARALLEL regions such as SHARED, PRIVATE, FIRSTPRIVATE, LASTPRIVATE, THREADPRIVATE etc. The quality of the parallel source code generated is derived from many of the features provided by ParaWise. For example, the dependence analysis is fully interprg,cedural and valuebased [3 ] (i.e. the analysis detects the flow of data rather than just the memory location accesses) and allows the user to assist with essential knowledge about program variables. There are many reasons why an analysis may fail to determine the non-existence of a dependence accurately. This could be due to incorrect serial code, a lack of information on the program input variables, limited time to perform the analysis and limitations in the current state-of-the-art dependence algorithms. For these reasons it is essential to allow user interaction as part of the process, particularly if scalability is required on a large number of processors. For instance, a lack of knowledge about a single variable that is read into an application can lead to a single assumed data dependence that serializes a single loop, which in turn greatly affects the parallel scalability of the application code. An example to illustrate the benefit of using a quality interprocedural analysis is shown in the next section. 2.3. User interaction Apart from the obvious need for user interaction to enable the removal of loop serializing dependence(s), efficient parallel OpenMP code requires the consideration of other factors. Due to the relative expense in starting threads as part of a PARALLEL region definition and barrier synchronizations within the region it is essential to attempt to place directives so that these overheads are significantly reduced. To this end, sophisticated algorithms have been developed for CAPO to improve the merging of PARALLEL regions and to prove the independence of threads in loops within the same PARALLEL region where a barrier synchronization can been
814 avoided. These algorithms rely heavily on dependence analysis, if a dependence that inhibits these optimizations has been assumed to exist, then user control is necessary to provide additional information to the tool. Figure 2 shows a sample of code taken from an ocean modelling code [6] (due to space restrictions the sample code also includes the final OpenMP directives discussed later). For the serial code, the variables being read in at run time mean that the static analysis alone has been unable to determine that the variable np (used in statement s2) cannot be equal to n c or nm (used in statement $1) for a given main time step iteration. In this case CAPO will define PARALLEL regions around k loops in the BAROCLINI C subroutine. If the user provides information to ParaWise such that np can never be equal to nc or nm in a given time step iteration, this results in the removal of some assumed dependencies and allows the i and j loops to be executed as parallel loops. This optimization enables parallelism at an outer level (j loop) where the PARALLEL region includes 12 nested parallel loops as well as calls to other subroutines. In this case the optimization goes a step further since the entire BAROCLINIC subroutine is defined with a single PARALLEL region, the interprocedural feature of the tools means that the final PARALLEL region can be defined in some routine higher up in the call graph, in a calling routine and is indicated in (Figure 2). Although user interaction is essential, it does allow for the possible introduction of errors in the parallel code. The incentive to enable parallelism can lead to the user being over optimistic. Apart from the erroneous deletion of a dependence or the incorrect bounds for a variable being added, the user can be presented with a choice of solutions. For example, a loop that is serial due to the re-use of a non-privatizable variable between iterations can be executed in parallel if the variable can be privatized or if the re-use dependence can be proven non-existent. An incorrect choice by the user can lead to erroneous results in the parallel execution.
2.4. Automatic relative debugging As the user is likely to introduce errors into the parallel code, tools are needed in the environment to assist the user in identifying the incorrect decisions made during the parallelization. Automatic relative debugging achieves this by comparing the execution of the serial program with that of the parallelized version, automatically finding the first difference in the parallel program that may have caused an incorrect value that was originally identified. Automation of the search for the first difference relies on the availability of an interprocedural dependence analysis of the program and the ability to determine mappings from the serial to the parallel program [8]. Here, the automatic relative debugging approach is applied to shared-memory OpenMP programs and determines the first PARALLEL region that contains a difference that may have caused an incorrect value that has been identified. Previous work [8] provides a thorough presentation of the algorithm used to determine the first difference between the serial (one processor parallel) execution and an N-processor parallel execution for a distributed memory, message passing based code. The steps (1)- (4) below are repeated until the earliest difference is found: ( 1) Find the possible definition points of the earliest incorrect value observed so far, using dependence analysis information (2) Examine the variable references on the right-hand sides of those definitions to determine a set of suspect variable references to monitor in a re-execution
815 read*,nm,np,nc do l o o p = l , m x i t e r _ t i m e call T I M E S T E P
SUBROUTINE BAROCLINIC !$OMP DO do j = j s t a , j e n d do i = i s t a , i e n d
enddo
9. .
do k=l, kmc .... v ( k , i - l : i + l , j,nc) , v (k, i-I :i+l, j ,nm) , v(k, i, j -I :j+l, nc) , v(k,i,j-l:j+l,nm)
S1 SUBROUTINE nnc=np nnm=nc nnp=nm np=nnp nc=nnc
TIMESTEP
nm=nnm
enddo
do k = l , k m c v(k,i,j,np)
S2
enddo
....
enddo enddo
200 continue !$OMP P A R A L L E L D E F A U L T ( S H A R E D ) P R I V A T E ( j c , L O W L I M i t b t p , itbtp, itbt, [$OMP& n n c 0 , n n m 0 , n n p 0 , c 2 d t b t , n t b t p 0 , n t b t 2 , i c , k , m , b m f , s m f , n c o n , stf,n) !$OMP& S H A R E D ( d y 2 r , d x 2 r , g r a v , n p 0 , d t b t , d c h k b d , n m 0 , d x , n c 0 , d y , !$OMP& d y r , d x r , n t b t , m x p a s 2 , c 2 d t u v , n p , n c , n m , p c y c , c 2 d t t s , g a m m a ) call B A R O C L I N I C 9 .
if(maxits.and.eb)THEN nc=nnp np=nnm
maxits=.false. endif
goto
200
Figure 2. Pseudo code showing the CAPO generated OpenMP directives. (3) Instrument the suspect variable references in both serial and parallel versions of the program {4) Execute the instrumented programs, stopping when a difference (i.e. a bad value) is detected at an instrumentation point. If any bad value encountered has not been previously observed then it is used as input for step (1), otherwise debugging is terminated The automated search for the first difference begins after all necessary dependence and directive information is obtained. An incorrect variable at a specific location in the execution is used as the starting point of the search, typically at an output statement where the output information differs between the serial and the parallel execution. The steps above are performed by the relative debugging controller and P2d2; the controller guides backtracking [9] and reexecution based on dependence analysis information and P2d2 retrieves and compares values. When necessary, the controller makes use of directive information by asking the CAPO library to determine, for example, if a variable is PRIVATE or SHARED at a given location in the application code [ 10]. P2d2 is used to manage control of the serial and parallel executions at the location of the earliest known difference. Two optimizations are implemented to reduce the number of reexecutions necessary to search for differences. First, the set of suspect variables determined in step (2) were computed in previous program states and are therefore candidates for examination by tracing their values without the use of re-execution (backtracking). The memory overhead of tracing is undesirable, so a limited form of backtracking that utilizes only the current program state to retrieve values that have not been overwritten (based on dependence analysis) is used. If any value retrieved in this way is found to be incorrect then that value is used as input to step (1), saving a re-execution.
816 Table 1 Summary of codes parallelized using parallelization tools Application LU, SP, B T FT, CG, MG CTM Code size 3K lines 2K lines 16K lines benchmark benchmark 105 routines Dependence analysis 0.5-1 hr 0.5-1 hr 1 hr Code generation under 5 mins under5 mins 10 mins User effort/tuning 1 day 1 day 2 days Total manual time 3 weeks 3 weeks 5 months Performance (cfwith within5-10% within 10-35% betterby 30% manual version) Sample speed up BT:30 on 32 CG:22on 32 3.5 on 4
GCEM3D O V E R F L O W 18K lines 100K lines 100 routines 851 routines 6.5 hrs 25 hrs 30 mins 30 mins 14 days 4 days 1 month 8 months factor8 better slightlybetter
24 on 32
16 on 32
For values of suspect variables that have been overwritten, a second optimization is employed. Such values are treated as if they were incorrect and steps (1) and (2) are speculatively applied to the definition points of each to find if any of their fight-hand side values can be retrieved with backtracking. Note that this optimization can be applied recursively for any fight-hand side value that has been overwritten. If a bad value is found among the right-hand side values then it is assumed that the overwritten value was also incorrect. The overall search for the first difference with the bad right-hand side value as input for step (1) is then continued. For a CAPO parallelized code the stored dependence graph and other information can be used. This tool can also be applied to manually parallelized OpenMP codes where a dependence analysis and directive extraction are performed on initialization of the tool.
3. RESULTS A number of codes have been parallelized both manually and using the parallelization tools. The codes include the NAS parallel benchmarks - a suite of well-used benchmark programs; C T M - NASA Goddard code that is used for ozone layer climate simulations; GCEM3D - NASA Goddard code used to model the evolution of cloud systems under large scale thermodynamic forces; O V E R F L O W - NASA Ames version that is used to model aerospace CFD simulations. Table 1 summarizes the approximate time taken for the various efforts involved in parallelizing these applications. In nearly all cases the quality of the code generated was at least as good as the manually parallelized version and was achieved with a significantly reduced user effort. Unlike the other codes, the performance for the GCEM3D code was enhanced by using the Paraver [ 11 ] profiling tool together with the existing environment tools. For all the larger codes CTM, GCEM3D and OVEFLOW user interaction was required to identify more parallelism, enable effective scalability and produce a significant speed up. This was performed by the authors, taking great care in the decisions made. To examine the effectiveness of the relative debugging facility an error was deliberately introduced into the parallel NAS LU code and the first indication of incorrect output was used as the start point for the debugging algorithm. After 3 re-executions for successively earlier instrumentation points, a variable at a particular location was identified as the first difference in the code. This related directly to the erroneous user interaction deliberately introduced as part of the earlier parallelization, exposing the likelihood that it was incorrect.
817 4. FUTURE W O R K AND C O N C L U D I N G R E M A R K S
The quality of the code generated yields comparable performance to a manual parallelization effort and since almost all of the work is automated, the total time to parallelize the application is significantly reduced when using the tools. The authors are working on an "expert assistant" that will guide the user to the reasons why, for example, loops are serialized or variables are non-privatizable, by asking pertinent questions that the user can attempt to answer. The aim of the expert assistant is to try and exploit any parallelism that is not immediately apparent. It is also envisaged that the questions can be prioritized by interacting with a profiling tool that can indicate inefficiencies in the parallel execution, such as loops exhibiting a poor speed up, frequently executed PARALLEL region start/stop overheads and barrier synchronizations within PARALLEL regions. These focus the user on code sections that have a significant effect on parallel performance. 5. A C K N O W L E D G E M E N T S The authors would like to thank their colleagues involed in the many different aspects of this work, including Gabriele Jost and Jerry Yan (NASA Ames), Dan Johnson, Wei-Kuo Tao and Steve Steenrod (NASA Goddard), Emyr Evans, Peter Leggett, Jacqueline Rodrigues and Mark Cross (Greenwich). Finally, the funding for this project from AMTI subcontract No.SK-03N-02 and NASA contract DTTS59-99-D-00437/A61812D is gratefully acknowledged. REFERENCES
[1]
[2] [3]
[4]
[5]
[6] [7]
[8]
E.W. Evans, S.E Johnson, EF. Leggett, M. Cross, Automatic and effective multidimensional parallelisation of structured mesh based codes. Parallel Computing, 26, 677703, 2000. C.S. Ierotheou, S.E Johnson, M. Cross and EF. Leggett, Computer aided parallelisation tools (CAPTools) - conceptual overview and performance on the parallelisation of structured mesh codes. Parallel Computing, 22, 197-226, 1996. S.P. Johnson, M. Cross and M. Everett, Exploitation of symbolic information in interprocedural dependence analysis. Parallel Computing, 22, 197-226, 1996. S.E Johnson, C.S. Ierotheou and M. Cross, Computer aided parallelisation of unstructured mesh codes. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, Editors H.R.Arabnia et al, publisher CSREA, vol. 1,344-353, 1997. H. Jin, M. Frumkin, and J. Yan, Automatic generation of OpenMP directives and its application to computational fluid dynamics codes, Intl Symposium on High Performance Computing, Tokyo, Japan, October 16-18, 2000,Lecture Notes in Computer Science, Vol. 1940, 440-456. C.S. Ierotheou, S.E Johnson, EF. Leggett and M. Cross, Using an interactive parallelisation toolkit to parallelise an ocean modelling code. FGCS, vol 19, 789-801, 2003. H.Jin, G.Jost, D.Johnson, W-K. Tao, Experience on the parallelization of a cloud modeling code using computer-aided tools. NASA Technical report NAS-03-006, 2003. G. Matthews, R. Hood, S.E Johnson, and EF. Leggett, Backtracking and re-execution in the automatic debugging of parallelized programs. Proceedings of the 1 l th IEEE Inter-
818 national Symposium on High Performance Distributed Computing, Edinburgh, Scotland, July 2002. [9] H. Agrawal, Towards automatic debugging of computer programs. Ph.D. Thesis, Department of Computer Sciences, Purdue University, West Lafayette, IN, 1991. [ 10] G. Matthews, R. Hood, H.Jin, S.P. Johnson, and C.S. Ierotheou, Automatic relative debugging of OpenMP programs. To appear in EWOMP 2003 conference proceedings, 2003. [11 ] http://www.upc.cebpa.es/paraver
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
819
H i n d r a n c e s in O p e n M P p r o g r a m m i n g F. Massaioli a aCASPUR, Inter-university Consortium for SuperComputing Applications, V. dei Tizii 6/b, 00185 Roma, Italy The OpenMP standard is an impressive success, as it has been widely embraced from its appearance and has become the de facto standard for HPC on SMP systems. However, OpenMP programmers have to face obstacle and difficulties. Most of those limit possible uses of OpenMP in mainstream programming areas. 1. INTRODUCTION In early 90's, the High Performance Computing (HPC) community switched from vector supercomputers to workstations clusters and MPP systems, with a significant but partial success. Two relevant sets of users were unable to exploit the technology: those with applications unsuitable to distributed memory architectures and the ones who just needed moderate speedups. In both cases, those users felt the additional work needed to parallelize on message passing architectures was excessive with respect to the moderate advantages. When affordable, RISC based Symmetric MultiProcessor (SMP systems) appeared around 1995, no clear standard for shared memory parallel programming was available. Using sets of proprietary directives meant loosing portability. Pthreads allowed, in principle, for portability, but they were more amenable to functional decomposition, while most HPC problems lent naturally to data distribution among workers. Last but not least, debugging shared memory parallel programs was even more difficult than for message passing ones. When the OpenMP standard [1 ] appeared in 1997, it brought new life to parallel computing. An easy and portable way to exploit multithreading was at last available to FORTRAN programmers, a set of directives for the C and C++ programming languages was published one year later. Good development tools were ready on the market. Many users embraced it and the number of parallel applications and parallel programmers grew quite rapidly. 2. THE O P E N M P SUCCESS
The quick spread of OpenMP was a remarkable success. From the very beginning a multiplatform, uniform implementation, the KAP/Pro ToolSet [2] was made available by Kuck and Associates Inc. (now KAI Software Lab, a division of Intel Americas, Inc.). A few system vendors promptly added OpenMP support to their compilers, and all the relevant ones eventually supported the standard. A few ISV compilers vendors introduced OpenMP support in their products. The driving force, however, was the quick adoption of OpenMP as a fundamental programming paradigm by the HPC community.
820 OpenMP also generated related research efforts, exploring many topics like compiling techniques, runtime libraries optimizations, adaptiveness to environment, microscopic and system level benchmarks, possible extensions. A user group, cOMPunity, keeps users and researchers in touch and can express one vote in the OpenMP Architecture Review Board. The cOMPunity web site [3] is an invaluable reference to all OpenMP related activities. Two main factors are able to explain OpenMP success. The first and foremost is OpenMP simplicity, which makes it more attractive than traditional multithreading standards. The second is the widespread appearance of multiprocessors systems at affordable prices. OpenMP relies on a small set of simple and well defined directives. An OpenMP program can be a valid serial program by just protecting possible calls to API function with conditional compilation. Being all the variables in a single address space, parallelization can be performed incrementally, allowing proper administration of development efforts. OpenMP does not depend on the specific threading model of choice on a particular platform. While multithreaded programs written using Pthreads calls are in principle portable to every computing platform, the resulting performances, according to the specific strategies adopted in the particular OS, may be suboptimal. Moreover, subtle discrepancies among different Pthreads implementations use to be a significant burden to the programmer. If a program can be coded using just OpenMP directives and API calls, portability issues are greatly reduced, and performances can be positively affected if the OpenMP compiler/runtime makes effective uses of the multithreading model of choice on a specific system. OpenMP has many restrictions with respect to the full capabilities of multithreading: the limitations inherent to the work-sharing concept, the fact that synchronization points m u s t be met by all the threads in a team, the clear (graphical!) evidence of directives with respect to library calls, reduce the adverse impact of classical concurrent programming mishaps, and make debugging and program correctness verification much easier. This is of paramount importance to those many computational scientist writing parallel programs with no formal education in the troubles and perils of concurrency. Of course, simplicity by no means implies easiness. Efficient scaling of a parallel applications on many CPUs is not easier in OpenMP than in other parallel programming models. The worksharing concept implies synchronization points, both explicit and implicit, which sometimes cannot be dispensed with. Synchronization costs grow with the number of threads, and are crucially affected by system architectures and runtime library internals. Many issues come up to the programmer attention in using 16 or more processors, even outside of the scope of OpenMP. One notable issue has to do with the huge size of data which can be managed in a single multithreaded process. Exploiting big address spaces effectively can call for deeper understanding not only of the system architecture, but of the specific processor implementation [5]. Serial or message passing programmers are not exposed to such vagaries. However, most users are not looking for the highest absolute performance, they just aim at moderate (i.e. adequate) speedups, programming productivity, portability. OpenMP is, to the majority of them, a perfect balance between simplicity and performances. The second factor behind the rapid OpenMP acceptance by the HPC community is the growing availability of SMP systems at affordable prices, both in the traditional "big iron" server market, and in the low end, commodity PC market. As a consequence, computational scientists
821 can easily find significant SMP servers to run their applications, and can buy small and cheap SMP PCs for program development and small simulations. This trend has been driven mainly by the needs of commercial applications. While the typical HPC programmer is concerned with the distribution of identical units of work among processors, with fairly rigid synchronization points, the classic uses of multithreading outside of HPC are mostly related to programming of complex applications and efficiency of access to data in a more asynchronous context. On big servers, whose many clients, asynchronously send requests to extract information from a common data repository, multithreading allows for more efficient resource utilization, less load on the OS with respect to different processes, and, of course, exploitation of more than one CPU, when available in the system. In personal productivity software, much of the gain arise from the programming simplification inherent to the functional decomposition of the whole program in independent or loosely dependent tasks, both in background and foreground. Those apparently diverging scenarios are however merging. From one part, HPC programmers exhibits growing interests in more asynchronous models of computations, not only for data intensive applications, like those found in life sciences, but for simulation of complex interactions, like in immune systems and financial markets [6]. Likewise, commercial software programmers are looking at HPC work-sharing style to cope with processing of huge databases. Personal computers are already exploiting such techniques for the most demanding applications (audio, video and photographic data processing being just the most "spectacular" examples). Finally, the constant request of more computing power at lower prices, is furthering the trend. Simultaneous MultiThreading (SMT) allows aggressively clocked CPUs to keep their pipelines at work, issuing instructions coming from more than one program flow. SMT is already available in some Intel CPUs (under the HyperThreading trade mark) and Intel stated more than once its intent to eventually implement it in all of its PC and server processor product lines. IBM and Sun publicly announced future processors with SMT technologies. It seems that in a few years, a programmer aiming at squeezing the most computing power from a CPU will have to resort to multithreading. Building on a big success and furthering it to new levels is always a difficult task. Has OpenMP a future? An easy answer could be that OpenMP suites the needs of a significant part of the HPC community, and it's here to stay. Don't fix what is not broken, and don't try to abuse it outside of design limits. While this makes partly sense, this "royal stagnation" scenario would probably doom OpenMP to a demise, for three good reasons. First, if many HPC programmers are mostly happy with OpenMP, this doesn't mean that they find it perfect or complete. There are many rough spots in OpenMP programming, limitations, missing features, excessive freedom left to implementors, inadequate quality of available tools. Disregarding those issues could bring users to disaffection. Second, if many HPC programmers are mostly or completely satisfied with the present status of OpenMP, actual needs changes in time, and some new requirements are already emerging. Third, for something to keep going in the computing market relentless progress is mandatory. For OpenMP that means that to keep it alive, new fields of applications must be opened, and in a market driven economy HPC is not such an interesting market to drive vendor choices. Although 30 years ago HPC had the technical lead of computing, nowadays HPC must exploit the best commercial technologies. The hardware and software trends discussed above give some hope about the latter points.
822 HPC programmers are more and more interested in programming techniques more akin to those in use in mainstream programming. Commercial software developers, on the other hand, are starting to exploit traditional HPC programming techniques in their products. Moreover, OpenMP simplicity appeals mainstream programmers: multithreading is becoming more and more relevant, it's easier to learn multithreading with OpenMP than directly on Pthreads or WinThreads, OpenMP is portable from platform to platform. As a matter of fact, some newsstand magazines about Linux already featured OpenMP [7], even in local editions [8]. Apparently, a convergence of interests is possible. However, strengthening the hold on the HPC market and widening the market to other programmers, is a difficult task, calling for extensions to the standard and better tools.
3. STANDARD RELATED ISSUES The official standard revision is 2.0 for both FORTRAN and C7C++. The Architecture Review Board (ARB) is actively working on a version 2.5, whose main aim is to merge the specifications for the two languages, and to clarify some points whose formulation was felt inadequate. This is a precious and delicate task. Some proposal extensions are being investigated by the ARB Futures Committee. This is one of the crucial activities for the OpenMP future. Many extension and modifications have been proposed in the past, by vendors, researchers, users. The ARB is (properly) reluctant to add features to the standard, to avoid hurried steps in wrong directions and damages to OpenMP invaluable simplicity. However, many problems are on the table from many years, and some users perceive this reluctance with a sense of frustration, feeling ignored in their troubles. Some of those troubles are worth a quick overview. One of the most tedious and error prone part ofparallelizing a code with OpenMP is the qualification of variables scopes in parallel end work-sharing directives. In the simpler cases, the vast majority ofthe variables is o f p r i v a t e or s h a r e d scope, so that a suitable d e f a u l t ( . . . ) clause and a few specific indications fit the job. In most cases the situation is not so favorable, as some variables are s h a r e d everywhere, but the scoping of most variables varies from region to region, or even from parallel loop to parallel loop. A simplified version of the Princeton Ocean Model code, used for climate simulations, took almost three months of work to parallelize [9]. The latest version, as more physical phenomena are taken into account in the simulation, could have up to 5 times the number of loops. Last but not least, d e f a u l t ( p r i v a t e ) and d e f a u l t ( s h a r e d ) clauses do not help program maintenance, as subtle bugs could creep in when the program is modified or extended. A d e f a u l t (none) clause, followed by careful declaration of the scope of each variable needed in a parallel section or loop, would be ideal in this respect, but unviable on complex codes. The proposal of Automatic Variable Scoping [ 10], would address this point by allowing the a u t o option to the d e f a u l t () clause to ask the compiler to determine the correct scoping. This proposal has been objected as a form of automated parallelization, a technology that didn't proved viable for real applications. However, automated parallelism proved inefficient because of the difficulties in choosing at compile time the loop which can be more efficiently parallelized. Once the parallel loop is anyhow designated by the programmer with a directive,
823 determining the correct scoping of variables shouldn't be much more difficult than the usual data dependency analyses performed by optimizing compilers. It is however not yet clear if a new clause is actually needed, or if such a feature should be added as an optional compiler extension. The need for nested parallelism was felt from the very beginning of the OpenMP standard. Having one thread building up a new separate team of threads, independent of the others, allows for uneven decomposition among threads of tasks of different computational weight, for the nesting of a p a r a l l e l do inside a s e c t i o n , for a barrier to affect just part of the threads (OpenMP has no point to point or subset synchronizations like, e.g. MPI), etc... Nested parallelism is thus crucial for load balancing, as well for adapting to dynamically changing computations. It is probably even more relevant to programming of commercial applications than to HPC programmers. Actually, nested parallelism/s defined by the present standard. However, so much freedom has been left to implementor, that it's not possible to count on it. Moreover, a few compiler writers misinterpreted the standard (probably misled by an apparent contradiction between two different sections of the document). To make it short, nested parallelism is standard, but not portable from the programmer's point of view. Part of nested parallelism functionality can be painstakingly reconstructed by hand, using the other OpenMP features, but nothing can be done to restrict the number of threads affected by a barrier. The only kind of loops supported by OpenMP work-sharing directives are standard counted loops, whose iteration number has to be known before executing it. Pointer chasing loops are, however, the idiomatic way to walk along a variable length linked list. They are widely used in mainstream programming, and some HPC applications [6] heavily rely on them. It has been shown [ 11 ] that in the construct: #pragma omp parallel private(p) for(p=head; p; p = p->next) #pragma omp single nowait do something (p) ; m
each loop iteration is taken care of by a single thread. However, threads must agree on which one will take charge of each s i n g l e n o w a i t execution. This can be done with a minimum of one lock per thread per loop iteration, so if n thread are used, each one has to pass through n locks to receive one unit of work. The overhead will thus grow with the number of threads. KSL proposed a new work-sharing directive, t a s k q , and implemented it in KAP/Pro ToolSet [2]: #pragma omp taskq private(p) for(p=head; p; p = p->next) #pragma omp task do_something (p) ; with the purpose of making explicit to the compiler and runtime that this is actually a worksharing operation, thus allowing for less overhead. Preliminary experiments with a synthetic but realistic benchmark seem to confirm that this could be the case, but wide variations are present among different systems tested.
824 Up to now, taskq are implemented only in KAP/Pro ToolSet and Intel compilers, and they have not yet been accepted in the standard. They can probably be emulated in a more efficient way than using s i n g l e n o w a i t, once again loosing simplicity. As many applications make heavy use of linked lists, an elegant solution to this kind of problems wold be very significant. Moreover, the balance between simplicity and performance could be critical in those cases, and more serious attempts and experiments on real codes are needed. Most applications like database and web servers, as well as all personal productivity software, video games, etc.., are based on an asynchronous, event driven model of programming. While this model is far less common in scientific computing, but could be applied to simulation of complex systems, as well as to data acquisition and data visualization. Support for asynchronous (or loosely synchronized) threads could ease implementation of software prefetching schemes, from both RAM and disks. Nested parallelism (if correctly and fully supported), taskq constructs, etc.., can be used to cope with specific situations, but in most cases, they are not enough. Moreover, to write OpenMP application with loosely synchronized tasks, two more ingredients are necessary. One is point to point and one to many synchronizations [ 12]. But the most important is reliability and support for recovery of failed threads. Presently, if an OpenMP thread dies because of an exception, the whole program exits. This behaviour is barely acceptable for personal productivity applications, and completely unacceptable for database servers. It is a crucial extension to open new grounds to OpenMP applications. 4. DEVELOPMENT TOOLS RELATED ISSUES Presently, most vendor compilers and a few ISV products support OpenMP, but the quality of the implementations varies significantly. Not only optimization levels can be severely affected when using OpenMP directives, but the same directives can exhibit completely different overheads among different compilers/runtime. Going to deeper details could be unfair to some vendors, but this is an issue that should be addressed more actively, as it limits performance portability. A few OpenMP constructs (like parallelization of FORTRAN 90 array syntax), yield disappointing performances, independently of the compiler. One notable example is the r e d u c t i o n clause, often a barely decent performer, a performance bottleneck when applied to arrays. In a molecular dynamics simulation code [13], using Newton's third law to halve the heavy amount of calculations, each thread takes part of the atoms, and compute the forces derived from pairs and triples interactions. The arrays of partial contributions to forces must then be summed up to obtain the resulting total force on each atom. As Fig. 1 shows, the r e d u c t i o n clause defined by the standard can easily become the bottleneck of the whole simulation. Significant relief comes from a straightforward implementation of a parallel reduction, and better scaling can be achieved by exploiting efficient BLAS implementations. Some open source implementations are in the works (see [3] for a list), but not completely ready to be used in production. This severely limits a wider adoption of OpenMP, because open source tools are invaluable to spread the use of a technology. The Intel compilers compiler, while proprietary, can fortunately be used at no cost for no profit software developments. Beside Intel's KSL, most vendors (with a few notable exceptions among ISVs) seem to com-
825
'
'
I
'
'
'
I
'
'
'
I
'
'
[
'
'
'
'
'
'
'
I
'
'
'
'
'
'
I
O M P R E D U C T I O N (guidef77) [] ~
D G E M V (guider77)
. ~r ~'~
, ~ - .,A O M P D O (guidef77)
.~.~'~,. ~
41~----O O M P R E D U C T I O N (xlf)
.~'~'~'~ ~
~-~'~M D G E M V (xlf) ~
~
:.
~='-'~~
t14
~
-
~ ~" ~ ~ ~ "
...~j~ ~
O M P D O (xlf)
"<~
~ ~ " ~"
l _
"
10
~ 8
_
J
6
SERIAL
4
0 .......................... 0
8
12
16 # Threads
24
32
Figure 1. Different scaling of a molecular dynamics simulation on a IBM Power3 16-way system, using native OpenMP reductions clause or hand-coded solutions. (Courtesy of S. Meloni) pletely disregard debugging and correctness assessment. Up to now no tools equaled the crucial effectiveness effectiveness of KSL's Assure verifier (and the forthcoming Intel Thread Checker [4]) at identifying race conditions. Those kind of tools have surprisingly positive impacts on programmer's productivity, and should actually receive more attention, even more than the many performance measurement and profiling tools which are actively developed in proprietary and research laboratories.
5. CONCLUSIONS OpenMP is a powerful and effective programming paradigm. It has potential and appeal to be embraced by a wider programmer's community, outside the HPC field. However, some well known limitations must be addressed, both at the standard and implementations level. Ignoring them would just bring to stagnation and demise. Moreover, the removal of some hindrances in the model could open the way to whole new applications fields. As a final remark, OpenMP could benefit from a wider adoption in education. Parallel computing is not anymore a subject just for advanced HPC courses, it is discussed in popular programming magazines. OpenMP could be used for a gentler introduction to concurrent programming concepts, in preparation to more flexible (and complex) models. That could as well make it more used in different ways, with obvious benefits for the OpenMP community. The author thanks P. Lanucara, S. Meloni and M. Rosati for providing material for this paper. The author is in debt with M. Bemaschi, M. Bull, B. Chapman, B. Kuhn, D. an Mey, S. Salvini and S. Shah for useful discussions and precious insight.
826 REFERENCES
[1] [2] [3] [4] [5] [6]
[7] [8] [9] [10]
[ 11] [12] [ 13]
http://www, openmp, org. http://developer, intel, com/software/products/kappro/. http- //www. compunity, org. http://developer, intel, com/software/products/threading/. F. Massaioli and G. Amati, Achieving high performance in a LBM code using OpenMP, Proc. EWOMP '02, Roma (2002). M. Bernaschi and F. Castiglione, Computational features of agent-based models, Simulation: Trans. Soc. for Modeling and Simulation International, submitted. B. Lucini, Advanced OpenMP, Linux Pro No. 35, Future Publishing, Bath, (2003). B. Lucini, Introduzione ad OpenMP, LinuxPRO, No. 5, Future Media Italy, Milano, (2003). P. Lanucara and G.M. Sannino, An hybrid OpenMP/MPI parallelization of the Princeton Ocean Model, Proc. ParCo 2001, Napoli, (2001). D. an Mey, A Compromise between Automatic and Manual Parallelization: Auto-Scoping, available at: http ://support. rz. rwth-aachen, de/publ ic/AutoScoping, pdf. T. Mattson and S. Shah, Advanced Topics in OpenMP, Tutorial, SuperComputing'02, Baltimore, (2002). J. M. Bull and C. Ball, Point-to-Point Synchronisation on Shared Memory Architectures, Proc. EWOMP '03, Aachen, (2003). S. Meloni, A. Federico and M. Rosati, Reduction on arrays: comparison of performances among different algorithms, Proc. EWOMP '03, Aachen, (2003).
Parallel Computing: SoftwareTechnology,Algorithms,Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
827
Wavelet-Based Still Image Coding Standards on SMPs using OpenMP* R. Norcen a and A. UhP aRIST++ & Dept. of Scientific Computing, University of Salzburg, Austria JPEG2000 and MPEG-4 Visual Texture Coding (VTC) are both wavelet-based and state of the art in still image coding. In this paper we show sequential as well as parallel strategies for speeding up two selected implementations of MPEG-4 VTC and JPEG2000. Furthermore, we discuss the sequential and parallel performance of the improved versions and compare the efficiency of both algorithms. 1. INTRODUCTION In this work we show how we can improve the runtime performance of the MPEG-4 VTC [3] and JPEG2000 [7] still image coding standards. First, we improve the wavelet decomposition part via a reorganization of the order in which the data is processed. Second, we exploit parallelism within the 2 major coding stages of both algorithms to further speedup the execution, which is the wavelet-lifting and code-block processing part in JPEG2000, and the wavelet filtering and the zerotree coding in MPEG-4 VTC. The reference software used in our experiments is the MPEG-4 MoMuSys (Mobile Multimedia Systems) Verification Model of Aug. 1999 (ISO/IEC JTC1/SC29/WG11 N2805) and the Jasper JPEG2000 reference implementation (by Michael D. Adams, available at http://www.ece.ubc.ca/~mdadams). We use OpenMP (http://www.openmp.org)to implement our parallel concept for the execution on shared-memory multiprocessors. Parallel results are presented for two multiprocessor platforms: A SGI Power Challenge (20 IP25 RISC CPUs, running at 195 Mhz) and a SGI Origin3800 (128 MIPS RISC R12000 CPUs, running at 400MHz). 2. JPEG2000
JPEG2000 [7] is designed to supplement and enhance the existing JPEG standard for still image coding. It provides advanced features such as low bit-rate compression, lossless and lossy coding, resolution and quality scalability, progressive transmission, region-of-interest (ROI) coding, error-resilience, and spatial random access in a unified framework. The JPEG2000 image coding standard is based upon the the wavelet transform, and operates on independent, non-overlapping blocks of wavelet coefficients (codeblocks) whose bit-planes are arithmetically coded in several passes to create an embedded, scalable bitstream. The major coding stages of the JPEG2000 encoder are: Preprocessing and the intercomponent transform which is only needed for dealing with sub-sampled, multi-component *This work has been partially supportedby the Austrian Science Fund (project FWF-13903).
828 (color) images, the intra-component transform, which is the wavelet decomposition part, tier1 coding, which comprises three coding passes, context-selection and arithmetic coding, and tier-2 coding, where R/D (rate-distortion) allocation and the bitstream formation is performed. 3. MPEG-4 VTC: SCALABLE TEXTURE CODING USING WAVELETS MPEG-4 Visual Texture Coding (VTC) is used within the MPEG-4 system to encode visual textures and still images. These data are then used by the MPEG-4 system as textures in photo realistic 3D models, animated meshes, but also simply as still images. The MPEG-4 VTC codec is based on the discrete wavelet transform, and uses sophisticated zerotrees for the encoding of the coefficients. The major coding stages of the MPEG-4 VTC encoder are: Preprocessing which includes image IO and the setup of important data structures, then the discrete wavelet transform part which also includes arbitrary shape object handling, and finally scalar quantization as well as zerotree coding (ZTE, originally introduced in [6]) and adaptive arithmetic coding of the wavelet coefficients. MPEG-4 VTC offers three different zerotree-based quantization modes: Single Quantization (SQ), Multi Quantization (MQ), and Bi-level Quantization (BQ). A very simplified view of these quantization modes is that the wavelet coefficients are split in groups of bits. In Bi-level Quantization (BQ), the bits of the coefficients are successively transmitted from the most significant bit to the least significant bit as done in classical zerotree coding [6]. In Single Quantization (SQ), all bits of each wavelet coefficient are clustered and quantized before entropy coding. Multi Quantization (MQ) provides an in-between solution, where junks of bits are quantized and fed to the entropy coder. 4. CACHE ISSUES A standard parameter setting is specified and applied to all runtime tests which are presented in the subsequent sections. Within this standard parameter setting, the MoMuSys MPEG-4 VTC and Jasper JPEG2000 implementations perform a 5-level DWT decomposition. Additionally, both codecs use 'lossy' mode to compute the DWT, and the compression factor is fixed at a medium value of about 25, and we concentrate on the encoding of very large image data (4096 x 4096 pixels). If not denoted otherwise, MPEG-4 VTC results are delivered using the single quantization mode. The DWT is usually the most demanding coding stage of both algorithms, followed by the encoding of the wavelet coefficients. With MPEG-4 VTC SQ and BQ zerotree coding, the DWT consumes up to 83% of the overall coding time (see table 2). Only MQ coding reduces the DWT part essentially to about 26.5%. Table 2 lists the runtime contributions of the 2 major JPEG2000 coding stages. Here, the wavelet transform part (up to 70%) is the most demanding part of the algorithm. It is quite clear that if we want to improve the overall execution time of both algorithms, we have to target the DWT part and also the coding of the wavelet coefficients. In Figure 1.a, the runtime of the first decomposition level of the MoMuSys MPEG-4 VTC DWT is subdivided into the vertical and horizontal filtering. We see a significant difference between the vertical and horizontal filtering performance, especially in the case of increasing image dimensions. The vertical filtering of large images is up to 5-8 times slower as compared to the horizontal filtering.
829 700O0
60000
1st h~176 decomposition 1st vertical decomposition
600O0
50000
50000
.~
40000
40000
E ...4
30000
E ._
30000 20000
20000
10000
10000 0
1st horizontal decomposition I st vertical decomposition 1st vertical decomposition aggregated (64)
1024
2048 Image size (x^2 pixels)
a) MPEG-4 VTC
4096
0 i 256512
.... 1024
, i 2048
4096
Image size (x^2 pixels)
b) JPEG2000
Figure 1. Origin3800, DWT cache issues: Runtime 1st vertical decomposition level A very similar runtime gap can be observed in the JPEG2000 reference implementation Jasper (see Figure 1.b), despite the more efficient lifting scheme which is employed to perform the wavelet transform. This unexpected runtime behavior of the vertical filtering routine is due to severe cache-miss problems. In Meerwald et al.[5] these cache properties are discussed in the context of JPEG2000 in detail and aggregation is used as a solution to the described problem. Aggregation tries to reorder the memory access during the vertical filtering such that the cache can be accessed more efficiently. In order to achieve this, the filtering of vertical columns is done in a 'parallel' fashion, not sequentially column by column, but each filtering step is performed concurrently for a number of neighboring columns. Aggregation is closely related to loop tiling [8], a technique which is widely used in compiler transformation optimization to increase the data locality. We use aggregation to enhance the filtering stage of the MoMuSys MPEG-4 VTC and JPEG2000 coders, where aggregation j means that j columns are filtered in parallel. Since the MPEG-4 VTC includes arbitrary shape handling within the filtering routine, we extend the aggregation technique to include also shape handling. Aggregation improves the efficiency of the vertical filtering stage significantly (see Figure 1.b aggregated curve). Another nice property of aggregation is that the spikes where the vertical filtering shows an especially poor runtime are removed significantly thus producing a more reliable runtime behavior. On the Power Challenge, the overall MPEG-4 VTC runtime for the Lena image with 4096 x 4096 pixels is reduced by a factor of 1.65 and the Jasper coder by a factor of 2.22 on the Power Challenge via employing aggregation (see Table 1, second column). But also the encoding of small images (256 x 256 pixels) is improved with aggregation (MPEG-4 VTC: 1.22x faster, JPEG2000:1.11 x faster). The Jasper coder behaves more efficiently in the aggregated case when compared to the MPEG-4 VTC, especially for bigger images. This effect is mainly due to the more complex structure of the filtering code in the VTC coder, since the VTC has to do adaptive shape administration during filtering. The original MPEG-4 VTC DWT code (BQ, Lena with 4096 x 4096) consumes 84.69% of the overall coding time on the Power Challenge. Aggregation with factor 64 reduces the DWT percentage essentially to about 69.39%. The Jasper coder behaves similarly: On the Power
830 Challenge (Lena with 4096 x 4096), the wavelet transform covers 66.36%. Aggregation with factor 64 reduces the transform stage to 33.63%. Column 2 of table 2 lists the obtained DWT runtime contributions.
5. PARALLELIZATION USING O P E N M P Traditional parallelization approaches for JPEG such as [1, 2] include tiling the image and distributing the tiles among separate CPUs. As JPEG performs the DCT on 8 x 8 pixels image blocks, this straightforward tile-based parallelization approach does not impair image quality because tiles are generally much larger than the transform blocks. JPEG2000 and the VTC employ the wavelet transform for image decorrelation, and the wavelet decomposition is usually computed on the entire image, which inhibits annoying compression block artifacts that occur at low bit rate coding. However, in spite of the quality impact, both algorithms also support the concept of image tiling for operation in low memory environments. In this case, the wavelet transform is performed on each image tile independently. The parallel processing of the independent image tiles, however, leads to a significant rate-distortion loss and severe blocking artifacts as the number of tiles and processors is increased, a disadvantage, which is not acceptable in many applications. For this reason we do not follow this simple tile-based parallelization idea but propose to distribute the global wavelet transform, as well as the code-block processing of JPEG2000 and the ZTE coding of MPEG-4 VTC, among several CPUs. In a first step, we distribute the load of the horizontal and vertical filtering, the most demanding coding stage of both algorithms, to the number of available processing units. This parallelization is done with the necessary OpenMP pragmas. Note that a multiprocessor algorithm does not have to take care about the requirement for overlapping data partition or data exchange due to the available shared data space. In a second step, the second most demanding coding stages of Jasper (code-block processing) and MPEG-4 VTC (zerotree coding) are parallelized. The independent code-block processing part of the Jasper coder is simple to parallelize, since there are no data dependencies across different blocks. Each processor can take a number of code-blocks and perform the block processing independently without any synchronization between CPUs. After each processor has finished the computation of the blocks that were assigned to this CPU, a barrier synchronization is used, before the algorithm is continued sequentially. The MPEG-4 VTC zerotree coding stage is more difficult to parallelize. Due to the special nature of the progressive bi-level bitplane coding (BQ-ZTE), there is a huge number of data-dependencies preventing a useful parallelization. The major problem is the use of binary arithmetic coding within the BQ zerotree symbol processing. Without arithmetic coding, a parallel BQ-ZTE version using separators is possible (see Kutil [4]). The structure of MPEG-4 VTC zerotree coding in single and in multi quantization mode, on the contrary, allows a parallelization. These entropy coding methods are organized in two major stages. In the first stage, each wavelet coefficient is quantized and a zerotree symbol is computed for this coefficient. With minor changes, this part can be performed in parallel on a number of available CPUs. In the second major stage of SQ and MQ coding, the computed zerotree symbols and quantized coefficient values are binary arithmetic coded and written to the bitstream. This part is intrinsically sequential, which is due to the data dependencies within
831 the arithmetic coder, but has only about 20% up to 35% runtime contribution to the overall SQ zerotree coding time. Figure 2.a shows the speedup for the first DWT decomposition level of the 4096 • 4094 sized image Lena on the Power Challenge. The horizontal decomposition step shows a very good
6
--,---x--
horizontal vertical
--i- -
'
t._0 *-, 20 c~
..r"""'"
-~---x--
vertical vertical aggregated
,~( x-
-o
o
.x-"
x::
,,x"
3 -a w
5
o
2
3
4
5 # CPU6s
a)
7
8
9
10
,
,
,
,
i
#: C P U s
I
I
[
10
b)
Figure 2. Speedup of the 1st DWT decomposition level for the Lena image with 4096 • 4096 pixels on the SGI Power Challenge.
speedup, the original vertical decomposition step however a bad speedup. Aggregation lifts not only the sequential performance significantly, also the parallel efficiency is pushed essentially. This effect is due to the worse cache exploitation and bus-congestion of the original filtering code. The speedup values in Figure 2.a are related to the runtimes of the 1 processor case of the 3 different corresponding algorithms. Figure 2.b on the contrary relates the Power Challenge runtimes to the runtimes of the original vertical filtering routine (= aggregation 1). Here, the aggregated version shows a superlinear speedup (more than 20 for 10 CPUs), and we see the sequential speedup (1 CPU) which is more than 4. With the improved (= aggregated) filtering and the parallelization of the wavelet transform as well as the SQ ZTE coding stage, the MPEG-4 VTC SQ execution time is reduced by a factor of about 4.43 (10 CPUs) on the Power Challenge (see table 1), whereas without improved filtering, only 2.75 is achieved at the same number of processors. Table 1 lists the results for the JPEG2000 system and the different MPEG-4 VTC quantization modes. We see that the improved Jasper performs more efficiently as compared to the improved MPEG-4 VTC. The 2 reasons for this behavior: First, the more efficient DWT scheme of the Jasper (no expensive arbitrary shape object handling, and the DWT is implemented via lifting), second the more efficient wavelet coefficient coding of JPEG2000. Figure 3 plots the overall achieved speedups of the MPEG-4 VTC and the JPEG2000 coder with respect to the relative runtime of each algorithm and with respect to the absolute runtime of the original code. Here, the 10CPU results are picked for table 1. Since the overall percentage of parallel code for the VTC and JPEG2000 coders is similar (see table 2), the achieved speedups are quite comparable. The curves showing the superlinear speedups, are the ones, which are related to the corresponding sequential runtimes without aggregation.
832 Table 1 Absolute speedup" Lena 4096 x 4096, different aggregation, with respect to the sequential code without aggregation (EC. = Power Challenge) absolute speedup 1CPU, aggr.64 10CPU, aggr.64 [ 10CPU, aggr.1
5.5
VTC, P.C., SQ VTC, P.C., MQ VTC, P.C., BQ
1.65 1.38
1.96
4.43 3.10 3.92
JPEG2000, P.C.
2.22
4.93
4.5
II
3.34
---*--- VTC: lena 4096 (aggr.64), with r e s p e c ~ JPEG2000: lena 4096 (aggr.64 ),...~yithJ~l~ect aggrega~r6n 1 ---*-- VTC: lena 4096 ( a ~ t h ~ e c t aggregation 64 --a--- JPEG2000: / ~ ~ 4 ) , with respect aggregation 64
VTC: lena 4096 (aggr.64), with respect aggregation 1 JPEG2000: lena 4096 ( a g g r . 6 4 ~ VTC: lena 4096 (aggr.64),~ti~respect aggregation 64 JPEG2000: lena 40 ggr.64), with respect a
5
2.75 2.38 2.39
_
4
=~ 3.5 r 2.5 2 1.5 1 0
2
4
6
8 #CPUs
a) Power Challenge
10
12
14
0
2
i 4
i 6
8
10
12
14
#CPUs
b) Origin3800
Figure 3. JPEG2000 versus MPEG-4 VTC (Single Quantization): Overall speedup for the Lena image with 4096 x 4096 pixels. We want to discuss our parallel results with respect to Amdahl's law for theoretical speedup ( speedup- (s+~) (s+p) (s+p=l)) 9When analyzing the MPEG-4 VTC (SQ) runtimes without aggregation conceived on the Power Challenge, we get (86.2% of the code is parallelized) an expected theoretical speedup of 4.46 for 10 processing units (see table 2). Actually, our code achieves a speedup of 2.75. With aggregated filtering, the percentage of parallel code decreases to about 77.1%, and with it also the possible theoretical speedup to 3.26. Nevertheless, the theoretical possible speedup is matched more closely (2.67) when compared to the non-aggregated case. The JPEG2000 reference implementation behaves similarly in parallel (see table 2). By looking at the results presented in table 2 we can conclude, that the parallel efficiency of the cache-optimized code is significantly better when compared to the original code.
6. CONCLUSION The runtime performance of the MoMuSys MPEG-4 VTC and the JPEG2000 reference implementation Jasper is improved significantly via implementing aggregated vertical filtering. A parallel version can further speedup the execution of both algorithms to some extent. The aggregated version's parallel efficiency is better. However, the scalability is very limited due to the relatively large amount of inherently sequential execution parts. Therefore, massively parallel systems are not a suited (and sensible) architecture for these new image coding standards.
833 Table 2 lena4096, VTC and JPEG2000: Runtime contributions. 2 most fight columns: Theoretical versus practical speedup for a 10 processor environment (EC. = Power Challenge, Or. = Origin380, t.s. = theoretical speedup, p.s. = practical speedup) % DWT % ZTE 1 % Par.tot. 10CPU: t.s. 10CPU" p.s. VTC, EC., SQ, aggr.1 67.3% 18.9% 86.2% 4.46 2.75 VTC, EC., MQ, aggr.64 47% 28.5% 75.5% 3.12 2.38 VTC, EC., BQ, aggr. 1 83.2% 0% 83.2% 3.98 1.98 VTC, EC., SQ, aggr.64 45.5% 31.5% 77.1% 3.26 2.67 JPEG2000, EC., aggr.1 69.9% 4.19 3.34 14.7% I 84.6% JPEG2000, EC., aggr.64 33.6% 32.9% 66.5% 2.49 2.22 JPEG2000, Or., aggr.1 59% 3.55 2.72 20.8% [l 79.8% JPEG2000, Or., aggr.64 29.1% 36% 65.1% 2.41 2.23 REFERENCES
[1] [2]
[3] [4]
[5] [6] [7]
[8]
G.W. Cook and E.J. Delp. An investigation of scalable SIMD I/O techniques with application to parallel JPEG compression. Journal of Parallel and Distributed Computing, 30:111-128, 1996. J.J. Falkemeier and G. Joubert. Parallel image compression with JPEG for multimedisa applications. In J.J. Dongarra et al., editors, High Performance Computing: Technologies, Methods & Applications, number 10 in Advances in Parallel Computing, pages 379-394. North Holland, 1995. ISO/IEC 14496-2. Information technology - coding of audio-visual objects - part 2: Visual, December 1999. R. Kutil. A significance map based adaptive wavelet zerotree codec (SMAWZ). In S. Panchanathan, V. Bove, and S.I. Sudharsanan, editors, Media Processors 2002, volume 4674 of SPIE Proceedings, January 2002. P. Meerwald, R. Norcen, and A. Uhl. Cache issues with JPEG2000 wavelet lifting. In C.-C. Jay Kuo, editor, VCIP'02, volume 4671 of SPIE Proceedings, San Jose, CA, USA, January 2002. SPIE. Jerome M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. on Signal Process., 41(12):3445-3462, December 1993. D. Taubman and M.W. Marcellin. JPEG2000- Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, 2002. M.E. Wolf and M.S. Lam. A data locality optimization algorithm. In E. Krause and W. J/iger, editors, Proc. of ACM SIGPLAN conference on Programming Language Design and Implementation, pages 30-44, June 1991.
This Page Intentionally Left Blank
Minisymposium
Parallel Applications
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
837
Parallel S o l u t i o n o f the B i d o m a i n E q u a t i o n s w i t h H i g h R e s o l u t i o n s x. Cai ab, G.T. Lines a, and A. Tveito ~b ~Simula Research Laboratory, P.O. Box 134, N-1325 Lysaker, Norway. bDepartment of Informatics, University of Oslo, P.O. Box 1080, Blindern, N-0316 Oslo, Norway This paper is concerned with the parallel solution of the Bidomain equations and an associated forward problem, which can be used to simulate the electrical activity in the heart and torso. For achieving high-resolution simulations on parallel computers, a scalable and parallelizable numerical strategy must be used. We present therefore an advanced parallel preconditioner for a 2 x 2 block linear system that arises from finite element discretizations. The scalability of the preconditioner is studied by numerical experiments that involve up to 81 million unknowns in the block linear system. 1. INTRODUCTION Normal electrical activity in the heart is very important for the heart's pumping function. In order to study the relationship between disorders in the heart and the measurable ECG signals on the body surface, researchers have developed advanced mathematical models involving ordinary differential equations (ODEs) and partial differential equations (PDEs). Numerical strategies have thus been devised to enable computer simulations, see e.g. [ 1, 2, 3, 7]. Due to factors such as the complexity of the equations and the irregular shapes of the domains, high resolutions in both time and space are needed to achieve sufficiently accurate computational results. One of the desired goals is to run full-scale 3D simulations with 0.1ms temporal resolution and 0.2rnm spatial resolution in the heart domain. This gives rise to computations on a 3D grid with about 5 x 107 grid points, involving several thousand time steps. Clearly, efficient numerical strategies are required for doing such large-scale simulations. An important property of a desirable numerical strategy is good scalability, meaning that the computational effort per degree of freedom should remain roughly constant, independent of the growth of the problem size. Scalability is also essential for a parallel solution of the mathematical model, i.e., the overall computational demand also roughly remains constant independent of the number of processors used. Following the work of [8] on designing an order-optimal sequential block preconditioner for Krylov iterative methods, we present in this paper its parallel counterpart that preserves the same efficiency. The scalability of this parallel preconditioner is studied through numerical experiments, which indicate that the desired spatial resolution of 0.2ram is achievable even on a moderately large parallel computing system.
838 The remainder of the paper is organized as follows. First, Section 2 gives a brief description of the mathematical model involving the Bidomain equations and an elliptic PDE in the torso. Then, Section 3 explains a numerical strategy for the mathematical model, which needs to solve a 2 x 2 block linear system per time step. Thereafter, Section 4 presents the design of a parallel block preconditioner for achieving scalable performance when solving such a 2 x 2 block linear system, whereas Section 5 is devoted to issues about parallelization. Finally, Section 6 shows the measurements of some numerical experiments before Section 7 gives a few concluding remarks. 2. THE MATHEMATICAL M O D E L A well-established mathematical model for describing the electrical activity in the heart is the Bidomain equations, see [9]. In this paper, the domain of the heart is denoted by H, and we consider the following form of the Bidomain equations involving the pair of primary unknowns: the transmembrane potential v and the extracellular potential u~. ~V XCm-~-~- nt- X/ion
=
V " ( M i V v ) nt- V " (MiVue)
o
=
v.
V.
+
(1)
in H,
in H.
(2)
In (1), the nonlinear function Iion(v, s) represents the ionic current where s denotes a vector of state variables describing the state of the cell membrane. There exist many different models of /ion, almost all of them rely on solving an ODE system to find the s vector, see e.g. [4, 10]. That is, an ODE system of the form d s / d t = F(v, s) models the electrical behavior of the cardiac cells. Moreover, X is a surface-to-volume scaling factor, Cm denotes the capacitance of the membrane, and M i denotes the intracellular conductivity tensor. Similarly, M e denotes the extracellular conductivity tensor in (2).
Figure 1. A schematic 2D slice of the entire solution domain f2 = H tJ T.
In order to compute the electrical potential on the body surface, we supplement the Bidomain equations (1)-(2) with a forward problem, which is the following elliptic PDE describing the propagation of the electrical signal in the torso T exterior to the heart: -v.
(MoWo)
=
o
in T,
(3)
839 where u o is the electrical potential in the torso and M o denotes the associated conductivity tensor. Therefore, our mathematical model consists of three PDEs (1)-(3), for which the entire solution domain ~2 = H U T is depicted in Figure 1. As for boundary conditions, we have 0
M~b-~n(v+u~)-0,
u~=Uo,
vO o O On - 0
0%
0%
M~-57 - Mo-gT = O on aH,
onOT.
In the temporal direction, the mathematical model is to be solved for a time period with known initial values for v, s, u~, and %. 3. N U M E R I C A L S T R A T E G Y
The basic idea behind discretizing (1) is to split this nonlinear PDE into two parts:
Ov Ot =
1 Cm/i~
S)
Ov XC m Ot = V . (MiVv) + V - ( M i V % ) ,
and
(4)
where the first part is to be solved together with an ODE system modeling the state variables s. In the temporal direction, the simulation time period for the entire mathematical model is divided into discrete time levels: O=t o 1, the solutions from the previous time level, v l-l, uel-1 , Uol--1, S/-1, are used as the starting values. Using a 0-rule, where 0 <_ 0 _< 1, we can construct a flexible numerical strategy whose work per time step consists of solving an ODE system twice, separated by the solution of a system of PDEs. We refer to [8] for a detailed explanation of the strategy, and remark that 0 = 1/2 gives rise to a second-order accurate temporal discretization. In summary, the computational work for the time step tl_ 1 --+ t z consists of the following sub-steps: 9 First, a chosen ODE system dv
dt
ds dt
1
= =
Cm Ii~
s)
(5)
F(v, s)
is solved for t E (tt_~, t~_~ +O(tt-tt_l) ], using the initial values v t-1 and s ~-1. The results are an intermediate transmembrane potential solution ~l-1 and an updated gt-~ vector. 9 Second, the three PDEs (1)-(3) are solved simultaneously, where (1) only uses its second part in the splitting given in (4). The temporal discretization is of the following form: @l __ ~ l - 1
~Cm
z~t
=
(1 - O ) ( V .
(MiV?31-1) Jr- V . (MiVtt/e-1))
+0 ( V . (M~V9 t) + V . ( M i V u [ ) ) , 0 --
(1
+0 ( V . (M/V~J) + 2 7. ( ( V i + Ve)Vu'~)), 0 --
(6)
- - O ) ( V . (MiVg; 1-1) + V . ((M{ + Me)VtJe-1))
(1 - 0 ) 2 7 . (MoVu~ -1) + OV. (MoVuZo).
(7) (8)
840 We remark that ~)z, u~, z u to are the unknowns to be found: The spatial discretization, using, e.g., finite elements, will give rise to a 2 x 2 block linear system: XCm I + 0AtA v O A t A v OAtATv 0AtAu]
~r
[u ~
=I b0 ]"
(9)
In the above block linear system, I and A v represent the mass matrix and the stiffness matrix associated with M i inside H, respectively. Moreover, we have combined the unknown values of u cl and Uol into one vector u 1. The sparse matrix A~, arises from discretizing (2) and (3) together. That is, we solve a combined elliptic problem - V . ( M V u ) = 0 in f2, using M = M i + Mc in H and M = M o in T, see [8] for more details. The Av matrix is the same as A~, except that A~ is padded with some additional zero columns to allow the multiplication A~u, whereas the A~ matrix is the transpose of A~. By extending the proof that is given in [5] to also include the torso domain, we can deduce that the 2 • 2 block matrix in (9) is symmetric and positive semi-definite. We remark that this property arises from the symmetry and positive definiteness of the conductivity tensors M i, Me and M o. Experiments have shown that the conjugate gradient (CG) method is an appropriate choice for (9). A suitable preconditioner is thus important for achieving rapid convergence. We present a scalable parallel preconditioner in Section 4. 9 Third, the chosen ODE system (5) is solved again for t c (tz_ 1 q- O(t I - tt_l), tt], using the initial values ~3z and ~1-1. The computational results are stored in v I and s t. Since the above numerical strategy splits the solution of (1) into different parts, we denote it by the "semi-implicit" approach. 4. A PARALLEL P R E C O N D I T I O N E R We know from [8] that the following diagonal block system XCm I + 0AtA v 0 ] 0 0AtAu
(10)
can work as an efficient preconditioner for (9). Here, we remark that the notation for the matrices I, A v and A u is the same as for (9). The rapid convergence of this preconditioner is due to the fact that (10) is spectrally equivalent to the 2 • 2 block matrix in (9), see [8]. During each preconditioning operation, the inverse of the diagonal block matrix (10) can be found by inverting XCm I + 0AtAv and O A t A u separately. It can be shown that the number of needed CG iterations, using the above preconditioner, is constant independent of the number of unknowns in (9). To make the above preconditioner parallel while maintaining its scalability, we suggest a layered design. First, one or several additive Schwarz iterations (see [6]) work as a parallel solver for both XCmI + O A t A v and O A t A u . Let us demonstrate this for the latter system. One additive Schwarz iteration for A u can be expressed as
aul
P ~ Z au- ~' i=o
(11)
841 where it is assumed that the body domain ~ is partitioned into P overlapping subdomains ~i, 1 _< i _< P. Thus, Au, i denotes a subdomain matrix that arises from a discretization restricted to ~i. Note that Au, 0 is associated with a discretization on a very coarse global grid. The use of -1 A~,0, also called coarse grid correction, is to ensure convergence independent of the number of subdomains P, see e.g. [6]. Second, as an approximate subdomain solver, we use multigrid V-cycles. This is because the complexity of such cycles is linear with respect to the number of subdomain unknowns. The construction of the required hierarchy of subdomain grids is described in Section 5. In summary, the combination of additive Schwarz iterations on the "global layer" and multigrid V-cycles on the "subdomain layer" constitutes our parallel preconditioner. Scalability with respect to the number of unknowns arises from the spectral equivalence between (9) and (10), together with using multigrid as subdomain solvers. Convergence independent of the number of subdomains is due to Schwarz iterations with coarse grid correction. 5. PARALLEL C O M P U T I N G Let P denote the number of processors, we adopt the approach of explicit domain partitioning that divides the entire computational work among the processors. Recall that the body domain ft consists of the heart H and the torso T, so we partition both H and T into P pieces, see Figure 2. Processor i is responsible for the composite subdomain g~i = Hi U Ti. In addition, we also introduce a certain amount of overlap between the subdomains to enable the additive Schwarz iterations that are used in the parallel preconditioner.
Figure 2. A schematic view of domain partitioning used for the parallel computation.
During parallel computation, the work on processor i consists of local operations that are restricted to H i and T i. That is, local finite element discretizations are carried out independently on the processors. The ODE system is also solved independently by the processors on their local heart subdomain points. No communication between processors is needed for these two tasks. However, during the parallel iterations for solving the global 2 x 2 block linear system, subdomain local operations need to be interleaved with inter-processor communication. Recall also that we want to run multigrid V-cycles as subdomain solvers in the additive Schwarz iterations for approximating (XCmI + 0AtAv) -1 and Au 1. This requires two associated hierarchies of subdomain grids. One approach to achieving this is that we start with a global H grid and a global T grid, both of medium resolution. Then, we partition the two
842 global grids into P overlapping parts respectively. Afterwards, the subdomain H and T grids are refined several times on each subdomain, giving rise to a hierarchy of subdomain H grids and a hierarchy of subdomain T grids. Note that a hierarchy of subdomain f~ grids is just the union of the associated hierarchies of subdomain H and T grids.
6. N U M E R I C A L E X P E R I M E N T S
In Table 1, we show the number of CG iterations needed for achieving convergence when solving the 2 x 2 block linear system (9) with 0 = 1/2. We have used the parallel preconditioner that is described in Section 4. The number of local grids in the subdomain grid hierarchy is listed in the first column, whereas the summed number of v and u unknowns is shown in the second column. We can observe convergence scalability by noting that the number of CG iterations remains almost constant, independent of the number of unknowns or P. We also remark that the largest system size in Table 1 corresponds to over 39 million grid points in H, quite close to a spatial resolution of 0.2mra.
Table 1 The number of CG iterations needed to solve the block linear system (9), using the parallel block preconditioner. Convergence is claimed when the L2-norm of the global residual is reduced by a factor of 104. grid # unknowns P=4 P=8 P=16 P=32 P=64 levels (v + u) P = 2 10 12 11 9 9 10 302,166 10 11 11 9 9 10 1,552,283 13 13 14 10,705,353 14 15 15 81,151,611
Table 2 is meant to reveal the parallel efficiency of the preconditioned parallel CG iterations. The wall-clock time measurements are obtained on an SGI Origin 3800 system. As a comparison, we also list the corresponding wall-clock time measurements of a so-called "decoupled" strategy, which first solves (1), and then a combined system for (2)-(3). In terms of solving linear systems, the "decoupled" strategy first solves a linear system (XCmI 4- 0AtAm)v -- b v and then a linear system Auu - bu, each system only once per time step. Note that the wall-clock time measurements of the "decoupled" strategy in Table 2 is the sum of solving these two linear systems. The measurements in Table 2 show that it is possible for the advanced "semi-implicit" strategy to achieve the same level of computational efficiency as the simple but accuracy-wise much inferior "decoupled" strategy. We remark that the scalability of the"semi-implicit" strategy is due to the parallel block preconditioner. We also remark that the measurements in Table 2 are obtained on a heavily-loaded SGI Origin system with many users competing for the same resource, so care should be taken when interpreting the speedup results.
843 Table 2 Wall-clock time measurements (in seconds) of two different strategies for solving the linear system(s) for one time step. grid # unknowns "decoupled" strategy "semi-implicit" strategy levels (v + u) P = 1 6 P=32 P=64 P=16 P=32 P=64 3 1,552,283 92.66 41.46 26.60 32.66 46.08 44.11 4 10,705,353 248.65 144.65 50.77 399.64 224.15 148.28 5 81,151,611 4406.61 2538.32 1412.63 4094.97 2915.88 1902.91 7. C O N C L U D I N G R E M A R K S
We have presented a layered design of a parallel preconditioner for the 2 x 2 block system (9). This parallel preconditioner enables scalable performance of the advanced "semi-implicit" numerical strategy for the mathematical model (1)-(3). Even for a 2 x 2 block system involving more than 81 million unknowns, solution is obtained within a reasonable amount of time on a moderately large parallel system. ACKNOWLEDGEMENT
We acknowledge the support from the Research Council of Norway through a grant of computing time (Programme for Supercomputing). REFERENCES
[1] D. B. Geselowitz and W. T. Miller. A bidomain model for anisotropic cardiac muscle. Annals of Biomedical Engineering, 11:191-206, 1983.
[2] R. M. Gulrajani. Models of electrical activity of the heart and computer simulation of the electrocardiogram. Critical Reviews in Biomedical Engineering, 16(1): 1-66, 1998.
[3] G. Huiskamp. Simulation of depolarization in a membrane-equation-based model of the [4]
[5] [6] [7]
[s]
[9]
anisotropic ventricle. Transactions on Biomedical Engineering, 45:847-855, 1998. C. H. Luo and Y. Rudy. A dynamic model of the cardiac ventricular action potenial. Circulation Research, 74:1071-1096, 1994. M. Pennacchio and V. Simoncini. Efficient algebraic solution of reaction-diffusion systems for the cardiac excitation process. Journal of Computational and Applied Mathematics, 145:49-70, 2002. B. Smith, P. E. Bjorstad, and W. D. Gropp. Domain Decomposition, Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996. J. Sundnes, G. T. Lines, P. Grottum, and A. Tveito. Electrical activity in the human heart. In H. P. Langtangen and A. Tveito, editors, Advanced Topics in Computational Partial Differential Equations - Numerical Methods and Diffpack Programming. Springer, 2003. J. Sundnes, G.T. Lines, K.-A. Mardal, and A. Tveito. Multigrid block preconditioning for a coupled system of partial differential equations modeling the electrical activity of the heart. Computer Methods in Biomechanics and Biomedical Engineering, 5:397-411, 2002. L. Tung. A Bi-domain model for describing ischemic myocardial D-C potentials. PhD thesis, MIT, Cambridge, MA, 1978.
844 [10] R. L. Winslow, J. Rice, S. Jafri, E. Marban, and B. O'Rourke. Mechanisms of altered excitation-contraction coupling in canine tachycardia-induced heart failure, II, model studies. Circulation Research, 84:571-586, 1999.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
845
Balancing Domain Decomposition Applied to Structural Analysis Problems* RE. Bjorstad ~ and J. Koster ~ ~Parallab, Bergen Center for Computational Science, Hoyteknologisenteret i Bergen, Thormohlensgate 55, University of Bergen, N-5008, Bergen, Norway This paper summarizes the present status of a library of iterative substructuring solvers for the parallel solution of large sparse linear systems of equations that arise from large scale industrial finite-element applications. The iterative solution schemes have been applied to a series of model problems but also to real-life test cases that include structural analysis problems from the automobile industry (incl. crankshafts) and ship structure analysis (incl. ship sections made of plates, shells and beams). The library provides a single framework that includes Schur complement techniques and Schwarz procedures. The preconditioners include well-known incomplete factorization variants as well as more advanced techniques (such as for example Balancing Domain Decomposition). In this paper, we show some of the performance of the software on a real-life industrial problem. 1. I N T R O D U C T I O N The domain decomposition solver SALSA [4] is a combined direct/iterative solver for large sparse systems of linear equations on distributed-memory computers. Though the coefficient matrices may be general in nature, SALSA has been developed primarily for solving positive definite systems from finite element problems, such as solid elasticity, shell, and plate problems, possibly on disconnected subdomains. The SALSA software package features state-of-the-art domain decomposition techniques. These include iterative substructuring techniques with l-level and 2-level Neumann-Neumann preconditioners [6, 10, 11, 12, 15]. The 2-level method features a variety of coarse spaces. Some of these coarse spaces are designed for specific classes of problems, but others are computed algebraically and no additional problem information is required. The software contains other domain decomposition techniques as well, but in this paper we restrict the discussion to iterative substructuring and the Neumann-Neumann preconditioners. 2. ITERATIVE SUBSTRUCTURING IN SALSA The SALSA package assumes a partitioning of the overall problem into non-overlapping subdomains [14]. It accepts a partitioned problem from the user, or computes one internally from a problem that is provided in assembled or elemental format (using METIS [9]). Let the *This work has been supported in part by the Norwegian Research Council through the NOTUR Transfer of Technology Projects (NFR project no. 132756/431).
846 physical domain be decomposed into N nonoverlapping subdomains i, i = 1 , . . . , N, and let the stiffness matrix K (~) and fight-hand side f(~) for subdomain i be defined as:
h.(~) ~-(~) ~ " R I
~ ~R R
(1)
,
Here, the matrix K ~ ) contains the contributions of the internal nodes, the matrix ~(i) the ~"RR contributions of the interface nodes, and the off-diagonal matrix "fc(i) contains the links between '~RI internal nodes and the interface nodes, re(i) is the transpose o f ~ R1 n'(~) The right-hand side vector '"IR f(i) is defined accordingly. SALSA allows a flexible mapping of the subdomains so that a good balance of the processor work load can be achieved. For example, if the number of subdomains is larger than the number of processors, multiple subdomains can be assigned to the same processor. If the number of subdomains is smaller than the number of processors or if the subdomains differ widely in work and/or storage, more than one processor can be assigned to a single subdomain that subsequently perform the operations for this subdomain in parallel. In the remainder of this paper, we assume (unless stated otherwise) a one-to-one mapping of subdomains to processors. Substructuring algorithms contain a preprocessing phase in which the internal unknowns in each substructure are eliminated. This is done by solving the N sparse linear systems K ~ ) 9 v~i) = f~i) for v~i) in parallel. SALSA uses by default the multifrontal direct sparse method MUMPS for solving these linear systems (but work is on-going to have iterative solvers as an option as well). MUMPS is a software package that (like SALSA) was originally developed as part of the European Framework IV project PARASOL [2, 3] and that is still being further extended. The preprocessing phase computes for each substructure i a fight-hand side vector of the form 9~(i) = f}~) - - ~(i) . v~i) 9 Vector 9~(i) is defined for the interface variables of substructure i. R -'~RI R The preprocessing phase also computes for each substructure a Schur complement matrix
•
"~
i
~~'RR
~
"~RI
"
~IR
(2) 9
Each matrix S i can be computed explicitly as a dense matrix (using MUMPS). Alternatively, the action of S~ on a vector is available indirectly through evaluation of the fight-hand side of Equation (2). After the preprocessing of the subdomain interiors, SALSA solves the interface problem N
N
i=1
i=1
(3)
for interface vector u n, by using an iterative (Krylov subspace) method (see e.g., [8]). Here, Ri denotes the prolongation operator associated with subdomain i. In the sequel, a vector xi denotes the interface vector x (that is defined for all interface variables) restricted to the interface variables of subdomain i, i.e., x i = RTx. The interface matrix S in Equation (3) is not assembled. This provides a source of parallelism that can be exploited in the construction of the preconditioner and the Krylov subspace iteration.
847 For example, a matrix-vector product y = S 9x is computed as N
i=1
This requires N matrix-vector products S~x~, one for each subdomain, that can be computed in parallel, followed by one communication step in which the local contributions are summed. After the interface solution vector u R is computed, some postprocessing is still necessary to compute the final solution vectors: N independent linear systems K~I). w~i) - _ ~~IR ( i ) . RTuR are solved in parallel for local vectors w~~). For each system, this requires a matrix-vector product to compute the right-hand side and (provided the direct solver MUMPS was used in the preprocessing phase) a forward/back substitution with the factors of K}/). Finally, the vectors @) and w~i) are summed to obtain the interior solution vectors u~i). The pre- and postprocessing phases are so-called embarassingly parallel, i.e., the operations for a substructure do not depend on data from any other substructure that is not yet available. The iteration on the interface problem is implemented in parallel but requires synchronization of the processors at each step of the iteration (for the dot-products, matrix-vector products, and application of the preconditioner). The number of iterations that is needed strongly depends on the problem and the preconditioner that is used. In the next section, we describe the more important features of the Neumann-Neumann preconditioners that are available in SALSA. 3. THE N E U M A N N - N E U M A N N P R E C O N D I T I O N E R S
The SALSA solver provides l-level and 2-level Neumann-Neumann preconditioners for the Krylov subspace iteration. We refer the reader to [4, 14] for an introduction to NeumannNeumann preconditioning and to [6, 10, 11, 12] for some detailed analyses. The l-level method preconditions an interface residual vector x by N
y -- ~
l~iDiS~lDi t~Tx"
/=1
Here, the matrices D i are diagonal matrices that define a partition of unity, i.e., }-~'~iU=1 RiDi = I. The l-level method requires the solution to the Neumann problems Six i = Yi with Neumann matrix Si. SALSA optionally factorizes the Neumann matrices using either MUMPS or LAPACK, depending on the sparsity of the matrix. However, the Neumann matrices can be (and often are) singular. To avoid this, SALSA first regularizes the Neumann matrices by increasing the diagonal entries with a small constant a times the largest diagonal entry (see also [7]). We note that such modification changes the preconditioner only, and not the (user-provided) problem. The 2-level Neumann-Neumann preconditioner requires the solution to a coarse problem. Several options are implemented in SALSA to construct a coarse matrix that enables the iteration to converge quickly for a range of problems. Mandel [12] showed that the coarse problem can be formulated in terms of N local coarse spaces Z/ = span{z/k 9k - 1 , . . . , ni}, one for each substructure i, where each local coarse space must satisfy Ker S i c Zi. The SALSA solver contains efficient coarse operators for problems for which the local coarse spaces Z/(or
848 rigid body motions) are known in advance. This is for example the case for diffusion equation and (2D and 3D) elasticity problems. A more flexible strategy for symmetric positive definite problems is to construct the local coarse spaces algebraically from selected eigenvectors of the local Schur operators [5]. This requires the computation of the eigenvectors that correspond to selected eigenvalues of a Neumann problem posed on each subdomain. In general, larger eigenvector based coarse spaces are more expensive to construct but improve convergence of the iteration (typically Preconditioned Conjugate Gradients - PCG). Hence, it is important to find a balance between the dimension of the coarse space and the number of iterations. In [5], a new coarse operator is proposed that adjusts the number of computed eigenvectors adaptively to the number of small eigenvalues found (for each Si individually). The underlying idea is to compute (a possible cluster of) the smallest eigenvalues and 'ignore' the larger eigenvalues. In the current implementation of SALSA, an eigenvalue A is considered small if it is not larger than A := O . m a x , , m i n ~ i i .
(4)
i
Here, 0 is a user-defined constant. The threshold A is calculated from the largest of the smallest eigenvalues for each of the matrices/
4)
9s .
where i, j are neighbouring subdomains or subdomains that have a common neighbour subdomain. The construction of the coarse matrix can be formulated as a sequence of (dense) matrix-matrix products and SALSA makes use of higher level BLAS routines where possible. 4. E X P E R I M E N T A L RESULTS We have evaluated the numerical behaviour and the performance of the SALSA algorithms and software with numerous experiments on various types of finite element problems. These include model problems like solid, plate, and shell elasticity problems (see [5]) and large scale
849 industrial problems. Space limitations do not permit us to describe these results in detail in this paper. Therefore, we decided to illustrate the behaviour and performance of SALSA on one relatively small but challenging test case that comes from an industrial structural analysis application. The experimental results were obtained on an IBM p690 Regatta Turbo at Parallab with thirty-two 1.3 Ghz power4 processors. The machine was under a normal production load when the experiments were carried out. The SALSA software is written in Fortran 90 and uses MPI for message passing between the processors. Experimental results for SALSA on an SGI Origin 2000 were published in [4, 5].
Figure 1. The MT1 tubular joint test case provided by Det Norske Veritas, Oslo.
DNV tubular joint (MTI). The test case was provided by Det Norske Veritas (Oslo, Norway). The problem comes from the modelling of a tubular joint and contains a mixture of solid, shell, and transitional finite elements. The problem was partitioned by Det Norske Veritas into 58 subdomains of widely varying sizes and shapes. The problem contains 97,470 degrees of freedom; 13,920 lie on the interfaces between the subdomains. The SALSA solver was given the partitioned stiffness matrices from Equation (1). Righthand sides were generated randomly. No other information was provided to the solver. The subdomains were mapped onto the processors in a cyclic way; i.e., subdomain i is assigned to processor ( / - 1) mod #procs. The regularization parameter c~ was set to 10 -12. The PCG iteration was stopped when the/2-norm of the residual was reduced by a factor 106. The experimental results in Table 1 were obtained with the ADAPT (6) coarse space constructor (see Section 3). The first set of results shows the elapsed times for 1 to 12 processors. One observes that the scalability is satisfactory for smaller numbers of processors but deteriorates for larger numbers of processors. This can contributed to several things. First (and most obvious), for larger numbers of processors, the global reductions have larger impact on the overall performance. Second, some of the processors are idle after they have finished the factorizations of the interior matrices. They need to wait until all the other processors have finished their factorizations before they can proceed with the construction of the preconditioner (due to some processor synchronization). If the overall elapsed time decreases, the idle times caused by this processor synchronization appear to have a larger impact. Finally, as mentioned earlier, the distribution of the subdomains over the processors is cyclic and thus not optimized according
850 Table 1 Results for the 58-subdomain MT1 test case using the A D A P T (6) coarse variant. #procs denotes the number of processors; #iter denotes the number of PCG iterations; 'coarse' denotes the dimension of the coarse space; 'pre' denotes the elapsed time (in seconds) for the preprocessing phase that includes the elimination of the internal unknowns, the construction of the Schur complement matrices, and the construction of the 2-level preconditioner; 'PCG' denotes the elapsed time for the PCG iteration; 'total' denotes the total elapsed time for solution required by SALSA. The times for postprocessing are relatively small and not listed here. #procs 1 2 4 8 12 1
1 1 1 1 1 1 1
1
0 0.005 0.005 0.005 0.005 0.005 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.010 0.050
coarse #iter 1518 1518 1518 1518 1518 1236 1350 1416 1446 1518 1596 1644 1764 3042
pre 32 29.0 31 16.5 9.7 31 31 6.1 31 5.3 134 26.7 78 30.2 35 28.5 35 28.2 32 29.0 28 30.8 25 31.7 22 32.3 15 50.4
times PCG 16.6 8.7 4.3 2.1 1.8 64.2 41.2 17.8 17.8 16.6 16.1 14.6 12.6 11.2
total 45.7 25.3 13.9 8.3 7.2 91.0 71.4 46.3 46.0 45.7 46.9 46.2 45.0 61.7
to the sizes and work related to each of the subdomains. A more clever mapping of the subdomains such that each of the processors would have approximately the same amount of work in the preprocessing phase may well reduce the idle times caused by the mentioned processor synchronization and thus improve the overall performance of the code on this problem. The order of the coarse matrix for the first set of results is quite large (1518). In order to see if this can be reduced, we experimented with the parameter 0 while keeping the number of processors constant (see Equation (4)). The results are shown in the second set of results in Table 1 for one processor. As expected, the results show that the size of the coarse space grows with increasing values of 0. A larger coarse space means that the time complexity of building the preconditioner increases and this becomes clearly visible for 0 > 0.007. On the other hand, a larger coarse space improves the convergence of the iteration but the time per iteration increases. The optimal value for 0 for the MT1 test case lies somewhere between 0.003 and 0.010. This experiment shows that the value of 0 must be well tuned in order to preserve both the convergence and the performance of the overall solution process. A way to compute a good value for 0 automatically would be of great help for the user of SALSA and we are currently looking into this. Finally, we note that some internal default settings in the SALSA code determine that SALSA computes 27 Schur complement matrices explicitly, while the other 31 matrices are represented implicitly.
851
5. CONCLUDING REMARKS Iterative substructuring induces a high level of (natural) parallelism throughout the solution process. In the past several years, the iterative substructuring software package SALSA has proven capable of solving different kinds of problems in academics, engineering and industry, in a robust, yet efficient way in a distributed memory environment. The eigenvector-based coarse space formulations that have been designed for the 2-level Neumann-Neumann preconditioner can be constructed efficiently in parallel and work well on various kinds of finite element discretizations. We plan to continue the development of the domain decomposition solver and use it as a platform to investigate algorithmic modifications and their use in practical situations. This will primarily focus on the Balancing Neumann-Neumann part, but also address enhancements of the basic performance and the integration of new functionalities that are of interest to end-users of the software. In this paper, we have limited the discussion primarily to linear systems that are symmetric positive definite. An unsymmetric variant of the Neumann-Neumann preconditioners (based on [1 ]) has however been prototyped and its numerical behaviour and preliminary performance is currently being investigated on some model problems. In addition, we are also looking into new formulations for the 2-level coarse space. Finally, we are also interested in looking more closely at the behaviour of hybrid versions of the available coarse space variants. Results on these will be published elsewhere. ACKNOWLEDGEMENT We acknowledge the work by Piotr Krzy~anowski on the construction of the coarse space formulations and his work on an earlier version of the software. REFERENCES
[1]
[2]
R Alart, M. Barboteu, R Le Tallec, and M. Vidrascu. Additive Schwarz method for nonsymmetric problems: application to frictional multicontact problems. In N. Debit, M. Garbey, R. Hoppe, J. P6riaux, D. Keyes, and Y. Kuznetsov, editors, Domain Decomposition Methods in Sciences and Engineering. DDM.org, 2000. Thirteenth International Conference, Lyon, France. P. R. Amestoy, I. S. Duff, J.-Y. L'Excellent, and J. Koster. A fully asynchronous multifrontal solver using distributed dynamic scheduling. Technical Report RAL-TR-1999059, Rutherford Appleton Laboratory, Chilton, Didcot, England, 1999. To appear in SIAM
J. Matrix Anal. Appl.
[3] R R. Amestoy, I. S. Duff, J.-Y. L'Excellent, and J. Koster. MUMPS: a general purpose
[4]
[5]
distributed memory sparse solver. In Proceedings' of PARA2000, the Fifth International Workshop on Applied Parallel Computing, pages 122-131. Springer-Verlag, June 18-21 2000. Lecture Notes in Computer Science 1947. P. E. Bjorstad, J. Koster, and R Krzy~anowski. Domain decomposition solvers for large scale industrial finite element problems. In Proceedings of PARA2000, the Fifth International Workshop on Applied Parallel Computing, pages 374-384. Springer-Verlag, June 18-21 2000. Lecture Notes in Computer Science 1947. R E. Bjorstad and R Krzy~anowski. A flexible 2-level Neumann-Neumann method for
852
[6]
[7]
[8] [9]
structural analysis problems. In Parallel Processing and Applied Mathematics PPAM 2001, Warsaw, Poland, September 2001. Springer-Verlag. Lecture Notes in Computer Science 2328. Y.-H. De Roeck and P. Le Tallec. Analysis and test of a local domain decomposition preconditioner. In R. Glowinski, Y. Kuznetsov, G. Meurant, J. P6riaux, and O. Widlund, editors, Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations, pages 112-128. SIAM, 1991. M. Dryja and O. B. Widlund. Schwarz methods of Neumann-Neumann type for threedimensional elliptic finite element problems. Comm. Pure Appl. Math., 48(2):121-155, 1995. G.H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, MD, 3rd edition, 1996. G. Karypis and V. Kumar. MEqqS. A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices. University of Minnesota, Department of Computer Science / Army HPC Research Center, September 1998. U R L h t t p ://www. cs. umn. e d u / ~ k a r y p i s .
[10] P. Le Tallec, J. Mandel, and M. Vidrascu. Balancing domain decomposition for plates. In Domain Decomposition Methods in Scientific and Engineering Computing, University Park, 1993, pages 515-524. Amer. Math. Soc., Providence, 1994. [ 11 ] P. Le Tallec, J. Mandel, and M. Vidrascu. A Neumann-Neumann domain decomposition algorithm for solving plate and shell problems. SIAM J. Numer. AnaL, 35(2):836-867, 1998. [12] J. Mandel. Balancing domain decomposition. Comm. Numer. Methods Engrg., 9(3):233241, 1993. [ 13] J. Mandel and P. Krzy• Robust Balancing Domain Decomposition. August 1999. Presentation at The Fifth US National Congress on Computational Mechanics, University of Colorado at Boulder, CO. [14] B. F. Smith, P. E. Bjorstad, and W. D. Gropp. Domain Decomposition. Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, Cambridge, 1996. [ 15] M. Vidrascu. Remarks on the implementation of the Generalised Neumann-Neumann algorithm. In C.-H. Lai, P. E. Bjorstad, M. Cross, and O. B. Widlund, editors, Domain Decomposition Methods in Sciences and Engineering. DDM.org, 1999. Eleventh International Conference, London, UK.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
853
Multiperiod Portfolio Management Using Parallel Interior Point Method L. Halada a*, M. Lucka b, and I. Melichercik c aInstitute of Informatics, Slovak Academy of Sciences, Dubravska 9, 845 07 Bratislava, Slovakia bInstitute for Software Science, University of Vienna, Liechtensteinstrasse 22, A-1090 Vienna, Austria CDepart. of Economic and Financial Modeling, Faculty of Mathematics, Physics and Informatics, Mlynska dolina, 840 00 Bratislava, Slovakia Computational issues occurring in finance industry demand high-speed computing when solving problems such as option pricing, risk analysis or portfolio management. In order to respond adequately on changes in the market it is necessary to evaluate information as fast as possible and to update the appropriate portfolio changes. As a consequence at present one can observe increasing interest in the development of multi-period models of portfolio management. These models introduce intermediate reallocation opportunities connected to transaction costs, which affect the composition of the portfolio at each decision instant. Stochastic programs provide an effective framework for sequential decision problems with uncertain data, when uncertainty can be modelled by a discrete set of scenarios. In this paper we present an algorithm for solving a three-stage stochastic linear program based on the Birge and Qi factorization of a constraint matrix product in the frame of the primaldual path-following interior point method. We outline the parallelization of the method for distributed-memory machines using Fortran/MPI and the linear algebra package LAPACK. 1. I N T R O D U C T I O N Problems of portfolio management can be viewed as multi-period dynamic decision problems where transactions take place at discrete points in time. At each point in time a portfolio manager has to make the decision taking into account market conditions (e.g. exchange rates, interest rates) and the contemporary composition of the portfolio. Using this information the manager could sell some assets from the portfolio and using the cash from selling and other possible resources he/she buys new assets. We present an example of a portfolio management problem. It is a model for allocation of financial resources to bond indices in different currencies. In each currency we have one index that consists of bonds issued in this currency. The whole portfolio is evaluated in the base currency. The risk one faces when making the decision is twofold: interest rate risk and exchange *This work was supported by the Special Research Program SFB F011 "AURORA"of the Austrian Science Fund FWF and VEGA Agency (1/9154/02) and (1/ 1004/04), Slovakia.
854 rate risk (future interest rates and exchange rates are uncertain). The stochastic properties are represented in the form of a scenario tree. The scenarios contain future possible developments of interest rates and exchange rates. The objective is to maximize the expected value of the portfolio at the time horizon taking into account future reallocation opportunities connected to transaction costs. When one deals with several currencies, the realistic scenario trees are "bushy" and the number of scenarios grows exponentially with the number of stages. Thus, the computation of such problems could be extremely large and computationally intractable. Approaches for solving these problems usually either take advantage the problems' matrix structure or decompose the problem into smaller subproblems. In the literature we can also see a considerable research efforts to develop efficient parallel methods for solving this problem on parallel computer architectures [1],[2],[3],[4]. In our paper we demonstrate a parallel interior point algorithm (IPM) for solving three-stage stochastic linear problem which comes from a three-period models of portfolio management. The paper is organized as follows: Section 2 establishes the problem formulation. Section 3 presents the application of the IPM to the three-stage stochastic programs. The last section discuss the issues of parallel implementation. 2. P R O B L E M F O R M U L A T I O N
The stochastic properties are represented in the form of a scenario tree. Denote by fT, 1 _< r _< T the set of nodes at time r. For any ca C 5%, 1 < ~- < T, there is a unique element a(ca) = ca' E 9c~_1, which is the unique predecessor of ca. Denote the decision variables of this process as b~~) ( J ) , ca~-c ~%" The amount of the index j bought in period 7,
S(r) J (CAT), car C U~-: The amount of the index j sold in period r, h ~ ) ( J ) , ca~ c 3%: The amount of the indexj held in period r, and constants as c (~ Initial cash available, h~~ Composition of the initial portfolio, 7] (ca~), car E ~-: The exchange rate base currency/j-th currency (7~~)
1 Vr),
o(r) j ( J ) , car C ~ " The value of the j-th index in the j-th currency, ~]~) ( J )
ca~ E ~ : Bid price of the j-th index in the base currency computed by
F!~) (car) _ v]~)(ca~)7]~)(ca~)(l ",3
~3
,,
X(~) J (car), ca~ 6 .T~: Ask price of the j-th index in the base currency computed by
X(',-) j (J)
:
o ('r) (ca-,-)9,]~-)(j-) ( 1 + 6]~k)
( ~ k > 0 and ~3~bid> 0 are transaction costs for buying and selling). Constraints
The amount of bought and sold units of index should be nonnegative. We forbid short positions. Therefore the number of hold units is nonnegative.
855
~(~-) ~j (w'r)_~O,
b~-) (w~-) > O,
hj(~-)( J )
> O
Vl<~-
Possible restrictions for selling A typical investor is conservative. He/she is not willing to sell a big part of the portfolio. Therefore we add constraints allowing to sell only/3 part of any asset or/3 part of the whole portfolio.
Inventory balance and cash-flow accounting for the Period 1 hJ~
1)~j
hJ 1) =
Vj.
z__., J
J
Inventory balance and cash-flow accounting for the Period T, where I ~(~-) (~-) hJ~--l)(a(w~-))+bJr)(w~)-~j (w ~-) - hj (coT) Vw "r E Y~ Vj. J
< ~- < 7":
J
Risk reduction The terminal wealth calculation is given by:
WT(J-)-
Z :Jz-)(w~)hj(7__:)(a(w~r)) J
V J - E Yr.
To reduce the risk one can add the constraint forbidding the terminal wealth to fall below some proper constant C:
WT(J)
> C
W ~ E J~-.
Objective function The objective function maximizes the expected terminal wealth. It can be written as: Maximize }-~-JeYT 7r(wT-)WT(wT-) , where 7r(w7) is the probability of the scenario J - . Now, if we express the objective function in the form cTx, where x is the vector of all decision variables, our aim is to find the solution of the problem Maximize cTx, subj. toA (3)x - b. In the three-stage stochastic model the constraint matrix of the whole problem A (3) and the corresponding vector b have the following form: / A~~)
2)
A?
A(3) _
9
o
b~2)
A?
,b-
h(;)
o
~(3)
(2)
\ -,- N(3)
N(3)
~N(3)
856 where the matrices A(k2), k = 1, 2, ...N (3) represent a two-stage problem in the frame of the whole three-stage problem. The right-hand side vector b is split on sub-vectors in accordance with matrix A (a). The k-th two-stage problem determined by the matrix A~2), k = 1, 2 , . . . , N (3) has the form (' 4(2) ~k,0 k,1
(1) k,1
(1) k,2
~r(2) "~k,2 7.(2)
A~2) =
(1)
k,3
k,3 9
\
where ""k,O 4(9) is an 'trek, _(9)0
.
T(;;
o
4( 1)
k,Mk
"~k,M k ]
_(9)0 X 'ttk,
matrix and A(1) __(1)j X "tek, _(1)j matrices. We suppose also that ~ ~k,j are tt~k. re(l) _(1) for all matrices 4(~) k,j < - - 'tek,j " ' k , j ' J = 1.2. . .M k. Matrices . . q~(2)k.jhave the size conformable to
matrices ~'k,O 4(2) and ''k,j a(1) 9 We assume moreover, that the matrices A~3)' a(2) and all matrices ~a(1) ~ ~k,O ~k,j have full row rank and ~" ~ k(~) ~ (1) for every k and j. , j -<- 't~k,j 3. A P P L I C A T I O N OF THE I N T E R I O R P O I N T M E T H O D
One approach for solving the problem defined above is to use the interior point method (IPM). We have chosen the Mehrotra's Predictor Corrector algorithm MPC defined in [5], p. 198. This algorithm, since 1990, has been the basis for most interior point software 9 Given (x ~ yO, z o) with x ~ > 0, z ~ > 0, it finds the iterates (x k+l, yk+l, zk+l), k = 0, 1, 2..., by solving the system 0
AT
A Z
0 0
,)(,..:,)(.c) 0
Ay aff
X
Az aff
=
f b
(1)
,
r.
where r b - b - A x , re = c - z - A T y , r~ -- - X Z e ; X and Z are diagonal matrices with diagonal entries x and z, respectively. Calculating the centering parameters ozpri
=
argmax{ct
e [0.1]'x k + a A x afI > O}
OLdual aff
:
argmax{ct
e [0.1]" z k + c t A z a f f > O}
aff
~aff
--
'
--
'
--
~p~i A _ a f : ~ T : ~ k
(X k "I" ~ a f f Z - . x ~
'
'
_e~IAzaff)/n
) ~,,~ + OLaf f
and setting a - ( P a i l ~ P ) 3, where # - x T z / n , the linear system (1) is solved again with the fight-hand side r b = O, r e = O, r , = a p e - A x a f f A z a f f e for the solution (Ax ee, Ay ee, Azee). Computing the search direction and step to boundary from ( A x k, A y k, A z k)
--
(AxafI
A y a:f, A z a I / ) + ( A x ce, A y ee, Azee).
OZpri x
---
a r g m a x { c ~ > O; x k + c~Ax k > 0}.
c~d~at max
-m
argmax{c~ > m O; z k + c~Az k > O}
857
pri 1), o~d~al and setting the ~~.pri = rain(0.99 9 ~max, k (xk+I yk+1 zk+l) are established as xk+l
:
=
rain(0.99
*
"" d u a l
(~max ~
1) ,
the values of
xk J-- ~kpri lkx k _ dual
From the computational point of view the most time consuming part of this algorithm is solving of the system (1) with different right-hand sides. Therefore an effective parallelization of this process is very suitable. With respect to this, let us express (1) as follows: m y aff
:
/kx afy
=
Z - I ( X A T A y aff + r. - Xrc) ,
mzaff
--
x-lr #
(ADAT)-I(% + AZ-I(Xr~-
_
2
r,)),
-1ZAxaff
where D = Z - ~ X . The crucial step for finding the unknown vectors is to solve the first equation. For our three-stage stochastic problem, it means to solve
(A(3)D(3)(A(3))t) A y = (% + A ( 3 ) Z - I ( X r c - r,)) : r (3),
(2)
A. It
where matrix A (3) stands as matrix has been proven in [6] that the inversion of the matrix A (3)D (3) (A(a)) t can be computed by the Sherman-Morrison-Woodbury formula as follows 9
(A(3)D(3)(A(3))t)-~_ (T~(3))-~- (T~(3))-~U(S)(G(3))-~ (V(3))t(~-~(3))-1 where
T4(a)= Diag(i(o3) ' R~2) R~2)~ ...~ .(2) -~,~N(3))
9-(3) t ) -'- N(3) )
U (3) = T(3) ~N(3)
0(3) (A(o3))t ) G(3) =
-A(o3)
0
'
and N (3)
O (~) =
RT)-
(D;~)) -~ +
{A(3)~tA(3) ,-~o ,..o
(2) - 1 T ( 3 ) + Z(T~ ~))t (n~) .~
k:l
,
(3)
858 Thus, the solution (A(3)D(3)(A(3)) t) Ay = r (3) can be expressed by the inversion on the basis of the validity (3) as Ay = p(3) _ s(3) while T2~(3)p(3) = G(3)q(3) =
5(3),
(V(3))tp(3),
(4) (5)
7~(3)s(3) =
U(3)q(3).
(6)
The equations (4)-(6) represent the decomposition of the original problem into three subproblems. An advantage of such a decomposition is that R(3) is the block-diagonal matrix, where the diagonal matrix element R~2) with corresponding fight-hand side represent the basic equation of the two-stage stochastic problem [7], [1]. The parallel three-stage procedure has been summarized in the paper [6]. The detail parallel procedure for solving the system (2) has been published in [8]. 4. PARALLEL IMPLEMENTATION The implementation of the IPM method based on three-stage algorithm rely on basic algorithms of linear algebra: Cholesky decomposition, solving a system of linear equations, matrixvector and matrix-matrix multiplication, and summation of matrices or vector, respectively. These algorithms are in the core of every multistage stochastic model and have a profound influence on the performance. For solving of linear algebra problems we have used the program package LAPACK [9], because the LAPACK library has been designed for high-performance workstations and shared-memory multiprocessors [10]. Parallel implementation of the threestage algorithm in the frame of the IPM method is based on the Message Passing Interface (MPI) [111. The solving of the system (2) is targeting distributed-memory parallel computers and relies on the Single Program Multiple Data (SPMD) model. The computational structure of both threestage and two-stage algorithms for solving the above mefioned system of linear equations, is very similar. Most computations are independent and can be performed in parallel. Collective communications are required only in two computational steps in every two-stage linear system and also twice in the three-stage linear system, where the collective gathering from all processes takes place. The careful analysis of the parallel three-stage algorithm is shown and explicitly described in the paper [8]. As it is shown there, all "block" rows j = 1, 2 , . . . , M k in solving the two-stage procedure A~2), can be processed in parallel. The same is true for k = 1, 2 , . . . , N (3) "block" rows in the matrix A (3), so two levels of parallel processing are possible. The above mentioned paper presents also the performance results achieved by experiments on the Beowulf cluster, University of Vienna. The algorithm based on the parallel BQ decomposition used in the frame of the MPC algorithm for three-stage stochastic problems, was implemented in the Fortran 90 programming language and executed on cluster of SMP's. The performance results of one of the experiments are illustrated on the Fig. 1. Almost linear speed-up can be observed for smaller number of processors; the slow-down achieved for 16 processors was caused by increasing overhead. The size of the test problem was too "small" for this number of processors and the overhead overcame the execution. The performance results for larger problems and further implementation details will create the subject of our next paper.
859
o= 40 .=.
3o
Number of processors
Figure 1. Execution times of experiments where the matrix A (3) has on the main diagonal N (3) - 9 block matrices A~2). Every matrix A~2), k - 1, 2, ...N (3) has again M k = 9 matrices A(1) of the size 3 x 6, j - 1, 2 ' ' " .M k, on the main diagonal. The size of the linear system (2) in k,j was in this case 273 x 546. 5. CONCLUSIONS We have presented a parallel method used for solving of the linear programs raised from portfolio management problems. The algorithm is based on the BQ factorization technique for three-stage stochastic programs in the context of the interior point method. Because the structure of the corresponding matrix for both three-stage and two-stage stochastic problems is regular, parallel execution in both hierarchical levels is possible. The algorithm is scalable and enables to solve large linear programs raising from the portfolio management problems.
REFERENCES
[1]
[2] [3] [4]
[5] [6]
E. R. JESSUP, D. YANG, AND S. A. ZENIOS, Parallelfactorization of structured matrices arising in stochastic programming, SIAM J.Optimization, Vol. 4, No.4, 1994, pp. 833846. G. CH. PFLUG, A. SWIETANOVSKY,Selected parallel optimization methods for financial management under uncertainty, Parallel Computing 26, 2000, pp. 3-25. SOREN S. NIELSEN, STAVROS A. ZENIOS, Scalable parallel Benders decomposition for stochastic linear programming, Parallel Computing 23, 1997, pp. 1069-1088. H. W. MORITSCH AND G. CH. PFLUG, Using a Distributed Active Tree in Java for the Parallel and Distributed Implementation of a Nested Optimization Algorithm, In IEEE Proceedings of the 5th Workshop on High Performance Scientific and Engineering Computing with Applications (HPSECA-03), October 2003, Kaohsiung, Taiwan, ROC. STEPHEN. J. WRIGHT, Primal-Dual Interior Point Methods, SIAM 1997, ISBN 0-89871382-X. G. CH. PFLUG, L. HALADA, A Note on the Recursive and Parallel Structures of the Birge and Qi Factorization for Tree Structured Linear Programs, Computational Optimization and Applications, Vol. 24, No. 2-3, 2003, pp. 251-265.
860 J.R. BIRGE AND L. QI, Computing block-angular Karmarkarprojections with applications to stochastic programming, Management Sci. 34, 1988, pp. 1472-1479. [8] S.BENKNER AND L.HALADA AND M.LUCKA, Parallelization Strategies of Three-Stage Stochastic Program Based on the BQ Method, Parallel Numerics'02, Theory and Applications. Edited by R.Trobec, EZinterhof, M.Vajtersic, and A.Uhl, pp.77-86, October,23-25, 2002 [9] E. ANDERSON, Z. BAI, C. H. BISCHOF, S. BLACKFORD, J. W. DEMMEL, J. J. DONGARRA, J. DU CROZ, A. GREENBAUM, S. HAMMARLING, A. MCKENNEY, AND D. C. SORENSEN, LAPACK Users" Guide, 3rd edn. SIAM Press, Philadelphia PA, 1999. [10] L. S. BLACKFORD AND J. J. DONGARRA, Lapack Working Note 81, Revised version 3.0, June 30, 1999. [ 11 ] MESSAGE PASSING INTERFACE FORUM. MPI: A Message-Passing Interface Standard. Vers. 1.1, June 1995. MPI-2: Extensions to the Message-Passing Interface, 1997. [7]
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
861
Performance of a parallel split operator method for the time dependent Schr6dinger equation T. Matthey a* and T. Sorevik bt aParallab, UNIFOB, Bergen, NORWAY bDept, of Informatics, University of Bergen, NORWAY
In this paper we report on the parallelization of a split-step algorithm for the Schrrdinger equation. The problem is represented in spherical coordinates in physical space and transformed to Fourier space for operation by the Laplacian operator, and Legendre space for operation by the Angular momentum operator and the Potential operator. Timing results are reported and analyzed for 3 different platforms 1. INTRODUCTION For large scale computation efficient implementation of state-of-the-art algorithm and highend hardware are an absolute necessity in order to solve the problem in reasonable time. Current high-end HPC-systems are parallel system with RISC processors. Thus requirements for efficient implementation are scalable parallel code to utilize large number of processors, and cache-aware numerical kernels to circumvent the memory bottleneck on todays RISC processors. In this paper we report on our experiences with optimizing a split-step operator algorithm for solving the time dependent Schrrdinger equation. A vast number of quantum mechanical problem require the solution of this equation, such as femto- and attosecond laser physics [2], quantum optics [9], atomic collisions [1 ] and in cold matter physics [8], just to name a few. In all cases this is time consuming tasks, and in many cases out of bounds on todays systems. Many different numerical discretization schemes have been introduced for the solution of this equation. In our view the most promising candidate for returning reliable solutions within reasonable time, is the split-step operator technique combined with spectral approximation in space. The resulting algorithm for is briefly outlined in section 2 where we also explain how it is parallelized. In section 3 we describe the efficient sequential implementation of the numerical kernels, and report timings for different versions of matrix multiply where one matrix is real while the other is complex. The core of the paper is section 4 where we report parallel speed-up on 3 different platforms and discuss our problems and successes. *http://www. ii .uib.no/~matthey %http://www.ii.uib.no/-tors
862 2. T H E A L G O R I T H M A N D ITS P A R A L L E L I Z A T I O N
The time dependent Schr6dinger equation (1) can be written as t) = H ~ ( x , t)
0
(1)
where the Hamiltonian operator, H, consists of the Laplacian plus the potential operator V(x, t)
H
(x, t)
t) + V(x,
t).
(2)
The problems we are targeting are best described in spherical coordinates. Transforming to spherical coordinates and introducing the reduced wave function (b = r ~ gives the following form of the Hamiltonian 02
H =
L2
20r2 + ~
+ V(r, O, O, t),
(3)
where L 2 is the angular momentum operator. To highlight the basic idea we make the simplifying assumption that V is time-independent as well as independent of qS. The same basic algorithm and the same parallelization strategy holds without these assumption, but the details become more involved. For H being a time independent linear operator the formal solution to (1) becomes
r
O, O, t,+,) = e~tUe(r, O, O, t,),
where At = in+ 1
-
-
(4)
t n. Splitting H into H = A + B we get
(I)(T, 0, qS, tn+l) = EAt(A+B)(~(T, O, r tn).
(5)
If we assumed that A and B commute we could write (5)
0,
=
0,
t.),
(6)
which would allow us to apply the operators separately ~ and greatly simplify the computation. Unfortunately they do not commute and in that case we will have a splitting error. The straightforward splitting of (6) leads to a local error of O(At2). This can be reduced to O(At 3) locally if a Strang [10] splitting is applied.
(I)(T, 0, qS, tn+l) -- cAt/2Ac AtB CAt/2Af~(T, O, ~, tn).
(7)
This introduction of"symmetry" eliminates a term in the error expansion and increases the order of the method. More elaborated splitting schemes, involving many terms do also exist. When the Strang splitting is done repeatedly in a time loop, we notice that except for the start and end two operations with e At/2a follow each other. These can of course be replaced by one operation of e Ata. Hence provided some care is taken with startup and clean up the Strang splitting can be implemented without extra cost. Combining the three ingredients; split-operator *For simplicitywe here split the Hamiltonianin only two operators. For more operators we can apply the formalism recursively, which is done in our implementation.
863 technique, spherical coordinates and spectral approximation in space was first suggested by Hermann and Fleck [7]. The reason why the split-operator technique is so extremely attractive in our case is the fact that the individual operators, the Laplacian and the angular momentum operator have well known eigenfunctions. The Fourier functions, e ik~, and the spherical harmonics, tPlm(0, r respectively. Thus expanding 9 in these eigenfunctions makes the computation not only efficient and simple, but exact as well. The split-step algorithm is outlined in Algorithm 1. Algorithm 1 The split-step algorithm /* i n i t i a l i z a t i o n */ for n : O , n s t e p s - i
P ,-- FFT(F~) k = scale with eigenvalues of A F n + l / 2 +-- I F F T ( ! 9) F = scale with eigenvalues of B ,2
end
for
F, F,/~ are all matrices of size n~ x nz, all discrete representation of 9 in coordinate space, Fourier space or Legendre space, respectively. For data living in the right space, time propagation reduces to simple scaling by the appropriate eigenvalues and the step size, and consequently fast as well as trivially parallelized. The computational demanding part are the transforms. Each transform is a global operation on the vector in question, and therefor not easily parallelized. But with multiple vectors in each directions, there is a simple outer level parallelism. We can simply parallelize the transforms by assigning nz/N p transforms to each processor in the radial direction, or n~/Np in the angular direction; Alp being the number of processors. With this parallelization strategy the coefficient matrix needs to be distributed column-wise for the Fourier transform and row-wise for the Legendre transform. Consequently, between a Fourier transform and a Legendre transform, we need to redistribute the data from "row-splitting" to "column-splitting" or visa versa. Our code is parallelized for distributed memory system, using MPI. As seen in Figure 1 the redistribution requires that all processors gather Np - 1 blocks of data of size n,~nz/N~ from the other N~ - 1 processors. Thus Np - 1 communication steps are needed for each of the Np processors. These can however be executed in Np - I parallel steps where all the processors at each step send to and receive from different processors. This can easily be achieved by letting each processor sending block (i + p) (rood Np) to the appropriate processor, p = 0, 1 , . . . , Np - 1, at step i. We have implemented this algorithm using point-to-point send and receive with a packing/unpacking of the blocks of the column- (row-) distributed data. In this case each block is sent as one item. This is implemented in a separate subroutine where non-blocking MPI-routines are used to minimize synchronization overhead. All that is needed to move from the sequential
864
Figure 1. Color coding shows the distribution of data to processors in the left and right part of the figure. The middle part indicates which blocks might be sent simultaneously (those with the same color). version of the code to the parallel one, is inserting the redistribution routine after returning to coordinate space and before transforming to new spectral space. The communication described above correspond to the collective communication routine MPI NLLTONLL provided each matrix block could be treated as an element. This is accomplished using the MPI-derived data type [6]. This approach is also implemented .... The algorithm is optimal in the sense that it sends the minimum amount of data and keeps all the processors busy all the time. What is beyond our control, is the optimal use of the underlying network. A well tuned vendor implementation of MPI_ALLTOALL might outpace our hand-coded point-to-point by taking this into account. In the parallel version the n z x n z coefficient matrix for the Legendre transform is replicated across all the participating processors. IO is handled by one processor only, and the appropriate broadcast and gather operations are used to distribute data to and gather data from all processors when needed. All remaining computational work, not covered here, is point-wise and can be carried out locally, regardless of whether the data are row- or column-splitted. For nr and nz being a multiple of Np the load is evenly distributed and any sublinear parallel speed-up can be contributed to communication overhead. 3. SEQUENTIAL OPTIMIZATION The time consuming parts of our computation are the forward and backward Fourier transform and the forward and backward Legendre transform. For the Fourier transform we do of course use the FFT-algorithm. We prefer to use vendor implemented FFT routines whenever they are available. But find it very inconvenient that no standard interface for these kernel routines are defined such as for BLAS [3]. This makes porting to new platforms more laborious than necessary. One possibility is to use the portable and self tuning FFTW-library [4, 5] and we have used this for the SGI Altix system and our IBM Pentium III cluster. The discrete Legendre transform is formulated as a matrix-vector product. When the transform is applied to multiple vectors these can be arranged in a matrix and we get a matrix-matrix product. For this purpose we use BLAS. A minor problem is that the transform matrix is a real matrix while the coeffisient vectors are complex. Thus we are faced with multiplying a real matrix with a complex matrix. The BLAS standard [3], however, require both matrices to be of the same type. There is two solutions two this incompatibility of datatypes. We can either split the complex matrix, B, in a real and complex part and do A = CB - C(X
+ iY) = CX + iCY
(8)
865 which requires two calls to DGEMM, or we can cast the coefficient matrix C to complex and do A = CB = (comp)(C)
(9)
9 t3
and make one call to ZGEMM. Note that C (The Legendre transformation matrix) is constant throughout the computation while/3 is new for each new transformation. Thus the splitting of B into real and imaginary part as well as the merging of X and Y to a complex A have to be done at each step, while a casting of C from real to complex is done once and for all before we start the time marching loop. Our timings on the IBM p690 Regatta system show a small advantage for the second
m
cA
104
trippel do-loop l complex matrices real matrices
---
8
. . . . . . . . trippel do-loop---~ complex .matrices I
,
. . . . . / . /
~o 103
.=_ ~9 101
g ._
.o_
/
oE
o
102
oE
o
§ 101
102 103 N u m b e r of grid points, n r, in angular direction.
104
Figure 2. Timings for the 3 versions of the matrix multiply as a function of n~. n z = 32
10101
102 Number of grid points, % in radial direction.
103
Figure 3. Timings for the 3 versions of the matrix multiply as a function of n z. n~ = 2048
strategy on small matrices. While working with real matrices appears to be slightly better for larger matrices. This probably reflect the fact that the arithmetic is less in the first case (8). However, this case is also more memory consuming. 4. E X P E R I M E N T A L RESULTS Our problem is computational demanding because many time steps are needed. Each time step is typically performed in 0.1-1.0 second on a single CPU. The outer loop over time steps is (as always) inherently sequential. Thus the parallelization have to take place within a time step. All arithmetic are embarrassingly parallel, provided the data is correctly distributed. To achieve this two global matrix "transpose" are needed in each step. The amount of data to be sent is n~n z, while the amount of computation is O ( n ~ n z ( l o g n~ + nz) ). We consider this to be medium grained parallelism, with a communication to computation ratio which should scale well up to a moderate number of CPU (20-30) for typical values of n~ (1000-10000) and n z (10-100). The code is written in Fortran 90, and MPI is used for message passing. We have run our test cases on the following 3 platforms:
866 9 A 32 CPU shared memory IBM p690 Regatta turbo with 1.3 GHz Power4 processor. ESSL scientific library version 3.3 and MPI that comes with the Parallel Operating Environment (POE) version 3.2 under AIX 5.1. 9 A 32 node Pentium III cluster with dual 1.27 GHz Pentium III nodes. 100 Mb switched Ethernet between nodes; Intel(R) 32-bit Compiler, version 7.0; MPICH 1.2.5; Intel Math Kernel Library for Linux 5.1 for BLAS; FFTW version 3.0. 9 8 CPU (virtual) shared memory, SGI Altix with 900 MHz Itanium 2. Inte164-bits Fortran Compiler version 7.1, Intel Math Kernel lib 5.2, SGI mpt 1.8, FFTW version 3.0. Doing reliable timing proved to be a big problem. The Regatta and the SGI Altix are both virtual shared memory systems, which in essence mean that they not only have a complicated memory structure, but there is no guaranty that all CPUs have their chunck of data laid out equally. The systems are also true multi-user systems in the sense that all processes from all users have in principal the same priority to all resources. On a heavily loaded system this means that some processes inevitably will loose the fierce battle for resources. When this happens to one process in a carefully load balanced parallel application, a devastating performance degradation happens at first synchronization point, where every one have to wait for the poor fellow who lost the battle. Our application needs to synchronize at each redistribution. In practice we found the elapsed runtime to be quite unpredictable on loaded systems. On rear occasions we might get the predicted runtime, but it was not unlikely to see a factor 2 in slowdown. Only when running on a dedicated system, timing become predictable, but even here differences of 10 % on identical runs were likely to happen. On the cluster the memory system of each node should be simpler and competition between processes did not take place within a compute node. However, network resources were subject to competition. On the cluster we observed the same unpredictability as on the other systems. Here it didn't help to have dedicated access to the system. We believe this to be a consequence of the interconnect. For our application this gets easily saturated and packets get lost. The ethernet protocol than requires the packet to be resent, bringing the network into a viscous circle of further saturation and higher packet losses. These problems should be kept in mind when reading and interpreting the reported results. In Figure 4 we report on the pure communication time for the 3 platforms with the two different modes of communication. The most obvious observation is the huge difference between the cluster and the two SMP-platforms. We conclude that "Fast Ethernet" is a contradiction in terms. You may either have fast interconnect or you may have Ethernet. But the two thing never coexist. A second observation is that for the MPICH on the cluster and for IBM's MPI on the Regatta, differences between our hand coded point-to-point and the MP I_ALLTOALL seem to be small, while on the SGI Altix the all-to-all seems to be substantial faster. In Figure 5 we show the speedup numbers for the 3 different platforms. The Regatta as well as the SGI Altix do very well on our test case, while the scaling on the cluster is quite poor. This all comes down to communication speed. Detailed timing show that the arithmetic scales linearly on all platforms. The communication does not scale that well, but as long as it only constitute a minor part of the total elapsed time, the overall scaling becomes quite satisfying, and this is the case on the regatta and the Altix.
867 SGI Altix [ .............Pentium III cluster -IBM p690 Regatta
-
101
104
z
~"~// 9
/~~ t-i
E E
8
SGI Altix P2P SGI Altix A2.A ............ Pentium III cluster .......... Pentium III cluster -IBM p690 Regatta - - IBM p690 Regatta -
-
,..
..,"
.,""
,.""
-
-
P2P A2A P2P A2A
//
103
10 0
Number of processor(s)
101
'"1o o
Number of processor(s)
10
~
Figure 4. Time spent on communication Figure 5. Speedup numbers for the 3 different for the different platforms and communica- platforms for the same problem. tion modes for a problem of size n r x n z = 4096 x 64. Point-to-point (P2P) communication is represented by solid lines, while dashed lines are used for M P I A L L T O A L L (A2A) ACKNOWLEDGEMENTS
It is with pleasure we acknowledge Jan Petter Hansen's patiently explanation of the salient features of the Schr6dinger equation to us. We are also grateful for Martin Helland's assistance with the computational experiments for the two versions of the matrix multiply.
REFERENCES
[1] B.H. Bransden and M.R.C. McDowell. Charge Exchange and the Theory of Ion-Atom Collisions. Clarendon, 1992. Jean-Claude Diels and Wolfgang Rudolph. Ultrashort laser pulse phenomena. Academic Press, 1996. [3] J. J. Dongarra, Jeremy Du Cruz, Sven Hammerling, and Iain S. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1): 1-17, 1990. [4] M. Frigo. FFTW: An adaptive software architecture for the FFT. In Proceedings of the ICASSP Conference, volume 3, page 1381, 1998. [s] M. Frigo. A fast fourier transform compiler. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'99), 1999. [6] William Gropp, Ewing Lusk, and Antony Skjellum. USING MPI, Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994. [7] Mark R. Hermann and J. A. Fleck Jr. Split-operator spectral method for solving the time-dependent schr6dinger equation in spherical coordinates. Physical Review A, 38(12):6000-6012, 1988. [8] W. Ketterle, D. S. Durfee, and D. M. Stamper-Kurn. Making, probing and understanding bose-einstein condensates. In M. Inguscio, S. Stringari, and C. E. Wieman, editors, [2]
868 Bose-Einstein Condensation of Atomic GasesL Proceedings of the International School of Physics, "Enrico Fermi", Cource CXL. IOS Press, 1999. [9] Marlan O. Scully and M. Suhail Zubairy. Quantum Optics. Cambridge University Press, 1997. [ 10] Gilbert Strang. On the construction and comparison of difference scheme. SIAM Journal of Numerical Analysis, 5:506-517, 1968.
Minisymposium
Cluster Computing
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 ElsevierB.V. All rights reserved.
871
D e s i g n a n d i m p l e m e n t a t i o n o f a 512 C P U c l u s t e r for g e n e r a l p u r p o s e supercomputing B. Vinter a aNordic DataGRID Facility, Odense, Denmark This paper describes the considerations when designing a 512 CPU cluster, the final design and the experiences from building and running the system. The cluster, Horseshoe, is the largest supercomputer in Denmark and services more than 400 researchers for their high-performance computing needs. The price/performance ration is a key element in every choice that is made and the need for high performance at a low cost is a driving motivation throughout the paper. 1. INTRODUCTION In 2001 it was decided that a group of researchers from the University of Southern Denmark, SDU, should apply for funding to build what would be the largest cluster, and fastest supercomputer, in Denmark. The group received a grant of 850.000 Euro to build a cluster that should have a size of 512 nodes. Such a large cluster for such a low price would require us to build from the same principles that are used when a department or a research-group builds it's own small cluster at a bare bones budget. At the time we were confident that this could be scaled to the desired size but nobody had ever described the design of such a large machine using only off the shelf components as a general purpose computer, a 1000 CPU machine had been described for a custom purpose however[2]. We put a large effort in the design of the system and finally got a working system, the path to which is described in this paper. First the design of the individual nodes is described, then the interconnection network and finally the construction of the systems and conclusions. 2. N O D E D E S I G N 2.1. Uni-Processors or Multi-Processors
Small multiprocessors, in particular dual-CPU machines, are often very attractive from a price perspective since the price of a dual machine if far less than that of two individual workstations. Previous work [1], has shown however that the performance of a cluster of multiprocessors increases by using more nodes while keeping the number of CPUs in the system constant. The reason for this is primarily that the CPUs within one machine has to compete for access to shared resources, most importantly memory and network. The competitions leads to increased waiting time for the CPUs and thus lower performance as shown in figure 1 that shows a LU decomposition using 32 CPUs on clusters of 2, 4 and 8 way SMPs. Since the CPU itself is much faster than memory the cost of sharing memory becomes very pronounced and we early on decided to use uni-processors.
872
0.90 0,80 L
0,70
_~
0.60
.~ 0.50 0,40 0.30 0.20
~
.-B.- Dual [ Quad
-e.- Oct
0.10 0,00 2
4
CPUs
16
32
Figure 1. Relation between SMP size and cluster performance 2.2. CPU choice The first thing to establish when choosing a CPU architecture for a cluster is of course to establish if any of the target applications require a special architecture, which naturally will limit the choices in architecture. In our case no such problems existed and as such we had a free choice of all CPUs. The primary design criteria then become a question of optimising the price/performance ratio with respect to a number of limiting and boundary conditions. System size; the overall system size cannot grow too large since this influence the physical size of the system and since many applications will not scale beyond a fairly small number of CPU's, typically 16-32. This means that we won't be able to use the very cheapest CPU's although they do have a set of desirable parameters including good CPU-speed to memory- and network-speed ratio. Power consumption; the power consumption of modem CPUs range wildly and the overall power-consumption is important for two reasons. First of all a large power consumption will require a larger aircondition, which in turn will leave less money for computers, and secondly we will have to pay for the electricity afterwards. In Denmark electricity is 0.20 c/(kW/h), which means that a 100W CPU will cost app. 175 euro/year in electricity, or app 90.000 euro/year for a 512 CPU system. So the difference between 100W/system and 200W/system is significant. Below is a table with the contending CPUs and their specifications as of 2001. The price listed is for a complete system including 1GB main memory and the price/performance ratio is calculated from the floating point performance since this is if most interest to us.
SpecInt SpecFp Price (euro) Price/Perf (Spec Fp)
P3 454 292 700 2.4
P4 515 543 930 1.7
Athlon 496 426 670 1.6
Itanium 370 711 6670 9.4
Alpha 380 514 6670 13.0
Power604e 248 330 10670 32.3
Ultra III 377 326 4000 12.3
From the price/performance ratios it's evident that only the P4 and Athlon CPU was realistic contenders for the CPU architecture. This result is non-withstanding the fact that when buying 512 machines the rebates that may be achieved on the more expensive systems will be much higher than that of the lower priced systems. There is no doubt that the gap in price/performance
873 Gaussian Elimination
14,00
...........................................................................................................................................................................................................................
13,50 j~,..-4-
13,00
,t..~
-
-
'
'
~
'--~- Temp ~Temp ---~- Temp - e - Temp - e - Temp Temp
~
.-~ 12,50 Fo 12,00
mx 11.50 w
I,,-+- Temp~0
11,00 10,50
:20 25 30 35 40 45
:-
' '"
~-,L,
x.,L
10,00 00:00:00 00:00:09 00:00:17 00:00:26 00:00:35 00:00:43 00:00:52 00:01:00 Run Time
Figure 2. Thermal throttling results on Gaussian Elimination ratio could be narrowed but it would still be quite significant, say no less than a factor. Athlon had two problems at the time the decision was made, the thermal protection was based on a sampling that in extreme cases would cause the CPU to burn and secondly none of the large vendors were backing Athlon CPU's. Thus for security and choice in vendor the Intel P4 was chosen.
2.2.1. Thermal throttling A major concern with respect to the Intel Pentium 4 CPU was persistent rumours on various web-sites that the thermal throttling, which is used to protect the CPU from overheating[8], was active :most of the time and in effect reducing the performance to half the expected. To verify these claims ourselves we were give a Pentium 4 machine from Intel for testing. We tested various benchmarks, both micro-benchmarks and application benchmarks, at various room temperatures. The results were very uplifting; for all benchmarks running up to, and including, 40C there was no signs of the thermal throttling being active, at 45C we start to see a performance decrease and at 50C the decrease is pronounced. Figure 2 shows the results for the Gaussian Elimination benchmark, the complete set of results can be found on www. imada, sdu. d k / ~ h m a / i n t e l .
2.3. Memory considerations The memory architecture was next to be decided. At the time of decision RAMBUS memory was very popular, while ECC memory was the standard for servers and workstations and DDRRAM the standard solution for PC's. ECCRAM was easily eliminated because a system of the proposed size will have nodes failing once in a while even if we did use ECCRAM, thus we had to design for fault-tolerance anyway and ECCRAM would only result in fewer such faults - but at a high price. RDRAM was much faster on paper, in benchmarks we got in the order of 10% better performance and this improvement did not justify a cost of approximately four times that of DDRRAM! In addition the use of RDRAM over DDRRAM would result in an increased power-consumption of 12W/GB. Thus standard DDRRAM was chosen.
2.4. System board The considerations for system-boards were fairly simple: Less is more; all we need is simple graphics, HD controller and onboard network The board must be supported by Linux - and
874 remain supported for at least 3 years. Low power is also a highly preferable feature. A large number of chipsets matched our requirements, however the Intel i845 based boards provided all the features in one chipset, which has the nice property that all we need, is support for i845 and not specific support for another NIC or HD-controller. At the same time the i845 was priced very attractively compared to similar chipsets.
2.5. Casing The casing of the machines is quite important when you want to put 512 of them in the same room! Correct airflow is all-important as the machines will be stacked very closely, thus an incorrect airflow would mean overheating of the system. In addition to good airflow we needed the machines to be small in order to ever fit 512 into the room allocated for the machine. Rack mounted machines would be the obvious choice for this but rack machines do have a number of problems in our scenario 9 None of the large vendors provides the chosen CPU/Memory/System-board configuration in rack mounting. 9 Rack cases are quite expensive 9 The cluster has a contracted lifetime of 3 years, after this time it would be nice to use the machines for students - which is hard is they are rack-mounted.
2.6. Node configuration The resulting nodes are from Compaq (now HP). The machines use the Intel i845 chipset, has a 40 GB hard drive for local scratch space, 1 GB main memory, a 2 GHz Intel P4 Northwood * CPU and onboard Intel Ethernet Express Pro 100 NIC. 3. N E T W O R K I N G In a cluster of course, the interconnection network is immensely important! Most large clusters use a custom cluster-interconnect to ensure high performance. These interconnects, including clan[5], Myrinet[3], Scalable Coherent Interface[4] and Quadrics[7], all provide high bandwidth, low latency communication. Unfortunately they are also all quite expensive, below is a table listing the performance and pricing in 2001 when we had to choose our interconnect. ~l FastEther Speed 100Mb/s Price ($) 50
GbEther 1Gb/s 700
Clan VIA 1,256 Gb/s 1200
Myrinet 2 Gb/s 2000
SCI 6.4Gb/S 2500 .
,
Quadrics 2.7 Gb/s 4000 .,
It is obvious that FastEther has at least an order of magnitude poorer bandwidth than the alternatives, however the others are close to the price of a node in the case of Gb-Ether and much more expensive than a compute-node in the other cases. Thus in order to justify the price of the Gb-Ethemet our applications would have to double their measured performance when running on Gb-Ether compared to FastEther. An experiment on a set of the most used applications show that the run at an average of 80% CPU utilization which easily show that doubling their performance is impossible - thus we chose FastEther for the interconnect. *The 0.13 u Northwoodtechnology uses less powerthan the earlier 0.18 u P4 CPU's
875
Figure 3. Network topology for 480 CPUs
3.1. Switch and topology considerations Once fast Ethernet was selected we needed to find a network topology that would spawn 512+ machines. There was no option for a single switch of 512 ports so some multi-level switch was needed in all cases. One approach is to keep the number of switches at a minimum and then let all switches be interconnected. This approach ensures that we go through only two levels of switches, which ensures the lowest possible latency. Unfortunately the channels between the switches could not be any larger than 1Gb at that time which would mean that a large switch would result in a large number of connections sharing the same Gb connection to another switch, a scenario that could be a problem in many applications. The alternative is to accept a third level of switches which results in poorer latency but which will result in much better aggregate bandwidth. The price becomes a crucial parameter once again, and the model using three levels of switches is much cheaper than the two level design, and since the former also eliminates the concerns towards bandwidth limitations, and since latency can be handled in software to a large extend. Most large switches are layer 3, which includes a large set of control features that are not needed for a topology as simple as ours, and since we would need levels of switches, thus from a performance and price perspective we wish to use small, non-managed, switches which should then be connected with Gb Ethernet to a Gb Ethernet switch. The choice of switches became HP2324 for the FastEther switches which are then connected to a HP4108 as the Gb backbone. The layout for 480 nodes are shown in figure 3. 4. B U I L D I N G AND I N S T A L L I N G T H E S Y S T E M The shelves and electricity were set up by professionals and after that a group of seven students took over. They set up the BIOS on each machine t, labelled the machine, shelved it and made a Ethernet cable to the correct switch. Overall the construction of the machine took the tThis could be done by reading a configuration from a floppy-disk
876
Figure 4. The final system seven students less than 25 hours, including assembling more than 520 Ethernet cables! The Linux operating system was installed by installing one machine with the debian distribution of our choice and setting that one machine up as all nodes should be configured. The remaining machines then had Linux installed using ka-boot[6]. 5. U N F O R E S E E N PROBLEMS
Building and setting up the cluster has presented few real problems. Amongst the most important problems that were faced were power and cooling issues. The electricity installation was designed to handle twice the theoretical power-usage, partly to be on the safe side, partly to prepare for more power-consuming machines when the cluster is replaced. As it turned out this was not enough to power the 512 switch-mode power-supplies after a power-failure. The problem was solved by increasing the power potential another 30%, which could be done by a fuse upgrade. Cooling was another problem, the machines we received had two types of harddrives ~ however one of these runs 12C hotter than the other! Another problem turned out to be that the 62kW air-condition has an 8kW engine and the air-speed from that one is too high for the machines to draw in the cold air. So while there was sufficient cooling some of the machines were running slightly hotter than anticipated. This was fixed by adding some diffusion cylinders to the air-condition exit. 6. CONCLUSIONS We have succeeded in designing and building a 512 CPU cluster-computer, which is a fully functional production machine. The theoretical performance of the machine is 4Tflops which is quite impressive considering the total budget of 850.000 Euro! The machine has been such a success that it has been extended by an additional 140 machines. The extension is based on 2.66 GHz P4's and uses Gb Ethernet rather than fast Ethernet for two We had made a positive list of hard-drives that were acceptable
877 reasons; first of all a specific set of our users had problems running their applications efficiently using 100 Mb/s network and secondly because now, a year later, Gb Ethernet had become much cheaper than what they were at the time of the original design. ACKNOWLEDGEMENTS A large number of people have been involved in building and running this cluster. Apart from a number of graduate students who did all the actual work, Frank Jensen have at all times ensured that the design reflected what the users need and Brian Truelsen and Claus Jeppesen are solely responsible for the ungrateful task of keeping the system up and running and the users happy- they do a great job! REFERENCES
[1] B. Vinter and O. Anshus and T. Larsen and J. Bjorndalen, Using Two-, Four- and Eight[2]
[31 [4]
[5] [6] [7] [8]
Way Multiprocessors as Cluster Component, Proceedings of the 2nd Communication Process Architectures, pp 129-146, IOS Press, 2001. Genetic-Programming.com, 1,000-Pentium Beowulf-Style Cluster Computer for Genetic Programming, www.genetic-programming.com/machine 1000.html, 1999. N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, pp. 29-38, Feb, 1995. K. Alnaes, E.H. Kristiansen, D.B. Gustavson and D.V. James, Scalable Coherent Interface, CompEuro 90, 1990. Giganet, Giganet cLAN, www.giganet.com/products/. Cyrille Martin and Olivier Richard, Parallel launcher for clusters of PCs, Parco 2001. www.quadrics.com. Stephen H. Gunther, Frank Binns, Douglas M. Carmean, Jonathan C. Hall, Managing the Impact of Increasing Microprocessor Power Consumption, Intel Technology Journal Q 1, 2001.
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
879
Experiences Parallelizing, Configuring, Monitoring, and Visualizing A p p l i c a t i o n s for C l u s t e r s a n d M u l t i - C l u s t e r s O.J. Anshus a, J.M. Bjorndalen a, and L.A. Bongo a aDepartment of Computer Science, University of Tromso, Norway To make it simpler to experiment with the impact different configurations can have on the performance of a parallel cluter application, we developed the PATHS system. The PATHS system use a "wrapper" to provide a level of indirection to the actual run-time location of data making the data available from wherever threads or processes are located. A wrapper specify where data is located, how to get there, and which protocols to use. Wrappers are also used to add or modify methods accessing data. Wrappers are specified dynamically. A "path" is comprised of one or more wrappers. Sections of a path can be shared among two or more paths. By reconfiguring the LAM-MPI Allreduce operation we achieved a performance gain of 1.52, 1.79, and 1.98 on respectively two, four and eight-way clusters. We also measured the performance of the unmodified Allreduce operation when using two clusters interconnected by a WAN link with 30-50ms roundtrip latency. Configurations which resulted in multiple messages being sent across the WAN did not add any significant performance penalty to the unmodified Allreduce operation for packet sizes up to 4KB. For larger packet sizes the Allreduce operation rapidly detoriated performancewise. To log and visualize the performance data we developed EventSpace, a configurable data collecting, management and observation system used for monitoring low-level synchronization and communication behavior of parallel applications on clusters and multi-clusters. Event collectors detect events, create virtual events by recording timestamped data about the events, and then store the virtual events to a virtual event space. Event scopes provide different views of the application, by combining and pre-processing the extracted virtual events. Online monitors are implemented as consumers using one or more event scopes. Event collectors, event scopes, and the virtual event space can be configured and mapped to the available resources to improve monitoring performance or reduce perturbation. Experiments demonstrate that a wind-tunnel application instrumented with event collectors, has insignificant slowdown due to data collection, and that monitors can reconfigure event scopes to trade-off between monitoring performance and perturbation. The visual views we generated allowed us to detect anomalous communication behavior, and detect load balance problems. 1. INTRODUCTION A current trend in parallel and distributed computing is that compute- and I/O-intensive applications are increasingly run on cluster and multi-cluster architectures. As we add computing resources to a parallel application, one of the fundamental questions is how well the application scales. There are two main ways of scaling an application when processors are added: speedup,
880 where the goal is to solve a problem faster, and scaling up the problem, where the goal is to solve a larger problem (or get a more fine-grained solution to a given problem) in a fixed time by adding computing resources (see also Amdahl [ 1] vs. Gustafson [ 13]). As the complexity and problem size of parallel applications and the number of nodes in clusters increase, communication performance becomes increasingly important. Of eight scalable scientific applications investigated in [29], most would benefit from improvements to MPI's collective operations, and half would benefit from improvements in point-to-point message overhead and reduced latency. Scaling an application when it is mapped onto different cluster and multi-cluster architectures involves controlling factors such as balancing the workload between the processes in the system, controlling inter-process communication latency, and managing interaction between the processes. In doing so, one of the main questions is understanding how an application is mapped to the given architecture. This requires an understanding of which computations are done where, where data is located, and when control and data flow through the system. We show that the performance of collective operations improve by a factor of 1.98 by using better mappings of computation and data to the clusters [4]. Point-to-point communication performance can also be improved by tuning configurations. In order to tune the performance of collective and point-to-point communication, fine-grained information about the applications communication events is needed to compute how the application behaves. An example of this is the Message Passing Interface (MPI) [ 17] collective functions, which have scaling problems if the algorithms of the functions are not mapped properly to the cluster architecture. Understanding why the functions do not scale as well in some configurations as in others is difficult without an exact knowledge of how the functions are implemented and mapped to the cluster architecture. Even if we discover the reason for the scaling problems, we may not be able to remedy the problem without modifying the source code of the communication library, as mechanisms intended to aid the mapping of the application to the cluster are either insufficient or not implemented [28]. In other cases, implementations of a communication layer or middleware may not have been tested on a cluster of the same size or configuration as the cluster an application is deployed at. This may expose new problems which also require intimate knowledge of the implementation to find and resolve [2].
2. PATHS A middleware system or communication library usually provides the user with an API that abstracts away low-level details about how communication is implemented and where objects and shared data is located. Although this simplifies programming distributed applications, it is difficult for the programmer to determine why some of the functionality provided by the API does not scale well, and where bottlenecks occur. Even if the user discovers a bottleneck or the reason for scaling problems, which may require intimate understanding of the API implementation [2], it may be difficult or impossible to solve the problems without modifying the implementation. PATHS allows configuring and mapping of an applications communication paths to the available resources. PATHS use the concept of wrappers to add code along the communication
881 paths, allowing for various kinds of processing of the data along the paths. PATHS use the PastSet [30] distributed shared memory system. In PastSet tuples are read from and written to named elements. The PATHS system provides a method of specifying how the application is mapped onto clusters, focusing on the location of computations and data. The specification can be changed by modifying meta-data and meta-code, allowing an applications mapping to be studied, tuned, and remapped to a given cluster without recompiling the application code. PATHS allows the user to specify what type of instrumentation should be used where. The user can also add new types of instrumentation to the system. PATHS simplifies studying how different configurations influence the performance of an application when it is mapped onto a cluster or multi-cluster. This simplifies experimenting with multiple factors and configurations.
3. EXPERIENCES MAPPING APPLICATIONS TO DIFFERENT CLUSTERS AND MULTI-CLUSTERS We have experimented with the reconfiguration of several benchmarks, including global reduction, Monte Carlo Pi, a wind tunnel, video distribution, and the ELCIRC fiver simulation of the Columbia River. The benchmarks were typicaly mapped onto three different clusters, consisting of 2-, 4- and 8-way SMP nodes. We also ran a multi-cluster configuration where the PATHS system was used to bind the three clusters together, as the nodes in each of the clusters didn't have direct connectivity to nodes in the other clusters. Space limitations prevent us from describing the experiments and their results here in any detail. Global Reduction and Monte Carlo Pi: We did experiments with two benchmarks: the global reduction benchmark (a benchmark of PastSet's equivalent of MPI Allreduce), and the Monte Carlo Pi benchmark (an emberassingly parallel benchmark). Experimenting with multiple factors and configurations can help expose the factors that are most important in a particular cluster, and which combination of factors lead to performance bottlenecks that should be avoided. An example is [5], where the sum wrapper was found to scale badly and become a performance bottleneck once it was used by more than 3 or 4 threads. This resulted in more time spent in the sum wrapper than sending messages between the 8-way nodes. Performance can be improved significantly without removing or modifying components. Instead of rewriting the implementation of the sum operations, the sum-operators were arranged hierarchically, to allow groups of 3-4 threads to compute a partial sum which was then forwarded down to a sum-operator further down in the hierarchy. This improved the performance of computing partial sums from threads on a node significantly. For some configurations, sending more messages on the network than the minimum required may improve the performance. One of the reasons for this is that there is sometimes a tradeoff between the number of messages sent and the parallelism in handling these messages. An example is [5], where sending more messages than the minimum required to implement a global reduction sum reduced the latency by nearly a factor 2 (on the 2W cluster) compared to sending the minimum number of messages. Different clusters may need different configuration strategies. The results in [5] shows that a strategy which performed best in one cluster was not the best strategy for another cluster.
882 The Wind tunnel application is a Lattice Gas Automaton particle simulation. Details of our experiments can be found in [3] and [8,9]). The application has a linear speedup for each of the 3 clusters, and for a multi-cluster configuration using both Tromso clusters. Combining the 2W cluster, which is located in Odense, Denmark, with any of the Tromso clusters gave us sublinear speedups. We tried a number of experiments and located some of the factors contributing to the bottlenecks, but have yet to identify them all. l,Tdeo distribution By careful configuration we managed to support 2016 clients without dropping frames [3]. LAM-MPI Configuration: ps0 For efficient support of synchronization and communication in parallel systems, these systems require fast collective communication support from the underlying communication subsystem. One approach is defined by the Message Passing Interface (MPI) Standard [17]. Among Figure 1: Log-reduce tree for 32 processes mapped onto 8 the set of collective communodes. nication operations broadcast is fundamental and is used in several other operations such as barrier synchronization and reduction [ 19]. Thus, it is advantageous to reduce the latency of broadcast operations on these systems. In one of the experiments we did [31 ], we used PATHS to improve the execution times of the collective "AllReduce" operation in the PastSet tuple space system. To get a baseline, we compared this performance with the equivalent operation performance in LAM-MPI (Allreduce). We found that we could configure the PastSet Allreduce to be 1.83 times faster than LAM-MPI [ 14,24]. The reason for this is that MPI uses static trees to reduce and broadcast the values. If these trees are not well matched to the cluster topologies, the performance will suffer as can be seen in figure 1 where many messages are sent across nodes in the system. The broadcast operation has a better mapping for this cluster though. By adding an experimental configuration system to LAM-MPI we were able to first replicate the performance of LAM-MPI, showing that adding our configuration system did not impact the performance of LAM-MPI when running identical configurations. Then, experiments were conducted to try to find configurations which improved the performance compared to the original mappings [4]. Simple changes to the configuration resulted in a performance matching the PastSet Allreduce.
4. MONITORING In this section we describe EventSpace [9,7], an approach and a system for online monitoring the communication behavior of parallel applications. For low-level performance analysis [8,27] and prediction [32,11 ], large amounts of data may be needed. For some purposes the data must be consumed at a high rate [25]. When the data
883 is used to improve the performance of an application at run-time, low perturbation is important [22,26]. To meet the needs of different monitors the system should be flexible, and extensible. Also the sample rate, latency, perturbation, and resource usage should be configurable [22,11,27]. Finally, the complexity of specifying the properties of such a system must be handled. The EventSpace system is designed to scale with regards to the number of nodes monitored, the amount of data collected, the data observing rate, and the introduced perturbation. Complexity is handled by separating instrumentation, configuration, data collection, data storage, and data observing. Our approach, illustrated in figure 2, is to have a virtual event Parallel ~ virtual event Monitors application /( .-ml,~o,,,,. ).l space [ . . / ~ ' K . ~ - - - . - -. I ii~!ii::ii!i~i!!Jliii!~!iiiii~,i!iii!Eiiiiii:~:~ ~:~ space, that contains traces of an applications communication (including communication used for synchronization). Event scopes are used by consumers to extract and combine virtual events from Figure 2: EventSpace overview. the virtual event space, providing different views of an applications behavior. The output from event scopes can be used in applications and tools for adaptive applications, resource performance predictors, and run-time performance analysis and visualization tools. When triggered by communication events, event collectors create a virtual event, and store it in the virtual event space. A virtual event comprises timestamped data about the event. In EventSpace, an application is instrumented when the configurable communication paths are specified. Event collectors are implemented as PATHS wrappers integrated in the communication system. They are triggered by PastSet operations invoked through the wrapper. The virtual event space is implemented by using PastSet. There is one trace element per event collector. We have also experimented with 3D visualizations in VRML. Initial experiments suggest that this may be a useful method of visualizing an application's behaviour. Extensive 3D visualizations can be found in the Avatar system [20,21 ].
5. EVENTSPACE MONITORING EXPERIMENTS To demonstrate the feasibility of the approach and the performance of the EventSpace prototype, we monitored a wind tunnel application running on the 4W and 8W clusters. We used eight matrices, each split into slices. The slices were split between 140 threads. Using PastSet, each thread exchanged the border entries of its slices with threads processing neighbor slices. We scaled the problem size to fill the 128MB memory on each of the 4W nodes. Slices were exchanged approximately every 300 ms (and thus virtual events are produced at about 3.3 Hz). Event Collecting Overhead: The overhead introduced to the communication path by a single event collector wrapper is measured by adding event collector wrappers before and after it. The average overhead, calculated using recorded timestamps in these wrappers, is 1.1 #s on a 1.8GHz Pentium 4 and 6.1 #s on a 200 MHz 8W node. This is comparable to overheads reported for similar systems [22,27]. The slowdown due to data collection is insignificant.
884
i
... '
a) Communication and computation times for the 'four threads per CPU' configuration (in total 96 threads)
b) Part of figure 3a) (250th step).
Figure 3. Communication and computation times
Event Scope Overhead: We measured how fast virtual events can be observed from the virtual event space, and the slowdown the wind-tunnel application will experience as a result of this. The rate at which virtual events can be extracted from the virtual event space decreases from 2000Hz to 2Hz as the number of concurrent extractions increases from one to around 2000. The amount of data extracted goes from 36 bytes to about 79KB. The slowdown of the wind tunnel application is always insignificant. For all experiments the average execution time for the wind tunnel is 612 seconds 9 14 seconds. We concurrently consumed events located on different nodes. When consuming sequentially from all nodes, there is no difference in sample rates and slowdown. However, when more processing are added to the communication paths the concurrent version is faster. This is because we now have better overlap of communication and computation. The event scopes used to monitor the collective operation resulted in a slowdown of 1.17. We discovered that an event scope actually perturbed the wind-tunnel more than the compute intensive monitoring threads running on each node of the clusters. By reconfiguring the event scopes to use less resources the slowdown was reduced to 1.08. But the trade-off was a 50% reduction in the observation rate. When running four monitors concurrently, the slowdown was about the same as the largest slowdown caused by a single monitor. However, for the consumers, the observe rates were reduced by 10-50%. 6. VISUALIZATION E X P E R I M E N T S In this section we provide examples on how the data in a virtual event space can be used for analyzing the communication behavior of the wind tunnel parallel application. We used the same clusters as before, two in Tromso (4W, 8W) and one (2W) in Odense in Denmark. Communication between Tromso and Odense is the departments Intemet backbone with about 30-40ms latency, and a widely varying bandwidth (well below 34Mbits). The wind-tunnel had linear scalability when run on the 4W cluster, and the perturbation due to monitoring could not be measured. We observed that when 200 steps were executed, having 4 threads per CPU* was about 5% faster than having 1 thread per CPU (the same problem size was used for both configurations). When the number of steps was increased to 500, they were equally fast. In figure 3a), there is one horizontal bar for each thread, indicating when it is using the *Each thread did an equal amount of work
885 communication system (black) and when it is is computing (light gray). On the horizontal-axis elapsed time is shown. We can see thick black stripes starting at the lower threads going upward. By using the step information displayed when pointing at a bar, we can see that most threads are one step ahead of the thread below (neighbors can only be one step apart). By looking at the completion times (where the bars end) we can see a wavefront, where .......................... the threads higher up finishes earlier than the threads further down. By highlighting some steps we found that after 100 steps the threads had gotten roughly equally far, after 200 steps we could see the wavefront shape starting to emerge, and after 400 steps it was clearly visible. An emerging wavefront is Figure 4: Read operation times for worker shown in figure 3b), where the 250th step thread 20, on elements from thread above is highlighted. In the background the same (solid), and from the thread below (dotted). black stripes as seen in figure 3a) are shown. The threads above the strip, are not affected by the wavefront. Figure 4 shows how thread 20 changes, at around the 400th step, from spending most of its time waiting for data from the thread above (solid line), to waiting for data from the thread below (dotted line). By highlighting the 400th step in the communication-computation view, we can see that this is where the wavefront hits worker thread 20. Also the time per step is slightly increased after the 400th step. The four thread per CPU configuration slows down as the computing proceeds, due to the communication behavior of the wind-tunnel application. observe 20 c0019 level1
=
'
'
steps
'
4~Ov
'
7. RELATED W O R K There are many approaches and tools to parallelizing, configuring, monitoring, and visualizing the behavior of parallel applications. We do not have the space here to go into details, and to contrast existing work with ours. We have given several references earlier in the paper. Other interesting approaches are: performance analysis tools [16,27], NetLogger [27], firmware based distributed shared virtual memory (DSVM) system monitor for the SHRIMP multicomputer [ 15], network performance monitoring tools [ 11,18], Infopipes [6], Gscope [ 12], Prism debugger for MPI [23], JAMM monitoring sensor management system [26], Network Weather Service [32], other monitoring systems include [ 11,22,27,32], configurable monitoring systems include [11,22,27,32]. 8. CONCLUSIONS Small and simple changes to a configuration influence scaling and latency. It is hard to find good configurations analytically or by computation. Instead, it is demonstrated that by starting with what is believed to be a good configuration, one can run a number of experiments using the PATHS system to identofy configurations with better performance. The LAM-MPI implementation [ 10] of the MPI standard is documented to, out of the box, use configurations which do not scale well. A configuration mechanism has been added to
886 LAM-MPI and used to double the performance of the MPI Allreduce function. Thus, the ability to tune the configuration with knowledge about the application and the cluster topology, as opposed to relying on the implementation to do important optimization choices, is found to be important. To get information about the behavior of an application we developed EventSpace on top of PATHS. Using data from EventSpace we have visualized the behavior of several applications, including a wind tunnel, and used it to find communication and load bottlenecks. We find that the ability to configure, monitor and visualize the behavior of parallel applications are very usefull when experimenting to achieve better performance. REFERENCES
AMDAHL, G. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. In Proceedings of the AFIPS Spring Joint Computer Conference, Atlantic City, New Jersey, USA (1967), AFIPS Press, Reston, Virginia, USA, pp. 483-485. [2] ARPACI-DUSSEAU, A. C., ARPACI-DUSSEAU, R. H., CULLER, D. E., HELLERSTEIN, J. M., AND PATTERSON, D. A. Searching for the Sorting Record: Experiences in Tuning NOW-Sort. In Proceedings of the SIGMETRICS symposium on Parallel and distributed tools (SPDT 98), USA (1998), pp. 124-133. [3] BJORNDALEN, J. M., ANSHUS, O., LARSEN, T., BONGO, L. A., AND VINTER, B. Scalable Processing and Communication Performance in a Multi-Media Related Context. Euromicro 2002, Dortmund, Germany (September 2002). [41 BJORNDALEN, J. M., ANSHUS, O., VINTER, B., AND LARSEN, T. Configurable Collective Communication in LAM-MPI. Proceedings of Communicating Process Architectures 2002, Reading, UK (September 2002). [5] BJORNDALEN, J. M., ANSHUS, O., LARSEN, T., AND VINTER, B. PATHS- Integrating the Principles of Method-Combination and Remote Procedure Calls for Run-Time Configuration and Tuning of High-Performance Distributed Application. In Norsk Informatikk Konferanse (Nov. 2001), pp. 164-175. [6] BLACK, A. P., HUANG, J., KOSTER, R., WALPOLE, J., AND PU, C. Infopipes: An abstraction for multimedia streaming. Multimedia Systems, special issue on multimedia middleware (2002). Volume 8 Issue 5 pp 406-419, Springer Verlag. [1]
References 7-32 are available at http://www.cs.uit.no/forskning/DOS/hpdc/papers/2003/parcorelwork.pdf.
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
887
C l u s t e r C o m p u t i n g as a T e a c h i n g Tool o.J. Anshus a, A.C. Elster b, and B. Vinter c aUniversity of Tromso, Norway bNorwegian University of Science and Engineering, Norway cUniversity of Southern Denmark, Denmark This paper summaries some of our experiences from designing and teaching five different courses on operating systems and parallel computing where we use clusters. The courses were taught over the last 3-5 years at several universities, including University of Texas at Austin, Norwegian University of Science and Technology, University of Tromso, University of Southern Denmark, and University of Oslo. We have designed and taught several courses on operating systems and parallel programming. Typically, we have found that the students need introductory courses on computer architecture and operating systems to benefit the most from a course on parallel computing using clusters. As a result we have found that all three authors independently have ended up needing a package of three courses: computer architecture, operating systems, and parallel computing, and in that order. 1. INTRODUCTION Clusters are sets of loosely coupled computers administered and employed in parallel to solve shared tasks. Clusters of"Common Off-The-Shelves" (COTS) components constitute a promising class of cost-effective parallel computers, allowing maintained harvesting of advances in processor and system technologies. Several factors interfere with the efficient utilization of clusters for parallel computing. The parallelization strategy used by an application could result in load imbalances due to unequal load assignment or excess communication among some nodes. In addition, load imbalances can arise due to hardware inequalities. It is very likely that a cluster, and even more so several clusters, is made up of old and new hardware as machines are upgraded. Additionally, multiprogramming in combination with independent scheduling decisions on each of the nodes can result in additional slowdown of a parallel application trying to take advantage of the distributed autonomous resources. All of these issues are behind our focus on defining and teaching very hands-on courses. In real usage, the applications and clusters behave in more complex ways than we anticipate. Our approach is therefore primarily experimental in order to capture the complex interactions between real workloads and modem technology. We teach the students to design, implement, measure, and evaluate software and hardware. Our courses are hands-on in order to capture the complex interactions between real workloads and modern technology. Exercises include matrix multiplication, distributed image manipulation, and implementation of distributed shared memory. As an example of a larger project, one
888 of the authors has developed together with a collegue, John Markus Bjorndalen, and students, a distributed and parallel robot system. We use it as an instrument to demonstrate the principles and practice of distributed computing in general, and high performance parallel computing in particular. The students use and program computers with different performance characteristics, and must learn to utilize the resources in a distributed system in practice. In particular, they must find ways of utilizing a compute cluster to support the (Lego) robots with compute performance. The robots need processing cycles to solve the given tasks, including finding the position of other robots when their position has been encrypted to avoid detection. The system has been used in three robot competitions. The system is simple enough to be understood by students, but complex enough to provide for an interesting distributed system with specialized services including a cluster for high performance computing. The system is visual, and invites both students of the course and other students to speculate on how it is done. One of the authors, in addition to traditional operating system issues, use MPI in her operating systems course to teach synchronization issues, and the concept of a cluster is introduced already at this level. Another author have together with a colleague, Tore Larsen, and students developed an operating systems course where the students through 5-6 projects are guided through the development and implementation of a small working operating system from the boot block to a system with demand paging. In this course well known synchronization mechanisms like locks, semaphores, monitors, and message passing are implemented. However, the mechanisms are used only to some extent due to time limitations. Building on the operating system courses and the insight the students have gained from working with the internals of fundamental abstractions, the follow-up parallel computing courses can start with more actual use of these abstractions in applications. All three authors teach parallel computing courses using a mixture ofmodels, problems to be solved, and platforms to be utiIized including bag of tasks, divide and conquer, optimization issues, shared memory architectures, message passing approaches (PVM, MPI), and more advanced memory abstractions (remote memory, associative memory, virtual and distributed shared memory). The hardware architectures used includes SMPs, SIMD, and CC-NUMA. Languages and systems used include Linux, BrickOS, PVM, LAM-MPI, MPICH, Java-MPI, Pthreads, Java, Java RMI, C, Linda, and TSpaces. We have found that the size of the clusters needed should minimum comprise eight processors, or preferably sixteen or more. This is because many problems scale satisfactorily from two to eight processors, but sixteen often clearly show a flattening curve with less and less gain from using more and more processors. We have also found it interesting to demonstrate for the students the effect coming from going from uniprocessors, to two, four, and eight-way systems. The trade-offs between processor cucles, processor-memory busses, I/O-busses, interconnects, and networks can easily be experienced. We have used both 100Mbit Ethernet and gigabit networks (Giganet, Myrinet). Actually being able to let the students experience for themselves the performance impact of using polling vs. blocking communication primitives makes it easier to discuss the overhead penalties introduced by interrupt handling, and starting blocked threads and processes. It is an eye opener when students for themselves find that a 100Mbit Ethernet can have latency as good as a Gigabit network. In our experience, it is simpler for the supporting staff, the students, and the teachers if a cluster is reserved for exclusive use by a cluster course. However, a large cluster can usually not be reserved for a single course, and instead we have found that reserving a subset of a larger cluster also work well. Having control over a subset of the nodes at least some of the time
889 will reduce the perturbations from other applications, and make it simpler for the students to do repeatable performance measurements. The students need introductory courses on computer architecture and operating systems to benefit the most from a course on parallel computing using clusters. All three authors have independently ended up needing a package of three courses: computer architecture, operating systems, and parallel computing, and in that order. The first part of this paper describes how students in operating systems courses as well as advanced courses on parallel computing may use a network of workstations and MPI to learn about synchronization and parallelization issues. We then describe how an advanced course on distributed and parallel systems use clusters to support cheap robots with compute cycles. The robots used are made by Lego, and the system used demonstrate the principles and practice of distributed computing in general, and high performance parallel computing in particular. Finally, we describe our experiences with teaching a graduate-level course on cluster computing. In this course, we use SMP clusters. Software used for the projects in this course includes JAVA-MPI, JAVA RMI, RPC, Tspacesry, and Linda.
2. R E L A T E D W O R K S
In [1] they discuss the issues arising from the need to develop undergraduate courses in high performanc computing, and their first attempt at meeting the challenges. They found that "a successive set of courses preparing students for high performance computing is typically needed". In contrast to us, they have considered a broader set of issues. We are only reporting on a limited set of courses, and with a very hard core computer science approach. We both use problem-based, project-oriented approaches. While we have focused on using Linux as the basic programming platform, they have made an effort to support Windows, Mac OS, and Unix. While we understand the need to support a variety of platforms, we have focused on the narrower issues of providing as much depth as we can in a limited set of courses. This has so far for us meant choosing a single platform. [3] proposes to organize a cluster course into the components system architecture, parallel programming, and parallel algorithms and applications. Basically, we follow the same overall idea, but with a strong focus towards core computer science students. Our courses are occupied with both the nuts and bolts of the architecture, and how to utilize them effectively when forming distributed and parallel programs. Several papers, including [4, 5, 6], describe partly overlapping, partly different experiences from ours when teaching parallel computing using clusters. Space limitations prevent us from being more detailed. Many resources are available for teaching cluster computing. For example, the IEEE Computer Society Task Force on Cluster Computing (TFCC) [2] provides online educational resources.
890 3. OPERATING SYSTEMS AND PARALLEL COMPUTING" TEACHING PROCESS SYNCHRONIZATION AND PARALLEL PARADIGMS USING MPI ON A NETW O R K OF WORKSTATIONS Operating system classes are often students, first encounter with the concept of resource allocation and process synchronization. Since task-based parallelism is based on essentially the same concepts, the stretch to cluster computing is not so far-fetched. The first part of the course is taught in the more traditional way where the students got to build a simple operating system on PCs inside a given framework. It teaches the students about interrupt handling and all the other lower levels of operations systems. After a lecture on embedded operating systems, the students are then introduced to the 6 basic MPI commands" Send, receive, broadcast, gather, MPI_INIT and MPI_FINALIZE. With this they are able to write a simple parallel program. One fun exercise that we have used is to have the students write a distributed images manipulator. Here the MPI programs read in an image, distributes it to a given number of processes (given as a flag when running MPI), manipulates it in parallel in some simple way, then gathers the result on one node and print the final image out to a file. To challenge the students, we have them figure out how to deal with xmb images (i.e. covert them to pixels data). We also made each student do a slightly different image manipulation (e.g. contract (eliminating pixels), expand (adding pixels), bit or color-reverse half the image, etc.). This is not necessary, but was a good way to make the students feel ownership of their implementation. The operating system courses uses our regular Linux workstations. Our system staff allowed us to install the public domain MPI implementation from Argonne National Labs called mpich on these machines. We made several mini clusters with 4-8 processors, some of which were dual-processor systems. Details regading MPI can be found at [7]. Course detail from a class Elster taught at the University of Texas at Austin can be found at [8]. The more advanced courses use an AMD-based cluster that has around 40 nodes. This system is currently split into an 8-node interactive systems and a 30+ node batch systems (we sometimes add and subtract nodes to use for other purposes). The Parallel Computing course is a 4th year class. These students are in a 5-year Masters program in computer engineering. Bachelor degrees are not given at NTNU, but some of the students transfer in with 3-year degrees from other colleges. The course teaches MPI programming in addition to focusing on both serial and parallel optimizations. Pacheco's Parallel Programming with MPI [9] and Gerber's The Software Optimization Cookbook [ 10] are used as supporting text book. There are several other texts on MPI as well as on-line tutorials, but we find Pacheco's text to be the favored one from a pedagogical standpoint. Additional material in the course is taken from Goedecker & Hoise: Performance Optimization of Numerical Intensive Codes [ 11 ]. The course webpage for our spring 2003 Parallel Computing course can be found at [12]. These webpages include pointers to several MPI tutorials (both oursby us others) as well as descriptions of several student programming excercises. The latter include simple ring-style programming excercises as well as timing excercises and more advance excercises like parallelizing Game of Life using more advanced MPI features such as virtual topologies (e.g. MPI_Cart_create ). The students also do a pure optimization programming assignment (serial) and then combine the two in their final programming assignment where they try to beat an ATLAS BLAS routine [ 13].
891 4. DISTRIBUTED AND PARALLEL ROBOT SYSTEM As a part of an advanced course on cluster architecture and programming, we have the last three years developed a system where the students program cheap Lego robots, and supporting computers [14]. We use the distributed and parallel robot system as an instrument to demonstrate the principles and practice of distributed computing in general, and high performance parallel computing in particular. The students use computers with different performance characteristics, and must learn to utilize the resources in a distributed system in practice. In particular, they must find ways of utilizing a compute cluster to support the robots with compute performance. The robots need lots of processing cycles to solve the given tasks, including finding the position of other robots when their position has been encrypted to avoid detection. Robots with small on-board computers cooperate or compete inside an arena according to a set of rules. A dedicated support computer per robot enhances the robots low processing, memory, and I/O performance. A compute cluster and a file server provide even higher performance. The robots are under constant surveillance by a positioning system comprised of a computer with a video camera. The camera positioning system enhances the robots very limited on-board sensors, and provides the robots with positioning information. To report the progress and the state of the system, a scoreboard computer monitors many activities, and report them on a scoreboard through a projector. A control and management computer starts, stops, and manipulates the system according to user input. Finally, an infrastructure computer can be added making the system independent of other infrastructures and networks by providing network services (DNS, DHCP, gateway to other networks). The code is developed using Linux as a platform, and C as the language. A cross compiler is used to create code for the robots. The robots use the BrickOS operating system supporting multiple threads.
Wireless Access Point
/
I..AN
/
A I
--~ c~,,,,,~,,~,~ Blade server
D
File server Scoreboard
DOOD 00 Development l~otops
Control and Management
Figure 1. Distributed and Parallel Robot System
Handheld computers
DN$, DHCP,Gateway(GPRS,++)
892 Overall, the system is simple enough to be understood by students, but complex enough to provide for an interesting distributed system with specialized services including a cluster for high performance computing. The system is visual, and invites both students of the course and other students to speculate how it is done. Even if a single powerful server can handle some functions, this is contra productive with regards to making the system both technically and pedagogically interesting and useful. This will give many opportunities to explain how the system is built, and why it is built as it is. In particular, the role of the compute cluster will be simple to demonstrate by running two demonstrations, one with the cluster, the other without. The behavior of the robots will dramatically change when they must wait longer for solutions to time critical and processor intensive computations. The system has been used for mandatory student projects and in three robot competitions. It is our experience that the robot system creates a great deal of attention from people both outside and inside of the academia, and furthers the interest in the established solution and in the technology behind.
5. GRADUATE COURSE ON CLUSTER COMPUTING This course takes the approach of introducing clusters as emulators of classic supercomputerarchitectures and shows how to program them efficiently. The course starts out by using shared memory, then go on using PVM, MPI, remote memory, distributed structured shared memory, and distributed shared virtual memory. The students should have a thorough understanding of operating systems, their nature, limitations and the cost associated with activating different operating system components. Just as important is a good knowledge of computer architecture, and it is imperative to understand the memory hierarchy and the IO-bus. Of course, the students must not be new to programming. SMP architectures are included in the course for several reasons. They provide a simple introduction to parallel programming and parallel architectures, it is often convenient to develop a version for shared memory architectures before developing one for other architectures, and finally, many clusters comprise shared memory nodes. The Parallel Virtual Machines (PVM) API allow us to discuss programs where the individual threads have their own tasks to solve, and where communication is done using explicit message passing instead of using shared memory. Using the Message Passing Interface (MPI) API, we discuss massively parallel processors (MPPs) and Single Instruction Multiple Data, SIMD, programming. With MPI we focus more on synchronous programs with a set of processes that all perform the same operations, but on different portions of the application data. Real MPP architectures discussed include the Intel Paragon and the latest ASCI machines. Remote memory allow us to eliminate the concept of explicit message passing. Remote memory is based on the idea that one processor may read and write the memory of another processor, but a processor may only cache data from its own memory blocks. From a programming perspective access to other processors memory simplifies things, but performance is likely to suffer if one is not careful. As a link to "real" supercomputers we discuss the Cray T3E. We also introduce an alternative memory model where the location of data is completely hidden and the contents of a data-block is used for addressing instead. This memory model, which is called associative memory, is only seen in supercomputers as pure-software implementations. We then complete the cycle by looking at shared memory in the form of cc-NUMA architectures and
893 shared virtual memory programming. The cc-NUMA class of machines is an important one since they are the scalable architecture that is easily programmed and well suited for porting existing parallel applications to. We discuss the SGI Origin 3000 and the Sequent NUMA-Q to see how these machines are physically built, and how they can be emulated using clusters. The students can chose whether they want to use C, FORTRAN or Java. Of course, it is indeed a problem for high-performance computing using Java that there is no support for the IEEE floating-point standard, and that Java does not use Intel's native 80+ bit floating point hardware support [ 15]. For the exercises and project on distributed shared memory system, we use Linda and TSpaces. 6. CONCLUSIONS Giving students hand-on experiences using low-cost clusters is an invaluable way for them to get a feel for the issues related to operating systems, computer architecture, parallel programming as well as parallel computing in general. Our courses are hands-on in order to capture the complex interactions between real workloads and modern technology. Students need courses on architecture and operating systems before being able to gain the full benefits from our courses. The architecture and operating system courses show them the nuts and bolts, while the cluster courses build on this, introducing new nuts and bolts, but all the time trying to preserve the students understanding of all what is going on at all lower abstraction levels. For computer science focused cluster programming courses, this is in our opinion important. Due to scaling issues, a cluster of sixteen or more processors will show up any bad scaling behavior better than smaller clusters. We have found it advantageous to reserve some parts of the clusters for teaching projects alone to provide the students with a stable and repeatable environment for experiments. REFERENCES [1]
Stewart, Kris, Zaslavsky, Ilya, Building the infrastructure for high performance computing in undergraduate curricula: ten Grand Challenges and the response of the NPACI education center, Conference on High Performance Networking and Computing, Proceedings of the 1998 ACM/IEEE conference on Supercomputing, San Jose, CA, Pages: 1 - 8. [2] IEEE Task Force on Cluster Computing (TFCC), http://www.ieeetfcc.org/. [3] Amy Apon, Rajkumar Buyya, Hai Jin, and Jens Mache, Cluster Computing in the Classroom: Topics, Guidelines, and Experiences, Proceedings of the 1st International Symposium on Cluster Computing and the Grid (CCGRID 2001). [41 Barry Wilkinson, Tanusree Pai, and Meghana Miraj, A Distributed Shared Memory Programming Course, Proceedings of the 1st International Symposium on Cluster Computing and the Grid (CCGRID 2001). [5] G.Capretti, M.R. Lagan'A, L.Ricci, RCastellucci, S.Puri, ORESPICS: a friendly environment to learn cluster programming, Proceedings of the 1st International Symposium on Cluster Computing and the Grid (CCGRID 2001). [6] Joel Adams, Chris Nevison, and Nan C. Schaller, Parallel Computing to Start the Millennium, SIGCSE 2000 3/00 Austin, TX, USA. [7] http ://www- unix. mcs.anl.gov/mpi/mpich/
894
[8] [9] [ 10] [ 11]
[12] [13]
http:ilwww.ece.utexas.edul~elstedee360p-fO01 Peter S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann Publishers, Inc. (1997) Richard Gerber, The Software Optimization Cookbook, Intel Press (2002) Stefan Goedecker and Adolfy Hoisie, Performance Optimization of Numerically Intensive Codes, Society for Industrial & Applied Mathematics (2001) http:llwww.idi.ntnu.nol~elsterlsif8044-sp031 and http:llwww.idi.ntnu.nol~elsterl tdt4200-sp04/ Automatically Tuned Linear Algebra Software (ATLAS) http://math-atlas.sourceforge. net/
[14] http:llwww.cs.uit.nolforskninglDOSIhpdclresearchldprobots.html [ 15] W. Kahan, J.D. Darcy, How Java.s Floating-Point Hurts Everyone Everywhere, ACM 1998 Workshop on Java for High-Performance Network Computing, Stanford University, 1998, USA.
Minisymposium
Mobile Agents
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
897
Mobile Agents Principles of Operation* A. Genco ~ ~University of Palermo (Italy), [email protected] In this paper we discuss the mobile agent technology and summarize their features, principles of operation and implementation elements. Some development tools are also discussed. The paper starts from a general description of Mobile Agents as an advanced software paradigm which extends Object Oriented Programming. Then it discusses principles of operation starting from agent intelligent behaviour and continuing with Mobility, Communication, Coordination, and Fault Tolerance. Finally, as for related topics, Monitoring, Performances, and Security issues are discussed. 1. I N T R O D U C T I O N A mobile agent is a launched program, which in general leaves from a host, and sails across the internet autonomously. Its mission can be to find useful information or to perform given transactions on behalf of its owner. Mobile agents can typically behave with intelligence and self-directed autonomy. Mobile agents should accomplish their task communicating with other agents, and behaving reactively and proactively. A main advantage of mobile agents is in that they are a general solution for distributed systems without any particular vocation for a given application field. They are an improvement of object oriented programming and are not a specific solution for parallel and distributed computing. Mobile agents can be used as a general approach to the design of middleware systems for user or application access to any computing resource in the net. Therefore, mobile agents are particularly suitable to arrange internet computing communities for hardware and software resource sharing. In other words, we can say mobile agents are an enabling technology for GRID computing, as they can provide any support mechanisms to them. Some mobile agents platforms, such as IBM Aglets [19], Concordia [18], Voyager [34], Odyssey [ 13], Grasshopper [8] and Jade [4], are available to implement agent based distributed systems and applications, and therefore, mobile agents can also be considered a ready-to-use technology. 2. P R I N C I P L E S OF OPERATION Many researchers are facing system operation and implementation aspects of mobile agents. A wide literature is available dealing with artificial intelligence, communication, coordination, *This work has been partly supported by the CNR (National Research Council of Italy) and partly by the MIUR (Ministry of the Instruction, University and Research of Italy).
898 mobility, fault tolerance. Here we can give just a short overview of the highlights of these topics.
2.1. Intelligent agents A community made up of autonomous individuals can be equipped with knowledge processing features so as to gain the capability of operating to the end of a given goal or role. These capabilities are strongly related to the concept of intelligence. We deal with agent system intelligence, and define the concept of intelligence, basing it on the following group features: 9 agent capability of autonomously operating; 9 social capability (i.e. the capability of communicating and task sharing); 9 group reactivity to extemal events; 9 proactivity (the capability of involving other agents and cooperating with them to achieve a goal; or the capability of planning their activity on a group knowledge and experience base). The discussion on agent intelligence should consider many aspect and implementation techniques. Here we give only two classifications which are considered the basis of agent typology. The first one is by [24] and deals with the combination of three main features which are: Cooperation, Learning, and Autonomy. Their partial overlapping produces four agent categories as shown in Fig. 1. The second classification is due to [9] and is based on three different agent behaviours. Different behaviour combinations produce three different agent types as shown in Fig. 2. Nevertheless, as asserted in [5] it is more useful to look at agent group behaviour, instead of individual agent behaviour. This is relevant to establish whether an agent system can be considered intelligent or not. An evidence of that can be observed in some natural society model as the one of ants.
Collaborative Agents
Smart Agents O
Cooperation
Learning
Autonomy Interface Agents
Figure 1. Nwana's agent classification
Collaborative Agents
899
Deliberative Agents
Hybrid Agents
0
Reflective behaviour
Reactive behaviour
Meditative behaviour
Reactive Agents
Figure 2. Davis' agent classification
2.2. Mobility Mobile agents are agents with enhanced migration capabilities. They can carry their execution status by saving and restoring process information. This capability is called Strong Migration and it makes the difference from normal agents which can migrate without carrying their execution status [32]. Fig. 3 shows a classification of the different aspects dealing with process migration. Strong migration is at the root of a tree in which the main parts of a running process are represented. Strong migration enables a mobile agent to pursue its role as the result of an execution which can start in a place and continue in another one. 2.3. Communication Mobile agents need to exchange messages for cooperation purposes as well as for efficient task accomplishment. However, reliable communication between mobile agents is difficult to achieve because of frequent agent migrations. Agent location is often unknown to other agents, thus preventing reliable message delivery. The MASIF standard has been proposed by the OMG to provide interface specifications for agent communication [8]. Broadcast and forwarding schemes have been also proposed for agent message passing. According to the Home-Proxy method, as in the IBM Aglets [ 19], a sender can use a lookup service to get the receiver home place address before sending a message. The Follower-proxy method, as in Voyager [34], is similar to the Home-Proxy, apart that a message is forwarded up to the receiver from its home place. An agent which uses the E-Mail method [20], sends a message to the receiver home place and the receiver has to check its home place for messages periodically. The Blackboard communication, as in AMBIT [7], requires available storage in each place an agent can visit and use to read or write messages. According to the Broadcast method, an agent simply broadcasts a message to all places in a distributed system. Actually, different communication types can be given by the composition of different communication and synchronisation mechanisms, where the most relevant communication feature is location dependency. Also relevant topics in agent communication are Agent Communication Language (ACL) and Knowledge Sharing Effort, which allow agents to communicate according to common ontology and peer to peer protocols.
900
1
U
BUm/
Figure 3. Classification of the different migration elements 2.4. Coordination Coordination is a key issue in mobile computing, because it allows programmers to independently manage single application components and component interaction. A complete programming model is made up of two parts: a computational model and a coordination model. Computational models deal with single-threaded computational entities, while coordination models deal with asynchronous interaction between communicating entities. Computational and coordination models can be integrated in one programming language, as well as they can be implemented by independent languages. A coordination model is made up of three basic elements:
9 Coordinables (entities whose interactions are managed by a coordination model); 9 Coordination media (abstractions and data structures for interaction, e.g. tuple spaces); 9 Coordination laws. Dedicated languages, such as KQML [ 11 ], can implement an agent coordination model by suitably managing agent communications. However, these solutions focus on peer to peer communication and lack overall agent environment interactions management. Therefore, the need for a specific mobile agent coordination model arises. Mobile agents can be unaware of the collaborative context they run within, so a coordination model should be able to avoid chaotic and anarchic situations, as well as to manage interactions between agents actions. To this end, some platforms and tools can be used, such as IBM Aglets [ 19], ARA [ 16], ftMAIN [21 ], JavaSpace [30], MARS [6].
901 2.5. Fault tolerance Increasing complexity of distributed applications makes fault management issues a central problem. Fault occurrence probability increase with operation complexity, execution time, and number of processors involved. It is obvious that fast fault detection and fixing can raise considerably system efficiency and reliability. Traditional solutions, such as SNMP [28] and RMON [25], present a serious drawback: they are not scalable enough. Moreover, fault detection in mobile agents environment is not an easy task. For example, due to slow communication, a delaying agent can be misunderstood as a lost agent, thus giving rise to false fault detection [Fischer 1983]. Another problem to be faced is in that a malfunction in a network resource may cause several alarms and fault chains which can involve as many users as the ones accessing that resource. Some fault management techniques were developed dealing with hardware and software faults, such as the use of specific coding [ 17] or network setup options[22]. Here we can give a short list of the main malfunctions which can provoke fault tolerance system interventions: Node (site) fault, Agent system fault, Agent fault, Network fault, Message lost or message alteration. Most solution in literature are specific for a given system or they cover just a few sides of mobile agent systems. Here we give a short list of such solutions:
9 [14, 15]: according to the TACOMA system, a rear guard agent is created in the place where a migrating agent is monitored. 9 [23]: An a priori routing plan is performed for each migrating agent in a TACOMA system. 9 [ 1, 29]: An agent route is split by worker nodes. Each time an agent migrates towards a new worker node, the agent is cloned in supplemental observer nodes. 9 [33]: a 2PC protocol based algorithm is proposed to achieve more reliable migration. 9 [26]: James and A 3 fault tolerance solutions for parallel distributed computing. 3. RELATED TOPICS We shortly discuss Monitoring, Performances and Security as related topics to Mobile Agent principles of operation. These topics should be discussed deeper. However, due to paper size restrictions, we suggest the reader to consult the referred bibliography for details. 3.1. Monitoring Agent monitoring techniques are very useful because they allow users to know which agents are running on their host as well as resources they are accessing. A large number of mobile agent environments rely on Java technology. However, Java Virtual Machine hides platform dependent information thus preventing online agent monitoring. Some Java extension have been developed to this end, which overcome the drawback, such as JVM Profiler Interface (JVMPI) and Java Native Interface (JNI) [31 ]. M A P I - Monitoring Application Programming Interface- was included in the S O M A Secure Open Mobile Agent- to arrange a monitoring component for mobile agent systems implemented in JAVA [2, 3]. SOMA uses MAPI to receive monitoring data about both the
902 local and remote resources agents are using. This way SOMA can monitor the global status of distributed resources by means of ad-hoc monitoring mobile agents. Some results of SOMA mobile agent monitoring show the advantages of Java based online monitoring along with the need of enabling dynamic overhead tuning by means of frequently collected data. RMON - Remote MONitoring- [ 10], and other recent tools, provide remote monitoring of network traffic. 3.2. Performances Performances can be evaluated according to different indexes and parameters. It is impossible to say what platform can show the absolute best performances. [27] considers two indexes that are turnaround time and network load. The platforms taken into consideration are: IBM Aglets, Mitsubishi Concordia, Objectspace Voyager, General Magic Odyssey, AdAstra Jumping Beans, IKV Grasshopper, and Siemens James. In some cases a benchmark test crashed, and therefore, some results lack in the performance comparison table (Tab. 1). Table 1 Platform Performance Comparison Platform Execution Time (secs) 1 agency, 1 lap, 5 agencues, 1 lap, 100 KB data No data JAMES ODISSEY SWARM GRASSHOPPER VOYAGER AGLETS CONCORDIA JUMPING BEANS
0.7 0.89 1.01 1.47 1.67 1.73 2.56 5.90
3.17 4.11 3.46 5.37 5.49 7.10 4.78
Network Load (KBytes) 5 agencies, 1 lap, No data Without caching 48.67 117.15 103.59
With caching 17.66 95.10 31.90
128.55 135.84 210.01
28.38 109.25 111.12
3.3. Security Mobile agent approach to distributed application development usually gives more efficient and scalable results than traditional client/server approach, but it shows serious security drawbacks. During its migrations across the net unsafe communication channels, mobile code is potentially exposed to many risks. Moreover, mobile agent approach allows malicious agents to run on a host without user awareness. A suitable security framework is thus required to provide faithful agents safe migration and execution, and to prevent malicious agents from host resource crunching. The security problem has three main sides which are communication security, Server protection against agent attack, and agent protection against server attack. As far as communication is concerned an attack can be in terms of data capture or modification, and for instance, it could be performed by an agent which simulates to be the right receiver of a message. Mobile agent system security is a problem not so different from then general Internet security problem and therefore security techniques are quite similar, and often based on encrypting algorithms, digital signatures, and biometrical devices for access control.
903 REFERENCES
[ 1] [2]
[3]
[4]
[5] [6] [7] [8] [9]
[ 10] [ 11 ] [12]
[ 13] [14]
[ 15]
[ 16]
[ 17]
Baumann, F. Hohl and H. Rothermel. Mole - Concept of a Mobile Agent System. Technical Report, TR-1997-15, Universit~it Stuttgart Fakult~it Informatik, Germany 1997. P. Bellavista, A. Corradi, and C. Stefanelli, Protection and Interoperability for Mobile Agents: A Secure and Open Programming Enviroment, IEICE Transactions an Communications, Vol E83-B, No. 5, May 2000. P. Bellavista, A. Corradi, and C. Stefanelli, An Integrated Managment Enviroment for Network Resources and Services, IEEE Journal on Selected Areas in Communication, Vol. 18, No. 5, May 2000. Bellifemine F., Caire G., Poggi A., Rimassa G. JADE: a White Paper, exp - Volume 3 n. 3, http://exp, telecomitalialab.com, 2003 R.A. Brooks, Intelligence without Representation, in Artificial Intelligence Vol.47, pp. 139-159, 1991 G. Cabri, L. Leonardi, F. Zambonelli. Implementing Agent Auctions using MARS, http://sirio.dsi.unimo.it/MOON/papers/papers.html, 2000 L. Cardelli and D. Gordon. Mobile Ambients, Foundations of Software Science and Computational Structures, LNCS No.1378, Sprinter, pp. 140-155,1998. Stefan Covaci, Grasshopper The First Reference Implementation of the OMG MASIF Mobile Agent System Interoperability Facility, http://www.omg.org/docs/98-04-05.pdf D.N. Davis, Reactive and Motivational Agents: Towards a Collective Minder, in J.P. Miiller, M.J. Wooldridge & N.R. Jennings (editors): Lecture Notes in Artificial Intelligence 1193, Intelligent Agents III, Proceeding of ECAI?96 Workshop, ISBN 3-54062507-0, Springer, 1997 L. Deri, and S. Suin, Effective Traffic Measurement Using Ntop, IEEE Communications, Vol. 38, No. 5, May 2000. Finin T. et al: KQML as an agent communication language, in Bradshaw J (Ed): 'Software Agents' MIT Press (1997). M. Fischer, N. Lynch, and M. Paterson. Impossibility of distributed consensus with one faulty process, proc. of the Second ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pages 1-7, Atlanta, Georgia, Mar. 1983. GeneralMagicInc. Odissey. 1997. available at http://www.genmagic.com/agents D. Johansen, R.van Renesse and F.B. Schneider. Operating System Support for Mobile Agents. Proc. of the 5th IEEE Workshop on Hot Topics in Operation Systems, Orcas Island, USA. May 1994 D. Johansen, R.van Renesse and F.B. Schneider. An Introduction to the TACOMA Distribuited System, version 1.0. Report Institute of Mathematical and Physical Science, Department of Computer Science, University of Tromtf, Norway, 1995. Holger Peine, Application and Programming Experience with the Ara Mobile Agent System, preprint of an article accepted for publication in IEEE Software - Practice and Experience, 2002 Kliger S, Yemeni S Yemini Y, Ohsie D, Stolfo S. A coding approach to Event Correlation. Proc. of Fourth International Symposium on Integrated Network Management. 1995 ;IFIE 266-277.
904 [18] Reuven Koblick, Concordia, Communication of the ACM, volume 42, Number 3, 1999, Pages 96-97 [ 19] D. Lange and M. Oshima, Programming and Deploying Java Mobile Agents with Aglets, Addison Wesly, 1998. [20] A. Lingnau and O. Drobnik, Agent-User Communication: Requests, Result, Interactions, in Lecture Notes in Computer Science (1477), Sprinter, pp. 209-221, 1998. [21 ] A. Lingnau, ffMAIN: using Tcl and the Tcl Httpd Web Server to implement a mobile agent infrastructure, www.tu-harburg.de/skf/tcltk/papers2000/paper.pdf, 2000 [22] Mansfield G., Ouchi M.,Jayanthi K., Kimura Y., Ohata K., Nemoto Y., Techniques for Automated Network Map Generation using SNMP. Proc. of INFOCOM'96. 1996; 473480. [23] Y. Minsky, R.van Renesse, F.B. Schneider and S.D.Stoller. Cryptography Support for Fault Tolerant Distribuited Computing. Proc. of the 7th ACM SICOPS European Workshop, ACM Press, September 1996. [24] H.S. Nwana, Software Agents: An Overview, Knowledge Engineering Review, 1996 [25] Perkins DT. RMON Remote Monitoring ofSNMP-Managed LANs. Prentice Hall: Englewood Cliffs, NJ, 1999. [26] L.M.Silva, P. Sim6es, G.Soares, P. Martins, V. Batista, C. Renato, L. Almeida, N. Stohr. JAMES: A Platform of Mobile Agents for the Management of Telecommunication Networks, Proc. of IATA '99, Stockholm, Sweden, August 1999. [27] Silva L. M. Silva, G. Soares, E Martins, V. Batista, L. Santos, Comparing the performance of mobile agent system: a study of benchmarking, ELSEVIER, 2000 [28] Stallings W., SNMP, SNMPv2, SNMPv3, and RMOM 1 and 2, Addison-Wesley: Reading, a 1999. [29] M. Strasser and K. Rothermel. Reliability Concepts for Mobile Agents. International Journal of Cooperative Information System (IJCIS), 7(4): pp. 355-382, 1998. [30] Sun Microsystems, The JavaSpace Specification, http://www.sun.corn/jini/specs/jsspec.html, 1999 [31 ] Sun Microsystems - Java Virtual Machine Profiler Interface (JVMPI), http://java.sun.com/products/j dk/1.3/doe s/guide/j vmpi/j vmpi.html. [32] Torsten Illmann, Michael Weber, Frank Kargl, Tilmann Kruger - Migration of Mobile Agents in Java: Problems, Classification and Solutions- http://cia.informatik.uni-ulm.de/. [33] M. Vogler, T. Kuklemann and M.L. Moschgath. An Approach for Mobile Agent Security and Fault Tolerance Using Distribuited Transactions. In Processing 1997 International Conference on Parallel and Distribuited Systems (ICPADS '97). IEEE Computer Society. December 1997. [34] Thomas Wheeler, Developing Peer Applications with Voyager, http://www.recursionw.com
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
905
Mobile Agent Application Fields* F. Agostaro a, A. Genco a, and S. Sorce a aUniversity of Palermo (Italy), [email protected], {agostaro, sorce }@studing.unipa.it The paper discusses the highlights of three mobile agent application fields which are: Parallel and Distributed Computing, Data Mining and Information Retrieval, Networking. An overview of the development platforms is also discussed. 1. I N T R O D U C T I O N Mobile Agents are a recent paradigm for software design which extends object oriented programming features. An agent can perform its task autonomously; a mobile agent can carry out complex tasks which require the agent to migrate from a network place to another one. Mobile agent application fields are many. We can guess they will be more in the next future. In many cases Mobile Agents can replace web services; in other cases mobile agents and web services can be an effective solution together. 2. P A R A L L E L AND DISTRIBUTED C O M P U T I N G Probably, no novel middleware system will ever be capable of making the internet a valid communication subsystem for parallel computing, especially if one is dealing with a number crunching application, a synchronous algorithm and frequent communication between processes. If time constraints are also a requirement, there is no better option than a dedicated parallel machine. Nevertheless, a wide number of parallel applications can be rearranged according to the distributed computing paradigm, thus making them right for execution in heterogeneous network systems with best effort communication. Many packages were developed to this end, such as PVM, MPI, Linda, Express, and so on, each giving an effective solution to the load balance problem, even performed in a dynamical fashion. Mobile agents allow programmers to arrange new load balance strategies in a more reliable way than the Remote Procedure Call paradigm. A distributed system can use mobile agents to perform process migration with both data and code shipping. On the base of their strong migration capability, we can say mobile agents are the most suitable solution when the communication subsystem is the one provided by the internet and no guarantee can be given about a constant bit rate for data interchange. Here we shortly discuss some platforms which implement mobile agents for parallel distributed computing. *This work has been partly supported by the CNR (National Research Council of Italy) and partly by the MIUR (Ministry of the Instruction, University and Research of Italy).
906 2.1. BOND
BOND [Marinescu 2002] is a Java based, FIPA [FIPA SC00001L and XC00086D] compliant agent framework which is used to implement various kind of application such as MPEG video streaming, stock selection, partial differential equation solving, user group schedule management. Key features of the BOND Agent framework are: 9 multi-plane state machine agent model, 9 component-based architecture, 9 Python [Rossum 2003] based agent description language [Shread 2002], 9 strong emphasis on introspection, 9 visual modelling and software verification, 9 dynamic agent behaviour (agent assembly, mobility, surgery, trimming, and lazy loading of strategies). BOND is a framework for message passing in a distributed system. BOND provides capabilities to create, deploy, migrate code at different locations, and also provides means for communication between all the components of the system. BOND infrastructure is based on an architecture and communication mechanisms that provide messaging capabilities such that every object in the environment can send and receive messages from other local or remote objects. 2.2. N O M A D S
NOMADS [Suri 2000] is a mobile agent system that supports strong mobility and safe Java agent execution. The NOMADS environment is composed of two parts: an agent execution environment called "Oasis" and a new Java compatible Virtual Machine (VM) called "Aroma". The combination of Oasis and the Aroma VM provides key enhancements over today Java agent environments. Oasis is an execution environment which includes the AROMA virtual machine. It consists of two independent parts: a front end program for interaction and maintenance, and a backend environment for execution. The maintenance program allows system manager to perform administration tasks such as account creation, resource use restriction and so on. Each agent runs in a different virtual machine. The Oasis process holds all virtual machine instances for agent execution, policy manager and dispatcher. AROMA VM is a Java compliant virtual machine which allows a running agent to perform strong migration with the capability of capturing the execution status of individual thread, thread group, all threads in a virtual machine. WYA (While You're Away) is a NOMADS distributed system which uses idle workstations to run stand alone or parallel computations. WYA provides accounting information to allow workstation owners to have some economic benefits by letting their computation resources be shared.
907 2.3. J P V M JPVM (Java-PVM) [Ferrari 1997] needs to be mentioned because mobile agent platform are usually implemented by means of java programming. JPVM is a Java class library which implements all the standard PVM functions and it does not implements any mobile agent system. However, JPVM can be a good basis for mobile agent systems implementation, thus enabling a parallel virtual machine to perform process strong migration and dynamic load balance strategies.
3. DATA MINING AND INFORMATION RETRIEVAL Data mining and information retrieval systems often rely on centralised management. According to this approach, data are collected from remote sources to one central host, thus rising network load and communication cost. Mobile agents provide remote resources local access capabilities, so they are an interesting option for data mining and information retrieval system design. A mobile agent approach reduces data collection from remote sites and therefore also reduces communication costs. According to Glitho et al. [Glitho 2002] a mobile agent is sent to a server with the end of retrieving and processing information locally. SmartSystem is an information retrieval statistical system with a space-vector model to evaluate similarities among documents. SmartSystem runs within a stationary agent as a multifunction interface which provides the following facilities: Textual query which returns a list of relevant documents, Full document retrieval, similarity evaluation in each couple of listed documents. JAM [Stolfo 1997] is an agent based system which works as an operating system extension. JAM is a meta-learning system which supports network broadcasting of learning and metalearning agents towards distributed data bases. JAM implements a collection of distributed applications for learning and classification in a data-site network where each data-site includes: a local data base, one or more learning agents, one or more meta-learning agents, a local user configuration, some visual interfaces. JAM data sites are designed to cooperate each with the others exchanging classification and learning agents. Yang et al [Yang 1998 A] proposed an implementation of intelligent, customizable mobile software agents for document classification and retrieval which was carried out by means of the Voyager platform [Yang 1998 B]. Different approaches were used, as for instance, TFIDF (Term Frequency- Inverse Document Frequency), Bayesian and DistAl (neural network classifier), whose performances were evaluated by means of genetic alcorithms. An agent is sent to a remote site where it retrieves the documents (journal paper abstracts and news articles) which comply with a query. Then the documents are sent to the local site, and the agent is finally killed. The documents are finally classified by the user as interesting or not. This allow a data set to be created for agent training purpose. 3.1. DDM Some DDM (Distributed Data Mining) systems [Kargupta 2003] use client-server architectures and some mobile agents. Among these we can include: BODHI, PADMA, and KEPLER [Krishnaswamy 2000]. BODHI (Besizing Knowledge through Distributed Heterogeneous Induction) is being written in Java. Its main goal is to create a communication system and a runtime environment for collected data analysis. These should be independent from specific platform, particular learning
908 algorithm or given knowledge representation. PADMA (Parallel Data Mining Using Agents) uses an approach similar to JAM where agents are deployed according to source localisation. PADMA agents perform their data mining task in parallel, without merging results. KEPLER integrates different automatic learning algorithms. It can be improved by adding any learning algorithm into the system. It behaves as a "plug-in" and does not includes a decision mechanism. An external intervention is required each time a new algorithm must be selected for a given data mining session. 4. NETWORKING Traditional network management systems are based on a centralized approach, since network "intelligence" resides in a small number of specialized nodes. This kind of approach is not flexible enough (e.g. routing algorithms cannot be easily modified), and presents some serious drawbacks, especially for extensibility and scalability purposes. Moreover, network congestion is likely to occur near intelligent nodes. These drawbacks can be overcome by a different network design approach, based on decentralized management. A traditional centralised model was proposed by the IETF (Internet Engineering Task Force) and the ISO (International Standard Organisation). In the following we deal with a decentralised model which relies on cooperating mobile agents. These are instructed to store routing information across dynamic networks, which can be used to direct application agents inte optimal paths. The active network paradigm is a solution to the same problem, which can be more efficient, provided that an application can use router devices to run its own strategies. Mobile agent have the advantage of running in central memory and CPU, so they can implement more complex and effective logic. 4.1. Mobile agent based research works A wide discussion should be needed on various solution proposed in literature, each showing advantages and drawbacks. Here we can only review the main features, without entering in details. Sahai e Morin [Sahai 1998] proposed a Mobile network Manager (MNM) application to be run in laptop computer as an administrator assistant to remotely control a network, by means of mobile agents for distributed management. Many solution have to face scalability problems. Among these, Liotta et al. [Liotta 1999] deal with a mobile agent based management architecture. A multi-level strategy is adopted which allow simple monitoring mobile agents to be cloned in order to reduce deployment cost. 4.2. Mobile agent based active networks An active network can be programmed by end users [Psounis 1999], thus enabling intermediate nodes to perform computations in the OSI application layer. By this way end users can control network services according to information coded in their packets. Many active network architectures currently use code mobility [Fuggetta 1998], a paradigm which is very similar to mobile agents. The difference between active networks and mobile agent based network management, mainly is in terms of protocol encapsulation and maintenance services. A basic difference is that active networks run their management code at the network layer, Mobile agents are application programs obviously running at the application
Criterion Implementation Language Architecture I Components
April Platform in C, agents in April Agents as lightweight ,,processes" running within an April invocation, any process can also act as agent station
Transport & Proprietary Communicaprotocol based tion Technol- on TCP/IP between special ogy communication daemons, transport uses communication mechanism, limited IIOP support
ASDK Java Aglets as mobile agents, aglet Context as host/ interface to runtime environment
Proprietary Agent Transfer Protocol (ATP) based on TCP/IP accessible via customisable Agent Transfer Communicati on Interface (ATCI)
D' Agents Platform in C, agents in TCL Agents move from agent daemon to another, one per computer, each agent runs in its own interpreter
Proprietary protocol based on TCPiIP or on e-mail
Grasshopper Java
Odyssey Java
MASIF pliant
MASIF larity
com-
TCP/IP, RMI, CORBA IIOP, MAF HOP, TCPIIP+SSL, RMI+SSL supported by internal ORB
Voyager Java simi.
Via transpor API witk implementat ion of RMI (default), IIOP DCOM
Distributed mobile objects as agents connected by an ORB, Space/ Subspace/ Super-space concept for broadcast messaging Proprietary protocol based on TCPiIP and CORBA supported by internal ORB
910 layer. With reference to active network terminology, a mobile agent for network management can be seen as a particular active packet, and an agent host node a particular active node. Minaret al. [Minar 1998] proposed a dynamic network management model by the use of a mobile agent population which are defined by five main properties: 1. Strong migration; 2. primitive language level provided by the infrastructure for agent migration towards a near node; 3. agents must be small and can be used to arrange complex agents; 4. agent cooperation by memory sharing for complex operation; 5. agents can select and use all the resources of whatever node it is running in. Connectivity maintenance agents can be specialised according to an objective function. For instance, some management agents could perform economical actions to encourage selected traffic models [Gibney, 1998]. Some agents could be specific for low latency route maintenance even if this would require huge bandwidth. Agents could adapt network infrastructure to changing requirements. For instance, the usage cost of a partially idle gateway could be lowered down to increase traffic. Network management agents can be specialised according to user goals. This can be done, provided that users know what connections they need, and are capable of instructing application mobile agents to arrange a personal connectivity. 5. PLATFORMS The Mobile Agent Platform Assessment Report By the MIAMI Project [Bross 2000] gives an overview of the most popular mobile agent platforms available. Here we report Table 1 from that document which summarises the features of six platforms: April [McCabe 1993], ASDK [Lange 1998], D'Agents [Gray 1998], Grsasshopper [Covaci 1998], Odissey [GeneralMagicInc 1997], Voyager [Wheeler 2003]. We feel the need to add the JADE platform [Bellifemine 2003] to the list. JADE is FIPA compliant and provides useful java tools to implement agent behaviours ant ontology driven interactions. JADE is from Telecom Italia Lab. Finally, the Java 2 Platform Micro Edition, by SUN Microsystems [J2ME] is one of the most popular java platforms which enables mobile devices, such as cellular phones and PDA, to be used as execution environment for mobile agent distributed, mobile and ubiquitous computing applications. REFERENCES
[Bellifemine 2003] Bellifemine F., Caire G., Poggi A., Rimassa G. JADE: a White Paper, exp - Volume 3 - n. 3, http://exp, telecomitalialab.com, 2003 [Broos 2000] R. Broos, B. Dillenseger, P. Dini, T. Hong, A. Leichsenring, M. Leith, E. Malville, M. Nietfeld, K. Sadi, M. Zell, Mobile Agent Platform Assessment Report By the MIAMIProject, Edited for the CLIMATE Sub-cluster Agent Platforms, Editors A. Guther, M. Zell, http://www.ee.surrey.ac.uk/CCSR/ACTS/Miami/
911 [Covaci 1998] Stefan Covaci, Grasshopper The First Reference Implementation of the OMG MASIF Mobile Agent System Interoperability Facility, http://www.omg.org/docs/98-04-05.pdf [Ferrari 1997] Adam J. Ferrari, JPVM: Network Parallel Computing in Java, Technical Report CS-97-29, Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA, December 8, 1997 [FIPA SC00001L] FIPA- Foundation for Intelligent Physical Agents -Abstract Architecture Specification, SC00001 L,, www.fipa.org [FIPA XC00086D] FIPA - Foundation for Intelligent Physical Agents - Ontology Service Specification XC00086D, http://www.fipa.org, 2000 [Fuggetta 1998] A. Fuggetta, G. P. Picco, e G. Vigna, Understanding Code Mobility, IEEE Transaction on Software Engineering, 24(5), Maggio 1998. [GeneralMagicInc 1997] GeneralMagicInc. Odissey. 1997. available at http://www.genmagic.com/agents [Gibney 1998] M.A. Gibney, N.R. Jennings, Dynamic Resource Allocation by Market-Based Routing in Telecommunications Networks, Intelligent Agents for Telecommunication Applications, Proceedings of the Second International Workshop on Intelligent Agents for Telecommunication (IATA'98) [Glitho 2002] Glitho R.H., E. Olougouna, S. Pierre "Mobile Agents and their use for information retrieval: a brief overview and an elaborate case study" IEEE Network, Jan/Feb 2002, 34-41 [Gray 1998] Robert S. Gray, David Kotz, George Cybenko, Daniela Rus, D 'Agents: Security in a multiple-language, mobile-agent system, 1998, Lecture Notes in Computer Science [J2ME] http ://java. sun.com/j 2me/ [Kargupta 2003] Hillol Kargupta, DBWORLD Distributed Data Mining Bibliography, Date: Fri, 26 Sep 2003 21:29:28 -0500 (CDT http://www.csee.umbc.edu/~hillol/DDMBIB/) [Krishnaswamy 2000] S. Krishnaswamy, A. Zaslavsky, S.W.Loke An architecture to support distributed data mining services in e-commerce environments', Workshop on Advanced Issues of E-Commerce and Web/based Information Systems, 2000 [Lange 1998] D. Lange and M. Oshima, "Programming and Deploying Java Mobile Agents with Aglets ", Addison Wesly, 1998. [Liotta 1999] A.Liotta ,G.Knight, G.Pavlou, On the performance and scalability of decentralised monitoring using mobile agents, Proceedings of the 10th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM'99),Ottobre 1999, pp.3-18. [Marinescu 2002] Dan C. Marinescu, Internet-Based Workflow Management: Towards a Semantic Web, ISBN: 0-471-43962-2,WILEY-INTERSCIENCE, 2002 - Bond 2.2 Released on December 1st, 2001 [McCabe 1993] F. G. McCabe. The April reference manual. Technical report, Department of Computing, Imperial College, London, 1993 [Minar 1998] N. Minar, K. H. Kramer, P. Maes, "Cooperating Mobile Agents for Dynamic Network Routing", MIT Media Lab, Cambridge, USA, 1998. (http://nelson.www.media.mit.edu/people/nelson/research/routes-coopagents/) [Psounis 1999] K. Psounis, "Active Networks: Applications, Security, Safety and Architecture" IEEE Communications Surveys, First Quarter 1999. [Rossum 2003] Guido van Rossum, Python Tutorial Release 2.3.3,, PythonLabs, Fred L. Drake, Jr., editor, Email: [email protected], December 19, 2003
912 [Sahai 1998] A. Sahai, C. Morin, Enabling a mobile network manager trough mobile agents, Proceedings of the Second International Workshop on Mobile Agents (MA'98), LNCS, vol. 1477, September 1998, pagg. 249-260. [Shread 2002] Paul Shread, Blueprint For A Decentralized World, February 7, 2002, http ://www.gridcomputingplanet.com/features/article.php/970541 [Stolfo 1997] S. Stolfo, A.L. Prodromidisz, S. Tselepis, W. Lee, D. W. Fan, P.K. Chan JAM: Java Agents for Meta-Learning over Distributed Databases, In Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining. Newport Beach, CA, pp. 74-81 Au- gust 1997 [Suri 2000] Niranjan Suri, Jeffrey M. Bradshaw, Maggie R. Breedy, Allan R.Ditzel, Gregory A. Hill, Brian R. Pouliot, David S. Smith, NOMADS: Toward an Environment for Strong and Safe Agent Mobilit, Proceedings of the Fourth International Conference on Autonomous Agents, 2000 [Yang 1998 A] J.Yang, V.Honavar, L. Miller, J. Wong "Intelligent Mobile agents for information retrieval and knowledge discovery from distributed data and knowledge sources", 99-102, 1998 [Yang 1998 B ] J.Yang, E Pai, V. Honavar, L. Miller, Mobile Agents for Document Classification ad Retrieval: a Machine Learning Approach, (1998), iastate.edu/~honavar/Paper...emcsr98.ps [Wheeler 2003] ThomasWheeler, Developing Peer Applications with Voyager, http://www.recursionw.com
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
913
Mobile Agents and Grid Computing* F. Agostar@, A. Chiello, A. Genco ~, and S. Sorce a aUniversity of Palermo (Italy), Dipartimento di Ingegneria Informatica, Viale delle Scienze, edificio 6, 90128 P a l e r m o - ITALY, [email protected], {agostaro, chiello, sorce }@studing.unipa.it This paper deals with mobile agents as an effective solution for grid service provision. A short overview is first introduced on the Grid paradigm and the most known research activities in the field. Then mobile agents are discussed and a comparison with the RPC method is made as far as the most effective solution to minimise network overload and fault occurrences is concerned. 1. I N T R O D U C T I O N Grid technology has brought a new way of conceiving computational resources as an integrated system made up of a set of interconnected computers, sometimes remotely located [Foster 2002]. The common feature of all grid architectures is a coordinated and controlled resource sharing between the members of a dynamic multi-institutional virtual community. Community members agree on a set of sharing rules and permissions, by which resources to be shared and members to be enabled to access these resources, can be defined. Resource sharing can take place by using a suitable protocol architecture for interoperability. The resources to be shared in a grid environment can be both physical resources and services, being a service a networkenabled entity which provides some capability. In a service-oriented view, interoperability can be achieved by a standard way of defining service interfaces and protocols to be used for service invocation. According to the key features required by a grid environment, three kinds of protocols are relevant: 9 C o n n e c t i v i t y P r o t o c o l s , for communication and authentication; 9 R e s o u r c e P r o t o c o l , for individual resource access negotiation, initiation, monitoring, con-
trol, account and payment; 9 C o l l e c t i v e P r o t o c o l s , for a coordinated use of many individual resources.
An Open Grid Service Architecture (OGSA) has been proposed, based on the issues outlined above. OGSA relies on a Web Service Description Language (WSDL) for service description [Foster 2004]. WSDL was developed by W3C [W3C 2001 ], and relies on XML, the eXtensible Markup Language [Strauss 2003]. A WSDL document is an XML document made of a set of definitions. The most important components of a WSDL service description are: *This work has been partly supported by the CNR (National Research Council of Italy) and partly by the MIUR (Ministry of the Instruction, University and Research of Italy).
914 9 m e s s a g e definitions, which are abstract definitions of the data to be transmitted; 9 p o r t type definitions, which are abstract definitions of operation to be performed on an
input message, to get an output message; 9 b i n d i n g definitions, which define the concrete data format and protocols to be used for
service invocation. The message and port type definitions are independent of the binding definitions. Therefore, changing concrete data formats and protocols does not affect the part of the WSDL document which contains message and port type definitions. The WSDL description of a service is also distinct from the service implementation. This is strictly related to the platforms on which the service resources reside, and allows easy overlay to local environment. This implies, for example, that a service implementation can rely on facilities provided by the specific platform on which is executed, thus increasing implementation efficiency. In this context, Grid resource discovery mechanisms play a central role. By means of these mechanisms, an implementation of a service can exploit lower level capabilities available on a host. If we apply this argument in a recursive fashion, we are able to construct a higher level service by suitably composing lower level services, thus obtaining a virtualization of resource behaviours. This virtualization requires not only standard service description, but also standard semantics for service interactions. OGSA fulfils this requirement with Grid Service: this is a Web service which provides a set of well-defined interfaces and follows a set of conventions, to provide uniform service interactions semantic 2. M A GRID SERVICE PROVISION
Mobile Agents technology can play a relevant role in grid computing. MA paradigm exhibits good heterogeneous systems adaptation capabilities and provides user-customized remote setup procedures. In particular, the MA paradigm allows network load reduction, dynamic load balancing [Genco 1996] and asynchronous user-defined task execution [Genco 2003 A] [Marques 2001]. Grid computing, as a paradigm for distributed complex services provision, needs mobile agents because this technology provides strong migration and allows a service provision process to migrate and carry its execution status with it. This is very relevant when arranging complex services, along with geographically distributed components to be bound by application logic. A complex service may need to start in a network place, where some facilities are available, and then to continue in another place where other facilities can be exploited, and so on. We could draw the graph of a complex service as the one in Fig. 1. A user calls a Master Service Provider (MSP) for a complex web service which can be performed as a multi-service procedure [Fou 2001]. Sub services are located in some Service Provider (SP) places, and therefore, many places may need to be visited. There are two main options for arranging a multi-service procedure. The one in Fig. 1. takes advantage by the use of mobile agents because, except for the initial user service request to MSP, network is loaded just for short hops, without requiting bidirectional communication. The other traditional one in Fig. 2. makes use of the well known Remote Procedure Call (RPC) method, by which MSP is charged with the task of calling each involved SP.
915
MSP
~
SP4
SP1A5
~
"~
I
SP7
"~
~I~P8All
SP3
SP2 SP6A9
SP 10
User Place
Fig. 1. Mobile agent gid service route
Actually, MA and RPC can be used together, as for instance, in the case a mobile agent needs to call a remote procedure. This may occur when migration is not required, or simply because there is no agent that was programmed for that procedure logic, or more, because some specific synchronisation is required as the one of an RPC. In other cases one can use a mixed approach. For instance, mobile agents can be used to perform the service provision strategy in Fig. 2, or an RPC can be recursively used to implement the service route in Fig. 1. Nevertheless, a relevant advantage of mobile agents is in being more effective for service discovery. Before using any RPC sequence, MSP needs to know where services are currently located. Probably, some centralised table could efficiently provide a service linking mechanism. Unfortunately, such a strategy entails MSP overload because of table updating to be frequently performed. Moreover, some inconsistency risk is always standing, as for instance, in the case some links are temporarily down. Some services might be cancelled or added at any time, even immediately after the MSP service table has been updated. If an inconsistency arises in the table, some RPC requests may fail, thus triggering error procedures. We could continue with the list of faults an RPC strategy may entail. Nevertheless RPC keeps on being a reliable strategy which is used in many distributed applications. We believe the mobile agent paradigm can always give a better solution, especially when the scenario is the one of the changing internet connections. A mobile agent can follow its route in Fig. 1. both performing dynamic routing [Genco 2003 B] and service provision at the same time. To this end, each SP should maintain and update a local table for Service Proximity Location (SPT) as it would be a cache memory (Fig. 3.). Service sequences are often similar among them, or they can be grouped by similarity. This allow an SP to hold the addresses of those services which are most frequently requested
916
MSP
~
_
SP 4
9
SP 7
SP8^ll
SP 2 SP 6 ^ 9
--0
SP 10
User Place
Fig. 2. RPC grid service provision
after the ones in its place. According to a routing policy, an SPT can be extended to store SP addresses for services of some hops ahead. A service discovery mission will be needed only in a few cases, when a service entry is missing in the table.
SP ~lllllllllllllll~lllllllllli~
I Agent Server
|
'
I | |
z
i
i
Service Proximity Table (SPT) Service id 1 Service id 2
|
SPT
Service id n Default service location
Fig. 3 . - SP Service Proximity Table
address address address address address
917 2.1. Distributed service location
Each SP holds those services it owns and allows to be shared within a grid community. Nevertheless, due to performance improvement reasons, an SP may be requested to host other services from different owners. Once analysed the most frequent service provision paths, it could be convenient to store some copies of a service object in some suitable SP places. If we assume the path in Fig. 1. be a frequent one, some interventions can be made to minimise the number of hops a mobile agent has to do. The service path in Fig. 1 requires eleven hops. By simply copying three service objects in different places, the same service can be carried out by eight hops (Fig. 4.), thus reducing network load at the cost of service size local memory occupation.
MSP
~
~
SP 7
.)
8^9
SF
b sp,o,l User Place
Fig. 4. Mobile agent servicepath reduction
This kind of performance improvement can be performed only if we are using mobile agents. An RPC service provision sequence cannot be reduced without involving an MSP as a service repository. Such a policy would be not so effective because the involved MSP would become a bottleneck. 3. CONCLUSIONS Mobile agents turned out to be the most effective paradigm to be used when distributed services are to be provided. Mobile agents allow a complex service to be managed by a suitable logic for sub services sequence binding. Logic can follow a distributed service in each place a sub service is provided.
918
Apart the cases of specific synchronisation requirements, mobile agents are more effective than the RPC method, because they are capable of minimising network load and also reduce inconsistency faults. REFERENCES
[Foster 2002] Foster, I.; Kesselman, C.; Nick, J.M.; Tuecke, S.; Grid services for distributed system integration, Computer, Volume: 35, Issue: 6, June 2002, Pages :37 - 46 [Foster 2004] Ian Fosterl,2 Carl Kesselman3 Jeffrey M. Nick4 Steven Tueckel The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration, http ://www.globus.org/research/papers/ogsa.pdf. [Fou 2001 ] John Fou, Web Services and Mobile Intelligent Agents, Combining Intelligence with Mobility, Ooctober 2001, http://www.webservicesarchitect.com/content/articles/fou02.asp [Genco 1996] Genco A.; G. Lo Re. 1996. The Egoistic Approach to Parallel Process Migration into Heterogeneous Workstation Network. Journal of Systems Architecture- The Euromicro Journal, ID n: JSA-030 [Genco 2003 A] Alessandro Genco, Mobile Agents Principles of Operation, Proceedings of ParCo 2003, 2 - 5 September 2003, Dresden [Genco 2003 B] Alessandro Genco, Mobile Agents and Knowledge Discovery in Ubiquitous Computing, Proceedings of ParCo 2003, 2 - 5 September 2003, Dresden [Marques 2001] Marques, P.; Simoes, P.; Silva, L.; Boavida, F.; Silva, J., Providing applications with mobile agent technology, Open Architectures and Network Programming Proceedings, 2001 IEEE, 27-28 April 2001, Pages: 129- 136 [Strauss 2003] Strauss, F.; Klie, T., Towards XML Oriented Internet Management, Integrated Network Management, 2003. IFIP/IEEE Eighth International Symposium on, 24-28 March 2003, Pages:505 - 518 [W3C 2001] Web Service Description Language (WSDL) 1.1, W3C Note 15 march 2001, http ://www.w3.org/TR/2001/NOTE-wsdl-20010315
Parallel Computing: SoftwareTechnology,Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
919
Mobile Agents, Globus and Resource Discovery* F. Agostaro a, A. Genco ~, and S. Sorce a aUniversity of Palermo (Italy), [email protected], {agostaro, sorce }@studing.unipa.it In this paper we deal with the grid technology and some related problem are discussed. In the first part of the document we discuss an overview of grids, in terms of application fields and needed protocols. Next we discuss the Globus project, which is nowadays the de facto standard for grid environments, with emphasis on its resource discovery components. Finally we deal with resource discovery as an application of the Mobile Agents paradigm and show how this approach can be used for general, scalable and multi-purpose resource search and location. 1. INTRODUCTION The technology standards of recent parallel architectures like CrayT3E has not been overcome yet. However the rapid development of network technologies and infrastructures in the last few years has brought a new way of conceiving computational resources. This can be identified by an integrated system made up of a set of interconnected computers, sometimes remotely located. This is the grid approach. The term grid is a very generic one, since grid architectures can be successfully employed in many different contexts, ranging from high performance solutions of numerical problems to artificial intelligence applications [Foster 2001], [Foster 2002]. The common feature of all grid architectures is a coordinated and controlled resource sharing between the members of a dynamic multi-institutional virtual community. The members of such a community agree on which resources will be shared and which members of the community will be enabled to access these resources, thus defining a set of sharing rules and permissions. Grid computing should give the opportunity of looking at a set of computing resources owned by the members of the community as the ones to be used by distributed parallel application. In this field several projects are in progress, being Globus [Foster 1997] the most known and the one standing the capability of becoming a standard, mostly in the area of integrated grid services provision. In the Globus Project computing communities are arranged in "Virtual Organizations" made up of groups who are building experimental and production grid infrastructures for their own purposes. Regardless the software tool used for resource sharing, the main problem users have to deal with is that all they need is available somewhere in the net; they only need to discover it. The mobile agent paradigm could be helpful to solve this problem. An agent is a software entity which should be capable of acting autonomously in the information virtual space. A mobile agent is an agent which can perform its task sailing in the net and thus completing in a *This work has been partly supported by the CNR (National Research Council of Italy) and partly by the MIUR (Ministry of the Instruction, Universityand Research of Italy).
920 site what it started in another site [Sahai 1998]. According to a given discovery strategy, mobile agents can be instructed to search among a set of addresses and to single out the ones of those sites fitting the member requirements in terms of hardware or software resources. 2. THE GRID TECHNOLOGY Resource sharing can take place by using a suitable protocol architecture, in order to implement interoperability between the members of a virtual community. Since the rules and the policies for resource sharing can change dynamically within a grid, these protocols should be light-weighted ones, so that resource sharing rules can easily be established and changed. Sharing relationships between the members of a community are not client-server ones necessarily. Interesting applications of this model have been developed in CAD communities, since grid architectures provide the opportunity of sharing dedicated software, knowledge repositories, databases and optimisation algorithms, as well as CPU cycles and hardware resources, among community members. A suitable access interface makes the distributed resources access transparent to the users. In this way, each designer can use advanced state-of-art analysis, design and optimisation tools at a relatively low cost. By the way, in the last few years companies and enterprises have realized that purchasing some services from an external service provider is often cheaper than designing specific dedicated architectures. This trend has further increased grid computing technology development. Some kinds of grids are listed below: 9 Computational Grids, which include distributed supercomputing, for large-scale numerical problems, and high-throughput computing grids, aiming at scheduling of a large number of independent (or at least loosely related) tasks. 9 Data Grids, aiming at knowledge repository and database sharing. 9 Service Grids, in order to provide the users with service of different kinds. Service Grids include on-demand computing grids, aiming at providing requiring users with short-term computational resources, and collaborative computing grids, thus improving communication and interaction between different users. The resources to be shared in a grid can be viewed either as physical resources or as services. We term "service" a network-enabled entity which provides some capability. Of course, a service is supported by physical resources. In a service-oriented view, the interoperability requirement can be achieved by a standard way of defining service interfaces and protocols to be used for service invocation. According to the key features required by the grid environment, three kinds of protocols are relevant: 1. Connectivity Protocols: these are related to communication and authentication purposes; in particular, authentication solutions for grid environment should have the following features: (a) Single sign-on: each user should be able to access grid resources by signin on only once.
921 (b) Delegation: each user should be able to create programs which can negotiate grid resources access on his behalf, potentially with some restrictions; (c) Integration with local security solutions: each user should be able to interoperate with resource provider local security solutions; (d) User-based trust relationships: the authenticated user should be able to use resources from different providers together without any interaction between providers; 2. Resource Protocols: these are related to individual resource access negotiation, initiation, monitoring, control, account and payment, and rely on connectivity protocols. Resource Protocols include Information Protocols, which are used to get information about the state of a given individual resource, and Management Protocols, which are used for sharing instantiation and monitoring. 3. Collective Protocols: these are related to coordinated use of many individual resources, and rely on Connectivity and Resource Protocols. Collective Protocols provide implementation of many sharing behaviours, ranging from general to highly specific applications. 3. THE G L O B U S P R O J E C T
An open-source implementation of these protocols is the Globus project, which is now the de facto standard [Foster 1998]. This project, conceived by Foster and Kesselman [Foster 1997], aims at the design of a framework for integrated resource management for high performance computation purposes. Globus is a project focused on providing the application of grid concepts to scientific and engineering computing. Within the Globus project, some low level tools have been developed. These implement the basic functions for communication, authentication, network management and data access, that are used to create high level services for high performance computation (e.g. programming tools and schedulers). The low level toolset is made of some modules, each defining an interface, by which the high-level services can call the corresponding function. Each module also defines an implementation by which the high-level functions can be built in a platform-independent way. Existing modules are: 9 a module for resources discovery and allocation. This module provides mechanisms to set up the application requests, to locate suitable resources, and to allocate them to the applications. 9 a module for efficient implementation of the basic communication functions. 9 a module for real-time monitoring of the configuration of the distributed system. 9 an authentication interface, which provides user and resources authentication. 9 a module for process generation, which aims at starting a computation on a machine on which necessary resources have been detected and allocated. 9 a data access module, which provides the functions for managing high speed access to persisting data.
922 This set of modules can be conceived as a high-performance computing oriented virtual machine. In order to use grid services, Globus is released as a toolkit, which is a set of useful components that can be used either independently or together to develop useful grid applications and programming tools. The components of the Globus Toolkit are [GLOBUS]: 9 The Globus Resource Allocation Manager (GRAM) provides resource allocation and process creation, monitoring, and management services. GRAM implementations map requests expressed in a Resource Specification Language (RSL) into commands to local schedulers and computers. 9 The Grid Security Infrastructure (GSI) provides a single-sign-on, run-anywhere authentication service, with support for local control over access rights and mapping from global to local user identities. Smartcard support increases credential security. 9 The Monitoring and Discovery Service (MDS) is an extensible Grid information service that combines data discovery mechanisms with the Lightweight Directory Access Protocol (LDAP). MDS provides a uniform framework for providing and accessing system configuration and status information such as compute server configuration, network status, or the locations of replicated datasets.
9 Global Access to Secondary Storage (GASS) implements a variety of automatic and programmer-managed data movement and data access strategies, enabling programs running at remote locations to read and write local data. 9 Nexus and globus_io provide communication services for heterogeneous environments, supporting multimethod communication, multithreading, and single-sided operations. 9 The Heartbeat Monitor (HBM) allows system administrators or ordinary users to detect failure of system components or application processes. For each component, an application programmer interface (API) written in the C programming language is provided for use by software developers. Command line tools are also provided for most components, and Java classes are provided for the most important ones. 4. R E S O U R C E DISCOVERY IN GLOBUS
In the Globus project, resource discovery is supported by the MDS toolkit component and takes place within "Virtual Organizations", made up of groups who are building experimental and production grid infrastructures for their own purposes. There are a great variety within the VOs, since their size can range from 2 to O(1000s) involved institutions and their lifetime can last from minutes to years. Each VO has its own directory service which participating systems register with, so that others may discover them. MDS is composed by the Grid Index Information Service (GIIS), which contains participant systems data, and by the Grid Resource Information Service (GRIS). This service registers the system on which it is running with the GIIS.
923
.,......"'"
VirtualOrganization
"=~.
/
/
# ,m .,~...~m
...,~
.,,.,.,'"""~omain #3 "'"~..,~
/
##
,.s#
\
Figure 1. a sample Virtual Organization composed by resources registered in three domains (red squares). 4.1. The GIIS A Grid Index Information Service may be installed and run on one or more systems, so that people (or applications) can search the GIIS for participating systems and query their configuration data. GIIS resolves user queries against VO resources, thus defining a specific view of them. The GIIS describes a class of servers, gathering information from multiple GRIS servers. Each GIIS is optimised for particular queries and its operating principles are similar to that of web search engines. Users or applications can find a GIIS server in different ways. Each GIIS can:
9 Run on well-known host/port; 9 Use DNS server records; 9 Be part of a referral tree; 9 Be indexed by another GIIS. 4.2. The GRIS A Grid Resource Information Service is installed and runs on each resource. It provides resource specific information (e.g. load, process information, storage information, etc.). GRIS allows a "White pages" approach for lookup of resource information, and a "Yellow pages" approach for lookup of resource options. In order to index the contents of a GRIS, a GIIS must first find it. Possible options are:
9 GIIS can be configured with GRIS hostnames 9 GRIS register with GIIS during startup
924 9
GIIS can walk a referral tree to find GRIS
The MDS component can support any of these approaches or combinations of them, the right one depending on the organization's requirements. 5. R E S O U R C E ALLOCATION IN GLOBUS In Globus, resource allocation is performed by the Globus Resources Allocation & Management (GRAM) API. This provides a common standard for resource requirements expression, relying on the Resource Specification Language (RSL). This way, different users can formulate their resources request according to one formalism, despite of local heterogeneity. GRAM also provides a common interface for remote resource exploiting. Within a Globus grid environment, this interface is commonly used for remote job submission and control (fig. 2). This is the most supported resource sharing modality in Globus. GRAM does not provide neither scheduling or resource brokering capabilities, nor accounting and billing features. However GRAM services are easily composable for scheduling and brokering mechanisms set up, and a wide variety of metaschedulers and resource brokers, relying on GRAM services, have been proposed [Czajkowski 200 lb].
M D S d i e n t API
Client
M ~ :
cafls
d
II~,::m:~:~ ~"~,~~-~,~-::::~::,~,~::~ ~,~:~:::: ~::::::::::::::::::::: ,~,~,~:~~,:II:,~,................~;~;~;~,~,~;~ ::,~,~=>:.~,,;~,~,~:~:,~,:I,:I:.~: :~:,~~:~,~~:~I:I i e n t API catls
t o a e ~ - - ~ s o u r c e inf:o
~. . . . . . . . . . . . . . . . . .
S i t e b o u n d a r y..
. . . . . ........ 9
G R A M Client AI request resoul~ and process ~
current status of resour163
I Query
API sta[:e
Massager
Request
reate processes
Nonitor contro~
"ill
Figure 2. GRAM components 6. MOBILE AGENTS AND R E S O U R C E DISCOVERY
In Globus, existing VOs are largely independent and are not "linked together" for shared use yet. Specifically, there is no universal GIIS that one can search to find all of the VOs and their resources. One can start his own GIIS telling others where it is, or contact existing VOs to ask where their GIIS is [GLOBUS]. Resource discovery is then necessary because users typically know little about the ensemble of a VO's resources, but it is difficult because of heterogeneous and dynamic resource set, and because of the rich set of possible queries.
925 The development of an enhanced resource management infrastructure is needed for supporting network scheduling, advance reservations, and policy-based authorization. This is the subject of an ongoing research [GLOBUS]. On the other hand, mobile agents may typically behave with intelligence and self-directed autonomy to carry out their function and they are able to communicate to each other. Mobile agents can also accomplish resource brokering as well as local scheduling and resource accounting and billing, therefore mobile agent paradigm may be helpful for resource discovery. On this way, a large number of mobile agent based solutions have been proposed, so we think that the mobile agent paradigm could be usefully exploited in future Globus releases. REFERENCES
[Czajkowski 2001 a] K. Czajkowski, S. Fitzgerald, I. Foster, C. Kesselman, Grid Information Services for Distributed Resource Sharing, Proceedings of the Tenth IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), IEEE Press, August 2001. [Czajkowski 2001b] K. Czajkowski, A.K. Demir, C. Kesselman, M. Thiebaux, Practical Resource Management for Grid-based Visual Exploration, Proceedings of the Tenth International Symposium on High Performance Distributed Computing (HPDC-10), IEEE Press, August 2001. [Foster 1998] I. Foster, C. Kesselman, The Globus Project: A Status Report, Proc. IPPS/SPDP '98 Heterogeneous Computing Workshop, pp. 4-18, 1998. [Foster 1997] I. Foster, C. Kesselman, Globus: A Metacomputing Infrastructure Toolkit, Intl J. Supercomputer Applications, 11 (2): 115-128, 1997. [Foster 2001 ] I. Foster, C. Kesselman, S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations, International J. Supercomputer Applications, 15(3), 2001. [Foster 2002] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, Open Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002. [GLOBUS] The Globus Project EA. Q. - http://www.globus.org/about/faq/general.html [Sahai 1998] A. Sahai, C. Morin, Mobile agents for location independent computing, proceedings of Second ACM International Conference on Autonomous Agents, 1998
This Page Intentionally Left Blank
Parallel Computing: SoftwareTechnology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
927
A Mobile Agent Tool for Resource Discovery* F. Agostaro a, A. Genco ~, and S. Sorce ~ aUniversity of Palermo (Italy), [email protected], {agostaro, sorce }@studing.unipa.it In this paper we present a mobile agent based tool for arranging communities whose members want to share computing resources. Such a tool enables community members to suitably arrange their own parallel virtual machine, using resources available within the community. Mobile agents are used to search among available addresses inside the community, and are instructed to select the ones which correspond to the users' requirements. We also propose a possible implementation of the system with reduced capabilities, along with a graphical user interface used to set up servers and instruct agents. 1. I N T R O D U C T I O N Grid technology has brought a new way of exploiting computational resources. Nowadays these can be seen as an integrated system made up of a set of interconnected computers, sometimes remotely located. A grid computing common feature is coordinated and controlled resource sharing between the members of a dynamic multi-institutional virtual community. The members of such a community agree on which resources will be shared and which members of the community will be enabled to access these resources, thus defining a set of sharing rules and permissions. Resource sharing can take place by using a suitable protocol architecture, in order to pursue interoperability between the members of a virtual community. In particular, the grid approach allows the configuration of a parallel machine which uses computing resources located on a network. Parallel processing in a PC network is something widely discussed in literature. Some works deal with platforms such as PVM [Geist 1994], Linda [Bjornson 1988], and so on; some others with the load balancing problem, such as MPI [Gropp 1994] or P4 [Butler 1993]; and others with resource lookup and distributed virtual machine set up [Foster 1999a], [Foster 1999b]. There are several existing projects aiming at the design of a framework for integrated management of high performance computation oriented services on distributed systems. Within most of these projects, some tools have been developed to implement the basic functions for communication, authentication, network management and data access. In this field, Globus [Foster 1998], [Foster 2002] is the most known and the one standing the capability of becoming a standard, mostly in the area of integrated grid services provision. In Globus, computing communities are arranged in "Virtual Organizations" made up of groups who are building experimental and production grid infrastructures for their own purposes. *This work has been partly supported by the CNR (National Research Council of Italy) and partly by the MIUR (Ministry of the Instruction, University and Research of Italy).
928 In this paper we mainly deal with the mobile agent technology in order to arrange parallel virtual machines. To this end, a Virtual Parallel Computing Community is set up to allow its users to share computing resources and to use them. Mobile agents are used to search among available addresses within the community, and are instructed to select the ones which correspond to the users' requirements. The mobile agents technology is used because of its capability to store the execution state, migrate to the next server and restart the execution there [Sahai 1998]. 2. P R O B L E M F O R M U L A T I O N Regardless of the chosen software tool, users should know physical location of machines involved for parallel virtual machine configuration. In fact, users with computational needs and users with available resources, even if actually connected, do not know each other's availability or needs (fig. 1). This problem can be solved by setting up a community for resource sharing and publishing their availability and network location. Members agree on which resources will be shared and which members of the community will be enabled to access resources, thus defining a set of sharing rules and permissions. 8~m p
9
9
/
\
. Comou,ers;, ( j #
e
| ,
"idle
! I !
t
t
|
~k ~._
'"
#
I)
rI 9
NETWORK
( ~
]
# t
~
t
Users with computational needs 9
i
t r , ! t
t i ,L
t
#
De
Figure 1. Users and resources, even if connected, do not know each other's availability or needs In the following we propose a resource discovery system, by means of a Virtual Parallel Computing Community (hereafter VPCC) which uses the Mobile Agent technology. According to a chosen discovery strategy, mobile agents are used to search among a set of addresses and are instructed to single out the ones of those sites fitting the member requirements in terms of hardware and software resources. Users who become members of a Community, publish their availability or needs (fig. 2), so they can interact according to a peer to peer resource provision protocol (fig. 3). 3. SYSTEM C O M P O N E N T S Users who want to become members of the Community, need to install an agent-based software on their machines to let their resources to be shared and enable them to look up and use community resources. The Community has three main components:
929 m
pB
9
e
Computers
a
"idle"
,
i~
I
(servers)
t t L
k
I
9
I
j
;>
Community
/
iI
NETWORK
/
"
~s
Users with computational needs (clients)
b
L t
t
aI #e 9
Figure 2. Availability and needs are published on the community 9B
~' 9
Community t
Computers
f
f
I
"running" t
I 9
I
9
~
t
9
NETWORK
t ~ i r I
%1
Users with computational needs
/I
,
I' ! t
w
~'% ~'q
,,-. i
0 0
m ~
Figure 3. Users and resources interact peer to peer 1. the client, that is the Mobile Agent: this is the community's key-component and acts for users who need computing resources; 2. the server application. This resides on the machine of the users whose resources are available, and receives the Mobile Agents of the system; 3. a database to store and manage network addresses and hardware information of the machines shared by their owners. 4. O P E R A T I N G P R I N C I P L E S When an user starts acting as a Community member, a copy of the Agent Server is launched, waiting for clients' agents. The server notifies the database of its availability to receive tasks, by sending its hardware configuration, in terms of CPU type and clock, available space on disk, max nr. of executable tasks and max CPU load allowed. The database receives servers information and saves them in a table (Table of Availability) by updating it with the server entry (fig. 4). When a Community member shows his needs for computing resources, an Agent is generated and instructed with the client's requirements (hard disk space, max number of executable tasks, connection speed, min CPU clock and max CPU load). The agent first steps into database and collects the IP addresses of the machines matching the client's requirements, by suitably filtering database. Once a list of available addresses is obtained, the agent steps into the network to test the actual availability of the collected servers. This is necessary because one or more of the servers
930
CPU,clock,harddisk,nrtasks,maxIoad~
iiiii!iiiiiiii!iiiiiiii Figure 4. The database is updated with servers information which have notified the database of their availability could be not available anymore, or even be disconnected. For each IP address the agent tests the server's matching to the client's requirements. If the test succeeds the agent books the server for a time interval large enough to end the test round. If the test fails, it proceeds its round with the next server. Once the test round ended, if the bookings placed by the agent allow the configuration required by the client, they will be confirmed, and the agent will go back to the client with the IP addresses list, saved in a configuration file which can be used to configure the parallel virtual machine. Some other details are reported which can be used to plan a good load balancing strategy. If the bookings don't satisfy the client's requirements, the user will be notified of the request fault. The agent is the key component of the system, because it takes care of the server availability test and of the decision whether include them into the virtual machine or no, according to the users requirements. The database acts as an intermediary between the clients and the servers. For the servers, it acts as a window in which the servers display their resources, while for the clients it is the starting point for the parallel virtual machine configuration search.
5. AN IMPLEMENTATION In order to try the solutions above, we propose a feasible implementation of the system, along with a graphical user interface used to set up servers and instruct agents. Users who want to become members of the community have to install the agent-based software tool to allow their machines to act as client or server, according to their needs or availability. When the tool is launched, the "VPCC Setup" window is shown (fig. 5). In this screen members decide if supply computing resources acting as server, or make a request for resources acting as clients. Users who want to act as server will select the "Supply" button. The "VPCC - Server setup" window will then open allowing to insert all technical data of the local machine (fig. 6). When all specification are inserted, by clicking on "OK" the application will send the information to the database and will start the Agent Server waiting for client's Agents. Users who want to act as client will select the "Request" button on the "VPCC - Setup" window (fig. 4). The "VPCC - Client's setup" window will then open allowing to insert all specification to instruct the client's agent (fig. 7). Once all data is inserted, by clicking on the "OK" button the mobile agent is launched to make his search between addresses listed in the database. The agent's "owner" will wait for the agent coming back. At the end of its search, the agent will return the address list of available and suitable machines, or a message telling the unavailability of suitable machines.
931
~/irtual Parallel Lempufing Commu niiy S~Iup
s
(Sercer sett~p)
tres163
~... ::::::.. r''"'"~:"'*.............'""J .!::==:-=i====~
Request ii
Rectuest comwJting resources (C lier~t agent .setup) Inform ation
~out
i
E~t
i :!
Olos~ application
Figure 5. first screen - where decide if supply or request computing resources
UserlD
lunipa2
CPU tope
iPentium III
.%.!
C,ock
i;i~~:~OoMHz
:,:j
Memory
:i256 ........MB .......................................~]?'~ ............................
Operating 8ystern
Windows XP:
Network connection speed
i:F3
Maxtasks/user
i:5 .............................................................................
Max diskspaceluser
!:;I~5 .....................Mb ...........
Max cpu 1o~a
i76..........................
.........oK .........i:ii
.o.~..! ,, !
c:,o~, i
Figure 6. server setup window - used to insert the configuration of supplied resources 6. TOOL OPERATING TEST We used the tool to try arranging a parallel machine made up of two hosts, each complying with requirements shown in fig. 7. Notice that we implemented a tolerance parameter for the connection time between client and server. This allows the configuration of a parallel virtual machine with reduced capabilities, if not all available machines match the client requirements. The community was composed by the following machines: 1. Pentium III 500 MHz, 256 Mb RAM, 60 Gb Hard Disk, static IP address, running the DataBase and a copy of the Agent Server on Windows XP Pro O.S. 2. Pentium II 333 MHz, 256 Mb RAM, 20 Gb Hard Disk, dynamic IP address, running a copy of the Agent Server on Windows 98 SE O.S.
932
<ms>
:~66. . . .
UserJO
[ucc::ic,
M ~ ping ~im~
N,. of hosts
I2"
Maxtask per host
15
Min ctet,k
[308
MH:z
Mi~ i~efformaln~einde~
!55
Max C P U lead per host
1'50
%
Disk space pe~ host
110
,,I- [2-5-
Mb
i
Figure 7. client's agent setup window - used to instruct the agent 3. Pentium III 1000 MHz, 256 Mb RAM, 40 Gb Hard disk, static IP address, running a copy of the Agent Server on Windows 2000 Pro O.S. 4. Pentium III 1000 MHz, 512 Mb RAM, 20 Gb Hard Disk, dynamic IP address, running the client software on Windows XP Pro O.S. In our tests, the client has to wait less than 5 minutes to have the configuration file listing the available addresses (fig. 8). This is a good result since the client does not know anything about available hosts when it decides to configure its own parallel virtual machine. Agentlog.log
M i g r a t i n g to 3rd a d d r e s s 24/07/2003 15:45:03 Migration L o c a l IP 1 4 7 . 1 6 3 . 3 . 3 0 P e r f o r m a n c e check. . done Suitable performances Booking... done E n d of a d d r e s s list M i g r a t i n g to ist a d d r e s s 24/07/2003 - 15:45:21 - Migration L o c a l IP 1 4 7 . 1 6 3 . 4 5 . 1 8 Confirm booking.., done M i g r a t i n g to 3 r d a d d r e s s 24/07/2003 - 15:46:09 - Migration -
24/07/2003 - 15:41:35 - Agent started L o c a l IP 1 5 1 . 2 9 . 1 2 8 . 4 4 M i g r a t i n g to DB s e r v e r 24/07/2003 - 15:41:48 - Migration OK L o c a l IP 1 4 7 . 1 6 3 . 4 5 . 1 8 Querying database.., done List of 03 a d d r e s s e s to be c h e c k e d M i g r a t i n g to i ~ a d d r e s s 24/07/2003 - 15:42:42 - Migration OK L o c a l IP 1 4 7 . 1 6 3 . 4 5 . 1 8 Performance check.., done Suitable performances Booking... done M i g r a t i n g to 2nd a d d r e s s 24/07/2003 - 15:43:28 - Migration OK L o c a l IP 2 1 3 . 2 5 5 . 2 8 . 1 5 7 Performance check.., done P e r f o r m a n c e s O U T OF B O U N D S
-
L o c a l IP 1 4 7 . 1 6 3 . 3 . 3 0 Confirm booking.., done Coming back home 24/07/2003 - 15:46:27 - Migration L o c a l IP 1 5 1 . 2 9 . 1 2 8 . 4 4 G e n e r a t i n g text file... ~ n e AGENT TERMINATED
OK
OK
OK
OK
Figure 8. the agent migration log
7. C O N C L U S I O N S AND FUTURE W O R K S In this work we presented a mobile agent-based system for resource discovery in a virtual parallel computing community. The proposed system turned out to be efficient, fault tolerant,
933 and capable of dynamically adapting itself to environment changes. We want to point out that mobile agents are only used to search between available addresses within the Community. Future works on these topics will include security and authentication features and enhanced booked server management. REFERENCES
[Bjomson 1988] R. Bjornson, N. Carriero, D. Gelemter and J. Leichter, Linda, the Portable Parallel, Yale University Department of Computer Science, 1988. [Butler 1993] R. Butler, E. Lusk, Monitors, messages, and clusters: The p4 parallel programming system, Argonne National Laboratory, 1993. [Foster 1998] I. Foster, C. Kesselmann, The Globus Project: a status report, in the proceedings of IEEE Heterogeneous Computing Workshop, March 1998. [Foster 1999a] I. Foster, C. Kesselman, Computational Grids, Chapter 2 of "The Grid: Blueprint for a New Computing Infrastructure", Morgan-Kaufman, 1999 [Foster 1999b] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, A. Roy, A Distributed Resource Management Architecture that Supports Advance Reservations and CoAllocation, International Workshop on Quality of Service, 1999. [Foster 2002] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure WG, Global Grid Forum, 2002 [Geist 1994] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Mancheck, V. Sunderam, PVM: Parallel ~rtual Machine, MIT Press, 1994 [Gropp 1994] W. Gropp, E. Lusk, Using MPI." Portable Parallel Programming with the Message-Passing Interface, the MIT Press, 1994. [Sahai 1998] A. Sahai, C. Morin, Mobile agents for location independent computing, in the proceedings of Second ACM International Conference on Autonomous Agents, 1998.
This Page Intentionally Left Blank
Parallel Computing: Software Technology, Algorithms, Architectures and Applications G.R. Joubert, W.E. Nagel, F.J. Peters and W.V. Walter (Editors) 9 2004 Elsevier B.V. All rights reserved.
935
Mobile Agents and Knowledge Discovery in Ubiquitous Computing* A. Genco a aUniversity of Palermo, [email protected] In this paper we discuss a knowledge discovery strategy to be performed by mobile agents in an Augmented Reality (AR) scenario. AR entities are implemented by mobile agents which perform all the behaviours of cooperating entities. Among other tasks, mobile agents implement a resource discovery strategy, which is aimed at filling lacking entities with missing methods and knowledge rules. A three layer entity description model and a cooperation mechanism are discussed which allow knowledge and methods to be shared between entities in augmented reality. 1. I N T R O D U C T I O N The evolution of information management models has led us to the Ubiquitous Computing (UC) paradigm [Weiser 1991 ], which extends the distributed computing idea. According to the UC view, computers will be perceived as an artificial expansion of reality. Interaction with the digital world will be possible every time and everywhere, in a fashion as similar as possible to natural interaction. Actually, interaction between these devices and natural elements will take place according to both natural and artificial intelligence laws. AR is a relevant topic in UC, since UC arises from digital devices integration in natural reality. An AR system is a distributed system which has to manage two basic classes of resources: physical resources, and logical resources [Beig12001 ] [Kangas 2002] [McCarthy 2001 ] [Schmalstieg 2002]. Physical resources are digital devices working in a real environment; logical resources are software entities which act within a hidden computational grid [Want 2002] [Streitz 2001 ], and interact according to some distributed paradigm. An AR layered model is then suitable to allow entities to interact according to different projections in virtual or physical relation spaces. According to such a model, a cooperation mechanism and a semantic routing strategy is needed to share virtual and physical resources. Entity cooperation can be negotiated by means of some message passing communication protocol to be driven by an upper level execution logic of entities. Entity semantic description can be performed by a wide variety of formalisms (semantic networks, frames [Baker 1998], hyper-text [Shipman 2002], etc.), programming tools and paradigms (object oriented programming, scripting languages, logic programming, etc.). The key issue is that both declarative and procedural programming should be adopted. To meet this requirement, some semantic network based representation tools have been proposed [Ali *This work has been partly supported by the CNR (National Research Council of Italy) and partly by the MIUR (Ministry of the Instruction, University and Research of Italy).
936 1993A], [Ali 1993B]. We choose to follow a hybrid implementation approach, and improve execution performances by procedural methods triggered by declarative clauses. 2. T H R E E LAYER ENTITY STACK We started from the idea that knowledge and methods in AR are distributed over AR entities. To this end we conceived an AR entity description model, which is based on a three layer stack [Genco 2003] and allowed us to distinguish semantic by physical world and to bind them by means of some resource management middleware (Fig. 1).
Semantics
Semantics
O
O
Middleware
S
Physics
:
=
Middleware
Peer to Peer Protocols
Physics
AR Multimedia Devices Deployment
I
Fig. 1. Three Layer Entitylnteraction Model
Three layers host three different projection descriptions of an AR entity respectively, which are" 9 a semantic projection for knowledge maintenance and management, 9 a middleware projection for procedural logic implementation, 9 a physical projection for physical resource management. The semantic projection of an entity use some common ontology and peer to peer communication protocols to interact with other entities semantic items. An overall execution logic implements knowledge processing and method activation. Semantic and middleware projections account for the non-visible part of an AR entity, and include knowledge bases and collections of methods. Physical projection accounts for physical resources in the visible part of an AR entity. It acts through a sequence of exposition - perception cycles and allows an entity to interact with physical projections of other entities by means of multimedia devices. Each of these layers works in a multithreaded environment, thus letting entity projections to interact with many other entities at a time.
937 3. THE SEMANTIC P R O J E C T I O N STRUCTURE Semantic projection is split out in a set of logical sub-layers, which describe the behavior semantics of an entity. Each layer is equipped with a specific logic and a mechanism for linking the middleware projection and its procedural methods.
Overall Execution Logic
- Entity creation Knowledge Maintenance knowledge Management Selection Logic
Maintenance Consistency Vocation Ability Survival Instinct
Fig. 2. Semantic Layer Structure
Fig. 2 describes the main logical layers in which semantic projection is split out. Semantic projection also includes an overall execution logic and a knowledge base. Main logical sublayers are" 9 Maintenance:
entity update;
entity prototype and reason of being. At no time an entity can accomplish a task or a cooperation request if it does not comply with the rules in its c o n s i s t e n c y layer;
9 Consistency:
9 Vocation:
logic planning and coordination;
layer contains access rules to the local knowledge base, also to be used when an entity is calling for cooperation;
~ Ability
9 Survival:
knowledge, rules and methods dealing with security and fault tolerance issues;
9 Instinct."
default reaction behavior of an entity, to be called by the sub-layer selection
logic. Each entity behavior is implemented by a FIPA compliant [FIPA SC and XC] agent, and includes one main permanent role and some transient tasks which can be changed for specific actions. At any time, a logical sub-layer is chosen by the selection logic, which is a part of the overall execution logic in the semantic projection. Layers can interact each with each other, through the selection logic or directly, between adjacent layers.
938 4. COOPERATION AND CREATION MECHANISMS We suppose AR resources are to be shared among entities. Therefore, an entity can be created without a full set of knowledge and methods. These can be supplied by other entities somewhere in AR. Entities can even be conceived, whose logical resources are pure knowledge, made up of a simple rule and no methods. We call these entities semantic cells. A semantic cell is just a prototype, which is unable to run procedural executions itself, since it is not provided with any method. Nevertheless, when a procedural execution is required for task accomplishment, a semantic cell can use the cooperation mechanism and ask for help from other entities. Generally speaking, an entity which needs to use a missing resource (hereafter LE - Lacking Entity), delivers a mobile agent with the task of discovering that resource in the AR environment.. In the following we discuss a cooperation mechanism to discovery and share knowledge and methods among entities. The mechanism is efficient, fault tolerant, and capable of dynamically adapting itself to AR environment changes. This mechanism also include semantic routing, thus allowing entities, semantic cells, and methods to be bound in a semantic process. This mechanism also ensures AR redundancy reduction. An AR entity usually undertakes the execution of a method as a result of some reasoning or knowledge process. This led us to represent entity knowledge in a hybrid declarative-procedural fashion, by means of a sequence of clause, each establishing a relationship among terms. Driven by its logic, entity execution starts from a verification process on its knowledge base. Procedural logic is then linked to knowledge base by attaching methods to rule predicates. Execution of methods is automatically activated when a corresponding predicate is verified. Moreover, rule predicates may be expanded by other rules. A semantic gap occurs when expansion rules, along with eventually attached methods, are not found in an entity knowledge base. Two cases may occur: 1. the expansion rules and methods are not available in the whole AR system. 2. the expansion rules and methods are owned by other entities. In the first case, knowledge and methods should be externally provided to an LE. In the second case, a cooperation can be asked for to an entity which is supposed to contain the requested knowledge and methods (hereafter FE "Friend Entity"). Then, predicate verification process goes on in the knowledge base of the selected FE. However such a cooperation can start only if an LE knows which other entity is owning the requested rules and methods. To this end, each entity in our model is equipped with an FE Table (FET), i.e. a set of couples <predicate, friend entity>, which allow unexpanded predicates to be bound to entities which are expected to contain expansion rules for them. Unfortunately, AR environments are strongly dynamic: entity knowledge and methods might change over time. This entails an LE might have a reference to an FE in its FET. which is expected to contain expansion rules for some predicates, but actually it does not. Once an FE queried its ability layer and found such a fault, it can recursively behave as an LE, and then exhibit a semantic gap. The opposite situation may occur as well: an LE does not contain a reference to an FE because this FE acquired its knowledge elements and methods only after the LE FET setup. In this case knowledge and method need to be searched across the whole AR system. Once an FE has been located, it can react three ways on LE request:
939 1. An FE can update an LE FET by adding a couple, whose second element is its own name. This way, the FE becomes available for direct cooperation with the LE. This solution appears to be the most simple and quick. However it can entail some bottleneck effect, since an FE can be invoked for cooperation by a large number of LE. 2. An FE can update an LE by providing the requested knowledge and methods. This solution overcomes the bottleneck effect, but, in the case LE are a many, it can entail overload because of LE update process. Moreover, the same update may be needed on a large number of LE, thus resulting in system redundancy. Finally, some delay also occurs in cooperation due to update completion time. 3. An FE can generate a new entity to be equipped with the requested knowledge and methods. After that the LE FET has to be updated by adding an entry with the new created entity name. This way a new entity becomes available for direct cooperation with an LE. This last approach requires a creation mechanism by which an entity (creator) can generate another entity, but it increases system efficiency. In order to overcome security drawbacks, initial knowledge and methods of the created entity are arranged as a subset of those of the creator entity. After that, created entities can evolve independently from their creators. 5. SEMANTIC R O U T I N G We now describe our semantic routing strategy, based on the ideas outlined above. Each FET includes a given number of entries which allow an LE to start a direct cooperation with an FE. One additional entry is also available to select a default FE which can be invoked to link other entities not yet known by an LE. On a semantic gap occurrence, an LE first seeks the missing predicate in its FET. If found, a cooperation starts with the selected FE. If not found, the resource discovery process goes on by means of the default FE. If a semantic gap cannot be filled yet, a LE can ask its creator for the missing knowledge and methods. At this point, if a creator can supply what requested, it can act according to one of the three modalities previously outlined, namely: 9 it can behave as a cooperating FE (Fig. 3a). 9 it can update the LE with the requested knowledge and methods (Fig. 3b). 9 it can create a new entity to be used as a cooperating FE (Fig. 3c). Differently, ifa creator lacks the requested knowledge and methods, it can invoke its creator in turn. So creator call may take place in a recursive fashion. In this case, when a creator hierarchy has been queried up to a root creator, the semantic gap will be showed to a system manager out of the system. He will be requested to fill the AR system semantic gap, by providing the root creator with the requested knowledge and fill methods. To implement this feature, we think each AR system should be equipped with a super-root original entity to be invoked by the general system manager to generate a given number of hierarchy top creators. This operation resembles account creation in usual operating systems.
940
iiii;iUiiiiiiiiiiiiii iii',~;i~i'~iii;~i~i~i~i~ii;i#;~i~i~
6
Local methods
_==E__i.......
CII C141
ledge
Loc
iiiiiiiiiiiiiiiiiiii;iiiiiDiii Local methods
Fig. 3a. A creator proposes itself as a cooperating entity Local knowledge
Ei methods
Local methods
C13 C. C141
Local knowledge
Ei methods
8
7
Local methods Ei
Fig. 3b. A creator provides the lacking entity with the required knowledge and methods
941
.•
Local
[
Ei
Local methods
//
Ei
Local knowledge
il,2);ii.iiii,(/i,~ii!
Ei/c 15
iiiiiiiiiiii~i~iii Local methods
Fig. 3c. A creator generates a new entity for cooperation with the lacking entity
The routing mechanism can be further explained by detailing Fig. 3a. Creators are represented by triangles. Ci3 is the j-th entity created by the creator Ci; the notation C~3k can be recursively defined. In Fig. 3a a generic entity C~j falls in a semantic gap. It calls its default FE, C1411for cooperation. This in turn calls its default entity. However, cooperation cannot be established in this case, and therefore C14~1 calls its creator, C14~ for cooperation. This mechanism goes on up to the creator C1, which owns requested knowledge and methods in its knowledge base. Finally the top creator can start a cooperation directly with the requesting entity C~j. 6. CONCLUSIONS We presented a mobile agent based cooperation model between entities in augmented reality. On the basis of a three projection AR entity description model, a cooperation mechanism and a semantic routing strategy for knowledge and methods sharing in AR have been discussed. This turned out to be efficient, fault tolerant, and capable of dynamically adapting itself to AR environment changes. REFERENCES
[Ali 1993A] Syed S. Ali, A Propositional Semantic Network with Structured Variables for Natural Language Processing, Proceedings of the Sixth Australian Joint Conference on Artificial Intelligence, November 17-19,1993 [Ali 1993B] Syed S. Ali, Stuart C. Shapiro, Natural Language Processing Using a Propositional Semantic Network with Structured Variables, Mind and machines, 3(4), November 1993. Special Issue on Knowledge Representation for Natural Language Processing
942 [Baker 1998] Collin F. Baker, Charles J. Fillmore and John B. Lowe, The Berkeley FrameNet Project, Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics [Beig12001 ] M. Beigl, H. Gellersen, A. Schmidt, Mediacups, experience with design and use of computer-augmented everyday artifacts, on Computer Networks 35,2001, Elsevier [FIPA SC00001L] FIPA - Foundation for Intelligent Physical Agents -Abstract Architecture Specification, SC00001L,, www.fipa.org [FIPA XC00086D] FIPA - Foundation for Intelligent Physical Agents - Ontology Service Specification XC00086D, http://www.fipa.org, 2000 [Genco 2003] A. Genco, HAREM: the Hybrid Augmented Reality Exhibition Model, to appear in proceedings of the 2nd WSEAS International Conference on E-ACTIVITIES 2003, Singapore, December 7-9, 2003 [Kangas 2002] K. Kangas, J. Roning, Using mobile code to create ubiquitous augmented reality, on Wireless Networks 8, 2002 Kluwer [McCarthy 2001 ] J. McCarthy, The virtual world gets physical, on IEEE Internet Computing, Nov.-Dec. 2001 [Schmalstieg 2002] D. Schmalstieg, G. Hesina, Distributed Applications for Collaborative Augmented Reality, Virtual Reality, Proceedings. IEEE, 24-28 March 2002,Pages:59 - 66 2002 [Shipman 2002] Frank Shipman, J. Michael Moore, Preetam Maloor, Haowei Hsieh, Raghu Akkapeddi, Semantics happen: knowledge building in spatial hypertext, Proceedings of the thirteenth conference on Hypertext and hypermedia, College Park, Maryland, USA, 2002, pages: 25-34 [Streitz 2001] N. Streitz, The role of UC and the disappearing computer for ComputerSupported Cooperative Work, Groupware, IEEE Proceedings. Seventh International Workshop on, 6-8 Sept. 2001 [Weiser 1991] Mark Weiser, The computer for the 21st century, Scientific American Ubicomp Paper, Sep. 1991, Scientific American Ubicomp Paper, www.ubiq, com.hypertext/weiser/SciAmDraft3 .html [Want 2002] R. Want, T. Pering, G. Borriello, K. Farkas, Disappearing Hardware, on Pervasive Computing, IEEE, Volume: 1, Issue: 1, Jan.-March 2002, Pages:36 - 47
Author & Subject Index
This Page Intentionally Left Blank
945
A U T H O R INDEX Adamidis, RA. Agostaro, F. Aldinucci, M. Almeida, F. Aloisio, G. Alonso, J.M. Amar, A. Andersson, U. Andonov, R. Anshus, O.J. Athanasaki, M Avellino, G. Aversa, R. Bachhiesl, P. Backfrieder, W. Badia, R.M. Baiardi, F. Bamha, M. Bartol, T.M. Beco, S. Behrens, J. Bell, R. Benkner, S. Benner, E Bergamaschi, L. Berti,G. Bidmon, K. Bjomdalen, J.M. Bjorstad, P.E. Blasi, E. Blikberg, R. Bodin, F. Boenisch, T.E Bongo, L.A. Bonhomme, A. B6rner, S. Boucaud, Ph. Boulet, P. Bourneb, E Brim, L. Brunst, H.
379 905,913,919,927 63,617 387,525 599 339 31 517 525 879,887 233 635 803 535 705 769 169 47 685 635 177 761 705 251 275 705 493 879 845 599 787 355 551 879 55 501 355 31 661 297 737
Cabibbo, N. Cafaro, M. Cai, X. Campa, S. Carrington, L. Castaings, W. Castillo, M. Cern~i, I. Chapman, B. Chaumette, S. Chiello, A. Cirou, B. Ciullo, P. Clemantis, A. Coppola, M. Cornelis, F. Coulaud, O. Counilh, M.C. Cruise, R. Culloty, J. Dafas, P. D'Agostino, D. Danelutto, M. De Bosschere, K. De Pietri, R. de Sande, F. Deerinck, T.E. Deister, F. Dekeyser, J.-L. de Luca, S. Di Martino, B. Di Santo, M. Diverio, T.A. Di Carlo, F. Di Renzo, F. Dorneles, R.V. Dorta, I. Drews, F. Drosinos, N. Dussere, M. Eicker, N.
355 599 837 617 777 403 251 297 795 135,305,321 913 569 617 159 617 39 151 569 719 395 677 159 63,617 39, 467 355 387 685 331 31 355 803 609 283,543 355 355 543 185 219 233 151 559
946 Ekman, P. Ellisman, M.H. Elssel, K. Elster, A.C. Epicoco, I Esnard, A. Exbrayat, M. Fearing, Ch. Feng, W. Fenner, J. Ferrero (Jr.), J.M. Fingberg, J. Fiore, S. Fleuren, M.J. Ford, W.C Froehlich, D. Funk, W.
517 685 243 371,887 599 151 47 509 653 705 339 705 599 347 685 403 103
Gao, X. 777 Gava, F. 95 Geisler, S. 431 Genco, A. 897, 905, 913,919, 927, 935 Gianuzzi, V. 159 Gimenez, J. 777 Gomoluch, J. 677 Goumas, G. 233 Grange, P. 135 Gribble, C. 13 Guerri, D. 169 Gundersen, G. 119 Haan, O. Haas, EW. Halada, L. Hallgren, T. Hansen, C. Hart, D. Hejtm~nek, L. Hellmuth, O. Hem~ndez, V. Hess, M. Hickey, D. H61big, C.
177 551 853 3 71 13 719 297 363 339 551 509 283,543
Holmgren, S. Hood, R. Hossfeld, F. Huang, L. Huedo, E.
475 811 3 795 579
Iben, U. Ierotheou, C.S. Isaila, F.
379 811 559
Jin, H. Johansson, H. Johnson, S.P. Jones, D. Jones, P. Juckeland, G. Jung, M. Kaldass, H. Kao, O. Karagiorgos, G. Kelz, M. Kendall, R.A. Kluge, M. Knoth, O. K6chel, E K6hnlein, J. Kohring, G. K611ing, S. Konagaya, A. Korch, M. Kornblueh, L. Koster, J. Koziris, N. Kozlenkov, A. Kr~imer, W. Krammer, B. Kranzlmiiller, D. Krischok, B. Labarta, J. Laskowski, E. Le Dimet, F.X. Le6n, C.
423,811 475 811 705 777 501 267 355 219 225 535 795 501 363 313 587 705 501 669 209 177 845 233 677 283 493 143 551 769, 777 449 403 185
947 Lee, C. Lef6vre. L. Li, H. Lines, G.T. Lippert, T. Llorente, I.M. Lonardo, A. Loulergue, F. Lu, C. Lucka, M. Lukyanov, M. Maebe, J. Malony, A.D. Mancini, M. Maraschini, A. Marinotto, A.L. Martinez, A. Massaioli, F. Matthews, G. Matthey, T. Mazauric, C. Mediavilla, E. Melichercik, I. Mendes, C.L. M6trot, B. Micheli, J. Middleton, S.E. Migliardi, M. Miller, B.P. Miller, W. Mirgorodskiy, A.V. Missirlis, N. Mocavero, S. Mohr, B. Molt6, G. Monserrat, M. Montero, R.S. Morenas, V. Moreno, L.M. Mori, E Moritsch, H. Moschny, T. Moskvin, V. Mtiller, M.S.
777 55 719 837 559 579 355 79,95,127 729 853 355 39 761 159 635 543 275 819 811 861 403 387 853 729 135 355 705 87 745 291 745 225 599 753 339 339 579 355 525 169 71 559 719 493
Munz, C.-D. Nagel, W.E. Narumi, T. Navaux, P.O.A. Norcen, R.
379 501,737 669 543 827
Ohno, Y. Oscoz, A. Oster, P.
669 387 517
Pacini, F. Papiez, L. Parker, S. Paschedag, N. Pattillo, J.M. Pekurovsky, D. Pene, O. P6rez, M. Pesciullesi, E Pfarrhofer, R. Pfltiger, S. Picinin Jr., D. Pimentel, F. Pini, G. Pleiter, D. Podesta, R. Poirriez, V. Prem, H. Pschernig, E.
635 719 13 355 685 661 355 525 617 535 501 543 291 275 355 87 525 643 193
Quintana-Orti, E.S. Quintana-Orti, G.
251 251
Rabenseifner, R. Rak, M. Ranaldo, N. Rapuano, F. Rasin, I. Rauber, T. Ravazzolo,R. Reed, D.A. Rehse, U.
379 803 609 355 291 23,209 617 729 291
948 Reilein, R. Renner, E. Rerrer, U. Resch, M.M. Ricci, L. Riedel, M. Rieger, H. Rizzi, R.L. R6ding, H. Rodriguez, C. Rodriguez, G. Rojas, A. Roman, J. Ronsse, M. Riinger, G.
23 363 219 493 169 313 331 543 501 185,525 769 185 569 39 23,485
Saiz, J. 339 Sartori, L. 355 Scarpa, M. 143 Schaubschl~iger, Ch. 143 Schifano, F. 355 Schmidt, J.G. 705 Schr6der, W. 363 Schroeder, M. 677 Schulze, R.W. 457 Seidl, S. 501 Sejnowski, T.J. 685 Serafini, T. 259 Shende, S. 761 Sheppard, R.W. 719 Shindyalov, I. 661 Simma, H. 355 Sips, H.J. 111 Slaby, E. 103 Snavely, A. 777 Sorce, S. 905, 913, 919, 927 Sorensen, K.A. 331 Sorevik, T. 787, 861 Srinivasa Raghavan, N.R. 643 St6gner, H. 535 Stfiben, H. 347 Steihaug, T. 119 Stewart, C.A. 719 Stiles, J.R. 685 Stougie, B. 39
Strey, A.
201
Taiji, M. Tan, G. Terracina, A. Theeuwen, F. Tichy, W.F. Tomko, K. Torquati, M. Tran, V.D. Trautmann, S. Tremel, U. Tripiccione, R. Tudruj, M. Tveito, A. Tzaferis, F.
669 423 635 31 559 509 617 403 485 331 355 449 837 225
Uhl, A.
193,535,827
Vaglini, L. van Reeuwijk, K. Vandierendonck, H. Vanneschi, M. Venticinque, S. Vign6ras, E Vik, T. Villano, U. Vinter, B. Vivini, P. Volkert, J. Voss, H.
169 111 467 617 803 135,305,321 371 803 871,887 355 143 243
Wakatani, A. Wallin, D. Walsh, P. Weatherill, N.P. William, T. Wilsey, EA. Wloch, R. Wolf, F. Wolke, R. Wolter, N. Wrona, F. Wu, S.
415 475 395 331 501 509 501 753 363 777 379 423
949
Xian, F.
423
Yahyapour, R.
627
Zanghirati, G. Zanni, L. Zegwaard, G.F. Zima, H.P. Zimeo, E. Zoccolo, C.
259 259 347 441 609 617
This Page Intentionally Left Blank
951
SUBJECT INDEX A-calculus
127
Action Potential Propagation Active messages Adaptive control Adaptive execution Adaptive multilevel substructuring (AMLS) Adaptive scheduling Additive Schwarz algorithms All-to-all communication Analytical models Automatic OpenMP code generation Automatic parallelization Automatic performance analysis
339 509 729 579 243 579 837 861 769 811 233 753
Balancing domain decomposition 845 BenchIT 501 Benchmarking 501, 517 Beowulf cluster 291 Bidomain equations 837 Biharmonic problem 267 Bioinformatics 677, 719 Biomechanics 705 BLAST 653 Block matching 193 Block preconditioning 837 Bulk Synchronous Parallelism 95, 127 Cache simulation 475 CAD/CAE 103 Cardiac tissues 339 Cavitating Flows 3 79 CFD 371 Chemistry-transport modeling 363 Cluster 169, 431 Cluster computer 283,543,871 Cluster Computing 423,559 Cluster File Systems 551 Cluster of SMPs 485 Codebook generation 415
Communication Communication middleware Compiler tools Components Computational Biology Computational Chemistry Computational steering Configure Conjugate gradient Consistency CORBA COTS Custom designed architectures Customizable protocol
897 509 23 31, 617 719 669 151 879 275 55 31 871 355 321
DAG 569 Data and task parallelism 111 Data assimilation 403 Data decomposition 543 Data Mining 905 Data races 39 Data redundancy 661 Data transmission time 457 Debugging 143 Debugging tool 493 Design patterns 617 Diesel Engine 599 Diffusion Method 225 Distributed algorithms 297 Distributed Computing 31,677 Distributed Dataflow 31 Distributed Shared Memory 55, 169 Distributed Virtual Environments 587 Distribution algorithm 787 Divisible tasks 219 Domain decomposition 363 DReAMM 685 Dynamic feature extraction 431 Dynamic instrumentation 745 Eigenvalue Ethernet EU 5th Framework Event-Oriented
243 871 34 7 79
952 Explicit Substitution Fault injection software Field computers Finance, portfolios, investment Full-system profiling Functional Programming
127 729 457 853 745 79, 95
Genetic Algorithms 313 Genome sequencing 653 Global Arrays 795 Globus Toolkit 599 Gradient projection methods 259 Grid 609, 627 Grid Computing 579, 599,643,705,913,919 Grid Workload Management 635 Hardware performance counter High accuracy High performance High Performance Computing Hydraulics models Imperative features Inspector Instrumentation Inter-bank dispersion Interior-point methods Interoperability Introspection Itanium-2 Iterative methods Iterative substructuring
753 283 617 135,339 403 95 569 39 467 853 617 441 517 225 845
Jagged arrays 119 Java 111, 119, 135,305,321 Job Description Language 635 Join skew 47 JPEG 2000 827 Kahn Process Networks Kernel performance analysis
31 745
KIVA3 Knowledge discovery
599 935
Laplace's differential equation 457 Large scale scientific visualization 13 Lattice Boltzmann method 291 Lattice QCD 559 Linear systems of equations 283,845 Linear time-invariant systems 251 Linux cluster 313 Load balancing 219,225,363,379 M-VIA 509 MATLAB 535 MCell 685 MDICE 535 Medical simulation 705 Mesh simplification 159 Message passing 169, 509 Metacomputer 395 Metacomputing 87 Micro-GA 599 Migratable-threads 87 Mixed task and data parallelism 23 Mobile Agents 897, 905,913,919, 927, 935 Mobility 305, 897 Model reduction 251 Modeling 777 Molecular dynamics 669 Momentum project 347 Monitor 879 Monte Carlo simulation 535 MPEG-4 827 MPI 177, 193,485,493,509, 803 Multidisciplinary simulation 331 Multigrid method 267 Multiprocessor tasks 23 Multithreading standards 819 Negative cycles Nested parallelism Network-attached storage Networking
297 787 423 905
953 Neural network Numerical simulation Numerical simulations
201,395 543 355
Object exchange 135 Object-oriented programming 119 Object-oriented simulation 331 OpenMP 177, 193,201,787, 803,827 OpenMP translation 795 Operational semantics 63 Optimal tiling 525 Optimization 103,599 Parallel algorithms 159, 371,853 Parallel algorithms in control 251 Parallel architectures 449 Parallel Computing 737, 905,927 Parallel databases 47 Parallel debugging 811 Parallel decomposition techniques 259 Parallel dynamic programming 525 Parallel File System 559 Parallel Finite Volume Approach 379 Parallel Genetic Algorithm 103 Parallel Isosurface Visualization 13 Parallel languages 111 Parallel machines 355 Parallel performance 661 Parallel preconditioning 275 Parallel processing 143,441 Parallel programming 63,493,617 Parallel programming languages 795 Parallel Runge-Kutta solvers 209 Parallel simulation 331 Parallel video server 423 Parallel Volume Visualization 13 Parallelization 403 Parallelization environment 811 ParaStation 559 Patterns 643 PDEs 475 Performance 777 Performance counters 517 Performance evaluation 209,501,803
Performance measurement 501,551,753 Performance modeling 769 Phase-field method 291 Pipeline 47 PNN algorithm 415 Power-aware supercomputing 653 Preconditioned conjugate gradient method 267 Program analysis 143 Programming environments 209 Programming tools 819 Protein interaction 677 Protein structure comparison 661 Radiation oncology 719 Reconfigurable systems 449 Remote Method Invocation (RMI) 87, 305 Remote visualization 159 Resource discovery 919,927 Resource management 627 RNA secondary structure 525 Scalable Performance Analysis 737 Scheduling 219,233,609, 627 Schr6dinger equation 861 SCI interconnection network 485 Scientific computing 151 Secure multicasting 587 Security 321 Semantics 79 Service discovery 913 Shallow water 403 Shared Memory Parallelisation 177 Simulation 803 Simulation optimization 313 Skeletons 63, 617 Skewed-associative cache 467 SMP 569, 827 Software architecture 609 Software development 819 Sparse eigenvalue problems 275 Sparse matrix data structures 119 Special-purpose computer 669 Spectral methods 861
954 Supercomputing Support vector machines Symmetric multiprocessor Synaptic transmission System reliability
777 259 201 685 729
Task scheduling TESLA protocol Threads Tiling transformation Tools Tracing
449 587 441 233 501 737
Ubiquitous computing UMTS Unified Modeling Language (UML) User Interface
935 347 643 635
Video coding Video database Virtual interface architecture Visualization VQ compression XOR-based hash functions
193 431 509 371,879 415 467