Computational Science - ICCS 2008, 8 conf

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Author: Marian Bubak | Marian Bubak | Geert Dick van Albada | Jack Dongarra | Peter M.A. Sloot

9 downloads 1880 Views 30MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5101

Marian Bubak Geert Dick van Albada Jack Dongarra Peter M.A. Sloot (Eds.)

Computational Science – ICCS 2008 8th International Conference Kraków, Poland, June 23-25, 2008 Proceedings, Part I

13

Volume Editors Marian Bubak AGH University of Science and Technology Institute of Computer Science and Academic Computer Center CYFRONET 30-950 Kraków, Poland E-mail: [email protected] Geert Dick van Albada Peter M.A. Sloot University of Amsterdam Section Computational Science 1098 SJ Amsterdam, The Netherlands E-mail: {dick,sloot}@science.uva.nl Jack Dongarra University of Tennessee Computer Science Department Knoxville, TN 37996, USA E-mail: [email protected]

Library of Congress Control Number: 2008928939 CR Subject Classification (1998): F, D, G, H, I, J, C.2-3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13

0302-9743 3-540-69383-1 Springer Berlin Heidelberg New York 978-3-540-69383-3 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12279241 06/3180 543210

Advancing Science Through Computation

I knock at the stone’s front door. “It’s only me, let me come in. I’ve come out of pure curiosity. Only life can quench it. I mean to stroll through your palace, then go calling on a leaf, a drop of water. I don’t have much time. My mortality should touch you.” Wislawa Szymborska, Conversation with a Stone, in Nothing Twice, 1997 The International Conference on Computational Science (ICCS 2008) held in Krak´ ow, Poland, June 23–25, 2008, was the eighth in the series of highly successful conferences: ICCS 2007 in Beijing, China; ICCS 2006 in Reading, UK; ICCS 2005 in Atlanta; ICCS 2004 in Krakow, Poland; ICCS 2003 held simultaneously in Melbourne, Australia and St. Petersburg, Russia; ICCS 2002 in Amsterdam, The Netherlands; and ICCS 2001 in San Francisco, USA. The theme for ICCS 2008 was “Advancing Science Through Computation,” to mark several decades of progress in computational science theory and practice, leading to greatly improved applications in science. This conference was a unique event focusing on recent developments in novel methods and modeling of complex systems for diverse areas of science, scalable scientiﬁc algorithms, advanced software tools, computational grids, advanced numerical methods, and novel application areas where the above novel models, algorithms, and tools can be eﬃciently applied, such as physical systems, computational and systems biology, environment, ﬁnance, and others. ICCS 2008 was also meant as a forum for scientists working in mathematics and computer science as the basic computing disciplines and application areas, who are interested in advanced computational methods for physics, chemistry, life sciences, and engineering. The main objective of this conference was to discuss problems and solutions in all areas, to identify new issues, to shape future directions of research, and to help users apply various advanced computational techniques. During previous editions of ICCS, the goal was to build a computational science community; the main challenge in this edition was ensuring very high quality of scientiﬁc results presented at the meeting and published in the proceedings. Keynote lectures were delivered by: – Maria E. Orlowska: Intrinsic Limitations in Context Modeling – Jesus Villasante: EU Research in Software and Services: Activities and Priorities in FP7 – Stefan Bl¨ ugel: Computational Materials Science at the Cutting Edge

VI

Preface

– Martin Walker: New Paradigms for Computational Science – Yong Shi: Multiple Criteria Mathematical Programming and Data Mining – Hank Childs: Why Petascale Visualization and Analysis Will Change the Rules – Fabrizio Gagliardi: HPC Opportunities and Challenges in e-Science – Pawel Gepner: Intel’s Technology Vision and Products for HPC – Jarek Nieplocha: Integrated Data and Task Management for Scientific Applications – Neil F. Johnson: What Do Financial Markets, World of Warcraft, and the War in Iraq, all Have in Common? Computational Insights into Human Crowd Dynamics We would like to thank all keynote speakers for their interesting and inspiring talks and for submitting the abstracts and papers for these proceedings.

Fig. 1. Number of papers in the general track by topic

The main track of ICSS 2008 was divided into approximately 20 parallel sessions (see Fig. 1) addressing the following topics: 1. e-Science Applications and Systems 2. Scheduling and Load Balancing 3. Software Services and Tools

Preface

4. 5. 6. 7. 8. 9. 10.

VII

New Hardware and Its Applications Computer Networks Simulation of Complex Systems Image Processing and Visualization Optimization Techniques Numerical Linear Algebra Numerical Algorithms

# papers 25

23 19

20 17

14

15

14 10

10

8

10

10

10

8

7

9

8

5

Int Agents

Workflows

Environ

Soft. Eng

Develop

Bioinfo

Dyn. Data

Teaching

GeoComp

Networks

Finance

Chemistry

Multiphys

Graphics

0

Fig. 2. Number of papers in workshops

The conference included the following workshops (Fig. 2): 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

7th Workshop on Computer Graphics and Geometric Modeling 5th Workshop on Simulation of Multiphysics Multiscale Systems 3rd Workshop on Computational Chemistry and Its Applications Workshop on Computational Finance and Business Intelligence Workshop on Physical, Biological and Social Networks Workshop on GeoComputation 2nd Workshop on Teaching Computational Science Workshop on Dynamic Data-Driven Application Systems Workshop on Bioinformatics’ Challenges to Computer Science Workshop on Tools for Program Development and Analysis in Computational Science

VIII

11. 12. 13. 14.

Preface

Workshop Workshop Workshop Workshop

on on on on

Software Engineering for Large-Scale Computing Collaborative and Cooperative Environments Applications of Workﬂows in Computational Science Intelligent Agents and Evolvable Systems

# participants 100 77

30

27 19 15

12

11

10

8 6 5

5

4

5

2

2

AU

SG

MY

2

1

TW

IR

KR

IL

1

IN

1

BR

SZ

TR

SI

2

DZ

2

MU

2

1

SK

RS

RU

PL

RO

NL

NO

MK

IT

LT

GB

GR

ES

1

2

JP

2

CN

2

MX

2

1

FR

DK

CZ

1

DE

BE

BG

1

1

2

3

US

2

3

AR

2

3

CA

3

UA

3

AT

5

4

Fig. 3. Number of accepted papers by country

Selection of papers for the conference was possible thanks to the hard work of the Program Committee members and about 510 reviewers; each paper submitted to ICCS 2008 received at least 3 reviews. The distribution of papers accepted for the conference is presented in Fig. 3. ICCS 2008 participants represented all continents; their geographical distribution is presented in Fig. 4. The ICCS 2008 proceedings consist of three volumes; the ﬁrst one, LNCS 5101, contains the contributions presented in the general track, while volumes 5102 and 5103 contain papers accepted for workshops. Volume LNCS 5102 is related to various computational research areas and contains papers from Workshops 1–7, while volume LNCS 5103, which contains papers from Workshops 8–14, is mostly related to computer science topics. We hope that the ICCS 2008 proceedings will serve as an important intellectual resource for computational and computer science researchers, pushing forward the boundaries of these two ﬁelds and enabling better collaboration and exchange of ideas. We would like to thank Springer for fruitful collaboration during the preparation of the proceedings. At the conference, the best papers from the general track and workshops were nominated and presented on the ICCS 2008 website; awards were funded by Elsevier and Springer. A number of papers will also be published as special issues of selected journals.

Preface

# participants 100

IX

94

34 28

21 17

16

14

10

9

8 5

5

5

3

3 2

1

1

5

2

2

1

3

3 2

2

3 2

1

1

3 2

2

1

1

AT BE CZ DE DK ES FR GB GR IT LT MK NL NO PL RO RS SI SK SZ TR CA US AR BR MX CN JP IL IN IR KR MY SG TW MU AU

1

5 4

4

Fig. 4. Number of participants by country

We owe thanks to all workshop organizers and members of the Program Committee for their diligent work, which ensured the very high quality of ICCS 2008. We would like to express our gratitude to the Kazimierz Wiatr, Director of ACC CYFRONET AGH, and to Krzysztof Zieli´ nski, Director of the Institute of Computer Science AGH, for their personal involvement. We are indebted to all the members of the Local Organizing Committee for their enthusiastic work towards the success of ICCS 2008, and to numerous colleagues from ACC CYFRONET AGH and the Institute of Computer Science for their help in editing the proceedings and organizing the event. We very much appreciate the help of the computer science students during the conference. We own thanks to the ICCS 2008 sponsors: Hewlett-Packard, Intel, Qumak-Secom, IBM, Microsoft, ATM, Elsevier (Journal Future Generation Computer Systems), Springer, ACC CYFRONET AGH, and the Institute of Computer Science AGH for their generous support. We wholeheartedly invite you to once again visit the ICCS 2008 website (http://www.iccs-meeting.org/iccs2008/), to recall the atmosphere of those June days in Krak´ ow.

June 2008

Marian Bubak G. Dick van Albada Peter M.A. Sloot Jack J. Dongarra

Organization

ICCS 2008 was organized by the Academic Computer Centre Cyfronet AGH in cooperation with the Institute of Computer Science AGH (Krak´ ow, Poland), the University of Amsterdam (Amsterdam,The Netherlands) and the University of Tennessee (Knoxville, USA). All the members of the Local Organizing Committee are staﬀ members of ACC Cyfronet AGH and ICS AGH.

Conference Chairs Conference Chair Workshop Chair Overall Scientiﬁc Co-chair Overall Scientiﬁc Chair

Marian Bubak (AGH University of Science and Technology, Krak´ ow, Poland) Dick van Albada (University of Amsterdam, The Netherlands) Jack Dongarra (University of Tennessee, USA) Peter Sloot (University of Amsterdam, The Netherlands)

Local Organizing Committee Kazimierz Wiatr Marian Bubak Zoﬁa Mosurska Maria Stawiarska Milena Zaj¸ac Mietek Pilipczuk Karol Fra´ nczak

Sponsoring Institutions Hewlett-Packard Company Intel Corporation Qumak-Sekom S.A. and IBM Microsoft Corporation ATM S.A. Elsevier Springer

Program Committee J.H. Abawajy (Deakin University, Australia) D. Abramson (Monash University, Australia)

XII

Organization

V. Alexandrov (University of Reading, UK) I. Altintas (San Diego Supercomputer Centre, UCSD, USA) M. Antolovich (Charles Sturt University, Australia) E. Araujo (Universidade Federal de Campina Grande, Brazil) M.A. Baker (University of Reading, UK) B. Bali´s (AGH University of Science and Technology, Krak´ ow, Poland) A. Benoit (LIP, ENS Lyon, France) I. Bethke (University of Amsterdam, The Netherlands) J. Bi (Tsinghua University, Beijing, China) J.A.R. Blais (University of Calgary, Canada) K. Boryczko (AGH University of Science and Technology, Krak´ ow, Poland) I. Brandic (Technical University of Vienna, Austria) M. Bubak (AGH University of Science and Technology, Krak´ ow, Poland) K. Bubendorfer (Victoria University of Wellington, New Zealand) B. Cantalupo (Elsag Datamat, Italy) L. Caroprese (University of Calabria, Italy) J. Chen (Swinburne University of Technology, Australia) O. Corcho (Universidad Politcnica de Madrid, Spain) J. Cui (University of Amsterdam, The Netherlands) J.C. Cunha (University Nova de Lisboa, Portugal) S. Date (Osaka University, Japan) S. Deb (National Institute of Science and Technology, Berhampur, India) Y.D. Demchenko (University of Amsterdam, The Netherlands) F. Desprez (INRIA, France) T. Dhaene (Ghent University, Belgium) I.T. Dimov (University of Reading, Bulgarian Academy of Sciences, Bulgaria) J. Dongarra (University of Tennessee, USA) F. Donno (CERN, Switzerland) C. Douglas (University of Kentucky, USA) G. Fox (Indiana University, USA) W. Funika (AGH University of Science and Technology, Krak´ ow, Poland) G. Geethakumari (University of Hyderabad, India) B. Glut (AGH University of Science and Technology, Krak´ ow, Poland) Y. Gorbachev (St.-Petersburg State Polytechnical University, Russia) A.M. Go´sci´ nski (Deakin University, Australia) M. Govindaraju (Binghamton University, USA) G.A. Gravvanis (Democritus University of Thrace, Greece) D.J. Groen (University of Amsterdam, The Netherlands) T. Gubala (Academic Computer Centre Cyfronet AGH, Krak´ow, Poland) M. Hardt (Forschungszentrum Karlsruhe, Germany) T. Heinis (ETH Zurich, Switzerland) L. Hluch´ y (Slovak Academy of Sciences, Slovakia) W. Hoﬀmann (University of Amsterdam, The Netherlands) A. Iglesias (University of Cantabria, Spain) C.R. Jesshope (University of Amsterdam, The Netherlands)

Organization

XIII

H. Jin (Huazhong University of Science and Technology, China) D. Johnson (University of Reading, UK) B.D. Kandhai (University of Amsterdam, The Netherlands) S. Kawata (Utsunomiya University, Japan) W.A. Kelly (Queensland University of Technology, Australia) J. Kitowski (AGH University of Science and Technology, Krak´ ow, Poland) M. Koda (University of Tsukuba, Japan) D. Kranzlm¨ uller (Johannes Kepler University Linz, Austria) J. Kroc (University of Amsterdam, The Netherlands) B. Kryza (Academic Computer Centre Cyfronet AGH, Krak´ ow, Poland) M. Kunze (Forschungszentrum Karlsruhe, Germany) D. Kurzyniec (Google, Krak´ ow, Poland) A. Lagana (University of Perugia, Italy) L. Lefevre (INRIA, France) A. Lewis (Griﬃth University, Australia) H.W. Lim (Royal Holloway, University of London, UK) E. Lorenz (University of Amsterdam, The Netherlands) P. Lu (University of Alberta, Canada) M. Malawski (AGH University of Science and Technology, Krak´ ow, Poland) A.S. McGough (London e-Science Centre, UK) P.E.C. Melis (University of Amsterdam, The Netherlands) E.D. Moreno (UEA-BENq, Manaus, Brazil) J.T. Mo´scicki (CERN, Switzerland) S. Naqvi (CETIC, Belgium) P.O.A. Navaux (Universidade Federal do Rio Grande do Sul, Brazil) Z. Nemeth (Hungarian Academy of Science, Hungary) J. Ni (University of Iowa, USA) G.E. Norman (Russian Academy of Sciences, Russia) ´ Nuall´ B.O. ain (University of Amsterdam, The Netherlands) S. Orlando (University of Venice, Italy) M. Paprzycki (Polish Academy of Sciences, Poland) M. Parashar (Rutgers University, USA) C.P. Pautasso (University of Lugano, Switzerland) M. Postma (University of Amsterdam, The Netherlands) V. Prasanna (University of Southern California, USA) T. Priol (IRISA, France) M.R. Radecki (AGH University of Science and Technology, Krak´ ow, Poland) M. Ram (C-DAC Bangalore Centre, India) A. Rendell (Australian National University, Australia) M. Riedel (Research Centre J¨ ulich, Germany) D. Rodr´ıguez Garca (University of Alcal, Spain) K. Rycerz (AGH University of Science and Technology, Krak´ ow, Poland) R. Santinelli (CERN, Switzerland) B. Schulze (LNCC, Brazil) J. Seo (University of Leeds, UK)

XIV

Organization

A.E. Solomonides (University of the West of England, Bristol, UK) V. Stankovski (University of Ljubljana, Slovenia) H. Stockinger (Swiss Institute of Bioinformatics, Switzerland) A. Streit (Forschungszentrum J¨ ulich, Germany) H. Sun (Beihang University, China) R. Tadeusiewicz (AGH University of Science and Technology, Krak´ow, Poland) M. Taufer (University of Delaware, USA) J.C. Tay (Nanyang Technological University, Singapore) C. Tedeschi (LIP-ENS Lyon, France) A. Tirado-Ramos (University of Amsterdam, The Netherlands) P. Tvrdik (Czech Technical University Prague, Czech Republic) G.D. van Albada (University of Amsterdam, The Netherlands) R. van den Boomgaard (University of Amsterdam, The Netherlands) A. Visser (University of Amsterdam, The Netherlands) D.W. Walker (Cardiﬀ University, UK) C.L. Wang (University of Hong Kong, China) A.L. Wendelborn (University of Adelaide, Australia) Y. Xue (Chinese Academy of Sciences, China) F.-P. Yang (Chongqing University of Posts and Telecommunications, China) C.T. Yang (Tunghai University, Taichung, Taiwan) L.T. Yang (St. Francis Xavier University, Canada) J. Yu (Renewtek Pty Ltd, Australia) Y. Zheng (Zhejiang University, China) E.V. Zudilova-Seinstra (University of Amsterdam, The Netherlands)

Reviewers J.H. Abawajy H.H. Abd Allah D. Abramson R. Albert M. Aldinucci V. Alexandrov I. Altintas D. Angulo C. Anthes M. Antolovich E. Araujo E.F. Archibong L. Axner M.A. Baker B. Bali´s S. Battiato M. Baumgartner U. Behn

P. Bekaert A. Belloum A. Benoit G. Bereket J. Bernsdorf I. Bethke B. Bethwaite J.-L. Beuchat J. Bi J. Bin Shyan B.S. Bindhumadhava J.A.R. Blais P. Blowers B. Boghosian I. Borges A.I. Boronin K. Boryczko A. Borzi

A. Boutalib A. Brabazon J.M. Bradshaw I. Brandic V. Breton R. Brito W. Bronsvoort M. Bubak K. Bubendorfer J. Buisson J. Burnett A. Byrski M. Caeiro A. Caiazzo F.C.A. Campos M. Cannataro B. Cantalupo E. Caron

Organization

L. Caroprese U. Catalyurek S. Cerbat K. Cetnarowicz M. Chakravarty W. Chaovalitwongse J. Chen H. Chojnacki B. Chopard C. Choquet T. Cierzo T. Clark S. Collange P. Combes O. Corcho J.M. Cordeiro A.D. Corso L. Costa H. Cota de Freitas C. Cotta G. Cottone C.D. Craig C. Douglas A. Craik J. Cui J.C. Cunha R. Custodio S. Date A. Datta D. De Roure S. Deb V. Debelov E. Deelman Y.D. Demchenko B. Depardon F. Desprez R. Dew T. Dhaene G. Di Fatta A. Diaz-Guilera R. Dillon I.T. Dimov G. Dobrowolski T. Dokken J. Dolado

W. Dong J. Dongarra F. Donno C. Douglas M. Drew R. Drezewski A. Duarte V. Duarte W. Dubitzky P. Edmond A. El Rhalibi A.A. El-Azhary V. Ervin A. Erzan M. Esseﬀar L. Fabrice Y. Fan G. Farin Y. Fei V. Ferandez D. Fireman K. Fisher A. Folleco T. Ford G. Fox G. Frenking C. Froidevaux K. F¨ ulinger W. Funika H. Fuss A. Galvez R. Garcia S. Garic A. Garny F. Gava T. Gedeon G. Geethakumari A. Gerbessiotis F. Giacomini S. Gimelshein S. Girtelschmid C. Glasner T. Glatard B. Glut M. Goldman

XV

Y. Gorbachev A.M. Go´sci´ nski M. Govindaraju E. Grabska V. Grau G.A. Gravvanis C. Grelck D.J. Groen J.G. Grujic Y. Guang Xue T. Gubala C. Guerra V. Guevara X. Guo Y. Guo N.M. Gupte J.A. Gutierrez de Mesa P.H. Guzzi A. Haﬀegee S. Hannani U. Hansmann M. Hardt D. Har¸ez˙ lak M. Harman R. Harrison M. Hattori T. Heinis P. Heinzlreiter R. Henschel F. Hernandez V. Hern´andez P. Herrero V. Hilaire y L. Hluch´ A. Hoekstra W. Hoﬀmann M. Hofmann-Apitius J. Holyst J. Hrusak J. Hu X.R. Huang E. Hunt K. Ichikawa A. Iglesias M. Inda

XVI

Organization

D. Ireland H. Iwasaki B. Jakimovski R. Jamieson A. Jedlitschka C.R. Jesshope X. Ji C. Jim X H. Jin L. Jingling D. Johnson J.J. Johnstone J. Jurek J.A. Kaandorp B. Kahng Q. Kai R. Kakkar B.D. Kandhai S. Kawata P. Kelly W.A. Kelly J. Kennedy A. Kert´esz C. Kessler T.M. Khoshgoftaar C.H. Kim D.S. Kim H.S. Kim T.W. Kim M. Kisiel-Drohinicki J. Kitowski Ch.R. Kleijn H.M. Kl´ıe A. Kn¨ upfer R. Kobler T. K¨ ockerbauer M. Koda I. Kolingerova J.L. Koning V. Korkhov G. Kou A. Koukam J. Ko´zlak M. Krafczyk D. Kramer

D. Kranzlm¨ uller K. Kreiser J. Kroc B. Kryza V.V. Krzhizhanovskaya V. Kumar M. Kunze D. Kurzyniec M. Kuta A. Lagana K. Lai R. Lambiotte V. Latora J. Latt H.K. Lee L. Lefevre A. Lejay J. Leszczy´ nski A. Lewis Y. Li D. Liko H.W. Lim Z. Lin D.S. Liu J. Liu R. Liu M. Lobosco R. Loogen E. Lorenz F. Loulergue M. Low P. Lu F. Luengo Q. Luo W. Luo C. Lursinsap R.M. Lynden-Bell W.Y. Ma N. Maillard D.K. Maity M. Malawski N. Mangala S.S. Manna U. Maran R. Marcjan

F. Marco E. Matos K. Matsuzaki A.S. McGough B. McKay W. Meira Jr. P.E.C. Melis P. Merk M. Metzger Z. Michalewicz J. Michopoulos H. Mickler S. Midkiﬀ L. Minglu M. Mirto M. Mitrovic H. Mix A. Mohammed E.D. Moreno J.T. Mo´scicki F. Mourrain J. Mrozek S. Naqvi S. Nascimento A. Nasri P.O.A. Navaux E. Nawarecki Z. Nemeth A. Neumann L. Neumann J. Ni G. Nikishkov G.E. Norman M. Nsangou J.T. Oden D. Olson M. O’Neill S. Orlando H. Orthmans ´ Nuall´ B.O. ain S. Pal Z. Pan M. Paprzycki M. Parashar A. Paszy´ nska

Organization

M. Paszy´ nski C.P. Pautasso B. Payne T. Peachey S. Pelagatti J. Peng Y. Peng F. Perales M. P´erez D. Pfahl G. Plank D. Plemenos A. Pluchino M. Polak S.F. Portegies Zwart M. Postma B.B. Prahalada V. Prasanna R. Preissl T. Priol T. Prokosch M. Py G. Qiu J. Quinqueton M.R. Radecki B. Raﬃn M. Ram P. Ramasami P. Ramsamy O.F. Rana M. Reformat A. Rendell M. Riedel J.L. Rivail G.J. Rodgers C. Rodr´ıguez-Leon B. Rodr´ıguez D. Rodr´ıguez D. Rodr´ıguez Garc´ıa F. Rogier G. Rojek H. Ronghuai H. Rosmanith J. Rough F.-X. Roux

X. R´ oz˙ a´ nska M. Ruiz R. Ruiz K. Rycerz K. Saetzler P. Saiz S. Sanchez S.K. Khattri R. Santinelli A. Santos M. Sarfraz M. Satpathy M. Sbert H.F. Schaefer R. Schaefer M. Schulz B. Schulze I. Scriven E. Segredo J. Seo A. Sfarti Y. Shi L. Shiyong Z. Shuai M.A. Sicilia L.P. Silva Barra F. Silvestri A. Simas H.M. Singer V. Sipkova P.M.A. Sloot R. Slota ´ zy´ B. Snie˙ nski A.E. Solomonides R. Soma A. Sourin R. Souto R. Spiteri V. Srovnal V. Stankovski E.B. Stephens M. Sterzel H. Stockinger D. Stokic A. Streit

XVII

B. Strug H. Sun Z. Sun F. Suter H. Suzuki D. Szczerba L. Szirmay-Kalos R. Tadeusiewicz B. Tadic R. Tagliaferri W.K. Tai S. Takeda E.J. Talbi J. Tan S. Tan T. Tang J. Tao M. Taufer J.C. Tay C. Tedeschi J.C. Teixeira D. Teller G. Terje Lines C. Te-Yi A.T. Thakkar D. Thalmann S. Thurner Z. Tianshu A. Tirado A. Tirado-Ramos P. Tjeerd R.F. Tong J. Top H. Torii V.D. Tran C. Troyer P. Trunﬁo W. Truszkowski W. Turek P. Tvrdik F. Urmetzer V. Uskov G.D. van Albada R. van den Boomgaard M. van der Hoef

XVIII

Organization

R. van der Sman B. van Eijk R. Vannier P. Veltri E.J. Vigmond J. Vill´ a i Freixa A. Visser D.W. Walker C.L. Wang F.L. Wang J. Wang J.Q. Wang J. Weidendorfer C. Weihrauch C. Weijun A. Weise A.L. Wendelborn

E. Westhof R. Wism¨ uller C. Wu C. Xenophontos Y. Xue N. Yan C.T. Yang F.-P. Yang L.T. Yang X. Yang J. Yu M. Yurkin J. Zara I. Zelinka S. Zeng C. Zhang D.L. Zhang

G. Zhang H. Zhang J.J. Zhang J.Z.H. Zhang L. Zhang J. Zhao Z. Zhao Y. Zheng X. Zhiwei A. Zhmakin N. Zhong M.H. Zhu T. Zhu O. Zimmermann J. Zivkovic A. Zomaya E.V. Zudilova-Seinstra

Workshops Organizers 7th Workshop on Computer Graphics and Geometric Modeling A. Iglesias (University of Cantabria, Spain) 5th Workshop on Simulation of Multiphysics Multiscale Systems V.V. Krzhizhanovskaya and A.G. Hoekstra (University of Amsterdam, The Netherlands) 3rd Workshop on Computational Chemistry and Its Applications P. Ramasami (University of Mauritius, Mauritius) Workshop on Computational Finance and Business Intelligence Y. Shi (Chinese Academy of Sciences, China) Workshop on Physical, Biological and Social Networks B. Tadic (Joˇzef Stefan Institute, Ljubljana, Slovenia) Workshop on GeoComputation Y. Xue (London Metropolitan University, UK) 2nd Workshop on Teaching Computational Science Q. Luo (Wuhan University of Science and Technology Zhongnan Branch, China), A. Tirado-Ramos (University of Amsterdam, The Netherlands), Y.-W. Wu

Organization

XIX

(Central China Normal University, China) and H.-W. Wang (Wuhan University of Science and Technology Zhongnan Branch, China) Workshop on Dynamic Data Driven Application Systems C.C. Douglas (University of Kentucky, USA) and F. Darema (National Science Foundation, USA) Bioinformatics’ Challenges to Computer Science M. Cannataro (University Magna Gracia of Catanzaro, Italy), M. Romberg (Research Centre J¨ ulich, Germany), J. Sundness (Simula Research Laboratory, Norway), R. Weber dos Santos (Federal University of Juiz de Fora, Brazil) Workshop on Tools for Program Development and Analysis in Computational Science A. Kn¨ upfer (University of Technology, Dresden, Germany), J. Tao (Forschungszentrum Karlsruhe, Germany), D. Kranzlm¨ uller (Johannes Kepler University Linz, Austria), A. Bode (University of Technology, M¨ unchen, Germany) and J. Volkert (Johannes Kepler University Linz, Austria) Workshop on Software Engineering for Large-Scale Computing D. Rodr´ıguez (University of Alcala, Spain) and R. Ruiz (Pablo de Olavide University, Spain) Workshop on Collaborative and Cooperative Environments C. Anthes (Johannes Kepler University Linz, Austria), V. Alexandrov (University of Reading, UK), D. Kranzlm¨ uller, G. Widmer and J. Volkert (Johannes Kepler University Linz, Austria) Workshop on Applications of Workflows in Computational Science Z. Zhao and A. Belloum (University of Amsterdam, The Netherlands) Workshop on Intelligent Agents and Evolvable Systems K. Cetnarowicz, R. Schaefer (AGH University of Science and Technology, Krak´ ow, Poland) and B. Zheng (South-Central University For Nationalities, Wuhan, China)

Table of Contents – Part I

Keynote Lectures Intrinsic Limitations in Context Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . Maria E. Orlowska

3

EU Research in Software and Services: Activities and Priorities in FP7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jes´ us Villasante

5

Computational Materials Science at the Cutting Edge . . . . . . . . . . . . . . . . Stefan Bl¨ ugel

6

Multiple Criteria Mathematical Programming and Data Mining . . . . . . . . Yong Shi, Rong Liu, Nian Yan, and Zhenxing Chen

7

HPC Opportunities and Challenges in e-Science . . . . . . . . . . . . . . . . . . . . . . Fabrizio Gagliardi

18

Integrated Data and Task Management for Scientiﬁc Applications . . . . . . Jarek Nieplocha, Sriram Krishamoorthy, Marat Valiev, Manoj Krishnan, Bruce Palmer, and P. Sadayappan

20

Why Petascale Visualization and Analysis Will Change the Rules . . . . . . Hank Childs

32

Computational Modeling of Collective Human Behavior: The Example of Financial Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andy Kirou, Bla˙zej Ruszczycki, Markus Walser, and Neil F. Johnson

33

Intel’s Technology Vision and Products for HPC . . . . . . . . . . . . . . . . . . . . . Pawel Gepner

42

e-Science Applications and Systems Grid-Supported Simulation of Vapour-Liquid Equilibria with GridSFEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I.L. Muntean, E. Elts, M. Buchholz, and H.-J. Bungartz Towards a System-Level Science Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomasz Gubala, Marek Kasztelnik, Maciej Malawski, and Marian Bubak

45

56

XXII

Table of Contents – Part I

Incorporating Local Ca2+ Dynamics into Single Cell Ventricular Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Sher, David Abramson, Colin Enticott, Slavisa Garic, David Gavaghan, Denis Noble, Penelope Noble, and Tom Peachey

66

Grid-Enabled Non-Invasive Blood Glucose Measurement . . . . . . . . . . . . . . Ibrahim Elsayed, Jianguo Han, Ting Liu, Alexander W¨ ohrer, Fakhri Alam Khan, and Peter Brezany

76

Simulating N-Body Systems on the Grid Using Dedicated Hardware . . . . Derek Groen, Simon Portegies Zwart, Steve McMillan, and Jun Makino

86

Supporting Security-Oriented, Collaborative nanoCMOS Electronics Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard O. Sinnott, Thomas Doherty, David Martin, Campbell Millar, Gordon Stewart, and John Watt

96

Comparing Grid Computing Solutions for Reverse-Engineering Gene Regulatory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Swain, Johannes J. Mandel, and Werner Dubitzky

106

Interactive In-Job Workﬂows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Branislav Simo, Ondrej Habala, Emil Gatial, and Ladislav Hluch´ y

116

Pattern Based Composition of Web Services for Symbolic Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandru Cˆ arstea, Georgiana Macariu, Dana Petcu, and Alexander Konovalov

126

DObjects: Enabling Distributed Data Services for Metacomputing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pawel Jurczyk, Li Xiong, and Vaidy Sunderam

136

Behavioural Skeletons Meeting Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Danelutto and G. Zoppi

146

Functional Meta-programming for Parallel Skeletons . . . . . . . . . . . . . . . . . . Jocelyn Serot and Joel Falcou

154

Interoperable and Transparent Dynamic Deployment of Web Services for Service Oriented Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Messig and Andrzej Goscinski

164

Pollarder: An Architecture Concept for Self-adapting Parallel Applications in Computational Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Sch¨ afer and Dietmar Fey

174

The Design and Evaluation of MPI-Style Web Services . . . . . . . . . . . . . . . . Ian Cooper and Yan Huang

184

Table of Contents – Part I

XXIII

Automatic Data Reuse in Grid Workﬂow Composition . . . . . . . . . . . . . . . . Ondrej Habala, Branislav Simo, Emil Gatial, and Ladislav Hluchy

194

Performance Analysis of GRID Middleware Using Process Mining . . . . . . Anastas Misev and Emanouil Atanassov

203

Scheduling and Load Balancing Bi-criteria Pipeline Mappings for Parallel Image Processing . . . . . . . . . . . . Anne Benoit, Harald Kosch, Veronika Rehn-Sonigo, and Yves Robert

215

A Simulation Framework for Studying Economic Resource Management in Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kurt Vanmechelen, Wim Depoorter, and Jan Broeckhove

226

Improving Metaheuristics for Mapping Independent Tasks into Heterogeneous Memory-Constrained Systems . . . . . . . . . . . . . . . . . . . . . . . . Javier Cuenca and Domingo Gim´enez

236

A2 DLT: Divisible Load Balancing Model for Scheduling Communication-Intensive Grid Applications . . . . . . . . . . . . . . . . . . . . . . . . . M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam

246

Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling in Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomasz Szepieniec and Marian Bubak

254

Parallel Path-Relinking Method for the Flow Shop Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wojciech Bo˙zejko and Mieczyslaw Wodecki

264

A Fast and Eﬃcient Algorithm for Topology-Aware Coallocation . . . . . . . Valentin Kravtsov, Martin Swain, Uri Dubin, Werner Dubitzky, and Assaf Schuster

274

Software Services and Tools View-OS: A New Unifying Approach Against the Global View Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ludovico Gardenghi, Michael Goldweber, and Renzo Davoli

287

Evaluating Sparse Data Storage Techniques for MPI Groups and Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamad Chaarawi and Edgar Gabriel

297

Method of Adaptive Quality Control in Service Oriented Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomasz Szydlo and Krzysztof Zielinski

307

XXIV

Table of Contents – Part I

Ontology Supported Selection of Versions for N-Version Programming in Semantic Web Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pawel L. Kaczmarek

317

Hybrid Index for Metric Space Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . Mauricio Marin, Veronica Gil-Costa, and Roberto Uribe

327

Structural Testing for Semaphore-Based Multithread Programs . . . . . . . . Felipe S. Sarmanho, Paulo S.L. Souza, Simone R.S. Souza, and Adenilso S. Sim˜ ao

337

Algorithms of Basic Communication Operation on the Biswapped Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenhong Wei and Wenjun Xiao Rule Engine Based Lightweight Framework for Adaptive and Autonomic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Jakub Adamczyk, Rafal Chojnacki, Marcin Jarzab, Krzysztof Zieli´ nski

347

355

A Monitoring Module for a Streaming Server Transmission Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sadick Jorge Nahuz, Mario Meireles Teixeira, and Zair Abdelouahab

365

BSP Functional Programming: Examples of a Cost Based Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fr´ed´eric Gava

375

On the Modeling Timing Behavior of the System with UML(VR) . . . . . . Leszek Kotulski and Dariusz Dymek

386

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafal Pokrywka

396

New Hardware and Its Applications Performance of Multicore Systems on Parallel Data Clustering with Deterministic Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaohong Qiu, Geoﬀrey C. Fox, Huapeng Yuan, Seung-Hee Bae, George Chrysanthakopoulos, and Henrik Frystyk Nielsen

407

Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications . . . . . Pawel Gepner, David L. Fraser, and Michal F. Kowalik

417

Heuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Dziurzanski and Tomasz Maka

427

Table of Contents – Part I

Max-Min-Fair Best Eﬀort Flow Control in Network-on-Chip Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fahimeh Jafari, Mohammad H. Yaghmaee, Mohammad S. Talebi, and Ahmad Khonsari Fast Quadruple Precision Arithmetic Library on Parallel Computer SR11000/J2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takahiro Nagai, Hitoshi Yoshida, Hisayasu Kuroda, and Yasumasa Kanada

XXV

436

446

Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e L. Abell´ an, Juan Fern´ andez, and Manuel E. Acacio

456

Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahmed El Zein, Eric McCreath, Alistair Rendell, and Alex Smola

466

Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm for Wireless Capsule Endoscopy . . . . . . . . . . . . . . . . . . . Pawel Turcza, Tomasz Zieli´ nski, and Mariusz Duplaga

476

Computer Networks Database Prebuﬀering as a Way to Create a Mobile Control and Information System with Better Response Time . . . . . . . . . . . . . . . . . . . . . Ondrej Krejcar and Jindrich Cernohorsky

489

Network Traﬃc Classiﬁcation by Common Subsequence Finding . . . . . . . Krzysztof Fabja´ nski and Tomasz Kruk

499

A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks . . . Orhan Dagdeviren and Kayhan Erciyes

509

Distributed Algorithms to Form Cluster Based Spanning Trees in Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kayhan Erciyes, Deniz Ozsoyeller, and Orhan Dagdeviren

519

The Eﬀect of Network Topology and Channel Labels on the Performance of Label-Based Routing Algorithms . . . . . . . . . . . . . . . . . . . . . Reza Moraveji, Hamid Sarbazi-Azad, and Arash Tavakkol

529

On the Probability of Facing Fault Patterns: A Performance and Comparison Measure of Network Fault-Tolerance . . . . . . . . . . . . . . . . . . . . . Farshad Safaei, Ahmad Khonsari, and Reza Moraveji

539

Cost-Minimizing Algorithm for Replica Allocation and Topology Assignment Problem in WAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Markowski and Andrzej Kasprzak

549

XXVI

Table of Contents – Part I

Bluetooth ACL Packet Selection Via Maximizing the Expected Throughput Eﬃciency of ARQ Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiang Li, Man-Tian Li, Zhen-Guo Gao, and Li-Ning Sun

559

Simulation of Complex Systems High Performance Computer Simulations of Cardiac Electrical Function Based on High Resolution MRI Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michal Plotkowiak, Blanca Rodriguez, Gernot Plank, J¨ urgen E. Schneider, David Gavaghan, Peter Kohl, and Vicente Grau

571

Statistical Modeling of Plume Exhausted from Herschel Small Nozzle with Baﬄe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gennady Markelov and Juergen Kroeker

581

An Individual-Based Model of Inﬂuenza in Nosocomial Environments . . . Boon Som Ong, Mark Chen, Vernon Lee, and Joc Cing Tay

590

Modeling Incompressible Fluids by Means of the SPH Method: Surface Tension and Viscosity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pawel Wr´ oblewski, Krzysztof Boryczko, and Mariusz Kope´c

600

Optimal Experimental Design in the Modelling of Pattern Formation . . . ` Adri´ an L´ opez Garc´ıa de Lomana, Alex G´ omez-Garrido, David Sportouch, and Jordi Vill` a-Freixa

610

Self-Organised Criticality as a Function of Connections’ Number in the Model of the Rat Somatosensory Cortex . . . . . . . . . . . . . . . . . . . . . . . . . . . . Grzegorz M. Wojcik and Wieslaw A. Kaminski

620

Approximate Clustering of Noisy Biomedical Data . . . . . . . . . . . . . . . . . . . Krzysztof Boryczko and Marcin Kurdziel

630

Domain Decomposition Techniques for Parallel Generation of Tetrahedral Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Barbara Glut and Tomasz Jurczyk

641

The Complete Flux Scheme for Spherically Symmetric Conservation Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen

651

Computer Simulation of the Anisotropy of Fluorescence in Ring Molecular Systems: Tangential vs. Radial Dipole Arrangement . . . . . . . . . Pavel Heˇrman, Ivan Barv´ık, and David Zapletal

661

Functional Availability Analysis of Discrete Transport System Realized by SSF Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomasz Walkowiak and Jacek Mazurkiewicz

671

Table of Contents – Part I

XXVII

Parallel Implementation of Vascular Network Modeling . . . . . . . . . . . . . . . Krzysztof Jurczuk and Marek Kr¸etowski

679

Some Remarks about Modelling of Annular Three-Layered Plate Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dorota Pawlus

689

Parallel Quantum Computer Simulation on the CUDA Architecture . . . . Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata Comparison of Numerical Models of Impact Force for Simulation of Earthquake-Induced Structural Pounding . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Jankowski

700

710

Image Processing and Visualisation Large-Scale Image Deblurring in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Piotr Wendykier and James G. Nagy

721

A New Signature-Based Indexing Scheme for Trajectories of Moving Objects on Spatial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaewoo Chang, Jungho Um, and Youngjin Kim

731

Eﬀective Emission Tomography Image Reconstruction Algorithms for SPECT Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Ram´ırez, J.M. G´ orriz, M. G´ omez-R´ıo, A. Romero, R. Chaves, A. Lassl, A. Rodr´ıguez, C.G. Puntonet, F. Theis, and E. Lang

741

New Sky Pattern Recognition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wojciech Makowiecki and Witold Alda

749

A Generic Context Information System for Intelligent Vision Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luo Sun, Peng Dai, Linmi Tao, and Guangyou Xu

759

Automated Positioning of Overlapping Eye Fundus Images . . . . . . . . . . . . Povilas Treigys, Gintautas Dzemyda, and Valerijus Barzdziukas

770

Acceleration of High Dynamic Range Imaging Pipeline Based on Multi-threading and SIMD Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radoslaw Mantiuk and Dawid Paj¸ak

780

Monte Carlo Based Algorithm for Fast Preliminary Video Analysis . . . . . Krzysztof Okarma and Piotr Lech

790

Interactive Learning of Data Structures and Algorithmic Schemes . . . . . . Clara Segura, Isabel Pita, Rafael del Vado V´ırseda, Ana Isabel Saiz, and Pablo Soler

800

XXVIII

Table of Contents – Part I

Optimization Techniques Prediction and Analysis of Weaning Results of Ventilator-Dependent Patients with an Artiﬁcial Neuromolecular System . . . . . . . . . . . . . . . . . . . Jong-Chen Chen, Shou-Wei Chien, and Jinchyr Hsu

813

Licence Plate Character Recognition Using Artiﬁcial Immune Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rentian Huang, Hissam Tawﬁk, and Atulya Nagar

823

Integration of Ab Initio Nuclear Physics Calculations with Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masha Sosonkina, Anurag Sharda, Alina Negoita, and James P. Vary

833

Non-uniform Distributions of Quantum Particles in Multi-swarm Optimization for Dynamic Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Trojanowski

843

An Integer Linear Programming for Container Stowage Problem . . . . . . . Feng Li, Chunhua Tian, Rongzeng Cao, and Wei Ding

853

Using Padding to Optimize Locality in Scientiﬁc Applications . . . . . . . . . E. Herruzo, O. Plata, and E.L. Zapata

863

Improving the Performance of Graph Coloring Algorithms through Backtracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjukta Bhowmick and Paul D. Hovland

873

Automatic Identiﬁcation of Fuzzy Models with Modiﬁed Gustafson-Kessel Clustering and Least Squares Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Grzegorz Glowaty

883

Extending the Four Russian Algorithm to Compute the Edit Script in Linear Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vamsi Kundeti and Sanguthevar Rajasekaran

893

Accuracy of Baseline and Complex Methods Applied to Morphosyntactic Tagging of Polish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcin Kuta, Michal Wrzeszcz, Pawel Chrzaszcz, and Jacek Kitowski

903

Synonymous Chinese Transliterations Retrieval from World Wide Web by Using Association Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chung-Chian Hsu and Chien-Hsing Chen

913

Numerical Linear Algebra Parallel Approximate Finite Element Inverses on Symmetric Multiprocessor Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konstantinos M. Giannoutakis and George A. Gravvanis

925

Table of Contents – Part I

XXIX

Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor . . . . . . . . . . . . Wesley Alvaro, Jakub Kurzak, and Jack Dongarra

935

Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W.N. Gansterer, H. Schabauer, C. Pacher, and N. Finger

945

On Using Reinforcement Learning to Solve Sparse Linear Systems . . . . . . Erik Kueﬂer and Tzu-Yi Chen

955

Reutilization of Partial LU Factorizations for Self-adaptive hp Finite Element Method Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maciej Paszynski and Robert Schaefer

965

Linearized Initialization of the Newton Krylov Algorithm for Nonlinear Elliptic Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar Khattri

975

Analysis and Comparison of Reordering for Two Factorization Methods (LU and WZ) for Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Beata Bylina and Jaroslaw Bylina

983

Numerical Algorithms KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chuan-Liang Chen, Yun-Chao Gong, and Ying-Jie Tian

995

Application of the Variational Iteration Method for Inverse Stefan Problem with Neumann’s Boundary Condition . . . . . . . . . . . . . . . . . . . . . . . 1005 Damian Slota Generalized Laplacian as Focus Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013 Muhammad Riaz, Seungjin Park, Muhammad Bilal Ahmad, Waqas Rasheed, and Jongan Park Application of R-Functions Method and Parallel Computations to the Solution of 2D Elliptic Boundary Value Problems . . . . . . . . . . . . . . . . . . . . 1022 Marcin Detka and Czeslaw Cicho´ n Using a (Higher-Order) Magnus Method to Solve the Sturm-Liouville Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032 Veerle Ledoux, Marnix Van Daele, and Guido Vanden Berghe Stopping Criterion for Adaptive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 1042 Sanjay Kumar Khattri Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051

Intrinsic Limitations in Context Modelling Maria E. Orlowska Ministry of Science and Higher Education, 1/3 Wsp´ olna Street, 00-529 Warsaw, Poland Phone: (+4822) 629 5703; Fax: (+4822) 529 2615

Abstract. Where are the limits in context modelling? Can we ever model context? Do we understand the limits in context modelling, in the context of business process execution? Keywords: context awareness, context-dependent applications, process model, web services, coordination.

The last few years of Information Technology evolution has been strongly inﬂuenced by virtually unlimited mobile connectivity between functionally sophisticated hardware devices. Incredible promises of innovative applications, often beyond our imagination, have emerged within pervasive, ubiquitous, and ambient computing, within the semantic web, smart knowledge management, and the management of agile processes, and so on. All have one common characteristic; they rely on a system’s ability to observe and capture the application context. Therefore, context awareness, context capturing, context management, de-abstracting context, and linking contexts are the key terms and phrases for these directions in contemporary computer science. The precise meaning of the term “context” depends on the context in which the term is used. This obvious meta-level logical loop often creates confusion and discourages eﬀorts, in diﬀerent situations, to present a precise semantic speciﬁcation of this term. The goal of this presentation is to constructively evaluate the feasibility of expanding the potential for future context-dependent applications in general. We will begin with a short overview of the generic and inherited limitations of current computational models that form the basis of all computing machines we use and most likely will use for many years to come. We will then continue the debate from the perspective of business process modelling issues where the meaning of “context”, and dependency of applications on “context”, is equally important. Typically, business process communication is characterized by complex interactions between heterogeneous and autonomous systems within an enterprise, and increasingly between trading partners. Each of the involved parties operates in its own individual context, with diﬀerent perspectives on the overall process. Most attempts at business process enforcement can be thought of in terms of two basic approaches. The ﬁrst “coordinative” paradigm is concerned with the enforcement of a structured, “pre-arranged” process. A complete process M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 3–4, 2008. © Springer-Verlag Berlin Heidelberg 2008

4

M.E. Orlowska

model, reﬂecting all potential local and global context variations, is designed prior to enacting the process. This approach mainly provides an eﬀective means of coordinating business activities with well-deﬁned dependency relations that can be modelled with simple workﬂow control constructs such as sequence, choice and fork. Another approach to process enforcement is concerned with a less structured and a more ad hoc process which is now more commonly quoted since the advent of web services. This is precisely where the use of context speciﬁcation, discovery or capturing is vital for a realistic chance to deliver any promise. The potential of web service technology in the area of integration and interoperation has generated substantial interest, with initiatives from leading software vendors such as HP, IBM, Microsoft, SAP, Oracle and Sun Microsystems. There is a belief that web services will provide a means for integrating applications, promoting interoperability and facilitating loosely-coupled process management over decentralised environments. A natural question that comes to mind: where is the source of the web services power? Why do integration problems, often semantically and contextually sensitive, become easier to handle under these architectural considerations? Many extensions to the basic web service functionality have been recently proposed with the aim of capturing more meaningful semantics and contexts than simply service invocations, enabling the modelling and implementation of business processes in the web service context. This coordination layer is said to provide web service ‘orchestration’, ‘choreography’ or ‘harmonization’. Even a superﬁcial analysis of the potential obstacles with deployment, either through coordinative or collaborative approaches for cross-organizational business process support, reveals that the main issues are related to context modelling of the loosely coupled application environments such that global goals are tractably achieved. This talk will highlight the issues outlined above with an attempt to provide well-deﬁned feasibility limits for such context sensitive loosely coupled applications.

EU Research in Software and Services: Activities and Priorities in FP7 Jes´ us Villasante Head of Unit DG INFSO D3 “Software and Services Architectures and Infrastructures”, European Commission

Abstract. Over the past years, Service-Oriented Architectures have become a paradigm in enterprises for enabling more eﬃcient and ﬂexible business processes and addressing some of the technological challenges posed by the service-based economy. Using loosely coupled services, they allow for discovery, composition and orchestration capabilities that are needed by end-users, business processes or other services.

In parallel with these developments, research on Grid Technologies has expanded from its initial focus on complex eScience applications towards general-purpose service infrastructures that can also be used by business and industry. This requires the dynamic provision of resources in an easy and transparent manner thus greatly facilitating the translation ofbusiness or user requirements into infrastructure capabilities. The business aim is on the convergence with the service paradigm so that the infrastructure can be managed according to more ﬂexible and dynamic business practices. This has opened new avenues of research in Software and Services that complement the research that is being done on Grids for scientiﬁc computations and on the deployment of research infrastructures for eScience. In this talk I will present the research on Software and Services within the FP6 and FP7 programmes of the European Commission, with particular emphasis on the activities related to service oriented infrastructures and Grids. I will also provide an overview of the expected research priorities in the coming years.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, p. 5, 2008. c Springer-Verlag Berlin Heidelberg 2008

Computational Materials Science at the Cutting Edge Stefan Bl¨ ugel Institut f¨ ur Festk¨ orperforschung, Forschungszentrum J¨ ulich, 52425 J¨ ulich, Germany [email protected] Phone: +49 2461 61 4249; Fax: +49 2461 61 2850 www.fz-juelich.de/iff/staff/Bluegel S/

Abstract. Material science is a ﬁeld which cuts through physics, chemistry, biology, and engineering working with an enormous spectrum of diﬀerent material systems and structures on quite diﬀerent length and time scale and symmetry, on diﬀerent degrees of hardness or softness as well as on liquids. The degree to which new functionalities of magnetic clusters, quantum dots, bio-molecules or carbon nanowires can be exploited for speciﬁc applications depends heavily on our ability to design devices with optimal behavior in response to external stimulation, such as applied voltage. While the basic physical eﬀects are often well understood, quantitative simulations with predictive power that do not rely on empirical models and parameters pose a major challenge. This is due to the large numerical eﬀort of the calculations, to accurately describe quantum eﬀects at atomic and larger distances. Therefore, modern quantum simulations in material science depend heavily on eﬃcient algorithms and powerful computer hardware.

During the past ten years, ﬁrst-principles calculations based on the densityfunctional theory (DFT) emerged as the most powerful framework to respond to the demands mentioned above on a microscopic level. By ﬁrst-principles is meant, that the parameters of the theory are ﬁxed by the basic assumptions and equations of quantum mechanics. The overwhelming success of the densityfunctional theory for the description of the ground-state properties of large material classes including organic molecules and bio-molecules, insulators, semiconductors, semimetals, half-metals, simple metals, transition-metals and rareearths in bulk, at surfaces and as nanostructures such as fullerenes and nanotubes makes it the unchallenged foundation of any modern electronic structure based materials. In this talk I will explore the opportunities of petaﬂop computing for materials science. Petaﬂop computing opens the path for the treatment of the van der Waals interaction of molecules, the chemical reactions of bio-molecules and the treatment of strongly-correlated electrons, where concept of individual electrons breaks down. These problems beneﬁt from the advent of massively parallelized computers. Conclusions for the method development for massively parallelized computers are drawn. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, p. 6, 2008. © Springer-Verlag Berlin Heidelberg 2008

Multiple Criteria Mathematical Programming and Data Mining Yong Shi1,2, Rong Liu1,3, Nian Yan2, and Zhenxing Chen2 1

Research Center on Fictitious Economy and Data Sciences, Chinese Academy of Sciences 100080 Beijing, China 2 College of Information Science and Technology, University of Nebraska at Omaha Omaha NE 68182, USA 3 School of Mathematical Science, Graduate University of Chinese Academy of Sciences 100049 Beijing, China [email protected], [email protected], {nyan,zchen}@mail.unomaha.edu

Abstract. Recently, researchers have extensively applied quadratic programming into classification, known as V. Vapnik’s Support Vector Machine, as well as various applications. However, using optimization techniques to deal with data separation and data analysis goes back to more than forty years ago. Since 1998, the authors and their colleagues extended such a research idea into classification via multiple criteria linear programming (MCLP) and multiple criteria quadratic programming (MQLP). The purpose of the paper is to share our research results and promote the research interests in the community of computational sciences. These methods are different from statistics, decision tree induction, and neural networks. In this paper, starting from the basics of Multiple Criteria Linear Programming (MCLP), we further discuss penalized MCLP Multiple Criteria Quadratic Programming (MCQP), Multiple Criteria Fuzzy Linear Programming, Multi-Group Multiple Criteria Mathematical Programming, as well as regression method by Multiple Criteria Linear Programming. A brief summary of applications of Multiple Criteria Mathematical Programming is also provided. Keywords: Multi-criteria programming, MCLP, MCQP, data mining, classification, regression, fuzzy programming.

1 Introduction Recently, researchers have extensively applied quadratic programming into classification, known as V. Vapnik’s Support Vector Machine [1], as well as various applications. However, using optimization techniques to deal with data separation and data analysis goes back to more than forty years ago. In 1960’s, O.L. Mangasarian’s group formulated linear programming as a large margin classifier [2]. Later in 1970’s, A. Charnes and W.W. Cooper initiated Data Envelopment Analysis where a fractional programming is used to evaluate decision making units, which is economic representative data in a given training dataset [3]. From 1980’s to 1990’s, F. Glover proposed M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 7–17, 2008. © Springer-Verlag Berlin Heidelberg 2008

8

Y. Shi et al.

a number of linear programming models to solve discriminant problems with a small sample size of data [4]. Then, since 1998, the authors and their colleagues extended such a research idea into classification via multiple criteria linear programming (MCLP) and multiple criteria quadratic programming (MQLP). These methods are different from statistics, decision tree induction, and neural networks. The purpose of the paper is to share our research results and promote the research interests in the community of computational sciences. The data mining task which will be investigated in this paper is the classification or the so-called discriminate analysis in statistical inference. The purpose of classification is to separate data according to some criteria. There are two commonly used criteria among them. The first one is the overlapping degree with respect to the discriminate boundary. The lower of this degree the better the classification is. Another one is the distance from a point to the discriminate boundary. The larger the sum of these distances the better the classification is. Accordingly, the objective of a classification is to minimize the sum of the overlapping degree and maximize the sum of the distances [4]. Note that these two criteria can not be optimized simultaneously because they are contradictory to each other. Fortunately, the multicriteria mathematical programming can be used to overcome this kind of problems in a systematical way. It has been thirty years since the first appearance of the multi-criteria linear programming. During these years, the multi-criteria programming has been not only improved in theoretical foundations but also applied successfully in real world problems. The data mining is such an area where the multi-criteria program has achieved a great deal. Initialed by Shi et al. [5], the model and ideal of multi-criteria programming have been widely adopted by the researches for classification, regression, etc. To handle the unbalanced training set problem, Li et al. [6] proposed the penalized multi-criteria linear programming method. He et al. [7] introduced the fuzzy approach in the multi-criteria programming to address the uncertainty in criteria of data separation. Using a different norm to measure the overlapping degree and distance, Kou [8] presented the Multiple Criteria Quadratic Programming for data mining. Kou et al. [9] proposed Multi-Group Multiple Criteria Mathematical Programming aimed to handle the multi-group classification. To extend the application of multi-criteria programming, Zhang et al. [10] developed a regressing method based on this technique. Some important characteristics of these variations of the multi-criteria data mining technique are summarized in Table 1. In respect of the abundance of the variations of multiple criteria mathematical programming and the diversity of applications, a comprehensive review of related methods would benefit the research in data mining. In this paper, several multi-criteria linear programming methods in data mining are reviewed and analyzed. The remaining part of the paper is organized as follows. First, we present the basics of Multiple Criteria Linear Programming (MCLP) (Section 2). Since the training set could be unbalanced, penalized MCLP method has been proposed to deal with this problem (Section 3). Furthermore, in order to achieve better classification performance and stability, Multiple Criteria Quadratic Programming (MCQP) has been developed (Section 4). Instead of identifying a compromise solution for the separation of data in MCLP, an alternative Multiple Criteria Fuzzy Linear Programming approach has also been studied (Section 5). In addition, two-group Multiple Criteria Mathematical

Multiple Criteria Mathematical Programming and Data Mining

9

Table 1. Some Important Characteristics of MCLP (M.1) Variations

MCQP FMCLP Multi-group MCLP MCLP Reg.

√ √ √ √ √ √ √

√ √

√

√ √ √

√ √ √

√ √

√ √

√

Regression

M.2 M.3 M.5 M.6 M.7 M.9 M.10 M.11

Classification

Multi-group classification Unbalance Constraints Soft Constraints Hard Constraints Non-Linear Objective Linear Objective

MCLP PMCLP

√ √ √ √ √ √ √ √

Programming has been extended to Multi-Group Multiple Criteria Mathematical Programming (Section 6). We also review how to apply MCLP to regression problem (Section 7). A brief summary of applications of multiple criteria mathematical programming is provided in Section 8. We conclude the paper in Section 9.

2 Multiple Criteria Linear Programming (MCLP) In linear discriminate analysis, the data separation can be achieved by two opposite objectives. The first one is to maximize the minimum distances of observations from the critical value. The second objective separates the observations by minimizing the sum of the deviations (the overlapping) among the observations [4]. However, it is theoretically impossible to optimize MMD and MSD simultaneously, the best tradeoff of two measurements is difficult to find. This shortcoming has been coped with by the technique of multiple criteria linear programming (MCLP) [5, 11, 12]. The first MCLP model can be described as follows: n

∑α

Min

i =1

i

n

Max

∑β i =1

S. T .

(Model 1)

i

(x i , w ) = b + yi (α i − β i ),

i = 1, K , n

α, β ≥ 0 Here,

αi

is the overlapping and

(x i , w ) = b

βi

the distance from the training sample

x i to the

yi ∈ {1,−1} denotes the label of x i and n is the number of samples. The weights vector w and the bias b are the discriminator

(classification boundary).

10

Y. Shi et al.

Fig. 1. The two criteria of classification

unknown variables to be optimized for the two objectives. A visual description of this model is shown in Fig. 1. Model 1 is formulized as Multiple Criteria Linear Programming which is difficult to optimize. In order to facilitate the computation, the compromise solution approach [5, 13] can be employed to reform the above model so that we can systematically identify the best trade-off between -Σαi and Σβi for an optimal solution. The “ideal value” of -Σαi and Σβi are assumed to be α* > 0 and β* > 0 respectively. Then, if Σαi > α*, we define the regret measure as -dα+ = Σαi + α*; otherwise, it is 0. If Σiαi < α*, the regret measure is defined as dα - = α* + Σαi; otherwise, it is 0. Thus, we have (i) α* + Σαi = dα - – dα +, (ii) |α* + Σαi | = dα - + dα +, and (iii) dα- , dα + ≥ 0. Similarly, we derive β* – Σβi = dβ - – dβ+, |β* – Σβi | = dβ - + dβ+, and dβ - , dβ+ ≥ 0. The two-class MCLP model has been gradually evolved as Model 2:

Min d α+ + d α− + d β+ + d β− n

S. T .

α ∗ + ∑ α i = d α− − d α+ i =1 n

β − ∑ β i = d β− − d β+ ∗

(Model 2)

i =1

(xi , w ) = b + yi (α i − βi ),

i = 1,K, n

α, β ≥ 0, d α+ ,d α− ,d β+ ,d β− ≥ 0 Here α* and β* are given, w and b are unrestricted. The geometric meaning of the model is shown as in Fig. 2. In order to calculate a large data set, the Linux-based MCLP classification algorithm was developed to implement the above Model 2 (Kou and Shi, 2002).

Multiple Criteria Mathematical Programming and Data Mining

11

Fig. 2. Model 2 and Model 7 formulations

3 Penalized MCLP Usually, the sample sizes of different groups vary; namely, the training set is unbalanced. To handle this problem with the MCLP model, Li et al. [6] proposed the following penalized MCLP method (Model 3) for credit scoring.

Min d α+ + d α− + d β+ + d β− S. T .

α∗ + p ×

n2 α i + ∑ α i = d α− − d α+ ∑ n1 i∈B i∈G

β∗ − p×

n2 βi = d β− − d β+ ∑ βi − ∑ n1 i∈B i∈G

(xi , w ) = b + α i − βi , (xi , w ) = b − α i + βi ,

(Model 3)

xi ∈ B xi ∈ G

α, β ≥ 0, d α+ ,d α− ,d β+ ,d β− ≥ 0 Here, “Bad” and “Good” denote different groups,

n1 and n2 are the number of

p ≥ 1 is the penalized parameter. In this model the distance is balanced on the two sides of b with the parameter n1 / n2 , even there are less “Bad” records on the left of the credit score boundary b . The value of p enhances the effect of “Bad” distance and penalizes much samples corresponding to the two groups, and

more if we wish more “Bad” records on the left of the boundary.

12

Y. Shi et al.

If n1 = n2 ,

p =1, the model above degenerates to the original MCLP model (Model 1). If n1 < n2 , then exist p ≥ 1 to make “Bad” catching rate of PMCLP higher than that of MCLP with the same n1 , n2 .

4 Multiple Criteria Quadratic Programming (MCQP) Based on MCLP, the Multiple Criteria Quadratic Programming is later developed to achieve better classification performance and stability. The overlapping and distance are respectively represented by the nonlinear functions f (α ) and g ( β ) . Given weights

ωα

and ωβ , let

f (α ) =|| α ||p and g ( β ) =|| β ||p , the two criteria basic

Model 1 can be converted into a single criterion general non-linear classification model (Model 4):

Min ωα || α ||p −ωβ || β ||p S.T .

(xi , w ) = b + yi (αi − βi ),

i = 1,K, n

(Model 4)

α, β ≥ 0 On the basis of Model 4, non-linear classification models with any norm can be defined theoretically. Let m

m

i =1

i =1

f (α ) = α T Hα = ∑ α i2 and f ( β ) = β T Qβ = ∑ β i2 where H and Q are predefined as identity matrices. We add the term

1 || w ||22 into 2

the objective function and formulate a simple quadratic programming with 2-norm as in Model 5: n n 1 || w ||22 +ωα ∑ αi2 − ωβ ∑ β i2 2 i =1 i =1 S . T . (x i , w ) = b + yi (αi − β i ), i = 1,K, n

Min

(Model 5)

α, β ≥ 0 In order to reduce the number of variables involved in our model and thus simplify computation. Let ηi = α i − β i . According to our definition, ηi = α i for all misclassified records and

ηi = − β i

for all correctly separated records. To obtain strong

convexity to the objection function, we add

ωb 2

b2 to Model 5’s objective function.

The weight Wb is an arbitrary positive number and ωb

<< ωβ . Model 6 becomes [8]:

Multiple Criteria Mathematical Programming and Data Mining n W ω n 1 || w ||22 + α ∑ηi2 + ωβ ∑ηi + b b2 2 2 i =1 2 i =1 S . T . (xi , w ) = b + yiηi , i = 1, K, n

13

Min

(Model 6)

η≥0 5 Multiple Criteria Fuzzy Linear Programming Instead of identifying a compromise solution for the separation of data in MCLP, the fuzzy approach classifies the data by seeking a fuzzy (satisfying) solution obtained from a fuzzy linear program (FLP) [7]. Let y1L be MSD and y 2U be MMD, then one can assume that the value of Maximize Σαi to be

y1U and that of Minimize Σαi to

be y 2 L . The classification problem is equivalent to the following fuzzy linear program (Model 7):

Min ξ S.T .

ξ≤

∑α

ξ≤

∑β

i

− y1L

y1U − y1L i

− y2 L

(Model 7)

y2U − y2 L

(x i , w ) = b + yi (αi − βi ),

i = 1,K, n

α, β ≥ 0 Note that Model 7 will produce a value of

ξ

with 0 ≤ ξ

< 1 . To avoid the trivial

solution, one can set up 0 ≤ ε < ξ , for a given ε . Therefore, seeking Maximum ξ in the FLP approach becomes the standard of determining the classifications between Good and Bad records in the database. A graphical illustration of this approach can be seen from Fig. 2, any point of hyper plane 0 < ξ < 1 over the shadow area represents the possible determination of classifications by the FLP method.

6 Multi-group Multiple Criteria Mathematical Programming The above models are concerned with two groups’ case. Now suppose we have k groups, G1, G2,…, Gk, are predefined. Gi ∩ G j = Φ, i ≠ j ,1 ≤ i, j ≤ k and

xi ∈ {G1 ∪ G2 ∪ ... ∪ Gk } . A series of boundary scalars b1
14

Y. Shi et al.

Let w =

( w1 ,..., wm )T ∈ R m be a vector of real number to be determined. Thus, we

can establish the following linear inequations [14; 8]: (xi, w) < b1,

∀xi ∈ G1;

(1)

bj-1 ≤ (xi, w) < bj,

∀xi ∈ Gj;

(2)

(xi, w)≥ bk-1,

∀xi ∈ Gk;

(3)

2 ≤ j ≤ k-1, 1≤ i ≤ n. A mathematical function f can be used to describe the summation of total overlapping while another mathematical function g represents the aggregation of all distances. The final classification accuracies of this multi-group classification problem depend on simultaneously minimize f and maximize g . Thus, a generalized bicriteria programming method for classification can be formulated as Model 8:

Min

f

Max S. T .

g (1), (2), and (3)

(Model

8)

Furthermore, to transform the bi-criteria problems of the generalized model into a single-criterion problem, weights ωα > 0 and ωζ > 0 are introduced for

f (α ) and g (ζ ) , respectively. The values of ωα and ωζ can be pre-defined in the process of identifying the optimal solution. As a result, the generalized model can be converted into a single-criterion mathematical programming model as Model 9: k

n

Min ωα ¦¦ α i , j j =1 i =1

S. T .

p

n § − ωζ ¨ ¦ ¦ ζ i , j ¨ j =1orj = k i =1 ©

(x i , w ) = b j + α i , j − ζ i , j , 1 ≤ (x i , w ) = b j −1 + α i , j −1 − ζ i , j −1 ,

k −1 n

p

− ¦¦ j = 2 i =1

b j − b j −1 2

· − ζ i, j ¸ ¸ p ¹

j ≤ k −1

(4)

2≤ j≤k

(5)

ζ i , j −1 ≤ b j − b j −1 2 ≤ j ≤ k

(6)

ζ i , j ≤ b j +1 − b j 1 ≤ j ≤ k − 1

(7)

Here xi is given, w and bj are unrestricted, and

α i, j , ζ i , j ≥ 0 , 1 ≤ i ≤ n.

(6) and (7) are defined as such due to the fact that the distances from any correctly classified data (x i ∈ G j , 2 ≤ j ≤ k − 1 ) to two adjunct boundaries bj-1 and bj must be less than bj - bj-1 . A better separation of two adjunct groups may be achieved by the following constraints which have stronger limitation on ζ i : j

Multiple Criteria Mathematical Programming and Data Mining

ζ i, j ≤ (bj

- bj-1 )/2+ε, 2 ≤

ζ i, j ≤ (bj+1

- bj )/2+ε, 1 ≤

15

j≤k

(8)

j ≤ k −1

(9)

ε ∈ ℜ+

is a small positive real number. Let p = 2, then objective function in Model 1 can now be a quadratic objective and we have Model 10 as shown below: k −1 n b − b n k n ⎛ j −1 Min ωα ∑∑ α i , j − ωζ ⎜ ∑ ∑ ζ i , j − ∑∑ j − ζ i, j p p ⎜ j =1orj = k i =1 2 j = 2 i =1 j =1 i =1 ⎝ S. T . (4), (5), (8), and (9)

Note that the constant

(

b j − b j −1

2

⎞ ⎟ ⎟ p⎠

) 2 is omitted from the Model 6 without any ef-

fect to the solution.

7 Regression Method by Multiple Criteria Linear Programming MCLP can also be applied to regression problem. The data set of the regression problem is T = {( x1 , y1 ), ( x2 , y 2 ), K , ( xn , yn )} , where T

T

T

xi ∈ R m is the input variable,

yi ∈ R is the output variable, which can be any real number. Define the G and B as +

−

“Good” and “Bad”, respectively, then the D MCLP and D MCLP data sets for MCLP regression model are constructed. With these data sets, MCLP regression model can be written as Model 11 [10]: n

Min

∑ (α i =1

S. T .

i

− α i ') − Max

n

∑ (β i =1

i

− β i ')

xi1w1 + L + xim wm + ( y1 + ε )wm +1 = b − α1 + β1 L xn1w1 + L + xnm wm + ( yn + ε )wm +1 = b − α n + β n

for all ∈ G

L xn1w1 + L + xnm wm + ( yn − ε )wm +1 = b + α n '− β n '

for all ∈ B

xi1w1 + L + xim wm + ( y1 − ε )wm +1 = b + α1 '− β1 '

α , α ' , β, β ' ≥ 0 Aggregation of Good samples: +

D MCLP = {((x i , y i + ε ) , +1) , i = 1, L , l } T

T

16

Y. Shi et al.

Aggregation of Bad samples: −

D MCLP = {((x i , y i − ε ) , −1) , i = 1, L , l } T

T

8 Applications The multi-criteria data mining techniques reviewed above have yielded fruitful results in diverse applications. Kou [8] applied the Multiple Criteria Quadratic Programming to credit card risk analysis and obtained comparable results with some sophisticated methods. Classification of HIV-1 mediated neuronal dendritic and synaptic damage is another successful example of the multi-criteria data mining techniques [15]. Kou et al. [16] introduced this technique to network surveillance and intrusion detection system. This approach has also been applied to predict firm bankruptcies [17, 18]. Zhang et al. [19] employed the both Multiple-Criteria Linear and Quadratic Programming in VIP e-Mail behavior analysis. In addition to these applications, some of the models mentioned above have played important roles in building the national credit scoring system in China as well as an insurance fraud detection system in USA, in witch tera-bytes of data have been handled for business intelligence.

9 Conclusions This paper has reviewed various multi-criteria programming data mining models. These methods are different from statistics, decision tree induction, and neural networks. We have discussed 11 models related to basic Multiple Criteria Linear Programming (MCLP), MCLP Multiple Criteria Quadratic Programming (MCQP), Multiple Criteria Fuzzy Linear Programming, Multi-Group Multiple Criteria Mathematical Programming, as well as regression method by Multiple Criteria Linear Programming. These models have

been successfully applied in many real-life applications, such as credit assessment management, information intrusion, bio-informatics, etc. The purpose of the paper is to share the research results and promote research interests in the international community of computational sciences. Acknowledgments. This work was partially supported by National Natural Science Foundation of China (Grant No. 70621001, 70531040, 70501030, 10601064), National Natural Science Foundation of Beijing (Grant No. 9073020), 973 Project of Chinese Ministry of Science and Technology (Grant No. 2004CB720103), and BHP Billiton Cooperation of Australia.

References 1. Cortes, C., Vapnik, V.: Support-vector Network. Machine Learning 20, 273–279 (1995) 2. Mangasarian, O.L.: Linear and Nonlinear Separation of Patterns by Linear Programming. Operations Research 13, 444–452 (1965) 3. Charnes, A., Cooper, W.W., Rhodes, E.: Measuring the Efficiency of Decision-making Units. European Journal of Operations Research 3(4), 339 (1979)

Multiple Criteria Mathematical Programming and Data Mining

17

4. Freed, N., Glover, F.: Simple but Powerful Goal Programming Models for Discriminant Problems. European Journal of Operational Research 7, 44–60 (1981) 5. Shi, Y., Wise, M., Luo, M., Lin, Y.: Data Mining in Credit Card Portfolio Management: A Multiple Criteria Decision Making Approach. In: Proceedings of International Conference on Multiple Criteria Decision Making, Ankara, Turkey (2000) 6. Li, A.H., Shi, Y., He, J.: MCLP-based Methods for Improving “Bad” Catching Rate in Credit Cardholder Behavior Analysis. Applied Soft Computing 8(3), 1259–1265 (2008) 7. He, J., Liu, X., Shi, Y., Xu, W., Yan, N.: Classifications of Credit Cardholder Behavior by using Fuzzy Linear Programming. International Journal Of Information Technology And Decision Making 3(4), 633–650 (2004) 8. Kou, G.: Multi-Class Multi-Criteria Mathematical Programming and its Applications in Large Scale Data Mining Problems, PhD Dissertation, University of Nebraska Omaha (2006) 9. Kou, G., Liu, X., Peng, Y., Shi, Y., Wise, M., Xu, W.: Multiple Criteria Linear Programming to Data Mining: Models, Algorithm Designs and Software Developments. Optimization Methods and Software 18, 453–473 (2003) 10. Zhang, D.L., Tian, Y.J., Shi, Y.: A Regression Method by Multiple Criteria Linear Programming. In: 19th International Conference on Multiple Criteria Decision Making (2008) 11. Shi, Y.: Multiple Criteria and Multiple Constraint Level Linear Programming: Concepts. World Scientific Publishing Co., Singapore (2001) 12. Shi, Y., Peng, Y., Xu, W., Tang, X.: Data Mining Via Multiple Criteria Linear Programming: Applications in Credit Card Portfolio Management. International Journal of Information Technology and Decision Making 1, 131–151 (2002) 13. Shi, Y., Yu, P.L.: Goal Setting and Compromise Solutions. In: Karpak, B., Zionts, S. (eds.) Multiple Criteria Decision Making and Risk Analysis Using Microcomputers, pp. 165– 204. Springer, Berlin (1989) 14. Kou, G., Peng, Y., Shi, Y., Wise, M., Xu, W.: Discovering Credit Cardholders Behavior by Multiple Criteria Linear Programming. Annals of Operations Research 135(1), 261–274 (2005) 15. Zheng, J., Zhuang, W., Yan, N., Kou, G., Peng, H., McNally, C., Erichsen, D., Cheloha, A., Herek, S., Shi, C., Shi, Y.: Classification of HIV-1 Mediated Neuronal Dendritic and Synaptic Damage Using Multiple Criteria Linear Programming. Neuroinformatics 2, 303– 326 (2003) 16. Kou, G., Peng, Y., Yan, N., Shi, Y., Chen, Z., Zhu, Q., Huff, J., McCartney, S.: Network Intrusion Detection by Using Multiple-Criteria Linear Programming. In: Chen, J. (ed.) Proceedings of 2004 International Conference on Service Systems and Service Management, Beijing, China, July 19-21, pp. 806–809 (2004) 17. Kwak, W., Shi, Y., Cheh, J.J.: Firm Bankruptcy Prediction Using Multiple Criteria Linear Programming Data Mining Approach. Advances in Financial Planning and Fore-casting 2, 27–49 (2006) 18. Kwak, W., Shi, Y., Eldridge, S., Kou, G.: Bankruptcy prediction for Japanese firms: using Multiple Criteria Linear Programming data mining approach. International Journal of Business Intelligence and Data Mining 1(4), 401–416 (2006) 19. Zhang, P., Zhang, J.L., Shi, Y.: A New Multi-Criteria Quadratic-Programming Linear Classification Model for VIP E-Mail Analysis. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4488, pp. 499–502. Springer, Heidelberg (2007)

HPC Opportunities and Challenges in e-Science Fabrizio Gagliardi Microsoft Research, Technical Computing, EMEA and LATAM Director Phone: +41 78 844 6476 [email protected]

Abstract. In the last few years a new paradigm has emerged in science: extensively use of simulation techniques and software modeling, running on a distributed high performance computing electronic infrastructure. This paradigm is referred to as electronic Science or e-Science. Besides computer simulation, it uses huge amounts of distributed and shared data captured by instruments or sensors and/or stored in databases, analyzed to provide new results for science. This distributed HPC and data ecosystem enables the sharing of acquired knowledge, remote access to resources and above of all a world-wide scientiﬁc collaboration. Keywords: e-Science, Grid computing, Cloud, elastic computing, multi-core architectures.

The EU Datagrid project was one of the ﬁrst examples of Grid computing going in this direction, which was followed by the EGEE (Enabling Grids for Escience) project. The research community and the High Energy Physicists at CERN were among the ﬁrst to adopt Grid computing. The EGEE projects has to date integrated more than 50.000 CPUs in Europe and beyond, and 20 Peta Bytes (millions of GigaBytes) of storage, serving multiple application communities including HEP, Bioinformatics, Astrophysics, Computational Chemistry, Earth Sciences, Fusion. Some business/industrial applications are also adopting this distributed HPC computing model such as automotive, ﬁnance, multimedia, and there also a few examples of e-Government ones such as in the civil protection area. Thus, Grid computing has delivered an aﬀordable and high performance computing infrastructure to scientists all over the world for solving intense computing problems within constrained research budget. Business or industrial entities have also used similar technologies to increase the usage of their computing infrastructure and reduce their total cost of ownership (TCO). In addition, Grid computing is leveraging the advanced research networks to deliver an eﬀective and irreplaceable channel for international collaboration. Issues which hinder a wider adoption of grid technology in e-Science and industry have to do with the cost of operations and the overall complexity of the Grid, which aims at delivering secure and reliable services over widely distributed non homogeneous resources belonging to diﬀerent administrative domains. The EGEE project, for instance, is spending more than 30 million € per M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 18–19, 2008. © Springer-Verlag Berlin Heidelberg 2008

HPC Opportunities and Challenges in e-Science

19

year (around half is covered by the EC) to run and support the 50.000 CPUs infrastructure (operations, middleware development and certiﬁcation, application support, training, dissemination, etc.). Besides the operational expenditures, one should take into account the depreciated capital expenditures of the infrastructure. Hardware costs of the 50.000 CPUs and the 20PB storage should be in the order of 100 million Euros, depreciated over 5 years, results for another 20 million Euros per year. Power consumption and heat dissipation are also becoming an important new factor that needs to be considered seriously. Taking the rough estimation of 10% of the hardware costs, this should add another 10 Euros per year operational expenditures. Supposing that the (over-provisioned) connectivity costs are covered anyway by the National Research Networks, this sums up to around 60 million € per year. This is comparable to the estimated 48M € ﬁgure that the EGEE usage would have cost, if performed with Amazon Elastic Computing and Simple Storage Services (presented by the EGEE project director in the 2008 EGEE User Forum). Although calculations are not accurate, they show that the grid and the cloud now are in the same order of costs. Thus elastic computing, Computing on the Cloud, Data Centres and Service hosting are oﬀering on demand CPUs and storage with similar pricing. As an example, Amazon Elastic Computing (EC2) oﬀers a small instance CPU hour for 0,10 USD, and is quite easy to use. In addition, it charges 0,10 USD per GB for data transfer in and 0,18 USD for data transfer out (of their systems). For datasets that stay at the Amazon system, one has to use the Amazon S3 services, which have an additional cost of 0,15 USD per GB per month. Other major stakeholders in the market such as SUN, Google and IBM are moving in the same direction towards oﬀering similar services of on-demand computing, storage and hosting. Emerging multi-core architectures and CPU accelerators promise potential breakthroughs, and in the future one might not have to rely on computer clusters, the cloud and the grid, rather aﬀord her/his own personal supercomputer desk top or desk side Microsoft is actively investigating this ﬁeld, and the Technical Computing in Microsoft Research, is supporting e-Science initiatives in collaboration with leading scientists around the world. We need to advance in making computing easy to use for the scientists to concentrate their energy in real science and not the computing tools!

Integrated Data and Task Management for Scientific Applications Jarek Nieplocha1, Sriram Krishamoorthy1, Marat Valiev1, Manoj Krishnan1, Bruce Palmer1, and P. Sadayappan2 1

Pacific Northwest National Laboratory, Richland, WA 99352, USA {jarek.nieplocha,sriram,marat.valiev, manoj,bruce.palmer}@pnl.gov 2 The Ohio State University, Columbus, OH 43210, USA [email protected]

Abstract. Several emerging application areas require intelligent management of distributed data and tasks that encapsulate execution units for collection of processors or processor groups. This paper describes an integration of data and task parallelism to address the needs of such applications in context of the Global Array (GA) programming model. GA provides programming interfaces for managing shared arrays based on non-partitioned global address space programming model concepts. Compatibility with MPI enables the scientific programmer to benefit from performance and productivity advantages of these high level programming abstractions using standard programming languages and compilers. Keywords: Global Array programming, computational kernels, MPI, task management, data management.

1 Introduction Since the dawn of distributed memory parallel computers, the development of application codes for such architectures has been a challenging task. The parallelism in the underlying problem needs to be identified and exposed; the data and computation then must be partitioned and mapped onto processors to achieve load balancing, and finally the interprocessor communication required to exchange the data between individual processors has to be carefully orchestrated to avoid deadlocks and unnecessary delays. When communication costs cannot be completely eliminated, alternative approaches should be taken to minimize the adverse effects of communication on scalability. To achieve good performance as well as scalability, knowledge of the network topology, memory hierarchy, and even some understanding of the underlying implementation of communication libraries has been required. Factors such as these have made parallel programming a difficult task, leaving it to a limited number of expert scientific programmers. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 20–31, 2008. © Springer-Verlag Berlin Heidelberg 2008

Integrated Data and Task Management for Scientific Applications

21

Over the last two decades, a number of high level programming languages and libraries have been created to address the challenges of the software development for scalable architectures. Unfortunately, most of these high-level programming models have not been given enough time and/or resources to mature and advance from the proof-of-concept to the production stage. There are several reasons contributing to this problem, including: immature run-time systems used to implement advanced programming models, shortage of funding available to the research community that would be required for adequate validation, testing, debugging, and optimizing of prototype programming models, constantly evolving and advancing parallel architectures, and unrealistic expectations from the users about performance and scalability of applications implemented using prototype implementations of novel programming models. For over a decade, we have been working on the development of the Global Arrays (GA) programming toolkit[1]. By working closely with application developers, we have been fortunate to be able define, tune, and evaluate the feature set and implementation choices for advancing GA. One of the key decisions in development of GA was interoperability with the mainstream programming model in scientific computing, MPI. GA has enabled scientific programmers to start with the low-level, low-productivity standard and use more advanced features of globally addressable data structures. The global address space abstractions can greatly improve programmer productivity by freeing the programmer from explicitly partitioning data and orchestrating the communication required to access the data. They allow focussed efforts on other key aspects of parallel computations such as parallelism or load balancing. The GA toolkit has in fact been adopted by several very large applications and used successfully by scientists without any formal training in parallel computing. This paper provides an overview of the integration of data and task parallelism in the context of the Global Array programming model. An overview of the GA capabilities is provided. An application example is provided to show how these capabilities can be used in a real complex scientific application that requires advanced management of distributed data and tasks executing on processor groups.

2 Application Motivations Exploiting all available forms of parallelism is becoming increasingly important for programming forthcoming high-end systems. Such systems, containing tens or hundreds of thousands of processors, present a challenge to many important scientific applications. Many applications that require these high-end systems tend to be composed of algorithms with variable computation/communication granularity. The question on how to partition computational resources and manage them to execute the overall application effectively is becoming critical to our ability to take advantage of the massively parallel hardware. One strategy to limit the negative effect of Amdahl’s law on the overall efficiency and scalability of the application is execute the finer granularity algorithms on smaller subsets of processors, where their efficiency and speedup are high. In many important problem areas such as environmental remediation, drug and enzyme design, and development of new energy sources, we need to gain molecular

22

J. Nieplocha et al.

level understanding of macroscopic phenomena. The fate of many macroscopic processes is often dependent upon intricate details of numerous chemical transformations at the microscopic (angstrom) level. Fundamental understanding of these phenomena is highly desirable for many large-scale practical applications in biology, energy, and climate areas, providing a way for rational control. The immense dimensions of this problem make it an ideal candidate for emerging peta and exascale computing platforms, but the necessary mathematical and software tools are inadequate or virtually nonexistent. The inherent complexity of the problem in conjunction with management of vast number of computing nodes presents challenges far beyond the conventional scientific applications codes. The presence of multiple scales requires concurrent parallel engagement and information exchange between different computational kernels (e.g. quantum, classical, continuum) with potentially thousands or more coupled simulations running at the same time. These simulations will have to dynamically managed and reconfigured subject to available computing nodes and network bottlenecks including the inherent run-time failures. Given the vast numbers of potential scenarios and outcomes, the details of the simulation cannot be anticipated ahead of time requiring intelligent software for managing complex data, scheduling processor resources and task execution.

3 Global Arrays In the traditional shared-memory programming model, data is located either in “private” memory (accessible only by a specific process) or in “global” memory (accessible to all processes). In shared-memory systems, global memory is accessed in the same manner as local memory, i.e., by load/store operations. The sharedmemory paradigm eliminates the synchronization that is required when message passing is used to access non-private data. The Global Arrays toolkit combines the best features of the shared and distributed-memory programming models [1]. It implements a shared-memory programming model in which data locality is managed by the programmer through explicit calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to distributed shared-memory (DSM) models [2, 3] that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than is local data and therefore allows data locality to be explicitly specified and hence managed. Another advantage is that GA, by optimizing and moving only the data requested by the user, avoids issues such as false sharing or redundant data transfers present in some DSM solutions. The GA model exposes to the programmer the hierarchical memory of modern high-performance computer systems, and by recognizing the communication overhead for remote data transfer, it promotes data reuse and locality of reference. The GA programming model includes as a subset message passing; in particular, the programmer can use full MPI functionality on both GA and non-GA data. The Global Array toolkit can be used as a distributed array library in an MPI-based application or as a complete programming environment based on a shared-memory approach. The core capabilities of GA are in the area of management of dense distributed arrays and a set of operations for sparse data management is available [1].

Integrated Data and Task Management for Scientific Applications

23

The library can be used in C, C++, Fortran 77, Fortran 90 and Python programs. The capabilities of the GA toolkit include: 1.

The GA toolkit provides extensive support for controlling array distribution and accessing locality information. Global arrays can be created by (1) allowing the library to determine array distribution, (2) specifying decomposition only for one array dimension and allowing the library to determine the others, (3) specifying the distribution block size for all dimensions, or (4) specifying irregular distribution as a Cartesian product of irregular distributions for each axis. The distribution and locality information is available through library operations that (1) specify the array section held by a given process, (2) specify which process owns a particular array element, and (3) return a list of processes and the blocks of data owned by each process corresponding to a given section of an array.

Physically distributed data

Process Z ga_get (a,175,185,19,70,buf,10) Process Y ga_get (a,180, 210, 23,40,buf, 30)

Single, shared data structure

Process X ga_get (a,100, 200, 17, 20,buf, 100)

HJ$ UDWKHUWKDQEXI RQWDVN

Fig. 1. Left: GA manages a distributed array as a single shared data object. As shown, any process/task can access the distributed data using global indexing (e.g., using global index A(4,3,1)). Right: Any part of the array can be accessed noncollectively as if it is located in shared memory (e.g., Process X gets a block of the global array with global indices starting at (100,17) and block size=100x3).

2.

The GA toolkit offers communication calls to support for both task and data parallelism. Task parallelism is supported through the one-sided (noncollective) copy operations that transfer data between global memory (distributed/shared array) and local memory. In addition, each process is able to access directly data held in a section of a global array that is logically assigned to that process.

24

3.

4.

J. Nieplocha et al.

Atomic operations are provided that can be used to implement synchronization and ensure correctness of an accumulate operation (floating-point sum reduction that combines local and remote data) executed concurrently by multiple processes and targeting overlapping array sections. The data parallel computing model is supported through the set of collectivelycalled functions that operate on either entire arrays or sections of global arrays. The set includes BLAS-like operations (copy, additions, transpose, dot products, matrix multiplication). Some of them are defined and supported for all array dimensions (e.g., addition). Some other operations such as matrix multiplication are limited to one- or two-dimensional arrays (however, multiplication is also offered on two-dimensional subsections of higher dimensional arrays). The set of data parallel operations has been enlarged to support the requirements of TAO. The extensions included element-wise operations on arrays (e.g., elementwise addition of two arrays, or shifting diagonal). GA extends its capabilities by offering interfaces to third-party libraries in the area of linear algebra through ScaLAPACK) [4] and optimization through TAO [5, 6].

For performance reasons shared memory is used for storing global arrays within SMP nodes. Therefore, any process/task can directly access the memory allocated for a global array on any other process in the same SMP node. Although every process is guaranteed to have fast access to the portion of array it owns, all the other processes in the same SMP node are able to access this memory directly, thereby avoiding unnecessary copies. In the case of a shared memory system, such as the SGI Altix, a process can access data in the entire global array directly. An appropriate interface for task mapping to individual SMP nodes of a cluster in the parallel job was introduced to enable exploiting the performance advantages of shared memory. Exploiting processor groups is becoming increasingly important for programming next-generation high-end systems composed of tens or hundreds of thousands of processors. To preserve compatibility of GA with MPI, GA follows the MPI approach to the processor group management as closely as possible. However, with GA the management of shared data rather than the explicit interprocessor communication is the main focus area. GA includes calls to create, access, share, and destroy shared data in the framework of the processor group management of MPI. For example, Fig. 2 illustrates the concept of using shared arrays by processor groups. The three processor groups (Group1, Group2, and Group3 in Fig. 2) execute tasks that operate on three arrays: A, B, and C. Array A is in the scope of all three processor groups. Array B is distributed on processor Group 1. Array C is distributed on processor group 3. All arrays can be accessed using collective (individual and multiple arrays) and one-sided (non-collective) operations by any processor in the group that owns the array. The concept of the default processor group is a powerful capability that enables rapid development of new group-based codes and simplifies conversion of existing, non-group aware codes. Under normal circumstances, the default group for a parallel calculation in GA is the MPI world group (contains the complete set of processors), but a call is available that can be used to change the default group to a processor subgroup. This call must be executed by all processors in the subgroup. Furthermore,

Integrated Data and Task Management for Scientific Applications

25

Array A

Array C Array B

Group1

Group2 Group3

Fig. 2. An example of multilevel parallelism in Global Arrays

although it is not required, it is probably a very good idea to make sure that the default groups for all processors in the system (i.e. all processors contained in the original world group) represent a complete non-overlapping covering of the original world group. Once the default group has been set, all operations are implicitly assumed to occur on the default processor group unless explicitly stated otherwise. Shared arrays are created on the default processor group and global operations by default are restricted to the default group. Inquiry functions, such as the number of tasks and the task rank, return values relative to the default processor group.

4 Task Pools Task parallelism is a popular technique for expressing parallelism in programs that exhibit irregular, sparse, or nested parallelism. Mixed task and data parallelism exists naturally in many application programs. Integrating task and data parallelism improves application performance in practice, however utilizing it may require sophisticated scheduling algorithms and software support [8]. In such programs, static partitioning can often lead to load imbalance due to irregularity in the problem spatial domain, sparsity in the data, or algorithmic characteristics of the problem where computational effort is not directly correlated with the data sizes. By decomposing the problem into tasks and dynamically mapping the computation onto available resources, these classes of applications are able to achieve scalable performance. In addition to overcoming irregularity in the computation, dynamic task scheduling can also be used to mitigate irregularity in the hardware that may be introduced through heterogeneity among processors or in the memory hierarchy. In the early stage of development, GA model was motivated by the needs of electronic structure computational chemistry applications that deploy task level parallelism. These applications relied on atomic increment of a shared task counter to assign dynamically tasks to individual processors. However, the development of more

26

J. Nieplocha et al.

complex task management schemes has not been pursued by the application developers. Recently, we have been designing different task pool management systems complementary to the GA model. Some early efforts were pursued in context of the Tensor Contraction Engine (TCE) to manage sparse tensor contraction operations performed by coupled cluster models for ab initio electronic structure modeling. This includes locality-aware load-balancing for in-memory computations [9], transparent memory hierarchy management for out-of-core programs [10], and work-stealing techniques [11]. All these approaches focused on tasks that are sequentially executed on a single processor. We have been working with application developers of quantum chemistry codes to define a task management pool suitable for more dynamic schemes and applicable to tasks executed on processor groups rather than individual processor. Moreover, depending on the problem and molecular properties, tasks can vary significantly in their requirements for memory and processor resources. The abstraction provided to the user, shown in Fig. 3, enables the specification of a set of independent tasks to be executed in parallel. For each such set, all processes collectively create a task pool object using the create task pool method. Each task in the task pool is identified by the routine to be invoked to process that task, identified by a function handle, and the set of locality elements it operates upon. In addition, any private data specific to that task can also be specified. Each locality element corresponds to a global data region, identified by its global address, size, and its access mode. Three access modes are supported. Read, write, and access modes allow for put, get, and accumulate of global data. For dense and block-sparse arrays, the global address is replaced by the array handle and the specification of the data region being accessed. Tasks are added to a task pool using the add task method. The creation and addition of tasks to the task pool is done by all processors. Once all the tasks have been added to the task pool, the seal task pool method is used to seal

task pool Task A

Task B

Task C

Task D

Processor Group Manager

Dispatcher

Task A

Task E

Task B

Task C

Fig. 3. Task Pool Management

processor resources

Integrated Data and Task Management for Scientific Applications

27

the work pool. This method is invoked once for a task pool and is used to perform start-time optimizations. Subsequently, Dispatcher processes the task pool and calls Processor Group Manager to create and appropriately sized processor groups. Dispatcher then assigns tasks to the processor groups to process the tasks in the task pool. Once a task is finished a processor group can be assigned another task or destroyed. A task pool, once created, can be processed multiple times. The cost of start-time optimizations, performed once, can thus be amortized over multiple cycles. The task pool approach for processor groups appears quite attractive for building applications with multi-level parallelism. Decomposition of the problem into tasks to be assigned to different processor groups is in particular attractive for massively parallel systems and problems that include some parts that lack sufficient parallelism to be executed on all the processors the application is using. Our approach for integrating task and data parallelism has several performance benefits in practice. Earlier efforts [12, 13] have pursued similar goals. However they are either compile-based approach or not suited for irregular computations. Moreover, these approaches do not offer much control to the user. For example, they are more biased to task parallelism and offer less support for data parallelism, or vice versa. Some of the earlier approaches are based on MPI processor group model, which imposes additional constraints on the task parallelism as outlined in Section 4.1. In addition, our approach uses global address space model which offers significant productivity advantages in the data management area. 4.1 Processor Group Management One of the design obstacles involved overcoming the limitations of the MPI processor group management. Under MPI all the processors in the parent group have to participate in the subgroup creation process. This constraint makes it impossible to create and destroy on-demand processor groups to match requirements of tasks on top of the queue. We designed a more flexible group management system and integrated it with ARMCI runtime-system that GA uses [7]. ARMCI in particular implements a basic set of collective communication calls on processor groups. The subgroup creation process is coordinated between processors that form a subgroup. However, this coordination can be implicit and is implemented without use of collective operations. The only assumption is that all the processors in the subgroup must be aware of the group membership.

5 Application Example An important representative problem where the issues outlined in Section 2 start to play an important role can be found in dynamical simulation of soft matter system consisting of collection of loosely bound molecules (e.g. liquid water). Examples of such systems can be found in wide range of applications related to environmental remediation, catalysis, biology, and other areas. To provide a realistic description of these systems requires consideration of relatively large fragments (more than 103 atoms) that, combined with periodic boundary conditions, give an insight into bulk behavior which is most relevant to practical applications. Until recently problems of

28

J. Nieplocha et al.

Fig. 5. Task graph for fragment based QM approach illustrated for a system containing 4 molecules

this size could only be described by classical molecular mechanics based on empirical pair potentials. In many cases however a quantum-mechanical based description would be quite beneficial providing validation for classical models as well as a description of electronic structure properties. Given that a direct quantum-mechanical based description would be unreasonable for these large systems, alternative approaches have been developed that combine both quantum mechanical and classical methods. Among them is a fragment based quantum mechanical description which utilizes a cluster-like many-body expansions where the short range quantum mechanical effects are evaluated on the quantum mechanical level while long-range electrostatic effects are handled classically [14]. For example, in the second order (binary) fragment expansion the total energy of the system is represented by

(

E binary = ∑ Ei1 + ∑ Ei1i2 − Ei1 − Ei2 i1

i1
)

where Ei is the energy of the ith monomer and Eij the energy of the dimer made from i and j molecules of the system. These individual energy terms are obtained as the eigenvalues of the Schrödinger equation

H i1Likψ i1Lik = Ei1Likψ i1Lik Here the Hamiltonian,

H i1Lik = H i10Lik +

∑

im ∉{i1 ,K,ik }

Vim

Integrated Data and Task Management for Scientific Applications

is given by the sum of the gas phase Hamiltonian

29

H i01Lik of the isolated cluster

i1 Lik and , Vim ,the electrostatic interaction between the subcluster i1 Lik and other subunits

im (not belonging to the subcluster).

VimCoulomb ( r ) = ∑ a∈im

ρi ( r′ ) Za dr′ −∫ m r − ra r − r′

To facilitate the calculations the electrostatic potential can be approximated by the effective electrostatic charges fit to reproduce the “true” electric field on a grid point surrounding the system.

Za

∑ r −r −∫

a∈im

a

ρi ( r′ ) m

r − r′

dr′ = ∑ a∈im

Qa r − ra

The calculation in the fragment-based approach proceeds according to the following scheme, see Fig. 4. 1. For each monomer in the system the Schrodinger equation for the monomer and new charges Qa are calculated. 2. Step 1 is repeated until self-consistency is reached where charges Qa and correspondingly energies Ei are no longer changing. 3. For each dimer in the system the Schrodinger equation is solved with other molecules represented by point charges calculated in steps 1-2. Steps 1 and 2 are ideally suited for exploiting the parallel task pools approach. On one hand the monomer calculations appear fully parallel, but there is nontrivial nonlinear coupling through the electrostatic charges that they generate.

6 Experimental Results The experiment evaluation was performed on a Quadrics cluster with dual 1.5GHz Itanium nodes running Linux kernel 2.6.11. The example system considered in this work consists of 256 water molecules enclosed in a ~20A cubic box. Only the monomer part of the calculation was considered at this point. After the initialization step where the coordinates of monomer units and surrounding charges are assembled, the quantum calculations of each monomer unit are executed independently subject to available processor pool. We performed the entire self-consistent loop until charges on all the monomers have converged. This involves one-sided communication of the charges from the processor group to a global charge array which is monitored for convergence. The performance of our scheme is shown in Fig. 4, labeled task-pool. In this case the processor groups were uniformly sized and included two processors. For comparison, we evaluated the performance of the application, with each monomer calculation being performed by all the processors collectively (MPI_COMM_WORLD), denoted do-all in the figure. As our data illustrates this approach does not scale due to inherent limitation in the

30

J. Nieplocha et al.

Execution tim e (secs)

2500 task-pool

2000

do-all

1500 1000 500 0 0

10

20

30

40

50

60

70

#procs Fig. 6. Execution time for the monomer QM calculations

scalability of the quantum calculation. We observe almost linear speedup with the number of processors in our task pool-based implementation.

7 Conclusions and Future Work We described an integrated data and task management system in context of the Global Arrays toolkit that supports a global address space programming model. Our task management system can handle variable sized processor groups. This capability was used for a quantum mechanical application with complex data and task management requirements. The experimental results demonstrate very good scaling. Our plans for future work include more experimental validation with other applications, and adding locality optimization to the task pool management, and implementing a distributed version of the task pool Dispatcher to handle very large numbers of processors.

References 1. Nieplocha, J., Palmer, B., Tipparaju, V., Krishnan, M., Trease, H., Apra, E.: Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit. International Journal of High Performance Computing and Applications 20(2) (2006) 2. Bershad, B.N., Zekauskas, M.J., Sawdon, W.A.: Midway distributed shared memory system. In: 38th Annual IEEE Computer Society International Computer Conference – COMPCON SPRING 1993, February 22-26 1993, pp. 528–537. IEEE, Piscataway (1993) 3. Cox, A.L., Dwarkadas, S., Lu, H., Zwaenepoel, W.: Evaluating the performance of software distributed shared memory as a target for parallelizing compilers. In: 1997 11th International Parallel Processing Symposium, IPPS 1997, April 1-5 1997, pp. 474–482. IEEE, Los Alamitos (1997)

Integrated Data and Task Management for Scientific Applications

31

4. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK: A Linear Algebra Library for Message-Passing Computers. In: Proceedings of Eighth SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, MN (1997) 5. Benson, S., McInnes, L., Moré, J., Sarich, J.: Toolkit for Advanced Optimization (TAO) User Manual. ANL/MCS-TM-242 (2004), http://www.mcs.anl.gov/tao 6. Benson, S., Krishnan, M., McInnes, L.C., Nieplocha, J., Sarich, J.: Using the GA and TAO toolkits for solving large-scale optimization problems on parallel computers. ACM Trans. Math. Softw. 33(2), 11 (2007) 7. Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.: High Performance Remote Memory Access Communications: The ARMCI Approach. International Journal of High Performance Computing and Applications 20(2) (2006) 8. Chakrabarti, S., Demmel, J., Yelick, K.: Modeling the benefits of mixed data and task parallelism. In: Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, June 24-26, 1995, pp. 74–83 (1995) 9. Krishnamoorthy, S., Nieplocha, J., Sadayappan, P.: Data and computation abstractions for dynamic and irregular computations. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769. Springer, Heidelberg (2005) 10. Krishnamoorthy, S., Catalyurek, U., Nieplocha, J.: Hypergraph partitioning for automatic memory hierarchy management. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2006) (November 2006) 11. Dinan, J., Krishnamoorthy, S., Larkins, B., Nieplocha, J., Sadayappan, P.: Scioto: A framework for global-view task parallelism (under submission) 12. Bal, H.E., Haines, M.: Approaches for Integrating Task and Data Parallelism. IEEE Concurrency 6(3), 74–84 (1998) 13. Rauber, T., Rünger, G.: Library support for hierarchical multi-processor tasks. In: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, November 16, 2002, pp. 1–10 (2002) 14. Kamiya, M., Hirata, S., Valiev, M.: Fast electron correlation methods for molecular clusters without basis set superposition errors. Journal of Chemical Physics 128(7), 74103 (2008)

Why Petascale Visualization and Analysis Will Change the Rules Hank Childs Lawrence Livermore National Laboratory, Box 808, L-557 Livermore, CA 94551-0808 Phone: 925-422-4035, Fax.: 925-423-6961 [email protected]

Abstract. In the last decade, supercomputers and the scientiﬁc simulations performed on them have dramatically increased in size. Currently, simulations can use hundreds of TeraFLOPs (trillions of ﬂoating point operations per second) and generate many, many terabytes of data. In the near future, we will see PetaFLOP computing and petabytes of data. In addition, a critical step in the simulation process is “postprocessing”: applying visualization and analysis techniques to better understand the simulation. As a result, the issues of visualizing and analyzing massive data sets have never been more important. This puts the spotlight on two key issues. One, are we prepared for the unprecedented scale of data that we will need to postprocess? And, two, assuming that we can handle data of this size, can we intelligently analyze these simulations? I will argue that, for both of these questions, we must “change the rules” and make dramatic departures from our current modus operandi.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, p. 32, 2008. © Springer-Verlag Berlin Heidelberg 2008

Computational Modeling of Collective Human Behavior: The Example of Financial Markets Andy Kirou1 , Bla˙zej Ruszczycki1 , Markus Walser2 , and Neil F. Johnson1 1

2

Department of Physics, University of Miami, P.O. Box 248046, Coral Gables, FL 33124 USA Landesbank Baden-W¨ urttemberg, Am Hauptbahnhof 2, 70173 Stuttgart, Germany [email protected]

Abstract. As a result of the increased availability of higher precision spatiotemporal datasets, coupled with the realization that most realworld human systems are complex, a new ﬁeld of computational modeling is emerging in which the goal is to develop minimal models of collective human behavior which are consistent with the observed real-world dynamics in a wide range of systems. For example, in the ﬁeld of ﬁnance, the ﬂuctuations across a wide range of markets are known to exhibit certain generic stylized facts such as a non-Gaussian ‘fat-tailed’ distribution of price returns. In this paper, we illustrate how such minimal models can be constructed by bridging the gap between two existing, but incomplete, market models: a model in which a population of virtual traders make decisions based on common global information but lack local information from their social network, and a model in which the traders form a dynamically evolving social network but lack any decision-making based on global information. We show that a combination of these two models – in other words, a population of virtual traders with access to both global and local information – produces results for the price return distribution which are closer to the reported stylized facts. Going further, we believe that this type of model can be applied across a wide range of systems in which collective human activity is observed. Keywords: complex systems, socio-economic systems, virtual traders, collective behavior.

1

Introduction

Socio-economic systems were traditionally treated from the points of view of game theory or traditional economic theory. These approaches – while undoubtedly successful in terms of gaining insight into core features – are unable to address the issue of how and why such systems produce the ﬂuctuating external signals that they do [1,2,3,4,5]. Human systems as diverse as traﬃc, Internet downloads, and ﬁnancial markets, are all known to produce large-scale ﬂuctuations – for example, in the number of cars taking a certain road, or the number of people accessing a certain website, or the number of people trying to sell a stock at certain times [1,2,3,4,5]. In previous decades, there was typically an insuﬃcient amount of reliable data available for researchers to address such problems M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 33–41, 2008. c Springer-Verlag Berlin Heidelberg 2008

34

A. Kirou et al.

of dynamics. Nowadays, with the increase in online logging of data – from social, governmental and commercial sectors – this area of modeling now becomes very attractive. However few advances are likely to be made analytically, since any meaningful explanation of the dynamics must be related back to what the collection of individual objects are doing. In other words, it is what physicists call a many-body problem – one in which the objects are subjected to endogenous and exogenous feedback and nonlinear interactions – and it is known that such many-body problems are in general intractable. Given the additional feature that the objects themselves may be semi-autonomous (i.e. they each have some form of independent decision-making ability such that a given external input may yield various possible outputs depending on some internal state of the object itself) the only realistic route toward advancing our understanding of such systems must surely be driven by computational modeling and simulation. In short, such socio-economic systems are complex – and the key to understanding the dynamics of such complex systems is provided by computer simulation. Among the wide range of socio-economic ﬁelds that have attracted the interest of complexity scientists, is the goal of trying to understand ﬁnancial market dynamics [1,2,3,4,5]. This is partly because ﬁnancial markets produce such high frequency data, and partly because the possible actions of an individual trader are quite simple: buy, sell or do nothing at any timestep. A wide range of interesting stylized facts have emerged based on analysis of the real market data over a wide range of timescales – from seconds through to days, weeks and months. In particular, it has been found that the distribution of price returns (i.e. changes in price between a given time t and a time t + Δt later) do not follow the simple distribution expected from a random walk. The standard model of a ﬁnancial market – based on the eﬃcient market hypothesis – is that price-changes are like the toss of a coin. They are supposedly independent – hence if we assume that each trader trades according to the toss of a coin, then the probability of buying and selling would a priori be 12 if we ignore the ‘do nothing’ option. Counting a head as +1 in terms of price change, and a tail as −1, the probability distribution for having a given price-change ΔP = Nbuy − Nsell is simply the probability of obtaining Nbuy heads and Nsell tails from N coin-tosses, such that N = Nbuy + Nsell . This is a binomial expression, which then approaches a Gaussian in the large-N limit. We note in passing that it is more common in the ﬁnance literature to consider the logarithm of price-changes rather than the price-change itself, hence the distribution of returns in the ideal, random-walk market is log-normal. However the two distributions are essentially indistinguishable in most ﬁnancial markets, since the returns are typically much smaller than the prices themselves.

2

Collective Non-randomness in the Human World

As a rough ﬁrst approximation, ﬁnancial market behavior is not far from the Gaussian model. However, many independent detailed empirical studies of ﬁnancial market returns have conﬁrmed that major deviations begin to arise

Computational Modeling of Collective Human Behavior

35

in the tails of the distribution [1,2,3,5]. Speciﬁcally, the distribution of pricechanges ΔP deviates from Gaussian behavior even at moderate values of ΔP . In particular, the probability of intermediate-to-large price-changes is larger than the random coin-toss model would suggest. This leads to the so-called ‘fat tail’ terminology which is often used to describe real ﬁnancial markets when they ‘misbehave’. In fact, such fat-tailed behavior is common across a wide range of socio-economic domains [1]. In many scientiﬁc settings such as physics, such deviations from Gaussian behavior in the tail of the distribution might be considered a mere detail, since the average behavior of the system is typically what counts. However real-world ﬁnance is all about risk – in particular, the risk of large unexpected price-changes – and so these deviations are actually the most important features of the distribution, contributing to abnormally high estimates of the moments of the distribution such as the variance. The fact that large price-changes are more likely than expected – and in more general socio-economic settings, that large traﬃc jams or heavy Internet downloads are more likely than expected – suggests that the population is unintentionally behaving in a coordinated way. It is as though the supposedly independent coin-tosses of the N traders are not in fact independent: when one comes up heads, they are all more likely to come up heads, and vice versa with tails. It is as though the population of traders was inter-connected in some way. The fact that getting the best price in a ﬁnancial market is a competitive activity – in the same way that managing to grab space on a busy road, or a download on a busy website, are also competitive – means that such coordination is very unlikely to have arisen through some intentional population-level decision making. There is no central controller – and even if there were, the fact that individual objects are competing to win means that no central controller would necessarily be listened to or followed by individual members of the population. This coordination observed in many scenarios where populations of humans are competing for some limited resource, thereby leading to larger than random probability for large events, is characteristic of many human systems [1,4]. But if so, what causes it and how can we provide a quantitative model of it? Because of the generic nature of the fat-tail statistics, any useful model should not depend on the details of the particular market or type of trader, or road or type of car, or website or type of computer. Instead it must be some fairly general feature of collective human activity. The fundamental question as to what underlying model might best represent such collective coordination, has inspired a new breed of computer-based scientiﬁc investigation involving physical, biological and social scientists. At its root, a system such as a ﬁnancial market, traﬃc system, or the Internet, involves agents (i.e. people) deciding between a few options (e.g. buy, sell, do nothing) based on some limited information – which may be global or locally generated – and then competing with the remaining agents for the available resource or reward. Any collective coordination will require some form of trader inter-connectedness. One way in which this could have arisen, is if subsets of the traders form social groups such that they and their immediate friends or associates, coordinate

36

A. Kirou et al.

their actions (i.e. bias their decisions and hence eﬀectively connect their coins during the coin-toss). In fast-moving ﬁnancial markets, such groups are likely to change fairly rapidly, and should at least be accounted for using dynamical models of such group formation. This idea has led to a particular class of models based on dynamical cluster formation. Notable examples include the dynamical clustering model of Egu´ıluz and Zimmermann [7] and of others in biology [6], [8]. A second way in which coordination could have arisen, concerns how agents react to a particular piece of common information. Models of this form, of which a notable example is the so-called Grand Canonical Minority Game [9], feature agents whose actions are dictated by a strategy (or set of strategies). In the ﬁrst case of real clusters, the grouping is intentional, while in the second case agents form unintentional groups (i.e. crowds) as a result of using the same strategy at the same time and hence acting identically over a short period of time. These two classes of model are complementary. Each of these models is ‘minimal’ in the sense that they are the simplest known examples which seem to capture the essential ingredients of clustering and decision-making respectively. To date, the two classes of model have been studied separately – however, they should clearly both be combined in order to understand the interplay of local and global information on collective group formation and hence the collective dynamics. In this paper, we take the ﬁrst steps in this direction of analyzing a collective human system in which there are local interactions as observed in the Egu´ıluz and Zimmermann [7] model (i.e. E-Z model) and global interactions as observed in the Grand Canonical Minority Game [9] (i.e. GCMG). The focus of this paper is to show how adding global interactions to the E-Z model, does indeed improve the ﬁt with the known empirical distributions of ﬁnancial market returns. Our construction is a modiﬁed E-Z model in which the agents are randomly assigned the trading strategies and apply their strategy based on the last two price movements. Unlike the full GCMG, the model does not include an explicit strategy score. Instead, the system behaves as a relative majority voting system.

3

Relative Majority Vote

The relative majority vote system is speciﬁed by two parameters, the total number of agents N and the consensus parameter x. To set up a timescale, we need to prescribe some form of timescale constant. At each time-step, an agent is chosen at random and the group to which the agent belongs is identiﬁed. The size of this group is denoted as s. Once a particular group has been identiﬁed, it makes a decision in the following way: Each agent votes either to sell, buy or wait (see Fig. 1) with approximate probability 1/3 for each of the listed options. If the number of votes for the most popular decision exceeds the threshold, which is deﬁned as T ≡x·s (1) the consensus is reached and the decision is performed. Otherwise there is no consensus and the group fragments into single individuals. The group that decides to wait does not trade, but instead merges with another group by choosing

Computational Modeling of Collective Human Behavior

37

Global information 10 m(10) = 2

+1

+1 0

0

+1

1

-1

2

0

3

-1

-1

consensus?

yes

wait > buy & wait > sell

global outcome is 0 cluster merges

sell > buy & sell > wait

buy > sell & buy > wait

global outcome is 0

global outcome is +1

no

in case of a tie one of the tied actions is picked at random

global outcome is 0 cluster fragments

Fig. 1. Schematic diagram of the decision process in our multi-agent model

randomly any other agent from the entire population and then joining with the group to which this agent belongs. For a particular group of size s, let us denote the number of agents who vote to sell, buy or wait (and merge with another group) as S, B, W respectively. The conditions for the group decision are as follows: Fragments: (W < T ) ∧ (S < T ) ∧ (B < T ) Buys: (B ≥ T ) ∧ (B > S) ∧ (B > W) (2) Sells: (S ≥ T ) ∧ (S > B) ∧ (S > W) Merges: (W ≥ T ) ∧ (W > B) ∧ (W > S) We also need to account for the fact we may have a tied number of votes. This is resolved by randomly picking one of the two tied decisions, e.g. if (S ≥ T ) ∧ (S = B) ∧ (S > W) then the group either sells or buys with equal probability. The decisions presented in Eq. (2) are exclusive, therefore the corresponding conditional probabilities (on the condition that the particular group is chosen) satisfy the equation pˆ frg + pˆ sell + pˆ buy + pˆ merge = 1 ,

(3)

where pˆ frg is the probability that the group fragments. Note that the above conditional probabilities depend on the group size s. From the symmetry of Eqs. (2) and (3) we see that it is suﬃcient to know pˆ frg since

38

A. Kirou et al.

Fig. 2. Cumulative distribution of price returns ΔP in arbitrary units, for the consensus parameter x = 37% (thicker solid line), x = 41% (dashed line) and x = 47% (dashed-dotted line). The number of agents is 10000. The thin solid lines indicate the scaling behavior observed for the cumulative probability distribution for returns ΔP in the pure E-Z model and for typical empirical market data [2]. The underlying probability distribution for price-changes p(ΔP ), follows a power-law p(ΔP ) ∼ (ΔP )−α over a range of ΔP values in both the E-Z and real market data, with α = 1.5 for the pure E-Z model and α ∼ 4 for empirical market data.

pˆ sell = pˆ buy = pˆ merge =

1 − pˆ frg . 3

(4)

We calculate the combinatorial expression as pˆ frg (s) =

s−1 s! 3s

W=0

min(T −1,s−W)

B=s−T −W

1 W!B!(s − B − W)!

(5)

The system may be described by mean ﬁeld theory (disregarding the ﬂuctuations and ﬁnite size eﬀects). We denote by ns the average number of groups of size s. For the steady state, the set of master equations is semi-recursive1 and is written as N s 1 1 − pˆ frg (s)(1 − δs1 )ns − pˆ merge (s) + 2 s pˆ merge (s ) s ns N N N s =1

+

s−1 s =1

1

s ns (s − s )ns pˆ merge s + δs,1

N

pˆ frg (s )s2 ns = 0 .

s =s+1

The equation for nl depends on ni for all i = 1 . . . l − 1.

(6)

Computational Modeling of Collective Human Behavior

39

The above set may be solved numerically2 . In order to include corrections from the ﬂuctuations and the ﬁnite size eﬀect, we need to perform direct simulations of the system described by Eq. (6). We are interested in the situation when the consensus parameter is within the range of x = (33 1/3 %, 50 %). If x < 33 1/3 % there is no fragmentation. If x > 50 % we have the absolute majority vote condition. When x → 33 1/3 % from above, we expect the return distribution to approach the one for the original Egu´ıluz-Zimmermann system, which shows a power law with exponent 1.5 over a large scale of return sizes. As we increase x slightly above 33 1/3 %, the dominant behavior of the buy/sell probabilty pˆ buy = pˆ sell for the large groups (i.e. of order s 100) yields an exponential cut-oﬀ, while for smaller s we have the ﬁnite size eﬀect. This modiﬁes the model in two ways, by changing the group distribution (since the conditional probabilities enter Eq. (6)) and the trade mechanism. Unlike the E-Z system [7] where mostly the large groups trade, we expect to have the trades coming from the actions of the small groups with the exponential cut-oﬀ due to the behavior of Eq. (5).

4

Simulation Results

The simulations were performed for a system with N = 104 agents, m = 2, and 106 time steps, with three diﬀerent values of the consensus parameter. The initial state of information was (1, 1). After 105 timesteps, in order to allow the system to reach equilibrium, the returns where computed as follows: if a cluster of size s decides to buy, the return is +s. If a cluster of size s decides to sell, the return is −s. After the simulation was complete, the time was rescaled by adding the returns of two consecutive timesteps since on average a transaction occurred once every two timesteps. Thus the results in Fig. 2 are eﬀectively for 9 × 105 /2 timesteps. It is observed that indeed most of the trades come from the action of the small groups. As the consensus parameter is increased, the distribution of returns can be seen to fall more sharply due to the increasing dominance of the exponential cut-oﬀ. Our results demonstrate that the feature of allowing agents access to global information and subsequent decision-making, when built into a model focused on local group formation (i.e. E-Z), leads to a hybrid model which can better capture features of the known empirical distributions. In short, both local group formation and global information are important when building a minimal computational model of ﬁnancial markets. By extension, the same statement should hold for collective human activity in any domain in which competition exists between a collection of interconnected agents.

5

Discussion

We have proposed a simple model system that represents a ﬁrst step in the quest to develop minimal, individual-based computational models of real 2

The numerical procedure for solving Eq. 6 is eﬀective for modest values of N , since at least 1/2 N 2 iteration steps are required.

40

A. Kirou et al.

socio-economic systems in which both local and global interactions are featured. Such minimal models aim to incorporate the minimum number of rules, and hence parameters, such that individuals’ behavior and interactions still appear credible; yet at the same time, the emergent dynamics should remain consistent with the maximum possible number of empirical stylized facts based on realworld data. In our particular case, we have incorporated global interactions via the heterogeneity of strategies held by the agents, as well as agent memory in the locally interacting system via the grouping mechanism. Our speciﬁc results are as follows. The scenario in which the agents are allowed to vote introduces an exponential cut-oﬀ starting on the scale for which eﬀects connected to the discrete nature of the system may be neglected. Our results show that those who usually trade are the small groups, and that there are no trades coming from the large groups. By contrast in the original E-Z model, the conditional probabilities were constant and any particular large group trades more often than a particular small group. The most realistic minimal model (which is as yet undiscovered) should lie somewhere in between. Any voting scenario is a Poisson process which introduces an exponential cut-oﬀ into the system. This exponential cut-oﬀ appears on a scale where the number of individuals involved is suﬃciently large that we may disregard the discrete nature of the system. The modeling and the computational challenge is therefore as follows: How can the present model be further enhanced such that it reﬂects the more complex behavior of the individuals through the possession of memory, behavior based on past experience, and the passing of information between groups concerning whether to trade or not? In terms of more general issues of computational modeling, we have tried to highlight the need to develop minimal computational models of real socio-economic systems through individual-based behavior. Future theoretical developments in such ﬁelds lie beyond simply integrating some form of phenomenological equation. Moreover, this sort of socio-economic modeling is an application of computation that is set to boom in the future given the growing availability of high-frequency data from socio-economic systems – and the fundamental philosophical need for theories which treat dynamical ﬂuctuations in addition to mean behavior. One particular example in which this philosophy is now being developed, is in improving our understanding of human conﬂict – by looking at the stylized facts of conﬂict dynamics in exactly the same way as has been done for ﬁnancial markets. Indeed, we have recently shown that remarkably similar minimal computational models can be built, exhibiting equally satisfying agreement with empirical data, simply by combining together global and local interactions among agents. This work on human conﬂict will be discussed in more detail elsewhere.

References 1. See for example, the wide range of publications and conferences around this common theme of computational modeling of socio-economic systems, http://www.unifr.ch/econophysics

Computational Modeling of Collective Human Behavior

41

2. Bouchaud, J.-P., Potters, M.: Theory of Financial Risk and Derivative Pricing: From Statistical Physics to Risk Management, 2nd edn. Cambridge University Press, Cambridge (2004) 3. Mantegna, R.N., Stanley, H.E.: An Introduction to Econophysics: Correlations and Complexity in Finance. Cambridge University Press, Cambridge (1999) 4. Johnson, N.F.: Two’s company, three is complexity. Oneworld, New York (2007) 5. Johnson, N.F., Jeﬀeries, P., Hui, P.M.: Financial Market Complexity. Oxford University Press, Oxford (2003) 6. Gueron, S., Levin, S.A.: The Dynamics of Group Formation. Mathematical Biosciences 128, 243–246 (1995) 7. Egu´ıluz, V.M., Zimmermann, M.G.: Transmission of Information and Herd Behaviour: An Application to Financial Markets. Phys. Rev. Lett. 85, 5659–5662 (2000) 8. Cont, R., Bouchaud, J.-P.: Herd Behavior and Aggregate Fluctuations in Financial Markets. Macroeconomic Dynamics 4, 170–196 (2000) 9. Challet, D., Zhang, Y.C.: Emergence of Cooperation and Organization in an Evolutionary Game. Physica A 246, 407–418 (1997)

Intel’s Technology Vision and Products for HPC Pawel Gepner Intel Corporation Phone: +48 602 41 41 28 [email protected]

Abstract. Traditionally most hardware and software architectural innovations have come through High End Computing. Today, innovation moves up from the bottom (low-power) and down from the top (parallelization) but the High-End is still a main foundation of new ideas. What is in today’s supercomputer will be in tomorrow’s desktop. What Intel does for HPC and what is the Intel vision and product portfolio for HPC segment will be covered during the lecture.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, p. 42, 2008. © Springer-Verlag Berlin Heidelberg 2008

Grid-Supported Simulation of Vapour-Liquid Equilibria with GridSFEA I.L. Muntean, E. Elts, M. Buchholz, and H.-J. Bungartz Technische Universit¨ at M¨ unchen, Dept. of Informatics, Scientiﬁc Computing in Computer Science, Boltzmannstr. 3, 85748 Garching, Germany {muntean,elts,buchholm,bungartz}@in.tum.de http://www5.in.tum.de

Abstract. In order to beneﬁt from grid computing, software applications in CSE often need to be substantially modiﬁed or rewritten to a large extent. To reduce the required grid know-how and eﬀort the computational scientist (end user and software developer) needs for this task, we developed a framework for engineering simulations in grid environments (GridSFEA). This paper presents two novel features of GridSFEA: the integrated support for parameter investigations and the controlled execution of long-running simulations in grids. They allow the grid enabling of CSE applications with minimal or even without changes of their source code. Furthermore, the overhead for working in grid environments introduced by our approach, compared to working on classical HPC platforms, is very low. We provide two examples of using GridSFEA for performing vapour-liquid equilibria (VLE) simulations using Molecular Dynamics and Monte Carlo methods. To develop VLE models, parameter investigations are carried out. Large VLE scenarios are computed over a long time, to create test cases for the development of HPC software. Keywords: grid application, grid services, HPC molecular simulation, vapour-liquid equilibria, two-centre Lennard-Jones, polar ﬂuid.

1

Introduction

Recently, grid computing environments [1] evolved from research and experimental status towards production systems, providing scientists with access to large aggregated computing and storage resources. Although the scope of such environments is limited – scenarios of capability computing typically requiring high performance resources at one place – esp. capacity computing applications (e.g., Monte Carlo (MC) simulations, parameter studies) oﬀer a huge potential for the grid. Despite this attractiveness (due to mechanisms for security or access to distributed resources e.g.) of computing grids, they still remain underutilised and underexploited by the computational science and engineering (CSE) community. One reason for this is esp. the tedious development of grid applications and the grid middleware know-how necessary for the scientist (developer) to master. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 45–55, 2008. c Springer-Verlag Berlin Heidelberg 2008

46

I.L. Muntean et al.

There is ongoing research to bridge the gap between CSE applications and the grid. On the one hand, development toolkits and libraries for grid applications have recently been embraced by the grid community. Such examples are Java CoG Kit [2], DRMAA [3], or JavaGAT [4]. They provide high-level API and tools for interacting with the grid, simplifying the grid software development a lot. Nevertheless, with this approach the code of existing CSE simulation software often still needs to be extended to get advantages from grids. On the other hand, frameworks for grid applications have emerged, mainly based on the above toolkits and libraries. Two representatives of such programs can be found in [5] and [6]. Since they both are tailored to their speciﬁc ﬁeld, earth sciences and image processing, resp. it is hard to reuse them for other applications. Another approach is represented by complex grid tools, covering the entire range from applications to grid resource managers and brokers. Such examples are Nimrod/G [7], Condor-G[8], or GrADS[9]. Although they are nearly complete solutions for grid-enabling applications, existing simulation codes (still) need to be rewritten, to use provided features of these tools. Such an example is GrADS, where the migration of a computational job is possible only after the re-implementation of application-level checkpointing in the simulation code using the SRS library. We present an approach suitable for enabling various CSE software applications without the need to adapt their code. The framework we introduce here – GridSFEA (Grid-based Simulation Framework for Engineering Applications) [10] – reuses the know-how about requirements of CSE applications gathered in previous experiences with frameworks for engineering processes [11,12]. It handles completely the interaction with the grid middleware Globus Toolkit 4 (GT4) [13] by using Java CoG Kit, DRMAA, an application portal based on OGCE2 [14], grid services, etc. The integration of an application is done by simple wrappers. So far, we have used GridSFEA for the development of classiﬁcation algorithms based on sparse grids [15] and for computational ﬂuid dynamics simulations. Here, we highlight its application to molecular simulation scenarios. In this paper, we focus on two easy-to-use mechanisms available within GridSFEA. The ﬁrst one refers to application-independent parameter studies, while the second one applies to checkpoint-based migration of long-running simulations in the grid. We evaluate these mechanisms in the context of Molecular Dynamics and Monte Carlo simulations of vapour-liquid equilibria (VLE) of pure substances and mixtures: First, we use GridSFEA for carrying out parameter space investigations needed to develop VLE models. Second, we show how our framework can be easily employed to manage the long-running execution of large VLE scenarios with millions of molecules in the grid. Those scenarios are needed to create test cases to be used in the development of HPC software. The framework is brieﬂy introduced in Sect. 2, with focus on the two mechanisms mentioned above. Section 3 presents computational aspects of the development of models and HPC simulation software for VLE. We discuss the numerical experiments carried out with GridSFEA in Sect. 4 and conclude with Sect. 5.

Grid-Supported Simulation of Vapour-Liquid Equilibria with GridSFEA

2

47

GridSFEA - A Simulation Framework for Engineering Applications in Grid

The GridSFEA framework aims at providing a viable complement to the classical way of computing HPC simulations. It enables various CSE software applications to easily run in grid environments with minimal or even without changes of their original source code. Within GridSFEA, a set of common functionalities, such as authentication, ﬁle transfer, or job coordination, are shared among all simulation applications. Additionally, the framework comprises enhancements speciﬁc to user scenarios. 2.1

Organisation of GridSFEA

In the realisation of GridSFEA, we reuse available (grid) community toolkits and libraries. GridSFEA currently works with GT4, which is the de-facto standard middleware for grid environments. The main components of our framework are shown in Fig. 1.

Fig. 1. The architecture of GridSFEA: tools running in the user (left) and the grid (right) environment, resp

GridSFEA Services are a collection of grid services for the management of simulations, of their data, and for the visualisation of the simulation results. They are developed based on WSRF and run on grid servers with GT4 installed. The application portal hosts web applications based on portlets, for interfacing the scientist with the simulation programs to compute with. Additionally, it reuses portlets from OGCE2 for basic interactions with the grid, such as portal authentication and ﬁle transfers. Application scripts are wrappers, adaptors, and utility programs (such as generators) written for integrating the various simulation programs into the framework. They typically run on the resources where the simulations are computed. The portal and the application scripts interact with the grid via the GridSFEA ClientApplication library, which is based on CoG Kit and DRMAA. The library acts as a glue for the framework. GridSFEA interacts with various categories of tools (see Fig. 1).

48

2.2

I.L. Muntean et al.

Enhanced Support for Parameter Space Investigations

The portal of our framework comprises two portlets designed for parameter studies. One is the SparseGrids portlet, a web application tailored to the requirements of the development process of classiﬁcation algorithms [15]. The second one is the ParameterInvestigation portlet, that allows more general parameter studies. It uses a ﬂexible and extensible data model for the parameter deﬁnition, based on XML Schema. Thus, it allows an intuitive speciﬁcation of simulation parameters, such as enumerations, sequences, and combinations of the latter ones. Moreover, it has its own engine for parameter generation, passes parameters and other input data to a user-speciﬁed simulation program, automatically submits jobs to the grid and manages them. JSR128 portlets provided by a third-party can coexist with our portlets and mutually complement their functionality. The parameter speciﬁcation API and the generation engine are part of the ClientApplication library. Thus, they are used from both portal and command line-based tools of the framework. Furthermore, one can easily write adaptors or plugins for other parameter study tools and connect them to GridSFEA. 2.3

Long Running Simulations in Grid Environments

GridSFEA provides a job migration mechanism based on checkpoints that aims to automate the execution of long-running simulations in grids. Thus, it manages the simulation data for jobs computed on diﬀerent grid resources at various locations. It aims to reduce the idle time of the continuation jobs in the queues. This is achieved by the GridSFEA services and the ClientApplication library. They record metadata about the input, output, result, and checkpoint ﬁles of each job. An application wrapper deﬁnes the interface between the framework and the program to be run in grid. This way, the simulation code remains unaﬀected. To a regular user job we add new operations before and after the execution of the simulation. The preprocessing operations are the retrieval of the checkpoint information from the GridSFEA services and the transfer of the checkpoint and of other input data. The registration of the new checkpoint is done at the end of the job. Furthermore, we provide the computational scientist, both user and application developer, the possibility to specify or to plug-in post processing scripts to be performed on the simulation results. So far, we plugged in the new job the generation of a preview of the results using VMD [16] for MD and Paraview for CFD simulations. They are used as batch renderers. Similar postprocessing tasks can be integrated in GridSFEA with low programming eﬀort. The prerequisites to enable a simulation program to use the migration mechanism available in GridSFEA are: non-interactive use (requested by the batch execution mode), checkpoint-based resuming of computations, and a simple wrapper for specifying checkpoint and result ﬁles or for integrating postprocessing tasks. Furthermore, the application has to be installed in grid in advance.

Grid-Supported Simulation of Vapour-Liquid Equilibria with GridSFEA

3

49

VLE Simulations

For many technologically relevant tasks in chemical and thermal process engineering knowledge of vapour-liquid equilibria (VLE) has turned out to be necessary. Among the diﬀerent ways to study VLE, such as experimental and theoretical approaches, molecular simulation plays an important role [17,18]. It is a reliable tool for predicting phase equilibria and the best way for gaining insights into the connection between phase behaviour and molecular interactions. Here we focus on two distinct challenges from the ﬁeld of VLE, both well suited to be tackled with the help of a grid. In Sect. 3.1, we describe a method for a fast elaboration of molecular models which involves extensive parameter studies. By using GridSFEA, we can proﬁt to a great extent from the ideal possibilities a grid provides for such studies. The second example (Sect. 3.2) deals with the evaluation of parallelisation strategies for a wide range of large VLE simulations. Therefore, we need to run a moderate number of long-running HPC jobs. GridSFEA’s job-migration facilities helps us to avoid to explicitly organise the resources for those jobs. 3.1

Elaboration of the 2CLJQ Fluid Model for VLE

The search for an appropriate interaction model for a given ﬂuid is usually a time consuming process. In this section, we follow a new route to develop interaction models, proposed in [19], that allows fast adjustments of model parameters to experimental data for a given class of pure ﬂuids and considerably reduces the time required for the development of new molecular models. We consider this model elaboration technique for the example of the two-centre Lennard-Jones plus pointquadrupole model (2CLJQ). The idea is to study the thermodynamic properties of the 2CLJQ model ﬂuid systematically and in detail over a wide range of model parameters with the help of GridSFEA. Using reduced coordinates, for the 2CLJQ ﬂuid with ﬁxed angle θ, only 2 parameters have to be varied: the LJ centre-centre distance1 L∗ and the pointquadrupole strength Q∗2 . Based on the results from the parameter study, it is straightforward to adjust the molecular interaction parameters of the 2CLJQ ﬂuid to experimental data of real quadrupolar ﬂuids. Properties like the critical value of the temperature e.g. are available as functions of the molecular interaction parameters [19]. Hence, the development of the molecular interaction model for a given substance is not more diﬃcult than the adjustment of parameters of thermodynamic models. Thus, the present simulations are a reliable basis for adjustments of the model parameters Q∗2 and L∗ to experimental VLE data of real ﬂuids. 3.2

Development and Evaluation of Parallelisation Strategies

For VLE simulations with large numbers of particles, the development and evaluation of parallelisation techniques is an important issue. 1

All values with ∗ are reduces values (transformed to a dimensionless form).

50

I.L. Muntean et al.

One typical property of VLE simulations is a very heterogeneous particle distribution. This necessitates the use of eﬃcient parallelisation and load-balancing algorithms. We have developed a MD software for the simulation of NVTensembles [20] with a large number of particles. The design of our software allows us to easily switch between diﬀerent parallelisation schemes [21], but we need to ﬁnd a way of comparing those schemes. Therefore, we ﬁrst have to investigate the possible simulation scenarios. For the initial conﬁguration of a NVT simulation, the Number of particles, the Volume, and the Temperature have to be speciﬁed. In our case, we use a face-centred grid to set the particles initial positions. The particle distribution that evolves depends on the temperature and on the density (particles per volume). Basically each combination of temperature and density yields a diﬀerent distribution of the particles in the domain. To cover most of those distributions for the evaluation of the parallelisation techniques, we should examine at least ﬁve diﬀerent densities and ﬁve diﬀerent temperatures, which leads to 25 scenarios. As we use N = 5 · 106 particles for each scenario and simulate it for 105 time steps, the simulation of one of those scenarios on 64 processors takes more that 24 hours. Getting processing time for a long-running job is quite hard and obviously it is harder to get processing time for 25 ot those jobs. But it is much easier to get several shorter blocks of processing time. Therefore, the simulation of all scenarios demands the possibility of job migration and support for parameter studies to reduce the administrative work for the user. By using GridSFEA, we get those beneﬁts without having to interact during the simulation.

4

Case Studies

For each of the two challenges presented in Sect. 3, we describe – after introducing the experiments carried out – how to run the simulations with GridSFEA and evaluate the results with a special focus on the time overhead needed for the framework. 4.1

Systematic Investigation of the VLE of the 2CLJQ Model Fluid

Setup of the experiments. The systematic investigation of the VLE of the 2CLJQ model ﬂuid was performed for a range of quadrupolar momentum 0 ≤ Q∗2 ≤ 4 and of elongation 0 ≤ L∗ ≤ 0.8, with steps 1 and 0.2, resp. Temperatures investigated ranged from 0.55% to 0.95% of the critical temperature. Combining these values, 125 input ﬁles were generated with GridSFEA. To obtain the VLE data the Grand Equilibrium method [18] was used. Widom’s insertion method [22] was used to calculate the chemical potential in the liquid. The data obtained from liquid simulation was further used as phase equilibrium conditions for the vapour phase. Results. The simulations have been computed with the parallel MD code ms2 [23]. They were carried out within the frame of the InGrid project [24]. GridSFEA services have been deployed on a GT4 container (version 4.0.3) running on

Grid-Supported Simulation of Vapour-Liquid Equilibria with GridSFEA

51

the grid node gt4.dgrid.hlrs.de. The user provided in the ParameterInvestigation portlet the XML description of the parameters Q∗2 , L∗ , and the temperature factor, together with the name of the generator for ms2 input ﬁles. The portlet generated all parameter combinations and for each of them submitted a batch job to the grid. Thus, the user does not have to care about all the tedious administration overhead necessary to create jobs for the diﬀerent parameter combinations, all that work is done by GridSFEA. Table 1. Time for the submission with GridSFEA of a trivial job (hostname) and of the VLE simulation tasks: liquid phase (MD) and vapour phase (MC)

trivial job liquid phase vapour phase

Execution time (min) Submission time (min) 10−4 0.27 42 – 254 0.27 41 – 103 0.27

In Table 1, we show the submission time for the VLE simulation tasks (liquid and vapour phases) together with the range of their execution time. Each of the generated jobs ran on sets of four processors. For both parts of the simulation, the submission time with GridSFEA was approximatively the same. This time is measured from the moment the submission command was issued until the user program begins to run. It includes internal operations in the framework, such as ﬁle transfer, job submission, results retrieval, logging etc. Thus, the submission time is a measure of the time overhead introduced by GridSFEA at the execution of a user simulation on HPC resources. The overhead is independent from the number of employed processors and from the duration of the individual jobs. Furthermore, it has the same value as for the submission of a trivial job that returns the name of the system it runs on (hostname). Figure 2 illustrates the strong inﬂuence of both the elongation and the quadrupolar momentum on the 2CLJQ VLE data for Q∗2 = 1 and Q∗2 = 4. Increasing the elongation or the quadrupolar momentum strongly inﬂuences the shape of the density coexistence curve. With the data obtained from this study, it

Fig. 2. Vapour-liquid coexistence of the 2CLJQ ﬂuid

52

I.L. Muntean et al.

is straightforward to adjust the molecular interaction parameters of the 2CLJQ model ﬂuid to experimental data of real quadrupole ﬂuid and to develop molecular interaction model for various substances [25]. 4.2

Examination of the Time Requirements for Large Scenarios

Scenarios and HPC environment. In Sect. 3.2, we motivated the examination of diﬀerent scenarios. Now we look at some results for three selected scenarios with the densities ρ∗ = 0.15, ρ∗ = 0.3 and ρ∗ = 0.6. All scenarios contain N = 5 · 106 particles and have a temperature of T ∗ = 0.85. The simulations were done on a Linux Cluster with an InﬁniBand 4x network. We used 16 nodes, each having 8 GB RAM and four Opteron 850 processors with 2.4 GHz. The parallelisation scheme used here is a domain decomposition without any load balancing.

Fig. 3. Simulation of 5 million particles at T ∗ = 0.85 and ρ∗ = 0.15 after 10,000, 50,000 and 100,000 time steps (visualisation with VMD [16])

Results. Fig. 3 shows the visualisation for the ﬁrst scenario (T ∗ = 0.85, ρ∗ = 0.15) after 10,000, 50,000 and 100,000 time steps. To get a clearer view, only a cubic section with one third of the original side length is shown. As the distribution changes signiﬁcantly during the whole simulation, 100,000 time steps are needed. In the beginning, the particles in each scenario are uniformly distributed. As the imbalance increases during the simulation, the processing time increases too and remains instationary during the ﬁrst 100,000 time steps. This indicates that the distribution is continuously changing throughout the simulation. Table 2. Time results for the migration with GridSFEA of the VLE simulation with ρ∗ = 0.30, T ∗ = 0.85 Scenario Setup time (GridSFEA) File transfer time Computation time ﬁrst job 15 s 16 s 10.6 h cont. job 38 s 20 s 11.4 h

Grid-Supported Simulation of Vapour-Liquid Equilibria with GridSFEA

53

We computed sets of jobs, with 10,000 time steps per job. Each job checkpointed its ﬁnal state to a ﬁle and registered it with the GridSFEA services. At startup, each job retrieved the checkpoint information from the grid services and transferred it to the local machine. Table 2 shows the overhead introduced by the migration mechanism of our framework for the scenario with N = 5 · 106 molecules at T ∗ = 0.85 and ρ∗ = 0.3, compared to the eﬀective simulation time. Discussion. In both experiments introduced in this section, the grid enabling of the simulation programs was carried out without any changes of their source code. To use the migration mechanism in the second experiment, a simple wrapper for the MD application was added to the user space installation of GridSFEA. Similarly, further engineering simulation tasks can use this mechanism. The overhead for automatically setting up a simulation job in GridSFEA is very small, below one minute. This makes our checkpoint-based migration suitable for typical computational engineering jobs, with execution times ranging from few hours to hundreds or thousands of hours. Nevertheless, for jobs with relatively short duration (several minutes), the migration mechanism from GridSFEA is not useful any more.

5

Conclusions

Using state-of-the-art grid technologies and tools, we developed the GridSFEA framework for performing engineering simulations in grid environments. With this work, we enabled the computational scientist to beneﬁt from GT4-based grid environments for concrete tasks such as modeling VLE processes and preparing realistic simulation data for test cases to be employed in the development of HPC software for VLE. Our approach achieves the integration of application scenarios in the framework by means of wrappers, therefore without modifying the source code of the respective CSE program. The two features of GridSFEA that we introduced in this paper – support for parameter space investigations and for long running simulations – are not only useful for the case studies discussed here, but also for other similar CSE scenarios speciﬁc to capacity computing. As future work, we plan the improvement of the automated migration mechanism in GridSFEA and an opening of the framework to other grid middleware (e.g. Unicore). Furthermore, we intend to increase the number and type of CSE scenarios and applications handled by GridSFEA. Acknowledgements. We thank HLRS and the German D-Grid initiative for providing within the InGrid project the computing resources employed for accomplishing this research. Furthermore, we acknowledge the collaboration on the VLE topic with B. Eckl and Dr. J. Vrabec at the Institute of Thermodynamics and Thermal Process Engineering, Universit¨ at Stuttgart.

54

I.L. Muntean et al.

References 1. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (2005) 2. Thomas, M., et al.: Grid Portal Architectures for Scientiﬁc Applications. Journal of Physics 16, 596–600 (2005) 3. Tr¨ oger, P., Rajic, H., Haas, A., Domagalski, P.: Standardization of an API for Distributed Resource Management Systems. In: Proc. 7th IEEE Int. Symp. on Cluster Computing and the Grid (CCGrid 2007), pp. 619–626 (2007) 4. Nieuwpoort, R.V., Kielmann, T., Bal, H.E.: User-Friendly and Reliable Grid Computing Based on Imperfect Middleware. In: ACM, Supercomputing (SC 2007) (2007) 5. Price, A.R., et al.: Optimization of integrated Earth System Model components using Grid-enabled data management and computation. Concurrency Computat.: Pract. Exper. 19, 153–165 (2007) 6. Jin, H., Zheng, R., Zhang, Q., Li, Y.: Components and workﬂow based Grid programming environment for integrated image-processing applications. Concurrency Computat.: Pract. Exper. 18, 1857–1869 (2006) 7. Abramson, D., Giddy, J., Kotler, L.: High performance parametric modeling with nimrod/g: Killer application for the global grid? In: IPDPS 2000: Proc. of the 14th Int. Symp. on Parallel and Distributed Processing. IEEE Computer Society Press, Washington, DC (2000) 8. Frey, J., et al.: Condor-G: A computation management agent for multi-institutional grids. Cluster Computing 5, 237–246 (2002) 9. Vadhiyar, S.S., Dongarra, J.J.: Self adaptivity in grid computing: Research articles. Concurr. Comput.: Pract. Exper. 17(2-4), 235–257 (2005) 10. Muntean, I.L.: GridSFEA - Grid-based Simulation Framework for Engineering Applications, http://www5.in.tum.de/forschung/grid/gridsfea/ 11. Mundani, R.P., Bungartz, H.J., Niggl, A., Rank, E.: Embedding, Organisation, and Control of Simulation Processes in an Octree-Based CSCW Framework. In: Proc. 11th Int. Conf. on Comp. in Civil and Building Eng., pp. 3208–3215 (2006) 12. Mundani, R.P., et al.: Applying Grid Techniques to an Octree-Based CSCW Framework. In: Di Martino, B., Kranzlm¨ uller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 504–511. Springer, Heidelberg (2005) 13. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 14. OGCE: Open Grid Computing Environments: www.collab-ogce.org/ogce2/ 15. Pﬂ¨ uger, D., Muntean, I.L., Bungartz, H.J.: Adaptive Sparse Grid Classiﬁcation Using Grid Environments. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 708–715. Springer, Heidelberg (2007) 16. Humphrey, W., Dalke, A., Schulten, K.: VMD – Visual Molecular Dynamics. J. Molecular Graphics (1996) 17. Serbanovic, S., et al.: Vapour-liquid equilibria of the OPLS model for the binary systems of alkanes and alkanes + alcohols. J. Serb. Chem. Soc (2005) 18. Vrabec, J., Hasse, H.: Grand Equilibrium: vapour-liquid equilibria by a new molecular simulation method. Molecular Physics (2002) 19. Stoll, J., Vrabec, J., Hasse, H., Fischer, J.: Comprehensive study of the vapourliquid equilibria of the two-centre Lennard-Jones plus point quadrupole ﬂuid. Fluid Phase Equilibria (2001)

Grid-Supported Simulation of Vapour-Liquid Equilibria with GridSFEA

55

20. Allen, M.P., Tildesley, D.J.: Computer Simulation of Liquids. Oxford University Press, USA (1989) 21. Bernreuther, M., Buchholz, M., Bungartz, H.J.: Aspects of a Parallel Molecular Dynamics Software for Nano-Fluidics. In: Parallel Computing: Architectures, Algorithms and Applications, Int. Conf. ParCo 2007 (2007) 22. Heyes, D.: Chemical Potential, Partial Enthalpy and Partial Volume of Mixtures by NPT Molecular Dynamics. Molecular Simulation (1992) 23. Eckl, B., Vrabec, J.: ms2 – MD simulation program. Inst. of Thermodynamics and Thermal Process Engineering, University of Stuttgart 24. INGRID: Inovative grid technology in engineering, www.ingrid-info.de/ 25. Vrabec, J., Stoll, J., Hasse, H.: A set of molecular models for symmetric quadrupole ﬂuids. Journal of Physical Chemistry B (2001)

Towards a System-Level Science Support Tomasz Gubala2,3 , Marek Kasztelnik3 , Maciej Malawski1 , and Marian Bubak1,3 1

Institute of Computer Science AGH, al. Mickiewicza 30, 30-059 Krak´ ow, Poland 2 Informatics Institute, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands 3 ACC CYFRONET AGH, Krak´ ow, ul. Nawojki 11, 30-950 Krak´ ow, Poland [email protected], [email protected], {malawski,bubak}@agh.edu.pl

Abstract. Recently, there is a growing need for an information technology solution to support a new methodology of scientiﬁc investigation, called system-level science. This paper presents a new approach to development and execution of collaborative applications. These applications are built as experiment plans with a notation based on the Ruby language. The virtual laboratory, which is an integrated system of dedicated tools and servers, provides a common space for planning, building, improving and performing in-silico experiments by a group of developers. The application is built with elements called gems which are available on the distributed Web- and Grid-based infrastructure. The process of application developments and the functionality of the virtual laboratory are demonstrated with a real-life example of the drug susceptibility ranking application from the HIV treatment domain. Keywords: System-level science, e-Science, collaborative applications, virtual laboratory, ViroLab.

1

Introduction

Nowadays we observe a new approach to scientiﬁc investigations which, besides of analyses of individual phenomena, integrates diﬀerent, interdisciplinary sources of knowledge about a complex system, to acquire understanding of the system as a whole. This innovative way of conducting research has recently been called system-level science [1]. Biomedicine is an important example of such a ﬁeld, requiring this new approach, which, in turn, must be accompanied by adequate information technology solutions. The complexity of challenges in the biomedical research and the growing number of groups and institutions involved creates more demand from that part of science for new, collaborative environments. Since biomedicine experts and research groups do not work in separation, more and more attention and eﬀort is devoted to collaborative, inter-laboratory projects involving data and computational resources. The computer science aspects of this research, which include virtual groups, virtual organizations built around complex in-silico experiments and electronic data stores are also representative for other ﬁelds. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 56–65, 2008. c Springer-Verlag Berlin Heidelberg 2008

Towards a System-Level Science Support

57

An example of such a collaborative application in the virology domain, being built and used in complex simulations by many cooperating users, is drug resistance evaluation for HIV treatment [2] [3]. As the ﬁnal results of this simulation is important for everyday practice of clinical virologists, there are eﬀorts to provide it as a service via the web [4]. The ViroLab project dedicates substantial resources to deliver a decision support system to help medical doctors issue HIV drug prescriptions [5], as it develops the Drug Ranking System (DRS) [6]. Treatment decision support systems, like DRS, are used and developed by many people. There are many groups involved in HIV research and users representing various expertise levels inside these groups work to deliver a valid, reasonably complete and eﬃciently working solution. In turn, this objective can be achieved only if the entire endeavor is backed by a solid, innovative and wellintegrated technology that is both generic enough to support users with distinct assignments, yet suﬃciently focused. In this paper we present the ViroLab Virtual Laboratory [7]: a collaborative, modern platform for system-level science. The laboratory is a set of dedicated tools and servers that form a common space for planning, building, improving and performing in-silico experiments in the virology domain. In subsequent sections we show how such a complex application as DRS for HIV treatment may be designed, prepared and deployed for use in a collaborative fashion by people of diﬀerent expertise levels, working towards a common objective. The next section presents an overview of related initiatives, and it is followed by a detailed explanation of operation of the proposed solution. Next, we discuss the novelty and innovation of this solution. We conclude with a summary and plans for future research.

2

Background

The need for information technology solutions supporting system-level science is indicated in the Cover Features by I. Foster and C. Kesselman [1]. Problem-solving environments and virtual laboratories have been the subject of research and development for many years [8]. Most of them are built on top of workﬂow systems. The LEAD [9] project is an example of a virtual laboratory for weather prediction applications; its main modules include a portal with user interfaces, a set of dedicated, distributed Grid resources and a workﬂow system which allows for combining the present resources together, to deﬁne task-speciﬁc processing. An example of an experimentation space is the Kepler [10] system which provides a tool for composing application workﬂows (which could, in particular, be experiments). In the MyGrid [11] environment, the Taverna system is used to compose complex experiment processes out of smaller, atomic building blocks. A rich library of those basic elements allows for great ﬂexibility and numerous diﬀerent solutions can be developed. Collaborative extensions have been provided by the MyExperiment project [12]. A recent overview of dedicated environments supporting development and execution of complex applications in biomedicine is presented in [13].

58

T. Gubala et al.

Most of problem solving environments and virtual laboratories are built on top of scientiﬁc workﬂow systems. The work on extension of the expressiveness of their programming models, interoperability, and on enabling access to diﬀerent computing resources is still a subject of research [14]. In this paper, basing on the experience from workﬂow systems, we present an alternative approach to building systems supporting system-level science.

3 3.1

Drug Ranking Experiment in Virtual Laboratory Experiment Pipeline

The process of experiment preparation in the collaborative ViroLab virtual laboratory is composed of well-deﬁned steps (Fig. 1). At the beginning, the medical expert deﬁnes requirements for the experiment: what are its objectives, what kind of data and computation is required. Subsequently, the experiment developer, by analyzing these requirements, identiﬁes the functional blocks that constitute the application. These computational elements of the ViroLab virtual laboratory are called gems and, in most cases, are available in the distributed Web- and Grid-based infrastructure. Otherwise, they have to be created, published and registered in the virtual laboratory, thus becoming available for other developers who may reuse them in their own experiments.

Fig. 1. Experiment pipeline: consecutive steps of an experiment in the virtual laboratory

Once all required computational activities are available, an experiment plan may be created. This purposed virtual laboratory provides an expressive, easy way to use a notation based on a high-level scripting language called Ruby [15]. The experiment plan is a Ruby script. The Ruby language provides a clear syntax, a full set of control structures and, as a result, it enables expressing experiments of arbitrary complexity levels in the form of scripts. After the script is created and it fulﬁlls (according to the developer) all the experiment requirements, it is stored in a dedicated repository and becomes available to other members of a given virtual organization. As a result, the scientist does not need to become familiar with scripting details, and may access the

Towards a System-Level Science Support

59

virtual laboratory through a portal as well as browse and execute the available experiments using dedicated tools [16]. During application execution, provenance data is created and stored in dedicated provenance storage. This information is used by the scientist to search for interesting data and its origins [17]. The experiment script, as released by a developer, may not be optimal or may lack some functionalities. The virtual laboratory enables the scientist to easily communicate with the developer using a dedicated tool to submit user feedback, which is then used by the developer to produce a better version of the application. The Drug Ranking System was created as a result of the experiment pipeline described above. Interpretation of the susceptibility of the HIV virus to particular drugs involves several steps. Some of these steps have to be performed manually (a blood sample has to be taken from the patient, the genetic material from the virus has to be isolated and sequenced). Once these steps are complete, a set of valid information is placed into a database. This material provides the required in- Fig. 2. “From Genotype to Drug Resisput for the DRS system. Knowing the tance” experiment nature of the experiment, a medical expert deﬁnes its structure (Fig. 2). A set of nucleotide sequences of the HIV virus has to be obtained. These sequences are then the subject of subtype detection algorithms and alignment processes, which create a list of mutations. This list is passed to the drug resistance expert system which returns virus-to-drug susceptibility values. When the experiment plan is deﬁned, the developer can start searching for required gems or create them if they are not available, and implement the experiment plan. 3.2

Development and Publication of Gems

As already hinted in Section 3.1, the basic computational building blocks of experiments are called experiment gems, which follows the name introduced for Ruby libraries (Ruby gems [15]). Although in the experiment script all such gems are represented with a uniform API based on the Grid Object abstraction [18], the gems themselves may be implemented using various technologies. Such an approach to integration of multiple technologies was motivated by the very vivid diversity of existing Grid- and Web-based middleware systems which may be used to provide access to computation. There are standard Web services, WSRF, distributed component frameworks such as MOCCA [19] or ProActive [20], as well as large-scale job-processing systems such as EGEE LCG/gLite [21]. The goal of the Virtual Laboratory is to support gems using all these technologies.

60

T. Gubala et al.

Before a gem can be used in Virtual laboratory experiments, it has to be prepared by a gem developer. Fig. 3 shows schematically the required steps. After the interface of the gem is deﬁned, it must be implemented using a selected technology. For simple, stateless interaction a standard Web service is the preferred solution. If a gem requires stateful (conversational) interaction and may beneﬁt from dynamic deployment on remote resources, then implementing it as MOCCA component may be a good choice. Otherwise, if running a gem is a CPU-intensive and time-consuming task, it may be reasonable to implement it as a standalone program, which may be submitted as a job to such Grid infrastructures as EGEE or DEISA. Once the gem is developed, it has to be registered in the Grid Resource Registry (GRR), which is a central service of the Virtual Laboratory. GRR stores a technical description (techinfo) of each gem, including all information about the interface, implementation and details required to deploy or invoke the gem. It is possible to register gems which are published by third parties on the Web in the form of Web services: in that case it is enough to provide the WSDL ﬁle, describing the given service. Before actual registration takes place, the gem developer may write testing and debugging scripts which operate directly on the gem techinfo. Following registration in the GRR, the gem becomes visFig. 3. Gem development: follow- ible to all experiment developers and can be ing interface deﬁnition, a gem has shared throughout the Virtual Laboratory. to be implemented, deployed in a In the Drug Ranking experiment described speciﬁc technology and registered in in this paper, the gems include the Drug ReGRR sistance Service [5] and the RegaDB HIV sequence alignment and subtyping tools [22]. 3.3

Experiment Planning, Scripting and Publishing

After the requirements of the experiment are deﬁned and the missing gems developed, installed and registered in the GRR, the developer can start creating the experiment plan. The plan links data and computation into a working application. As presented in Section 3.2, the gems can be implemented using diﬀerent technologies and, consequently, the creation of an experiment that connects these technologies, becomes complicated. To hide the complexity of the underlying middleware, a high-level object-oriented API called the Grid Operation Invoker – GOI [18] has been introduced. Uniform access to computations is enabled by providing three level of resource description (Fig. 4) – Grid Object, Grid Object Implementation and Grid Object Instance. During creation of the experiment plan only the highest level is used, although, if necessary, the

Towards a System-Level Science Support

61

developer can deﬁne all the resource’s technical details using one of the lower layers. The next problem that occurs while creating the experiment plan is access to the medical data. The virtual laboratory provides a high-level, secure API that enables querying diﬀerent data sources with the Data Access Client – DAC (a client of the ViroLab Data Access Service [23]). The Experiment Planning Environment (EPE [16]) supports creation of experiment plans. EPE is an RPC application based on the Eclipse platform which oﬀers an integrated set of tools and a dedicated editor for writing experiment plans. The Domain Ontology Store (DOS) plug-in is a graphical browser that enables discovery of semantic information about the data and computational services. The Grid Resource Registry browser (GRR-browser) plug-in allows Fig. 4. Grid Object abstraction browsing registered services, their operations, input, output parameters and the attached documentation. These two plug-ins are integrated with the EPE experiment plan editor and between them provide a powerful mechanism for data and service discovery. The DRS experiment plan (see Fig. 5) was created using this set of tools. The developer knows that three computational services (responsible for subtyping, aligning and drug ranking) are required. Using the DOS plug-in all computational parts that return subtyped, aligned and drug-ranking results are found. Afterwards, by switching from DOS to the GRR-browser plug-in, the developer is able to see the details of the gems operations. The statements which result in the creation of selected resources, are added to the experiment plan directly from the browser plug-in. EPE is also integrated with the Experiment Repository version control system (based on Subversion), which facilitates collaboration between developers. As a result, many developers can work on single experiment plan, sharing it with other members of a virtual organization. The last step in experiment plan development is to make it available to the medical expert who is the application end user. The release plug-in, integrated with EPE, simpliﬁes the experiment plan release process. During this process a new branch in the SVN repository is created and the experiment plan is copied with a unique version number and licence ﬁle. 3.4

Execution of Experiment

Both GOI and DAC are elements of the GridSpace engine (GSEngine [24]) which provides runtime support. It allows executing experiment plans locally on the developer’s machine, or remotely, on the server (Fig. 6). EPE is integrated with the runtime, thus making experiment plan creation and testing easy. For the medical expert who is the end user of the created experiments, a dedicated Web based application (Experiment Management Environment – EMI [16]) is created,

62

T. Gubala et al.

patientID = DataRequester.new.getData("Provide patient\’s ID") region = DataRequester.new.getData("Region (\"rt\" or \"pro\")") nucleoDB = DACConnector.new("das", "angelina.hlrs.de:8080/wsrf/services/DataAccessService","","","") sequences = nucleoDB.executeDistributedQuery( "select nucleotides from nt_sequence where patient_ii=#{patientID.to_s};") subtypesTool = GObj.create("RegaDBSubtypesTyool") subtypes = subtypesTool.subtype(sequences) puts "Subtypes: #{subtypes}" mutationsTool = GObj.create("RegaDBMutationsTool") mutationsTool.align(sequences, region) mutations = regaDBMutationsTool.getResult drs = GObj.create("DrugResistanceService") puts drs.drs("retrogram", region, 100, mutatations) Fig. 5. Listing of the decision support system experiment plan

hiding the complexity of the technology layer. It allows browsing information about the released experiment plans’ versions (their names, descriptions, licences) and executes them. Thanks to close integration with the GSEngine, interaction between users and experiment plans is realized. This mechanism allows receiving additional information from the user during script execution. For example, the DRS experiment (Fig. 5) requires two pieces of input data from the user: patientId – necessary to receive patient sequences from the medical database, and the region – required by the Drug Resistance Service.

4

Innovation

The ViroLab virtual laboratory provides an environment to collaboratively plan, develop and use biomedical applications. The main innovation of the presented platform is dedication to multi-expertise task-oriented groups. Tools are provided for technical personnel, developers and administrators whose task is to maintain and enrich the experiment space. Additionally, there are tools that help virologists and healthcare providers perform their treatment-related tasks. The respective objectives and actions of these user groups are combined together with a set of remote services, information stores and other integration techniques. In this way the laboratory helps entire research teams (both traditional and virtual, Internet-wide ones) reach their scientiﬁc and professional goals more eﬀectively.

Towards a System-Level Science Support

63

Fig. 6. GSEngine - collaborative environment for experiment plan execution

Another innovative feature of the presented solution is stress on the generality of provided solutions in the middleware layer. The GridSpace runtime components are designed to support various remote computation technologies, programming models and paradigms. Together with this generic and multi-purpose solution, the environment provides a set of user-oriented tools that allow customizing, arranging and populating the virtual laboratory space with content and solutions speciﬁc to certain application domains. It is a method of harvesting the end users’ high creativity to help them co-create their environment rather than tailoring ready-to-use solutions. Since the e-Science domain is evolving very quickly, we argue that this model of a generic platform with speciﬁc content is best suited for technically knowledgeable teams of scientists. The described concept of independent analysis gems and data sources as well as the scripting glue used to combine them in desired experiments, ensures easy reconﬁgurability, extensibility and enables ad-hoc recomposition of laboratory content and applications. The presented platform facilitates fast, close cooperation of developers and users on experiments. Since an in-silico experiment is subject to frequent changes, modiﬁcations and enhancements, the traditional software model of releases, downloads, deployments and bug reports is not eﬀective enough. Instead, the ViroLab experiment planning and publishing model encourages quick, agile software releasing and a corresponding versioning scheme. In this model, enhancement reports can be provided right away in the experiment execution tool and they are immediately visible to all interested programmers, who may publish new experiment versions which are also immediately ready to use by all interested scientists in the group. The additional licensing and terms-of-use information, always attached to experiments, saves the end users time that would otherwise be spent on ﬁnding out whether and how the results of experiments may be used and published.

64

5

T. Gubala et al.

Summary

The applicability and suitability of the virtual laboratory was demonstrated with the real-life example of the drug susceptibility ranking application from the HIV treatment domain. The main innovation of this work is the novel design of the virtual laboratory that allows for truly collaborative planning, development, preparation and execution of complex data acquisition and analysis applications, so crucial for the biomedicine ﬁeld. In the proposed environment people of diﬀerent occupations, both advanced script developers and scientists can eﬀectively and collaboratively conduct their respective tasks, contributing to a common goal. The current version of the presented platform, rich documentation and tutorials are available from the ViroLab virtual laboratory site [7]. The laboratory is under continuous development. One of the most important features to be added is a module for management of results produced by experiments. Eﬀort is being invested in semantic descriptions of data and computations. Consequently, ﬁnding interesting information will become easier and the corresponding middleware will be able to track the provenance of results in an application-speciﬁc way. This, in turn, will lead to future experiment repeatability. The listed functionality aspects are of great importance for system-level science. Acknowledgements. This work was partially funded by the European Commission under the ViroLab IST-027446, the IST-2002-004265 Network of Excellence CoreGRID projects, the related Polish SPUB-M grant and the Foundation for Polish Science. The Authors are grateful to Piotr Nowakowski for his comments and suggestions.

References 1. Foster, I., Kesselman, C.: Scaling system-level science: Scientiﬁc exploration and it implications. Computer 39(11), 31–39 (2006) 2. Vandamme, A.M., et al.: Updated european recommendations for the clinical use of hiv drug resistance testing. Antiviral Therapy 9(6), 829–848 (2004) 3. Rhee, S., et al.: Genotypic predictors of human immunodeﬁciency virus type 1 drug resistance. In: Proceedings of National Academy of Sciences of the United States of America. National Academy of Sciences, vol. 103 (2006) 4. Rhee, S., et al.: Human immunodeﬁciency virus reverse transcriptase and protease sequence database. Nucleic Acids Research 31(1), 298–303 (2003) 5. Sloot, P.M.A., Tirado-Ramos, A., Altintas, I., Bubak, M., Boucher, C.: From molecule to man: Decision support in individualized e-health. Computer 39(11), 40–46 (2006) 6. Sloot, P.M.A., Tirado-Ramos, A., Bubak, M.: Multi-science decision support for hiv drug resistance treatment. In: Cunningham, P., Cunningham, M. (eds.) Expanding the Knowledge Economy: Issues, Applications, Case Studies, eChallenges 2007, pp. 597–606. IOS Press, Amsterdam (2007) 7. ViroLab Virtual Laboratory, http://virolab.cyfronet.pl

Towards a System-Level Science Support

65

8. Rycerz, K., Bubak, M., Sloot, P., Getov, V.: Problem solving environment for distributed interactive simulations. In: Gorlatch, S., Bubak, M., Priol, T. (eds.) Achievements in European Reseach on Grid Systems. CoreGRID Integration Workshop 2006 (Selected Papers), pp. 55–66. Springer, Heidelberg (2008) 9. Droegemeier, K., et al.: Service-oriented environments in research and education for dynamically interacting with mesoscale weather. IEEE Computing in Science and Engineering (November-December 2005) 10. Altintas, I., Jaeger, E., Lin, K., Ludaescher, B., Memon, A.: A web service composition and deployment framework for scientiﬁc workﬂows. ICWS 0, 814–815 (2004) 11. Stevens, R.D., et al.: Exploring williams-beuren syndrome using mygrid. Bioinformatics 1(20), 303–310 (2004) 12. MyExperiment: myexperiment website (2007), http://myexperiment.org 13. Aloisio, G., Breton, V., Mirto, M., Murli, A., Solomonides, T.: Special section: Life science grids for biomedicine and bioinformatics. Future Generation Computer Systems 23(3), 367–370 (2007) 14. Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., Goble, C., Livny, M., Moreau, L., Myers, J.: Examining the Challenges of Scientiﬁc Workﬂows. IEEE Computer 40(12), 24–32 (2007) 15. Thomas, D., Fowler, C., Hunt, A.: Programming Ruby – The Pragmatic Programmer’s Guide, Second Edition. The Pragmatic Programmers (2004) 16. Funika, W., Har¸ez˙ lak, D., Kr´ ol, D., P¸egiel, P., Bubak, M.: User interfaces of the virolab virtual laboratory. In: Proceedings of Cracow Grid Workshop 2007, ACC CYFRONET AGH, pp. 47–52 (2008) 17. Bali´s, B., Bubak, M., Pelczar, M., Wach, J.: Provenance tracking and querying in virolab. In: Proceedings of Cracow Grid Workshop 2007, ACC CYFRONET AGH, pp. 71–76 (2008) 18. Bartynski, T., Malawski, M., Gubala, T., Bubak, M.: Universal grid client: Grid operation invoker. In: Proceedings of the 7th Int. Conf. on Parallel Processing and Applied Mathematics PPAM 2007. LNCS. Springer, Heidelberg (to appear, 2008) 19. Malawski, M., Bubak, M., Placek, M., Kurzyniec, D., Sunderam, V.: Experiments with distributed component computing across grid boundaries. In: Proceedings of the HPC-GECO/CompFrame workshop in conjunction with HPDC 2006, Paris, France (2006) 20. Baduel, L., Baude, F., Caromel, D., Contes, A., Huet, F., Morel, M., Quilici, R.: Programming, Deploying, Composing, for the Grid. In: Grid Computing: Software Environments and Tools. Springer, Heidelberg (2006) 21. EGEE Project: Lightweight middleware for grid computing (2007), http://glite.web.cern.ch/glite 22. de Oliveira, T., et al.: An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics (2005) 23. Assel, M., Krammer, B., Loehden, A.: Management and access of biomedical data in a grid environment. In: Proceedings of Cracow Grid Workshop 2006, pp. 263–270 (2007) 24. Ciepiela, E., Kocot, J., Gubala, T., Malawski, M., Kasztelnik, M., Bubak, M.: Gridspace engine of the virolab virtual laboratory. In: Proceedings of Cracow Grid Workshop 2007, ACC CYFRONET AGH, pp. 53–58 (2008)

Incorporating Local Ca2+ Dynamics into Single Cell Ventricular Models Anna Sher2, David Abramson1, Colin Enticott1, Slavisa Garic1, David Gavaghan2, Denis Noble3, Penelope Noble3, and Tom Peachey1 1

Faculty of Information Technology, Monash University, Clayton, 3800, Victoria, Australia 2 Comp. Biology Group, Oxford University Computing Laboratory, Oxford OX1 3QD, UK 3 Department of Physiology, Anatomy and Genetic, Oxford University Oxford OX1 3PT, UK

Abstract. Understanding physiological mechanisms underlying the activity of the heart is of great medical importance. Mathematical modeling and numerical simulation have become a widely accepted method of unraveling the underlying mechanism of the heart. Calcium (Ca2+) dynamics regulate the excitationcontraction coupling in heart muscle cells and hence are among the key players in maintaining normal activity of the heart. Many existing ventricular single cell models lack the biophysically detailed description of the Ca2+ dynamics. In this paper we examine how we can improve existing ventricular cell models by replacing their description of Ca2+ dynamics with the local Ca2+ control models. When replacing the existing Ca2+ dynamics in a given cell model with a different Ca2+ description, the parameters of the Ca2+ subsystem need to be re-fitted. Moreover, the search through the plausible parameter space is computationally very intensive. Thus, the Grid enabled Nimrod/O software tools are used for optimizing the cell parameters. Nimrod/O provides a convenient, user-friendly framework for this as exemplified by the incorporation of local Ca2+ dynamics into the ventricular single cell Noble 1998 model. Keywords: Cardiac Cells, Mathematical modeling, Parameter optimization, Grid Computing.

1 Introduction Researchers have been developing complex models of cardiac cells for many years, in an attempt to explore the detailed physiology and operation of the heart. Ultimately, the goal is to produce better treatment strategies and to develop novel drugs for treating heart disease. This case study concerns the detailed modeling of particular ion channels in heart muscle cells. The key physiological function of the heart is to pump blood around the living organism. This function is enabled by the spread of electrical excitation through the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 66–75, 2008. © Springer-Verlag Berlin Heidelberg 2008

Incorporating Local Ca2+ Dynamics into Single Cell Ventricular Models

67

cardiac tissue and contraction of the cardiac muscles. On the single-cell level (a myocyte), the mechanisms of excitation-contraction coupling are closely regulated by calcium ion (Ca2+) dynamics. Ca2+ entering the cell triggers the release of Ca2+ from the sarcoplasmic reticulum (SR) which is the organelle that stores calcium. The resulting rise of intracellular Ca2+ (Cai) activates the contraction of the cell. This phenomenon is known as Ca2+-induced Ca2+ release (CICR). Local Ca2+ dynamics are characterized by the interactions within localized microdomains (known as dyadic spaces) between L-type Ca2+ channels (LCCs) located on the Transverse–tubules (T– tubules), which are deep invaginations of the membrane into the cell, and closely opposed Ca2+ release channels (known as ryanodine receptors, RyRs) located on the sarcoplasmic reticulum (Fig. 1). The sarcoplasmic reticulum is an extensive and well organized network, that repeatedly comes in contact with each T-tubule, so that the number of dyadic spaces throughout the cell has been estimated to be of the order 50,000 – 300,000.

Sarcolemma Ca-ATP

NCX

Cai Dyadic space JLCC

SR JRyR CaSR

Cads JD

Serca

Cai

T-tubule Fig. 1. Local Ca2+ dynamics of Greenstein et al. 2006 [1] model. Arrows represent the direction in which Ca2+ flows. Cai, Cads and CaSR denote intracellular, dyadic and SR Ca2+ respectively. The diagram illustrates the local control theory: LCCs and RyRs contribute to local JLCC and JRyR fluxes respectively within the dyadic space, JD represents the diffusion of Ca2+ out of the dyad into the bulk myoplasm, SERCA re-uptakes Ca2+ back into the SR, Ca2+-ATPases and NCXs pump Ca2+ out of the cell.

68

A. Sher et al.

ICa.L(ds) INa-stretch

Ina.b ICa INaCa Ip.N ICa. ICa-stretch

L-type Ca2+ channel

Diadic space

a

Na+

Na+K+ Ca2+

Na+ Na+ Na+ Ca2+

b

Ca2+ Na+ Ca2+ Ca2+

IAn-stretch Cl-

Sarcoplasmic Reticulum

Iup

Ca2+ NSR Itr Ca2+

K+

K+ K+

K+

K+

JSR

Calsequestrin K+

Irel

Ca2+

INs-stretch

Myoplasm

K+ K+ Na+

K+

K+

IK1 IKr IKs IKNa Ib(K) IK.ATP Ito IK.ACh INaK IK-stretch Fig. 2. Schematic representation of the single cell Noble et al. 1998 [3] model (adapted from a diagram at www.cellml.org model repository). Arrows represent the direction in which the ions flow and the label of the corresponding ionic current is located above or bellow the arrow. The model includes four Ca2+ compartments which are the intracellular, dyadic, network SR (NSR) and junctional SR (JSR). This model is an example of the deterministic non-common pool ventricular model that uses phenomenological description of Ca2+ in the dyadic space and Ca2+ related currents. While such a model succeeds in producing graded SR Ca2+ release, it lacks the mechanistic description of local SR Ca2+ release, i.e. the stochastic interaction between LCCs and RyRs within the dyadic spaces.

The local Ca2+ release mechanisms are essential to reproduce the characteristic properties of the excitation-contraction coupling such as high gain and graded Ca2+ release. However, most existing single cell models lack the description of the biophysical nature of local Ca2+ dynamics. In this paper we present a methodology of how local Ca2+ dynamics can be efficiently incorporated into a single cell model of ventricular myocyte in order to produce a biophysically accurate cell model. The two stages involved are (i) development of the Ca2+ subsystem and (ii) its incorporation into a single cell model. The first stage is the generation of the local control CICR models (also known as the coupled LCC-RyR models) such as, for instance, the ones that have been developed by Hinch et al. [2] and Greenstein et al. [1]. The second stage, which is the focus of this paper, involves the incorporation of the coupled LCC-RyR models into a single cell model. Specifically, the steps are as follows:

Incorporating Local Ca2+ Dynamics into Single Cell Ventricular Models

69

− The equations that describe Ca2+ dynamics in the original single cell model (e.g. Noble 1998 model1 [4] (Fig. 2)) are substituted by equations of the biophysically detailed Ca2+ subsystem (e.g. baseline 40-state coupled LCC-RyR Greenstein 2006 model [1] (Fig. 1)), provided that units are modified accordingly; − The parameters of the newly obtained single cell model are refitted. This is done to ensure that the newly obtained single cell model, which contains the replaced Ca2+ subsystem, is capable of reproducing the data of the original model. In particular, the specific aim is to fit the dynamics of the Ca2+ of the newly developed wholecell model either to the dynamics of the Ca2+ of the original model (e.g. Noble 1998) and/or to the available experimental data (e.g. Cai transient, IV curves, tail currents recorded from the voltage-clamp experiments, etc.). To achieve this, we need to optimize the parameters of the Ca2+ subsystem, or, in other words, to solve an inverse problem. In this paper we demonstrate how the novel Grid computing tools allows the incorporation of local Ca2+ dynamics into the existing cellular models at a low computational cost. The optimization methods that are used require repeated evaluation of the models, and thus the time required to compute the optimal model parameters can be very long. The computational Grid can be exploited to speed the execution by delivering a large number of processors. The Grid enabled Nimrod/O tool that we use in this experiment incorporates a range of non-linear optimization methods, and these can be used to optimize the cell parameters accordingly. Section 2 briefly introduces the Grid and the Nimrod tools. Section 3 discusses challenges and results of incorporating local calcium dynamics on the example of the ventricular single cell Noble 1998 model.

2 Grid Computing The Grid provides a general platform for integrating computation, data and instruments [8]. It serves as the infrastructure for implementing novel applications, particular in science and engineering. In particular, “computational” Grids have emerged as a viable platform for delivering on-demand access to a range of very high performance machines. Whilst it may not be possible to gain access to sufficient resources at any single site, computational Grids can aggregate a number of otherwise separate resources into a single large super-computer. Such a virtual machine, or testbed, is an ideal base for simulating complex systems using computational models because the resources can be assembled at a period of peak demand and then released for use when not required. Such platforms have the potential to offer very cost effective solutions, leveraging everything from spare cycles on high end machines through to large pools of inexpensive desktops that are idle. 1

The Noble 1998 model is extensively used by various researchers and, thus, it is important to assess the effect of replacing the existing phenomenological description of Ca2+ in the dyadic space and Ca2+ related currents with the local, biophysically sound Ca2+ dynamics. Therefore, the Noble 1998 model is chosen as the case study.

70

A. Sher et al.

In spite of the enormous progress in building operation Grids, and the significant effort in developing middleware, assembling such a testbed on demand is difficult. Most Grids are built from different components, and this resource heterogeneity is a fact of life. Likewise, Grids are built across multiple administrative and security domains, posing problems for aggregating them into a single virtual machine. Lack of a single owning organization also means that resource scheduling becomes complex – no single job scheduler can guarantee access to sufficient computational power, making it difficult to deliver the guaranteed levels of service. Importantly, Grid application users don’t want to know about the complexity of the underlying fabric, and wish to concentrate on their domain science. Difficulty in using the Grid is not a hypothetical concern. Currently, very few scientists use the Grid routinely, and instead rely on local resources, which are under their control. This means that the scale and nature of the work is limited. Until we can make it easier to use, the Grid will never be adopted by more than the most hardy or desperate users! Over the years we have developed a strategy for delivering the high levels of performance, and have built software tools that make it easy for scientists to leverage the computational power of the Grid. Specifically, the Nimrod family of tools allows a non-expert to specify large computational experiments using legacy software, and execute these over a range of Grid resources. Nimrod is not a single tool: it incorporates a component that distributes computations to the resources (Nimrod/G) [5], [7]; a component that searches for “good” solutions using non-linear optimization algorithms (Nimrod/O) [4], [6]; and a component that helps evaluate which parameter settings are important using experimental design (Nimrod/E). Most aspects of Nimrod have been written about extensively over the years, so we will only provide a cursory overview in Section 2.1 of the paper. 2.1 The Nimrod Tool Family Figure 3 shows the architecture of the Nimrod tool family and the interaction between the major components. Typically, users interact through a Web browser using the Nimrod portal. This single point of presence then directs traffic to one of three different components – Nimrod/G which support parameter studies and distributes the computations to the Grid, Nimrod/O which performs optimization and Nimrod/E which uses experimental design techniques to scope parameter studies. Importantly, each of these components acts either as a user level tool, or as middleware, depending on the client use. For example, Nimrod/G can interact directly with users using a Web enabled interface, or can provide services to other software (such as Nimrod/E, Nimrod/O) via an API. Each of the applications discussed here leverages different aspects of the tools. In many cases, they used Nimrod/G to perform a crude sweep of the overall parameter space, and then launched Nimrod/O to refine the solutions. Nimrod/E is a fairly new development, and whilst it has been used in the cardiac modeling work, we do not have results at this stage. An important aspect of the tool family is that they share a common specification language – which is written in a text document called a “plan” file. This file contains details of the parameters and how to invoke the application, and is typically quite small. Over the years, we have expanded the plan file to allow more complex

Incorporating Local Ca2+ Dynamics into Single Cell Ventricular Models

71

workflows to be specified [9], however, in the simplest form a single application is run many times. Nimrod/O plan files contain some additional information about which heuristics to use This specifies the optimization algorithm, or algorithms, and associated settings. For example, the file may specify simulated annealing and the associated cooling regime. Starting points for iterative algorithms are also specified as Nimrod/O can perform multiple concurrent searches. The Nimrod/E plan file contains information about which parameter combinations are to be estimated, and which are assumed negligible.

Fig. 3. The Nimrod tool chain

2.2 Nimrod Methodology Each of the case studies discussed in the next section adopted the same overall methodology – regardless of which Nimrod tool was used. The following steps summarize this process: 1. Testbed construction. The use must decide which resources will be included in the grid testbed, and configure Nimrod to use these. The Nimrod portal provides a number of high level interfaces for making this fairly easy. Nimrod assumes that users already have accounts (and the necessary authentication) on each of the testbed resources. 2. Software preparation. Here the applications are compiled and tested on each of the Grid resources. This can either be performed manually by logging into each

72

3.

4.

5.

6.

7.

A. Sher et al.

of the different remote resources, or by using a tool like Distant [9] which manages the process through a single user oriented client. Even when configured manually, it is possible to prepare the application binary on one machine, and use Nimrod to distribute it to similar resources before execution. Determine which Nimrod tool to use. As discussed, Nimrod has a number of different components. The user must select the most appropriate component, depending on whether a complete, partial or guided search is required. Describe how to execute the application, and which files are required for input and output. These steps are described in the Nimrod plan file, using a simple declarative language. Nimrod can be instructed to copy input files to each resource, and return output files. Large output files can be left on remote resources for later analysis. Nimrod also managed parameter substitution via command line options or special control files. Determine the parameters and their ranges. This will vary depending on the application requirements, These are then described in the Nimrod plan file using the ‘parameter’ keyword. Most parameters are independent, however, it is also possible to specify sequences of parameters that create complex workflows [9]. In this paper we use Nimrod/O to compute optimal parameter settings. Execute the experiment. This is usually performed through the Nimrod portal, but it is also possible to use the Nimrod command line tools. Long running experiments can be left unattended, and monitored using the Nimrod monitoring tools. Analyze the results, possibly returning to step 5 to refine the parameter ranges.

3 Incorporating Local Ca2+ Dynamics The models of the local Ca2+ dynamics is a system of ODEs of approximately 30-70 variables with up to 100 parameters. These ODEs do not exhibit stiffness, thus, time integrators such as the forward Euler integrator or a Runge-Kutta 4th order method are appropriate to simulate these Markov models. The results presented below are simulated in Matlab 6.5 using an inbuilt `ode45' solver - a one-step solver based on an explicit Runge-Kutta of 4th and 5th order, that is appropriate for non-stiff problems and has medium accuracy. Each simulation on a personal laptop (e.g. Toshiba 512 MB RAM, 2 GHz, 60 GB Hard Drive, Windows XP) takes under five minutes, thus the computational resources required to perform one simulation are minimal. However, to optimize the set of parameters in the newly developed Ca2+ subsystem, the software tools which perform optimization algorithms within the framework of the distributive computing are essential. In particular, Nimrod/O provides a computationally effective manner of tuning the parameters and examining their effects within the newly developed models. Interestingly, in the case of running the simulations discussed above, the limiting factor is the number of Matlab licenses available rather than the number of processors. This is an issue that requires consideration by both the community and independent software vendors if the true power of the Grid is to be realized for this class of software. Nimrod/O offers a variety of optimization methods, such as “subdivision search” and downhill type search methods, etc. The simulation results

Incorporating Local Ca2+ Dynamics into Single Cell Ventricular Models

73

presented below are obtained using the downhill simplex method of Nelder and Mead. The optimal set of parameters, calculated using the simplex method, is obtained by fitting the action potential (AP), Cai transient and ICaL current, with the objective function calculated using the least-square approximations. The direct incorporation of the canine 40-state Greenstein 2006 coupled LCC-RyR model 1 into Noble 1998 guinea pig model (Fig. 2) results in a distorted electrical behaviour of the cell such as a significant second peak in Cai transient and a pronounced plateau phase in an action potential (compare dashed and dotted curves in Fig. 4). An optimized set of parameters obtained using Nimrod/O significantly improves the dynamical behaviour of the modified Noble 1998 model (solid curve).

Fig. 4. Cardiac model output: The guinea pig Noble 1998 [4] ventricular model modified to include local Ca2+ dynamics. Dashed curve represents the Noble 1998 model. Dotted curve shows the modified Noble 1998 model that incorporates the 40-state coupled LCC-RyR Greenstein et al. 2006 model. Solid curve denotes Noble 1998 model modified to include Greenstein et al. 2006 Ca2+ dynamics with an optimized set of parameters2. 2

The optimal set of parameters (corresponding to the solid curve) is as follows: an increase in the maximum rate of the SERCA pump (1.4-fold), an increase in the conductance of RyR (3.5-fold) and LCC (1.5-fold) channels, modified constants of the 10-state LCC Markov model (the transition rate to the Ca2+-dependent inactivation (CDI) state by 1.33-fold; the transition rate out of the CDI-state constant by 0.34-fold; the transition rate out of the closed state by 2.2-fold; the transition state out of the open state by 0.59-fold) and of the 4-state RyR Markov model (the transition rate into the open state from CDI-state 2 by 4.58-fold; the transition rate out of open state into CDI-state 2 by 0.79-fold; the transition rate into the open state from CDI-state 4 by 2.4-fold; the transition rate out of the closed state by 2.1-fold).

74

A. Sher et al.

Importantly, the new set of parameters, which falls within the physiologically acceptable ranges, results in an elimination of the second peak in Cai transient (middle panel in Fig. 4). Further, the results demonstrate that Nimrod/O provides a convenient, user-friendly framework for tuning the parameters in the cardiac cell models in an efficient computional manner by taking advantage of parallel batches of evaluations. This study provides a valuable platform for future incorporation of the biophysically detailed Ca2+ subsystems into whole-cell models of various species. It is important to note that the use of Nimrod/O highlighted the issues of the parameter sensitivity and over-parameterizations of cardiac ionic models (data not shown). Specifically, the challenges involved in analyzing and characterizing the significance of a given set of parameters in ionic models include (1) potentially fewerthan-necessary constraints being imposed when calculating the objective function, (2) the cardiac ionic models being complex nonlinear systems which have many local minima as opposed to global minima, etc. Parameter estimation in cardiac systems is an ongoing area of research. Thus, while Nimrod/O is a valuable tool in parameter optimization with low computational cost, further studies need to be performed in order to improve the method of finding the optimal set of parameters in a given ventricular single cell model.

4 Conclusions In this paper we have outlined the steps necessary for updating the ventricular myocytes models with the local Ca2+ dynamics. Nimrod/O was used as the tool to incorporate the coupled LCC-RyR models in the place of the existing Ca2+ dynamics. To conclude, the incorporation of the local Ca2+ dynamics into the Noble 1998 model shows that Nimrod/O is a convenient, user-friendly framework for tuning the parameters in the cardiac cell models in an efficient computational manner by taking advantage of parallel batches of evaluations. Thus, Nimrod/O, provides a valuable, low-computational tool for the incorporation of the biophysically detailed Ca2+ subsystems into whole-cell models of various species. Acknowledgements. The cardiac modeling project was supported by EPSRC EScience Pilot Project in Integrative Biology GR/S72023/01, UK. The Nimrod project has been funded by the Australian Research Council, the Cooperative Research Centre for Enterprise Distributed Systems, the Department of Communications, Information Technology and the Arts under a GrangetNet grant, and the Australian Partnership for Advanced Computing.

References 1. Greenstein, J.L., Hinch, R., Winslow, R.L.: Mechanisms of excitation-contraction coupling in an integrative model of the cardiac ventricular myocyte. Biophys. J. 90, 77–91 (2006) 2. Hinch, R., Greenstein, J.L., Tanskanen, A.J., Xu, L., Winslow, R.L.: A simplified local control model of calcium-induced calcium release in cardiac ventricular myocytes. Biophys. J. 87, 3723–3736 (2004)

Incorporating Local Ca2+ Dynamics into Single Cell Ventricular Models

75

3. Noble, D., Varghese, A., Kohl, P., Noble, P.: Improved guinea-pig ventricular cell model incorporating a diadic space, IKr and Iks, and length- and tension-dependent processes, Can. J. Cardiol. 14(1), 123–134 (1998) 4. Abramson, D., Lewis, A., Peachey, T., Fletcher, C.: An Automatic Design Optimization Tool and its Application to Computational Fluid Dynamics. In: SuperComputing 2001, Denver (November 2001) 5. Abramson, D., Sosic, R., Giddy, J., Hall, B.: Nimrod: A Tool for Performing Parametrised Simulations using Distributed Workstations. In: The 4th IEEE Symposium on High Performance Distributed Computing, Virginia (August 1995) 6. Abramson, D., Lewis, A., Peachy, T.: Nimrod/O: A Tool for Automatic Design Optimization. In: The 4th International Conference on Algorithms & Architectures for Parallel Processing (ICA3PP 2000), Hong Kong, December 11-13 (2000) 7. Abramson, D., Giddy, J., Kotler, L.: High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid? In: International Parallel and Distributed Processing Symposium (IPDPS), Cancun, Mexico, May 2000, pp. 520–528 (2000) 8. Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Computing Infrastructure, 2nd edn. Morgan Kaufmann, USA (2003) 9. Ayyub, S., Abramson, D., Enticott, C., Garic, S., Tan, J.: Executing Large Parameter Sweep Applications on a Multi-VO Testbed. In: 7th IEEE International Symposium on Cluster Computing and the Grid, CCGrid, Brazil, pp. 73–80 (2007) 10. Goscinski, W., Abramson, D.: Legacy Application Deployment over Heterogeneous Grids using Distributed Ant. In: IEEE Conference on e-Science and Grid Computing, Melbourne (December 2005)

Grid-Enabled Non-Invasive Blood Glucose Measurement Ibrahim Elsayed1, Jianguo Han2 , Ting Liu3 , Alexander W¨ ohrer1 , 1 1 Fakhri Alam Khan , and Peter Brezany 1

3

Institute of Scientiﬁc Computing University of Vienna, Nordbergstrasse 15/C/3, A-1090 Vienna, Austria {elsayed,woehrer,khan,brezany}@par.univie.ac.at 2 School of Information Science and Technology Beijing University of Chemical Technology 15 BeiSanhuan East Road, ChaoYang District, Beijing 100029, China [email protected] School of Earth and Space Science, Peking University, Beijing 100871, China [email protected]

Abstract. Earth and life sciences are at the forefront to successfully include computational simulations and modeling. Medical applications are often mentioned as the killer applications for the Grid. The complex methodology and models of Traditional Chinese Medicine oﬀer diﬀerent approaches to diagnose and treat a persons health condition than typical Western medicine. A possibility to make this often hidden knowledge explicit and available to a broader audience will result in mutual synergies for Western and Chinese medicine as well as improved patient care. This paper proposes the design and implementation of a method to accurately estimate blood glucose values using a novel non-invasive method based on electro-transformation measures in human body meridians. The framework used for this scientiﬁc computing collaboration, namely the ChinaAustria Data Grid (CADGrid) framework, provides an Intelligence Base oﬀering commonly used models and algorithms as Web/Grid-Services. The controlled execution of the Non-Invasive Blood Glucose Measurement Service and the management of scientiﬁc data that arise from model execution can be seen as the ﬁrst application on top of the CADGrid. Keywords: Traditional Chinese Medicine, Non-Invasive Blood Glucose Measurement, Grid Computing, e-Infrastructure.

1

Introduction

Grid computing [1] promises to change the way scientists will tackle future research challenges in a number of domains, including earth sciences [2], medicine [3,4] and life sciences [5]. Service-oriented architectures (SOA) [6] are utilized in Grid computing to facilitate the virtualization of heterogeneous resources, e.g. data sources, computational resources, network and workload. Due to wireless connectivity improvements and hardware getting mobile and constantly smaller M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 76–85, 2008. c Springer-Verlag Berlin Heidelberg 2008

Grid-Enabled Non-Invasive Blood Glucose Measurement

77

and cheaper, the visions of ubiquitous [7] and mobile computing [8] are becoming reality. Information is collected, exchanged and processed by specialized computing units embedded in the environment to achieve a certain task [9] and accessed as needed on-the-way [10]. Enormous amount of data will be produced at a rate never seen before in any ﬁeld of human activity, requiring next generation grids [11] to cope with and making use of it. Traditional Chinese Medicine (TCM) [12] has diﬀerent models and approaches to diagnose and treat a persons health condition than typical Western medicine. According to meridian-theory, which is an important part of TCM, meridians are a secret to our biological and medical knowledge. Recently, several projects in China started to investigate how modern measurement and information and communication technologies can support not only exact estimation of the meridian status, but also to improve TCM in general [13]. Although Chinese and Western medicine have a diﬀerent understanding and approach to life, health, and illness - concerted, complementary work between Western allopathy and Chinese medicine could result in an improved health system [14]. The problem of diabetis mellitus [15] is urgent around the world (estimated 100 million [16]), in Austria approximately 300.000 people suﬀer from it [17]. To minimize short term problems (e.g. unconsciousness, hypoglycemia) and severe mid and long term complications (e.g. blindness and neuropathie), which are also very costly to treat, repetitive invasive blood sugar measurements are needed with following medication modulation (typically insulin for Type 1 and pills for Type 2 diabetics). Besides several negative medical side eﬀects (e.g. skin injuring), this way is also rather expensive. This paper proposes the grid based implementation of a novel non-invasive method for measuring human blood glucose values accurately and conveniently by the use of a special medical meridian measurement instrument and the CADGrid infrastructure [18]. The data obtained by this instrument are referred to as meridian measurement data and can be analyzed by the meridian electro information transmission model [19] to derive human blood glucose values. This model is implemented as a complex grid-based computing process that executes a number of compute intensive algorithms. The controlled execution of this process is done by the Non-Invasive Glucose Measurement Service, in short NIGM-Service. A special subset of CADGrid-services will focus on receiving and analyzing patients meridian measurement data coming from mobile devices and thin clients. Splitting the process of vital parameter estimation into an evolveable, personalized data mining process and a rather simple source signal emitting and recording phase has two main beneﬁts: ﬁrst, the infrastructure can easily be applied to other target values and second, improvements to the involved data mining services and process will directly result in a more precise and robust estimation of the values. Diabetic patients as well as researchers in the ﬁeld of treatment of diabetes mellitus will highly proﬁt from the realization of the above mentioned scientiﬁc computing infrastructure. This paper is organized as follows. In Section 2 the application domain is described in order to provide the reader with domain speciﬁc background knowledge. Section 3 describes the China-Austria Data Grid infrastructure and

78

I. Elsayed et al.

on top of it its ﬁrst application the Non-Invasive Glucose Measurement Service in Section 4. Section 5 gives a brief overview of available non-invasive methods for measuring human blood glucose values. Finally, we conclude the paper and mention our ongoing work as well as planned next steps in Section 6.

2

Application Scenario

The non-invasive method for measuring human glucose values used in the NIGMService is based on the meridian-theory, according to which the human body has 14 acupuncture meridians. Each of these longitudinally distributed lines on our human body has its main points, called source points, totally 24 [20]. Clinical practices of TCM (especially acupuncture) have been guided by meridian theory for thousands of years. Exact estimation of the meridian status has been one of the pilot tasks of TCM. In order to prove the meridian theory with modern methods a number of special meridian measurement instruments were developed. Analyzing these meridian measurement data with advanced data mining techniques and models can lead to important information about human illness state and other health relevant knowledge. The electro signal measurement instrument sends an electric signal (white noise) into one meridian source point and measures the corresponding signal output at another source point either on the same meridian or on another meridian. In particular a random electro signal with the maximal voltage less than 2.0 V is produced by the instrument. This process is illustrated in Figure 1. According to international standard terminologies deﬁned by the World Health Organization on traditional medicine in the western paciﬁc region [21], point 1 is called “Ba Xie” and point 2 is called “He Gu”. These points are connecting two diﬀerent meridians through a special information channel. Zero potential points are points on the human body that have the lowest electrical potential and are usually located on ﬁnger tips and toes. In the context of electro signal measure, zero point measures are used as reference values, in order to weight the electric potential on a acupuncture point. The measurements obtained in this process can, if analyzed by the meridian electro information transmission model, derive diabetic patient’s blood glucose values. Another instrument measures the subcutaneous impedance at acupuncture points and its adjacent control points [20]. Also high sensitive CO2 analyzers are used to measure transcutaneous CO2 emission [22] from the skin on correlating meridian source points. Using clustering and frequent itemset mining techniques, correlations among measured values of these points can be identiﬁed, e.g. start-end point correlation, symmetric point correlation etc., which is useful knowledge for research in the meridian theory and thus aims to observe the features of acupuncture and meridians. Deploying an e-infrastructure that provides collaborative research with advanced data mining services, eﬃcient data and workﬂow management services, as well as visualization services contribute to the progress in this domain.

Grid-Enabled Non-Invasive Blood Glucose Measurement

Acupuncture Point 1

Acupuncture Point 2

79

Zero Potential Point

Fig. 1. The Electro Signal Measure on the left hand

3

CADGrid Infrastructure

Science has evolved in the past several decades from an empirical and theoretical approach to one that includes computational simulations and modeling [23], commonly known as enhanced science or e-science in short. The term e-Science is often used to refer to highly collaborative computational science that uses distributed software infrastructures, like grids in order to support shared eﬀorts. Technological progress in such e-Science infrastructures have enabled researchers to run complex, computational investigations that include data access, analysis, and largely automated model execution. The China-Austria Data Grid (CADGrid) connects several research institutions in both countries and provides an Intelligence Base oﬀering commonly used models and algorithms as services as well as compute and storage resources. Further it is equipped with our workﬂow engine WEEP [24] allowing the users to execute a number of pre-selected services in a controlled and eﬃcient way. The Intelligence Base is the heart of the CADGrid infrastructure, providing participants a number of services aiming to support and accomplish user deﬁned research tasks. In general the services can be classiﬁed into four main groups, (a) data analysis group includes services representing data mining and other special data analysis algorithms (i.e. Kalman Filtering), (b) data managing and preprocessing services covering issues that occur when working with diﬀerent data sources or when accessing data resources from diﬀerent locations, (c) workﬂow services that are necessary to provide the controlled execution of a number of pre-selected services, and (d) the data visualization group including services for data representation and visualization. Three typical usage scenarios will exemplify how diabetic patients, medical professionals, and researchers in the ﬁeld of treatment of diabetes mellitus can beneﬁt from NIGM and its underlying grid infrastructure.

80

I. Elsayed et al.

Diabetic patients can measure the electricity information transformation of their blood sugar values using the measurement instrument which will be embedded into mobile devices such as PDA, cell phones and other handhelds. Using the NIGM-Service oﬀered via the Intelligence Base the measurement data can be analyzed on the ﬂy and the patient will be informed via messaging services about his blood glucose values. Patients will beneﬁt twice: ﬁrst the incurred expenses will be lower in comparison to conventional methods based on chemical reactions and second patients will have a much more convenient method than injuring their skin to receive the needed drop of blood for the chemical reaction. Health professionals and services have a better surveillance of the patient’s illness and can adjust their therapy more eﬃciently based on the continuously collected data. Scientists from diﬀerent institutions in diﬀerent countries can work together on common goals. This will require access to anonymized measurement data and analysis services from diﬀerent locations. Further they might want to share their results among each other. The CADGrid Intelligence Base provides advanced data management services enabling such requirements in an eﬃcient and secure way. By means of the online portals, scientists can access advanced data analysis tools, computational power, and data resources and share elaborated results.

4

Non-Invasive Glucose Measurement Service

The NIGM-Service covers the process of computing patients glucose values from meridian measurements. The service consists of the execution of several algorithms, each one available as a standalone CADGrid service within the Intelligence Base. Using the workﬂow engine WEEP [24], a predeﬁned workﬂow representing the NIGM-Service is deployed and available as one service within the CADGrid Intelligence Base. Figure 2 illustrates the workﬂow with the visualized outputs of each component. In the following the non-invasive process of calculating patients blood glucose values from meridian measurements is brieﬂy described. More details can be found in [19]: Collect Measurement Data by running a measurement process with the electro signal measurement instrument. This non-invasive painless process is done by a health professional or any trained person on the hand of a patient. The current prototype of the instrument is equipped with a USB interface and connected to a PC, where the occurred measurement data is organized in two ﬁles. One representing the input values and the other contains values of the measured output signal. This data, referred to as meridian measurement data is sent to the joint grid testbed where the NIGM-Service will be started. The NIGM-Service is composed of a set of services implementing the following algorithms, which will be executed one after the other, whereas each algorithm’s output represents the input to the next algorithm. System Identiﬁcation is the ﬁrst algorithm to be executed. A model is set up for each measured input/output value pair describing the relationship

Grid-Enabled Non-Invasive Blood Glucose Measurement

Input Signals

81

Output Signals

Diabetic

Data

Health Professional

Identification

Transformation

Transformation (FFT)

User

Data

Diabetic

Fig. 2. The Non-Invasive Glucose Measurement Service

between the corresponding signals. The resulting diﬀerence equation is used to generate a standard output signal for a common input signal. Kalman Filtering is further used to weaken the white noises and ﬁlter the useful signals, as well as useful noises. In fact, the useful signals are the true data retrieved from the meridians. However, the measurement data is blended with diﬀerent kind of noises: physical, chemical, and biological noises. These noises can represent the patient’s condition and feelings at the moment the measurements were taken i.e. fear, hotness, etc. and also the surroundings of the patient i.e. noisy or electromagnetic environment. The ﬁrst kind of noises represent useful information (color noise), whereas the latter one (white noise) are of less interest. Although they need to be extracted in order to get data that is close to the true data. This process is deﬁned as optimal estimation of measured true data, which is blended by white and color noises and is covered within the Kalman Filtering service. Wavelet Transformation takes the graph containing the optimally estimated data produced in the previous step and splits it into several frequency levels. Each generated new graph, referred to as wavelet represents the origin time domain graph from diﬀerent frequency levels. This mechanism of experimental data processing provides the most visual and informative picture of the measured data and allows ﬁnding peculiarities of the signal in the area of the wavelet. Hence the wavelet with the most important graph is selected and forwarded to the next algorithm. It is to say that choosing a best wavelet for calculating the blood glucose values is in the current state of research progress an optional step and needs to be elaborated in detail.

82

I. Elsayed et al.

Fast Fourier Transformation is then executed in order to produce the frequency domain representation of the time domain graph used in the previous algorithms. A frequency domain graph shows how much of the signal lies within each given frequency band over a range of frequencies. Further the so called electro-transformation-character-values are computed. These values, such as maximum, integral, center point, etc. represent the characteristics of the produced graph. These are the main inputs for the next step which is to setup a back propagation neural network. Back Propagation Neural Network. In general such a model is used to establish a mathematical model describing the relationship among two data series. The two data series in our model are the electro-transformation-character-values on one side and conventionally measured glucose values, serving as training data in order to set up the model. The model has to be trained ﬁrst, pictured in the left corner of the ﬁgure above. During this phase, in addition to each meridian measurement a conventional method has to be used to get reference values. Once a model is set up for a patient it can be used to compute the patient’s glucose values for a given set of electro-transformation-character-values without conventionally measured values. In order to maintain the model, reference values have to be measured again in a certain time interval (e.g. once a week) and the model must be re-evaluated to accommodate recent changes and developments in the patient health condition and to further improve precision and robustness.

5

Related Work

The development of non-invasive methods for measuring human blood glucose values has been targeted by a number of research institutions all over the world. The most common non-invasive methods were categorized into two broad categories [16]: (a) blood glucose measurement with external factor and (b) without external factor (based on the theory that blood glucose values can be derived from glucose concentration in human body liquids like tears, urine and saliva). The latter cannot be used for continuous measurement of glucose concentration in blood. Methods with external factor are described in the following: Electromagnetic Radiation. Here electromagnetic radiation (EMR) is used as an external factor for the non-invasive measurement of glucose value in blood. The optical characteristics of the tissue are observed and analyzed when a human body is exposed to EMR. Experiments have shown that optical methods are not suﬃcient to calculate blood glucose values accurately, therefore other external factor techniques like mechanical stimulation of tissues and analysis of light scattering are more frequently used. Eye Sensitivity to small color changes. In this technique the value of glucose in blood is measured from the eye sensitivity to small color changes where wide range of wavelengths, ﬁlters and detectors are used. Endogenous Method. This method correlates changes in the glucose value to changes in the electro-physiological activity of certain peripheral nerves.

Grid-Enabled Non-Invasive Blood Glucose Measurement

83

Changes in their endogenous physiological and biochemical functions are observed from which the glucose value is calculated. Nuclear Magnetic Resonance. In this method the nuclear magnetic resonance spectrum and water resonance are considered. Blood glucose values are calculated on the basis of theory that the ratio of the resonance frequencies correlates with the blood glucose level. Our approach has the advantage to split the process of blood glucose value estimation into a developable, personalized data mining process and a rather simple source signal emitting and recording phase. Improvements to the involved data mining services and overall process can be deployed remotely and will directly result in a more precise and robust estimation for the patient without the need to change the client environment. Another important beneﬁt of our approach is that the introduced concept of local measurement and remote personalized estimation is general applicable to various scenarios, e.g. for estimating another vital parameter like blood pressure. E-Health [25], the use of information and communication technologies to improve or enable health care, will have various impacts on health care services [26] and the quality, cost and eﬃciency of patient care [27]. Health Grids [27] are going to be used for the individualized health care and the epidemiology analysis. The former is improved by the secure and eﬃcient combination of the widespread personal clinical information and the availability of advanced services for diagnostic and therapy. The latter combine the information from a wide population to extract the knowledge that can lead to the discovery of new correlations between symptoms, diseases, genetic features or any other clinical data. An example of a project serving both directions is @neurIST [3], which aims to create an IT infrastructure for the management of all processes linked to research, diagnosis and treatment development for complex and multi-factorial diseases - currently focusing on cerebral aneurysm and subarachnoid haemorrhage. The recently started EU project OLDES [28] includes a pilot application focusing on older persons suﬀering from Type 2 diabetes mellitus. The main aim is to provide a low-cost and easy-to-use health care platform, including a continuous glucose monitoring system. All measured data, including other vital parameters, will be transferred wirelessly to the patient’s health provider. This will enable a better surveillance of the patient’s health status and more eﬃcient adjustment of therapy. While we are having the same overall aims as OLDES namely the reduction of acute and chronic complications of diabetes and a more eﬀective prevention of emergencies, resulting in the reduction of the frequency of hospitalizations and increased quality of life - we are targeting also the commonly younger, more mobile and active Type 1 diabetics with our non-invasive measurement method, but not providing such a comprehensive health platform.

6

Conclusions

This paper has outlined the China-Austria Data Grid framework in the context of its ﬁrst application, namely the Non-Invasive Glucose Measurement Service,

84

I. Elsayed et al.

in short NIGM-Service. The key contribution in this paper is the grid based implementation of a novel non-invasive method for accurate estimation of blood glucose values based on electro-transformation measures in human body meridians. The presented approach has two main beneﬁts, by splitting the process of vital parameter estimation into an evolve-able, personalized data mining process and a rather simple source signal emitting and recording phase: ﬁrst, the infrastructure can easily be applied to other target values and second, improvements to the involved data mining services and process will directly result in a more precise and robust estimation of the values. It follows from the discussion in this paper that the treatment of diabetic patients will be the ﬁrst domain highly proﬁting (improvement of quality of life, economical aspects, etc.) from the NIGM-Service. However, the scientiﬁc computing infrastructure presented in this paper also establishes the basis for a number of future applications and extensions to other domains. Acknowledgments. The work described in this paper is being carried out as part of the research projects “Medical Measurement Grid for On-Line Diagnosis” and “Austrian Grid” supported by the Austrian Federal Ministry of Science and Research.

References 1. De Roure, D., Baker, M.A., Jennings, N.R., Shadbolt, N.R.: The evolution of the grid. In: Berman, F., Hey, A.J.G., Fox, G. (eds.) Grid Computing: Making The Global Infrastructure a Reality, pp. 65–100. John Wiley & Sons, Chichester (2003) 2. Ramakrishnan, L., Simmhan, Y., Plale, B.: Realization of Dynamically Adaptive Weather Analysis and Forecasting in LEAD: Four Years Down the Road. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 1122–1129. Springer, Heidelberg (2007) 3. The @neurIST Project: Integrated Biomedical Informatics for the Management of Cerebral Aneurysms (2008), http://www.aneurist.org/ 4. Benkner, S., Berti, G., Engelbrecht, G., Fingberg, J., Kohring, G., Middleton, S.E., Schmidt, R.: GEMSS: Grid-infrastructure for Medical Service Provision (2004) 5. Krishnan, A.: A survey of life sciences applications on the grid. New Generation Computing 22(2), 111–126 (2004) 6. Srinivasan, L., Treadwell, J.: An overview of service-oriented architecture, web services and grid computing. Hewlett-Packard White Paper (2005) 7. Abowd, G.D., Mynatt, E.D.: Charting past, present, and future research in ubiquitous computing. ACM Transactions on Computer-Human Interaction (TOCHI) 7(1) (2000) 8. Satyanarayanan, M.: Fundamental challenges in mobile computing. In: PODC 1996: Proceedings of the ﬁfteenth annual ACM symposium on Principles of distributed computing, pp. 1–7 (1996) 9. Mattern, F.: Ubiquitous computing: Scenarios from an informatised world. In: Zerdick, A., et al. (eds.) E-Merging Media - Communication and the Media Economy of the Future, pp. 145–163. Springer, Heidelberg (2005) 10. The AKOGRIMO Project: Access to Knowledge trough the Grid in a Mobile World (2008), http://www.mobilegrids.org/

Grid-Enabled Non-Invasive Blood Glucose Measurement

85

11. Cannataro, M., Talia, D.: Towards the next-generation grid: a pervasive environment for knowledge-based computing. Information Technology: Coding and Computing, 437–441 (2003) 12. Maciocia, G.: The Foundations of Chinese Medicine: A Comprehensive Text for Acupuncturists and Medical Herbalist. Elsevier Churchill Livingstone (2005) 13. Wen, D.: The drum of chinese medicine’s modernization beats loudly: A summary of the second international science and technology conference on traditional chinese medicine. Acupuncture Today (2005) 14. Schloegl, R.: Report on the outcome of the conference ’traditional chinese medicine - a successful concept for east and west’. In: Seminar on the modernization of traditional medicine, pp. 10–11 (2006) 15. WHO-IDF: Deﬁnition and diagnosis of diabetes mellitus and intermediate hyperglycemia. Report of a WHO/IDF Consultation (2006) 16. Bazaev, N.A., Selishchev, S.V.: Noninvasive methods for blood glucose measurement. Biomedical Engineering 41(1), 42–50 (2007) 17. Diabetes-Austria (2008), http://www.diabetes-austria.com/ 18. CADGrid: China-austria data grid (2007), http://www.par.univie.ac.at/project/cadgrid/ 19. Han, J., Han, Y., Xia, Q., Hou, X., Li, Y.: Experiment and analysis on transmission-characteristics of human-body-meridians-electro-information with application. Modern Chinese- and west-Medicine Magazine 3(19), 17–29 (2005) 20. Zhang, W.B., Jeong, D.: Subcutaneous impedance measured by four-electrode method and low impedance acupoints located by single power alternative current. American Journal of Chinese Medicine 32(5), 779–788 (2004) 21. WHO: International standard terminologies on traditional medicine in the western paciﬁc region. WHO Library Cataloging in Publication Data (2007) 22. Zhang, W.B.: Proceedings in the study of transcutaneous CO2 emission in acupuncture and meridians. World Journal of Gastroenterol 6(3) (2000) 23. Bell, G., Gray, J., Szalay, A.: Petascale computational systems. Computer 39(1), 110–112 (2006) 24. WEEP: The workﬂow enactment engine project (2005), http://weep.gridminer.org 25. Oh, H., Rizo, C., Enkin, M., Jadad, A.: What is ehealth (3): A systematic review of published deﬁnitions. Journal of Medical Internet Research (2005) 26. Martin, S., Yen, D.C., Tan, J.K.: E-health: impacts of internet technologies on various healthcare and services sectors. International Journal of Healthcare Technology and Management 4(1/2), 71–86 (2002) 27. Breton, V., Dean, K., Solomonides, T.: The healthgrid white paper (2005), http://www.whitepaper.healthgrid.org 28. The OLDES project: Old people’s e-services at home (2007), http://www.oldes.eu/

Simulating N-Body Systems on the Grid Using Dedicated Hardware Derek Groen1,2 , Simon Portegies Zwart1,2 , Steve McMillan3 , and Jun Makino4 1

2

Section Computational Science, University of Amsterdam, Amsterdam, The Netherlands [email protected] Astronomical Institute “Anton Pannekoek”, University of Amsterdam, Amsterdam, The Netherlands 3 Drexel University, Philadelphia, United States 4 National Astronomical Observatory, Mitaka, Japan

Abstract. We present performance measurements of direct gravitational N -body simulation on the grid, with and without specialized (GRAPE-6) hardware. Our intercontinental virtual organization consists of three sites, one in Tokyo, one in Philadelphia and one in Amsterdam. We run simulations with up to 196608 particles for a variety of topologies. In many cases, high performance simulations over the entire planet are dominated by network bandwidth rather than latency. Using a global grid of GRAPEs our calculation time remains dominated by communication over the entire range of N , which was limited due to the use of three sites. Increasing the number of particles will result in a more eﬃcient execution and for > 2 · 106 we expect the computation time to overtake the communiN ∼ cation time. We compare our results with a performance model and ﬁnd that our results are in accordance with the predicted values.

1

Introduction

The simulation of a star cluster is commonly performed by direct-method N-body integrators [1]. The gravitational force on individual stars in such simulations is calculated by aggregating the force contributions from all other particles in the system. This is a compute-intensive operation that requires the use of parallel algorithms or dedicated hardware (such as GRAPEs [2], or GPUs [3]) for simulating more than a few thousand stars. Several parallel algorithms have been developed for N-body simulations, including a copy algorithm [4], where updated particles are exchanged between all processes, and a ring algorithm [5]. Parallelization of GRAPEs appears to be an eﬃcient way to reduce the wallclock time for individual simulations [4,5,6]. The gravitational N -body problem has calculation time complexity O(N 2 ), whereas the communication scales only with O(N ). For suﬃciently large N , the force calculation time will therefore overtake the communication time. For a local cluster of GRAPEs with low-latency M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 86–95, 2008. c Springer-Verlag Berlin Heidelberg 2008

Simulating N-Body Systems on the Grid Using Dedicated Hardware

87

and high bandwidth network, break-even between calculation and communication is reached at N ∼ 104 [6]. Generally, GRAPE clusters are not cheap and few institutions can aﬀord such dedicated hardware solutions. An alternative to purchasing a large cluster is provided by a computational grid [7]. Grid technology can be applied to combine several clusters (with or without GRAPEs) into one collective infrastructure. Doing so is beneﬁcial, as the purchase and maintenance costs are shared among institutes. The grid provides security mechanisms to allow uniform registration and authentication, monitoring tools to detect idle GRAPEs and meta-schedulers to divide the workload over the participating sites. These features make grids both more scalable and ﬂexible than cluster setups that connect using ssh keys [8]. The real challenge for the grid is to develop new applications for astronomical problems that have yet to be solved. For example, the simulation of an entire galaxy requires at least several PFLOP/s of computational power and the development of a hybrid simulation environment [9]. Such an environment performs several astrophysical simulations on vastly diﬀerent temporal and spatial scales, simulating gravity interactions as well as stellar evolution and treatment of close encounters. To facilitate these tightly-coupled multi-physics simulations on the PFLOP/s scale, we require an extensive and heterogeneous grid infrastructure consisting of several powerful compute clusters. Although grid technology has often been applied to facilitate high-throughput computing [10] or loosely-coupled simulations, little research has been done on investigating how the grid can be eﬃciently used to solve tightly-coupled HPC problems. By using grid technology for this speciﬁc set of problems, we can potentially fullﬁll the computational requirements for performing petascale multi-physics simulations. Using a grid infrastructure for HPC has another drawback, as networks between grid sites have completely diﬀerent characteristics compared to local area networks. Earlier experiments indicate that a grid of regular PCs across Europe improves overall performance for relatively small N [5], but these results do not necessarily extend to a global grid of GRAPEs, which has intercontinental network lines connecting special purpose nodes. We address the question for which problem size a world-wide grid has practical usage, and how to optimize our simulations to take advantage of the grid. In this, we focus on grids equipped with GRAPE hardware.

2

Technical Setup and Performance Model

We have constructed a heterogeneous grid of GRAPEs, which we call the Global GRAPE Grid (or G3). The G3 consists of ﬁve nodes across three sites, where each of the nodes is equipped with a GRAPE-6Af special purpose computer. We provide the technical details of our grid setup in 2.1, while we brieﬂy describe the applied performance model in 2.2.

88

D. Groen et al.

2.1

Technical Setup

Of the ﬁve nodes in the G3, two nodes are located at Tokyo University (Tokyo, Japan), two are located at the University of Amsterdam (Amsterdam, the Netherlands) and one is at Drexel University (Philadelphia, United States). Local nodes in the G3 are connected by Gigabit Ethernet, whereas diﬀerent sites are connected with regular internet. In Table 1 we present the speciﬁcations of the G3. Each of the computers in the G3 is set up with Globus Toolkit middleware1 and MPICH-G22 . Table 1. Speciﬁcations for the nodes in G3. The ﬁrst column gives the name of the computer followed by its country of residence (NL for the Netherlands, JP for Japan and US for the United States). The subsequent columns give the type of processor in the node, the amount of RAM, followed by the operating system, the kernel version and the version of Globus installed on the PC. Each of the nodes is equipped with a 1 Gbit/s Ethernet card and GRAPE-6Af hardware. Local nodes are interconnected with Gigabit Ethernet. name vader palpatine yoda skywalker obi-wan

location CPU type NL NL JP JP US

Intel P4 2.4GHz Intel P4 2.67GHz Athlon 64 3500+ Athlon 64 3500+ 2x Xeon 3.6GHz

RAM OS kernel Globus [MB] version 1280 Ubuntu 5.10 2.6.5 4.0.3 256 RHEL 3 2.4.21 4.0.3 1024 FC 2 2.6.10 3.2.1 1024 FC 2 2.6.10 3.2.1 2048 Gentoo 06.1 2.6.13 4.0.4

In Table 2 we present the network characteristics, latency and bandwidth, of the connections within G3. We tested local area network (LAN) and wide area network (WAN) connections using the UNIX ping command to measure latency. We use scp for measuring the network bandwidth, transferring a 75 MB ﬁle, rather than referring to theoretical limits because the majority of bandwidth on non-dedicated WANs is used by external users. For our performance measurements, we used a standard implementation of MPICH-G2 without speciﬁc optimizations for long-distance networking. As a result, the MPI communication makes use of only 40%-50% of the available bandwidth3 . If we were to optimize MPICH-G2, or add support for grid security to already optimized MPI libraries, such as Makino’s tcplib4 or OpenMPI, our bandwidth use would be close to the bandwidth use of a regular ﬁle transfer. The N-body integrator we have chosen for our experiments uses block timesteps [11] with a 4th order Hermite integration scheme [12]. The time steps with which the particles are integrated are blocked in powers of two between a 1 2 3

4

http://www.globus.org http://www3.niu.edu/mpi/, in the future: http://dev.globus.org/wiki/MPICH-G2 For more information we refer to a research report from INRIA: http://hal.inria.fr/inria-00149411/en/ See: http://grape.mtk.nao.ac.jp/∼makino/softwares

Simulating N-Body Systems on the Grid Using Dedicated Hardware

89

Table 2. Characteristics of local and wide network connections. Latency indicates the required time for sending 1 byte through the network connection. The bandwidth indicates the transfer capacity of the network connection. The bandwidth was measured with a 75MB scp ﬁle transfer. connection

latency [ms] Amsterdam LAN 0.17 Tokyo LAN 0.04 Amsterdam - Tokyo WAN 266.0 Amsterdam - Phil. WAN 104.0 Philadelphia - Tokyo WAN 188.0

bandwidth (theory) bandwidth (real) [MB/s] [MB/s] 125.0 11.0 125.0 33.0 57.0 0.22 312.5 0.56 57.0 0.32

minimum of 2−22 and a maximum of 2−3 N-body time unit [13]. During each time step, the codes perform particle predictions, calculate forces between particles and correct particles on a block of active particles (see [6] for a detailed description of the integration scheme). Particle corrections include updates of positions and velocities, and computation of new block time steps of particles. For our experiments we use two implementations of an N -body integrator. One of these codes runs on a single PC with and without GRAPE, whereas the other is parallelized with MPI using the ring algorithm (see [5] for more details). We initialize the simulations using Plummer spheres that were in virial equilibrium and performed our simulations using a softening parameter of 2−8 . Since our simulations are performed over one N -body time unit , the realization of the N-body system is not critical to the timing results. 2.2

Performance Model

To further understand our experiments we adopted the parallel performance model described by [14], which is based on the work of [15], [6] and [5], but then applied to the grid. To apply these models, we need to measure several parameters. These include τpred , which is the time to predict a single particle, τforce , which is the time to calculate the forces between two particles, and τcorr , which is the time spent to correct a single particle. We measure the values for τpred , τforce and τcorr by using a sample N-body simulation with 32768 particles, and provide them in table 3 for the various nodes in the G3. We have applied the performance model to the results presented in Sect.3. In Fig. 1 we compare the measured wall-clock time for the ring algorithm on the grid with the performance model. To guide the eye, the results for a single GRAPE are also presented in both ﬁgures. The performance model tracks the real measurements quite satisfactorily, giving a slightly lower computation time for a single GRAPE while giving a slightly higher computation time for a simulation across grid sites. The communication overhead of a distributed computer often renders high performance computing on a grid ineﬃcient. However, for suﬃciently large N ,

90

D. Groen et al.

Table 3. Machine performance speciﬁcation and machine-speciﬁc constants. The ﬁrst two columns show the name of the machine, followed by the country of residence. The last three columns give the time required for to perform one particle prediction (τpred ), the time required for one force calculation between two particles (τforce ) and the time required for correcting one particle (τcorr ) respectively, all in microseconds. name

location τpred [μs] vader NL 0.247 palpatine NL 0.273 yoda JP 0.131 skywalker JP 0.131 obi-wan US 0.098

τforce [μs] 0.216 0.193 0.110 0.110 0.148

τcorr [μs] 4.81 2.39 1.29 1.29 1.14

there will eventually be a point where relatively little time is lost communicating, and the compute resources are eﬃciently used. In Fig. 1 we can see that, for GRAPE-enabled simulations, the model predicts break-even between calculation and communication around N 2 · 106 . For large N , a grid of two GRAPEs will therefore outperform a single GRAPE.

3

Results

We have performed a number of simulations on local machines and on the G3, which consists of timing simulations lasting one N-body time unit and short simulations that have been proﬁled. We measured the full wall-clock execution time for the longer simulations and we proﬁled the shorter simulations. 3.1

Timing Results of N-Body Calculations

We run the N -body codes, discussed in Sect.2.1, on a single PC and across the network in parallel using N = 1024 to N = 65536 (a few additional calculations were performed with N > 65536). The runs were performed with and without GRAPE. We present our results in Fig. 1. If a simulation is run multiple times with the same problem set, the execution time may be slightly diﬀerent per run. This variation is within the margin of 1.07 over 4 runs, and can be primarily attributed to ﬂuctuations in the network bandwidth. Single PC. The performance on a single PC (represented by the thick dashed line with bullets in Fig.1) is entirely dominated by force calculations, which scales as O(N 2 ). As the number of steps per N-body time unit increases with N , the execution time scales slightly worse than N 2 . Grid of PCs. The performance on the G3 using all three sites, without using GRAPE, is given by the thin dashed line with triangles. For N < 24576, the performance is dominated by network communication. Given that p indicates the number of processes, the network communication scales as O (N log p) [6].

Simulating N-Body Systems on the Grid Using Dedicated Hardware 106

91

Ring

105

Tapp [s]

104

10

3

102

10

1

10

0

103

1 GRAPE 2 GRAPEs (PT) 2 GRAPEs (AT) 3 GRAPEs (APT) 1 PC 3 PCs (APT) perf. model 104

105

106

N

Fig. 1. The time for running the application for 1 N -body time unit (Tapp ) as a function of the number of stars (N ) using the ring algorithm. The two thick lines give the results for a single CPU with GRAPE (lower solid curve) and without (top dashed curve). We make the distinction between solid curves to present the results for simulations run with GRAPE, and dashed curves to give the results without GRAPE. The results on the grid are presented with four diﬀerent lines, based on the three included locations. Each of these runs is performed with one node per site. The results for the WAN connection Philadelphia–Tokyo (given in the legend by PT), Amsterdam–Tokyo (AT) and Amsterdam–Philadelphia–Tokyo (APT) are indicated with the solid curves with ﬁlled squares, open squares and ﬁlled triangles, respectively. The dashed curve with ﬁlled triangles gives the results for the Amsterdam–Philadelphia–Tokyo connection but without using GRAPE. Dotted lines indicate the performance of runs with GRAPE according to the performance model.

For our grid-based simulation experiments without GRAPE, break-even between communications and force calculations is achieved around N ∼ 4 · 104 . For larger N , the execution time is dominated by force calculations, rather than network communication. Single PC with GRAPE. The performance on a single PC with GRAPE, given by the thick solid line with bullets, is dominated by force calculations, although communication between host and GRAPE, and operations on the host PC have an impact on performance for N < 16384. For such small N , the GRAPE performs less eﬃciently, because many blocks are too small to ﬁll the GRAPE pipelines. For larger N , force calculations become the performance bottleneck, and the scaling of the execution time becomes that of a single PC without GRAPE. Grid of PCs with GRAPE. The performance on two sites in the G3 (with GRAPEs) is given by the thin solid line with solid squares for calculations

92

D. Groen et al.

100

76

133

234

410

720

block size [n] 1262 2213

3880

6803

11926

network bandwidth

network latency

10-1

share of time spent

force calculations 10-2

-3

10

PC-GRAPE comm. time corrector -4

10

predictor 10-5 3 10

104

105

106

N

Fig. 2. Share of wall-clock time spent on individual tasks during a single time step, using 3 nodes on 3 sites. Solid lines indicate tasks performed locally. The thick solid line with ﬁlled circles represents time spent on force calculations, and the thin solid lines give the result for time spent on communication between PC and GRAPE (open triangles), particle corrections (open circles) and particle predictions (open squares) respectively. Dotted lines indicate time spent on communication between nodes. The thin dotted line with asterisks indicates time spent on communication latency between nodes and the thick dotted line with solid squares indicates time spent on using the network bandwidth.

between Philadelphia and Tokyo, and by the thin solid line with open squares for calculations between Amsterdam and Tokyo. The performance on the G3 using all three sites is given by the thin solid line with triangles. For all problem sizes N we have measured, the grid speedup Γ [16] is less than 0.15, indicating that the performance is dominated by network communication. The network communication time scales better than the force calculation time, therefore, if N is suﬃciently large, force calculation time will overtake the network communication time. However, this break-even point lies at much higher N than for a grid of PCs, because the use of GRAPE greatly decreases the time spent on force calculations. 3.2

Proﬁling of the N-Body Simulations

We have chosen one parallel algorithm (ring) and one resource topology (3 nodes on 3 sites) to proﬁle the simulation during one integration time step. The block size n for every measurement was ﬁxed using a formula for calculating average block size n = 0.20N 0.81 , which has been used for the same initial conditions in [3]. During execution, we measured the time spent on individual tasks, such as force

Simulating N-Body Systems on the Grid Using Dedicated Hardware

93

calculations or communication latency between processes. We have proﬁled our simulations for N = 1024 up to N = 196608, using the timings measured on the process running in Tokyo. The results of these measurements are given in Fig. 2. We ﬁnd that for larger N , low bandwidth of our wide area network aﬀects the outcome of the performance measurements, and that MPI calls are only able to use about a quarter of the available bandwidth for passing message content. > 5 · 105 we expect the force calculation to take more time than network For N ∼ latency. If we were to use the network bandwidth more eﬃciently for such a large number of particles, the execution time would be dominated by force calculations. The usable bandwidth can be increased either by using a more eﬃcient MPI implementation (see Sect.2.1) or by using a dedicated network. Using our > 2 · 106 current networking and MPI implementation, we expect that for N ∼ particles the force calculation time overtakes the bandwidth time.

4

Conclusions

We studied the potential use of a virtual organization in which GRAPEs are used in a wide area grid. We tested the performance model with an actual grid across three sites, each of which is located on a diﬀerent continent. We used GRAPE hardware in Japan, the Netherlands and the USA in parallel for calculations of 1024 up to 196608 particles. With these particle numbers we were unable to reach superior speed compared to a single GRAPE. However, we were both able to run simulations consisting of 3 times as many particles and to outperform a single computer without GRAPE. We estimate that a small intercontinental grid of GRAPEs will reach superior > 2 · 106 particles. If we were performance compared to single GRAPE for N ∼ to increase the bandwidth by two orders of magnitude, e.g. by using dedicated light paths, we expect the grid of GRAPEs to outperform a single GRAPE for > 4 · 105 . N ∼ We have mainly discussed the use of GRAPEs in a virtual organization, but new developments in using graphical processing units appear to achieve similar speeds as GRAPEs [3,17,18]. In addition, GPUs are equipped with a larger amount of memory, which allows us to exploit more memory-intensive, but also faster, parallel algorithms. Although our proof-of-concept infrastructure was of limited size, we have shown that it is possible to use dedicated hardware components located across clusters for high performance computing. As we have proﬁled and modelled a single-physics N-body simulation on the G3, we can proceed by bringing a multiphysics simulation environment (such as MUSE 5 ) to the grid. Scheduling the complex communications between stellar dynamics, evolution and collision simulations on a grid infrastructure provides a previously unexplored direction for future work. Alternatively, a much larger grid can be used to simulate a very large Nbody system. Although break even between communication and computation 5

http://muse.li

94

D. Groen et al.

occurs at relatively large N for block time-step N-body simulations over regular internet, shared time-step simulations perform much more favorably, especially when using dedicated light paths. Acknowledgements. We are grateful to Mary Inaba and Cees de Laat for discussion and support in realizing this grid setup, and to Alessia Gualandris for providing some of the codes and feedback on this work. We are also grateful to Alfons Hoekstra, Marian Bubak and Stefan Harfst for fruitful discussions on the contents of this paper. This research is supported by the Netherlands organization for Scientiﬁc research (NWO) grant #643.200.503 and by the European Commission grant for the QosCosGrid project (grant number: FP6-2005-IST-5 033883), and we thank SARA computing and networking services, Amsterdam for technical support.

References 1. Aarseth, S.J.: Direct n-body calculations. In: Goodman, J., Hut, P. (eds.) Dynamics of Star Clusters. IAU Symposium, vol. 113, pp. 251–258 (1985) 2. Fukushige, T., Makino, J., Kawai, A.: GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems. Publications of the Astronomical Society of Japan 57, 1009–1021 (2005) 3. Portegies Zwart, S.F., Belleman, R.G., Geldof, P.M.: High-performance direct gravitational N-body simulations on graphics processing units. New Astronomy 12, 641–650 (2007) 4. Makino, J., Kokubo, E., Fukushige, T.: Performance evaluation and tuning of grape-6 - towards 40 ”real” tﬂops. sc 00, 2 (2003) 5. Gualandris, A., Portegies Zwart, S., Tirado-Ramos, A.: Performance analysis of direct n-body algorithms for astrophysical simulations on distributed systems. Parallel Computing 33(3), 159–173 (2007) 6. Harfst, S., Gualandris, A., Merritt, D., Spurzem, R., Portegies Zwart, S., Berczik, P.: Performance analysis of direct N-body algorithms on special-purpose supercomputers. New Astronomy 12, 357–377 (2007) 7. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of High Performance Computing Applications 15(3), 200–222 (2001) 8. Boku, T., Onuma, K., Sato, M., Nakajima, Y., Takahashi, D.: Grid environment for computational astrophysics driven by grape-6 with hmcs-g and omnirpc. ipdps 05, 176a (2005) 9. Hoekstra, A.G., Portegies Zwart, S.F., Bubak, M., Sloot, P.M.A.: Petascale Computing: Algorithms and Applications, 1st edn., pp. 147–159. Chapman and Hall/CRC (2008) 10. Abramson, D., Giddy, J., Kotler, L.: High performance parametric modeling with nimrod/g: Killer application for the global grid? ipdps 00, 520 (2000) 11. McMillan, S.L.W.: The use of supercomputers in stellar dynamics; proceedings of the workshop, institute for advanced study, Princeton, NJ, June 2-4 (1986). In: Hut, P., McMillan, S.L.W. (eds.) The Use of Supercomputers in Stellar Dynamics. Lecture Notes in Physics, vol. 267, p. 156. Springer, Berlin (1986)

Simulating N-Body Systems on the Grid Using Dedicated Hardware

95

12. Makino, J., Aarseth, S.J.: On a Hermite integrator with Ahmad-Cohen scheme for gravitational many-body problems. Publications of the Astronomical Society of Japan 44, 141–151 (1992) 13. Heggie, D.C., Mathieu, R.D.: Standardised Units and Time Scales. In: Hut, P., McMillan, S.L.W. (eds.) The Use of Supercomputers in Stellar Dynamics. Lecture Notes in Physics, vol. 267, p. 233. Springer, Berlin (1986) 14. Groen, D., Portegies Zwart, S., McMillan, S., Makino, J.: Distributed N-body simulation on the grid using dedicated hardware. New Astronomy 13, 348–358 (2008) 15. Makino, J.: An eﬃcient parallel algorithm for O(N 2 ) direct summation method and its variations on distributed-memory parallel machines. New Astronomy 7, 373–384 (2002) 16. Hoekstra, A.G., Sloot, P.M.A.: Introducing grid speedup g: A scalability metric for parallel applications on the grid. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 245–254. Springer, Heidelberg (2005) 17. Hamada, T., Iitaka, T.: The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units. ArXiv Astrophysics e-prints, astro-ph/0703100 (March 2007) 18. Belleman, R.G., B´edorf, J., Portegies Zwart, S.F.: High performance direct gravitational N-body simulations on graphics processing units II: An implementation in CUDA. New Astronomy 13, 103–112 (2008)

Supporting Security-Oriented, Collaborative nanoCMOS Electronics Research Richard O. Sinnott, Thomas Doherty, David Martin, Campbell Millar, Gordon Stewart, and John Watt National e-Science Centre, University of Glasgow, Scotland {r.sinnott,t.doherty,d.martin,c.millar,g.stewart, j.watt}@nesc.gla.ac.uk

Abstract. Grid technologies support collaborative e-Research typified by multiple institutions and resources seamlessly shared to tackle common research problems. The rules for collaboration and resource sharing are commonly achieved through establishment and management of virtual organizations (VOs) where policies on access and usage of resources by collaborators are defined and enforced by sites involved in the collaboration. The expression and enforcement of these rules is made through access control systems where roles/privileges are defined and associated with individuals as digitally signed attribute certificates which collaborating sites then use to authorize access to resources. Key to this approach is that the roles are assigned to the right individuals in the VO; the attribute certificates are only presented to the appropriate resources in the VO; it is transparent to the end user researchers, and finally that it is manageable for resource providers and administrators in the collaboration. In this paper, we present a security model and implementation improving the overall usability and security of resources used in Grid-based eResearch collaborations through exploitation of the Internet2 Shibboleth technology. This is explored in the context of a major new security focused project at the National e-Science Centre (NeSC) at the University of Glasgow in the nanoCMOS electronics domain. Keywords: Grid computing, e-Research, Security, Virtual Organizations, Shibboleth.

1 Introduction Security and ease of use are critical factors to the success and uptake of Grid technologies in supporting collaborative e-Research. Current end user experience of interacting with large scale computational and data resources such as the National Grid Service (NGS) [1] in United Kingdom typically begins with obtaining an UK eScience X.509 certificate issued by the trusted UK Certification Authority (CA) [2] at Rutherford Appleton Laboratories (RAL) [3]. This has numerous issues. Firstly, it is off-putting to many potential researchers since they need to deal with unfamiliar security concepts. Furthermore, this authentication-based model for Grid security whereby the binding of the user identity to the certificate through the CA is an M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 96–105, 2008. © Springer-Verlag Berlin Heidelberg 2008

Supporting Security-Oriented, Collaborative nanoCMOS Electronics Research

97

extremely limited model of security since it does not restrict what that user can access and use other than at the level of privileges associated with a local user account for example. Instead, to improve the usability and availability of Grid resources to particular individuals or to particular collaborations, finer grained security models are required to ensure that resources are only accessible to appropriate individuals/VOs at the discretion of local resource managers according to their own local policies. That is authorization infrastructures are required which allow to define policies on access and usage, which can subsequently be enforced by local resource providers to limit access to their own resources according to appropriate site specific policies. Critical to the success of any authorization infrastructure are tools to support site administrators in the definition of security policies. End users themselves should also be, as far as possible, shielded from the underlying complexities of authorization policies and associated security attributes or indeed the Grid more generally. In an ideal world end users should be able to access Grid resources in much the same way as they access other Internet resources [4]. In this paper we describe novel solutions which allow system or site administrators to define their own local policies on acceptance of a variety of VO-specific security attributes from potentially remote collaborators which can subsequently be used to make local authorization policy decisions. Through exploitation of the Internet2 Shibboleth technologies, various sources of security attributes - so called attribute authorities (AA), and authorization infrastructures, we are able to provide seamless and transparent access to Grid resources from potentially remote, trusted collaborators. To demonstrate the validity of this approach we show how we have exploited these technologies in the major new security-oriented project: Meeting the Design Challenges of NanoCMOS Electronics [5] at the National e-Science Centre (NeSC) at the University of Glasgow. We note that this one example from many projects at the NeSC which have adopted this approach, hence the solutions are generic and widely applicable.

2 Collaborative Grid Security Models Existing Grid security models as typified by X509 Public Key infrastructures [5] underpinning access to resources such as the NGS suffer from several key limitations. Some of these key limitations include the end-user experience; the associated granularity of the security model offered by authentication-only Grid security models, and the trust model underlying the PKI itself. These limitations are described in detail in [6], [7]. The vision of the Grid is to provide single sign-on access to distributed resources. Through recognizing and trusting a centralized CA in associating the identity of a researcher with a particular digital certificate, single sign-on authentication can be supported. Thus researchers use their X509 certificate (or more often a proxy credential created from that X509 certificate) with a common username given by the distinguished name (DN) associated with that credential and single (strong) password. Through trusting the CA that issued the certificate, the end user is able to access a wide range of resources that recognize that credential without the need for multiple usernames and passwords across those sites. In short, the approach is based upon a public key infrastructure (PKI) supporting user authentication [8].

98

R.O. Sinnott et al.

Knowing the identity of the end user requesting access to a resource is important, but is only the starting point of security however. Finer grained models of security are needed which define precisely what end users are allowed to do on resources across a given inter-organizational collaboration. Role based access control is one approach that has been advocated for some time to address this issue. In this approach roles are defined and associated with policies describing what a user with that role is allowed to do on a given resource. Attribute certificates capture this information and can be used by resources providers to check the validity of user requests, i.e. that they are in accordance with local authorization policies. Detailed definitions of RBAC based systems and their benefits are given in [9], [10]. RBAC systems are often limited in that they are often complex to administer and use. What are required are simple tools for VO administrators and local system administrators to define and enforce security policies across research collaborations, and user oriented approaches that utilize these information. Examples of some tools for RBAC systems include [11], [12] and experiences in their application are given in [13], [14]. One of the most immediately usable ways to utilize authorization infrastructures is through ensuring that only sites within the VO can access VOresources. Another way of considering this is scoping of trust. Any useable e-Research collaborative infrastructure has to be aligned with the way in which researchers wish to work. Keeping systems simple from the end user perspective is a key aspect of this, and ideally aligned with the way in which they access resources more generally. The UK academic community and many other countries are rolling out national level federated access control systems, where authentication is devolved to a user’s home site utilizing the Internet2 Shibboleth technologies [15], [16]. The UK Access Management Federation [17] was established at the end of November 2006. The core of Shibboleth is a basic trust relationship between institutions within a federation, where each institute in the federation is expected (trusted) to authenticate their users properly. The architecture of Shibboleth defines several entities which are necessary to achieve this seamless integration of separate collaborating institutional authentication systems. The main components of Shibboleth consist of Identity Providers (IdPs, also known as a Shibboleth ‘Origin’); a Where-Are-You-From (WAYF) service, and one or more Service Providers (SP, also known as a Shibboleth ‘Target’). The IdP is typically the users’ home institution and is responsible for authenticating the end users at their institution. Each institution will have their own local systems for authenticating their users, e.g. LDAP or other mechanisms. The WAYF service is generally run by the federation that the institutions are subscribed to. It typically presents a dropdown list to the user that contains all the participating institutions (or projects) that are subscribed to within the federation. Users choose their home institution from this list and are then redirected to the home institution (IdP). The SP provides services or resources for the federation that the end user wishes to access. A typical scenario of this process is where a user types in the URL of the service or portal (SP) they wish to access. If the SP is protected by Shibboleth, the user will be redirected to the WAYF service where they select their home institution. Once redirected to their IdP they will provide the username and password they would normally use for authentication at their home institution. Once successfully

Supporting Security-Oriented, Collaborative nanoCMOS Electronics Research

99

authenticated, the user will be automatically redirected to the SP they are trying to access. At the same time, the security attributes (privileges) of this user will also be passed to the SP in a secure manner for further authorization from either the IdP or one or more known attribute authorities (AA). What attributes will be released by an institutional IdP or AA and what attributes will be accepted by a given SP needs to be configurable however and targeted towards the needs of particular VOs. It is important that all of this is transparent to the end users (who simply log-in to their home site). The uptake and adoption of Shibboleth technologies within a Grid context is not without potential concerns however. Sites need to be sure that collaborating sites have adopted appropriate security policies for authentication. Strength of user passwords and unified institutional account management are needed. Shibboleth is, by its very nature much more static than the true vision of the Grid, where VOs can be dynamically established linking disparate computational and data resources at run time. Instead it is still largely the case that the attributes that are defined and subsequently released from an IdP and how they are used by an SP is an involved and difficult process requiring understanding and pre-agreement on the information exchange between sites. The UK Federation for example has agreed a small set of attributes based upon the eduPerson schema [18]. Whilst the combination of Shibboleth and Grid technologies offer numerous direct complementary synergies, few tools currently exist to help facilitate the process of integrating Grid and Shibboleth technologies. For example, on the IdP side, an Attribute Release Policy (ARP) defines which user attributes may be released to the federation for which individuals. Tools such as ShARPE (Shibboleth Attribute Release Policy Editor) [19] provide a user interface to the ARP allowing a user or administrator to interact with the IdP attribute release policy without having to manually edit a raw XML file. At the SP end, the Attribute Acceptance Policy (AAP) component of Shibboleth defines which IdPs will be recognized (the default in the UK federation is that all sites are trusted at the authentication level); which attributes from the set release by any IdP will be recognized to potentially gain access to local services; or further, which attributes for specific individuals will be recognized. Tools are thus required to scope the accepted IdPs and associated attributes. This scoping is likely to be aligned with the particular requirements of different VOs. We note that currently site administrators are required to manually edit the AAP XML file to tighten up the attribute rules. As these rules may change quite frequently (especially in the Grid vision for truly dynamic VOs) it is desirable to provide capabilities similar to ShARPE to allow an administrator to instantly scope attributes for the SP, but also allow a delegated user to dictate the policy for their service through this application. Furthermore there is a risk in deploying a policy which has been edited by hand as any typographic mistake may compromise the whole SP. Services which allow only valid manipulation of the AAP would eliminate this risk. To improving the usability and uptake of Shibboleth technology in the Grid environment, the SPAM-GP (Security Portlets simplifying Access and Management of Grid Portals) project [20] was proposed to provide tools to support the process of establishing and enforcing fine grained Grid security in a Shibboleth environment. Specifically the project is developing a family of JSR-168 compliant portlets which a Grid portal administrator can use for tailoring access to the resources available behind

100

R.O. Sinnott et al.

the portal, i.e. the Grid services which themselves have authorization requirements that need to be met. The first such portlet that has been developed is the SCAMP (Scoped Attribute Management Portlet). This portlet allows restricted and syntactically correct manipulation of the AAP of a Shibboleth SP to streamline the subset of IdPs from whom a portal will accept user attributes. The portlet parses the federation metadata for the list of all the IdPs within the federation, and stores the values of the ‘scope’ entry for each IdP. When the SP is provided with a scoped attribute, the suffix will by definition be one of these scoped values. The list of IdP scopes in the federation is provided to the user/portal administrator in the form of a drop down list, one per user attribute, where the institutions from whom attributes are to be recognized/accepted from may be selected. The first time the portlet runs, the policy will set all attributes to ‘scoped’ but with no scope defined, so the default behavior is not to accept attributes from any institutions – a default common with most security infrastructures, i.e. deny all. Subsequently collaborating sites can be iteratively added to build a VO at the attribute level by the portal (VO) manager. Once defined, these changes can then be added to the AAP file. This policy information will then subsequently be available for the next browser session referencing that resource, i.e. only allowing access to the resources from known and trusted sites with expected attributes. To understand the benefits of this scoping and how it is used in combination with Shibboleth to tailor access to Grid resources we outline how this has been applied in the nanoCMOS electronics domain.

3 NanoCMOS Electronics Case Study The NeSC at the University of Glasgow have successfully demonstrated how single sign-on to a variety of portals across a variety of e-Research domains can be supported enabling inter-disciplinary e-Research combining Shibboleth and Grid technologies. The largest of these projects is in the nanoCMOS electronics domain specifically through the EPSRC-funded Meeting the Design Challenges of nanoCMOS Electronics project [5]. This domain is characterized by its heavy dependence and protection of intellectual property. This includes protection of designs, data, processes and the commercial, and often extremely expensive licensed design software that are used. This 4-year project itself began in October 2006 and involves collaboration between the universities of Glasgow, Edinburgh, Southampton, York and Manchester, with many leading industrial partners in the electronics domain including tools providers. Collaboration in the nanoCMOS domain is essential to overcome the major concerns facing the development of next generation electronic devices. The building blocks of electronics design begins at the transistor level. These transistors are becoming ever smaller and have now reached the nano-scale with 40nm Silicon MOSFETs in mass production at the current 90 nm node and sub-10nm transistors expected at the 22nm technology node, scheduled for production in 2018. 4nm transistors have already been demonstrated experimentally highlighting silicon’s potential for decreasingly small dimensions. These decreasingly small devices are now influenced by atomistic effects resulting in device variability – hence

Supporting Security-Oriented, Collaborative nanoCMOS Electronics Research

101

understanding the influences at the atomic scale and incorporating this into the design process is essential. At the time of writing, numerous prototypes of the nanoCMOS services have been implemented and made available within a project portal protected by Shibboleth. These technologies have been based upon one of the leading Grid middleware today from the Open Middleware Infrastructure Institute UK (www.omii.ac.uk). These include: − atomistic device modeling services exploiting transistor designs from commercial device manufacturers and exploiting a range of statistical physics based approaches for atomistic characterization of devices (including modeling of electron mobility, dopant clustering, line edge roughness and exploiting a range of simulation approaches, e.g. Monte Carlo drift diffusion); − services that supports the generation of compact models from device modeling simulations including exploiting license protected software; − circuit simulation services incorporating device variability that allow to model the impact of device variability in the circuit/system design process. The atomistic device modeling service and circuit simulation services are shown in Fig. 1 along with the outputs from the atomistic modeling, namely: a set of I/V curves reflecting the atomistic variability of the dopant concentrations and their distribution, and the output of the circuit simulation of a NAND gate showing the associated variation based on the atomistic device variability.

Fig. 1. Atomistic Device Modeling Service and I/V Outputs (above) and Circuit Simulation of NAND gate incorporating atomistic variability (below)

102

R.O. Sinnott et al.

Access to these services and importantly to the associated data sets that they generate requires security authorization. This is important both for the commercial value of the licenses, for the intellectual property associated with the designs and data sets themselves. We note that the atomistic device simulations themselves are especially computationally intensive and the complete characterization of a given device from a commercial supplier can require hundreds of thousands of CPU hours. An atomistic characterization of one such commercial device was undertaken and required >100,000 jobs on the ScotGrid computational resource (www.scotgrid.ac.uk) for its complete atomistic characterization [26]. The front end access to the portal is depicted at the bottom of Fig. 2 below. We note that this portal displays the various attributes that have been released by the identity provider and attribute authority at the University of Glasgow. We note that in this case, the only attributes that are recognised by the portal are those prefixed with NanoCMOS from the nanoCMOS partner sites. The top part of Fig. 2 shows another Shibboleth protected portal but this time without scoping of attributes.

Fig. 2. NanoCMOS Portal with Attribute Scoping (below) and Other Clinical Portal without Attribute Scoping (above)

This scoping allows the portal to be restricted to only accept attributes from known and trusted sources, e.g. the nanoCMOS partner sites or more restrictively, only from specific individuals at those sites. The attributes themselves are then used to restrict access to the associated services that are available within the portal. The services themselves have been developed to exploit a range of distributed HPC resources such as the National Grid Service, ScotGrid, and Sun Grid Engine-based clusters and Condor pools at Glasgow University. One key use of these attribute

Supporting Security-Oriented, Collaborative nanoCMOS Electronics Research

103

certificates are both to restrict access the specific services but also where appropriate to the back-end computational resources themselves. Thus privileged end users are able to submit jobs, themselves described in Job Submission and Description Language (JSDL) [23] generated through the portlets, via OMII-UK GridSAM instances. This is achieved through providing authorisation capabilities to GridSAM itself, specifically through authorization decisions based on access to the back-end Distributed Resource Management (DRM) connectors of GridSAM. We note that a variety of resource specific DRM connectors are available within GridSAM including connectors for Condor, Sun Grid Engine and Globus. The focus of the authorization decisions currently supported are through restricting access to the Globus DRM connector for the GRAMSubmissionStage part of the DRM connector sequence. In this model, the authorisation decision is decided before the JSDL document is submitted to the GridSAM instance and converted to a Globus specific Resource Specification Language (RSL) document and ultimately submitted to a GRAM manager. The authorization decisions themselves are made by using policies defined and enforced within the PERMIS RBAC system. The details of how PERMIS can be linked and used to restrict access to Grid services are described in detail in [24], [25]. We note that since major HPC resources such as the NGS require that X509 certificates are used for job submission, the back end of the portal supports a MyProxy service for creation and management of proxy credentials needed for job submission to major clusters.

4 Conclusion Inter-organizational collaborative e-Research requires tools that simplify access to and usage of distributed resources yet support finer-grained access control. Shibboleth combined with tools that allow management of security attributes offer a suitable model for such collaboration. Crucial to the success of Shibboleth and the uptake of Grid based e-Infrastructures are tools that support fine grained access to services and data sets. Proof of concept prototypes for definition of attribute acceptance policies have been demonstrated and applied in various e-Research projects. We note that the SCAMP portlet is just one of the several portlets we will produce during the course of this project. Other portlets that will be produced include an Attribute Certificate Portlet (ACP) which will allow users to issue X.509 ACs to other users for use with applications requiring finegrained highly secure authorization, exploiting results from the recently completed Dynamic Virtual Organizations in e-Science Education (DyVOSE) project, specifically through a portlet enabled version of the Delegating Issuing Service (DIS) [21]; a Content Configuration Portlet (CCP) supporting dynamic configurability of portal content based upon Shibboleth attributes and knowledge of existing available Grid services; and an Attribute Release Policy (ARP) portlet allowing configuration of the attributes released from an IdP. All of these portlets will be JSR-168 compliant and developed with the intention that a portal based VO administrator can define their own local policies on attribute acceptance, attribute release and how these attributes can configure access to local

104

R.O. Sinnott et al.

Grid resources based upon security authorization policies. We recognize that portlets for administrators are a highly beneficial approach since they overcome the potential syntactic and semantic errors that might be introduced through manual editing of security acceptance policies. Furthermore, through JSR-168 compliance we expect these portlets to be widely applicable and easy to establish and use in other projects (both at NeSC and beyond). We note that many Grid-based VOs are based upon the Virtual Organisation Management System (VOMS [12]) for definition of the VO-specific attributes. Through the recently funded VPMan project [22] we are exploring how VOMS attributes can be incorporated into authorization infrastructures such as PERMIS. Thus rather than expecting to aggregate security attributes from one or more IdPs or associated attribute authorities, it might well be the case that we exploit IdPs for authentication and a VOMS server for the attributes that have been agreed upon for that particular VO. These attributes are then used by PERMIS to make an authorization decision. We have demonstrated already how this is supported with a variety of leading Grid middleware including Globus and OMII-UK [24]. One final challenge that remains to be addressed is how to exploit these kinds of tools when defining and enacting workflows comprised of several services where each service in the workflow requires security attributes to be presented to make an authorization decision. To address such kinds of scenarios we are working with OMII-UK to feed them requirements for future security-oriented workflow languages and enactment engines.

References 1. UK National Grid Service (NGS), http://www.grid-support.ac.uk/ 2. Jensen, J., The, U.K.: e-Science Certification Authority, Proceedings of the UK e-Science All-Hands Meeting, Nottingham, UK (September 2003) 3. UK Rutherford Appleton Laboratories (RAL), http://www.grid-support.ac.uk/content/ view/23/55/ 4. Sinnott, R.O., Jiang, J., Dr Watt, J., Ajayi, O.: Shibboleth-based Access to and Usage of Grid Resources. In: Proceedings of IEEE International Conference on Grid Computing, Barcelona, Spain (September 2006) 5. Meetings the Design Challenges of nanoCMOS Electronics, http://www.nanocmos.ac.uk 6. Sinnott, R.O., Watt, J., Jiang, J., Stell, A.J., Ajayi, O.: Single Sign-on and Authorization for Dynamic Virtual Organizations. In: 7th IFIP Conference on Virtual Enterprises, PROVE 2006, Helsinki, Finland (September 2006) 7. Watt, J., Sinnott, R.O., Jiang, J., Ajayi, O., Koetsier, J.: A Shibboleth-Protected Privilege Management Infrastructure for e-Science Education. In: 6th International Symposium on Cluster Computing and the Grid CCGrid 2006, Singapore (May 2006) 8. Housley, R., Polk, T.: Planning for PKI: Best Practices Guide for Deploying Public Key Infrastructures. Wiley Computer Publishing, Chichester (2001) 9. Sandhu, R.S., Coyne, E.J., Feinstein, H.L., Youman, C.E.: Role-Based Access Control Models. IEEE Computer 29, 38–47 (1996)

Supporting Security-Oriented, Collaborative nanoCMOS Electronics Research

105

10. Ninghui, L., Mitchell, J.C., Winsborough, W.H.: Design of a Role-based Trustmanagement Framework. In: Proceedings of the 2002 IEEE Symposium on Security and Privacy (2002) 11. Chadwick, D.W., Otenko, A.: The PERMIS X.509 Role Based Privilege Management Infrastructure, Future Generation Computer Systems, vol. 936, pp. 1–13. Elsevier Science BV, Amsterdam (2002) 12. Virtual Organization Membership Service (VOMS), http://hep-project-grid-scg.web.cern.ch/hep-project-grid-scg/voms.html 13. Sinnott, R.O., Stell, A.J., Chadwick, D.W., Otenko, O.: Experiences of Applying Advanced Grid Authorisation Infrastructures. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 265–275. Springer, Heidelberg (2005) 14. Sinnott, R.O., Stell, A.J., Watt, J.: Comparison of Advanced Authorisation Infrastructures for Grid Computing. In: Proceedings of International Conference on High Performance Computing Systems and Applications, Guelph, Canada (May 2005) 15. Shibboleth, http://shibboleth.internet2.edu/ 16. Shibboleth Architecture Technical Overview, http://shibboleth.internet2.edu/docs/draftmaceshibboleth-tech-oberview-latest.pdf 17. UK Access Management Federation, http://www.ukfederation.org.uk/ 18. eduPerson Specification, http://www.educause.edu/eduperson/ 19. Shibboleth Attribute Release Policy Editor, http://federation.org.au/twiki/bin/view/Federation/ShARPE 20. OMII SPAM-GP project, http://www.nesc.ac.uk/hub/projects/omii-sp 21. Delegation Issuing Service (DIS), http://sec.cs.kent.ac.uk/permis/downloads/Level3/DIS.shtml 22. Integrating VOMS and PERMIS for Superior Secure Grid Management (VPMan), http://sec.cs.kent.ac.uk/vpman/ 23. JSDL, http://www.gridforum.org/documents/GFD.56.pdf 24. Sinnott, R.O., Chadwick, D.W., Doherty, T., Martin, D., Stell, A., Stewart, G., Su, L., Watt, J.: Advanced Security for Virtual Organizations: Exploring the Pros and Cons of Centralized vs Decentralized Security Models. In: 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon, France (May 2008) 25. Sinnott, R.O., Watt, J., Chadwick, D.W., Koetsier, J., Otenko, O., Nguyen, T.A.: Supporting Decentralized, Security focused Dynamic Virtual Organizations across the Grid. In: 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam (December 2006) 26. Reid, D., Millar, C., Roy, G., Roy, S., Sinnott, R.O., Stewart, G., Asenov, A.: Supporting Statistical Semiconductor Device Analysis using EGEE and OMII-UK Middleware. In: 3rd EGEE User Conference, Clermont-Ferrand, France (February 2008)

Comparing Grid Computing Solutions for Reverse-Engineering Gene Regulatory Networks Martin Swain, Johannes J. Mandel, and Werner Dubitzky School of Biomedical Sciences, University of Ulster, Coleraine BT52 1SA, UK [email protected]

Abstract. Models of gene regulatory networks encapsulate important features of cell behaviour, and understanding gene regulatory networks is important for a wide range of biomedical applications. Network models may be constructed using reverse-engineering techniques based on evolutionary algorithms. This optimisation process can be very computationally intensive, however its computational requirements can be met using grid computing techniques. In this paper we compare two grid infrastructures. First we implement our reverse-engineering software on an opportunistic grid computing platform. We discuss the advantages and disadvantages of this approach, and then go on to describe an improved implementation using the QosCosGrid, a quasi-opportunistic supercomputing framework (Qos) for complex systems applications (Cos). The QosCosGrid is able to provide advanced support for parallelised applications, across diﬀerent administrative domains and this allows more sophisticated reverse-engineering approaches to be explored.

1

Introduction

Computational grids are able to virtualise distributed, heterogeneous processing and storage resources in order to create a single, integrated infrastructure with great power. Such grids are able to provide computing capacity greater than that of advanced supercomputers – but only for certain applications that typically consist either of independent tasks or tasks that are pleasantly parallelisable because highly parallel applications cannot run eﬃciently on the grid’s distributed infrastructure. However, by reducing the gap between grid infrastructures and supercomputers through the development of quasi-opportunistic supercomputing middleware the QosCosGrid project aims to provide a suitable grid infrastructure for the simulation of complex systems such as gene regulatory networks [1]. In this article we compare two grid infrastructures and show how they can be used to reverse-engineer models of gene regulatory networks by discovering model parameters that generate speciﬁc behaviour. Parameter estimation is an important task for many complex systems applications and evolutionary algorithms are a commonly used approach. There are various diﬀerent implementations of distributed evolutionary algorithms, with diﬀerent parallelisation approaches and patterns of communication. It is an aim of the QosCosGrid M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 106–115, 2008. c Springer-Verlag Berlin Heidelberg 2008

Comparing Grid Computing Solutions

107

project to develop middleware for parallelised applications, and to provide a toolkit for evolutionary computing. Here we describe how we have grid-enabled existing reverse-engineering software, called Evolver, which has a basic distributed evolutionary algorithm implementation. We then compare an implementation of Evolver, based on Active Objects using ProActive Java [2] and the QosCosGrid, with a batch-processing method, based on the DataMiningGrid [3] using Globus [4] and Condor [5]. Finally we outline how we plan to take advantage of the further functionality which will become available in the QosCosGrid.

2 2.1

The Problem: Reverse-Engineering Gene Regulatory Networks Gene Regulatory Networks

Gene-regulatory networks (GRNs) are important for understanding an organism’s complex dynamical behaviour. Within a GRN, genes and their products interact with one another - genes code for proteins that may in turn regulate the expression of other genes in a complex series of positive and negative feedback loops [6]. Thus gene expression, as regulated by the network, is essential in determining the functional state or physical characteristics of an organism. GRNs may be viewed as a complex cellular control system in which information ﬂows from gene activity patterns through a cascade of inter- and intracellular signalling functions back to the regulation of gene expression [7]. The dynamic activity of GRNs have been revealed by microarray time series experiments that record gene expression [8] and it has been hypothesized that a GRN’s structure may be inferred from this, and related, time-series data. This is a reverse-engineering problem in which causes (the GRNs) are deduced from eﬀects (the expression data) [9]. 2.2

Evolutionary Computing

Evolutionary computing is inspired by the process of natural evolution, by which new organisms emerge due to changing environmental pressures. Genetic algorithms [10] are an optimisation technique, whereby solutions to a particular problem are represented by individual organisms: the genotype of an individual refers to the encoding of a possible solution to the problem, and the phenotype represents the possible solution in a form that can be evaluated. The gene pool is optimised through evolutionary processes, such as mutating and combining genotypes, and evaluating and removing the weakest phenotypes. This is an iterative process, with stochastic mechanisms controlling the mutation and combination of diﬀerent genotypes. Phenotypes are evaluated using a ﬁtness function, representing the environment, and the evolutionary process proceeds until an individual eventually emerges that represents a suitably accurate solution to the optimisation problem. For example, an individual’s genotype may represent the parameters of a complex system simulation, and the evaluation of an individual’s phenotype would

108

M. Swain, J.J. Mandel, and W. Dubitzky

represent how well the output of the simulation meets certain criteria, such as its time-dependent behaviour. It is therefore possible to use evolutionary computing to discover or reverse-engineer complex system simulations with speciﬁc properties. 2.3

The Evolver Software Package

The software used in this study was the Evolver component from the Narrator gene network modelling application [7]. While GRNs may consist of hundreds of genes, Narrator is typically used to model networks of about 10 genes.

Fig. 1. Showing how molecular models involve each gene being individually optimised using the other genes’ expression levels as input; and how network models are optimised to ﬁt the full set of time-series data, which may describe gene expression under diﬀerent conditions

In a common reverse-engineering scenario where Evolver is used it is assumed that the basic network topology is known, but it is not known if one gene inhibits or excites an other gene. Three diﬀerent systems of diﬀerential equations can be used with Evolver, and through the process of reverse-engineering the parameters of these equations it is possible to determine typical network features such as positive and negative feedback loops. The more detailed and complete the data sets, the more accurate the results. Evolver is speciﬁcally designed to perform gene network feature discovery using a two stage evolutionary optimisation approach, as shown in Fig. 1: 1. Molecular modelling: In this stage individual genes are optimised. Initially a population of genes is created using random parameters for each gene. The time-series expression levels of the other genes are used as input to the diﬀerential equations used to model the gene’s dynamics, and the reverse-engineering algorithm is used to predict the parameters need to ﬁt the output of the gene model being reverse-engineered to its expression levels.

Comparing Grid Computing Solutions

109

2. Network modelling: The most promising models from the molecular modelling stage are combined to form a population of genetic networks, and these are optimised to ﬁt the expression level time series data sets. To grid-enable Evolver it was necessary to create three packages, two corresponding to the molecular and network modelling stages given above, and a third package for result processing. On the grid, each modelling package can be used to create multiple jobs, for example the optimisation of a ten gene network can be performed with thirty grid jobs: each of the ten genes is optimised three times in order to avoid problems with local minima, so there are 30 executions of the molecular modelling package, which may be performed simultaneously if suﬃcient grid resources are available. The data processing stage is used to collect the output of these thirty jobs and format it in preparation for the following stage.

3

3.1

Deploying Evolver on Grid and Distributed Computing Frameworks Implementation of Evolver on the DataMiningGrid

The DataMiningGrid is described in detail elsewhere [3]. In summary, the DataMiningGrid was built using the Globus Toolkit [4], which was extended with a number of enhancements, for example to the execution environment in order to utilise the full functionality of Condor’s Java Universe. Users usually interacted with the system via the Triana workﬂow editor [11]. A test-bed was created, based at three sites in the UK, Slovenia and Germany and at each site Globus components were used to interface with local Condor clusters. Condor is an opportunistic, high throughput computing system that uses cycle-stealing mechanisms to create a cluster out of idle computing resources [5]. Central to the grid infrastructure was an Application Description Schema (ADS) created speciﬁcally by the DataMiningGrid and which was used in all aspects of the grid system: for example to describe and register applications on the grid; dynamically conﬁgure the workﬂow editor; provide application technical requirements to resource brokering services; and to enable complex parameter sweep applications. Before deploying Evolver to the DataMiningGrid it was necessary to create an ADS template, which could then be instantiated during a workﬂow execution and eventually passed to the DataMiningGrid’s resource broker service. The resource broker service is able to manage all aspects of the execution process and it is able to distribute the Evolver jobs to Condor pools in diﬀerent administrative domains. Three ADS were deﬁned for two Evolver classes. The ﬁrst Evolver class implements the evolutionary algorithm and this could be called in two diﬀerent modes of operation: the molecular modelling stage, and the network modeling stage. Certain parameters were hard-coded into the ADS to diﬀerentiate between these stages, and two diﬀerent ADS were created to make the application easier to use

110

M. Swain, J.J. Mandel, and W. Dubitzky

Fig. 2. Screen shot of a DataMiningGrid workﬂow for reverse-engineering genetic regulatory networks. The black lines show the input and output connections between components. The workﬂow stages are marked with white labels.

and clearer to understand. The second Evolver class processes the results generated by either modelling stage and it generates new sub-populations by merging the ﬁttest individuals from evolved island populations. The three ADS correspond to three workﬂow stages: molecular modelling stage; a quick data processing stage, here a number of the ﬁttest parameter sets for each gene are collected on a storage server by the DataMiningGrid’s resource broker service and combined in diﬀerent ways to form populations of networks; and a network modeling stage. In Fig. 2 we show how the workﬂow components can be easily combined together to provide workﬂows that allow individuals to be exchanged between island populations. This means that fewer generations are needed at each stage of the evolutionary process resulting in a faster overall execution time. In this example the data processing and network modelling stages are repeated three times (and are paired in Fig. 2), while the ﬁrst or molecular modelling stage is performed just once. A workﬂow component specifying the location of input data is connected to each molecular modelling and network modelling stage. After the ﬁrst network optimisation the results are again processed and island populations are regenerated from the ﬁttest networks; this process is repeated

Comparing Grid Computing Solutions

111

once more before the ﬁnal reverse-engineered networks are obtained, viewed and downloaded. In Fig. 3 we show the individual times taken for 90 test jobs to complete: these times are based on the time from when the DataMiningGrid’s resource broker service ﬁrst schedules the job until the job is ﬁnished. It is important to see in Fig. 3 that one job is an outlier, and takes almost twice as long than the other jobs, which all complete within 35 minutes. For this application such single outliers were encountered fairly regularly and had a serious eﬀect, both on the performance of the algorithm in terms of execution times, and in terms of overall resource usage as most of the Condor pool is idle while waiting for this single job to complete. The cause of this outlier was not clear as it did not occur consistently, but is probably due to misconﬁgured software on the corresponding machine node.

Fig. 3. This shows the time for each individual job to complete, ordered (or ranked) by length of time, for the execution of 90 jobs in the molecular modelling stage

Despite these occasional problems with the Condor pools, an advantage of the DataMiningGrid was the ability to simultaneously distribute Evolver jobs over diﬀerent administrative domains and thus gain access to greater quantities of computational resources. The workﬂow editor was also useful, as this allowed easy experimentation and testing of variations of the basic workﬂow. However, in this approach Evolver implemented the island-distributed evolutionary model by using ﬁles to migrate individuals between islands in a centralised manner. This is awkward: a better solution would be to migrate individuals from node to node by streaming the individuals’ genotypes from one Java virtual machine to another, without writing any data to the ﬁle system at all. The centralised approach must be synchronised, so that every island must ﬁnish before any migration is possible, and this can cause problems when many nodes are used, especially if one of those nodes is unreliable. Misconﬁgured or unreliable machines can cause signiﬁcant delays to the execution of the entire workﬂow.

112

3.2

M. Swain, J.J. Mandel, and W. Dubitzky

The Quasi-opportunistic Supercomputing Approach: The QosCosGrid

Most grids, including the DataMiningGrid, are characterised by opportunistic sharing of resources as they become available. The quasi-opportunistic approach is diﬀerent in that it aims to provide a higher quality of service by oﬀering more dependable resources to its users. In addition, the QosCosGrid is speciﬁcally designed for parallel, distributed applications and it supports two important technologies for distributed applications, namely OpenMPI [12] and ProActive Java [2]. Two other important features of the QosCosGrid’s functionality include the co-allocation of distributed resources, with fault-tolerance so that resource requirements can be guaranteed even in grids where resource availability is constantly changing; and the support of synchronous communications according to speciﬁc communication topologies, as typically required by various diﬀerent complex systems simulations. Although it was hoped that Evolver could be used on the QosCosGrid without any signiﬁcant alterations, it had to be modiﬁed to ensure that Evolver objects were serialisable and so able to work with ProActive’s Active Objects. A master-worker architecture has been implemented with ProActive and is shown in Fig. 4. A master Active Object is used to control and monitor the worker Active Objects. Each worker uses an evolutionary approach to optimise either individual genes or networks of genes, according to the two-stage approach outlined in Sect. 2.3. The process is as follows: 1. The master sends out various input ﬁles to the diﬀerent worker nodes, which are situated in diﬀerent administrative domains, and calls a method at the worker nodes to begin the molecular modelling stage. 2. The workers notify the master on completion of a molecular modelling run and the master counts the number of such notiﬁcations. When all workers have completed, the master executes the data processing stage by gathering the ﬁttest individuals from each worker node and sending out, to the workers, a list of all ﬁt individuals that are to be compiled into a network populations. 3. The master node then activates network evolution by the workers. When the workers complete their network modelling task, notiﬁcation is performed by calling a master method: and the master, after performing the data processing task, calls the network modelling method on the workers. This process continues for a ﬁxed number of iterations until the reverse-engineering process is complete. The advantages of using ProActive are already evident. The iterative approach means that individuals can be exchanged between subpopulations more frequently. While this still currently relies on ﬁle-processing, the amount of data in the ﬁles is small (less than 100 Kb). To implement data exchange using messaging between Active Object involves implementing modiﬁcations to the Evolver source code, and this is something we have been avoiding as we expect to make a thorough redesign of the reverse-engineering approach in the future. The problem of long-running machines can be overcome with the ProActive approach: the master node is notiﬁed when worker nodes complete their tasks,

Comparing Grid Computing Solutions

113

Fig. 4. The design of Evolver using ProActive. Looping occurs between stages 2. and 3. Each group of worker nodes is in a diﬀerent administrative domain, the master is the single node.

and it counts the number of completed workers. Currently it waits until all workers are complete before initiating the next stage of the application, but it can be changed so that it only waits until 90% of all workers have completed. It is possible to adapt the evolutionary algorithm parameters for each worker node. This can be very important in a heterogeneous environment, as diﬀerent machines will run at diﬀerent speeds. If the master node has data on the technical speciﬁcation of each worker node, then by reducing population sizes or number of generations it can ensure that all workers complete at much the same time (unless they ﬁnd a perfect solution and so exit before running all generations).

4

Discussion

There are many reverse-engineering methods for GRNs reported in the literature [13]. While a number of these are Java applications which may be distributed over local clusters, see for instance JCell [14], few have been designed for grids distributed over multiple, independent organisations, although one notable effort in this direction comes from the DARPA BioSPICE community [15]. An advantage of using grids is that the evolutionary optimisation process is easily parallelised and thus able to take advantage of distributed computing resources, greatly reducing the overall runtime. A particular advantage of using the QosCosGrid infrastructure is due to its sophisticated control of communication, between clusters and nodes within clusters, and across administrative domains. This supports the development of more advanced applications, when compared to traditional grid and high-throughout computing systems such as the DataMiningGrid. For optimisation based on evolutionary algorithms, the QosCosGrid is suitable for ﬁne-grained models such as cellular evolutionary models. Table 1 summarises the features of Evolver, when deployed in each of the two grid computing platforms.

114

M. Swain, J.J. Mandel, and W. Dubitzky

Table 1. Comparison of Evolver features, when implemented on two grid platforms Platform Administrative domain Opportunistic Service guarantee Fault tolerance Parallelisation Island migration Syncronisation

5

DataMiningGrid Multi Fully no no Coarse Files Synchronous

QosCosGrid Multi Quasi yes yes Coarse or Fine Active objects Synchronous

Future Work

The QosCosGrid is due to be completed early in 2009, by which time an extensive test-bed will have been deployed. This test-bed will have computational resources suﬃcient to explore more time-consuming GRN reverse-engineering tasks such as larger networks or collections of networks (i.e. cells). For these scenarios it is important that the evolutionary optimisation process is as eﬃcient as possible. Hence we plan to use a purpose-built evolutionary algorithm toolkit, such as ParadisEO [16] which has a distributed version compatible with OpenMPI. Such toolkits include provide a quick way to test many diﬀerent parallel implementations of evolutionary algorithms, and combine well with the QosCosGrid functionality.

6

Conclusions

In this paper a computationally intensive application for reverse-engineering gene regulatory networks has been implemented on two diﬀerent distributed computing platforms. By comparing these technological approaches the advantages of quasi-opportunistic supercomputing, as implemented in the QosCosGrid, have been highlighted. These include support for diﬀerent parallelisation strategies which allow sophisticated reverse-engineering approaches to be developed. Acknowledgments. The work in this paper was supported by EC grants DataMiningGrid IST FP6 004475 and QosCosGrid IST FP6 STREP 033883.

References 1. Charlot, M., De Fabritis, G., Garcia de Lomana, A.L., Gomez-Garrido, A., Groen, D., et al.: The QosCosGrid project: Quasi-opportunistic supercomputing for complex systems simulations. Description of a general framework from diﬀerent types of applications. In: Ibergrid 2007 conference, Centro de Supercomputacion de Galicia (GESGA) (2007)

Comparing Grid Computing Solutions

115

2. Baduel, L., Baude, F., Caromel, D., Contes, A., Huet, F., et al.: Programming, Deploying, Composing, for the Grid. In: Grid Computing: Software Environments and Tools, Springer, Heidelberg (2006) 3. Stankovski, V., Swain, M., Kravtsov, V., Niessen, T., Wegener, D., et al.: Gridenabling data mining applications with DataMiningGrid: An architectural perspective. Future Gener. Comput. Syst. 24, 259–279 (2008) 4. Foster, I.T.: Globus toolkit version 4: Software for service-oriented systems. J. Comput. Sci. Technol. 21, 513–520 (2006) 5. Litzkow, M., Livny, M.: Experience with the Condor distributed batch system. In: Proc. IEEE Workshop on Experimental Distributed Systems, pp. 97–100 (1990) 6. Wolkenhauer, O., Mesarovic, M.: Feedback dynamics and cell function: Why systems biology is called systems biology. Molecular Biosystems 1, 14–16 (2005) 7. Mandel, J.J., Fuss, H., Palfreyman, N.M., Dubitzky, W.: Modeling biochemical transformation processes and information processing with Narrator. BMC Bioinformatics 8 (2007) 8. Arbeitman, M.N., Furlong, E.E.M., Imam, F., Johnson, E., Null, B.H., et al.: Gene Expression During the Life Cycle of Drosophila melanogaster. Science 297, 2270– 2275 (2002) 9. Swain, M., Hunniford, T., Mandel, J., Palfreyman, N., Dubitzky, W.: ReverseEngineering Gene-Regulatory Networks using Evolutionary Algorithms and Grid Computing. Journal of Clinical Monitoring and Computing 19, 329–337 (2005) 10. Holland, J.H.: Adaptation in Natural and Artiﬁcial Systems. University of Michigan Press (1975) 11. Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workﬂow Environment: Architecture and Applications. In: Taylor, I., Deelman, E., Gannon, D., Shields, M. (eds.) Workﬂows for e-Science, pp. 320–339. Springer, New York (2007) 12. Coti, C., Herault, T., Peyronnet, S., Rezmerita, A., Cappello, F.: Grid services for MPI. In: ACM/IEEE (ed.) Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon, France (2008) 13. Arbeitman, M.N., Furlong, E.E.M., Imam, F., Johnson, E., Null, B.H., et al.: Gene Expression During the Life Cycle of Drosophila melanogaster. Science 297, 2270– 2275 (2002) 14. Spieth, C., Supper, J., Streichert, F., Speer, N., Zell, A.: JCell—a Java-based framework for inferring regulatory networks from time series data. Bioinformatics 22, 2051–2052 (2006) 15. McCollum, J.M., Peterson, G.D., Cox, C.D., Simpson, M.L.: Accelerating Gene Regulatory Network Modeling Using Grid-Based Simulation. SIMULATION 80, 231–241 (2004) 16. Cahon, S., Melab, N., Talbi, E.G.: Building with paradisEO reusable parallel and distributed evolutionary algorithms. Parallel Comput 30, 677–697 (2004)

Interactive In-Job Workﬂows ˇ Branislav Simo, Ondrej Habala, Emil Gatial, and Ladislav Hluch´ y Institute of Informatics, Slovak Academy of Sciences, D´ ubravsk´ a cesta 9, 845 07 Bratislava, Slovakia [email protected]

Abstract. This paper describes a new approach to interactive workﬂow management in the grid. By modiﬁcation of existing system for management of applications composed of web and grid services an interactive workﬂow management system has been created, which allows users to manage complex jobs, composed of several program executions, interactively. The system uses an interactivity functionality provided by the Interactive European Grid project to forward commands from a GUI to a workﬂow manager running inside of a grid job. The tool is able to visualize the inner workﬂow of the application and the user has complete in-execution control over the job, can see its partial results, and can even alter it while it is running. This allows not only to accommodate the job workﬂow to the data it produces, extend or shorten it, but also to interactively debug and tune the job.

1

Introduction

The focus of current grid infrastructures like the EGEE [8] and middlewares like the gLite [6] is targeted on batch processing of computing intensive jobs, usually sequential ones. While this model is very good for e.g. parameter study applications, where the execution time of a single instance is not that important as the time required to process the whole set of jobs, there is a lot of applications where the minimization of the run rime of a single instance is important. One of the ways to achieve that goal is to parallelize the computation into cooperating processes using for example the MPI [7] messaging protocol as a means for data exchange. The other feature lacking in currently prevalent grid infrastructures is the ability to interact with an application running in the grid. This fact stems from the focus on the high throughput aspect of the whole grid architecture. After having the high throughput grids established and deployed on the production level, it is time to support the additional types of applications. The development in the Interactive European Grid (int.eu.grid) project [3] is focused on implementing these two missing features, intra– and inter–cluster MPI support, and interactive applications. The tools providing this functionality are discussed in the next section.

This work was supported by projects int.eu.grid EU 6FP RI-031857, VEGA No. 2/6103/6, and INTAP (RPEU-0029-06).

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 116–125, 2008. c Springer-Verlag Berlin Heidelberg 2008

Interactive In-Job Workﬂows

117

Section 3 describes an interactive workﬂow management system of the ﬂood forecasting application. The system was developed as a modiﬁcation of system developed previously in the project K-Wf Grid [9] as a management tool for application composed of web and grid services. It allows users to manage more complex jobs, composed of several program executions, in an interactive and comfortable manner. The system uses the interactive channel of the project to forward commands from a GUI to the on-site workﬂow manager and to control the job during execution. While the system is used to interactively run the workﬂow of the ﬂood forecasting application, it is also suitable for other applications, where the user may want to adapt their workﬂow execution during runtime, according to partial results or other conditions. Section 4 shows a real use case of the system – the ﬂood forecasting application.

2

Tools for Interactivity and MPI

In order to use the interactivity and MPI [7], the application had to be integrated with or adapted to several components of the int.eu.grid project. The application executable had to be modiﬁed to use the MPI calls and linked to MPI library. On the client side, an application speciﬁc visualization plug-in had to be created to provide customized user interface for the application in the Migrating Desktop (MD) [4] rich client framework. The MD provides an application programming interface (API) to the developer for direct connection to the application. The conceptual schema is shown on the Fig. 1. Below we give further description of these components. The user interface client—Migrating Desktop—is a rich client framework and graphical user interface (GUI) that hides the complexities of the grid from the user. It provides basic functionality necessary for working with grid: single signon using user’s certiﬁcate, data management (transfer of data ﬁles from a workstation to the grid and back, registration of ﬁles to the virtual directory), job management (job submission, monitoring), visualization of job results. The MD is implemented in the Java language and runs as a client on the user’s machine. It is based on the Eclipse OSGi framework [10] plug-in architecture, thus allowing customization of its functionality. Plug-ins play an important role in the application support by providing application speciﬁc functionality. Input plugins provide custom input parameter speciﬁcation, visualization plug-ins provide visualization of the application outputs and the user interface for interactive application control. A plug-in is provided in the form of a Java archive called bundle. It is loaded to the MD automatically upon startup, after registration into the central registry. In order to be able to control the interactive application, writing a visualization plug-in is usually necessary. The user–application connection is realized by setting up an interactive data channel between the application running in the grid and the client plug-in. It is a data tunnel that can transfer raw binary data that are to be interpreted by the application. The channel passes all the data from the standard output of the application to the plug-in and data sent to the channel are passed to the standard

118

ˇ B. Simo et al.

Fig. 1. Interactive channel connecting Migrating Desktop with application running in the grid

input of the application. In case of the MPI application, standard outputs of all MPI processes are merged and sent as one output stream into the channel. The standard input is available only to the master process of the application, which must then distribute any information to other processes if necessary (see Fig. 1). Starting jobs from the MD allows the user to request a setup of the interactive channel, which is then connected to the application plug-in. The interactive channel is then set up transparently to the application. It connects the standard input, standard output and error output of the application with the MD, where they are available as separate data streams in the visualization plug-in of the application. The application plug-in has to explicitly support interactivity and support it in an application speciﬁc way, so it usually has to be implemented for each application from scratch. Because the MD must be ﬁrewall friendly and it cannot be expected that it will have direct connection from the outside, a proxy machine is used to pass the communication from the grid to MD. The proxy machine is called Remote Access Server (RAS) and is used also for other tasks that might require traversing ﬁrewalls, e.g. ﬁle transfers. The channel between RAS and grid nodes is created using the glogin [5] tool. It uses a special setup procedure and certiﬁcate– based authentication to create a SSH tunnel. Channel between RAS and MD is currently implemented as a simple HTTP polling.

3

Interactive Workﬂow Management

The execution of a workﬂow in the grid environment usually means automatic execution of its tasks by some kind of a workﬂow engine. From the user’s point of view the whole workﬂow is processed as one big job and the user can at most monitor the execution of single tasks of the workﬂow. In this section we

Interactive In-Job Workﬂows

119

describe a dynamic grid workﬂow execution and management system, which allows interactive monitoring and changing of a workﬂow running in the grid. The diﬀerence between classical grid workﬂows and the one described here is that our workﬂow is submitted to the grid (i.e. to a resource broker [16] managing job submissions for that particular grid) as one job that will be started on one of the grid resources and then all the tasks of the workﬂow are executed internally as part of that workﬂow job. The workﬂow job is connected to the user interface via the interactive channel that allows the user to monitor and change the workﬂow and its properties. The advantage of executing the workﬂow in this manner is fast startup of the workﬂow tasks as they do not have to go through the grid resource broker. The tool is suitable for applications, where the user may adapt their execution during runtime, according to partial results. If the need arises, another analysis may be added to process any interesting partial results that were computed. Or, if a simulation provides uninteresting data, the rest of the workﬂow subtree may be cancelled, and resources shifted to other parts of the job. Any application, which currently uses a shell script calling several components (binary modules or other scripts) may be easily converted to a visually controlled workﬂow. The workﬂow can then be saved, exported to an XML ﬁle, and later reused – such reuse is very simple even for non-experts. In Section 3.1 we describe the original implementation of the workﬂow execution engine that was implemented in the KWf-Grid [9,2] project and then in Section 3.2 the re-implementation of the workﬂow engine to the grid environment of the int.eu.grid project. 3.1

Interactive Workﬂow Using K-Wf Grid Middleware

The main component of the Grid Application Control module, and the core of the architecture of K-Wf Grid (see Fig. 3.1) is the Grid Workﬂow Execution Service (GWES) [11,12]. This component is a web service, whose main function is to analyze, process, manage, and execute workﬂows described in a workﬂow description language based on Petri nets and called Grid Workﬂow Description Language (GWorkﬂowDL) [13]. The GWorkﬂowDL is a dialect of XML, designed speciﬁcally for controlling workﬂows of services, programs, grid jobs, or data transfer operations using the semantics of Petri nets. While the most widely used abstraction for workﬂows today is the Direct Acyclic Graph (DAG), Petri nets provide theory which is at least comparable to the theory supporting DAG operations, and enable to describe wider range of constructs, including cycles and conditional branches. Moreover, in Petri nets the data is an integral part of the whole construction (represented by so-called ”tokens”), and so the GWorkﬂowDL document at any stage describes the whole state of the system, which is very useful for repeating experiments and doing parameter studies. It is possible to let the workﬂow execute to a certain stage, then take a snapshot of its current structure into a ﬁle, and then try several executions with diﬀerent parameters by simply modifying

120

ˇ B. Simo et al.

Fig. 2. Architecture of components used in the K-Wf Grid project

the snapshot GWorkﬂowDL ﬁle. The GWES engine in K-Wf Grid is implemented as a web service, with operations that allow to – – – – – – – – – – – –

Initiate a workﬂow Start a previously initiated workﬂow Suspend a running workﬂow Resume a suspended workﬂow Abort a running workﬂow (similar to suspending, but the workﬂow cannot be resumed) Restart a ﬁnished workﬂow Set and get user-readable workﬂow description Query the unique workﬂow identiﬁer or its status Store the workﬂow to a preconﬁgured XML database Retrieve a stored workﬂow from the database Query any data token in a workﬂow Get or set some speciﬁc properties of a workﬂow.

A more detailed description of all capabilities of GWES, as well as a complete state transition diagram for GWorkﬂowDL-described workﬂow can be found in [14]. The GWES is supported by several other services and tools. In project K-Wf Grid, it is mainly the Workﬂow Composition Tool (WCT) and Automated Application Builder (AAB). Since GWorkﬂowDL supports several levels of abstraction for activities in a workﬂow, these tools are used to concretize an abstract place. WCT is responsible for ﬁnding an appropriate service class (non-grounded service interface description), or several service classes, for an abstract activity. AAB then ﬁnds all grounded services, which do expose the interface selected by

Interactive In-Job Workﬂows

121

WCT. From these, one is picked at runtime by the scheduler (scheduling algorithms may be selected by users). These components are an integral part of the semantic support facility of the workﬂow construction and execution process, and they use information present in the knowledge base of the infrastructure. Another tool supporting GWES is the Grid Workﬂow User Interface (GWUI). GWUI is a graphical front-end for GWES, able to visualize a workﬂow handled by GWES. Using GWUI, user may monitor a workﬂow, and perform basic interaction with it—execute it, pause, abort, query and modify data tokens in places of the Petri net. A sample of the visualization can be seen in Fig. 3.

Fig. 3. A screenshot of a sample workﬂow visualized in GWUI

3.2

Interactive Workﬂows with GWES in int.eu.grid

In the project int.eu.grid, the infrastructure supporting GWES, as well as GWES itself is modiﬁed to ﬁt into the common grid infrastructure based on the LCG [15] and the gLite [6] grid middlewares. Since the int.eu.grid applications are not based on SOA architecture, but on more common grid jobs, GWES has been modiﬁed to be part of the core of an executable module, which is then executed as a grid job in the project’s infrastructure. This job is then interactively managed by the user via GWUI embedded into the Migrating Desktop (MD) interface. The GWES was converted into a stand-alone Java application, executable from command line. When the job starts, ﬁrst executed application is GWES, with a parameter pointing to a GWorkﬂowDL description of the workﬂow to execute. Instead of a web service interface, GWES communicates through its standard input and output, which are connected to the interactive channel of int.eu.grid. At the other end of this channel is the GWUI, working as visualization plug-in in the MD. It was also modiﬁed to communicate through the interactive channel facilities of MD instead of accessing a remote web service.

122

ˇ B. Simo et al.

The general capabilities of the GWES remain almost the same as in the K-Wf Grid. It has been extended with another job type, so it is now able to execute local programs, which are referenced by activities in the GWorkﬂowDL Petri net. The GWUI has received the ability to modify workﬂows by adding, removing, and reconnecting activities and places. The possibility to edit the data has also remained. The WCT and AAB components are no longer present in this setup, since the workﬂow is not constructed from start automatically. Also, the scheduler has been replaced by a simpler module, which is able to allocate nodes to the executed activities. This is now internal part of the GWES. The workﬂow job is started from MD as a special MPI interactive job. The number of nodes requested for the job must be equal or greater to the number of nodes required by any single task of the workﬂow, otherwise the workﬂow would fail. The allocation of nodes inside the workﬂow job is performed according to parameters set by user in the GWorkﬂowDL document. If there are several activities ready to ﬁre (execute), those which cannot receive enough computational nodes wait until other activities ﬁnish and vacate their allocated nodes. If GWES during execution encounters activity, whose demands for nodes exceeds the total number of nodes allocated to the interactive job, it signals a fault to the user, and aborts the workﬂow.

4

Flood Forecasting Application

The ﬂood forecasting application itself started its life in the EU project ANFAS [17]. In the beginning, it was an HPC experiment, using a hydraulic simulation model only to predict water ﬂow in an area hit by a river ﬂood. After ANFAS, the application has been signiﬁcantly extended during the CROSSGRID [18,1] project, to contain a whole cascade of simulation models, and to use the Globus Toolkit [21], then in version 2. Since ﬂoods usually occur as a result of speciﬁc weather conditions, marked mainly by period of heavy precipitation, the simulations begin with weather prediction. From this prediction, a hydrological model computes runoﬀ into the riverbed, and from thus prediction river level, a hydraulic simulation can predict actual ﬂooding of the target area. With the development of the grid and incorporation of the service-oriented architecture paradigm [22], the application has also changed. In the project MEDIgRID [19], it was extended with more simulation models and visualization tools, and deployed as a set of loosely coupled WSRF [20] services, using Globus Toolkit [21] version 4. The new architecture of what was previously called a simulation cascade [18] can be seen on Fig. 4). It is a set of loosely coupled models, with several possible execution scenarios. Figure 4 contains several entities, each of them having its role in our application. At the top of the ﬁgure is depicted our main data provider, the Slovak Hydrometeorological Institute (SHMI). SHMI provides us with input data for the ﬁrst stage of our application, the Meteorology. The meteorological forecast is computed by the MM5 model, which operates in three

Interactive In-Job Workﬂows

123

Fig. 4. Architecture of the ﬂood forecasting application

distinct operation modes (simple, one-way nested and two-way nested). This is the forecasting step of the whole application. The predicted weather conditions are used in the Watershed integration stage to compute water runoﬀ into the target river. This result is then further processed in the Hydrology stage, where two models - HSPF and NLC compute river levels for selected geographical points. These levels are then used to model water ﬂow in the last, Hydraulic stage of the application. All important results are visualized and displayed to the user if he/she requires it. In the current implementation, the interactive workﬂow management system described in previous chapter is used to manage the workﬂow of this application inside of a job submitted to the grid.

5

Conclusions

The interactive workﬂow management developed for the ﬂood forecasting application gives the user a new type of steering capabilities in terms of dynamic workﬂow restructuring. The application components running inside this system have no startup penalty compared to regular grid jobs, what is an advantage for workﬂows of short lived jobs. Because the system can be used for any other application consisting of interconnected components, we expect it to be adopted by other applications in the future.

124

ˇ B. Simo et al.

References ˇ 1. Hluch´ y, L., Habala, O., Tran, V., Gatial, E., Maliˇska, M., Simo, B., Sl´ıˇzik, P.: Collaborative Environment for Grid-based Flood Prediction. Computing and Informatics 24(1), 87–108 (2005) 2. Bab´ık, M., Habala, O., Hluch´ y, L., Laclav´ık, M.: Semantic Services Grid in FloodForecasting Simulations. Computing and Informatics 26(4), 447–464 (2007) 3. Interactive European Grid project (Accessed January 2008), http://www.interactive-grid.eu 4. Kupczyk, M., Lichwala, R., Meyer, N., Palak, B., Plociennik, M., Wolniewicz, P.: Applications on demand as the exploitation of the Migrating Desktop. Future Generation Computer Systems 21(1), 37–44 (2005) 5. Rosmanith, H., Volkert, J.: glogin - Interactive Connectivity for the Grid. In: Juhasz, Z., Kacsuk, P., Kranzlm¨ uller, D. (eds.) Distributed and Parallel Systems - Cluster and Grid Computing, pp. 3–11. Kluwer Academic Publishers, Budapest, Hungary (2004) 6. gLite - Next generation middleware for grid computing (Accessed January 2008), http://glite.web.cern.ch/glite 7. Message Passing Interface Forum (Accessed January 2008), http://www.mpi-forum.org 8. EGEE (Enabling grids for e-science) project (Accessed January 2008), http://www.eu-egee.org 9. Bubak, M., Fahringer, T., Hluchy, L., Hoheisel, A., Kitowski, J., Unger, S., Viano, G., Votis, K.: K-WfGrid Consortium: K-Wf Grid - Knowledge based Workﬂow system for Grid Applications. In: Proceedings of the Cracow Grid Workshop 2004, Poland, p. 39. Academic Computer Centre CYFRONET AGH (2005) ISBN 83915141-4-5 10. Equinox - an OSGi framework implementation (Accessed January 2008), http://www.eclipse.org/equinox 11. Hoheisel, A., Ernst, T., Der, U.: A Framework for Loosely Coupled Applications on Grid Environments. In: Cunha, J.C., Rana, O.F. (eds.) Grid Computing: Software Environments and Tools (2006) ISBN: 1-85233-998-5 12. Hoheisel, A.: User Tools and Languages for Graph-based Grid Workﬂows. In: Special Issue of Concurrency and Computation: Practice and Experience. Wiley, Chichester (2005) 13. Pohl, H.W.: Grid Workﬂow Description Language Developer Manual. K-Wf Grid manual (2006) (Accessed January 2008), http://www.gridworkflow.org/kwfgrid/ gworkflowdl/docs/KWF-WP2-FIR-v0.2-GWorkflowDLDeveloperManual.pdf 14. Hoheisel, A., Linden, T.: Grid Workﬂow Execution Service - User Manual. K-Wf Grid (2006) (Accessed January 2008), http://www.gridworkflow.org/kwfgrid/ gwes/docs/KWF-WP2-D2-FIRST-GWESUserManual.pdf 15. LCG - LHC Computing Grid project (Accessed January 2008), http://lcg.web.cern.ch/LCG 16. Fern´ andez, E., Heymann, E., Senar, M.A.: Resource Management for Interactive Jobs in a Grid Environment. In: Proc. of IEEE Int. Conf. On Cluster Computing (Cluster 2006), Barcelona, Spain, September 2006. IEEE CS Press, Los Alamitos (2006) CD-ROM edition 17. ANFAS Data Fusion for Flood Analysis and Decision Support (Accessed January 2008), http://www.ercim.org/anfas

Interactive In-Job Workﬂows

125

ˇ 18. Hluch´ y, L., Tran, V.D., Habala, O., Simo, B., Gatial, E., Astaloˇs, J., Dobruck´ y, M.: Flood Forecasting in CrossGrid project. In: Dikaiakos, M.D. (ed.) AxGrids 2004. LNCS, vol. 3165, pp. 51–60. Springer, Heidelberg (2004) ˇ 19. Simo, B., Ciglan, M., Sl´ıˇzik, P., Maliˇska, M., Dobruck´ y, M.: Mediterranean Grid of Multi-Risk Data and Models. In: Proc. of 1-st workshop Grid Computing for Complex Problems - GCCP 2005, VEDA, 2006, Bratislava, Slovakia, NovemberDecember 2005, pp. 129–134 (2006) ISBN 80-969202-1-9 20. Czajkowski, K., Ferguson, D.F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W.: The WS-Resource Framework. March 5 (2004), (Accessed January 2008), http://www.globus.org/wsrf/specs/ogsi to wsrf 1.0.pdf 21. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005), http://www.globus.org/toolkit 22. Foster, I., Kesselman, C., Nick, J.M., Tuecke, S.: The Physiology of the Grid (Accessed January 2008), http://www.globus.org/alliance/publications/papers/ogsa.pdf

Pattern Based Composition of Web Services for Symbolic Computations Alexandru Cˆ arstea1 , Georgiana Macariu1 , Dana Petcu1 , and Alexander Konovalov2 1

2

Institute e-Austria Timi¸soara, Romˆ ania [email protected] University of St Andrews, St Andrews, Scotland [email protected]

Abstract. The suitability of the BPEL workﬂow description language for the dynamic composition of Web services representing computational algebra systems is investigated. The prototype implementation of the system for dynamic generation of BPEL workﬂows and two examples demonstrating the beneﬁts of our approach are described. One of important aspects of the design is that the composition is achieved using standard workﬂow patterns without any modiﬁcation of the underlying computational algebra systems, provided they support the OpenMath format. Keywords: dynamically generated workﬂows, service-oriented architecture, symbolic computing, workﬂow patterns.

1

Introduction

Complex problems can be solved using algorithms that combine multiple execution steps. Workﬂow technologies are often used nowadays to combine the results obtained by invoking black box software components available as Web services. Most of the classical composition examples are referring to static composition that is achieved at design time by specifying all the details of the composition. On the other hand, in the more technologically challenging case, of dynamic composition of Web services, the decision on which services have to be called in order to solve a particular problem is done at runtime. Dynamic composition intends to make use of the tremendous potential offered by already existing Web services. Several practical problems prevent dynamic composition to be applicable as general solutions. First of all, the standard WSDL document describes the service interface but it doesn’t oﬀer any information regarding its functionality or its QoS. Another problem is the limited availability and transient nature of such services. While no current standard approaches for dynamic composition were developed, Section 2 is an overview of such techniques, general and applied in the area of symbolic computation. The system that we recently proposed [1,2] focuses on exploiting the functionality oﬀered by Computer Algebra Systems (CAS) wrapped as Web and Grid M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 126–135, 2008. c Springer-Verlag Berlin Heidelberg 2008

Pattern Based Composition of Web Services for Symbolic Computations

127

services. The previous results underlined the need and proved the capability to expose the functionality of various CASs through uniform interfaces exposed as Web services and demonstrated the ability to compose the functionality of those systems for problems that follow a certain pattern. Complicated graphical interfaces currently allow creation and deployment of static workﬂows. While these solutions are extremely useful for specialized users, they are unusable in the context of common CAS interfaces due to their lack for the speciﬁc functionality required by CAS application developers. On the other hand, recent research results about Web services composition led to the identiﬁcation of workﬂow patterns (Section 3 discusses some of them). The CAS users may beneﬁt from the potential functionality provided by a software solution that allows combining Web service functionality using standard patterns. The present paper focuses on extending the simple composition of symbolic computing Web services based on a given structure to a complex one based on an arbitrary workﬂow. The proposed solution for the construction and deployment of composed Web services in a dynamic fashion is planned to be available through the CAS’ usual interface. General workﬂow patterns are helping the CAS user to describe the relationships and sequence of services calls. The resulted description is deployed and executed by components implemented using Java 5 SDK relaying on the ActiveBPEL [3] workﬂow engine and the PostgreSQL [4] database servers. The approach is described in Section 4, while some implementation details are pointed out in Section 5. The functionality of the system is presented using several examples in Section 6 and conclusions are outlined in Section 7.

2

Related Work

Specialized languages for describing Web service workﬂows are usually XML based languages because they are well suited for the automated machine processing. Graphical interfaces, e.g. ActiveBPEL Designer [3], can be used to create abstract or concrete workﬂows and assists the user in deploying the resulted workﬂow. Low level details such as the URL location of the partner services, may be explicitly provided by the user. Triana [5] can be used in conjunction with previously known UDDI registries to discover and compose Web service functionality. Platforms for managing composed services, such as EFlow [6] allow predeﬁned static composition with a dynamic binding selection technique. Dynamic composition approaches include AI planning mechanisms and ontology based composition. The set of services dynamically selected to solve a particular problem may change from one invocation to another. As a result, dynamic discovery mechanism must be used at runtime to decide which services should be invoked. The selection of services must meet requirements regarding the functionality and the QoS to be provided. In this respect, several general problems may appear [7]. The discovery problem, for example, raises two subproblems that need to be solved at the same time: obtaining a service description and obtaining the location of the service. Reliability constitutes also an issue since services may be occasionally unavailable.

128

A. Cˆ arstea et al.

In [8] it is noted that a generally accepted assumption is that each Web service can be speciﬁed by its preconditions and eﬀects in the planning context. A similar assumption is also used in Polymorphic Process Model (PPM) [9]. A specialized language, DAML-S [10] has direct support for AI planning techniques. The state change produced by the execution of a Web service is speciﬁed through the precondition and eﬀect properties of the service proﬁle. As described in [11], the semantic Web vision is to make Web resources accessible by content as well as by keywords. Web services play an important role in this scenario: users and software agents should be able to discover, compose, and invoke content using complex services. The main drawback of this approach is that specifying ontologies may become a very complicate task. Symbolic computation services may be part of a computational infrastructure that can be used for solving complex problems. The analysis of the work conducted in the context of building symbolic computing services by projects such as MONET [12], GENSS [13] or MathBroker [14], has led us to the conclusion that dynamic discovery techniques implemented using AI techniques for Web services, in general, and for symbolic services, in particular, are not yet able to provide a wide-scale applicable solution. The discovery process in MONET uses the MSDL ontology language and the MPDL problem description language to retrieve the right mathematical services by interrogating modiﬁed UDDI registries. A similar agent based approach is also used in GENSS. Our approach diﬀers in several respects. First of all, it uses the functionality oﬀered by remotely installed CASs as potential solvers of mathematically described problems. The current system aims to integrate the functionality of the functions implemented in remote CASs into the context of the user’s CAS system. The discovery process uses as a main criterion of selection the functionality implemented by a certain service to manage a certain OpenMath call object. The OpenMath standard [15] ensures the interoperability between Web services that expose functionality of diﬀerent CASs. Previous results obtained in the context of workﬂow patterns [16] are used within the current approach to provide a higher level of abstraction. Thus, implementation details are hidden and the user can concentrate on the problem and not on low level details of implementation. The user can build arbitrary complex workﬂows using standard constructs (workﬂow patterns): the complex symbolic computation process is speciﬁed in terms of workﬂow patterns and not in a speciﬁc workﬂow composition language.

3

Workﬂow Patterns Background

Algorithmic solutions of complex problems are obtained through execution of atomic steps in a predeﬁned order. The analysis of the algorithm implementations for diﬀerent problems often led to the identiﬁcation of higher level patterns. As a result of the research in the domain of Web services composition, specialized software components that are capable of executing workﬂows described using languages such as XLANG, WSFL and BPEL were created. The description of

Pattern Based Composition of Web Services for Symbolic Computations

129

these workﬂows requires low level details such as the address of the composed Web services, data conversion speciﬁcation and fault handling. Several patterns that apply to Web service composition were identiﬁed in [16] and they were further used to investigate the expressivity of several workﬂow languages and the support they oﬀer for implementing various patterns [17]. A short overview of the most common workﬂow patterns is presented below. A sequence pattern represents the sequential execution of two or more tasks. The dependency between certain steps may be purely functional, or a data dependency may exist between these tasks. When the nature of the problem to be solved permits it, several tasks may be executed in parallel as a parallel split pattern that describes a process fork. If the subprocesses reunite at a certain moment of the execution, that point is a join point and the parallel split is with synchronization. For this pattern we assume that every branch is executed only once. As a variation of this pattern, multiple instances without synchronization occurs when multiple instances of the same task must be executed. A group of tasks may have to be executed only if a condition is met. Such behaviour may be expressed using conditional patterns. The exclusive choice pattern selects, amongst several branches, a branch that should be executed. Similarly, the multichoice pattern, allows several branches to be executed in parallel if the individual condition for each branch is met. One can potentially encounter more that one possible approach while solving a symbolic computation problem. Several solving techniques should be tested at the same time by concurrent processes and, as soon as the solution is obtained, the rest of the processes may be discarded. The deferred choice pattern expresses this functionality. Often there are situations when the same action must be applied several times to various arguments. This behaviour is expressed as the multiple instances with prior knowledge pattern when the exact number of iterations is known and as multiple instances without prior knowledge when an external call is expected to end the loop execution. Web service composition is achieved by issuing calls to partner Web services that may return a result or they may be intended to solely alter the general state of the system. The communication models used to interact with partner services were abstracted as several conversational patterns. We have chosen to implement two models of interaction. A common pattern, the request/reply pattern, allows a synchronous invoke of a partner service. The other one, the one way invocation pattern covers the situation when the sender only wants to transmit a message to the partner Web service and it does not expect a response message to be issued so the client may continue its execution. More complex communication patterns can be established using the above described communication patterns. Asynchronous communication is useful when the required computation time is long. A compound pattern that we found particularly useful is a combination of two request/reply patterns, where the ”reply” message is used only as a form of acknowledgment. In this situation the Web service client sends a request and receives the result at a later time as a call-back message. The client role is played by the workﬂow management engine that combines

130

A. Cˆ arstea et al.

partner Web services functionality. This behaviour allows a non blocking asynchronous communication between the workﬂow and the partner services. A common functionality is to enable the user to interrupt the execution of a running process. The pattern that speciﬁes this behaviour is the cancel pattern.

4

An Architecture for Composing Symbolic Services

Symbolic computing often demands computational resources that are not available in the context of a local machine and not even in the context of super computers or specialized clusters. Moreover, the client may request functionality available with a general purpose CAS or it may require services from a CAS specialized on particular ﬁeld. Integrating those systems into a broader distributed architecture oﬀers the premise to use the best available software solution for a given problem. The solution we propose is based on a computational infrastructure that brings required hardware and software resources together, using HPC, Web and Grid related technologies. CASs are the main tools for symbolic computing. To enable remote access to their functionality we have developed CAS Server components [1] that expose CAS functionality as Web services. For discovery and security reasons, the local registries store information about the CAS installed on the server, respectively the CAS functions that are available to be remotely invoked. More details about CAS Server components are given in [1]. Building on CAS Servers, we have implemented a system that is able to combine functionality of several CASs (Figure 1). A complex problem can now be solved by combining the results computed using diﬀerent CASs and the computing power of a distributed architecture. Orchestration of multiple CAS Servers is a complex process that must oﬀer solutions for discovering, invoking and storing results received from CAS Servers that were invoked. The key of success is the ability to express the solution in terms of workﬂow patterns. Using Web technologies the communication among CASs is simpliﬁed and standardized (a key in achieving this goal is the usage of the XML based OpenMath language). Application specialists need not be aware of all the details required for the complete speciﬁcation of the whole workﬂow using a specialized language. Instead, they only need to be able to combine several workﬂow patterns in order to describe a high level solution to their problem. The user-speciﬁed workﬂow can be automatically translated into a specialized workﬂow language, deployed and executed by a workﬂow management server. The blueprint of the client component that we have implemented can be used to enable this functionality within every CAS with a minimal eﬀort. Thus, a solution for a certain problem can be described in terms of the supported workﬂow patterns. As we shall see in the examples section, our solution enables the GAP system [18] to combine workﬂow patterns and execute workﬂows that use the functionality of several other CASs installed on remote machines. The description of the problem speciﬁed at the client level is submitted to a server that will manage the rest of the process. At the client side, the workﬂow

Pattern Based Composition of Web Services for Symbolic Computations

131

Fig. 1. CAS-wrapper service architecture

speciﬁed within the CAS is encoded using a XML language similar to BPEL. The main reason for using a XML intermediate language instead of a complete BPEL description is the signiﬁcant larger size of the complete speciﬁed BPEL workﬂow. Additionally, this approach enable clients with few computational and communication available resources, e.g. PDAs and mobile phones, to access the system. The drawback of this approach is the additional server load needed to convert the XML format to the BPEL format. In a distributed system one cannot correctly predict the status of the computing infrastructure available to be used for solving the problem. Our system enables the user to combine the functionality oﬀered by CASs installed on the CAS Servers registered to the system; the user is able to specify the CASs to be used, but the particular CAS Server that is invoked at runtime is selected automatically by the system, based on several relevant criteria. The most important criterion is the functionality provided by a particular CAS Server. Another criterion is the current load of a hardware resource. The current paper does not focus on ﬁnding the best selection or scheduling algorithm to be used. The system will be enriched with a load balancing facility in the near future. Several changes had to be implemented on the simple system presented in [2] to support the functionality described above. The client manager component is now responsible not only for receiving new workﬂows and for providing clients access to the result of their computation, but also for translating the XML workﬂow representation received from the client to the corresponding BPEL workﬂow format, to deploy the workﬂow into the ActiveBPEL engine and to launch the execution of the process.

5

Implementation Details

Using the system presented in this paper, the client is able to specify workﬂows by composing standard workﬂow patterns. This functionality is based on the implementations of several workﬂow patterns. We have chosen BPEL as a workﬂow description language due to its better capabilities comparatively to its competitor languages, demonstrated [19]. This system implements workﬂow

132

A. Cˆ arstea et al.

patterns using BPEL predeﬁned activities and additional constructs. While the activities represent the structure of the workﬂow, additional technologies such as XSD and WSDL are used to support data dependencies amongst activities, to evaluate conditions and to identify the partner services. Several patterns, such as the sequence pattern, have direct correspondence with existing BPEL activities, but most of the patterns have to be implemented by complex constructions. Using the Java API oﬀered by the ActiveBPEL engine we generate constructions similar to those described in [17]. Patterns that can be implemented with minimal eﬀorts are the sequence pattern and the parallel/split pattern because of the direct correspondence for these patterns in BPEL through the sequence and ﬂow BPEL activities. We were also able to implement patterns like exclusive choice, or multiple choices with and without synchronization. Conversational patterns cannot be implemented straightforwardly because they require adding the corresponding invocation task, input and output variables and links with partner WSDL documents to the resulting BPEL document. The lack of prior knowledge about the structure of the new workﬂow imposes that these details are generated at deployment time. Predeﬁned structure of Web services’ interfaces and a standard encoding format for the data representation, namely OpenMath, makes composing these services possible. In the context of arbitrary Web services, implementing conversation patterns would be impossible without additional semantic information being available. Encoding data using OpenMath is legitimate because the content of the messages is intended to be understood and used in the context of a CAS. The workﬂow engine does not have the ability to manage OpenMath objects and is not expected to understand the content of the data exchanged among partners. As a side eﬀect, the current version of the system has certain limitations regarding the way the conditions described for conditional patterns and repetitive patterns must be speciﬁed. For example, an OpenMath object that does not encode a number cannot be used to specify a condition. The process speciﬁed at the client level is translated by the Client Manager component into a BPEL workﬂow. The main part of the resulted BPEL document is the corresponding translation of the workﬂow described at client side. Starting the workﬂow can be done only by invoking the composed service that results after deploying the workﬂow, therefore an additional receive activity had to be added. Results obtained after the execution of the workﬂow are sent to a Web service responsible for storing the results through an additional call. Because we want to avoid the computational expense of deploying the same workﬂow several times, we allow the user to access already deployed workﬂows.

6

Examples

The previously implemented approach [2] oﬀers the ability to execute simple scenarios. An example that was used to demonstrate its functionality was the ability to compute the value of Gcd(Bernoulli(1000), Bernoulli(1200)) using remote machines and two diﬀerent CASs: GAP and KANT. The Gcd() was computed using

Pattern Based Composition of Web Services for Symbolic Computations

133

a KANT system by combining the Bernoulli results obtained from two separate instances of GAP. We used this example as a starting point for demonstrating the capabilities of the system. The main functional enhancement of the system described here is that it permits execution of workﬂows that are not bound to a two level invocation scheme. The corresponding GAP code that would allow obtaining the same result as the previous system is: startWorkflow(); startSequence(); startParallel(); v1:=invoke("KANT",Bernoulli(1000)); v2:=invoke("KANT",Bernoulli(2000)); endParallel(); invoke("GAP",gcd(v1,v2)); endSequence(); endWorkflow(); The above code is translated at the client level into a simpliﬁed BPEL like format and it is submitted to a server. The server will translate the workﬂow into a regular BPEL workﬂow and will manage the execution. At a later time, the user may access the computed result based on the identiﬁer that it is received when submitting the workﬂow. The next example describes the ”ring” workﬂow. Imagine a ”ring” of services, where each service accepts request from its ”left” neighbour (for example, an integer N), performs an action (for example, N:=N +1) and sends the new value of N to its ”right” neighbour. The test is started with the initial value N=0 and will be terminated by that service on which the parameter N will reach the prescribed upper bound. Below we demonstrate the pseudocode describing a generic ring workﬂow for two services that can be straightforwardly extended for arbitrary number of services (running the same or various CASs) to combine them in a ring: startWorkflow(); c:=setCondition("N<100"); startWhile(c); startSequence(); v:=invoke("GAP","Int(N+1)"); c:= setCondition("N<100"); startMultiChoice(); startBranch(c) v:=invoke("KANT","EvalString(N+1)"); endBranch(); endMultiChoice(); c:= setCondition("N<100"); endSequence(); endWhile(); endWorkflow();

134

A. Cˆ arstea et al.

By implementing the ”ring workﬂow” example we demonstrate that the system can be used to implement complicated workﬂows. For example, in the workﬂow arising from the orbit enumeration algorithm [20] we can combine three various kinds of services: 1. job server, sending procedure calls to appropriate image service. 2. image service for computing the image of the point (may be more than one, each sending procedure call to appropriate orbit service). 3. orbit service for storing the orbit (may use hash tables, may be more than one, each maintaining part of the table and sending procedure call, if necessary, to the job server ).

7

Remarks and Future Work

This paper is focused on using common workﬂow patterns as high level components to express and execute dynamic generated workﬂows for symbolic computing. The software solution that we provide has several additional beneﬁts. The user speciﬁes the workﬂow to be executed and then submits it to a server. The result can be obtained at a later time without having to maintain the connection with the server. Usually, the connection provided between servers is better that the connection of a client to a server. Managing the workwﬂow at the server level may allow the client to obtain the result faster than managing the workﬂow at the client side. The server may also provide load balancing and failure management by using a specialized workﬂow engine. Since the description of the workﬂow is done in XML format, the proposed system can also be used not only by a CAS client, but also by any system that is able to properly formulate the request and to submit it through the Web service interface of the execution engine provided by the system. The system may be extended in future versions by oﬀering additional support for several workﬂow patterns that were not implemented yet and will emerge as needed in the intensive testing phase and by adding better selection policies of CAS Servers. An important issue that we want to tackle in a future version is to enable support for Grid services since their speciﬁc interface prevents the current system to compose the functionality of these services. Acknowledgements. This research is partially supported by EU FP6 grant RII3-CT-2005-026133 SCIEnce: Symbolic Computing Infrastructure for Europe.

References 1. Cˆ arstea, A., Frˆıncu, M., Konovalov, A., Macariu, G., Petcu, D.: On Service-oriented Symbolic Computing. In: Wyrzykowski, R. (ed.) PPAM 2007. LNCS, vol. 4967. Springer, Berlin (inprint, 2008) 2. Cˆ arstea, A., Macariu, G., Frˆıncu, M., Petcu, D.: Composing Web-based Mathematical Services. In: Negru, V., et al. (eds.) SYNASC 2007, pp. 327–334. IEEE Computer Society Press, Los Alamitos (2007)

Pattern Based Composition of Web Services for Symbolic Computations

135

3. ActiveBPEL, http://www.active-endpoints.com/active-bpel-engine-overview.htm 4. PostgreSQL, http://www.postgresql.org/ 5. Majithia, S., Shields, M., Taylor, I., Wang, I.: Triana: a Graphical Web service Composition and Execution Toolkit. In: ICWS 2004, pp. 514–521. IEEE Computer Society, Washington (2004) 6. Casati, F., Ilnicki, S., Jin, L., Krishnamoorthy, V., Shan, M.C.: Adaptive and Dynamic Service Composition in eFlow. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000. LNCS, vol. 1789, pp. 13–31. Springer, Heidelberg (2000) 7. Solomon, A.: Distributed Computing for Conglomerate Mathematical Systems. In: Joswig, M., Takayama, N. (eds.) Algebra, Geometry and Software System, pp. 309–325. Springer, Berlin (2003) 8. Rao, J., Su, X.: A Survey of Automated Web Service Composition Methods. In: Cardoso, J., Sheth, A.P. (eds.) SWSWPC 2004. LNCS, vol. 3387, pp. 43–54. Springer, Heidelberg (2005) 9. Schuster, H., Georgakopoulos, D., Cichocki, A., Baker, D.: Modeling and Composing Service-based and Reference Process-based Multi-enterprise Processes. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000. LNCS, vol. 1789, pp. 247–263. Springer, Heidelberg (2000) 10. Ankolekar, A., Burstein, M., Hobbs, J., Lassila, O., Martin, D.L., Mcllraith, S.A., Narayanan, S., et al.: DAML-S: Semantic Markup for Web Services. In: Cruz, I.F., et al. (eds.) The Emerging Semantic Web ISWC 2002. IOS Press, Amsterdam (2002) 11. Milanovic, N., Malek, M.: Current Solutions for Web Service Composition. IEEE Internet Computing 8(6), 51–59 (2004) 12. Aird, M.L., Medina, W.B., Padget, J.: MONET - Service Discovery and Composition for Mathematical Problems. In: IEEE/ACM CCGrid 2003, pp. 678–685. IEEE Computer Society, Los Alamitos (2003) 13. Grid-Enabled Numerical and Symbolic Services, http://genss.cs.bath.ac.uk/index.htm 14. Baraka, R., Schreiner, W.: Querying Registry-published Mathematical Web Services. In: Wagner, R., Ma, J., Durresi, A. (eds.) AINA 2006, pp. 767–772. IEEE Computer Society, Los Alamitos (2006) 15. OpenMath, http://www.openmath.org/ 16. Russell, N., ter Hofstede, A.H.M., van der Aalst, W.M.P., Mulyar, N.: Workﬂow Control-ﬂow Patterns: A revised view. BPM Center Report BPM-06-22 (2006) 17. Wohed, P., van der Aalst, W.M.P., Dumas, M., ter Hofstede, A.H.M.: Analysis of Web Services Composition Languages: The case of BPEL4WS. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 200–215. Springer, Heidelberg (2003) 18. The GAP Group, GAP - Groups, Algorithms, and Programming, Version 4.4.10 (2007) http://www.gap-system.org 19. Kiepuszewski, B.: Expressiveness and Suitability of Languages for Control Flow Modeling in Workﬂows. PhD thesis, Brisbane, Australia (2003) 20. L¨ ubeck, F., Neunh¨ oﬀer, M.: Enumerating Large Orbits and Direct Condensation. Experiment. Math. 10(2), 197–205 (2001)

DObjects: Enabling Distributed Data Services for Metacomputing Platforms Pawel Jurczyk, Li Xiong, and Vaidy Sunderam Emory University, Atlanta GA 30322, USA {pjurczy,lxiong,vss}@emory.edu

Abstract. Many applications rely heavily on large amounts of data in the distributed storages collected over time or produced by large scale scientiﬁc experiments or simulations. The key constraints for building a distributed data query infrastructure for such applications are: scalability, consistency, heterogeneity and network and resource dynamics. We designed and developed DObjects, a general-purpose query and data operations infrastructure that can be integrated with metacomputing middleware. This paper describes the architecture of our data services and shows how those services were integrated with the metacomputing framework oﬀering users an open platform for building distributed applications that require access to data integrated from multiple distributed data sources. Keywords: Metacomputing, Distributed systems, Distributed databases, Data integration.

1

Introduction

Many applications rely heavily on large amounts of data in the distributed storages collected over time or produced by large scale scientiﬁc experiments or simulations. Consider a system that integrates the air and rail transportation networks with demographic data in order to model the large scale spread of infectious diseases (such as the SARS epidemic or pandemic inﬂuenza). Rail and air transportation databases are distributed among hundreds of local servers and demographic information is provided by a few global database servers. Such class of high-performance applications can be enabled by the construction of a networked virtual supercomputer, or metacomputer [1], in which high-speed networks are used to connect computers and resources located at geographically distributed sites. However, in order to integrate the distributed and heterogeneous data sources for such applications, the general metacomputing platform needs to be extended to meet a number of requirements. First, the scale of the applications can vary from a handful of nodes to several hundreds of nodes and requires good scalability as well as data query and update functionalities with data consistency. Second, the heterogenity of data sources requires a uniﬁed and seamless data representation and query interface for the applications. Lastly, the dynamics of resource and network conditions require applications to adapt M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 136–145, 2008. c Springer-Verlag Berlin Heidelberg 2008

DObjects: Enabling Distributed Data Services for Metacomputing Platforms

137

dynamically in both query processing and transaction management in order to achieve scalability and data consistency. Contributions. In this paper we introduce DObjects, a general-purpose infrastructure that extends distributed metacomputing platforms and provides data services for querying and operating data from heterogeneous data sources. Contribution of our research can be summarized as follows. First, the system extends the metacomputing paradigm with data services where nodes can share resources and form a virtual supercomputer and oﬀers a scalable way for accessing and operating distributed and heterogeneous data sources. Second, it includes a distributed query execution and optimization engine that deploys and executes (sub)queries on system nodes in a dynamic (based on nodes’ on-going knowledge of the data sources, network and node conditions) and iterative (right before the execution of each query operator) manner to optimize response time and throughput. Third, it includes an extension of three-phase commit protocol (3PC) that is non-blocking (allowing resource unlocking when nodes become unavailable), resilient to failures (including dynamic network partitioning and node failures even during state transitions), and ﬂexible (with adjustable parameters to guarantee consistency for systems with diﬀerent characteristics). Organization. Section 2 provides a review of related work. Section 3 presents an overview of our framework, including its architecture, data operations, and query language. Section 4 and 5 present some details of our query execution engine and commit protocol respectively. Section 6 presents an initial evaluation of the system and ﬁnally Sect. 7 provides a brief conclusion.

2

Related Work

Distributed systems. It is important to position DObjects among the existing distributed system frameworks. Distributed system technologies such as DCOM1 , RMI2 , and CORBA3 support distributed objects paradigm and can be used to build distributed applications. There are many distributed computing architectures (e.g. client-server or P2P) and platforms (BOINC [2] or Globus Toolkit [3] just to name few) built on top of these technologies. Some systems run on a volunteer basis where participants donate their unused computational power to work on interesting computational problems (such as BOINC). Others are more strict about the participants of the computing network. For instance, Grid systems and platforms, such as Globus Toolkit [3], provide general frameworks for running software on Grid architecture. However, large administrative eﬀort is required to set up the Grid infrastructure. P2P systems are more suitable for ad-hoc collaborations, characterized by more dynamic participation patterns than those observed in Grid systems. H2O [4] is a metacomputing platform for resource sharing designed to avoid the administrative burden related to using 1 2 3

http://msdn2.microsoft.com/en-us/library/ms809340.aspx http://java.sun.com/javase/technologies/core/basic/rmi/index.jsp http://www.corba.org/

138

P. Jurczyk, L. Xiong, and V. Sunderam

Client

Client

Transaction Query processing management Users’ Applications

DObjects node (Mediator) DObjects node (Mediator)

DObjects node (Mediator)

... DObjects

DObjects node (Mediator)

Resource sharing platform (H2O)

DCOM, CORBA, RMI etc.

Oracle PostgreSQL

Data adapters (Wrappers)

Data stream

Physical Physical Physical resource resource resource

System architecture

System structure

Fig. 1. System architecture

Grid systems. It implements a model where the roles of resource providers, service deployers and users can be separated. This makes resource sharing easier for providers, in the spirit of the P2P model. Our system builds on top of these technologies and extends the general metacomputing platform with support for distributed data access and operation services. Current implementation of DObjects builds on top of the H2O platform. The data services provided by DObjects oﬀer query processing and transaction management substrates fully integrated with the metacomputing middleware and can be used easily and transparently in distributed applications for scalable data operations with ﬂexible consistency guarantee. The most relevant to our work are OGSA-DAI and its extension OGSA-DQP [5] which were introduced by Grid community as a middleware assisting with access and integration of data from separate sources via the Grid. While the two approaches share a similar set of goals with DObjects, they were built on the grid/web service model and DObjects is built on the P2P metacomputing model and hence suits better middleware platforms providing resource sharing on a peer-to-peer basis. Distributed databases. Distributed database systems have been extensively studied and many systems have been proposed over the years. Earlier distributed database systems, such as R* and SDD-1, share modest targets for network scalability (a handful of distributed sites) and assume homogeneous databases. The focus is on encapsulating distribution with ACID guarantees. Later distributed database or middleware systems, such as DISCO [6], target large-scale heterogeneous data sources and employ a centralized mediator-wrapper based architecture to address the database heterogeneity in the sense that a single mediator server integrates distributed data sources through wrappers. As the query load increases, the centralized mediator may become a bottleneck. Most recently, Internet scale query systems, such as PIER [7], target thousands or millions of massively distributed data sources and focus on eﬃcient query routing schemes for network scalability. But they sacriﬁce on functionalities of complex queries and data updates and typically relax the consistency guarantee. DObjects distinguishes itself from existing solutions by addressing an important problem space that has been overlooked, namely, integrating large-scale heterogeneous data

DObjects: Enabling Distributed Data Services for Metacomputing Platforms

139

sources with network and query load scalability as well as maintaining transaction semantics. In spirit, DObjects is a distributed mediator-based system where a federation of mediators and wrappers form a virtual system. Instead of focusing on traditional cost-based query optimization, the query processing engine of DObjects focuses on dynamically placing (sub)queries on the mediators for query-load balancing and scalability. In addition, it includes an extension of the three-phase commit protocol (3PC) [8] to provide a non-blocking, resilient, and ﬂexible consistency guarantee for the dynamic environment.

3

DObjects Overview

Figure 1 presents our vision of deployed DObjects framework. The system has no centralized services and thus allows system administrators to avoid the burden in this area. It uses the metacomputing paradigm as a resource sharing substrate. Each node in the system provides its computational power that can be used by others during query execution. In addition, nodes can run data adapters which pull data from external data sources and transform it to a uniform format that is expected while building query responses. Front-end users can connect to any system node; however, while the physical connection is established between a client and one of the system nodes, the logical connection is established between a client node and a virtual database system consisting of all the participating nodes. Our current implementation builds on top of a Java metacomputing platform, H2O, that provides light-weight, decentralized and peer-to-peer resource sharing and communication. Data model. When user starts a query, DObjects returns persistent entities which are data represented as objects. From user’s perspective, query responses are objects of desired type. Each data object has a set of attributes, divided into two groups: simple and referential. Simple attributes represent simple types, such as numbers or strings. Referential attributes follow an object-oriented idiom and allow the deﬁnition of association, composition or collection relations between data objects. Thus, when a referential attribute is accessed, another persistent entity, or a collection of persistent entities, is obtained. A set of available data types in the system along with their attributes is deﬁned in the system conﬁguration. Each conﬁguration entry has a full description of an object, i.e. its type name and a list of simple attributes and referential attributes. When a referential attribute is deﬁned, one has to specify the foreign key information that is required to join the referencing object and referenced object. It also speciﬁes a list of nodes (sources) where given objects can be found. Each source is speciﬁed with: 1) name of the node, 2) remote data object name, and 3) attribute mappings that deﬁne the semantic mappings between the remote data object and the current object. There is no centralized copy of the global conﬁguration. For systems with a handful DObjects nodes (the number of data sources can be still large), the conﬁguration can be replicated and synchronized at every node as the cost of synchronization will be relatively small. For larger scale systems with more DObjects nodes, the global schema can be replicated at a subset of

140

P. Jurczyk, L. Xiong, and V. Sunderam

Fig. 2. Persistent entity conﬁguration

the DObjects nodes such as landmark nodes. An example conﬁguration corresponding to the example mentioned in Sect. 1 is provided in Fig. 2. It deﬁnes a persistent entity of CityInformation that has 2 referential attributes: a list of RailroadConnections, and a list of Flights. Data operations and query language. DObjects supports all standard data operations including queries and updates. Both synchronous and asynchronous queries are supported. In case of the latter user can get results incrementally and operate on partial results. The query language for our system could be select c.name, r.destination, implemented using any language that allows f.flightNumber, p.lastName one to specify attributes or conditions for a from CityInformation c, c.lRails r, c.lFlights f, given attribute in objects hierarchy. XPath f.lPassengers p or XQuery as well as OQL-like language are where c.name like `San%` and p.lastName=`Adams` all valid approaches. An example of query Fig. 3. Query example for Dobjects using OQL-like language is presented in Fig. 3.

4

Query Processing

In this section, we introduce the query processing and optimization engine in DObjects. It is important to highlight that our approach does not attempt to optimize physical query execution performed on local databases. Responsibility for this is pushed to data adapters and data sources. Our optimization goal is at a higher level focusing on building eﬀective sub-queries and optimal placement of those sub-queries on the system nodes to minimize the query response time and maximize system throughput. While adapting ”textbook” distributed query processing techniques such as distributed join algorithms and the learning curve approach for keeping statistics about data adapters, our query processing framework presents a number of innovative aspects. First, instead of generating a set of candidate plans, mapping them physically and choosing the best ones as in conventional cost based query optimization, we create one initial abstract plan for a given query (Fig. 5 presents plan for query presented in Fig. 3). It consists of such operators as joins, data merges and select operators executed on data sources. Second, when the query plan is being executed, the node chooses active elements from the query plan in a top-down manner for execution. However, placement decisions and physical plan calculation are performed dynamically and iteratively to guarantee the best reaction to changing load or latency

DObjects: Enabling Distributed Data Services for Metacomputing Platforms

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

generate high-level query plan tree active element ← root of query plan tree choose execution location for active element if chosen location = local node then delegate active element and its subtree to chosen location return end if execute active element; for all child nodes of active element do go to step 2 end for return result to parent element

Fig. 4. Local algorithm for query processing

141

Join City Information with referential attributes

Merge City Information from different locations

Select City Information from different locations

Prepare referential attributes Merge Railroad Connections from different loctions

Prepare Flights (join it with Passengers)

Select Railroad Connections

Fig. 5. Example of high-level query plan

conditions in the system. If a node ﬁnds that the best candidate for executing current element is a remote node, a migration of workload occurs. In order to choose the best node for migration, we deploy a network and resource-aware cost model that dynamically adapts to network conditions (such as delays in interconnection network) and resource conditions (such as load of nodes). Figure 4 presents a sketch of the local query execution process. Below we brieﬂy describe the two cost features of our cost model. More details can be found in [9]. Latency between nodes. To compute the network latency between each pair of nodes eﬃciently, each DObjects node maintains a virtual coordinate [10], such that the Euclidean distance between two coordinates is an estimate for communication latency. The overhead of maintaining a virtual coordinate is small because a node can calculate its coordinate after probing a small subset of nodes such as well-known landmark nodes or randomly chosen nodes. Load of nodes. The second feature of our cost metric is load of the nodes. Given our desired feature to support cross-platform applications, instead of depending on any OS speciﬁc functionalities for the load information, we incorporated a solution that assures good results in heterogeneous environment. The main idea is based on time measurement of execution of predeﬁned test program that considers computing and multithreading capabilities of machines.

5

Transaction Management

In this section we introduce the distributed commit protocol in DObjects. To support a wide range of dynamic and large-scale distributed networks, the desired features for our protocol include: (1) eﬃciency and scalability, (2) resistance to dynamic environment, and (3) ﬂexibility of consistency level. The basic version of 3PC does not support network partitioning or multiple node failures. Moreover, it assumes atomicity of transitions (e.g. messages can either be sent to all the cohorts or to none of them). Such an assumption is hard to implement in distributed computing paradigms. Extensions of 3PC, such as Q3PC [11], while addressing some of the issues through the idea of

142

P. Jurczyk, L. Xiong, and V. Sunderam Table 1. Consistency levels for commit protocol (TT is transaction timeout)

No logging Finite Consistency if protocol ﬁnishes before TT timeout (node failures cause inconsistency). Possibly no information about transaction outcome. Inﬁnite Resistant to network failures, node failTT ures cause inconsistency. Possibly no information about transaction outcome.

Optimistic logging Consistency if protocol ﬁnishes before TT (node failures can cause inconsistency). Resistant to network failures, node failures can cause inconsistency.

Full logging Consistency if protocol ﬁnishes before TT. Full consistency, blocking protocol.

i

d ive ce re ee it agr m m nd Co S e

Co m m Se it nd rec ab eiv or e d t

Abort received

i

i

(TT) and ((any of alive cohorts aborted) or (none of alive cohorts (pre-)committed)) OR (T,F) and (any cohort aborted or ((all available) and (none (pre-)committed)))

Prepare received Send ack

F, T

TT ed OR bort a me me so tted o d i S an m m co

F, T

Al co l co m ho m r itt ts ed

All cohorts acked Send commit to all

All co h abort orts ed noon or comm e itted

ts ed or re oh ag l c ts al or to oh re l c pa Al pre nd Se

An Se y co nd ho a b rt r o rt e p to lied al l a co bor ho t rts

F,

T

Commit request sent to all cohorts

T F,

quorum, suﬀer cost of blocking. The commit protocol we designed and implemented in DObjects is presented in Fig. 6. The protocol proceeds similarly to classical 3PC. First, coordinator sends commit request to all cohorts. If any cohort does not agree, the transaction is aborted. Next, after all cohorts agreed to the transaction, coordinator sends prepare to commit message. After all cohorts acknowledge receiving of the prepare message, Protocol coordinator Protocol cohorts coordinator sends comq q mit messages and commits. The key idea in our protocol is to introw w a duce an additional coordinator’s state u and use quorum to handle the node failures and a p non-atomic state transitions and to use two u adjustable transaction parameters to oﬀer a p c e c non-blocking alternative and the ﬂexibility. The ﬁrst parameter, Fig. 6. Commit protocol. T is timeout waiting for next log level, allows users message, F is failure transition, TT is transaction timeout. to switch between the amount of logged information by each node (no log, optimistic log or full log). The second, transaction timeout TT, deﬁnes maximum time that an unresolved transaction can persist in the system (in addition to the node failure timeout between state transitions). When the timeout is reached, decision is made even if some participants of the commit protocol are not available. The communication among cohorts is similar to quorum-based protocols, namely when failures occur nodes contact others to ﬁnd out the outcome of the transaction. The addition of the parameters, however, allows the protocol to oﬀer a non-blocking alternative when maximum transaction timeout is reached. The diﬀerent logging level also allows the system to have the ﬂexibility of being able to conduct full recovery (T,F or TT) and (any cohort (pre-)committed)

F, T

Commit received

i

i

DObjects: Enabling Distributed Data Services for Metacomputing Platforms

143

with a higher overhead or sacriﬁce consistency for eﬃciency. Table 1 summarizes the diﬀerent consistency levels achieved by diﬀerent parameter conﬁgurations. We believe that the new-generation distributed data systems need to oﬀer this ﬂexibility and leave decisions to users who can choose between full consistency or some risk of inconsistent state in case of failures for a trade-oﬀ of eﬃciency. For detailed description and analysis of the protocol we refer readers to [12].

6

Initial Deployment and Experimental Evaluation

Average response time [ms]

The framework is fully implemented 140000 Small query - full optimization with current version available for downSmall query - no optimization 120000 load 4 . In this section we present an iniMedium query - full optimization 100000 Medium query - no optimization tial deployment and evaluation of our 80000 system in terms of its feasibility. For rig60000 orous performance evaluations and sim40000 ulations under diﬀerent parameter set20000 tings of the query processing engine and 0 commit protocol, we refer to [9] [12]. Number of clients We deployed DObjects framework on four nodes started on generalFig. 7. Average query response time purpose PCs (Intel Core 2 Duo, 1 GB RAM, Fast Ethernet network connection). The conﬁguration involved the following three objects: 10,000 CityInformation objects (provided by node 1), 50,000 Flight objects (20,000 provided by nodes 2 and 30,000 provided by node 3) and 70,000 RailroadConnection objects (provided by node 3). Node 4 was used only for computational purposes. The database engine nodes used was PostgreSQL 8.2. 1

2

3

4

Query processing. We measured response time for diﬀerent query workloads including small queries (returning CityInformations only) and medium queries (returning CityInformations along with list of Flights and list of RalilroadConnections). Figure 7 presents results for small and medium queries for various number of parallel clients. Clearly the response time is signiﬁcantly reduced when query optimization is used. The response time may seem a bit high at the ﬁrst glance. To give an idea of the overhead introduced by our system, we integrated all the databases used in the experiment above into one single database and tested a medium query from Java API using Hibernate framework (using one client). The query along with results retrieval took an average of 16 s. For the same query, our system took 20 s that is in fact comparable to the case of a single database. While the overhead introduced by DObjects cannot be neglected, it does not exceed reasonable boundary and does not disqualify our system as every middleware is expected to add some overhead. In this deployment, the overhead is mainly an eﬀect of network communication. In addition, the cost of distributed computing middleware adds to the overhead which is the price a 4

http://www.mathcs.emory.edu/Research/Area/datainfo/dobjects

P. Jurczyk, L. Xiong, and V. Sunderam

800

7

700

6

600

Throughput (t/s)

T ran sactio n tim eo u t (m s)

144

500 400 300

Full log

200

Optim istic log

100

5 4 3 Full log

2

Optim istic log

1

No log

No log

0 2

3

4

8

Num ber of involved cohorts

Fig. 8. Transaction response time

0 1

2

4

8

16

Num ber of clients

Fig. 9. Throughput of transactions

user needs to pay for convenient access to distributed data. However, for larger setup with larger number of clients, we expect our system to perform better than centralized approach as the beneﬁt from distributed computing paradigm and load distribution will outweigh the overhead. Transaction management. We also evaluated the performance of the commit protocol and the impact of diﬀerent logging levels through the deployment as described earlier. The experiment was conducted by submitting data update requests (create and delete operations one by one) involving data objects hosted by a desired number of nodes (which corresponds to number of cohorts involved in transactions). Figure 8 presents average response time for varying number of cohorts. As expected, the reduction from full logging to optimistic logging in log level yields better performance for the transactions and the diﬀerence between optimistic logging and no logging is very small. Figure 9 presents average transaction throughput for diﬀerent number of independent clients for three cohorts. As expected, limiting amount of logged information led to better throughput. Moreover, the diﬀerence between optimistic logging and no logging can be observed. Especially for larger number of clients, when the system becomes overloaded, no logging shows better performance. Such a phenomenon is not surprising, as writing even a small amount of information to persistent storage (e.g. hard drive) is considerably expensive in case of high system resources utilization.

7

Conclusion and Future Work

We have introduced DObjects, a distributed data objects framework that facilitates integration of data from large scale heterogeneous sources and can be used easily and transparently in distributed applications. We have discussed its architecture built on top of a metacomputing platform for addressing both geographic and load scalability, its dynamic query processing engine with local query migrations that dynamically adjusts to the network and resource conditions, and its commit protocol that is scalable and resistant to dynamic network and node failures and, most importantly, provides a conﬁgurable level of consistency depending on speciﬁc application and system deployment characteristics.

DObjects: Enabling Distributed Data Services for Metacomputing Platforms

145

Our approach was validated through real implementation and deployment. The ongoing and future eﬀorts include further enhancement for query optimization with a broader set of cost features (for instance dynamic migration of active operators in real-time from one node to another if load situation changes) and support for continuous queries. We are also considering fault tolerance properties of the commit protocol. In particular, we are planning to introduce data replications and extend the system with tolerance of Byzantine node failures.

References 1. Smarr, L., Catlett, C.E.: Metacomputing. Commun. ACM 35, 44–52 (1992) 2. Anderson, D.P.: BOINC: A system for public-resource computing and storage. In: Fifth IEEE/ACM International Workshop on Grid Computing, pp. 4–10. IEEE Computer Society, Washington (2004) 3. Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 4. Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards Self-organizing Distributed Computing Frameworks: The H2O Approach. Parallel Processing Letters 13, 273–290 (2003) 5. Alpdemir, M.N., Mukherjee, A., Gounaris, A., Paton, N.W., Fernandes, A.A., Sakellariou, R., Watson, P., Li, P.: Using OGSA-DQP to Support Scientiﬁc Applications for the Grid. In: Herrero, P., S. P´erez, M., Robles, V. (eds.) SAG 2004. LNCS, vol. 3458, pp. 13–24. Springer, Heidelberg (2005) 6. Tomasic, A., Raschid, L., Valduriez, P.: Scaling access to heterogeneous data sources with DISCO. IEEE Trans. on Knowl. and Data Eng. 10, 808–823 (1998) 7. Huebsch, R., Hellerstein, J.M., Lanham, N., Loo, B.T., Shenker, S., Stoica, I.: Querying the internet with PIER. In: 29th International Conference on Very Large Data Bases, pp. 321–332, VLDB Endowment, Berlin (2003) 8. Skeen, D., Stonebraker, M.: A formal model of crash recovery in a distributed system. In: Concurrency control and reliability in distributed systems, pp. 295– 317. Van Nostrand Reinhold Co., New York (1987) 9. Jurczyk, P., Xiong, L.: DObjects: a metacomputing framework with dynamic query processing for distributed data networks. Technical Report TR-2007-015, Emory University, Math&CS Dept., Atlanta, USA (2007) 10. Dabek, F., Cox, R., Kaashoek, F., Morris, R.: Vivaldi: A Decentralized Network Coordinate System. SIGCOMM Comput. Commun. Rev., 15–26 (2004) 11. Skeen, D.: A Quorum-Based Commit Protocol. Technical report, Cornell University, Ithaca, USA (1982) 12. Jurczyk, P., Xiong, L.: Adapting Distributed Commit Protocol for Dynamic Metacomputing Frameworks. Technical Report TR-2007-025, Emory University, Math&CS Dept., Atlanta, USA (2007)

Behavioural Skeletons Meeting Services M. Danelutto1,2 and G. Zoppi1 2

1 Dept. Computer Science – Univ. Pisa CoreGRID Programming Model Institute

Abstract. Behavioural skeletons have been introduced as a suitable way to model autonomic management of parallel, distributed (grid) applications. A behavioural skeleton is basically a skeleton with an associated autonomic manager taking care of non-functional issues related to skeleton implementation. Here we discuss an implementation of a task farm behavioural skeleton exploiting SCA, the Service Component Architecture recently introduced by IBM. This implementation is meant to provide plain service/SCA users some eﬃcient skeleton modelling common parallel application pattern and also to investigate the advantages and the problems relative to skeletons in the service world. Experimental results are eventually discussed.

1

Introduction

Algorithmic skeleton concepts have been around since the initial works by Murray Cole in late ’80 [9]. They initially spread through the parallel computing community and they fertilized some independent research activities in diﬀerent research groups [12,8]. Several distinct systems have been developed [11,15,18,19] exploiting diﬀerent implementation technologies that have been used as proof of concept of user friendly, eﬃcient, completely transparent parallel programming models basically targeting parallel/distributed architectures such as workstation clusters and networks. More recently algorithmic skeleton concepts moved to the grid scenario and were used to program several diﬀerent kind of super/meta/skeleton components modelling common grid programming paradigms [13,7,4]. Components embedding of skeletons somehow ﬁlled the gap among skeletons concepts and software engineering concepts. As a matter of fact, component technology allowed to embed skeletons in well know programming paradigms (the components) and to hide non-relevant implementation parameters to ﬁnal skeleton users (by exploiting hierarchical component composition facilities). Eventually this allowed more and more high level programming abstraction to be provided to application programmers. In component frameworks, skeletons are usually implemented through composite components and basically provide the very same programming pattern that classical, non component skeleton programming environments

This research is carried out under the FP6 Network of Excellence CoreGRID and the FP6 GridCOMP project funded by the European Commission (Contract IST2002-004265 and FP6-034442).

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 146–153, 2008. c Springer-Verlag Berlin Heidelberg 2008

Behavioural Skeletons Meeting Services

147

usually provide to the application programmers. Skeleton code parameters were provided as components, allowing hierarchical and incremental program development. Component technology simpliﬁed somehow code deployment on remote processing elements leveraging onto component framework facilities rather then requiring explicit and consistent programming eﬀorts in the design and implementation of the skeleton compiler and run time tools. After successfully migrating to the grid scenario via component technology a further step was made: skeletons were used to combine powerful parallel application pattern abstraction with typical grid related autonomic management features. Behavioural skeletons were thus introduced in 2007 [5,6] in the framework of the CoreGRID and GridCOMP projects1 . A behavioural skeleton is basically a skeleton integrated with an “autonomic manager” item taking care of all the non-functional aspects related to skeleton implementation, such as performance optimization, fault tolerance and security. The autonomic managers, in this case, can be understood as the locus where well-known self-optimization and self-healing autonomic [14] features are implemented. In the meanwhile, grid programming environments evolved more and more through (Web) Services and web services [20] become a de facto standard in several related scenarios: grid programming, enterprise applications, software interoperability, incremental application development, etc. In this paper we basically discuss the results achieved by implementing a speciﬁc GCM behavioural skeleton (the functional replication/task farm skeleton as described in [6]) on top of SCA (the Service Component Architecture introduced by IBM [1]). By this experiment we aimed to follow Cole’s “manifesto” suggestion to propagate the concept with minimal disruption [10]. The task farm skeleton discussed in this works provides service application programmers with a very high level programming paradigm that can be easily used to program most of the typical embarrassingly parallel grid/distributed applications actually leaving to the component implementing the programming paradigm2 the hard task to implement self optimization and self healing features. We also followed Cole’s recommendation to accommodate diversity by providing user-friendly ways of extending and modifying autonomic management features and policies. Last but not least, implementation of behavioural skeletons on top of SCA allowed a comparison to be performed with the already existing similar implementation of behavioural skeletons on top of ProActive/Fractal [17] developed within GridCOMP [16]. The rest of the paper is structured as follows: in Sections 2 and 3 we brieﬂy introduce behavioural skeletons and Service Component Architecture. Section 4 details current implementation of a task farm behavioural skeleton in SCA and Section 5 presents the results we achieved and compares SCA implementation with GridCOMP ProActive based one.

1

2

The former being an FP6 NoE and the latter being an FP6 STREP project, both funded by EU. To its autonomic manager, actually.

148

2

M. Danelutto and G. Zoppi

Behavioural Skeletons

As stated in [5,6] behavioural skeletons model common, reusable patterns of parallel computations as traditional algorithmic skeletons do. In addition, however, they provide self management facilities taking care of several distinct non functional3 issues related to skeleton implementation. Sample behavioural skeleton component is depicted in Fig. 1. This is a simpliﬁed version of a composite component modelling a task farm skeleton. The component only has three provide ports: two are functional and they are used to ask task computation (submit initial data and retrieve computation results) and to specialize the task farm component with a proper worker component. The other one is non functional and it is used to submit to the component a performance contract to be satisﬁed on the behalf of the task farm component user. This is all what the ﬁnal user of the composite component perceives of the task farm component. However, internally the component hosts a Manager component and one or more Worker components. The worker component string actually computes the submitted tasks under the supervision of the manager. The manager, in turn, takes care of implementing the performance contract provided by the user according to some kind of best eﬀort strategy. We do not want to enter void setPerformanceContract a complete description of behavioural skeletons imManager Result submit(Task task) plemented as components boolean provideWorker( here. The interested reader Worker w) Worker Worker may refer to [5,6] in case. What we want to point out is the kind of management Fig. 1. Behavioural skeleton component (simpliﬁed) provided by the autonomic manager embedded in the composite task farm behavioural skeleton component. The autonomic manager continuously monitors a set of execution parameters (e.g. service time of the task farm). In case a performance contract violation is observed, it ﬁgures out a strategy suitable to recover the situation, possibly looking up a set of predeﬁned “repair” strategies. Once the strategy has been individuated, a plan is computed to apply the strategy with the minimum overhead and eventually the plan is executed. Plan execution possibly requires to temporary stop normal functional component activities (task computation) and to reconﬁgure the internal composite component components (e.g. adding one more worker to the current worker component string). In case no strategy can be found to repair the performance contract violation, the user is simply informed and nothing else happens. The computation prosecutes as initiated, fulﬁlling the best eﬀort criteria we stated for the autonomic manager. 3

“Non functional” features are the features related to how a given computation is performed, opposite to “functional” features that are related to what it has to be computed.

Behavioural Skeletons Meeting Services

3

149

SCA

Service Component Architecture [1] provides the user with a programming framework supporting application development based on Service Oriented Architecture. Applications programmers may build their application re-using existing services embedded in service components and specifying composite components through proper XML ﬁles. Eventually, SCA applications can be run on several distinct kinds of distributed platforms exploiting existing technologies such as web services or RMI. SCA bindings are provided for diﬀerent programming languages in such a way programmers can use (among the others) Java and C++ code to program the primitive SCA components. Typical SCA component assembly is the one of Fig. 2 (the Figure is taken from SCA documentation). In this case, the composite itself is speciﬁed using an XML ﬁle detailing all the internal components and wires as well as all the external use/provide interfaces. The composite can be run on the SCA runtime as well as on plain web services runtimes. Several alternative methods can be used to export and reference the component ports. The big advanService Reference tage of SCA frameJava Interface / WSDL PortType Java Interface / WSDL PortType work relies in the Properties fact that it proComposite A vides very handy Property setting ways to build appliComponent Component Reference Service cations from existA B ing services (which is the very same Promote Wire Promote thing you can do using BPEL in other Binding Binding context) and allows Web Service / SCA Web Service / SCA / JCA / JMS / SLSB / JCA / JMS / SLSB the service composition to be (re-) used Fig. 2. SCA component assembly within applications as a primitive component. In order to write a Java SCA program, programmers must basically perform the following steps. First, interfaces declaring the component interface should be provided. In the interface services (methods) as well as properties (instance variables) provided by the component are declared. Then Java code/classes implementing these interfaces should be provided. Eventually, XML ﬁles describing composition of components have to be written, declaring the component name, what the component exports (provide ports), uses (use ports) and how the component is implemented (java classes used). Once these steps have been performed, the program may be launched by instantiating a SCADomain and passing it as a parameter the XML .composite ﬁle hosting the component composition speciﬁcation. At this point the component can be accessed by clients querying the SCADomain a reference to the component (this is achieved using the name provided in the XML ﬁle) and accessing the component exported features. Again,

150

M. Danelutto and G. Zoppi

the interested reader will ﬁnd all the details relative to SCA on the web site hosting all the documentation [1]. We used here the Tuscany open source implementation of the SCA service component framework [3].

4

Implementation

Our implementation of the task farm behavioural skeleton was designed as shown in Figure 3. A WorkpoolService composite SCA component has been implemented and it is provided to the user that completely takes care of implementing a task farm, with respect to both functional and non-functional behaviour. The WorkpoolService component exposes interfaces (provides methods) to submit jobs (and this is a functional concern) as well as to start autonomic management and to submit rules aﬀecting autonomic management behaviour (this is instead a non-functional concern). Autonomic management is implemented exploiting JBoss rules [2]. Each JBoss rule includes a precondition as well as the actions to be executed in case the precondition is satisﬁed. Both preconditions and actions use proper Java beans associated to the entities managed within the WorkpoolService. Methods of these beans may be invoked within a JBoss rule in order to achieve some (part of an) autonomic self-* behaviour. The JBoss rules associate to a SCA task farm behavioural skeleton component are activated only in case the user explicitly asks to start autonomic control management. Some predeﬁned rules, such as increase the parallelism degree in case the task pool service time happens to be larger than user deﬁned service time contract, are predeﬁned in the task farm component. Other rules can be deﬁned on the ﬂy by the task farm component user and inserted via its non-functional interfaces. The TaskFarmManager component inside the WorkpoolService actually takes care WorkpoolService Worker Manager of tasks submit requests and uses one of the WorkerManager PE Worker Worker components to execute the subTask Farm mitted task, in such a way parManager allel execution of the submitted task stream is achieved. Each Worker Manager one of the WorkerManagers take care of the Worker components Worker PE allocated on the same resource (processing element) used to run the WorkerManager. In Fig. 3. WorkpoolService skeleton turn, Worker components in the resource are allocated instantiating copies of the Worker components provided by the user through the WorkpoolService functional interface. Each Worker is able to compute a single submitted task at a time. The WorkerManager component provides a functional submitTask interface as well as non functional addWorker and deleteWorker

Behavioural Skeletons Meeting Services

1200

700 600

1000

500 800 400 600 300 400

200

200

100

0

0 2

4

6

8

10

12

9

800

Measured (coarse grain) Ideal (coarse grain) Measured (fine grain) Ideal (fine grain)

14

16

#worker

Completion time (secs) (fine grain) #worker

Completion time (secs) (coarse grain)

1400

151

8 7 6 5 4 3 2 1 0 0

50

100 150 200 Computation time (secs)

250

300

Fig. 4. Scalability (left) and eﬀect of dynamic management (right) in task farm behavioural skeleton

interfaces that can be used, upon WorkpoolService requests, to allocate and deallocate computing components on the resource. The WorkpoolService basically includes the autonomic management of the task farm behavioural skeleton as described above.

5

Experimental Results

We run some experiments on a cluster of Linux workstations interconnected through a Fast Ethernet network to verify the functional features of our SCA task farm behavioural skeleton as well as the non-functional features of its autonomic management. First of all, scalability has been measured. Fig. 4 left shows typical scalability curves we got when running coarse grain tasks thought the farm SCA behavioural skeleton. The grain of a task is the ratio between the time spent to compute the task and the time spent to deliver the input data and to retrieve the result data to and from the remote processing element actually computing that task. In this case the computation time refers to the overall time spent to compute 1K tasks on the farm SCA behavioural skeleton. It is not surprising that coarse grain tasks achieve almost perfect scalability as the tasks are independent (task farm implements an embarrassingly parallel computation pattern). In case the task computational grain is lower (O(10) (“ﬁne grain” in the Figure) instead of O(100) (“coarse grain” in the Figure)) the task farm behavioural skeleton stops scaling at about 8 workers. We then provided JBoss rules to adapt the task farm behavioural skeleton to varying performance achieved in the execution of the user tasks and we explicitly activated autonomic management. Fig. 4 right shows what happened when rules were used stating the service time of the task farm should be kept smaller than a given value and additional load was put on the resources used to compute the tasks. The autonomic manager inside the WorkpoolService detected a decrease in the service time and started increasing the parallelism degree until the required

152

M. Danelutto and G. Zoppi

service time was obtained again. In this case, additional load was deployed on the ﬁrst four worker resources when about 500 tasks (out of 1K tasks) were computed. The task farm behavioural skeleton reacted autonomically and started new workers on four additional computing nodes, whose WorkerManager had no Worker allocated up to that moment. The whole thing happened without any explicit programmer intervention, but supplying the proper JBoss rule stating that in case the service time goes down under a given threshold, new workers should have been added, such as rule "AdaptUsageFactor" when $workerBean: WorkpoolBean(serviceTime > 0.25) then $workerBean.addWorkerToNode(""); end

After implementing the task farm behavioural skeleton on top of SCA Tuscany, we veriﬁed that all the necessary mechanisms are presents in SCA. The only thing we had to emulate programmatically was dynamic composite component modiﬁcation, by generating proper .composite ﬁles that are then used to instantiate the new (modiﬁed) composite components.

6

Conclusions

We implemented a task farm behavioural skeleton such as the one described in [5,6] using SCA. A version of the very same behavioural skeleton on top of ProActive/GCM has already been implemented in the framework of the GridCOMP project [16]. We succeeded demonstrating that SCA can be usefully exploited to implement behavioural skeletons. Implementation scales and autonomic management is actually achieved exploiting proper JBoss rules modelling autonomic manager policies. In the ProActive/GCM implementation, however, autonomic behaviour of the task farm has to be programmed directly in Java, whereas in this case much more simpler JBoss rules can be used to implement the very same politics and strategies. Being the task farm behavioural skeleton component be implemented in a fully compliant SCA way, plain Web Services as well as Java component can be used to provide task farm workers and to implement WorkpoolService clients, in such a way interoperability is greatly enhanced (with respect to the ProActive/GCM implementation) and Cole recommendation to “propagate the concept with minimal disruption” was further enforced. Overall, we think this represents another step in the direction of permeating existing, state of the art parallel/distributed programming environments with the algorithmic skeleton concept. Autonomic bahaviour implementation through a separate manager within the task farm skeleton/component, in turn, further separates functional and non-functional concerns, as advocated in the behavioural skeleton framework. The experience discussed in this work is a ﬁrst step towards a full integration of behavioural skeletons in the service framework. The prototype presented here is being reﬁned, to support further skeletons and to reﬁne the mechanisms used to implement the autonomic rules.

Behavioural Skeletons Meeting Services

153

References 1. Service component architecture (2007), http://www.ibm.com/developerworks/library/specification/ws-sca/ 2. Jboss rules home page (2008), http://www.jboss.com/products/rules 3. Tuscany home page (2008), http://incubator.apache.org/tuscany/ 4. Aldinucci, M., Bertolli, C., Campa, S., Coppola, M., Vanneschi, M., Veraldi, L., Zoccolo, C.: Self-Conﬁguring and Self-Optimising Grid Components in the GCM model and their ASSIST implementation. In: Proceedings of HPCGECO/Compframe Workshop, Paris, HPDC-15 associate workshop (2006) 5. Aldinucci, M., Campa, S., Danelutto, M., Dazzi, P., Kilpatrick, P., Laforenza, D., Tonellotto, N.: Behavioural skeletons for component autonomic management on grids. In: CoreGRID Workshop on Grid Programming Model, Grid and P2P Systems Architecture, Grid Systems, Tools and Environments, Heraklion (June 2007) 6. Aldinucci, M., Campa, S., Danelutto, M., Vanneschi, M., Kilpatrick, P., Dazzi, P., Laforenza, D., Tonellotto, N.: Behavioural skeletons in gcm: autonomic management of grid components. In: Proc. of Intl. Euromicro PDP 2008: Parallel Distributed and network-based Processing, Toulouse, France, IEEE, Los Alamitos (2008) 7. Aldinucci, M., Coppola, M., Danelutto, M., Tonellotto, N., Vanneschi, M., Zoccolo, C.: High level grid programming with ASSIST. Computational Methods in Science and Technology 12(1), 21–32 (2006) 8. Bacci, B., Danelutto, M., Orlando, S., Pelagatti, S., Vanneschi, M.: P3 L: A Structured High level programming language and its structured support. Concurrency Practice and Experience 7(3), 225–255 (1995) 9. Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computations. Research Monographs in Parallel and Distributed Computing. Pitman (1989) 10. Cole, M.: Bringing Skeletons out of the Closet: A Pragmatic Manifesto for Skeletal Parallel Programming. Parallel Computing 30(3), 389–406 (2004) 11. Cole, M., Benoit, A.: The Edinburgh Skeleton Library home page (2005), http://homepages.inf.ed.ac.uk/abenoit1/eSkel/ 12. Darlington, J., Field, A.J., Harrison, P.G., Kelly, P.H.J., Sharp, D.W.N., Wu, Q., While, R.L.: Parallel Programming Using Skeleton Functions. In: Reeve, M., Bode, A., Wolf, G. (eds.) PARLE 1993. LNCS, vol. 694. Springer, Heidelberg (1993) 13. Gorlatch, S., D¨ unnweber, J.: From grid middleware to grid applications: Bridging the gap with HOCs. In: Getov, V., Laforenza, D., Reinefeld, A. (eds.) Future Generation Grids. Springer, Heidelberg (2006) 14. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Computer 36(1), 41–50 (2003) 15. Kuchen, H.: A skeleton library. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 620–629. Springer, Heidelberg (2002) 16. GridComp: Eﬀective Components for the Grids (2007), http://gridcomp.ercim.org/ 17. ProActive home page (2007), http://www-sop.inria.fr/oasis/proactive/ 18. Serot, J., Ginhac, D., Derutin, J.P.: SKiPPER: A Skeleton-Based Parallel Programming Environment for Real-Time Image Processing Applications. In: Malyshkin, V.E. (ed.) PaCT 1999. LNCS, vol. 1662, Springer, Heidelberg (1999) 19. Vanneschi, M.: PQE2000: HPC tools for industrial applications. IEEE Concurrency 6(4), 68–73 (1998) 20. W3C. Web services home page (2006), http://www.w3.org/2002/ws/

Functional Meta-programming for Parallel Skeletons Jocelyn Serot1 and Joel Falcou2 1

LASMEA, UMR 66O2 CNRS/U. Blaise Pascal, Campus des C´ezeaux F-63177 Aubi`ere, France 2 IEF, Universit´e Paris-Sud, F-91405 Orsay Cedex, France [email protected], [email protected]

Abstract. We describe the implementation in MetaOcaml of a small domain speciﬁc language for skeleton-based parallel programming. We show how the meta-programming facilities oﬀered by this language make it possible to virtually eliminate the run-time overhead for the resulting programs, compared to a hand-crafted, low-level implementation. Keywords: Parallel programming, skeletons, meta-programming.

1

Introduction

The now well-known concept of skeletons [1] has been proposed as a solution to the problems raised by low-level parallel programming using message-passing libraries such as MPI. With skeletons, parallel programming boils down to choosing, instantiating and combining high-level, generic constructors taken from a predeﬁned library. Skeletons therefore deﬁne a small domain speciﬁc language (DSL) providing the requested level of abstraction, by which parallel programs can be built without having to deal with low-level implementation details. Several practical realisations of the skeleton concept have been proposed [2,4,7,6,14]. These realisations essentially diﬀer in the way the related DSL is implemented. In practice, one has to deﬁne 1) the syntax and semantics of this DSL 2) the transformation rules used for turning the DSL-level expressions into a low-level implementation (using MPIcalls for instance). For most cited systems, the DSL is implemented as a library within a classical, sequential host language (C, C++, Caml). This has two advantages. First, this spares the implementor from having to write a dedicated lexical analyser and parser. Second, it greatly eases the interfacing to sequential functions (either because these functions are written within the host language or because its foreign function interface can be directly reused). The price to pay, on the other hand, is a signiﬁcant run-time overhead for the resulting code, compared with a hand-crafted implementation of the same application using low-level MPI calls. In this paper, we explain how partial evaluation and meta-programming techniques – which have already been successfully applied in other contexts – can be used to solve the aforementioned problem. More precisely, we describe the implementation, in MetaOcaml, of a small DSL allowing M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 154–163, 2008. c Springer-Verlag Berlin Heidelberg 2008

Functional Meta-programming for Parallel Skeletons

155

– high-level speciﬁcation of parallel programs as a combination of skeletons, – automatic generation of the underlying, equivalent low-level parallel code as a set of sequential communicating processes, – execution of this code on a cluster architecture supporting the MPIlibrary. The paper is organised as follows. Section 2 brieﬂy recalls what is metaprogramming and how it is supported within the MetaOcaml programming language. In Section 3, we explain why this technique is interesting in the context of message-based parallel programming. In Section 4 we introduce a DSL dedicated to skeleton-based parallel programming, giving its syntax and semantics. We then describe how this (formal) semantics can be used to derive an implementation in MetaOcaml. Preliminary experimental results are given in Section 6. Related work is discussed in Section 7. We conclude by summarizing our contributions and giving hints for further work.

2

Meta-programming with MetaOcaml

Meta-programming is a technique allowing programs (so-called meta-programs) to analyse, transform and generate other programs (so-called object programs). MetaOcaml [15] is an extension of the Ocaml language oﬀering support for meta-programming thanks to three special value constructors, named brackets, escape and run, and a special type constructor, named code. – Brackets (.< >.) enclose an object program part : # let a = .< 1+2 >.;; val a : int code = .<(1+2)>. The expression 1+2 is not evaluated and the meta-variable a now holds its representation. The corresponding code fragment can now be inserted in other programs or compiled and executed. – The escape operator (.~) inserts an object program part into a larger part : #let b = .< 3 * .~a >.;; val b : int code = .<(3*(1+2))>. – The run operator (.!) evaluates (compile and run) an object program : #let c = .!b;; val c : int = 9

3

Meta-programming Applied to Parallel Programming

In order to see how meta-programming “`a la MetaOcaml” can be be interesting in the context of parallel programming, let’s consider the (otherwise meaningless) program depicted in Figure 1-a. In this program, each process with an even rank sends a value (computed by the f function) to the process with a rank

156

P0 send; recv

J. Serot and J. Falcou

P2

P1

send;

recv; send

recv

P3

let myrank = get_proc_rank() in if myrank % 2 = 0 then send (f()) (myrank+1); recv (myrank+1) else let y = recv (myrank-1) send (g y) (myrank-1);

recv; send

-a-

-bFig. 1. Example 1

immediately above and then waits for a return value from the same process. Each process with an odd rank waits for a value from the process with a rank immediately below, applies a function g to this value and sends the result to this process. On almost all cluster architectures, the execution model is SPMD (Single Program Multiple Data). Written with point-to-point communication primitives like send and recv, the corresponding program would therefore look like in Figure 1-b1 . In this program, the communication pattern is ﬁxed. The code to be executed by each process is then known as soon as its rank is known. We can take advantage of this to rewrite it in the following manner with MetaOcaml: let pgm rank = if rank % 2 = 0 then .< send (f()) (rank+1); recv (rank+1) >. else .< let y = recv (rank-1) in send (g y) (rank-1) >. let myrank = get_proc_rank() in .!(pgm myrank) Now, the code to be executed is dynamically generated by each process and then executed. This technique is powerful because it allows this code to be specialised – and optimized – according to parameters which are only known at launch time2 . It could be objected that this incurs a runtime overhead but this overhead is only paid once, at launch time, and can therefore completely be covered by the gain in performance resulting from executing an optimized code. In the sequel, we will apply this idea to obtain an eﬃcient implementation of a DSL for skeleton-based parallel programming.

1

2

We assume here that the primitives send and recv have signatures : val send: ’a -> pid -> unit and val recv: pid -> ’a and that the primitive get proc rank returns, at runtime, the rank of the executing process. This should be contrasted to other languages supporting compile-time metaprogramming, such as C++ or Template Haskell [9].

Functional Meta-programming for Parallel Skeletons

4

157

A DSL for Skeleton-Based Parallel Programming

Syntax. The abstract syntax of this DSL – let’s call it Skl– is deﬁned below: Σ ::= Seq f | Pipe Σ1 . . . Σn | Pardo Σ1 . . . Σn | Farm n Σ f ::= sequential function n ::= integer ≥ 1 The skeleton set is classical. Intuitively, Seq encapsulates sequential userdeﬁned functions in such a way they can be used as parameters to other skeletons; Pipe and Farm are the usual data-parallel skeletons. Pipe handles situations in which computations are performed in stages. Farm uses a master/workers scheme to implement dynamic load-balancing strategies. Pardo models parallel, independent computations, where n distinct tasks are run on n distinct processors. Semantics. Due to space limitations, we only give here the semantics of a subset of the language, namely the one built upon the Pipe and Pardo skeletons. This simpliﬁed version should suﬃce to convey the general ideas and show that the implementation sketched in Section 5 is well-founded. The full version of the semantics can be found in [12]. The implementation model of Skl is CSP-based. A parallel program is described as a process network, i.e. a set of processes communicating by channels and executing each a sequence of instructions. Formally, a process network (PN) is a triple π = P, I, O where – P is a set of labeled processes, i.e. pairs (pid , σ) where pid is a (unique) process id and σ a triple containing: a list of predecessors (pids of processes p for which a communication channel exists from p to the current process), a list of successors (pids of processes p for which a communication channel exists from the current process to p) and a descriptor Δ. We note L(π) the set of pid s of a process network π. For a process p, its predecessors, successors and descriptor will be denoted I(p), O(p) et δ(p) respectively. – I(π) ⊆ L(π) denotes the set of source processes for the network π (i.e. the set of processes p for which I(p) = ∅) – O(π) ⊆ L(π) denotes the set of sink processes for the network π (i.e. the set of processes p for which O(p) = ∅) The process descriptor Δ is a sequence of (abstract) instructions, implicitly iterated (processes never terminate). Instructions use implicit addressing, with each process holding two variables named vi, vo. A (simpliﬁed) instruction set is given below. In the subsequent explanations, p designates the process executing the instruction. instr ::= SendTo | RecvFrom | CallFn ﬁd The SendTo instruction sends the contents of variable vo to the process whose pid is given in O(p). The RecvFrom instruction receives data from the process

158

J. Serot and J. Falcou

whose pid is given in O(p) and puts it in the variable vi. The CallFn instruction performs a computation by calling a sequential function. This function takes one argument (in vi) and produces one result (in vo). The transformation of the skeleton tree describing the application into a process network is formalized by means of a set of production rules derived from a basic process algebra. The following notation will be used. If E is a set, we denote by E[e ← e ] the set obtained by replacing e by e (assuming E[e ← e ] = E if e ∈ / E). This notation is left-associative: E[e ← e ][f ← f ] means (E[e ← e ])[f ← f ]. If e1 , . . . , em is an indexed subset of E and φ : E → E a function, we will note E[ei ← φ(ei )]i=1..m the set (. . . ((E[e1 ← φ(e1 )])[e2 ← φ(e2 )]) . . .)[em ← φ(em )]. Except when ex1 n plicitly indicated, we will note I(πk ) = {i1k , . . . , im k } and O(πk ) = {ok , . . . , ok }. j j j j For concision, the lists I(ok ) et O(ik ) will be noted sk et dk respectively. For lists, we deﬁne the a concatenation operation ++ as usual : if l1 = [e11 , . . . , em 1 ] 1 n and l2 = [e12 , . . . , en2 ] then l1 ++l2 = [e11 , . . . , em 1 , e2 , . . . ; e2 ]. The empty list is noted []. The length of list l (resp. cardinal of a set l) is noted |l|. The . operator creates a process network containing a single process from a process descriptor, using the function New() to provide “fresh” process ids : δ∈Δ l = New()

δ = {(l, [], [], δ)}, {l}, {l}

(Singl)

The • operation “serializes” two process networks, by connecting outputs of the ﬁrst to the inputs of the second : πi = Pi , Ii , Oi (i = 1, 2)

|O1 | = |I2 | = m

π1 • π2 = (P1 ∪ P2 )[(oj1 , σ) ← φd ((oj1 , σ), ij2 )]j=1...m [(ij2 , σ) ← φs ((ij2 , σ), oj1 )]j=1...m , I1 , O2 (Serial)

This rule uses two auxiliary functions φs and φd deﬁned as follows : φs ((p, s, d, δ), p ) = (p, [p ]++s, d, [RecvFrom]++δ) φd ((p, s, d, δ), p ) = (p, s, d++[p ], δ++[SendTo]) The function φs (resp. φd ) adds a process p as a predecessor (resp. successor) to process p and updates accordingly its instruction list. This involves prepending (resp. appending) a RecvFrom (resp. SendTo) instruction) to this instruction list. The operation puts two process networks in parallel, merging their inputs and outputs respectively. πi = Pi , Ii , Oi (i = 1, 2) π1 π2 = P1 ∪ P2 , I1 ∪ I2 , O1 ∪ O2

(Par)

Functional Meta-programming for Parallel Skeletons

159

Skeletons can now be deﬁned in terms of the operations deﬁned above, using the following conversion function C : C[[Seq f ]] = f C[[Pipe Σ1 . . . Σn ]] = C[[Σ1 ]] • . . . • C[[Σn ]] C[[Pardo Σ1 . . . Σn ]] = C[[Σ1 ]] . . . C[[Σn ]]

5

Implementation

Embedding the Skl language within MetaOcaml is done by providing an interpretation function (run) which converts an Skl program into distinct pieces of residual code on each processor. This conversion is performed in two steps: First, the process network representation of the program is obtained from the tree-based one using the production rules deﬁned by the semantics. Second, the residual code is generated by ﬁnding, in this representation, the process having the rank of the current process – the one executing the run function – and by converting its sequence of instructions into Ocaml code : let myrank = Mpi.comm_rank Mpi.comm_world let run pgm = let pnet = expand_tree pgm in let mycode = code pnet myrank in .! (mycode)

The expand_tree function implements the ﬁrst step. For example, we have let (expand_tree : skl_tree -> process_network) = function ... | Pipe (t::ts) -> serial (expand t) (expand (Pipe ts)) ...

where serial implementes the • operator deﬁned in the previous section. The code function implements the second step : let code pnet rank = let mydesc = List.assoc rank pnet.procs in .< while true do .~(code_instrs mydesc mydesc.instrs) done >.

The residual code is producing by traversing the list of macro-instructions attached to the current process: let code_instrs pdesc instrs = let pstate = build_initial_state () in List.fold_left (fun c i -> .< begin .~c; .~(code_instr pdesc pstate i) end >.) .< () >. instrs

160

J. Serot and J. Falcou

The pstate variable will hold the (dynamic) process state. This state includes the last input and output values (iv and ov resp.), the pid of the sending process for the last receive instruction and the list of idle workers (for farm masters)3 Translation of macro-instructions into Ocaml code is performed by the code_instr function. The use of implicit addressing greatly eases this translation, since arguments and results are read (resp. written) directly from (resp. to) the process state (no variable environment is needed). For example, we have: let rec code_instr pdesc pstate = function Comp f -> .< f pstate >. | SendTo -> let dst = List.hd (List.rev pdesc.sendto) in .< Mpi.send pstate.ov dst 0 >. | RecvFrom -> let src = List.hd (pdesc.recvfrom) in .< pstate.iv <- Mpi.receive src Mpi.any_tag >. | Ifq (instrs1,instrs2) -> .< if pstate.q = List.hd pdesc.recvfrom then .~(code_instrs pdesc instrs1) else .~(code_instrs pdesc instrs2) >. ...

A CallFn f instruction generates a call to the sequential function f . SendTo, and RecvFrom instructions are translated into direct calls to MPIprimitives (we use here the OcamlMpi [13] library, v1.0.1). For the conditional instruction Ifq , the escape operator of MetaOcaml allows the insertion of the selected branch in the residual code4 . Example 1. Consider a simple Skl program describing a three-stage pipeline: data are produced at stage 1, processed at stage 2, and results are consumed at stage 3. In MetaOcaml, this program will be written as: let let let let let

f () = (* code of the sequential function producing data *) g x = (* code of the sequential processing function *) h x = (* code of the sequential function consuming results *) pgm = Pipe [seq f; seq g; seq h] _ = run pgm

The residual code produced on each processor is as follows: PID Code 2 while true do f s; Mpi.send s.ov 1 0 done 1 while true do s.iv <- Mpi.receive 2 0; g s; Mpi.send s.ov 0 0 done 0 while true do s.iv <- Mpi.receive 1 0 ; h s done 3

4

Our approach therefore explicitly relies on cross-stage persistence, i.e. the possibility of deﬁning of variable – like pstate – at stage i and using it at stage i + 1. The RecvFrom and Ifq instructions are used by processes involved in the farm skeleton; they were therefore not introduced in simpliﬁed semantics described in the previous section.

Functional Meta-programming for Parallel Skeletons

161

Example 2. Consider now a three-stage pipeline, in which the second stage is a farm involving a master and two workers : let pgm = Pipe [seq f; Farm (2, seq g); seq h] let _ = run pgm The residual code produced on each processor is now : PID Code 4 while true do f s; Mpi.send s.ov 3 0 done 3 while true do let r, q, _ = Mpi.receive_status 0 0 in s.ov <- r; s.q <- q; if s.q = List.hd pd.recvfrom then begin let q = get_idle_worker iws in Mpi.send s.ov q 0 end else begin update_workers s.q s.iws; Mpi.send <- s.ov 0 0 end done 2 while true s.iv <- Mpi.receive 3 0; f s; Mpi.send s.ov 3 0 done 1 while true s.iv <- Mpi.receive 3 0; f s; Mpi.send s.ov 3 0 done 0 while true s.iv <- Mpi.receive 3 0; f s done

In both cases, the code is very similar to the one that an experienced MPIprogrammer would have written.

6

Results

We have assessed the impact of this implementation technique by measuring the overhead ρ on the completion time over hand-written Ocaml +OcamlMpi code for both single skeleton application and when skeletons are nested at arbitrary level. For single skeleton tests, we observe the eﬀect of two parameters: τ , the execution time of the inner sequential function and N , the ”size” of the skeleton (number of stages for pipeline, number of workers for farm. The test case for nesting skeletons involved nesting P farm skeletons, each having ω workers. Results were obtained on a PowerPC G5 cluster with 30 processors and for N = 2 − 30 and τ = 1ms, 10ms, 100ms, 1s. For pipeline, ρ stays under 2%. For farm, ρ is no greater than 3% and becomes negligible for N > 8 or τ > 10ms. For the nesting test, worst case is obtained with P = 4 and ω = 2. In this case, ρ decreases from 7% to 3% when τ increases from 10−3 s to 1s. These results are signiﬁcantly better than those obtained with skeleton-based parallel programming systems which do not exploit meta-programming. In the implementation described in [10], for instance, where skeletons are implemented as simple higher-order functions dynamically scheduling MPI calls, the overhead was between 10 and 20 %. For the last version of the skipper system [11], in which skeletons were translated into macro data-ﬂow graphs executed by a distributed interpreter, this overhead could reach reach 100%. For implementations

162

J. Serot and J. Falcou

relying on object-oriented techniques, such as that described by Kuchen [4], the reported overhead (compared to C+MPI code in this case) is between 20 and 120 %.

7

Related Work

The idea of exploiting meta-programming techniques to implement a DSL is deﬁnitely not new. Reviews of related issues can be found for example in [8] (in the context of functional programming) or in [5] (in the context of parallel programming). In [12] we have already described a skeleton-based parallel programming system exploiting the static meta-programming facilities oﬀered by the C++ template mechanism. The work described here is based upon the same semantics and process network model but diﬀers in the host language and in the fact that the residual code is now produced at runtime, using the dynamic code generation facilities of MetaOcaml. This idea of exploiting the dynamic meta and multi-stage programming facilities oﬀered by a language such MetaOcaml to implement a DSL seems to have been followed only by Herrmann in [3]. The system which he describes is very similar to ours but nevertheless diﬀers in several points. First, skeletons are not part of the DSL per se. Instead, they can be deﬁned in terms of the constructs introduced by this DSL. Second, skeletons for which the scheduling of communications cannot be computed at compile-time – such as Farm – are not supported. Third, the semantics of the DSL is not deﬁned formally but is instead directly encoded in the MetaOcaml interpreter (there’s no explicit intermediate representation such as a process network).

8

Conclusion

We have shown in this paper how the meta-programming facilities oﬀered by a language such as MetaOcaml can be used to implement eﬃciently a DSL dedicated to skeleton-based parallel programming. The (runtime) specialisation of code for each distinct processor makes it possible to eliminate the overhead observed in other skeleton-based approaches to parallel programming. The resulting system oﬀers a high level of abstraction and performances on the par with hand-crafted MPIcode. The prototype described here is however more a proof-of-concept than a fullﬂedged programming system and several issues could be further investigated. In particular, the fact that the residual code is generated at run-time on each processor makes it possible to customize and optimize this code according to the actual proﬁle of this processor. This possibility is clearly “under-used” in our implementation, in which only the processor rank is taken into account. More sophisticated strategies could easily be developed, taking into account the processor computing capabilities, actual workload, etc. This could prove very useful for handling heterogeneous architectures, for instance.

Functional Meta-programming for Parallel Skeletons

163

References 1. Cole, M.: Algorithmic skeletons, ch.13. Research Directions in Parallel Functional Programming. Springer, Heidelberg (1999) 2. Bacci, B., Danelutto, M., Orlando, S., Pelagatti, S., Vanneschi, M.: P3L: A Structured High Level Programming Language And Its Structured Support. Concurrency: Practice and Experience, 225–255 (1995) 3. Herrmann, C.: Generating message-passing programs from abstract speciﬁcations by partial evaluation. Parallel Processing Letters 15(3), 305–320 (2005) 4. Kuchen, H.: A skeleton library. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 620–629. Springer, Heidelberg (2002) 5. Lengauer, C., Batory, D., Consel, C., Odersky, M. (eds.): Domain-Speciﬁc Program Generation. LNCS, vol. 3016. Springer, Heidelberg (2004) 6. Michaelson, G., Scaife, N., Horiguchi, S.: Parallel Standard ML with Skeletons. Scalable Computing, Practise and Experience 6(4) (2006) 7. S´erot, J., Ginhac, D.: Skeletons for parallel image processing: an overview of the SKiPPER project. Parallel Computing 28(12), 1785–1808 (2002) 8. Sheard, T.: Accomplishments and research challenges in meta-programming. In: Taha, W. (ed.) SAIG 2001. LNCS, vol. 2196, pp. 2–44. Springer, Heidelberg (2001) 9. Sheard, T., Peyton-Jones, S.: Template metaprogramming for Haskell. In: Chakravarty, M.M.T. (ed.) ACM SIGPLAN Haskell Workshop 2002, pp. 1–16. ACM Press, New York (2002) 10. S´erot, J.: Embodying parallel functional skeletons: an experimental implementation on top of mpi. In: Lengauer, C., Griebl, M., Gorlatch, S. (eds.) Euro-Par 1997. LNCS, vol. 1300, pp. 629–633. Springer, Heidelberg (1997) 11. S´erot, J.: Tagged-token data-ﬂow for skeletons. Parallel Processing Letters 11(4), 377–392 (2002) 12. Falcou, J., S´erot, J.: Formal semantics applied to the implementation of a skeletonbased parallel programming library. In: Proceedings of the International Conference ParCo 2007, Aachen. Advances in Parallel Computing, vol. 15, IOS Press, Amsterdam (2008) 13. OcamlMPI, http://caml.inria.fr/cgi-bin/hump.en.cgi?contrib=401 14. OcamlP3L, http://ocamlp3l.inria.fr 15. MetaOcaml, http://www.metaocaml.com

Interoperable and Transparent Dynamic Deployment of Web Services for Service Oriented Grids Michael Messig and Andrzej Goscinski School of Engineering and Information Technology Deakin University Pigdons Road, Geelong {messig,ang}@deakin.edu.au

Abstract. Dynamic deployment of Web services is a term used frequently when describing the selection and deployment of a service to a grid host. Although current grid systems (such as Globus) provide dynamic deployment, the requirements of the service being deployed are not considered. Therefore truly dynamic deployment cannot be achieved as the services deployed are restricted to the grid system used. We present a dynamic deployment mechanism as part of self configuration in a service oriented grid environment. The dynamic deployment mechanism takes the requirements of the service into consideration, including parameters such as the operating system required to execute the service, the required software libraries, any additional required software packages, price and Quality of Service (QoS) parameters. Keywords: Autonomic Computing, Service Oriented Grids, Web Services.

1 Introduction Grid computing has changed the way distributed systems are constructed, programmed and used. With a focus on Web services to provide interoperability, service oriented grid computing in particular have introduced a number of new problems which cannot be solved by traditional distributed systems research. One area which is of particular importance is the deployment of services. Service deployment is the operation of configuring a service, transferring the service to a host of a service provider (which will perform execution of the service code) and publishing the service with any discovery mechanisms (such as UDDI). Although this operation may seem straightforward, a number of issues must be considered to ensure deployment is dynamic. Firstly, the requirements of the service must be taken into consideration. If a service requires a particular set of software libraries, additional applications, or a certain operating system to run on, a deployment mechanism must ensure the destination host meets these requirements. Secondly, QoS attributes must be taken into account. If a service requires specific QoS attributes such as availability, throughput, interoperability or security, a deployment facility must ensure the destination host conforms to these requirements. Finally, service deployment must take into consideration trustworthiness of the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 164–173, 2008. © Springer-Verlag Berlin Heidelberg 2008

Interoperable and Transparent Dynamic Deployment of Web Services

165

destination service provider. The service (or clients using the service) may have a prior agreement on the level of trust required by both the service and the client. The destination host of a service provider chosen for service deployment must adequately meet any trust requirements prior to deployment. The aim of this paper is to present our research into a dynamic deployment facility as part of an autonomic grid management system. By taking an autonomic approach to service management, we are able to provide characteristics such as self discovery and negotiation (automatic registration and discovery of services), self configuration mechanisms (dynamic deployment and migration) and self healing (automatic recovery) transparently while ensuring interoperability is achieved. In this paper we focus on our novel dynamic deployment system, one part of our overall autonomic broker environment, which ensures the requirements of the service and the grid environment are taken into consideration during deployment and provides transparent and interoperable service deployment.

2 Related Work There are a number of grid systems which claim to support dynamic deployment. These systems however fail to address the requirements of the service, QoS and trust. Gridbus is a grid system focused on the grid economy, allowing providers of services offered by the grid to charge for their use. The user of the grid system is able to specify some QoS parameters and query parameters such as price. The Gridbus system then finds a service which matches these requirements. Gridbus has been developed to cooperate with other grids such as Globus and exposes some of the services through a Web service interface [4]. Gridbus, although addressing QoS, does not provide any self configuration mechanisms, including dynamically deploying services in which their requirements are considered. ProActive is a grid system which has been developed for object oriented parallel processing, mobile and distributed computing [2], and attempts to solve the problem of code reuse by introducing a grid programming and deployment framework. The aim of ProActive is to enable a simple hierarchical deployment model through the use of Java objects and focuses on scalability [2]. Although ProActive provides some self configuration mechanisms (deployment and migration) these mechanisms can only be used with ProActive Java objects. Therefore, deployment is not dynamic. The deployment mechanism is not interoperable or flexible and do not take into consideration a service or object’s requirements during deployment. The Australian BioGrid Portal [1] provides access to molecular docking applications and chemical databases. These services are used to provide data access, analysis and visualisation of molecular screening results using web based tools. The portal to the BioGrid does not offer dynamic deployment with consideration of a service’s requirements. The grid requires that the client deploying the service knows the specifics of each APAC node, their network address and the software installed on each node. This is a very primitive approach which requires the users of the system to be directly involved in constructing and deploying the required services. Although some grid systems reviewed (Globus and ProActive) provide deployment mechanisms, these are not dynamic. The metrics of the grid, properties of destination

166

M. Messig and A. Goscinski

hosts and the service’s requirements are not considered. From the grid systems reviewed, there were no systems which provided dynamic deployment. In summary none of the systems surveyed provide adequate self configuration mechanisms.

3 Autonomic Grid Environment We have previously reported on our goal of an autonomic grid environment in [8]. Autonomic characteristics have the potential to significantly improve service oriented grid environments; self discovery and negotiation, self configuration and self healing mechanisms reduce the complexity of grid systems while improving reliability, interoperability and usability. To ensure autonomic characteristics are provided for all types of grid and Web service systems, we must first identify how autonomic computing characteristics can be integrated into service oriented grids. Autonomic characteristics can be integrated within the operating system. This however has the drawback in that grid environments are heterogeneous and many different service providers may use many different operating systems. Autonomic characteristics may be integrated within each service. This however is impractical as some of the mechanisms required by autonomic computing, such as self discovery and self healing, would require coordination between all of the services which adopt the autonomic computing characteristics.

Fig. 1. Interaction between clients, brokers and services

Finally, the autonomic characteristics can be provided by an intermediary such as a broker. This approach does not have the downfalls of the previous two approaches, as a specific operating system is not required and all services do not have to include autonomic characteristics. Different grid toolkits, services and client applications can all take advantage of autonomic computing characteristics through the assistance of a broker as it is not tied to a particular software system. By introducing a broker as an intermediary, shown in Figure 1, the broker not only keeps track of client requests but

Interoperable and Transparent Dynamic Deployment of Web Services

167

also ensures that the client’s requests are fulfilled. We can provide autonomic characteristics transparently and ensure interoperability through the adoption of proxies. As grid environments are distributed and scale to extremely large distributed systems, the broker must be distributable and must be able to integrate with other broker instances. We propose the adoption of a System Management Broker (SMB) which provides self discovery and negotiation, self configuration and self healing and assesses the trustworthiness of clients and service providers (through trusted registries, trust rating authorities or other trusted SMBs), shown in Figure 2.

Service Proxy

Client Proxy

Client

Web Service 1 Web Service n Host

Service Provider Interface Communication

Discovery Service

Service Deployment Service

Failure Detection Service

Provider Registry Negotiation Service

State Management Service Service Migration Service

System Knowledge Service

Self Discovery and Negotiation

System Change Notification Service

Self Configuration

Restoration Service

Self Healing

Fig. 2. SMB components

These include the interface; the self discovery and negotiation modules which contain the discovery, negotiation and system knowledge services; the self configuration module which contains the service deployment, provider registry, service migration and system change notification services; and the self healing module which contains the failure detection, state management and restoration services. Each of these modules, apart from the interface, is responsible for providing individual autonomic services. As each of the individual services within the SMB is a Web service, the functionality of each of the services can be exposed as a set of Web service methods. To maintain transparency between clients, services and the broker, we propose two proxies. The client proxy is used by client applications to transparently discover and communicate with the SMB, while the service proxy is used by Web services and service providers to transparently discover an communicate with the SMB.

168

M. Messig and A. Goscinski

4 Dynamic Deployment The SMB environment (as specified in Section 3) must provide dynamic deployment which takes the requirements of the service to be deployed into account when selecting a suitable destination host. Once a service is deployed (or registered) with the SMB, the service can be discovered and used by clients. To provide dynamic deployment we propose the following operations depicted in Figure 3.

Fig. 3. Deploying and Registering a Service

The service provider must first invoke a discovery service to find an available SMB (Message 1). Once a broker is found (Message 2), the service provider invokes the SMB and requests service deployment (Message 3). The provider supplies the files required to execute the service and a document containing the details of the service; the broker is then aware of any special requirements of the service, including the operating system the service is designed for and any mandatory software packages or libraries required by the service. The details also include any QoS parameters defined by the service provider, a price for using the service, its name and a short

Interoperable and Transparent Dynamic Deployment of Web Services

169

description. The broker’s interface passes the request to the service deployment service (Message 4) which invokes the provider registry to find a suitable destination host which meets the service’s requirements (Message 5). The provider registry informs the system change notification service that a change in the grid has occurred (a service has joined the grid) and a suitable host is required (Message 6). If a suitable host is found, its details are returned (Message 7). Although the first available suitable host is selected, scheduling algorithms can be used (provided by the service provider hosting the deployment service). Scheduling is an already established area of research and is not the focus of this paper. The service is then deployed by invoking the host’s service proxy (Message 8), which deploys the service and returns the outcome (Message 9). The service is registered with UDDI (Message 10) and the outcome is returned (Messages 11 and 12). If unsuccessful deployment is aborted and the SMB attempts to find another suitable host. If no suitable hosts are available, the broker attempts to deploy the service with any other known SMB’s (cross SMB deployment). To be able to deploy services dynamically in a user friendly manner, we implement the Service Provider Tool for service provider registration and service deployment. The former uses the mechanisms of the service proxy to register with the SMB. Several sets of input are required, including provider’s name, description and URL. Contacts at the service provider and the details of all hosts and services at the service provider can be added. Each host must be specified to ensure service deployment to these hosts can be carried out. The attributes of the host, such as all installed software, operating system and details of the host’s service proxy are required. The Service Provider Tool also deploys services either dynamically using the SMB to find a suitable host or to a specified host. The service’s details must be specified for deployment (name, description, version, operating system, required software and QoS attributes) and the service’s binary files as a compressed (ZIP) file. The tool then utilises the service proxy and the SMB to deploy the service.

5 Experimentation of the Dynamic Deployment Facility To demonstrate our approach is sound, we perform several experiments on a heterogeneous enterprise grid testbed. The testbed consists of 12 local Intel based nodes (intentionally older Pentium III nodes) connected via 100MB/s Ethernet. This local grid is then connected to 4 remote nodes of a remote grid (intentionally new Pentium IV nodes) via a microwave link and requires the use of a VPN. Auction Service – The first experiment we undertake is on a commercial centric application. These applications are typically client driven, not CPU bound, and have a short execution time. To demonstrate the applicability and the benefits achieved from the SMB in a commercial environment we have developed an auction service reflective of auction services such as eBay [5]. To demonstrate the flexibility of our approach we develop several different auction service versions: a vanilla auction service which has no support from any external or third party toolkits or applications and no state management mechanisms, a WSRF auction service which utilizes the Web Service Resource Framework (WSRF) for state management, an SMB auction service which uses the SMB for state management and a WSRF with SMB service which uses both WSRF and the SMB for state management For the vanilla and SMB auction services we also study the use of mobile devices for running the client

170

M. Messig and A. Goscinski

application. This cannot be studied for the WSRF implementation as the WSRF library (WSRF.NET) does not support mobile devices. The only differentiating factor between the auction service versions (in terms of service deployment) is the size of each service. Each version of the auction service is compressed into a ZIP archive. The vanilla auction service is 1.28KB (Kilobytes) in size. The SMB auction service is 49.7KB due to the inclusion of SMB assembly (approximately 36KB uncompressed) and the UDDI library required by the SMB assembly (approximately 96KB uncompressed). The WSRF service is 822KB due to the inclusion of the WSRF.NET libraries. Finally, the combined WSRF with SMB auction service is 870KB. We deploy each of the auction services and measure the deployment times. The average resultant deployment times across ten runs for each experiment are shown in Figure 4.

12 Vanilla Auction Service

10

T im e (s)

8

SMB Auction Service

6 WSRF Auction Service

4 2

WSRF + SMB Auction Service

0 Specific Local Specific Specific SMB Selected SMB Selected SMB Selected Provider, Local Remote Remote Local Provider, Remote Remote SMB Provider, Local Provider, Local SMB Provider, Local Provider, SMB Remote SMB SMB Remote SMB

Cross SMB Deployment, Remote Provider

Fig. 4. Auction Service Deployment

First, we deployed to a specific host of a local service provider. The vanilla auction service performed the best in this test (and subsequently across all test scenarios), due to the service being the smallest and hence the quickest to process by the destination host. The SMB auction service performed close to the vanilla service, therefore for smaller services, a difference of approximately 50KB does not noticeably affect deployment (7.5% of deployment time). The WSRF and WSRF with SMB auction services performed on average twice as long as vanilla and SMB services (both WSRF services are larger). During deployment, both the SMB and the destination host’s service proxy must receive the service files, process the SOAP, identify and process the compressed file and extract the service files on the destination. When we look at deployment on hosts of a remote service provider (across a cross campus microwave link), the time required to deploy the services drops dramatically. When remote destinations are selected, the time difference between the larger services (WSRF and WSRF with SMB) and the smaller services (vanilla and SMB) is noticeably less than when local destination hosts are selected. The remote hosts are able to complete deployment 2.6 times faster than local hosts when the destination is selected by the SMB and 3 times faster than the local hosts when the destination host

Interoperable and Transparent Dynamic Deployment of Web Services

171

is specified. This is attributed to the remote hosts having greater processing power and can process the messages sent by the deployment tool more efficiently. When we used the SMB to select a destination host, the time required to is increased. On the local hosts, the SMB selection of a destination host produces a much larger overhead, an average of approximately 2.3 s (34%) when compared to static deployment. On the remote hosts, the difference between static and dynamic deployment is small (approximately 0.28 s in all cases); however the average overall increase is approximately 20%. The increase in time is expected due to the additional steps performed by the SMB in selecting a suitable and available destination host. As the remote hosts are more efficient in processing requests, additional overhead exists but it is relatively smaller than the overhead introduced on the local hosts. Finally, we look at the results of deployment across two different SMB’s. In this case, the environment is set up such that the local SMB does not have an available suitable destination host to deploy the auction services and contacts the remote SMB for service deployment. For the smaller auction services (vanilla and SMB), the difference between cross SMB deployment and SMB selected deployment is small (approximately 20 ms or 14%), however, for the larger auction (WSRF and WSRF with SMB) services the difference is greater, approximately 1.5 s or 43%. The larger auction services are slower as the service’s files must be handled by an additional mediator (the local SMB) before being deployed (by the remote SMB). We can conclude from this experiment that service deployment is affected by two factors: the size of the service and the processing capabilities of the destination host. The size of the service affects deployment in all of the tested scenarios. This is expected, as the larger the service (in size), the longer it takes to process the files to the destination host. The processing capabilities of the destination host have a big impact on the overall deployment time, as expected. Faster hosts are able to process requests more efficiently and therefore the overall time to deploy a service is quicker. Ray Tracing – The second experiment is on a scientifically focused application. We report on the experimentation with a commonly used computational application, the Povray ray tracer [10]. The scientific focus of grid computing is primarily in high performance computing and parallel processing and the Povray application has widely been adopted as an appropriate application to test parallel processing systems [7]. We expose the latest version of the Povray application as a Web service. We then use the inbuilt mechanisms of the Povray application to split the rendering job into many distributable parts [10]. We develop a client application to invoke the Povray Web service and use the SMB environment to manage the service. We repeat the experiment ten times and record the average measurement. As we have shown the difference between local and remote deployment in the auction service experiments, we restrict the deployment of the Povray service to local service providers. We conduct two different tests, static and dynamic deployment. As the only differentiating factor in deployment of services is the actual size of the service, as a basis for comparison, we compare the time required to deploy the Povray service with the time required to deploy the auction services. The results of the test are shown in Figure 5. A single instance of the distributed Povray Web service is approximately 777KB (KB) in size. This is compared to the Vanilla auction service, SMB auction service, WSRF auction service and the WSRF with SMB auction service which are 1.28KB, 49.7KB, 822KB and 870KB, respectively. The Povray

172

M. Messig and A. Goscinski

service takes an average of approximately 5.1 s to deploy when the destination host is specified and 6.8 s to deploy on a destination host selected by the SMB. As expected, this service is slower to deploy than the SMB auction service (49.7KB) but faster to deploy than the WSRF auction service (822KB). The additional time required to deploy the service to an SMB selected host is due to the selection process, which requires matching a suitable host to the service’s requirements. 12

Vanilla Auction Service

10

SMB Auction Service

Time (s)

8

WSRF Auction Service

6

WSRF with SMB Auction Service

4

Distributed Povray Web Service Legacy Service

2

0 Specific Local Provier, Local SMB

SMB Selected Local Provider, Local SMB

Fig. 5. Comparison of Deployment Results

Legacy Application – The final experiment we perform focuses on legacy applications. We wrap a legacy application as a Web service using the Web service Legacy Application Wrapper (WLAW) shown in [9] and use the Service Provider Tool to dynamically deploy the service. The legacy application used for these experiments is the popular bioinformatics application, hybrid-ss-min, which is a gene folding algorithm used to compute the minimum energy folding of a DNA or RNA sequence [9]. Once we have wrapped the hybrid-ss-min application as a Web service, we compare the time required to deploy the service with the services from the previous experiments. The results of the deployment tests are shown in Figure 5. The legacy service, once compressed, is approximately 111KB and static deployment to a local host requires approximately 2.8 s. Comparing this time to the previous results, deployment is faster than the distributed Povray Web service, however not as fast as the SMB or vanilla auction services. This is expected, as the size of the legacy service is slightly larger than the SMB and the vanilla auction service, but not as large as the distributed Povray Web service. The same can be seen in deploying the legacy service dynamically. As expected, the time required to deploy a service is largely influenced by the size of the service. When deploying a service to an SMB selected host, the overall time required to deploy is larger, however the size of the service is still influential in the completion time of the deployment operation.

6 Conclusion The aim of this paper was to demonstrate the autonomic grid management system in providing dynamic deployment. We achieved this aim through the experimentation of

Interoperable and Transparent Dynamic Deployment of Web Services

173

three different types of applications to reflect the different uses of typical grid environments. By intentionally selecting different categories of applications we were able to successfully demonstrate transparency, usability and interoperability of the SMB in managing different applications. Through our novel autonomic grid management system we have achieved a significant improvement in service deployment. These achievements were realised through close cooperation of client and service proxies, the SMB and the Service Provider Tool. We have shown through these system components that the time required to deploy a service is relative to the service’s size and the computational capabilities of the destination host. Our future work consists of rigorous testing of the SMB in a large commercial grid environment.

References 1. Smith, B., Buyya, R., Branson, K.: BioGrid: Web and Grid Services Enabled Molecular Docking for Drug Discovery. APAC Grid Program, Australian Partnership for Advanced Computing (APAC), Canberra, Australia, 2004-2006 2. Baduel, L., et al.: Programming, Composing, Deploying for the Grid, Grid Computing: Software Environments and Tools. Springer, Heidelberg (2005) 3. Beeson, B., et al.: A Portal for Grid-enabled Physics. In: Proc of the 5th IEEE/ACM Int Workshop on Grid Computing (GRID 2004), Pittsburgh, USA, November 2004. IEEE CS Press, Los Alamitos (2004) 4. Buyya, R., Venugopal, S.: The Gridbus Toolkit for Service Oriented Grid and Utility Computing: An Overview and Status Report. In: Buyya, R., Venugopal, S. (eds.) Proc of the First IEEE Int Workshop on Grid Economics and Business Models (GECON 2004), Seoul, Korea, April 23, 2004, pp. 19–36. IEEE Press, New Jersey (2004) 5. eBay, Tap into the eBay Platform (2007) (last access November 2007), http://developer.ebay.com 6. Foster, I.: Globus Toolkit Version 4: Software for Service Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005) 7. Freisleben, B., et al.: Parallel raytracing: a case study on partitioning and scheduling on workstation clusters. In: Proc. of the 30th Hawaii Int. Conf. on System Sciences, January 1997, vol. 1, pp. 596–605 (1997) 8. Messig, M., Goscinski, A.: Autonomic System Management in Mobile Grid Environments. In: Proc of The 5th Australasian Symp on Grid Computing and e-Research (AusGrid 2007), Ballarat, Australia (2007) 9. Messig, M., Goscinski, A.: Service Migration in Autonomic Service Oriented Grids. In: Proc of the 6th Australasian Symposium on Grid Computing and e-Research (AusGrid 2008), Wollongong, Australia, CRPIT, vol. 82 (2008) 10. Persistance of Vision Pty Ltd. POV-Ray – The Persistence of Vision Raytracer (2007) (last accessed September 2007), http://www.povray.org 11. Zuker, M.: Mfold Web Server for Nucleic Acid Folding and Hybridization Prediction. Nucleic Acids Research 31(13), 3406–3415 (2003)

Pollarder: An Architecture Concept for Self-adapting Parallel Applications in Computational Science Andreas Schäfer and Dietmar Fey Lehrstuhl für Rechnerarchitektur und -kommunikation, Institut für Informatik, Friedrich-Schiller-Universität, 07737 Jena, Germany {gentryx,fey}@cs.uni-jena.de

Abstract. Utilizing grid computing resources has become crucial to advances in today’s computational science and engineering. To sustain efﬁciency, applications have to adapt to changing execution environments. Suitable implementations require huge eﬀorts in terms of time and personnel. In this paper we describe the design of the Pollarder framework, a work in progress which oﬀers a new approach to grid application componentization. It is based on a number of specialized design patterns to improve code reusability and ﬂexibility. An adaptation layer handles environment discovery and is able to construct self-adapting applications from a user supplied library of components. We provide ﬁrst experiences gathered with a prototype implementation. Keywords: grid computing, scientiﬁc computing, self-adaptation.

1

Introduction

The last decades have seen a constantly growing demand for computing resources, scientists and engineers embrace the new opportunities oﬀered by current hardware. Frequently they develop new simulation software on, for instance, their notebook, test it with smaller datasets on workstations and employ largescale clusters and multi-cluster to analyze real world data. Such grid applications have to cope with numerous new problems due to the increased complexity and variety of their environment. An exemplary setup is shown in Fig. 1. Suitable designs usually represent the outcome of a multidisciplinary eﬀort. It is not always possible to invest so much time into the implementation of a new grid application. Therefore it is imperative to simplify the development of adaptable, highly portable scientiﬁc applications. This need gave rise to several frameworks, most of them focusing on data decomposition. While this allows comprehensive support for developers, it also limits a framework’s applicability to the data structures it supports. As an alternative we propose a component centric approach. In our framework, software is developed in terms of small self-contained components. Components that oﬀer identical services – but for diﬀerent environments – share a common interface. An adaptation layer detects the environment (e.g. a cluster M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 174–183, 2008. c Springer-Verlag Berlin Heidelberg 2008

Pollarder: An Architecture Concept

175

Fig. 1. Small compute grid. This is a multi-cluster setup as we use it at University Jena. An essential diﬀerence to a homogeneous setup is the inter-cluster network: all cluster nodes have to share the head nodes’ bandwidth for cross-cluster communication.

made of dual-core processors or a multi-cluster) and acts as a factory to build the application from those components that match the environment best. The rest of the paper is organized as follows. Section 2 gives a brief overview of the current state of the art. Section 3 outlines our design and Sections 4 and 5 describe the two most important employed design patterns in depth. In Section 6 we outline how Pollarder performs its environment discovery and the ﬁrst results of our prototype are captured in Section 7.

2

Related Work

Parallelization and adaptation have for long been subject of research and consequently there is a huge variety of diﬀerent solutions. This section outlines two illustrative examples. Cactus is a modular framework for multi-dimensional physical simulations. It consists of a library of components – named thorns – and a connecting layer which is called flesh, hence the name of the framework. Since all data has to traverse the ﬂesh and the ﬂesh can only pass on multi-dimensional arrays, the interface for the thorns is limited [1]. The user can manually adapt Cactus applications by selecting the thorns to be used. Runtime adaptation is a burden of the driver thorns, which are responsible for parallelization (e.g. Carpet [1] for adaptive mesh reﬁnement). While Cactus is targeted for a speciﬁc application domain and oﬀers a rigid structure therefore, the Common Component Architecture (CCA) is based on a ﬂexible component model [2]. The use of the provides/uses design pattern and the Scientiﬁc Interface Deﬁnition Language allow cross-language component reuse and transparent remote objects via proxy generation. ProActiv [3] is a Java based framework for component construction and deployment. Components can be hierarchically constructed from other components. Similarly to CCA it relies on remote method invocation, as opposed to message passing. It provides solutions for code mobility, security and parallel method invocations.

176

3

A. Schäfer and D. Fey

Pollarder Overview

The goal of Pollarder is to automatically assemble a Grid application from user provided components on the basis of an environment discovery. We provide design patterns to aid the componentization of an application. As hinted in the introduction, the Pollarder framework’s most important feature is the adaptation layer. The layer consists of two parts: an environment discovery component and a factory which uses the environment discovery to select appropriate components for a component library. Assume a user wants to perform a parallel computation for a given model, e.g. to numerically integrate a given function. Instead of selecting the appropriate solver by directly (for example one using the Message Passing Interface, MPI) he registers his solvers at Pollarder’s component library and requests an instance at Pollarder’s factory. The factory uses a scoring function to determine how well a component ﬁts into a given environment. This function can be chosen separately for each component and could even be user-deﬁned to capture performance models, but up to date we did not need such a complex scoring function. For more information on how Pollarder integrates into the source code, see Section 7. This technique of decoupling is also known as dependency injection (DI), an variation of inversion of control [4]. The original goal of DI was to create easily testable objects by injecting mock objects whilst running in a test environment. We abuse this method to achieve self-adaptation. The advantage of this decoupling is that, if multiple solvers are available, the same program can run without modiﬁcation in each environment they are targeted at.

4

Model-Parallelization-Balancer

Despite the huge variety of parallel applications, analysis of the parallel algorithm structure design space exhibits a number of recurring patterns. Therefore it is sensible to separate a speciﬁc model from its generic parallelization [5]. An eﬃcient parallelization has to maintain a smooth load equality among participating nodes, which suggests to use a dedicated balancer to tune the parallelization’s parameters to care for eﬃcient execution. The approach we have distilled from this is that grid applications should be designed in three pieces: Model, Parallelization, and Balancer. According to our experiences, this partition results in a separation of concerns and increased code reuse. Model. The Model is the core of the application and deﬁnes the problem to be solved (e.g. a cellular automaton (CA) to simulate molten metal, see Section 7). Parallelization. The Parallelization encapsulates a generic parallel algorithm for a certain class of parallel computer (e.g. CA for MPI clusters). Although a generic component is typically harder to write than a specialized one, our experience is that it is a lesser eﬀort in the long run, because it can be repeatedly reused and improved.

Pollarder: An Architecture Concept

1BSBMMFMJ[BUJPO

.PEFM

JOTUBOUJBUFT

"EBQUBUJPO -BZFS

EFUFDUTFOWJSPONFOU

177

#BMBODFS

$PNQPOFOU -JCSBSZ VTFSTVQQMJFE

TFBSDIFTDPNQPOFOUT

Fig. 2. The Model-Parallelization-Balancer pattern. The adaptation layer detects the environment it is running in. Given a Model, it will select an interface compatible Parallelization and Balancer most suitable for the environment.

Balancer. In general, to optimize a parallelization’s performance, several parameters have to be set. This is not limited to the load balancing itself, but also includes – just to mention some examples – load balancing frequency and (for ﬁnite diﬀerence codes and CA simulations) ghost zone width [6]. Setting these parameters could be left as a user burden, but this would not only be tedious and error-prone, it would even be impossible for parameters that have to change during runtime. Optimizing these parameters is the task of the Balancer.

5

Hierarchical Adaptive Parallelization

A challenge often encountered in compute grids is the heterogeneity of the participating systems. While it is often relatively simple to create a parallelization for a homogeneous system, a grid application has to take into account speciﬁc details like diﬀerent network characteristics or processor performances simultaneously. This can render an application overly complex. The problem has become even more prominent with the advance of multi-core nodes in message passing clusters. It would be much simpler, if the programmer could deal with one system and its properties at a time and ignore the remaining ones meanwhile. The Hierarchical Adaptive Parallelization pattern depicted in Fig. 3 tries to achieve this by breaking down a parallelization into smaller sub-parallelizations, each of which is responsible for a single subsystem. Using the MPB pattern, the sub-parallelizations use their own balancers to adapt to their sub-system (not shown in the ﬁgure). By choosing suitable components for each subsystem, the application can harness each system optimally. The sub-parallelizations are stacked in a hierarchy to reﬂect the grid’s structure. They perform synchronization with their direct neighbors in the tree (parents and children) and possibly with those nodes that share the same parent. If a sub-parallelization is not a leaf node, it will delegate its load the children. A problem caused by the HAP pattern is that such a tree of object is diﬃcult to set up as each node has to decide, which components it needs. As Fig. 3 shows, this can result in multiple components residing on one node, e.g. in a multi-cluster like the one in Fig. 1, a cluster’s head node will end up hosting

178

A. Schäfer and D. Fey

3"$MVTUFS )FBE/PEF (SJE-FWFM

6OJY1PPM 8PSLFS/PEF

1BSBMMFMJ[BUJPO .1*

)FBE/PEF

8PSLFS/PEF

1BSBMMFMJ[BUJPO .1*

$MVTUFS-FWFM 1BSBMMFMJ[BUJPO .1*

1BSBMMFMJ[BUJPO .1*

1BSBMMFMJ[BUJPO .1*

1BSBMMFMJ[BUJPO .1*

/PEF-FWFM

1BSBMMFMJ[BUJPO 0QFO.1

1BSBMMFMJ[BUJPO 1BSBMMFMJ[BUJPO 0QFO.1

0QFO.1

1BSBMMFMJ[BUJPO TFSJBM

1BSBMMFMJ[BUJPO TFSJBM

$PSF-FWFM

1BSBMMFMJ[BUJPO TFSJBM

1BSBMMFMJ[BUJPO TFSJBM

Fig. 3. Exemplary stacking of sub-parallelizations according to the HAP pattern as it would be created for our numerical integration demo (see Section 7)

at least two sub-parallelizations: one to synchronize with other clusters and another one to synchronize with its peers in the same cluster. It is their task to aggregate external communication, thus shielding the worker nodes from external synchronization. This leads to fewer high-latency cross-cluster connections and also reduces the number of nodes participating in collective MPI operations. As the employed RA cluster’s nodes are dual CPU machines, they have an additional layer below to care for shared memory parallelization via OpenMP, thereby cutting the necessary number of MPI processes on the RA Cluster in half. With Pollarder this is not a problem because the factory inside the adaptation layer takes over object instantiation and hide the complexity from the user.

6

Environment Discovery

By the term environment discovery we refer to automatically detecting the system’s static properties that are relevant for application adaptation. During our experiments we found three properties to be most important: the middleware used for networking, the number of cores on each node and the network structure. While the former two are easily detected (e.g. by checking MPI::COMM_WORLD.size() and /proc/cpuinfo), the latter is a lot harder to ﬁnd out. For the HAP pattern it is only important that nodes with similar properties are grouped together. Therefore Pollarder does not need to perform a complex discovery of the exact network topology, but can resort to a hierarchical clustering algorithm. It yields a tree of nodes clustered according to a heuristic we have

Pollarder: An Architecture Concept

179

developed as a distance measure. A full description of our clustering algorithm would be beyond the scope of this paper, but the following paragraphs should yield a brief overview. Our method consists of two main components: the distance measure to get an estimate of how close two compute nodes are together and the clustering algorithm itself, which yields a hierarchy based upon the measure. For the distance measure we sought a heuristic that would be suited for use with the HAP pattern. Close nodes should have a good connection and within each cluster the interconnect should be relatively homogeneous to ease the eﬃcient programming of each subsystem. Initially we though measuring the ping pong latency between each node (time to send one byte from one node and back) would be a suﬃcient measure, but this would ignore diﬀerent bandwidths in the corresponding networks. Similarly, the throughput alone wouldn’t be suﬃcient. Even together, these two measures would often failed to detect the (for multi-clusters important) case when two nodes are directly connected to the same switch, but this connection has to be shared by multiple nodes (as it is the case for cluster head nodes which often route the inner nodes’ traﬃc to the outside. To detect this case, we found a third heuristic to be useful: the similarity of the hostname. Let s and t be two hostname strings. Our measure takes the domain-wise reversed hostnames and then extracts their common preﬁx p. The distance d is then deﬁned as d = (max{|s|, |t|} − |p|)/max{|s|, |t|})1 . To cope with diﬀerent scales, the three resulting distance matrices (one each for latency, throughput and hostname similarity) were multiplied by the reciprocal of their maximum element. The three dimensional distance vectors were reduced to a scalar using the Euclidean norm. Additionally to the distance measure, we had to ﬁnd a clustering algorithm. One could group a system’s compute nodes in two extreme forms of hierarchies. The ﬁrst one is a ﬂat tree of height one where each compute node forms a leaf and all leafs are connected via the root. The second one is a binary tree in which subsequently the two closest clusters are grouped via a new union node. Both examples are undesirable for use with the HAP pattern. In a heterogeneous multi-cluster the ﬂat tree would conceal which groups of nodes should be handled separately while the binary tree would yield no information on which nodes could be handled as a group. Thus we were looking for an algorithm that would yield a compromise, a tree not too high, but with nodes that are not too fat. In order to facilitate automation, the clustering algorithm should require as little user input as possible. The QT (quality threshold) algorithm [7], originally developed for gene clustering, seemed to be a promising candidate. It does not require the user to chose a number of clusters up front, but rather a maximum diameter for the clusters (a cluster’s diameter is deﬁned as the maximum distance between two elements, i.e. complete linkage). Given a distance measure, QT will employ a greedy 1

For instance racl00.inf-ra.uni-jena.de and ppc660.mirz.uni.jena.de would be reversed to de.uni-jena.inf-ra.racl00 and de.uni-jena.mirz.ppc660. p would be de.uni-jena., leading to a distance of s = (25 − 12)/25 = 0.52.

180

A. Schäfer and D. Fey

algorithm to subsequently partition the initial set of elements into clusters, starting with the element wise largest cluster of valid diameter. Nodes already assigned to a cluster are taken out of consideration. The disadvantage of this algorithm is that it will not necessarily yield a connected graph. Some nodes may not even belong to any cluster. As a solution, we use a multiphase variation of QT. It takes as input parameters, similarly to QT, a distance measure and a maximum cluster diameter. Additionally it requires a diameter multiplier. In the ﬁrst phase nodes are clustered using QT. Each cluster is represented by a new group node. For the next round the maximum diameter is enlarged by the diameter multiplier and QT is run again on the remaining set of nodes and clusters. The distance between two clusters is computed via single linkage (minimum distance of two elements). For a maximum diameter greater than 0 and a diameter multiplier greater than 1 this will yield a complete hierarchy in every case. Figure 5 shows the result of our algorithm based on data gathered on the multi-cluster shown in Fig. 1. Finally, according to the HAP pattern, the artiﬁcially introduced group nodes have to be assigned to physical machines. Our algorithm uses a bottom up strategy: starting with the lowest level of clusters, for each the single linkage between its elements and all nodes outside of this cluster is computed. The cluster side element belonging to the single linkage pair is then associated with the group node (not shown in Fig. 5).

7

Evaluation

We are currently evaluating a C++ prototype of Pollarder. In collaboration with the Chair of Metallic Materials at FSU Jena, our working group has developed the computational science application MuCluDent (Multi-Cluster DendriTe), a simulation code for dendritic growth in freezing metal alloys [8]. It is based on a combination of cellular automaton and ﬁnite diﬀerence method and is parallelized via geometric decomposition. MuCluDent has to run eﬃciently on a variety of hardware: small scaled model tests are typically done on workstations and notebooks, the parallelization is mostly tested on one of our clusters and long simulations with relevant domain sizes are run on our multi-cluster setup. MuCluDent should be able to adapt itself to the current system without source code changes and user interaction. We use Pollarder to simplify the wire-up code which instantiates a parallelization, load balancer and IO objects. Figure 4 shows a simpliﬁed version of this part. The macro POLLARDER_REGISTER_PARALLELIZATION will expand to a specialization of Pollarder’s class template based component library. As parameters it takes the parallelization’s base class, the class itself, a slot number (for alternatives) and a scoring function. Similarly POLLARDER_SUPPORTS_HAP is used to hand on type information to Pollarder. Even though MuCluDent currently only uses Pollarder’s MPB pattern and support for IO objects (not described in the patterns sections), we could reduce the length of the wire-up code from about 250 lines to circa 50.

Pollarder: An Architecture Concept

181

t e m p l a t e < c l a s s MODEL> c l a s s S e r i a l S i m u l a t o r { . . . } ; t e m p l a t e < c l a s s MODEL> c l a s s S t r i p i n g S i m u l a t o r { . . . } ; t e m p l a t e < c l a s s MODEL> c l a s s P a r t i t i o n i n g S i m u l a t o r { . . . } ; POLLARDER_REGISTER_PARALLELIZATION ( S i m u l a t o r , S e r i a l S i m u l a t o r , 0 , Pollarder : : i s S e r i a l ); POLLARDER_REGISTER_PARALLELIZATION ( S i m u l a t o r , S t r i p i n g S i m u l a t o r , 1 , P o l l a r d e r : : isThreaded ) ; POLLARDER_REGISTER_PARALLELIZATION ( S i m u l a t o r , P a r t i t i o n i n g S i m u l a t o r , 2 , P o l l a r d e r : : isMPI ) ; v o i d run ( I n i t i a l i z e r i n i t ) { S i m u l a t o r ∗ sim = P o l l a r d e r : : F a c t o r y<S i m u l a t o r > ( ) . g e t ( ) ; sim−>run ( i n i t ) ; }

Listing 1.1. Shortened Wire-up Code from MuCluDent t e m p l a t e < c l a s s MODEL> c l a s s T h r e a d e d I n t e g r a t o r { . . . } ; t e m p l a t e < c l a s s MODEL> c l a s s M P I I n t e g r a t o r { . . . } ; POLLARDER_REGISTER_PARALLELIZATION ( I n t e g r a t o r , T h r e a d e d I n t e g r a t o r , 0 , P o l l a r d e r : : isThreaded ) ; POLLARDER_REGISTER_PARALLELIZATION ( I n t e g r a t o r , M P I I n t e g r a t o r , 1 , P o l l a r d e r : : isMPI ) ; POLLARDER_SUPPORTS_HAP( T h r e a d e d I n t e g r a t o r ) ; POLLARDER_SUPPORTS_HAP( M P I I n t e g r a t o r ) ; d o u b l e run ( ) { I n t e g r a t o r

i = P o l l a r d e r : : F a c t o r y ( ) . g e t

( ) ; r e t u r n i −>i n t e g r a t e ( 0 , 1 ) ; }

Listing 1.2. Wire-up Code for our Numerical Integration Demo Fig. 4. Exemplary use of Pollarder. Notice how similar the componentization and wireup are, even though the applications have to perform very diﬀerent computations.

The heterogeneous networks we use in our multi-cluster setup have proven to be problematic for MuCluDent, as the slowest node will determine the whole system’s performance. With overlapping computation and communication, a load balancer can reduce the time needed for computation on slower nodes, but to compensate high latency networks, the ghost zone would have to be enlarged. While enlarged ghost zones on all nodes would be undesirable (as they come at the expense of increase overhead and would be unnecessary between nodes sharing a low latency network), handling a locally increased ghost zone width only between selected nodes turned out to be overly complex. We plan to smooth this out with a HAP capable parallelization for MuCluDent, but, unlike our other parallelizations, this implementation is not yet able to perform load balancing. As MuCluDent’s computational load is distributed very unevenly across the simulation grid, we could not gather meaningful benchmark results for this parallelization so far. We did however test the HAP pattern with a demo application which implements a simple numerical integration for one dimensional functions. Figure 4 shows a small code excerpt. For the test run we used three dual Opteron nodes from our RA cluster. For comparison we did integrate f (x) = x2 on the interval

182

A. Schäfer and D. Fey uni-jena.de cluster middleware and diameter according to our distance measure

mirz.uni-jena.de MPI 0.99

mirz.uni-jena.de MPI 0.24

MPI 1.71

mipool.uni-jena.de inf-ra.uni-jena.de MPI 1.02

MPI 0.62

Nodes CPUs

Fig. 5. Cluster analysis from a test run on our multi-cluster system. The initial maximum diameter was 0.7, the diameter multiplier was 1.6.

[0, 1] once using a "ﬂat" MPI parallelization with six processes (two on each machine) and once with a stacked HAP parallelization that used three MPI processes which forwarded their sub-intervals to a threaded parallelization. Although the problem scales well, the HAP parallelization turned out to be 31% faster than the ﬂat parallelization. This is because of the low number of samples (2000), the initial scatter of interval borders and the ﬁnal gather of results did dominate the running time and a reduced number of MPI processes sped them up. But still it substantiates our claim that HAP may beneﬁt a system’s eﬃcient usage. As the ﬂat parallelizations in the MuCluDent project suﬀer from the same problem that communication may be the dominating factor in a multi-cluster, we expect a comparable gain for our geometric decomposition codes. Figure 5 shows the result of Pollarder’s environment detection using our cluster analysis algorithm. For the test 20 nodes from the Unix pool were used along with 10 from the Linux pool and three from the RA cluster (two of which were dual Opterons). Despite its early stage, our prototype was able to reliably detect the system’s structure, including the two dual Opterons on the right. An interesting observation is the sub-cluster of diameter 0.24 in the Unix pool (mirz.uni-jena.de). Initially this seemed to be a bug in our algorithm, but it turned out that the nodes in this sub-cluster have gigabit Ethernet, which contrasts the other Unix pool nodes that only use Fast Ethernet.

8

Summary and Outlook

Complexity and variety of contemporary grid systems have become major challenges for scientiﬁc computing. We have presented a new approach to grid application componentization, specially targeted at adaptive parallelizations. The presented design patterns can break down an application’s functionality into small, reusable components. Our prototype suggests that these pattern are

Pollarder: An Architecture Concept

183

generic enough to be employed in a variety of applications, ranging from loosely coupled problems like simple function integration to tightly coupled geometric decomposition codes. The Model-Parallelization-Balancer pattern takes care for coarse grained adaptation, while Hierarchical Adaptive Parallelization can decompose complex parallelizations into smaller sub-parallelizations. This is especially important in the face of increasingly popular combined multi-core and MPI cluster setups. An factory takes over environment discovery and assembles the application’s components, thereby enabling self-adaption to multiple environments and relieving the user from manual interaction. While the adaptation provided by the factory is of static nature, the balancer in the MPB pattern can provide dynamic adaptation during at runtime. Despite being only a prototype, our current implementation has already proven itself in a real application and is able to reliably detect even complex multi-cluster setups.

References 1. Goodale, T., Allen, G., Lanfermann, G., Masso, J., Radke, T., Seidel, E., Shalf, J.: The Cactus Framework and Toolkit: Design and Applications. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds.) VECPAR 2002. LNCS, vol. 2565. Springer, Heidelberg (2003) 2. Bernholdt, D.E., Allan, B.A., Armstrong, R.C., Bertrand, F., Chiu, K., Dahlgren, T.L., Damevski, K., Elwasif, W.R., Epperly, T.G., Govindaraju, M., Katz, D.S., Kohl, J.A., Krishnan, M.K., Kumfert, G.K., Larson, J.W., Lefantzi, S., Lewis, M.J., Malony, A.D., McInnes, L.C., Nieplocha, J., Norris, B., Parker, S.G., Ray, J., Shende, S., Windus, T.L., Zhou, S.: A Component Architecture for High-Performance Scientiﬁc Computing. International Journal of High Performance Computing Applications 20, 163–202 (2006) 3. Baduel, L., Baude, F., Caromel, D., Contes, A., Huet, F., Morel, M., Quilici, R.: Programming, Deploying, Composing, for the Grid. In: Grid Computing: Software Environments and Tools. Springer, Heidelberg (2006) 4. Fowler, M.: Inversion of Control Containers and the Dependency Injection Pattern (2004) 5. Mattson, T.G., Sanders, B.A., Massingil, B.L.: Patterns for Parallel Programming. Addison Wesley Professional, Reading (2004) 6. Quinn, M.J. (ed.): Parallel Programming in C with MPI and OpenMP, vol. 1. Mc Graw Hill, New York (2003) 7. Heyer, L., Kruglyak, S., Yooseph, S.: Exploring expression data: identication and analysis of coexpressed genes. Genome Research 9, 1106–1115 (1999) 8. Schäfer, A., Erdmann, J., Fey, D.: Simulation of Dendritic Growth for Materials Science in Multi-Cluster Environments. In: Workshop Grid4TS, vol. 3 (2007)

The Design and Evaluation of MPI-Style Web Services Ian Cooper and Yan Huang School of Computer Science, Cardiﬀ University, United Kingdom {i.m.cooper,yan.huang}@cs.cardiff.ac.uk

Abstract. This paper describes how Message Passing Web Services (MPWS) can be used as a message passing tool to enable parallel processing between WS-based processes in a web services oriented computing environment. We describe the evaluation tests performed to assess the point-to-point communications performance of MPWS compared to mpiJava wrapping MPICH. Following these evaluations we conclude that: using web services to enable parallel processing is a practical solution in coarse grained parallel applications; and that due to inter message pipelining, the MPWS system can, under certain conditions, improve on the communication times of mpiJava.

1

Introduction

A workﬂow is a series of processing tasks, each of which operates on a particular data set and is mapped to a particular processor for execution. In a looselycoupled web service environment, a workﬂow can itself be presented as a web service, and invoked by other workﬂows. Web service standards and technologies provide an easy and ﬂexible way for building workﬂow-based applications, encouraging the re-use of existing applications, and creating large and complex applications from composite workﬂows. BPEL4WS is commonly used for web service based scientiﬁc workﬂow compositions [1], but users are limited to applications with non-interdependent processes. Furthermore, issues relating to the unsatisfactory performance of SOAP messaging have tended to inhibit the wide adoption of web service technologies for high performance distributed scientiﬁc computing. In spite of the performance concerns, the use of web service architectures to build distributed computing systems for scientiﬁc applications has become an area of much active research. Recently developed workﬂow languages have started addressing the problem of intercommunicating processes, Grid Services Flow Language (GSFL) [2] is one example; it provides the functionality for one currently executing Grid service to communicate directly with another concurrently executing Grid service. Another example is Message Passing Flow Language (MPFL) [3], this speciﬁes an XML based language that enables web service based workﬂows using MPI-style send and receive commands, to be described. Neither of the examples mentioned above have presented a workﬂow engine and currently there is no workﬂow engine that supports MPI M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 184–193, 2008. c Springer-Verlag Berlin Heidelberg 2008

The Design and Evaluation of MPI-Style Web Services

185

style direct message passing; the GSFL paper describes an implementation using OGSA notiﬁcation ports in a subscriber producer methodology, but the MPFL remains a draft language with no implementation details. In this paper, we investigate the potential and suitability of using a web services infrastructure to support parallel applications that require MPI-like message passing. We look at various methods and tools that can be used to implement these message exchange patterns (MEPs) and assess the suitability of previous work, within the web service framework, for this emerging workﬂow use. We then propose an implementation for Message Passing Web Services (MPWS) and present performance results comparing MPWS against mpiJava [4]; a leading hpc Java implementation [5]. We have used mpiJava as it is a tool for distributed computing rather than for use within a cluster environment; MPWS combines distributed, loosely coupled services to form a temporary, tightly coupled application with a similar goal. There has also been much research to compare mpiJava to other HPC systems [6].

2

Background and Related Research

In the context of parallel computing and MPI, message passing is referred to as the act of cooperatively passing data between two or more separate workers or processes [7]. Thus, message passing is used in parallel scientiﬁc applications to share data between cooperating processes. It enables applications to be split into concurrently running subtasks that have data interdependencies. In a serviceoriented scenario, this can be translated to the act of sending data from one executing service to another, concurrently executing, service. The problem here is that a service can be concurrently invoked many times; once a service is invoked, there must be a way of determining which instance of the service needs to receive the message. SOAP based web services communicate via SOAP messages, and these messages are exchanged in a variety of patterns. Within the WS framework there is normally a simple Message Exchange Pattern (MEP) that involves either a request only, or a request and response message. The normal invocation of a service during the execution of a workﬂow is for the workﬂow manager to request a service and then, when the service has completed, a response is returned to the workﬂow manager. It can be seen that this requires mediation by the workﬂow manager at every step of the workﬂow process. Kut and Birant [8] have suggested that web services could become a tool for parallel processing and present a model, using threads to call web services in parallel, to allow web services to perform parallel processing tasks. This model and can be extended (as shown in Fig. 1) to allow these services to exchange data directly, this removes the need for the workﬂow manager to intervene every time a process transfers data [2]. Currently there is no standard for directly passing data from one service to another running service. Alternative MEPs are in various stages of research; inonly patterns are in common usage in most web service platforms, and research

186

I. Cooper and Y. Huang

Fig. 1. Extending the use of parallel executing services to perform message passing

has been undertaken into a single request multiple response (SRMR) MEP [9]. In this framework for SRMR an agent is used to relay the service call, and a centralized web service collects the responses. Research into the use of web services in parallel computations is presented by Puppin et al [10], who developed an approach for wrapping MPI nodes within web services. Their paper shows that the performance of wrapped MPI nodes can be comparable with MPI running in a cluster environment, although, many more computers are required for the wrapped MPI version. In our research, we focus on developing and evaluating web services that are capable of MPI–like communication with other services; the performance of SOAP messaging is a key issue in determining if MPWS can be made comparable in performance with other distributed message-passing systems. There is a problem when it comes to sending the data within a SOAP message. SOAP uses XML and if true XML formatting is to be used, i.e. listing each entity of the data within a tagged element, the space overhead for the message is potentially very large. The most eﬃcient method of encoding data is to serialize it into a binary representation. In the Java language there is an in-built function to transform objects to their binary encoded representation; this is the mechanism that mpiJava uses to encode its objects before sending them to a socket. The problem is that we cannot translate a binary ﬁle directly to string format, as there are not enough characters available. There are four solutions available to this problem; binary-to-character encoding [11], packaging, binary XML encoding [12], and linking [11]. Packaging, such as SOAP with Attachments (SwA) [13], or Message Transmission Optimization Mechanism (MTOM) [14] allows data to be transmitted externally to the SOAP envelope. A comparison of transmission speeds using SOAP with Attachments and true XML formatting is given in [15]. MTOM also stores the data within the object model. MTOM has been chosen as the transmission protocol for these messages as it is SOAP based; yet it increases the speed of the data by allowing attachments, while keeping the data accessible in the object model. MTOM does not have the

The Design and Evaluation of MPI-Style Web Services

187

coding overheads of either the binary to character or the binary XML encoding, and it stays within the SOAP communication protocols, unlike linking.

3

The Design of a Message-Passing Web Service

The challenge is to design a tool which will combine the tightly coupled programming concept like MPI and the distributed, loosely coupled architecture of SOAP web services; to do this we need to adhere to WS and SOAP messaging standards whilst providing an eﬃcient form of communication between services. MPWS is designed to address three areas; the creation of a set of services, the initialisation of those services so they are aware of each other, and the communication between the services. The creation of a set of services is achieved by the workﬂow manager, its role is to accept jobs, normally speciﬁed using an XML-based workﬂow language such as MPFL, then ﬁnd a collection of suitable services for those jobs and invoke them all within a unique communication domain. A communication domain is a collection of service instances which are involved in the same composite application, and can communicate directly with each other; this means that each service instance must be aware of all other service instances in the domain. Based on the job deﬁnition, the workﬂow manager will discover and select a group of suitable Message Passing (MP) web services using standard WS techniques, then generate a communication domain ID for the workﬂow application. The workﬂow manager can then specify the rank number and invoke a run method for each MP service involved. The initialisation of the service is performed in the invocation of the run method; the input data for the application as well as the binding information for the services to work together, is passed to each individual service that is involved in the same workﬂow application. The binding information includes: communication domain ID; the rank number for the service; and a list of service endpoint references, each associated with a particular rank ID. Knowing the rank number as well as the service endpoint references, will allow the service to perform point-to-point message passing with all other services in the same communication domain. An MP web service can participate in multiple applications concurrently, so in order to solve the problem of identifying which service invocation is to be addressed, there is one communication domain established for each application instance; this is associated with a unique identifying number the Communication Domain ID. Each MP web service instance belongs to a communication domain, and each service instance has an associated resource; this resource is identiﬁed by the Communication Domain ID, is initiated for the particular communication domain, and stores the binding information and messages sent to that service instance. WS-Resources are deﬁned in the WSRF speciﬁcations [16], they allow for the concept of state within web services. A resource is uniquely identiﬁable and accessible via the web service [17]. The use of resources provides message buﬀers for an MP web service. Instead of sending

188

I. Cooper and Y. Huang

and receiving the messages synchronously, the message is sent to the resource associated with the receiving web service instance, then the receiving web service can retrieve a particular message from the corresponding resource. A message is associated with a communication domain ID and a message tag; this will ensure that the message can be identiﬁed within a communication domain. MPWS has been designed to conform to WS Standards and to SOAP messaging standards, to allow the use of loosely coupled services in a traditionally tightly coupled MPI coding style. To this end we have designed MPWS to support multi-layer interfaces; the upper layer as a WS layer, and the lower layer as a message-passing (MP) layer. With the web service layer, an MP web service supports WSDL standards, providing loosely coupled services which can be easily published, discovered and reused. There are two main methods exposed via the web services interface: – Run method - this mainly consists of a sequence of instructions so that it performs one or more particular tasks. Since an MP web service normally involves cooperation with other MP web services for a particular application, setting up communication domains is the ﬁrst task when the run method is invoked – Store method - this receives messages sent from other MP web services and stores them to the resource associated with the MP web service instance. With the message passing layer, an MP web service is able to conduct message-passing communication with other MP web services by supporting message-passing interfaces, including send, receive, broadcast, and sendReceive. The message-passing interfaces are not exposed via WSDL, but are low-level interfaces that can only be invoked via the WSDL-level methods. For example, inside a run method body, there may be instructions such as sending data to a particular MP web service or receiving data from a particular MP web service, and these can be carried out by directly invoking the methods provided within the message-passing programming package, MTOM is used as the transmission protocol in this layer. Fig. 2(a) gives an example which shows a send operation scenario between two MP web services, A and B. A communication domain was initiated with the communication domain ID equal to 3303. Service A sends a message to service B within the communication domain. The send method from the MP service is called to send the message to service B. This is done by invoking the store method provided by service B. When the store method is called, it stores the message it received into the resource associated with the domain ID 3303. Although service B has received the message and stored it within one of its associated resources, the message cannot be used unless a receive method is called. The receive method retrieves this message from the resource (ID = 3303) associated with the service instance, the tag name associated with the message is used to identify the particular message within the communication domain (Fig. 2(b)). The use of the resource to provide a buﬀering service for message passing encourages the adoption of the asynchronous ﬁre-and-forget style [18] of message

The Design and Evaluation of MPI-Style Web Services

189

Fig. 2. An example of sending a message from Service A to Service B

sending which is supported in AXIS 2.1.1. The ﬁre-and-forget send method returns immediately after the existence of the receiving host is conﬁrmed providing increased performance over the sendRecieve or sendRobust style .

4 4.1

The Evaluation Testing

Many benchmark suites have been devised and put forward as the deﬁnitive parallel computing benchmark tests ([19],[20]), many of these are designed to test the underlying hardware or the collective communications features of the message passing tools. The purposes of the tests that are to be performed on MPWS and mpiJava are to ﬁnd the speed of the communication implementations and not the capabilities of the network. The ping pong test is used in most of the bench mark suites as a simple bandwidth and latency test. Getov et. al. [21] used a number of variations of the ping pong test to compare the performance of MPI and java-MPI, also Foster and Karonis [22] use the ping pong test to evaluate MPICH-G, a grid enabled MPI. It has been decided to use two variations of the ping pong tests. The ﬁrst, PingPong, transfers data from one process to another and then back again. In this test, there are an even number of processors within the communication domain that are paired up to concurrently pass data to and from each other, see Fig. 3(a). In this ﬁgure the messages are represented by the solid arrows, the time taken for the message to be sent from one service to a second service and then back again is measured as the round trip time. The second test is the Ping*Pong test [21], this test involves sending multiple messages from one service to a second service before the second service returns a message, this is also seen in Fig. 3(b). This test will diﬀerentiate between: the intra message pipeline eﬀect, where the message is broken into smaller parts by the system and processed through a pipeline to speed up the communication; and the inter message pipeline eﬀect, where the system does not have to wait for one message to complete its transfer before starting processing the next message [21]. The ping*pong test may show more a realistic view of the systems performance, as it emulates many real applications of message passing (such as a matrix multiplication).

190

I. Cooper and Y. Huang

Fig. 3. Communication Diagram for PingPong, Ping*Pong and matrix multiplication tests

As a further test that has a more real life application to it, a one dimensionally blocked parallel matrix multiplication application is used. This application is based on a simple parallelisation of the matrix multiplication problem. The communications for the matrix multiplication application are shown in Fig. 3(c), each arrow represents a portion of the matrix being sent from rank(i) to another processor. It is important to note that while the order of the sends for each rank are ﬁxed, a rank can start sending its data as soon as it has received data from the preceding rank. For the matrix multiplication application, the actual multiplication calculations are extremely time consuming and dilute the performance of the communications with variances in processor utilisation at the time of testing. We have, therefore omitted the calculation part of the application and only presented the communication part. 4.2

Evaluation Results and Discussion

Versions of each test have been written and evaluated as both a web service, running on Tomcat 5.5.20 using AXIS 2.1.2, and in Java using the mpiJava API (V1.2 wrapping MPICH 1.2.6); all code was written in Java 1.6.0. The MPWS evaluation tests are undertaken on a public network of university machines, all of which are prone to unforeseen activity. The tests were done during low usage hours to reduce inconsistencies and all graphs show minimum timings to reduce the impact of the network on the results; the error bars show maximum timings over the set of tests. The Linux machines used for the testing have twin Intel pentium 4, 2.8GHz processors; in order to eliminate the discrepancy’s between the diﬀerent handling of threads with the MPWS and mpiJava systems, both systems were restrained to using only one processor on each machine. The graphs in Fig. 4 and Fig. 5 show the timings of MPWS and mpiJava running the ping pong tests. The results show the expected communications overhead of the SOAP message, that degrades the performance for smaller messages, but they also show that over a message data size threshold of approximately 200Kbytes (or n = 160) the extra communication overhead has been absorbed by the total MPWS communication time to make the MPWS and MPI systems run at a relatively similar speed. The graph in Fig. 5 concentrates on the timings for smaller message sizes, allowing the reader to easily compare the two systems. The ping pong test shows

The Design and Evaluation of MPI-Style Web Services

Fig. 4. Times of Ping Pong test MPWS and mpiJava

191

Fig. 5. Times of Ping Pong test MPWS and mpiJava; small message sizes

that for large message sizes the MP web services are an acceptable alternative to mpiJava, but below the data sizes of around 125Kbytes, the systems overheads are very noticeable. This is not really unexpected, as the there are the overheads of the SOAP headers and the HTTP protocol to consider. The results for the ping*pong test are shown in Fig. 6, it is noticed that the threshold (n=130) for MPWS absorbing the overhead of the SOAP messages is slightly lower than with the PingPong test. More signiﬁcant, is the tenancy for MPWS to outperform the version using mpiJava’s standard send; we put this down to the inter message pipeline eﬀect and the buﬀer handling of the two diﬀerent systems. The parallel matrix multiplication communication results are shown in Fig. 7, they consistently show that the MPWS performs the communications faster than mpiJava at matrix sizes above the overhead threshold. We again put the results of the matrix test down to the application of the system buﬀers in the MPWS and mpiJava implementations, and the inter message pipeline eﬀect. In the ping*pong test, both the inter message pipeline of the send and receive were being tested, but in the matrix multiplication test, each of the consecutive sends from every processor are being received by a diﬀerent processor. In MPWS, the

Fig. 6. Times for the Ping*Pong test MPWS and mpiJava

Fig. 7. Times for the Matrix Multiplication test MPWS and mpiJava

192

I. Cooper and Y. Huang

main message buﬀering occurs in the receiving processor. This distributes the message buﬀering process at the time of high utilisation.

5

Conclusion and Further Work

From the tests we have discovered that despite using MTOM, The overhead of SOAP messaging is still a problem which aﬀects the performance of MPWS when message sizes are small. However, when the message sizes reach a threshold, MPWS and mpiJava systems run at a relatively similar speed. We also found that the inter message pipe eﬀect, is a noticeable feature in MPWS applications that use consecutive sends; it is even more so in those applications who’s consecutive sends are received by a distributed selection of processors. From the above observations, we conclude that MPWS is an eﬀective tool for coarse grained parallel applications , such as a parallel matrix multiplication, implemented in a service oriented environment. The next steps will be to consider the design of other send styles, such as ssend (synchronous send), and evaluate MPI style collective communication functionality such as: broadcast; gather and scatter; and all reduce.

References 1. Akram, A., Meredith, D., Allan, R.: Evaluation of bpel to scientiﬁc workﬂows. In: CCGRID 2006: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2006), pp. 269–274. IEEE Computer Society, Washington (2006) 2. Krishnan, S., Wagstrom, P., von Laszewski, G.: Gsﬂ: A workﬂow framework for grid services (2002) Preprint ANL/MCS-P980-0802 3. Huang, Y., Huang, Q.: Ws-based workﬂow description language for message passing. In: 5th IEEE International Symposium on Cluster Computing and Grid Computing, Cardiﬀ, Wales, U.K. (2005) 4. Carpenter, B., Fox, G., Ko, S., Lim, S.: mpiJava 1.2: API Speciﬁcation (October 1999), http://www.npac.syr.edu/projects/pcrc/mpiJava/mpiJava.html 5. Baker, M., Carpenter, B., Shaﬁ, A.: An Approach to Buﬀer Management in Java HPC Messaging. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 953–960. Springer, Heidelberg (2006) 6. Lee, H.K., Carpenter, B., Fox, G., Lim, S.B.: Benchmarking hpjava: Prospects for performance. In: 6th Workshop on Languages, Compilers and Run-time Systems for Scalable Computers (March 2002) 7. Gropp, W.: Tutorial on MPI: The Message-Passing Interface 8. Kut, A., Birant, D.: An approach for parallel execution of web services. In: Proceedings - IEEE International Conference on Web Services, pp. 812–813. IEEE Computer Society, Los Alamitos (2004) 9. Ruth, M., Lin, F., Tu, S.: Adapting single-request/multiple-response messaging to web services. In: Computer Software and Applications Conference, 29th Annual International, vol. 2, pp. 287–292 (2005) 10. Puppin, D., Tonellotto, N., Laforenza, D.: How to run scientiﬁc applications over web services. In: International Conference Workshops on Parallel Processing. ICPP 2005 Workshops, pp. 29–33 (2005)

The Design and Evaluation of MPI-Style Web Services

193

11. Harrington, B., Brazile, R., Swigger, K.: Ssrle: Substitution and segment-run length encoding for binary data in xml. In: 2006 IEEE International Conference on Information Reuse and Integration, September 2006, pp. 11–16 (2006) 12. Bayardo, R.J., Gruhl, D., Josifovski, V., Myllymaki, J.: An evaluation of binary XML encoding optimizations for fast stream based XML processing. In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 345–354. ACM Press, New York (2004) 13. Barton, J.J., Thatte, S., Nielsen, H.F.: Soap messages with attachments. W3c note, W3C (December 2000) 14. The Apache Software Foundation: MTOM Guide -Sending Binary Data with SOAP. 1.0 edn. (May 2005), http://ws.apache.org/axis2/1 0/mtom-guide.html 15. Ying, Y., Huang, Y., Walker, D.W.: Using soap with attachments for e-science. In: Proceedings of the UK e-Science All Hands Meeting 2004 (August 2004) 16. Czajkowski, K., Ferguson, D.F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W.: The ws-resource framework version 1.0. Technical report, Globus Alliance and IBM (2004) 17. Graham, S., Karmarkar, A., Mischkinsky, J., Robinson, I., Sedukhin, I.: Web Services Resource 1.2 (WS-Resource) Public Review Draft 01. OASIS, June 10 (2005) 18. Jayasinghe, D.: Invoking web services using apache axis2 (December 2006), http://today.java.net/pub/a/today/2006/12/13/ invoking-web-services-using-apache-axis2.html (Accessed August 2007) 19. Luszczek, P., Dongarra, J., Koester, D., Rabenseifner, R., Lucas, B., Kepner, J., McCalpin, J., Bailey, D., Takahashi, D.: Introduction to the hpc challenge benchmark suite. Technical report, icl.cs.utk.edu (March 2005) 20. Intel: Intel mpi benchmarks. Technical report, Intel (June 2006) 21. Getov, V., Gray, P., Sunderam, V.: Mpi and java-mpi: contrasts and comparisons of low-level communication performance. In: Supercomputing 1999: Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p. 21. ACM Press, New York (1999) 22. Foster, I., Karonis, N.: A grid-enabled mpi: Message passing in heterogeneous distributed computing systems. In: IEEE/ACM Conference on Supercomputing, 1998. SC 1998, pp. 46–46. IEEE Computer Society, Los Alamitos (1998)

Automatic Data Reuse in Grid Workflow Composition Ondrej Habala, Branislav Simo, Emil Gatial, and Ladislav Hluchy Institute of Informatics, Slovak Academy of Sciences, Dubravska 9,84507 Bratislava, Slovakia {Ondrej.Habala,Branislav.Simo,Emil.Gatial, Ladislav.Hluchy}@savba.sk

Abstract. Many papers, research projects, and software products have tackled the problem of automatic composition of a workflow of computer processes, which computes certain data or performs a specific task. In recent years this has also gained popularity in grid computing, especially in connection with semantic description of resources usable in the workflow. However, most of the works dealing with semantically-aided workflow composition propose solutions only for workflows of processes, without the data necessary to execute them. We describe the design of a system, which will be able to find not only the processes, but also the content for their execution, based solely on the list of available resources and a description of the required target of the workflow. The solution is based on our previous work in the project K-Wf Grid, utilizes semantic description of resources by means of ontologies, and operates on a SOA-based grid composed of web services. It is being developed in the context of a project called SEMCO-WS1. Keywords: Semantic grid, SOA, web services, automated workflow composition.

1 Introduction Many papers, projects, software solutions [1-3] have tackled the problem of automatic composition of a workflow of computer processes. This type of automation is very attractive especially in software engineering applied to scientific research, where complicated simulations and parameter studies often require tens of single steps in order to obtain the solution required by the scientist. Since the inception of grid computing, workflow composition of grid jobs into complex workflows has also gained prominence with its apparent usefulness, long history of previous works not applied specifically to grid, and robust mathematical theory based mainly on direct acyclic graphs. In recent years, advances in semantic web have been applied also in grid computing – creating semantic grid [4] – and specifically in the area of semantically-aided composition of workflows of grid tasks. However, most of the many works on this topic have concerned themselves only with the composition of a 1

This work is supported by projects SEMCO-WS APVV-0391-06, int.eu.grid EU 6FP RI-031857, VEGA 2/7098/27.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 194–202, 2008. © Springer-Verlag Berlin Heidelberg 2008

Automatic Data Reuse in Grid Workflow Composition

195

workflow of computer processes – represented by grid jobs, calls to web service interfaces, of other custom tasks – solving the “how”, and have omitted the “what” of this problem, it being the data, on which these processes operate. This has been left to the user. While the sought after result is a system, in which the user enters the description of the data he/she requires, and the system composes a workflow able to compute it, most of the existing solutions create only a workflow able to solve a class of problems, and the selection of the one unique member of this class via entering the correct data is left to the user. We have designed, and begun to implement, a system, which solves also the “what” of automated workflow composition. The proposed system is based on previous work done in the context of the project K-Wf Grid [8], and extends it with tools which are able to determine exactly which data is necessary for which process in the composed workflow in order to get – at the end – the data which the user has described as his/her target. The system is based on semantic description of data and grid services by ontologies. The workflows are modeled as Petri nets, this being a legacy of K-Wf Grid offering very good means to model data (as Petri net tokens). The system interacts with the user only to the extent absolutely necessary to acquire data or services which are required for the solution, but currently not available in the grid. The rest of this paper first presents the project K-Wf Grid and its results, and establishes a frame of reference for our own work. Then we briefly present the project SEMCO-WS [9], and the main part of the paper describes the proposed solution of automatically composing workflow with not only processes, but also data.

2 Results of K-Wf Grid The project Knowledge Based Workflow System for Grid Applications – K-Wf Grid – started in September 2004 and ended in February 2007. It was very successful in attaining its goals, and the final review in March 2007, as well as a public showcase at the Cracow Grid Forum ’06 was both a success. The consortium was composed of six partners, and the work was very well focused on one goal – automated composition of workflows of grid services using semantic support, and accessible through a comfortable web-based graphical interface. For the purpose of this paper, we will discuss only the parts of K-Wf Grid middleware, connected with application workflow composition. The middleware has been tested on three pilot applications. Each application went through several stages, beginning with integration, and ending with a successful workflow execution. K-Wf Grid application is a set of web or grid (WSRF) services. Any application has first to be integrated into the system’s knowledge base [10]. The process of integration is mostly automatic [11], the application expert has only to annotate WSDL documents of the application’s services with markup denoting input and output structures used in service calls. Following the integration, the application may be used in the system. User enters a textual description of his/her problem, and a tool [12] connected to the knowledge base finds any available targets – service results – relevant to this problem description. The user selects one or more of the found targets, and thus establishes a context for a workflow.

196

O. Habala et al.

The workflow is formed and executed in several stages. We have to remember that it is always modeled as a Petri net, so the appropriate terms are activities (for processes), tokens (for data), and data places – for inputs and outputs of the activities. It begins with an abstract description of a problem, having only one activity, and one output (the target of the workflow). This simple workflow is then expanded [13] using descriptions of available services, stored in the knowledge base. The result is a workflow of service classes – descriptions of service interfaces, without actual grounding (concrete service providers). Then in another step [14] the available service providers for each service are found, and the workflow is now composed of a set of activities, each of which presents several possible choices of a concrete web/grid service for its execution (firing, in terms of Petri nets). The final pre-execution step is scheduling, during which the scheduler [5] selects for each activity one service, assumed to be the best one (considering a metric for evaluating different properties of services) for the workflow. After this workflow construction, the workflow (assuming that the system was able to find all necessary service classes and grounded services for these classes) may be executed. During execution, one or more input data structures may be necessary – the user will be asked to provide data description. Although this step is also comfortable, and the user may use custom web forms developed by the application developer, it is still necessary for the user to know the application and be able to judge which data will be necessary to produce the target he/she wishes to obtain. Following the workflow execution, the data may be downloaded from grid storage; or, if the application contains also visualization tools and custom services able to cooperate with grid middleware (so-called job packagers), it is transformed into easily readable form and made accessible through the K-Wf Grid portal.

3 Semantic Workflow Composition in SEMCO-WS The project SEMCO-WS is a small national project with a consortium of four members, all from Slovakia. It started in February 2007 and is scheduled to end in January 2010. The project is trying to expand and refine several components of K-Wf Grid (adding also other features, not present in K-Wf Grid). While the whole process of workflow construction and execution in K-Wf Grid was observable through a graphical workflow visualization tool, the user could edit only data tokens, not the workflow structure. SEMCO-WS will include an improved version of the visualization tool, supporting also complete workflow editing. The process of knowledge base management will be also supported by comfortable graphical tools, and the knowledge base itself will be decentralized. Most importantly for this paper, the process of data selection, left to the user in K-Wf Grid, will be fully automated. The user will be asked only to provide data, which will not be available and described in the knowledge base. The design of this part of SEMCO-WS middleware is explained in the following chapter.

Automatic Data Reuse in Grid Workflow Composition

197

4 Adding Automatic Data Selection to Workflow Composition To be able to propose not only the activities of a workflow, but also the content of the initial tokens in it, based only on the content of the final output token (which represents the target data of the workflow), the system has to be able to 1. know the content of any data present in the system, 2. infer the necessary input token parameters, based on the parameters of the output token to be generated by any activity, 3. decompose output structures of web and grid services into separate tokens, and 4. compose input structures for web and grid services from existing tokens. The solutions of these partial problems are described below. 4.1 Content of Data – Metadata Any data piece available to the workflow composition system is represented by a token. If it is a file, it can be its content (for smaller files), its URL, LFN, or any other identifier. It can be an URL of a database, and an identifier of an item in the database. The actual content of a token is application dependent, and the system does not need to be able to read it. Any token has first to be either entered by a user, or generated by an application activity, and it is processed only by another application activity or a user, so the application dependence is fully hidden in the application domain. The workflow system needs only to know the metadata of the token, to be able to evaluate its content for possible use in a workflow. The metadata, represented by OWL and OWL-S constructions is composed of generic, application-independent part, and of other, application dependent part. This layering of ontologies has been also used in K-Wf Grid, where the application ontologies were using a common base ontology layer with basic facilities for description of services, files, resources, computers, clusters, etc. The OWL standard is used mainly because it has been already incorporated into K-Wf Grid middleware, upon which our system is based. Alternatively, also other suitable ontology representation language could be used, or even the WSMO language [15], developed specifically for modeling of web services. The requirement that the system has to be able to use data computed in the past for current workflows also implies, that all tokens created by any application have to be stored in a database. Since in K-Wf Grid and in SEMCO-WS the workflows (including tokens) are described in an XML dialect called the Grid Workflow Description Language [7], a simple XML database is sufficient for token storage, and later lookup. When the system identifies a data piece based on its metadata, the ontology will contain also the identifier of the token representing the data, and the token can be retrieved from the XML database using this identifier. Each newly created token has to be described by its metadata. This can be done in two ways – either the metadata is generated by another application component (a simple web service, or other module), or it may be computed using a set of mathematical and logical formulas. As we will discus below, it is necessary to be able to infer the properties of input data from required output parameters, so it may be also

198

O. Habala et al.

possible – at least for a subclass of activities – to describe the inverse transformation and infer properties of output tokens from known properties of input tokens. 4.2 Backtracking from Required Output Token to the Necessary Input Tokens The process of constructing a workflow in K-Wf Grid has been sufficiently described before [13], [14], [6]. This process did not provide for reusing existing data, and always assumed that the whole workflow chain has to be computed anew, even if some partial results from previous workflow could be reused and could replace parts of the newly created workflow. The workflow construction process used backtracking, from the final activity to the initial activities of the workflow. In SEMCO-WS, the process will also include backtracking from the final token to the initial tokens of the workflow. We can abstract tokens and activities as data providers, with the difference that tokens provide data and require no inputs, and activities also provide data, but require input data – other tokens. So the process of workflow construction can be described using this algorithm: Program construct_workflow (token_metadata_list) //The input is a list of metadata descriptions of all //tokens we wish to produce with the constructed //workflow Variables: workflow //list of components of the constructed //workflow token activity token_metadata //member of token_metadata_list 1. Foreach token_metadata in token_metadata_list a. token ← find in token_db based on token_metadata b. If token ≠ nil workflow = workflow + token Else Find activity able to produce token workflow = workflow + activity token_metadata_list = token_metadata_list + token_metadata of all input tokens of activity 2. If token_metadata_list is not empty, goto 1 3. Output workflow So we see, that we first try to find the data we need, and if it cannot be found (it was not yet computed or entered into the system), we find an activity which can compute such data. Of course, then we have to find the correct input data for this activity too, and the process repeats itself. The K-Wf Grid incarnation of this algorithm was looking only for activities, and essentially omitted step 1a in the algorithm. In 1b-else clause, we are looking for an activity able to produce a token with certain parameters. To be able to do this, we also need descriptions of the capabilities

Automatic Data Reuse in Grid Workflow Composition

199

of activities, i.e. what are the possible parameters of tokens the activity is able to produce. If this description includes also the parameters of the token to be produced, the activity may be used to produce it. 4.3 From Tokens and Activities to Data Structures and Services Our Petri net model of a workflow operates with activities, places, and tokens. For purposes of management of SOA applications, we need to be able to transform these concepts to web service calls, and input/output structures of web service interfaces (data structures). While the transition from activities to web service calls is straightforward, tokens and data structures do not map directly. Any activity may require more than one piece of data – so on its input are several tokens, coming from several input places, but the underlying service can consume only one input structure. Also, the output data structure of the service may contain several data fragments – values, references to files, etc. – and we want to store them as tokens separately, since they may be later used separately as inputs to other activities. For the purpose of construction of input data structures for services, and decomposition of their output structures into separate tokens, we have extended the annotation of WSDL documents of application’s services with additional elements. These elements (contained inside the definition of data structures) contain XSLT code, which may be used to automatically compose the structure from several XML fragments identical with tokens, and also to automatically split the output structure into the fragments which then become the output tokens of the activity. The composition and decomposition process is quite straightforward; an activity may have several input and output places, but the actual web service hidden behind the activity has only one input, and one output, both represented by an XML structure. Upon activity firing, the input tokens (XML fragments) are concatenated together, and then transformed into the format of the web service input structure using the XSLT provided in WSDL document. Similarly, the output structure is then filtered by XSLT into distinct output tokens – one XSLT document for each output token. 4.4 The Workflow Composition Process The whole process and problem of automated workflow construction using also existing data has been decomposed into several steps (see Fig. 1): 1. Initial data has to be entered by the user; he/she is a domain expert, so it is easy (using custom web forms which are part of the application) for him/her to create the token, as well as describe it with metadata; both token and metadata are stored. 2. When the GWES looks for components of the workflow, it queries the knowledge base for any existing tokens which can provide the necessary data. 3. If such token is found, it is extracted from the token database; otherwise an activity is used (not shown in Fig. 1 since it is not the focus of this paper, composition of services into workflow has been sufficiently described in other works).

200

O. Habala et al.

Fig. 1. The data location/creation process

4. During execution of the workflow, GWES has to compose an input structure for the actual application service from the input tokens of the service’s activity. This is done automatically, using XSLT transformations present in the WSDL document of the service. 5. The composed input structure is used in a call to the application service; the service replies with an output structure. 6. The obtained output structure is decomposed into single tokens, which represent data items contained in the structure. This is also done automatically using XSLT transformations inside WSDL document (see Chapter 4.3 for details). 7. The created tokens have to be annotated; this is another application-specific step of the process, but (as discussed above), we can avoid helper services by using mathematical and logical formulas which describe the process of transformation of parameters of input tokens (which are known) into parameters of output tokens. Alternatively, if such formulas cannot be used because of the complexity of the transformation, helper service can be used. The formulas or the URL of the helper service can also be found inside the annotated WSDL document of the application service. 8. The created tokens (extracted from output of the application service in step 6) are stored in the token database for later reuse. 9. The metadata of the new tokens (created in step 7) is stored in the system’s knowledge base. The cycle may begin with another iteration from step 2, or – if the workflow has no more activities which can be fired – end. We have not discussed some situation which may arise, when the composition of a workflow is halted by a requirement for data, which cannot be found in the database, nor produced by any known activity. In such case, the user may be asked to provide the data. Alternatively he/she may abort the workflow composition process, enter new application service into the system, and restart the workflow composition. So the situation can be resolved by adding either the data, or a service which can produce it.

Automatic Data Reuse in Grid Workflow Composition

201

5 Conclusions We have shown, that fully automated construction of workflows of web and grid services can also reuse existing data, provided that the processes is supported by a semantic annotation layer, and the application services are annotated. Three key components of the annotation are formulas for transformation of input metadata into output metadata, formulas for the inverse transformation, and description of capabilities of application services. The design presented in this paper is currently entering the implementation phase. The project SEMCO-WS is continuing the work of K-Wf Grid in some areas, where the semantic support of grid workflow composition can be further improved, and the inclusion of automatic data reuse is one of them. The prototype of this system is to be ready by 2009, and when SEMCO-WS finishes in 2010, the system will be ready to be used and further improved by other researchers.

References 1. Bubak, M., Gubala, T., Kapalka, M., Malawski, M., Rycerz, K.: Grid Service Registry for Workflow Composition Framework. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3038, pp. 34–41. Springer, Heidelberg (2004) 2. VDS – The GriPhyN Virtual Data System (Accessed January 2008), http://www.ci.uchicago.edu/wiki/bin/view/VDS/VDSWeb/WebMain 3. Krishnan, S., Wagstrom, P., von Laszewski, G.: GSFL: A Workflow Framework for Grid Services. In: Preprint ANL/MCS-P980-0802, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, 1L 60439, U.S.A. (2002) 4. Semantic Grid Community Portal (Accessed January 2008), http://www.semanticgrid.org/ 5. Wieczorek, M., Prodan, R., Fahringer, T.: Comparison of Workflow Scheduling Strategies on the Grid. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 792–800. Springer, Heidelberg (2006) 6. Hoheisel, A., Der, U.: An XML-based Framework for Loosely Coupled Applications on Grid Environments. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2657, pp. 245–254. Springer, Heidelberg (2003) 7. Alt, M., Gorlatch, S., Hoheisel, A., Pohl, H.W.: A Grid Workflow Language Using HighLevel Petri Nets. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 715–722. Springer, Heidelberg (2006) 8. Knowledge-based Workflow System for Grid Applications (K-Wf Grid). EU 6th FP Project, 2004-2007 (Accessed January 2008), http://www.kwfgrid.eu 9. Semantic composition of Web and Grid Services (SEMCO-WS). Slovak APVV project, 2007-2009 (Accessed January 2008), http://semco-ws.ui.sav.sk/ 10. Kryza, B., Pieczykolan, J., Babik, M., Majewska, M., Slota, R., Hluchy, L., Kitowski, J.: Managing Semantic Metadata in K-Wf Grid with Grid Organizational Memory. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings of Cracow Grid Workshop – CGW 2005, Krakow, Poland, November 20-23 2005, pp. 66–73 (2005) 11. Habala, O., Babik, M., Hluchy, L., Laclavik, M., Balogh, Z.: Semantic Tools for Workflow Construction. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3993, pp. 980–987. Springer, Heidelberg (2006)

202

O. Habala et al.

12. Laclavik, M., Seleng, M., Hluchy, L.: User Assistant Agent (EMBET): Towards Collaboration and Knowledge Sharing in Grid Workflow Applications. In: Cracow 2006 Grid Workshop: K-Wf Grid, pp. 122–130 (2007) ISBN 978-83-915141-8-4 13. Gubala, T., Herezlak, D., Bubak, M., Malawski, M.: Semantic Composition of Scientific Workflows Based on the Petri Nets Formalism. In: Proc. of e-Science 2006, Amsterdam (2006) ISBN-0-7695-2734-5 14. Dutka, L., Kitowski, J.: AAB – Automatic Service Selection Empowered by Knowledge. In: Bubak, M., Turała, M., Wiatr, K. (eds.) Proceedings of Cracow Grid Workshop – CGW 2005, ACC-Cyfronet USTs, ACC-Cyfronet UST, November 20-23 2005, p. 58 (2006) 15. Web Service Modeling Ontology (Accessed March 2008), http://www.wsmo.org/

Performance Analysis of GRID Middleware Using Process Mining∗ Anastas Misev1 and Emanouil Atanassov2 1

University Sts Cyril and Methodius, Faculty of Natural Sciences & Mathematics Institute of Informatics, Skopje, Macedonia 2 Bulgarian Academy of Sciences, Institute for Parallel Processing, Sofia, Bulgaria [email protected], [email protected]

Abstract. Performance analysis of the GRID middleware used in a production setting can give valuable information to both GRID users and developers. A new approach to this issue is to use the process mining techniques. Analyzing logs of the middleware activities, performed on the SEE-GRID pilot production Grid infrastructure, objective qualitative and quantitative information on what actually happens can be obtained. Using the appropriate tools like ProM to apply the process mining algorithms, many interesting findings and conclusions can be drawn. In this paper we describe our approach and show some of our conclusions. Keywords: Grid, middleware, performances, process mining.

1

Introduction

Performance analysis of the GRID middleware can give valuable information to both GRID users and developers. Users gain by better understanding the workflow that is followed during job’s lifecycle and the possible obstacles. Since the Grid middleware usually presents alternative ways to accomplish the same final result, the relevant performance information enables the users to optimize their choices and improve their throughput. Developers can benefit by locating the bottlenecks and other problematic points during the job lifecycle and try to modify the middleware appropriately. They can also compare various implementations. The performance of the GRID middleware can be analyzed from various aspects. As seen in [1], [2], analysis can be performed on the MDS, OGSA-DAI etc. All of this focuses mostly on the developers views of the middleware. In this work, we analyze the performance from the logging and bookkeeping data obtained from the Logging and Bookkeeping (L&B) Service. In this way, we try to quantify the perception of reliability that the users get when they look at the final outcome (success/failure) of their jobs. ∗

This paper is based on the work done at the Institute for Parallel Processing at the Bulgarian Academy of Sciences, during the one month stay, supported by the FP6 project: Bulgarian IST Centre of Competence in 21 Century (BIS-21++), Contract no.: INCO-CT-2005-016639.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 203–212, 2008. © Springer-Verlag Berlin Heidelberg 2008

204

A. Misev and E. Atanassov

2 Description of the Logging and Bookkeeping Service and Database The Logging and Bookkeeping (L&B) service [3] tracks jobs managed by the gLite WMS (workload management system) or the Resource Broker (RB). It gathers events from various WMS/RB components in a reliable way and processes them in order to give a higher level view, the status of job. Virtually all the important data are fed to L&B internally from various gLite middleware components, transparently from user’s point of view. Three main features of the system are events delivery, notifications and security, and access control. For a deeper understanding of them, refer to [3], [4]. All the data that the service receives is stored in a relational database. The diagram of the database is shown in the Fig. 1.

Fig. 1. Database structure of the L&B database

The unique user id, along with the cert data from the X509 certificate is stored in the Users table. Each job is assigned a unique identifier that is used as a foreign key in the related tables and is stored in the Jobs table. For each job, several events are created in the Events table. Each event has the job id, the sequence number, event code and a time stamp. Two more tables relate to the Events table: Short_fields and Long_fields. Both of them store pairs of (name, value) data, related to the event by the job id and event sequence number. The Short_fields contains shorter values, strings of up to 255 characters. For the longer values (entire JDL for a particular job, CE name, name of the queue, names of the files accompanying the job etc), which can be up to 16 millions of characters, the database uses the Long_fields table, with the same reference to the Events table (job id and event sequence number). When the job’s lifetime ends, a record is added to the States table, referencing the job’s final sequence number and large string field containing detailed description of the job lifetime.

Performance Analysis of GRID Middleware Using Process Mining

205

3 The Rationale for L and B Based Performance Analysis Various approaches have been proposed to tackle the performance analysis of the distributed systems. The work done by Margalef et al [5], proposes 3 different approaches: static, run-time and dynamic. In that context, L&B based analysis is a static, post-mortem analysis. As such, it has many advantages, but also some disadvantages. Main advantages are that it will not introduce any overhead to the production system, since the analysis is done off-line. Analyzing trace files can require lots of time, but since time is not an issue in off-line analysis, more comprehensive and in-depth analysis can be performed, helping even non-expert users to make fine tuning of their applications. Main disadvantage of this approach is that such analysis require high level of details in the log files, which will require lots of resources (both processing and storage) for their manipulation. Also, since the analysis is static and post-mortem, it cannot cope with some dynamic application behavior that will occur during each execution. Process mining has not yet been used as a tool to make performance analysis of GRID middleware. Other applications of the technique have proven very useful [7], [11].

4 Short Overview of Process Mining Process mining techniques allow for extracting information from event logs [6], [8], [9]. It targets the automatic discovery of information from an event log. This information can be used to deploy new systems that support the execution of business processes or as a feedback tool that helps in auditing, analyzing and improving already enacted business processes. The main benefit of process mining techniques is that information is objectively compiled. In other words, process mining techniques are helpful because they gather information about what is actually happening according to an event log of an organization, and not what people think that is happening in this organization. The type of data in an event log determines which perspectives of process mining can be discovered and specifies the type of questions that can be answered using the mining process: 1. The control flow perspective can be mined if the logs contain tasks executed by a process. The key elements in this perspective are processes and cases (process instances). This represents the “How?” questions. 2. If the log provides information about the persons/systems that executed the tasks, the organizational perspective can be discovered, giving answers to the “Who?” questions. 3. When the log contains more details about the tasks, like the values of data fields that the execution of a task modifies, the case perspective (i.e. the perspective linking data to cases) can be discovered. This relates to the “What?” questions.

206

A. Misev and E. Atanassov

We have chosen the ProM framework [10], [11] for several reasons: it is opensource, Java based, and it has big variety of available plug-ins. It is extensible with new plug-ins, if required. The included plug-ins can be used either on logs only, called discovery plug-ins, or on logs and process models, called conformance and extension plug-ins. The discovery plug-ins can be used to discover the process elements only from the log files. They can then depict the process into various formats (Petri nets for example). The conformance plug-ins relies both on log data and a process model. They can be used to test the conformance of the data in the log to the proposed process model. The extension plug-ins also requires both logs and process model, but they discover the information that will enable to enhance the process model. The ProM framework uses its own format to store the log data and additional attributes. The format is called MXML and is based on XML. Along with the ProM, there is an open source tool called ProMImport [12] that enables conversion from various well known log formats into MXML.

5 Application of Process Mining on the L and B Log Data For the purpose of our analysis, we use the job identifiers as process instances (or cases) and events as audit trail entries. We also use the status field of the job (from the Events table) as the model element (or state in which on job can be in) and the combination of program and host name as originator (the entity performing the process element). For the future research, we will add more attributes to the analysis (CE name, detailed status of the job, queue and VO name, etc.) After we have imported the log data into the MXML format and loaded the log into the framework, we can proceed with log filtering. Log filtering enables us to select only the data that is relevant to the analysis. For example, we can define that only logs for jobs starting with REGJOB event will be used. Also, we can select that we will analyze only complete instances, so we can define another filter that will include only jobs with particular event as last event (DONE, CLEAR, CANCEL…). It is possible to perform more advanced filtering using the advanced filtering tab. For example, we can filter out only events done by specific originators, which will help us to reduce the data for some of the analysis. This is important in our case since some of the events are reported by multiple originators to the service. Also, we can use remapping filter to remap sub-jobs to a parent job. 5.1 Log Summary We have started the mining process of the L&B logs with simple log summary plug in. It gives an overview of the number of jobs (process instances) and events (audit trail entries). For each of the model element, a frequency is calculated and shown. Also, the model elements that are first in the audit trails (Starting log events) and last (Ending log events) are shown with the frequency of their occurrence. Finally, each of the originators is shown, along with the frequency of occurrences in the audit trails. From here we can get basic notion about the data we are mining. For example, we can instantly see how many of the process instances finished successfully by looking

Performance Analysis of GRID Middleware Using Process Mining

207

at the Ending log evens. We can also see the workload performed by various originators (program-service and host name combination). 5.2 Heuristic Miner Another appropriate plug-in for analyzing data that is less structured or has instances that follow several different paths of execution is the Heuristic Miner. Using the tool, a heuristic network can be produced depicting the control flow in the given process model. A simplified example is shown in Fig. 2. The numbers in the boxes represent the number of occurrences of the specific event and the numbers on the links represent the frequency and the absolute number of occurrences of a specific transition.

Fig. 2. Heuristic network

Using this network, we can easily recognize the frequencies of various transitions in the job’s lifespan. The network can also be converted to a Petri net, as one of the most common formalism used to represent workflows. 5.3 Petri Net Performance Analysis Once having a Petri net from the log, we can perform additional analysis. Especially useful is the Petri net performance analysis. As a result of this analysis, an interactive diagram is produced, helping in identifying the bottlenecks in the process model. Different color coding of the places (circles) of the Petri net marks the different time needed in each one of them, as shown in Fig. 5. (Blue means low waiting time, yellow middle and purple high). Also, by selecting two transitions in the net, the tools shows the statistics (min, max and average time needed from one to the other). 5.4 Performance Sequence Diagram Performance sequence diagram plug-in can be especially of help if you want to know what behavior in your processes is common, what behaviors are rare and what behavior may result in extreme situations (e.g. instances with extremely high throughput times). An example of the output is given in the Fig. 4. We have used this output to identify the most common sequences of events that occur during job’s lifetime, along

208

A. Misev and E. Atanassov

with their basic statistics. The diagram can be a full diagram, showing all the instances into time, or pattern diagram, grouping (by variable parameters) similar sequences into patterns. It also has a rich set of filtering options allowing us to select sets of process instances, or even individual ones. This plug-in, if used by end-users can help them visually identify the problems with their job submissions.

Fig. 3. LTL plug-in

5.5 Conformance Analysis Conformance analysis plug-in requires both log data and a process model (Petri net for example). It replays the entire log and checks the conformance of each job with the model. It offers two perspectives: Log and model. The log perspective illustrates each separate job and indicates the ones that do not conform to the model. Model perspective shows the Petri net and indicates the non conformant points. It can also show the number of times an activity should occur (regarding the model) but actually didn’t and vice versa. 5.6 LTL Plug-In The Linear Temporal Logic (LTL) plug-in check validity of LTL formulas on the analyzed log. It has a rich set of options and predefined formulas. As a result it divides the set of process instances into ones that satisfy and others that not satisfy the formula. The example shown in Fig. 3 show the conformance of the processes to the formula “eventually activity RUNNING then DONE then CLEAR”.

6 Some Important Findings about Middleware Performance Derived from the Process Mining We have performed the analysis on different subsets of the L&B data. At the beginning, mostly for performance reasons, we have analyzed jobs from several users,

Performance Analysis of GRID Middleware Using Process Mining

209

user by user. Subsequently, we have made a filtered data set from the whole database. We must note that some of the plug-ins requires much more time when working with large datasets, especially the ones that perform log replay. Most of the results that follow are from the overall analysis. 6.1 Percentage of Successful Jobs From the performed analysis, we can conclude that the underlying infrastructure (SEE-GRID [13]) performs satisfactory. The overall percentage of the successful jobs is around 70%. We identified several factors that influence the success rate of jobs. First of all, there is the human factor. Analysis of logs of jobs submitted by experienced users show greater percentage of success. If we analyze filtered logs from more experienced users, we can conclude that up to 80% end either with status DONE, or with status CLEAR (retrieved output). We have to make deeper analysis which require additional data attributes added to the logs to be analyzed, like status code, exit code etc. to better understand the percentage of the finished jobs and what is more important to discover the reasons why the other jobs failed. Other factors include: 1. the “quality” of the Grid sites – usually larger sites in terms of number of CPUs have better support 2. software versions – the installation of a new middleware version or revision usually causes some hick-ups 3. Lack of resources or inappropriate job scheduling mechanisms – large percentage of failures are caused by jobs waiting in the queue for too long. The so-called proxy renewal mechanism did not work reliably until newer versions of the middleware solved the problem. 6.2

Patterns of Job Control Flow in the Middleware

Using the Performance sequence diagram, we can obtain useful data about the patters of events that the jobs follow during their lifetime. Analyzing a RB log consisting of

Fig. 4. Patterns of job control flow

210

A. Misev and E. Atanassov

around 18500 jobs, we have identified 85 different patterns of behavior. As shown on the Fig. 4, most of the jobs that finish successfully follow the first and the third pattern. They do it in average time of 27 hours and 11 hours respectively, with the former length is due to user intervention to pick up the results. This can be used as a good reference for the lengths of the proxy certificates. Another conclusion that we can make from this results is the relatively large number of jobs following the fifth pattern (Pattern 4 in Fig. 4). Around 600 jobs have failed after waiting in the queues for an average of more than 280 hours. Using this data and examining the specific instances we could identify the reason for such failures.

Fig. 5. Performance analysis with bottlenecks (details)

6.3 Bottlenecks in the Job Lifetime Using performance sequence diagram, we have analyzed jobs from single user (for performance sake). Out of 470 jobs, almost 20% of them have been waiting in the queues with average time between 57 and 65 hours. All of them finished with ABORT (mostly due to proxy expired). Using Petri net performance analysis we could also note that the point with the biggest waiting time (shown in purple in the Fig. 5). We can see that the most time jobs spend waiting to start running. Other two bottlenecks include the running time (which greatly depends on the job itself) and time before the output is retrieved (shown in yellow), which is a human interaction.

7 Future Works The work specified in this paper is only the beginning of deeper and wider performance analysis of the GRID middleware. Several issues that we will tackle soon include building custom import filter, based on the ProMImport framework to import the data directly from the L&B database, extend the data that is imported into the framework with additional elements (CE name, matching process results, some JDL attributes etc. to enable even more analysis), enable direct connection to the L&B web service interface, so the users can select particular jobs from within the ProM framework and get even more data from the service directly and propose a more

Performance Analysis of GRID Middleware Using Process Mining

211

intuitive interface (possibly web) to the L&B data, to enable users to get better understanding. Waiting time in the queues can be quite long. Some solutions to these problems that we will investigate further are: − Providing separate queues for various types of jobs could strengthen the user’s perception of the GRID, − Providing end-to-end mechanism for jobs prioritization.

8 Conclusion Using the process mining to analyze GRID middleware is not a new idea, but very little has been done to actually analyze the platform. Using the L&B database as a source of logging data was a natural choice. After researching for the appropriate tool, the ProM tool was chosen, mostly for the features mentioned. The initial results of the mining process are presented in this paper. A very important conclusion from the analysis is that the underlying infrastructure performs satisfactory. With an overall job success rate of around 70% it is quite near the EGEE [14] average of 79% [15]. Since users experience affects the percentage of successful jobs, the education of the users about the underlying technology will increase the overall performance. The more aware the users are about the possibilities of the infrastructure and in the ways to evaluate certain sites, the better the success rate will be. In this context, measuring the success rate of each site can help users choose only the set of sites that promise higher throughput.

References 1. Zhang, X., Schopf, J.M.: Performance Analysis of the Globus Toolkit Monitoring and Discovery Service. In: MDS2, Proceedings of the International Workshop on Middleware Performance (MP 2004), part of the 23rd International Performance Computing and Communications Conference (IPCCC) (2004) 2. Jackson, M., Antonioletti, M., Chue Hong, N., Hume, A., Krause, A., Sugden, T., Westhead, M.: Performance Analysis of the OGSA-DAI Software. In: Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK (September 2004) 3. EGEE User’s Guide, Service Logging And Bookkeeping (L&B) (2007), https://edms.cern.ch/document/571273/1 4. Kouril, D., Krenek, A., Matyska, L., Mulac, M., Pospısil, J., Ruda, M., Salvet, Z., Sitera, J., Skrabal, J., Vocu, M.: Advances in the L&B Grid Job Monitoring Service (2007) (visited 06.08.2007), http://lindir.ics.muni.cz/dg_public/lb2.pdf 5. Margalef, T., Jorba, J., Morajko, O., Morajko, A., Luque, E.: Different Approaches to Automatic Performance Analysis of Distributed Applications. In: Getov, V., et al. (eds.) Performance Analysis and Grid Computing. Springer, Heidelberg (2004) 6. van der Aalst, W.M.P., Weijters, A.J.M.M. (eds.): Process Mining. Special Issue of Computers in Industry, vol. 53. Elsevier Science Publishers, Amsterdam (2004)

212

A. Misev and E. Atanassov

7. Rozinat, A., de Jong, I.S.M., Gunther, C.W., van der Aalst, W.M.P.: Process Mining of Test Processes: A Case Study, BETA Working Paper Series, WP 220, Eindhoven University of Technology, Eindhoven (2007) 8. Alves de Medeiros, A.K., Günther, C.W.: Process Mining: Using CPN Tools to Create Test Logs for Mining Algorithms. In: Sixth Workshop and Tutorial on Practical Use of Colored Petri Nets and the CPN Tools, Aarhus, Denmark (October 2005) 9. Process mining (2007), http://www.processmining.org/ 10. ProM tool (2007), http://is.tm.tue.nl/~cgunther/dev/prom/ 11. Alves de Medeiros, A.K., Weijters, A.J.M.M. (Ton): ProM tutorial, Technische Universiteit Eindhoven, The Netherlands (November 2006) 12. ProMimport, http://is.tm.tue.nl/~cgunther/dev/promimport/ 13. SEE-GRID – South Eastern Europe GRID-enabled eInfrastructure Development (2007), http://www.see-grid.eu/ 14. EGEE – Enabling Grids for E-sciencE (2007), http://www.eu-egee.org/ 15. Monitoring and visualization tool for LCG (statistics for February 2008), http://gridview.cern.ch/GRIDVIEW/job_index.php

Bi-criteria Pipeline Mappings for Parallel Image Processing Anne Benoit1 , Harald Kosch2 , Veronika Rehn-Sonigo1 , and Yves Robert1 1

LIP, ENS Lyon, 46 All´ee d’Italie, 69364 Lyon Cedex 07, France {Anne.Benoit,Veronika.Sonigo,Yves.Robert}@ens-lyon.fr 2 University of Passau, Innstr. 43, 94032 Passau, Germany [email protected]

Abstract. Mapping workﬂow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonistic criteria should be optimized, such as throughput/period and latency (or a combination). Typical applications include digital image processing, where images are processed in steadystate mode. In this paper, we study the bi-criteria mapping (minimizing period and latency) of the JPEG encoding on a cluster of workstations. We present an integer linear programming formulation for this NP-hard problem, and we present an in-depth performance evaluation of several polynomial heuristics. Keywords: pipeline, workﬂow application, multi-criteria, optimization, JPEG encoding.

1

Introduction

This work considers the problem of mapping workﬂow applications onto parallel platforms. This is a challenging problem, even for simple application patterns. For homogeneous architectures, several scheduling and load-balancing techniques have been developed but the extension to heterogeneous clusters makes the problem more diﬃcult. Structured programming approaches rule out many of the problems which the low-level parallel application developer is usually confronted to, such as deadlocks or process starvation. We therefore focus on pipeline applications, as they can easily be expressed as algorithmic skeletons. More precisely, in this paper, we study the mapping of a particular pipeline application: we focus on the JPEG encoder (baseline process, basic mode). This image processing application transforms numerical pictures from any format into a standardized format called JPEG. This standard was developed almost 20 years ago to create a portable format for the compression of still images and new versions are created until now (see http://www.jpeg.org/). Meanwhile, several parallel algorithms have been proposed [9]. JPEG (and later JPEG 2000) is used for encoding still images in Motion-JPEG (later MJ2). These standards are commonly employed in IP-cams and are part of many video applications in the world of game consoles. Motion-JPEG (M-JPEG) has been adopted and further developed to several M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 215–225, 2008. c Springer-Verlag Berlin Heidelberg 2008

216

A. Benoit et al.

other formats, e.g., AMV (alternatively known as MTV) which is a proprietary video ﬁle format designed to be consumed on low-resource devices. The manner of encoding in M-JPEG and subsequent formats leads to a ﬂow of still image coding, hence pipeline mapping is appropriate. We consider the diﬀerent steps of the encoder as a linear pipeline of stages, where each stage gets some input, has to perform several computations and transfers the output to the next stage. The corresponding mapping problem can be stated informally as follows: which stage to assign to which processor? We require the mapping to be interval-based, i.e., a processor is assigned an interval of consecutive stages. Two key optimization parameters emerge. On the one hand, we target a high throughput, or short period, in order to be able to handle as many images as possible per time unit. On the other hand, we aim at a short response time, or latency, for the processing of each image. These two criteria are antagonistic: intuitively, we obtain a high throughput with many processors to share the work, while we get a small latency by mapping many stages to the same processor in order to avoid the cost of inter-stage communications. Source Image Data

Scaling

YUV Conversion

Block Storage

Subsampling

FDCT

Quantizer

Entropy Encoder

Quantization Table

Huﬀman Table

Compressed Image Data

Fig. 1. Steps of the JPEG encoding

2

Framework

Principles of JPEG encoding. Here we brieﬂy present the mode of operation of a JPEG encoder (see [14] for further details). The encoder consists in seven pipeline stages, as shown in Fig. 1. In the ﬁrst stage, the image is scaled to have a multiple of an 8x8 pixel matrix, and the standard even claims a multiple of 16x16. In the next stage a color space conversion is performed from the RGB to the YUV-color model. The sub-sampling stage is an optional stage, which, depending on the sampling rate, reduces the data volume: as the human eye can dissolve luminosity more easily than color, the chrominance components are sampled more rarely than the luminance components. Admittedly, this leads to a loss of data. The last preparation step consists in the creation and storage of so-called MCUs (Minimum Coded Units), which correspond to 8x8 pixel blocks in the picture. The next stage is the core of the encoder. It performs a Fast Discrete Cosine Transformation (FDCT) (eg. [15]) on the 8x8 pixel blocks which are interpreted as a discrete signal of 64 values. After the transformation, every point in the matrix is represented as a linear combination of the 64 points. The quantizer reduces the image information to the important parts. Depending on the quantization factor and quantization matrix, irrelevant frequencies are reduced. Thereby quantization errors can occur, that are remarkable as quantization noise or block generation in the encoded image. The last stage is the entropy encoder, which performs a modiﬁed Huﬀman coding.

Bi-criteria Pipeline Mappings for Parallel Image Processing

217

Applicative framework. On the theoretical point of view, we consider a pipeline of n stages Sk , 1 ≤ k ≤ n. Tasks are fed into the pipeline and processed from stage to stage, until they exit the pipeline after the last stage. The k-th stage Sk ﬁrst receives an input from the previous stage, of size δ k−1 , then performs a number of wk computations, and ﬁnally outputs data of size δ k to the next stage. These three operations are performed sequentially. The ﬁrst stage S1 receives an input of size δ 0 from the outside world, while the last stage Sn returns the result, of size δ n , to the outside world, thus these particular stages behave in the same way as the others. On the practical point of view, we consider the applicative pipeline of the JPEG encoder as presented in Fig. 1 and its seven stages. Target platform. We target a platform with p processors Pu , 1 ≤ u ≤ p, fully interconnected as a (virtual) clique. There is a bidirectional link linku,v : Pu → Pv between any processor pair Pu and Pv , of bandwidth bu,v . The speed of processor Pu is denoted as su , and it takes X/su time-units for Pu to execute X ﬂoating point operations. We enforce a linear cost model for communications: it takes X/b time-units to send (resp. receive) a message of size X to (resp. from) Pv . Communications contention is taken care of by enforcing the one-port model [3]. Bi-criteria interval mapping problem. We seek to map intervals of consecutive stages onto processors [13]. Intuitively, assigning several consecutive tasks to the same processor will increase their computational load, but may well dramatically decrease communication requirements. We search for a partition of [1..n] into m ≤ p intervals Ij = [dj , ej ] such that dj ≤ ej for 1 ≤ j ≤ m, d1 = 1, dj+1 = ej + 1 for 1 ≤ j ≤ m − 1 and em = n. The optimization problem is to determine the best mapping, over all possible partitions into intervals, and over all processor assignments. The objective can be to minimize either the period, or the latency, or a combination: given a threshold period, what is the minimum latency that can be achieved? and the counterpart: given a threshold latency, what is the minimum period that can be achieved? The decision problem associated to this bi-criteria interval mapping optimization problem is NP-hard, since the period minimization problem is NP-hard for interval-based mappings (see [2]).

3

Linear Program Formulation

We present here an integer linear program to compute the optimal interval-based bi-criteria mapping on Fully Heterogeneous platforms, respecting either a ﬁxed latency or a ﬁxed period. We consider a framework of n stages and p processors, plus two ﬁctitious extra stages S0 and Sn+1 respectively assigned to Pin and Pout . First we need to deﬁne a few variables. For k ∈ [0..n + 1] and u ∈ [1..p] ∪ {in, out}, xk,u is a boolean variable equal to 1 if stage Sk is assigned to processor Pu ; we let x0,in = xn+1,out = 1, and xk,in = xk,out = 0 for 1 ≤ k ≤ n. For k ∈ [0..n], u, v ∈ [1..p] ∪ {in, out} with

218

A. Benoit et al.

u = v, zk,u,v is a boolean variable equal to 1 if stage Sk is assigned to Pu and stage Sk+1 is assigned to Pv : hence linku,v : Pu → Pv is used for the communication between these two stages. If k = 0 then zk,in,v = 0 for all v = in and if k = n then zk,u,out = 0 for all u = out. For k ∈ [0..n] and u ∈ [1..p] ∪ {in, out}, yk,u is a boolean variable equal to 1 if stages Sk and Sk+1 are both assigned to Pu ; we let yk,in = yk,out = 0 for all k, and y0,u = yn,u = 0 for all u. For u ∈ [1..p], ﬁrst(u) is an integer variable which denotes the ﬁrst stage assigned to Pu ; similarly, last(u) denotes the last stage assigned to Pu . Thus Pu is assigned the interval [ﬁrst(u), last(u)]. Of course 1 ≤ ﬁrst(u) ≤ last(u) ≤ n. Topt is the variable to optimize, so depending on the objective function it corresponds either to the period or to the latency. We list belowthe constraints that need to be enforced. For simplicity, we write u instead of u∈[1..p]∪{in,out} when summing over all processors. First there are constraints for processor and link usage: every stage is assigned a processor, x = 1. Every communication either is assigned a i.e., ∀k ∈ [0..n + 1], k,u u link or collapses because both stages are assigned to the same processor: ∀k ∈ z + y = 1. If stage Sk is assigned to Pu and stage Sk+1 [0..n], k,u,v k,u u=v u to Pv , then linku,v : Pu → Pv is used for this communication: ∀k ∈ [0..n], ∀u, v ∈ [1..p] ∪ {in, out}, u = v, xk,u + xk+1,v ≤ 1 + zk,u,v . If both stages Sk and Sk+1 are assigned to Pu , then yk,u = 1: ∀k ∈ [0..n], ∀u ∈ [1..p] ∪ {in, out}, xk,u + xk+1,u ≤ 1 + yk,u . If stage Sk is assigned to Pu , then necessarily ﬁrstu ≤ k ≤ lastu . We write this constraint as: ∀k ∈ [1..n], ∀u ∈ [1..p], ﬁrstu ≤ k.xk,u + n.(1 − xk,u ) and ∀k ∈ [1..n], ∀u ∈ [1..p], lastu ≥ k.xk,u . Furthermore, if stage Sk is assigned to Pu and stage Sk+1 is assigned to Pv = Pu (i.e., zk,u,v = 1) then necessarily lastu ≤ k and ﬁrstv ≥ k + 1 since we consider intervals. We write this constraint as: ∀k ∈ [1..n − 1], ∀u, v ∈ [1..p], u = v, lastu ≤ k.zk,u,v + n.(1 − zk,u,v ) and ∀k ∈ [1..n − 1], ∀u, v ∈ [1..p], u = v, ﬁrstv ≥ (k + 1).zk,u,v . The latency of the schedule is bounded by Tlatency : p n δ k−1 w δ n zk−1,t,u + suk xk,u + zn,u,out ≤ Tlatency b b u=1 k=1 t=u t,u u∈[1..p]∪{in} u,out and t ∈ [1..p] ∪ {in, out}. There remains to express the period of each processor and to constrain it by Tperiod : ∀u ∈ [1..p], n δk δ k−1 w k z + su xk,u + z ≤ Tperiod . bt,u k−1,t,u bu,v k,u,v k=1

t=u

v=u

Finally, the objective function is either to minimize the period Tperiod respecting the ﬁxed latency Tlatency or to minimize the latency Tlatency with a ﬁxed period Tperiod . So in the ﬁrst case we ﬁx Tlatency and set Topt = Tperiod . In the second case Tperiod is ﬁxed a priori and Topt = Tlatency . With this mechanism the objective function reduces to minimizing Topt in both cases.

Bi-criteria Pipeline Mappings for Parallel Image Processing

4

219

Overview of the Heuristics

The problem of bi-criteria interval mapping of workﬂow applications is NPhard [2], so in this section we brieﬂy describe polynomial heuristics to solve it. See [2] for a more complete description or refer to the Web at: http://graal.ens-lyon.fr/∼vsonigo/code/multicriteria/ In the following, we denote by n the number of stages, and by p the number of processors. We distinguish two sets of heuristics. The heuristics of the ﬁrst set aim to minimize the latency respecting an a priori ﬁxed period. The heuristics of the second set minimize the counterpart: the latency is ﬁxed a priori and we try to achieve a minimum period while respecting the latency constraint. 4.1

Minimizing Latency for a Fixed Period

All the following heuristics sort processors by non-increasing speed, and start by assigning all the stages to the ﬁrst (fastest) processor in the list. This processor becomes used. H1-Sp-mono-P: Splitting mono-criterion. At each step, we select the used processor j with the largest period and we try to split its stage interval, giving some stages to the next fastest processor j in the list (not yet used). This can be done by splitting the interval at any place, and either placing the ﬁrst part of the interval on j and the remainder on j , or the other way round. The solution which minimizes max(period(j), period(j )) is chosen if it is better than the original solution. Splitting is performed as long as we have not reached the ﬁxed period or until we cannot improve the period anymore. H2-Sp-bi-P: Splitting bi-criteria. This heuristic uses a binary search over the latency. For this purpose at each iteration we ﬁx an authorized increase of the optimal latency (which is obtained by mapping all stages on the fastest processor), and we test if we get a feasible solution via splitting. The splitting mechanism itself is quite similar to H1-Sp-mono-P except that we choose the Δlatency ) within the authorized latency solution that minimizes maxi∈{j,j } ( Δperiod(j) increase to decide where to split. While we get a feasible solution, we reduce the authorized latency increase for the next iteration of the binary search, thereby aiming at minimizing the mapping global latency. H3-3-Sp-mono-P: 3-splitting mono-criterion. At each step we select the used processor j with the largest period and we split its interval into three parts. For this purpose we try to map two parts of the interval on the next pair of fastest processors in the list, j and j , and to keep the third part on processor j. Testing all possible permutations and all possible positions where to cut, we choose the solution that minimizes max(period(j), period(j ), period(j )). H4-3-Sp-bi-P: 3-splitting bi-criteria. In this heuristic the choice of where to split is more elaborated: it depends not only of the period improvement, but also of the latency increase. Using the same splitting mechanism as in H3-3Δlatency Sp-mono-P, we select the solution that minimizes maxi∈{j,j ,j } ( Δperiod(i) ).

220

A. Benoit et al.

Here Δlatency denotes the diﬀerence between the global latency of the solution before the split and after the split. In the same manner Δperiod(i) deﬁnes the diﬀerence between the period before the split (achieved by processor j) and the new period of processor i. 4.2

Minimizing Period for a Fixed Latency

As in the heuristics described above, ﬁrst of all we sort processors according to their speed and map all stages on the fastest processor. H5-Sp-mono-L: Splitting mono-criterion. This heuristic uses the same method as H1-Sp-mono-P with a diﬀerent break condition. Here splitting is performed as long as we do not exceed the ﬁxed latency, still choosing the solution that minimizes max(period(j), period(j )). H6-Sp-bi-L: Splitting bi-criteria. This variant of the splitting heuristic works similarly to H5-Sp-mono-L, but at each step it chooses the solution Δlatency ) while the ﬁxed latency is not exceeded. which minimizes maxi∈{j,j } ( Δperiod(i) Remark. In the context of M-JPEG coding, minimizing the latency for a ﬁxed period corresponds to a ﬁxed coding rate, and we want to minimize the response time. The counterpart (minimizing the period respecting a ﬁxed latency L) corresponds to the question: if I accept to wait L time units for a given image, which coding rate can I achieve? We evaluate the behavior of the heuristics with respect to these questions in Section 5.

5

Experiments and Simulations

In the following experiments, we study the mapping of the JPEG application onto clusters of workstations. Pf ix = 310 1

Lopt = 337, 575 2

3

4

5

P6

6

7

Lopt = 336, 729

1

2

3

4

P6

5

2

6

7

1

3

4

5

4

6

7

5

6

Lf ix = 340 1

Popt = 307, 319 2

3

4

5

6

7

P3

Lf ix = 330 1

7

P3

P4

Lopt = 322, 700 2

3 P7

P3

Pf ix = 330

Popt = 307, 319

1 P5

P3

Pf ix = 320

Lf ix = 370

Popt = 322, 700 2

3

4

P3

P3

a)

b)

5

6

7

Fig. 2. LP solutions strongly depend on ﬁxed initial parameters

Inﬂuence of ﬁxed parameters. In this ﬁrst test series, we examine the inﬂuence of ﬁxed parameters on the solution of the linear program. As shown in

Bi-criteria Pipeline Mappings for Parallel Image Processing

221

Fig. 2, the division into intervals is highly dependant of the chosen ﬁxed value. The optimal solution to minimize the latency (without any supplemental constraints) obviously consists in mapping the whole application pipeline onto the fastest processor. As expected, if the period ﬁxed in the linear program is not smaller than the latter optimal mono-criterion latency, this solution is chosen. Decreasing the value for the ﬁxed period imposes to split the stages among several processors, until no more solution can be found. Fig. 2(a) shows the division into intervals for a ﬁxed period. A ﬁxed period of Tperiod = 330 is suﬃciently high for the whole pipeline to be mapped onto the fastest processor, whereas smaller periods lead to splitting into intervals. We would like to mention, that for a period ﬁxed to 300, there exists no solution anymore. The counterpart ﬁxed latency - can be found in Fig. 2(b). Note that the ﬁrst two solutions ﬁnd the same period, but for a diﬀerent latency. The ﬁrst solution has a high value for latency, which allows more splits, hence larger communication costs. Comparing the last lines of Fig. 2(a) and (b), we state that both solutions are the same, and we have Tperiod = Tlatency . Finally, expanding the range of the ﬁxed values, a sort of bucket behavior becomes apparent: Increasing the ﬁxed parameter has in a ﬁrst time no inﬂuence, the LP still ﬁnds the same solution until the increase crosses an unknown bound and the LP can ﬁnd a better solution. This phenomenon is shown in Fig. 3. 350

330 L_fixed 325

340 Optimal Period

Optimal Latency

P_fixed

330 320 310 300 300

320 315 310 305

305

310 315 320 Fixed Period

(a) Fixed P.

325

330

300 320 330 340 350 360 370 380 390 400 Fixed Latency

(b) Fixed L.

Fig. 3. Bucket behavior of LP solutions

Assessing heuristic performance. The comparison of the solution returned by the LP program, in terms of optimal latency respecting a ﬁxed period (or the converse) with the heuristics is shown in Fig. 4. The implementation is fed with the parameters of the JPEG encoding pipeline and computes the mapping on 10 randomly created platforms with 10 processors. On platforms 3 and 5, no valid solution can be found for the ﬁxed period. There are two important points to mention. First, the solutions found by H2 often are not valid, since they do not respect the ﬁxed period, but they have the best ratio latency/period. Fig. 5(b) plots some more details: H2 achieves good latency results, but the ﬁxed period of P=310 is often violated. This is a consequence of the fact that the ﬁxed period value is very close to the feasible period. When the tolerance for the period is bigger, this heuristic succeeds to ﬁnd low-latency solutions. Second, all solutions, LP and heuristics, always keep the stages 4 to 7 together (see Fig. 2

222

A. Benoit et al.

for an example). As stage 5 (DCT) is the most costly in terms of computation, the interval containing these stages is responsible for the period of the whole application. Finally, in the comparative study H1 always ﬁnds the optimal period for a ﬁxed latency and we therefore recommend this heuristic for period optimization. In the case of latency minimization for a ﬁxed period, then H5 is to use, as it always ﬁnds the LP solution in the experiments. This is a striking result, especially given the fact that the LP integer program may require a long time to compute the solution (up to 11389 seconds in our experiments), while the heuristics always complete in less than a second, and ﬁnd the corresponding optimal solution. 360

400 LP fixed P H1 H2 H3 H4

360

LP fixed L H5 H6

350 heuristical solution

heuristical solution

380

340 320 300

340 330 320 310 300 290 280

280

270 0

1

2

3 4 5 6 random platform

7

8

0

9

1

2

3

4

5

6

7

8

9

random platform

(a) Fixed P = 310.

(b) Fixed L = 370.

Fig. 4. Behavior of the heuristics (comparing to LP solution) 350 theoretical simulation

600

330 latency

latency

LP H2

340

500 400 300

320 310 300

200

290

100 1

2

3

4

5

6

280 280

290

300

heuristic

(a) Simulative latency.

310 320 period

330

340

350

(b) H2 versus LP.

Fig. 5. MPI simulation results

MPI simulations on a cluster. This last experiment performs a JPEG encoding simulation. All simulations are made on a cluster of homogeneous Optiplex GX 745 machines with an Intel Core 2 Duo 6300 of 1,83Ghz. Heterogeneity is enforced by increasing and decreasing the number of operations a processor has to execute. The same holds for bandwidth capacities. For simplicity we use a MPI program whose stages have the same communication and computation parameters as the JPEG encoder, but we do not encode real images (hence the name simulation, although we use an actual implementation with MPICH). In this experiment the same random platforms with 10 processors and ﬁxed parameters as in the theoretical experiments are used. We measured the latency of the simulation, even for the heuristics of ﬁxed latency, and computed the average over all random platforms. Fig. 5(a) compares the average of the theoretical

Bi-criteria Pipeline Mappings for Parallel Image Processing

223

results of the heuristics to the average simulative performance. The simulative behavior nicely mirrors the theoretical behavior, with the exception of H2 (see Fig. 5(b)). Here once again, some solutions of this heuristic are not valid, as they do not respect the ﬁxed period.

6

Related Work

The blockwise independent processing of the JPEG encoder allows to apply simple data parallelism for eﬃcient parallelization. Many papers have addressed this ﬁne-grain parallelization opportunity [5,12]. In addition, parallelization of almost all stages, from color space conversion, over DCT to the Huﬀman encoding has been addressed [1,7]. Recently, with respect to the JPEG2000 codec, eﬃcient parallelization of wavelet coding has been introduced [8]. All these works target the best speed-up with respect to diﬀerent architectures and possible varying load situations. Optimizing the period and the latency is an important issue when encoding a pipeline of multiple images, as for instance for Motion JPEG (M-JPEG). To meet these issues, one has to solve in addition to the above mentioned work a bi-criteria optimization problem, i.e., optimize the latency, as well as the period. The application of coarse grain parallelism seems to be a promising solution. We propose to use an interval-based mapping strategy allowing multiple stages to be mapped to one processor which allows meeting the most ﬂexible the domain constraints (even for very large pictures). Several pipelined versions of the JPEG encoding have been considered. They rely mainly on pixel or blockwise parallelization [6,10]. For instance, Ferretti et al. [6] uses three pipelines to carry out concurrently the encoding on independent pixels extracted from the serial stream of incoming data. The pixel and block-based approach is however useful for small pictures only. Recently, Sheel et al. [11] consider a pipeline architecture where each stage presents a step in the JPEG encoding. The targeted architecture consists of Xtensa LX processors which run subprograms of the JPEG encoder program. Each program accepts data via the queues of the processor, performs the necessary computation, and ﬁnally pushes it to the output queue into the next stage of the pipeline. The basic assumptions are similar to our work, however no optimization problem is considered and only runtime (latency) measurements are available. The schedule is static and set according to basic assumptions about the image processing, e.g., that the DCT is the most complex operation in runtime.

7

Conclusion

In this paper, we have studied the bi-criteria (minimizing latency and period) mapping of pipeline workﬂow applications, from both a theoretical and practical point of view. On the theoretical side, we have presented an integer linear programming formulation for this NP-hard problem. On the practical side, we have studied in depth the interval mapping of the JPEG encoding pipeline on a cluster of workstations. Owing to the LP solution, we were able to characterize

224

A. Benoit et al.

a bucket behavior in the optimal solution, depending on the initial parameters. Furthermore, we have compared the behavior of some polynomial heuristics to the LP solution and we were able to recommended two heuristics with almost optimal behavior for parallel JPEG encoding. Finally, we evaluated the heuristics running a parallel pipeline application with the same parameters as a JPEG encoder. The heuristics were designed for general pipeline applications, and some of them were aiming at applications with a large number of stages (3-splitting), thus a priori not very eﬃcient on the JPEG encoder. Still, some of these heuristics reach the optimal solution in our experiments, which is a striking result. A natural extension of this work would be to consider further image processing applications with more pipeline stages or a slightly more complicated pipeline architecture. Naturally, our work extends to JPEG 2000 encoding which oﬀers among others wavelet coding and more complex multiple-component image encoding [4]. Another extension is for the MPEG coding family which uses lagged feedback: the coding of some types of frames depends on other frames. Diﬀerentiating the types of coding algorithms, a pipeline architecture seems again to be a promising solution architecture.

References 1. Agostini, L.V., Silva, I.S., Bampi, S.: Parallel color space converters for JPEG image compression. Microelectronics Reliability 44, 697 (2004) 2. Benoit, A., Rehn-Sonigo, V., Robert, Y.: Multi-criteria Scheduling of Pipeline Workﬂows. In: HeteroPar 2007, Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks. IEEE Computer Society Press, Los Alamitos (2007) 3. Bhat, P., Raghavendra, C., Prasanna, V.: Eﬃcient collective communication in distributed heterogeneous systems. Journal of Parallel and Distributed Computing 63, 251 (2003) 4. Christopoulos, C., Skodras, A., Ebrahimi, T.: The JPEG2000 still image coding system: an overview. IEEE Trans. on Consumer Electronics 46, 1103 (2000) 5. Falkemeier, J., Joubert, G.: Parallel image compression with JPEG for multimedisa applications. High Performance Computing: Technologies, Methods and Applications, Advances in Parallel Computing, 379–394 (1995) 6. Ferretti, M., Boﬀadossi, M.: A Parallel Pipelined Implementation of LOCO-I for JPEG-LS. In: 17th International Conference on Pattern Recognition (ICPR 2004), vol. 1, pp. 769–772 (2004) 7. Kumaki, T., et al.: Acceleration of DCT Processing with Massive-Parallel MemoryEmbedded SIMD Matrix Processor. IEICE Trans. on Information and Systems LETTER- Image Processing and Video Processing E90-D, 1312 (2007) 8. Meerwald, P., Norcen, R., Uhl, A.: Parallel JPEG2000 Image Coding on Multiprocessors. In: IPDPS 2002. IEEE Computer Society Press, Los Alamitos (2002) 9. Monnes, P., Furht, B.: Parallel JPEG Algorithms for Still Image Processing. In: Southeastcon 1994. Creative Technology Transfer - A Global Aﬀair. Proceedings of the 1994 IEEE, pp. 375–379 (1994)

Bi-criteria Pipeline Mappings for Parallel Image Processing

225

10. Papadonikolakis, M., Pantazis, V., Kakarountas, A.P.: Eﬃcient high-performance ASIC implementation of JPEG-LS encoder. In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE2007). IEEE Communications Society Press (2007) 11. Shee, S.L., Erdos, A., Parameswaran, S.: Architectural Exploration of Heterogeneous Multiprocessor Systems for JPEG. International Journal of Parallel Programming 35 (2007) 12. Shen, K., Cook, G., Jamieson, L., Delp, E.: An overview of parallel processing approaches to image and video compression. In: Image and Video Compression, Proc. SPIE, vol. 2186, pp. 197–208 (1994) 13. Subhlok, J., Vondran, G.: Optimal latency-throughput tradeoﬀs for data parallel pipelines. In: ACM SPAA 1996, pp. 62–71. ACM Press, New York (1996) 14. Wallace, G.K.: The JPEG still picture compression standard. Commun. ACM 34, 30 (1991) 15. Wen-Hsiung, C., Smith, C., Fralick, S.: A Fast Computational Algorithm for the Discrete Cosine Tranfsorm. IEEE Trans. on Communications 25, 1004 (1977)

A Simulation Framework for Studying Economic Resource Management in Grids Kurt Vanmechelen, Wim Depoorter, and Jan Broeckhove University of Antwerp, BE-2020 Antwerp, Belgium [email protected]

Abstract. Economic principles are increasingly being regarded as a way to address conﬂicting user requirements, to improve the eﬀectiveness of grid resource management systems, and to deliver incentives for providers to join virtual organizations. Because economic resource management mechanisms can encourage grid participants to reveal the true valuations of their jobs and resources, the system becomes capable of making better scheduling decisions. A lot of exploratory research into diﬀerent market mechanisms for grids is ongoing. Since it is impractical to conduct analysis of novel mechanisms on operational grids, most of this research is being carried out using simulation. This paper presents the Grid Economics Simulator (GES) in support of such research. The key design goals of the framework are enabling a wide variety of economic and non-economic forms of resource management while simultaneously supporting distributed execution of simulations and exhibiting good scalability properties.

1

Introduction

Conducting research into resource management systems (RMS) on real grids is diﬃcult for two main reasons. The ﬁrst one relates to the costs involved in setting up and maintaining such a system. The second is the need to test new RMS’s under a variety of diﬀerent load patterns and infrastructural arrangements, which is all but impossible to achieve with a real grid system. The large scale on which grid RMS’s need to be studied exacerbates these problems. The only viable option for researchers then is to resort to simulation. While there exists a number of general purpose simulators for grids, they have limited support for economic resource management systems (ERMS). There is a need for such support however, as it allows easy comparison between diﬀerent economic and non-economic approaches and enables researchers to focus on the mechanism design and implementation of the chosen approach, while leveraging the strength of the existing general purpose framework in setting up the grid environment, running the simulation and monitoring the desired metrics. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 226–235, 2008. c Springer-Verlag Berlin Heidelberg 2008

A Simulation Framework

2

227

Related Work

To provide some background on existing simulators and their capabilities we describe a number of them [1,2,3,4,5,6,7] here. For a more elaborate overview one can consult [8]. The Bricks simulator was designed as a performance evaluation system to analyse diﬀerent scheduling approaches for High Performance Computing systems in a global setting [1]. Two of the most interesting features of Bricks are the use of a scripting language to describe the conﬁguration and parameters of the simulation and its ability to incorporate external components such as NWS into simulations. Bricks has also been used to evaluate ﬁxed cost-based scheduling approaches [9]. The framework dictates a centralized approach for resource management however, limiting its general applicability. Development has ceased and the framework is no longer available from the oﬃcial project site. MicroGrid can create virtual Globus environments of arbitrary composition and allows for the execution of real applications [2]. As such, it is actually an emulator rather than a simulator. This makes MicroGrid interesting for optimizing grid applications with regards to the target conﬁguration of the grid or conversely allow designers of grids to play with various parameters to optimize the grid architecture. Since MicroGrid is an emulator running real applications, it is very time intensive. It is also diﬃcult to test new resource management approaches as all of them have to be compatible with Globus. Active development seems to have halted after 2004. SimGrid [3] is an extensive toolkit for the simulation of distributed applications and is written in C. The toolkit started out with a central scheduling approach and was subsequently adapted to allow for decentralized scheduling [10]. Later on, it was extended in order to allow developers to implement distributed services in the simulator and transfer them to a real world grid without code modiﬁcation. Development is ongoing with the addition of MPI support and modiﬁcations to the networking layer. SimGrid focuses heavily on the network aspects of grids and less on scheduling strategies. To accommodate for economic resource management, substantial modiﬁcations would have to be made to make the simulated entities economic aware and to support the required interaction patterns. While SimGrid has been used in combination with economic scheduling approaches [11], the auctions were performed outside of the framework, with SimGrid only executing the resulting schedule. GridSim is written in Java on top of the SimJava 2.0 basic discrete event infrastructure, dating from 2002. GridSim allows for packet-level simulation of the network and also oﬀers components oriented towards data grids. Additionally, it supports advance reservations, workload traces, an output statistics framework and background network traﬃc. GridSim has been used to simulate a NimrodG like deadline and budget constrained scheduling system [4] and an auction environment [5]. Development is ongoing with the latest release dating from September 2007. OptorSim [6] is a discrete event simulator that has been developed to simulate data access optimization algorithms in grids. In this regard, it takes inter-site

228

K. Vanmechelen, W. Depoorter, and J. Broeckhove

bandwidth into account for data transfers between grid sites. The simulator’s focus is on overall optimization of grid resources rather than intra-site or per-user optimization. This allows OptorSim to simplify two aspects; all users are modeled as a single Users entity and the worker nodes at each grid site are represented by a single entity as well. The simulation model is based on a simpliﬁcation of the architecture proposed by the European DataGrid (EDG). OptorSim has been used to evaluate cost-based replication-aware algorithms for Resource Brokers and Replica Optimization Services (ROS). The latest version was released in October 2006. jCase [7] is a tool for evaluating combinatorial auctions through simulation. It has been applied to the ﬁeld of grid resource management and supports multiple algorithms for price determination and solvers for determining the optimal set of winners in a combinatorial auction. As such, it is one of the few simulation tools that support research into ERMS’s. jCase however, is not a general purpose framework and speciﬁcally targets combinatorial auctions. Currently it also lacks support for simulations of dynamic systems over time.

3

GES Overview

The Grid Economics Simulator (GES) is a discrete event simulator that has been developed for the evaluation of various economic approaches in their ability to eﬃciently organize a resource market. This section will present an overview of the simulator’s architecture, operation and features. 3.1

Key Abstractions

Since the focus of GES is on economic grid resource management, we will describe the key abstractions from an economic point of view. It is important to note however that GES also supports non-economic resource management in which case aspects such as billing and pricing are omitted. The consumer represents a grid user that wants to execute computational jobs. Each consumer has a queue of jobs that need to be executed and for which resources must be acquired from providers through participation in the market. A consumer is provided with a budgetary endowment that may be replenished periodically. In every simulation step, consumers are billed with the usage rate prices for all resources that are allocated to their jobs at that particular moment. Every provider hosts a number of CPU and disk resources that are supplied to the computational market. Providers interact with consumers to agree upon a price for the execution of a job. When agreement is reached, the provider will bill the consumer. The execution of a job may start immediately or in the future. Once a resource is allocated to a job, it remains allocated until the job completes. The market brings together consumers and providers. It also dictates the interaction pattern used for negotiating resource allocations. A market has a bank facility that keeps accounts for each consumer and provider. The bank also handles all transactions necessary for paying the bills associated with resource usage.

A Simulation Framework

229

A market follows either a spot market or future market allocation paradigm. The former is characterized by immediate dispatching of a job to a resource while the latter supports advance reservation. A more in depth explanation on these allocation paradigms is given in 3.5. 3.2

Simulation Parameters

All simulated entities are characterized by a number of parameters. The most important ones relate to the number of consumers and providers participating in the simulation, the number of jobs and their induced workload, the budgets of the consumers, and the number of CPUs and their collective processing capacity. For increased ﬂexibility, values for these parameters may be chosen in a multitude of ways and at diﬀerent grouping levels as supported by the configuration layer described in the next subsection. For example, the average number of jobs in the simulation N aj is related to the normalized total system load L and the average normalized load of a job laj by L = N aj × laj. Therefore it is possible to choose two of these three parameters to fully specify the load of a scenario. When we want to simulate the arrival of jobs over time we can also use traces of job arrivals T j or approximate them using arrival distributions Dj . The previous discourse is also applicable on a consumer group level as well as on the individual consumer level. For consumer group i for example, we can choose values for the average number of jobs in the group Niaj and the average normalized load of a job liaj . The translation of these averages into concrete values for each consumer can be done in a straightforward way by distributing them equally over all consumers in the group, but also by means of a chosen random distribution. 3.3

Architecture

An overview of the GES architecture is given in the layered diagram of ﬁgure 1. Each layer is mapped to a package in the simulator’s codebase. One of the key design goals of the architecture are extensibility and reusability. This “extend-and-reﬁne” philosophy can be found throughout the whole simulation core layer and its components. The domain layer contains base classes for all domain entities such as Consumer, Provider, Job, GridResource and GridEnvironment. The Bank entity is situated in the economic layer. Support for traditional forms of resource management is provided through the non-economic layer. Class extension is heavily used from the domain layer up to the speciﬁc RMS implementation. For instance, a Consumer class of the Domain layer only keeps track of job status metrics, while an EconomicConsumer also keeps track of budgetary metrics. Existing components can be easily extended when new RMS algorithms are added to the framework. An overview of the diﬀerent RMS systems that are currently supported by GES is given in section 3.5.

230

K. Vanmechelen, W. Depoorter, and J. Broeckhove

Fig. 1. Overview of the architecture of GES

Examples of reusability can be found in the Economic layer which provides components for accounting, billing and transactions, the Future Market layer which hosts reservation mechanisms for preemptible and non-preemptible workloads, the Auctions layer that supports pluggable protocols for auctioning, and the Tendering layer where new negotiation strategies can be plugged in. Simulations can be distributed over multiple processing nodes through the distribution layer. This layer interfaces with compute resources that host a Jini-enabled compute service, clusters fronted by a Sun Grid Engine head node, or clusters with a passwordless SSH setup. Currently, distribution is supported at the granularity of a simulated scenario. Possibilities for distributed execution of the individual entities in the simulation are planned for future releases. The gui layer allows the user to create, run and monitor live market scenarios. A screenshot of the user interface is given in ﬁgure 2. A persistency framework

Fig. 2. Screenshot of the GES UI

A Simulation Framework

231

allows for storing both scenario conﬁgurations and conﬁgurations of the UI layout. Aggregated metrics over simulation runs and over a selection of simulated entities (e.g. a collection of consumers) are supported in the form of means, variances, standard deviations and box plots. After data collection and analysis, data can be directly exported from the simulator’s UI to standard data formats such as csv or graphical formats such as eps and png. 3.4

Operational Overview

A simulation runs for a number of time steps. Each time step consists of a number of phases. For the spot markets (see 3.5), these phases are listed on ﬁgure 3. First a central controller updates the joblist and budget of the consumer. Then, depending on the market mechanism used, the consumers, providers or both are instructed to start negotiations. In order to execute jobs, the consumer accepts a bill and sends it to the bank in phase 4. In phase 5 all monetary transactions take place. Finally providers are instructed to execute the relevant jobs. When these are ﬁnished, the consumer is notiﬁed in phase 7.

Fig. 3. Overview of a simulation step in GES

3.5

RMS Frameworks

GES comes with built-in support for a number of reference and experimental resource management systems, both non-economic and economic. The noneconomic RMS’s are provided as a reference and for the purpose of comparison. We have implemented an oﬄine central scheduler that can be initialized with diﬀerent non-economic scheduling policies: – An Earliest Deadline First policy that schedules in the jobs of the consumer with the earliest deadline ﬁrst. – A Priority policy where jobs are processed in order of the consumer’s conﬁgured priority level. – A Round Robin policy scheduling jobs from diﬀerent consumers in a round robin fashion. – A FIFO policy that schedules jobs in ﬁrst-in-ﬁrst-out manner as they arrive. – The DONE policy that aims to maximize the number of consumers that meet their deadline. It follows a greedy approach, scheduling in consumer requests in order of increasing workload. When planning in an individual job, the CPU with the largest available remaining processing capacity is selected and the job is planned in as close as possible to its deadline.

232

K. Vanmechelen, W. Depoorter, and J. Broeckhove

The economic RMS’s implemented in GES are divided into two separate branches. The ﬁrst one encompasses the spot markets while the second one incorporates the future markets. Spot markets are characterized by very dynamic price setting and quick reaction to changing conditions but also suﬀer from the exposure problem [12]. Future markets with support for advance reservation and co-allocation solve this problem at the expense of increased complexity and reaction time to changing market conditions. In spot markets, consumers have to negotiate per job for execution rights, while in future markets they can do this for an entire application consisting of multiple jobs. The spot markets that are implemented in GES are the following: – A Selective Tendering market with congestion control [13], where consumers request quotes from a group of selected providers. If a consumer is unable to obtain an allocation after requesting a certain number of quotes, it backs oﬀ and tries again at a later point in time. – An Auction market which supports double auctions as well as English, Dutch, First-Price Sealed-Bid and Vickrey auctions [14]. – A Commodity market that uses a Walrasian Auctioneer [15] for pricing. Multiple price adjustment schemes can be used ranging from a routine based on Smale’s method [16] to various optimization routines delivered by the Matlab Optimization Toolbox which are interfaced through RMI. – An implementation of the market mechanisms used in Tycoon [17]. The future markets supported by GES are: – The CBS [18], a centralized brokering system where consumers have to direct their application processing requests to a central broker entity that will negotiate with the providers. The broker aims to maximize the total value generated by fulﬁlling the consumers’ requests. – The DAS market [18] is a decentralized auctioning system where each provider holds auctions for selling its resources over the scheduling window. A consumer will place a sealed bid for each of its jobs at potentially multiple providers. These providers then calculate the winners of the auctions using a greedy heuristic. Multiple rounds can be held in order to schedule in as much consumers as possible.

4

Case Study: Value Realization for Users with Hard Deadlines

In this case study, we will use GES to study the diﬀerence in value realization between diﬀerent RMS’s for consumers which assign a hard deadline to the execution of their application. We compare the economic DAS and CBS markets with a non-economic, deadline-based scheduler that adopts the DONE policy. We varied the processing capacity of the Grid while measuring realized value, infrastructure utilization, price levels and resource shares. For each sample point, we requested 100 runs in order to monitor the variance in the output metrics as a

A Simulation Framework

233

result of the use of stochastic variables. In total, 5700 simulations were necessary for the data collection. The simulation was run for 2016 simulated time steps with a grid environment hosting 300 consumers and 20 providers. Consumers were divided into three groups with diﬀerent deadline ranges and associated valuation factors (V Fdeadline ) as shown in ﬁgure 4 (left). Every consumer hosted between 210 and 390 jobs with each job having a processing requirement between 1 and 80 time steps. Each consumer’s valuation was determined by multiplying a base valuation of 10000 credits with a load dependent factor and the V Fdeadline factor. For this setup, we assumed consumers to bid truthfully and consequently equated each consumer’s bid with its valuation. The total processing capacity in the system was 1250, which was uniformly distributed over the providers in the environment. The processing capacity of each individual CPU varied between 0.5 and 1. We ran our simulations on the CalcUA cluster at the university of Antwerp which hosts 256 Opteron 250 nodes using GES’ distribution layer. Our experiment took 10069 seconds on the cluster, yielding a speedup of 85. This speedup closely corresponded to the amount of nodes available to us on the cluster. The right graph in ﬁgure 4 shows the percentagewise increase of realized consumer value compared to the DONE RMS when varying infrastructural capacity. As can be observed from the graph, both the CBS and DAS markets compare favourably to the non-economic approach.

10

DAS CBS 40

8

35

7

30

Value / ValueDONE (%)

VFdeadline

45

VFdeadline Group I VFdeadline Group II VFdeadline Group III

9

6

5

4

20

15

10

3

5

2

1 200

25

400

600

800

1000 1200 Deadline

1400

1600

1800

2000

0 200

400

600

800

1000 1200 Capacity

1400

1600

1800

2000

Fig. 4. Valuation factors for the three consumer groups (left) and the value increase for the CBS and DAS markets compared to the DONE RMS, under varying capacity (right)

Table 1 and 2 show the results for the CBS market and DONE RMS respectively. Standard deviation is shown only for the utilization and value metrics due to space considerations. Although the non-economic approach attains higher utilization levels as a consequence of its preference for smaller workloads, it does not realize as much value for users as the economic approach. The high-value consumer groups are allotted a greater share in the CBS market because of their larger budgetary endowment and valuations. In the non-economic approach, the low-value group is given the largest share because it has the execution window with the least amount of resource competition. Table 1 shows that cost levels

234

K. Vanmechelen, W. Depoorter, and J. Broeckhove Table 1. Output metrics under varying capacity for the CBS market

Cap. U til.(%) 2000 79.28±1.41 1000 81.82±1.08 200 83.45±1.40

V alue(%) 95.54±0.78 58.82±0.90 12.59±0.23

ShareI (%) 37.23 56.96 56.72

ShareII (%) 37.60 28.07 29.20

ShareIII (%) 25.17 14.97 14.08

CostI 1.90 6.25 7.02

CostII 1.59 2.87 3.35

CostIII 0.85 0.89 1.05

Table 2. Output metrics under varying capacity for the DONE RMS Cap. U til.(%) 2000 83.99±1.30 1000 92.40±1.51 200 93.08±1.67

V alue(%) 92.94±2.12 48.06±1.62 9.45±0.69

ShareI (%) 32.29 29.68 29.16

ShareII (%) 33.34 31.49 29.58

ShareIII (%) 34.37 38.84 41.26

per unit of workload adjust to the degree of congestion in the system and the budgetary capabilities of the diﬀerent consumer groups.

5

Summary and Future Work

Economic forms of resource management oﬀer great opportunities for building grids that deliver incentives for provider participation and that try to maximize realized consumer value. There is a need for general purpose simulators with economic support to assist research in this ﬁeld. We have introduced the Grid Economics Simulator and illustrated its extensibility by describing its architecture and operation and by providing an overview of the diﬀerent supported RMS’s. We demonstrated the capabilities of GES with a case study highlighting various aspects of the framework. While GES in its current form has proven to be very useful in our research [15,18,19], we are planning for the inclusion of additional features. The ﬁrst is the inclusion of network abstractions. This is a necessary step for more realistic simulation and to enable planned future research towards bandwidth pricing. In addition, we wish to be able to import traces from workload databases such as the Grid Workloads Archive [20]. This would allow us to use more realistic user and job proﬁles in simulations.

References 1. Takefusa, A., Matsuoka, S., Nakada, H., Aida, K., Nagashima, U.: Overview of a performance evaluation system for global computing scheduling algorithms. In: Proceedings of HPDC 1999, pp. 97–104. IEEE Computer Society, Los Alamitos (1999) 2. Song, H.J., Liu, X., Jakobsen, D., Bhagwan, R., Zhang, X., Taura, K., Chien, A.: The microgrid: a scientiﬁc tool for modeling computational grids. Sci. Program. 8(3), 127–141 (2000)

A Simulation Framework

235

3. Casanova, H.: Simgrid: a toolkit for the simulation of application scheduling. In: Proceedings of CCGrid 2001, pp. 430–437. IEEE Computer Society, Los Alamitos (2001) 4. Buyya, R.: Economic-based Distributed Resource Management and Scheduling for Grid Computing. PhD thesis, Monash University, Australia (2002) 5. Assun¸ca ˜o, M.A., Buyya, R.: An evaluation of communication demand of auction protocols in grid environments. In: Proceedings of GECON 2006, pp. 24–33. World Scientiﬁc, Singapore (2006) 6. Cameron, D.G., Millar, A.P., Nicholson, C., Carvajal-Schiaﬃno, R., Stockinger, K., Zini, F.: Analysis of Scheduling and Replica Optimisation Strategies for Data Grids Using OptorSim. Journal of Grid Computing 2(1), 57–69 7. Schnizler, B.: Resource Allocation in the Grid; A Market Engineering Approach. PhD thesis, University of Karlsruhe (2007) 8. Sulistio, A., Yeo, C.S., Buyya, R.: A taxonomy of computer-based simulations and its mapping to parallel and distributed systems simulation tools. Softw. Pract. Exper. 34, 653–673 (2004) 9. Takefusa, A., Casanova, H.: A study of deadline scheduling for client-server systems on the computational grid. In: Proceedings of HPDC 2001, pp. 406–415 (2001) 10. Legrand, A., Lerouge, J.: Metasimgrid: Towards realistic scheduling simulation of distributed applications. Technical Report 2002-28, LIP (2002) 11. Das, A., Grosu, D.: Combinatorial auction-based protocols for resource allocation in grids. In: Proceedings of PDSEC 2005, IEEE Computer Society, Los Alamitos (2005) 12. Bykowsky, M.M., Cull, R.J., Ledyard, J.O.: Mutually destructive bidding: The fcc auction design problem. Journal of Regulatory Economics 17(3), 205–228 (2000) 13. Depoorter, W.: Establishment of agency as an eﬀective market based resource allocation method using ges. Master’s thesis, University of Antwerp (2007) 14. Vanmechelen, K., Broeckhove, J.: A comparative analysis of single-unit vickrey auctions and commodity markets for realizing grid economies with dynamic pricing. In: Veit, D.J., Altmann, J. (eds.) GECON 2007. LNCS, vol. 4685, pp. 98–111. Springer, Heidelberg (2007) 15. Stuer, G., Vanmechelen, K., Broeckhove, J.: A commodity market algorithm for pricing substitutable grid resources. Fut. Gen. Comput. Syst. 23(5), 688–701 (2007) 16. Smale, S.: A convergent process of price adjustment and global newton methods. Journal of Mathematical Economics 3(2), 107–120 (1976) 17. Feldman, M., Lai, K., Zhang, L.: A price-anticipating resource allocation mechanism for distributed shared clusters. In: Proceedings of EC 2005, British Columbia, ACM Press, New York (2005) 18. Vanmechelen, K., Depoorter, W., Broeckhove, J.: Economic grid resource management for CPU bound applications with hard deadlines. In: Proceedings of CCGrid 2008, IEEE Computer Society, Los Alamitos (in press, 2008) 19. Vanmechelen, K., Stuer, G., Broeckhove, J.: Pricing substitutable grid resources using commodity market models. In: Proceedings of GECON 2006, pp. 103–112. World Scientiﬁc, Singapore (2006) 20. Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.H.J.: The Grid Workloads Archive. FGCS (submitted, 2007)

Improving Metaheuristics for Mapping Independent Tasks into Heterogeneous Memory-Constrained Systems Javier Cuenca1 and Domingo Gim´enez2 1

Departamento de Ingenier´ıa y Tecnolog´ıa de Computadores, Universidad de Murcia, 30071 Murcia, Spain [email protected] 2 Departamento de Inform´ atica y Sistemas, Universidad de Murcia, 30071 Murcia, Spain [email protected]

Abstract. This paper shows diﬀerent strategies for improving some metaheuristics for the solution of a task mapping problem. Independent tasks with diﬀerent computational costs and memory requirements are scheduled in a heterogeneous system with computational heterogeneity and memory constraints. The tuned methods proposed in this work could be used for optimizing realistic systems, such as scheduling independent processes onto a processors farm. Keywords: processes mapping, metaheuristics, heterogeneous systems.

1

Introduction

In this work the problem of mapping independent tasks to the processors in a heterogeneous system is considered. The tasks are generated by a processor and sent to other processors which solve them and return the solutions to the initial one. So, a master-slave scheme is used. The master-slave scheme is one of the most popular parallel algorithmic schemes [1], [2]. There are publications about optimal mapping master-slave schemes in parallel systems [3], [4], [5], but in those works the optimal mappings are obtained only under certain restrictions, and memory constraints are not considered. In our approach each task has a computational cost and a memory requirement. The processors in the system have diﬀerent speeds and a certain amount of memory, which imposes a restriction on the tasks which it can be assigned. The goal is to obtain a task mapping which leads to a low total execution time. To obtain the optimum mapping in the general case is an NP problem [6], and heuristic methods may be preferable. In our previous work [7], the basic scheduling problem was explained together with some possible variants. To solve them, diﬀerent metaheuristics (Genetic Algorithm, Scatter Search, Tabu Search and

This work has been partially supported by the Consejer´ıa de Educaci´ on de la Regi´ on de Murcia, Fundaci´ on S´eneca 02973/PI/05.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 236–245, 2008. c Springer-Verlag Berlin Heidelberg 2008

Improving Metaheuristics for Mapping Independent Tasks

237

GRASP) [8], [9] were proposed. In this work these metaheuristics are improved in diﬀerent ways in order to reduce the time to perform the task mapping and to obtain a better solution. The paper is organized in the following way: in section 2 the basic scheduling problem is explained; in section 3 some metaheuristics for the solution of the proposed scheduling problem are analysed; and, ﬁnally, section 4 summarizes the conclusions and outlines future research.

2

Scheduling Problem

Of the diﬀerent scheduling problems introduced in our previous work [7], this paper studies, as an example, the problem with ﬁxed arithmetic costs and no communications in depth. In this problem, given t tasks, with arithmetic costs c = (c0 , c1 , . . . , ct−1 ) and memory requirements i = (i0 , i1 , . . . , it−1 ), and p processors with the times to perform a basic arithmetic operation a = (a0 , a1 , . . . , ap−1 ), and memory capacities m = (m0 , m1 , . . . , mp−1 ), from all the mappings of tasks to the processors, d = (d0 , d1 , . . . , dt−1 ) (dk = j means task k is assigned to processor j), with ik ≤ mdk , ﬁnd d with which the following mimimum is obtained: ⎧ ⎫ ⎨ ⎬ max cl , min aj (1) ⎭ {d/ ik ≤mdk ∀k=0,1,...,t−1} {j=0,1,...,p−1} ⎩ l=0,1,...,t−1;dl =j

where the minimum of the mapping times which satisﬁes the memory constraints is obtained, and for each mapping the time is that of the processor which takes most time in the solution of the tasks it has been assigned. There is a maximum of pt assignations (with the memory constraints the number of possibilities may decrease), and it is not possible to solve the problem with a reasonable time by generating all the possible mappings. An alternative is to obtain an approximate solution using some heuristic method. This possibility is considered in this paper.

3

Application of Metaheuristics to the Scheduling Problem

In this section the application of metaheuristic methods to the version of the scheduling problem previously described is analysed. The methods considered are: Genetic Algorithm (GA), Scatter Search (SS), Tabu Search (TS) and GRASP (GR). The four metaheuristics are analysed from the same perspective, identifying common routines and element representations. The goal is to obtain a mapping with an associated modelled time close to the optimum, but with a low assignation time, because this time is added to the execution time of the routine. A general metaheuristic scheme is considered [10]. One such scheme is shown in algorithm 1. Each of the functions that appears in that scheme works in a diﬀerent way depending on the metaheuristic chosen:

238

J. Cuenca and D. Gim´enez

Algorithm 1. General scheme of a metaheuristic method. Initialize(S); while not EndCondition(S) do SS = ObtainSubset(S); if |SS| > 1 then SS1 = Combine(SS); else SS1 = SS; end SS2 = Improve(SS1); S = IncludeSolutions(SS2); end

– Initialize. To create each individual of the initial set S, this function assigns tasks to processors with a probability proportional to the processor speed. • GA works with a large initial population of assignations. • SS works with a reduced number of elements in S. This could produce a lower time for this method than that of the GA. • TS works with a set S with only one element. • GR: In each iteration the cost of each candidate is evaluated, and a number of candidates are selected to be included in the set of solutions. – ObtainSubset: In this function some of the individuals are selected randomly. • GA: The individuals with better ﬁtness function (equation 1) have more likelihood of being selected. • SS: It is possible to select all the elements for combination, or to select the best elements (those with better ﬁtness function) to be combined with the worst ones. • TS: This function is not necessary because |S| = 1. • GR: One element from the set of solutions is selected to constitute the set SS (|SS| = 1). – Combine: In this function the selected individuals are crossed, and SS1 is obtained. • GA, SS: The individuals can be crossed in diﬀerent ways. One possibility is to cross pairs of individuals by exchanging half of the mappings, obtaining two descendants. • TS, GR: This function is not necessary. – Improve: • GA: A few individuals are selected to obtain other individuals, which can diﬀer greatly. This process is done by using mutation operands. The aim is to diversify the population to avoid falling in local optimums. • SS: This function consists on a greedy method which works by evaluating the ﬁtness value of the elements obtained with the p possible processors (with memory constraints) in each component, in order to search for a better element in its neighborhood.

Improving Metaheuristics for Mapping Independent Tasks

239

• TS: Some elements in the neighborhood of the actual element are analysed, excluding those in a list of previously analysed tabu elements. • GR: This function consists of a local search to improve the element selected. Some greedy method can be used, or all the elements in the neighborhood of the selected one can be analysed. – IncludeSolutions: This function selects some elements of SS2 to be included in S for the next iteration. • GA: The best individuals from the original set, their descendants and the individuals obtained by mutation, are included in the next S. • SS: The best elements are selected, as well as some elements which are scattered with respect to them to avoid falling within local minimums. • TS, GR: The best element from those analysed is taken as the next solution. – EndCondition: • GA, SS, TS, GR: The convergence criterion could be a maximum number of iterations, or that the best ﬁtness value from the individuals in the population does not change over a number of iterations. 3.1

Basic Experimental Tuning of the Metaheuristics

Experiments with diﬀerent tasks and systems conﬁgurations have been carried out, obtaining similar results. The experiments, whose results are shown beyond, has the following conﬁguration: The size of each task has been randomly generated between 1000 and 2000, the arithmetic cost is n3 , and the memory requirement n2 . The number of processors in the system is the same as the number of tasks. The costs of basic arithmetic operations has been randomly generated between 0.1 and 0.2 μsecs. The memory of each processor is between half the memory needed by the biggest task and one and a half times this memory. Preliminary results for the proposed problem in section 2 have been obtained using the following parameter values, whereas with other close values the results would be similar. – GA: • Initialize: The population has 80 elements; the elements in S are initially generated randomly assigning the tasks to the processors, with the probability proportional to the processor speed. • Combine: Each pair of elements is combined with half of the components of each parent; in each combination the best parent and the best descendant are included in the population. • Improve: the probability of mutation is 1/5. • EndCondition: the maximum number of iterations is 800, and the maximum number of iterations without improving the optimum solution is 80. – SS: • Initialize: S has 20 elements. The initialization is that in GA.

240

J. Cuenca and D. Gim´enez

• Combine: The combination is that in GA. • Improve: Each element is improved with a greedy method, which works by selecting for the processor with highest execution time a task which could be assigned to another processor reducing the ﬁtness function (equation 1). • IncludeSolutions: The elements with lowest cost function and those most scattered with respect to the best ones (using a 1-norm) are included in the reference set. • EndCondition: The maximum number of iterations is 400, and the maximum number of iterations without improving the optimum solution is 40. – TS: • Improve: The neighborhood has 10 elements, obtained by taking the tasks assigned to the processor with most cost and reassigning them to other processors. The tabu list has 10 elements. • EndCondition: The maximum number of iterations is 200, and the maximum number of iterations without improving the solution is 20. – GR: • Initialize: The initial set has 20 elements. The elements are generated as in GA and SS. • ObtainSubset: The element selected from S is chosen randomly, with more probability for the elements with better ﬁtness function (equation 1). • Improve: The element is improved with the greedy method used in SS. • EndCondition: The number of iterations is 20. Table 1 compares the mapping time and the simulated time obtained, in a PC Centrino 2.0 GHz., with each of the heuristics, and those with a backtracking, for those problem sizes where the backtracking obtains a solution using a reasonable mapping time. Those cases where the corresponding method does not obtain the optimal solution are in bold. In almost all the cases the metaheuristics provide the best solution and use less time than a backtracking. Table 1. Comparison of backtracking and the metaheuristics. Mapping time and modelled execution time (in seconds), varying the number of tasks.

tasks 4 8 12 13 14

Back map. simul. 0.025 3132 0.034 4731 0.058 1923 0.132 1278 0.791 1124

GA map. simul. 0.051 3132 0.028 4731 0.021 1923 0.055 1278 0.081 1124

SS map. simul. 0.065 3132 0.132 4731 0.158 1923 0.159 1278 0.192 1124

TS map. simul. 0.010 3132 0.015 4731 0.016 2256 0.016 1376 0.017 1124

GR map. simul. 0.019 3132 0.024 4731 0.029 1923 0.024 1278 0.027 1135

Improving Metaheuristics for Mapping Independent Tasks

241

For big systems and using the diﬀerent heuristics, satisfactory mappings are obtained in a reduced time. In Table 2 the mapping and the simulated times for big systems are shown. Those cases where the best solution of modelled time is obtained for each problem size appear in bold. GA and SS are the methods that need more mapping time to obtain a good solution with the parameters considered. GR and TS use much less time and obtain the best solution for almost all the cases. TS needs less time than GR, but its solutions are not always as good. Therefore, GR is the method which behaves best. Following these results with the preliminary tunings, a deeper study on how to improve those metaheuristics is now underway. For example, the next subsection shows how advanced tunings can be applied to the Genetic Algorithm. Table 2. Comparison of the metaheuristics for big systems. Mapping time and modelled execution time (in seconds), varying the number of tasks.

tasks 25 50 100 200 400

3.2

GA map. simul. 0.139 1484 0.413 1566 0.592 1903 0.825 3452 3.203 3069

SS map. simul. 0.259 1450 0.429 1900 0.834 1961 1.540 3452 2.682 3910

TS map. simul. 0.010 1450 0.015 1757 0.022 3018 0.079 3452 0.375 3069

GR map. simul. 0.045 1450 0.078 1524 0.158 1460 0.293 3452 0.698 3069

Advanced Tuning of the Genetic Algorithm

Various tunings possibilities have been studied in order to improve the GA method. The most signiﬁcant ares: – In the routine Combine: • T1. It is possible to change the heredity method. Instead of a descendant inheriting strictly each half of its components from each parent, each component is inherited pseudo-randomly, so giving more probability to the parent with best ﬁtness value (the ﬁtness value of a solution is the modelled execution time of the processor that needs more time to ﬁnish its assigned tasks) (equation 1). • T2. Another possibility of changing the heredity method consists of choosing each component of a descendant from the less loaded processor from those of its parents. The load of a processor r, Wr , is the product of the cost of performing an arithmetic operation in r and the addition of the cost of the tasks assigned to r: cl , (2) Wr = ar {l=0,1,...,t−1;dl =r}

In other words, if for the i-th component, that task is assigned in the parent A to the processor r, that has a load of Wr , and in the parent B to the processor q, that has a load of Wq , then in the descendant the component i will be r if Wr < Wq , or q in other case.

242

J. Cuenca and D. Gim´enez

– T3. In the routine Improve it is possible to introduce a hybrid approach, using a steered mutation instead of a pure mutation. In the solution to be improved, each task assigned to an overloaded processor (a processor is overloaded if its load (equation 2) is greater to the average load of all the processors) is reassigned randomly to another processor. Therefore, this routine mutates the solution to another where the total loads of the most overload processors have been reduced. – T4. In the routine ObtainSubset, where the solutions that will be combined are chosen, it is possible to chose these solutions pseudo-randomly, giving more probability to the solutions with better ﬁtness. In the ﬁrst column of Table 3 the times obtained with the base case (the original GA) in a PC Centrino 2.8 Ghz, are shown for diﬀerent numbers of tasks. In the second column, the times obtained when the T1 tuning are shown. The solutions are obtained more quickly than in the base case (less mapping time), but these solutions are worse than the previous ones (more simulated execution time). This could be because T1 is a greedy tuning that leads the algorithm to a local minimum. In the third column, the times obtained when the T2 tuning are shown. Now, the time to obtain the solutions is very similar to the base case, but the solutions for some of the problems are better. So this tuning could be an interesting improvement. In the forth column the times obtained when the T3 tuning is applied to the routine Improve are shown. The solutions are worse than in the base case, that is, a steered mutation does not work as well as we thought (a deeper study is given below). In the ﬁfth column, the times obtained when the T4 tuning is applied to the routine Improve are shown. The times are better in some cases. It converges faster than the base case, using less mapping time. Since the improvements T2, T3 and T4 seem interesting and they aﬀect diﬀerent parts of the algorithm, it could be appropriate to combine them. In this way in the sixth and seventh columns of Table 3 the results when using T2 with the other tunings are shown. Combining T2 with T3 or T4 the results do not improve those obtained only with T2 and they need more mapping time. Therefore, it is better to apply just T2. Finally, combining T3 and T4 the results do not improve any of them. Table 3. Comparison of the diﬀerent tunings applied to the Genetic Algorithm, varying the number of tasks basic GA T1 T2 T3 T4 T2+T3 T2+T4 T3+T4 tasks map. simul. map. simul. map. simul. map. simul. map. simul. map. simul. map. simul. map. simul. 50 0.13 1646 0.02 2277 0.05 1524 0.08 1715 0.09 1715 0.05 1524 0.06 1524 0.08 1715 100 0.25 2068 0.09 2581 0.13 1460 0.14 2230 0.25 2000 0.17 1460 0.16 1460 0.14 2230 150 0.47 2422 0.19 2908 0.19 2039 0.25 2464 0.36 2418 0.22 2039 0.22 2039 0.25 2464 200 0.41 3452 0.28 3717 0.31 3452 0.31 3452 0.33 3452 0.34 3452 0.34 3452 0.33 3452 400 1.56 3069 1.19 4184 1.19 3069 1.67 3069 1.42 3069 1.20 3069 1.25 3069 1.72 3069 1600 12.10 3680 10.50 4061 11.77 1735 11.38 3882 12.08 3482 12.56 1735 11.28 1735 12.09 3882

In order to understand better the behavior of the algorithm with the diﬀerent tunings, the Figs. 1, 2 and 3 show the evolution of the best solution from the new generated individuals per iteration, in each case, along all the iterations, for the problem of mapping 1600 tasks.

Improving Metaheuristics for Mapping Independent Tasks

243

Fig. 1. Evolution of the best solution from the new generated individuals per iteration for a problem size of 1600 tasks. Without tuning (T0) applied to the routine Combine, with T1 and with T2.

Regarding the routine Combine (Fig. 1), if the tuning T1 is applied, the restriction of inheriting each component from the best parent confers a more greedy tendency to the algorithm. It falls in local minimums, with worse solutions than in the base case, where it can seldom exit. However, with the tuning T2 each component of a descendant can come from any of the parents, so a bigger

Fig. 2. Evolution of the best solution from the new generated individuals per iteration for a problem size of 1600 tasks. Without tuning (T0) applied to the routine Improve and with T3.

244

J. Cuenca and D. Gim´enez

Fig. 3. Evolution of the best solution from the new generated individuals per iteration for a problem size of 1600 tasks. Without tuning (T0) applied to the routine ObtainSubset and with T4.

mixture of genetic code is produced, causing more diversity of descendants and so allowing the algorithm exits from local minimums easily. The tendency, from the ﬁrst iteration, is to improve the best solution because the most overloaded processors are unloaded in each step. In the routine Improve (Fig. 2), with the tuning T3 the mutation operation is steered towards better solutions quickly, but this kind of mutation prevents the genetic code of the descendant diﬀering a lot from those of the parents. In this way, if the algorithm falls in a local minimum it is very diﬃcult to get out of it because it has not a pure mutation. If the tuning T4 is applied to the routine ObtainSubset (Fig. 3), the algorithm progresses slowly but surely, because in each iteration only the best solutions are chosen to have descendants and few false moves are made.

4

Conclusions and Future Works

The paper presents some improvements on previous proposals for the application of metaheuristics techniques to tasks to processors mapping problems, where the tasks are independent and have various computational costs and memory requirements, and the computational system is heterogeneous in computation and with diﬀerent memory capacities (communications are not yet considered). The metaheuristics considered have been: Genetic Algorithm, which is a global search method; Scatter Search is also a global search method, but with improvement phases; Tabu Search is a local search method with the search guided by historic information; GRASP method is a multiple local search method. The

Improving Metaheuristics for Mapping Independent Tasks

245

parameters and the routines have been tuned and the experiments to obtain satisfactory versions of the metaheuristics have been carried out, mainly with the Genetics Algorithm where some detailed tuning techniques have been studied. In future works advanced tunings, like those applied to the Genetic Algorithm in this work, will be applied to the other metaheuristics. On the other hand, diﬀerent characteristics of the heterogeneous systems will be considered: variable arithmetic cost in each processor depending on the problem size, variable communication cost in each link,... Other general approximations (dynamic assignation of tasks, adaptive metaheuristics,...) will also be studied. The tuned methods proposed in this work will be used for optimizing realistic systems, such as scheduling independent processes or mapping MPI jobs onto a processors farm.

References 1. Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, 2nd edn. Prentice-Hall, Englewood Cliﬀs (2005) 2. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley, Reading (2003) 3. Banino, C., Beaumont, O., Legrand, A., Robert, Y.: Sheduling strategies for master-slave tasking on heterogeneous processor grids. In: Fagerholm, J., Haataja, J., J¨ arvinen, J., Lyly, M., R˚ aback, P., Savolainen, V. (eds.) PARA 2002. LNCS, vol. 2367, pp. 423–432. Springer, Heidelberg (2002) 4. Pinau, J.F., Robert, Y., Vivien, F.: Oﬀ-line and on-line scheduling on heterogeneous master-slave platforms. In: 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2006), pp. 439–446 (2006) 5. Brucker, P.: Scheduling Algorithms, 1st edn. Springer, Heidelberg (2007) 6. Lennerstad, H., Lundberg, L.: Optimal scheduling results for parallel computing. SIAM News, 16–18 (1994) 7. Cuenca, J., Gim´enez, D., L´ opez, J.J., Mart´ınez-Gallary, J.P.: A proposal of metaheuristics to schedule independent tasks in heterogeneous memory-constrained systems. In: CLUSTER (2007) 8. Hromkovic, J.: Algorithmics for Hard Problems, 2nd edn. Springer, Heidelberg (2003) 9. Dr´eo, J., P´etrowski, A., Siarry, P., Taillard, E.: Metaheuristics for Hard Optimization. Springer, Heidelberg (2005) 10. Raidl, G.R.: A uniﬁed view on hybrid metaheuristics. In: Almeida, F., Blesa Aguilera, M.J., Blum, C., Moreno Vega, J.M., P´erez P´erez, M., Roli, A., Sampels, M. (eds.) HM 2006. LNCS, vol. 4030, pp. 1–12. Springer, Heidelberg (2006)

A2 DLT: Divisible Load Balancing Model for Scheduling Communication-Intensive Grid Applications M. Othman , M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia, 43400 UPM Serdang, Selangor D.E., Malaysia [email protected], [email protected]

Abstract. Scheduling an application in data grid is signiﬁcantly complex and very challenging because of its heterogeneous in nature of the grid system. Divisible Load Theory (DLT) is a powerful model for modelling data-intensive grid problem where both communication and computation loads are partitionable. This paper presents a new divisible load balancing model known as adaptive ADLT (A2 DLT) for scheduling the communication intensive grid applications. This model reduces the maximum completion time (makespan) as compared to the ADLT and Constraint DLT (CDLT) models. Experimental results showed that the model can balance the load eﬃciently, especially when the communication-intensive applications are considered. Keywords: Divisible Load Theory, Data Grid, Load Balancing.

1

Introduction

In data grid environment, many large-scale scientiﬁc experiments and simulations generate very large amounts of data in the distributed storages, spanning thousands of ﬁles and data sets [1]. Due to the heterogenous nature of the grid system, scheduling an applications in such environment either data- or communication- intensive is signiﬁcantly complex and challenging. Grid scheduling is deﬁned as a process of making scheduling decision involving allocating job to resources over multiple administrative domains [2]. The DLT has emerged as a powerful model for modelling data-intensive grid problem [3]. The DLT model exploits the parallelism of a divisible application which is continuously divisible into parts of arbitrary size, by scheduling the loads in a single source onto multiple computing resources. The load scheduling in data Grid is addressed using DLT model with additional constraint that each worker node receives the same load fraction from each data source [4]. Most of the previous models do not take into account the communication time. In

The author is also an associate researcher at the Lab of Computational Science and Informatics, Institute of Mathematical Research (INSPEM), University Putra Malaysia.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 246–253, 2008. c Springer-Verlag Berlin Heidelberg 2008

A2 DLT: Divisible Load Balancing Model

247

order to achieve a high performance, we must consider both communicationand computation-times [5,6]. In [7], the CDLT is used for scheduling decomposable data-intensive applications and the results are compared with the results of genetic algorithm. The same constraint is tested, which was suggested in [4] and each worker node receives the same load fraction from each data source. They considered the communication time but not in dividing the load. Firstly, they divided the load using DLT model then added the communication time to the makespan. When ADLT model is proposed for scheduling such applications and compared with CDLT model, and it gives better performance [8]. In this paper, a A2 DLT model is proposed as an improvement of ADLT model. The objective of the model is to distribute loads over all sites in such a way to achieve an optimal makespan for large scale jobs.

2

Scheduling Model

In [5], the target data intensive application model can be decomposed into multiple independent subtasks and executed in parallel across multiple sites without any interaction among subtasks. Lets consider job decomposition by decomposing input data objects into multiple smaller data objects of arbitrary size and processing them on multiple virtual sites. High Energy Physic (HEP) jobs are arbitrarily divisible at event granularity and intermediate data product processing granularity [1]. In this research, assuming that a job requires a very large logical input data set (D) consists of N physical datasets and each physical dataset (of size Lk ) resides at a data source (DSk , for all k = 1, 2, . . . , N ) of a particular site. Fig 1 shows how the logical input data (D) is decomposed onto networks and their computing resources. The scheduling problem is to decompose D into datasets (Di for all i = 1, 2, . . . , M ) across M virtual sites in a Virtual Organization (VO) given its initial physical decomposition. Again, we assume that the decomposed data can be analyzed on any site. 2.1

Notations and Deﬁnitions

All notations and their deﬁnitions used throughout this paper are shown in Table 1. 2.2

Cost Model

The execution time cost (Ti ) of a subtask allocated to the site i and the turn around time (TT urn Around T ime ) of a job can be expressed as follows Ti = Tinput and

cm (i)

+ Tcp (i) + Toutput

cm (i, d)

248

M. Othman et al.

l1,1

D1

)

1

l1, i l i,1

f( d

DS1

+1 1 ,i

1,M

N,1

li,i

f(d i+1)

M

i+1 N,

Di+1

f(d

li,

M)

lN

DSi

f(di) Di

,i

li,i+1

lN,M

DM DSN

Fig. 1. Data decomposition and their processing Table 1. Notation and Deﬁnition Notation M N Li Lij L αij αj wj Zij Ti

Deﬁnition The total number of nodes in the system The total number of data ﬁles in the system The loads in data ﬁle i The loads that node j will receive from data ﬁle i The sum of loads in the system, where L = N i=1 Li The amount of load that node j will receive from data ﬁle i The fraction of L that node j will receive from all data ﬁles The inverse of the computing speed of node j The link between node i and data source j The processing time in node i

TT urn

M

Around T ime

= max{Ti }, i=1

respectively. The input data transfer (Tinput cm (i)), computation (Tcp (i)), and output data transfer to the client at the destination site d (Toutput cm (i, d)) are 1 presented as a maxN k=1 {lki · Zki }, di · wi · ccRatio and f (di ) · Zid , respectively. The Zij is the network bandwidth between site i and j, wi is the computing time to process a unit dataset of size 1MB at site i, the function f (di ) is an output data size and ccRatio is the non-zero ratio of computation and communication. The turn around time of an application is the maximum among all the execution times of the sub tasks. The problem of scheduling a divisible job onto M sites can be stated as deciding the portion of original workload (D) to be allocated to each site, that is, ﬁnding a distribution of lki which minimizes the turn around time of a job.

A2 DLT: Divisible Load Balancing Model

249

The proposed model uses this cost model when evaluating solutions at each generation.

3

ADLT Scheduling Model

In all literature related to the divisible load scheduling, an optimality criterion [6] was used to derive an optimal solution. In order to obtain an optimal makespan, it is necessary and suﬃcient that all the sites that participate in the computation must complete at the same time. Otherwise, load could be redistributed to each sites and this will improve the processing time. This optimality principle in the design of load distribution strategy is used. The communication time fraction is added into the ADLT model and the ﬁnal fraction of the model is shown as below, CMi,j =

1 1 M wj (Σx=1 wx )

+

Zi,j

CMi,j αj = N M i=1

j=1

1 M

N

x=1

1 y=1 Zx,y

CMi,j

(1)

(2)

and CMi,j αi,j = N M i=1

j=1

CMi,j

Li .

(3)

Details of this model and their derivation can be found in [8].

4

Proposed A2 DLT Model

In the ADLT model, the fraction equations (1), (2) and (3) are taken separately from each source, see [8]. In addition the node speed- and link speed- fractions are also taken separately, thus yields the node speed fraction as, 1 wj 1 M (Σx=1 wx )

(4)

while the link speed fraction for each link given as, N x=1

1 Zi,j

M

1 y=1 Zx,y

.

(5)

Again, the summation of these fractions are also taken separately at each source. These loads are divided by using these fractions and ﬁnally the makespan is calculated. In the proposed model, we must balance the load of the whole system (means all sources). In other word, the node speed fraction is calculated separately, thus yield the node speed- and the link speed- fractions given as,

250

M. Othman et al.

1 M (Σx=1 wx ) +

1 wj N x=1

M

(6)

1 y=1 Zx,y

and,

1 M (Σx=1 wx ) +

1 Zi,j N x=1

M

1 y=1 Zx,y

,

(7)

respectively. Finally, the new fraction is given as,

CMi,j =

1 wj

1 M (Σx=1 wx ) +

N x=1

M

1 y=1 Zx,y

+

1 Zi,j

1 M (Σx=1 wx ) +

CMi,j αj = N M i=1

and

CMi,j αi,j = N M i=1

5

j=1

j=1

CMi,j

CMi,j

,

Li .

N x=1

M

1 y=1 Zx,y

, (8)

(9)

(10)

Numerical Experiments

To measure the performance of the proposed A2 DLT model against the previous models, randomly generated experimental conﬁgurations are used, see [7,8]. The network bandwidth between sites is uniformly distributed between 1Mbps and 10Mbps. The location of n data sources (DSk ) is randomly selected and each physical dataset size (Lk ) is randomly selected with a uniform distribution in the range of 1GB to 1TB. We assumed that the computing time spent in a site i to process a unit dataset of size 1MB is uniformly distributed in the range 1/rcb to 10/rcb seconds where rcb is the ratio of computation speed to communication speed. We examined the overall performance of each model by running them under 100 randomly generated Grid conﬁgurations. We varied the parameters, ccRatio (0.001 to 1000), M (20 to 100), N (20 to 100), rcb (10 to 500) and data ﬁle size (1 GB to 1 TB). When both the number of nodes and the number of data ﬁles are 50, the results are collected and shown in Fig. 2. The results showed that the makespan of the proposed model is better than the other models, especially when the ccRatio is less than 1 (communicationintensive applications). Thus, the proposed model balances the load among the nodes more eﬃciently. From Table 2, the results show that the A2 DLT is 34% better than CDLT in terms of makespan. While the A2 DLT is better than ADLT by 25%. These results showed that A2 DLT is the best among CDLT and ADLT models.

A2 DLT: Divisible Load Balancing Model

251

Fig. 2. Makespan for A2 DLT, ADLT and CDLT models (N =50, M =50 and ccRatio=0.001 to 1000)

Table 2. Percentage makespan improvements of A2 DLT against CDLT and ADLT models ccRatio 0.001 0.01 0.1 1 Average

CDLT (%) ADLT (%) 49 49 53 43 30 20 5 -13 34 25

Fig. 3. Makespan vs. Data ﬁle Size for A2 DLT, ADLT and CDLT models (N =100, M =100 and ccRatio=0.001)

252

M. Othman et al.

When we compare the A2 DLT model to the CDLT and ADLT with diﬀerent size of data ﬁle, the A2 DLT model produces a better result as increasing the size of data ﬁle. The result is shown in Fig. 3. The impact of the ratio of output data size to input data size is also shown in Fig. 4. The A2 DLT model performs better for communication intensive applications that generate small output data compared to input data size (low oiRatio). For computation intensive applications, the ratio of output data size to input data size does not aﬀect the performance of the algorithms much unless when ccRatio is 1000.

Fig. 4. The impact of output data size to input data size (a)oiRatio > 0.5(b)oiRatio = 0 : No output or small size of output

6

Conclusion

Previously, ADLT model reduced the makespan for scheduling divisible load application. In this paper, an improvement version of ADLT model known as the A2 DLT model is proposed. The new model reduces the makespan and balance the load better than ADLT model, especially for communication-intensive applications. The experiment results showed that A2 DLT model improved with an average of 34% and 25% of makespan compared to CDLT and ADLT models, respectively. With such improvement, the proposed model can be integrated in the existing data grid schedulers in order to improve the performance.

References 1. Jaechun, N., Hyoungwoo, P.: GEDAS: A Data Management System for Data Grid Environments. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 485–492. Springer, Heidelberg (2005)

A2 DLT: Divisible Load Balancing Model

253

2. Venugopal, S., Buyya, R., Ramamohanarao, K.: A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing. ACM Computing Surveys 38(1), 1–53 (2006) 3. Robertazzi, T.G.: Ten Reasons to Use Divisible Load Theory. IEEE Computer 36(5), 63–68 (2003) 4. Wong, H.M., Veeravalli, B., Dantong, Y., Robertazzi, T.G.: Data Intensive Grid Scheduling: Multiple Sources with Capacity Constraints. In: Proceeding of the IASTED Conference on Parallel and Distributed Computing and Systems, Marina del Rey, USA (2003) 5. Mequanint, M.: Modeling and Performance Analysis of Arbitrarily Divisible Loads for Sensor and Grid Networks. PhD Thesis. Dept. Electrical and Computer Engineering, Stony Brook University, New York USA (2005) 6. Bharadwaj, V., Ghose, D., Robertazzi, T.G.: Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems. Cluster Computing 6, 7– 17 (2003) 7. Kim, S., Weissman, J.B.: A Genetic Algorithm Based Approach for Scheduling Decomposable Data Grid Applications. In: Proceeding of the International Conference on Parallel Processing. IEEE Computer Society Press, Washington (2004) 8. Othman, M., Abdullah, M., Ibrahim, H., Subramaniam, S.: Adaptive Divisible Load Model for Scheduling Data-Intensive Grid Applications. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 446–453. Springer, Heidelberg (2007)

Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling in Grids Tomasz Szepieniec1 and Marian Bubak1,2 1

Academic Computer Centre CYFRONET AGH, ul. Nawojki 11, 30-950 Krak´ ow, Poland 2 Institute of Computer Science, AGH, al. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected], [email protected] Phone: (+48 12) 617 43 35, Fax: (+48 12) 633 80 54

Abstract. Among many attempts to design DAG scheduling algorithms that would face grid environment requirements, the strategy of number of eligible jobs maximization seems promising. Therefore, this paper presents the results of thorough analysis and evaluation of this strategy and its implementation called PRIO. We have analysed a large space of random DAGs and various resources parameters to compare results of PRIO algorithm with standard critical path length prioritization, FIFO prioritization as well as with quasi-optimal solution. Results of this comparison, in terms of the makespan and robustness, are supplemented by a theoretical and speciﬁc case analysis. We conclude with an assessment of usefulness of the current implementation of eligible jobs maximization strategy. Keywords: DAG, list scheduling, application scheduling, Internet-based computing, eligible jobs maximization.

1

Introduction

Modern, large scale applications require eﬃcient execution on available resources. The structure of many of these application may be represented by DAGs. Many attempts to the DAG-based application scheduling for grid environments were enhancements of existing solutions [1,2], however, one may argue that they address the requirements only partially or require knowledge about the environment that is hard to obtain. The uncertainty level typical for Internet-based computing [3] precludes an accurate identiﬁcation of a critical path. An observation done from a perspective of a user or an application scheduler in large computation environments, that free resources quickly become busy if not allocated immediately, provide foundation for an idea of keeping an application ready to use resources immediately when they become available. In case of DAGs, keeping applications ’ready’ means scheduling jobs in a way which maximises the number of jobs that are eligible for mapping to new resources when they appear. An implementation of this strategy was a subject of a series of papers [3,4,5,6]. An M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 254–263, 2008. c Springer-Verlag Berlin Heidelberg 2008

Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling

255

algorithm was introduced to provide Internet-Computing (IC) optimal schedule for a large class of DAGs. However, in practice an algorithm that may be applied for all possible DAGs is needed, so a heuristic algorithm was designed to provide an IC optimal schedule if it exists, while for the rest of DAGs, the heuristics take steps to enhance eligible jobs ratio [7]. The heuristics were implemented with integration with Condor DAGMan, under name PRIO Tool. This implementation was used in [7] for comparison with the FIFO ordering. The results of the evaluation were promising, however, while FIFO ordering is rather a basic strategy, we believe that a comparison with stronger algorithms is still needed to understand usefulness of the Malewicz’s heuristics. The goal of this paper is to provide a comparison with more advanced techniques and better understanding of the usability of PRIO algorithm for the DAGs scheduling in contemporary grids. Speciﬁcally, we focused on following aspects: – assessment of PRIO algorithm applicability to real grid systems; we tried to identify possible strengths and weaknesses of this algorithm; – statistical evaluation of the algorithm with comparison to the standard DAG’s job prioritization method, FIFO ordering and optimal schedules, in order to clarify how successful the algorithm is and how much robust results it provides; – case analysis of a few typical results to understand characteristics of schedules provided by PRIO algorithm prioritization. Before we describe results of an analysis in Section 3, we give a brief overview of related works in Section 2. Next, we introduce a practical evaluation by describing a grid model implemented for simulation in Section 4. A methodology we have used is given in Section 5. Statistical evaluation and case analysis are presented in Sections 6 and 7. Finally, we summarize our evaluation in Section 8.

2

Related Works

Extensive overview of DAG scheduling algorithms related to grids is given in [8]. In a grid environment, new challenges are faced. The most important ones are the heterogeneity of resources and dynamic changes of several parameters of the environment, like resources availability and transfer time between nodes. Several attempts were made to adapt previous DAG scheduling heuristics, like HEFT [9] and FCP [10] to grid environment [1,2], however none of them fully addresses new requirements. An important issue in the list scheduling, like HEFT or SDC [11], is its sensitivity to the method of job performance on nodes evaluation [13]. From a practical point of view, in a large environment it is complicated to gather relevant data, even if we use less sensitive hybrid algorithm [12]. All the mentioned above heuristics do not consider the heterogeneity of the network parameters. For applications which are sensitive to them, clustering heuristics would be a better solution [8]. Finally, some of proposed algorithms e.g. JDCS [14], provide sophisticated mechanisms such as back-trace techniques to reduce the data

256

T. Szepieniec and M. Bubak

preloading delay, but they usually require rarely available services, like the grid performance prediction or the resource reservation, and heavily depend on their quality. Currently available evaluations of PRIO [7] are limited to comparison with FIFO-based ordering. While the latter method, in practice, means ‘no prioritization’, we consider such an evaluation insuﬃcient. However, PRIO proved to be substantially better than FIFO. It is worth to mention that comparison was made on DAGs with almost equal execution time (random changes up to 30 per cent) and for resources nearly homogeneous.

3

Analysis of the Eligible Jobs Maximization Algorithm

The heuristics mentioned in the previous section, like HEFT, focus directly on reducing makespan. On the contrary, PRIO maximises number of eligible jobs in each step, hoping that this strategy results in increasing resource usage and, eventually, in decreasing the overall DAG makespan. PRIO Tool is an implementation of a heuristic algorithm proposed by Malewicz et al. [3]. The heuristic includes the results of research on algorithms, that ﬁnd optimal IC schedules for some classes of DAGs [3,4,5,6]. PRIO takes advantage of the idea that it is possible to derive an IC-optimal schedule for a complex DAG by decomposing it into simple components, scheduling each component independently, and then combining the resulting schedules [3]. The tool provides a prioritization of DAG jobs, which is typically the ﬁrst stage of list scheduling algorithms. Mapping jobs to resources is expected to be done according to prioritized list. The target platform of PRIO is the application scheduling in grids in which we have concurrent schedulers. In such environments, if heavily used, resources that become available are allocated immediately to jobs that are eligible. So, from the point of view of application level, schedulers resources are lost if there are no eligible jobs available at the moment when resources appear. The aim of the PRIO is to minimise probability of kind of situation, called a gridlock. The important advantage of PRIO algorithm is that we do not require estimation of time needed to complete either jobs or transfers. The only input data that PRIO requires is a structure of a DAG. Therefore, in environments in which there are no means of obtaining estimation of node and links eﬃciency (in some cases this eﬃciency could vary for diﬀerent jobs and/or change over time) or we do not have knowledge about jobs’ reference execution time, PRIO would be a better choice than methods that are very sensitive to the accuracy of performance data [13]. However, in many cases, taking into consideration the structure of DAGs only, would weaken the results. Especially, it is worth to note that a sequence of jobs obtained by the PRIO algorithm is, in fact, intended for jobs completion, while at job completion time new eligible jobs are triggered. PRIO proposes this order for submission, which means that an assumption is made that all jobs and transfers take the same time or, at least, an order of completion remains the

Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling

257

same. It touches the heuristics applicability to heterogeneous environments and limits its usage for DAGs composed with jobs having diﬀerent execution time. In such case a schedule done according PRIO would suﬀer from: – the aim of maximising eligible jobs would be missed as a result of changes in jobs’ completion sequence, – gridlock probability increase in case of long jobs are scheduled while short jobs would provide new eligible jobs earlier.

4

Simulation Method and Related Grid Model

While the PRIO algorithm was designed to provide an application level scheduling, in the simulation environment we try to model resources which are available for single user in a computational grid environment organized similarly to WLCG/EGEE grid [15]. In WLCG resources consist of about 250 computation clusters, called ‘sites’, of size between several to thousands CPUs. Jobs at each site are scheduled by a Local Resource Management System (LRMS) according to local policy of supporting set of virtual organizations which are mapped onto queues at LRMS. A resource broker service chooses suitable services according to a job’s owner membership in a virtual organization and job requirements. This decision is done immediately according to a resources list that is ranked typically by the expected waiting time in a local queue. In such a grid architecture, requesting a resource for a job at the moment when it becomes eligible, introduces unacceptable overheads. More reasonable solution is to use lazy scheduling, in which resources are requested in advance according to expected need, but allocation to jobs is done at the moment when resources become available. When no jobs are ready for execution at this time, resources are freed by the application, while usually keeping resources that we do not actively use is against rules. Such conditions creates a room for application scheduling in which the environment is seen as a stream of resources ready for the job allocation and usage. We use a simple, discrete-event simulator, similar to one applied in the ﬁrst evaluation of PRIO algorithm. Resources are modelled as a probabilistic stream of free workers (CPUs) parameterized by: the average time between resourcesappear events and the average number of workers available in such events. Computational resources are parameterized by heterogeneity level ranged from homogeneous resources to the level in which some workers can be 4 times more eﬃcient than others, which is the value observed in WLCG. While in our evaluation we do not focus on brokering resources to a speciﬁc site but on prioritization only, data transfers are not modeled separately. We considered it included in overall cost of executing job and in the heterogeneity of resources. We also assume that resources broker takes resources in random order, regardless of their eﬃciency.

258

5

T. Szepieniec and M. Bubak

Evaluation Set-Up

The aim of our evaluation was to understand PRIO usability by comparing its results with other solutions for a wide range of cases. Below we describe prioritization methods chosen for the comparison. In each case a motivation for adding this algorithm to our comparison is provided. 1. BTIME – a classic approach to prioritize jobs according to the maximal sum of execution costs on a path from a current job to one of sinks in DAG, based on an estimated computing and communication costs of DAG jobs. A comparison with this approach is the most interesting as this method is commonly used. 2. FIFO – jobs are scheduled in the order in which they are made eligible. When more that one job is eligible at the same time, the order from DAG deﬁnition is taken. The reason for adding this algorithm to comparison was twofold: (1) this model represents no prioritization, so we could measure added values of PRIO, and (2) this algorithm was used in previous PRIO evaluations it gave us an opportunity for validating our results. 3. Quasi-optimal – post mortem searching for optimal prioritization based on exact availability of resources. We applied the algorithm that test all possible prioritizations with some optimizations that speed-up the process (e.g. detecting schedules already tested). However, it was still necessary to limit both execution time and a solution buﬀer size. In cases when the algorithm was not able to complete within deﬁned time, we consider as a result the best of: temporal result of optimal algorithm and other algorithms used in the comparison. Motivation for adding this algorithm to comparison was to estimate a room for further improvement. Random DAGs generation was done using modiﬁed DAGGEN tool [16]. We generated DAGs of diﬀerent characteristics, parametrized by fatness (FAT), density of communication (DENS), regularity (REG), jumps between level (JUMP), diﬀerence in cost between jobs and overall size of DAG (CCR). Values of parameters used are shown in upper part of Table 1. We had 3000 diﬀerent conﬁgurations for DAG generation, each was used 10 times in the process of generating DAGs. Table 1. Values for DAG parameters and environment parameters Parameter FAT DENS REG JUMP CCR size of DAGs TBE RS HI

Values 0.05, 0.1, 0.2, 0.3, 0.5 0.05, 0.1, 0.2, 0.3, 0.4 0.01, 0.05, 0.1, 0.2, 0.4 1, 3, 5 0, 1, 2, 3 10, 30 5, 10, 20, 40, 80, 160 1, 3, 5, 7, 9, 11 1, 2, 3, 4

Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling

259

Table 2. Ratio of wins for each prioritization algorithm in nontrivial cases

summary of wins individual wins

PRIO 59.0% 3.9%

BTIME 93.7% 34.0%

FIFO 8.6% 0.9%

Table 3. Normalized average makespan and its standard deviation achieved by each method in nontrivial cases average makespan standard deviation

PRIO 1.111 0.164

BTIME 1.097 0.158

FIFO 1.129 0.174

Resources are characterized by average time between resources-appear events (TBE), average resources size (RS) in each event and heterogeneity index (HI) of the system, deﬁned as maximum performance ratio between two machines in a system. A list of parameters used is collected in the lower part of Table 1. There were 90 parameters combination each applied 10 times to every generated DAG. In general, there were 2.7M simulations performed for every prioritization method. It is worth mentioning that in a mapping process we mapped all the eligible jobs in prioritization order, so the overall mapping order could be diﬀerent from the prioritization list. As we focus on evaluation of prioritization algorithms, the simplest method of resource mapping was used. To make the competition fair, the eﬃciency of workers (heterogeneity index) mapped onto the same job remains the same for the whole methods. The simulations were performed on Zeus cluster in ACC CYFRONET AGH. Limiting the optimal algorithm execution time to 1 minute, we were able to collect all the results in less that 72 hours using 150 cores of 4-core Intel Xeon 2.33 GHz.

6

Statistical Analysis of Results

From analysis presented, we excluded trivial cases, which we consider schedules that provided the same results for all algorithms. We had two classes of such cases: (1) resources were so limited that the whole process got almost sequential, (2) availability of resources allowed for scheduling all eligible jobs virtually immediately. In our experiment 82% of all cases were classiﬁed as trivial. Table 2 presents a summary of the comparison, assuming that we are interested only in the best algorithm for each case. For all cases we chose an algorithm that provides the smallest makespan, which we call a winner. In many cases more than one method provided the best result. In general, the most successful prioritization method was BTIME. PRIO proved to be substantially better that FIFO, but failed to provide result comparable to BTIME.

260

T. Szepieniec and M. Bubak

Table 4. Selected parameter’s value correlation with makespan and range of changes of average makespan results for PRIO and BTIME PRIO BTIME Correlation Max.change Correlation Max.change density 0.2 0.1% 97.6 0.0% allowed jumps between levels 0.99 2.6% 0.99 3.0% allowed jumps between levels 0.99 2.6% 0.99 3.0% performance ratio between jobs -0.95 1.4% -0.85 1.3% heterogeneity of environment -0.95 1.4% -0.85 1.3%

1.22 "FIFO" "PRIO" "BTIME"

1.2 1.18

Average makespan

1.16 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Resources usage ratio

Fig. 1. Average normalized makespan in function of resource usage ratio

For better understanding of usability of PRIO algorithm we analysed parameters in class of 3.9% cases in which PRIO provide the best result. The results show the same distribution of parameters, similar to distribution of parameters in whole nontrivial cases set. Therefore, we can assume that success rate of PRIO algorithm, depends on a speciﬁc structure of DAG and its relation to the resources stream. The rest of the analysis is based on makespan values normalized according to quasi-optimal method results. Normalization was done to achieve the same impact to presented data from long and short schedules. General summary of average makespan of each algorithm and its standard deviation is presented in Table 3. We see that PRIO is situated between the other two algorithms both in makespan and robustness measured in terms of standard deviation of makespan [17]. Additionally, we could observe that each prioritization method produced average makespan about 10% worse than quasioptimal. Taking into account also standard deviation that exceeds 15%, we have a picture of substantial room for further improvement in this ﬁeld. We also provided an analysis on how simulation parameters inﬂuence the results, but even if there was a strong correlation between parameter values and makespan, overall impact for makespan in range of parameters we evaluated does not exceed 3% (Table 4).

Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling

261

Table 5. Average stall ratio

average stall ratio standard deviation

PRIO 0.412 0.263

BTIME 0.408 0.261

FIFO 0.415 0.259

In Figure 1 average makespan in function of average resource usage is presented to check how schedules depend on value that reﬂects the availability of resources. We shall note, that in every class of each parameter BTIME prioritization outperformed PRIO. Additionally, we can conclude that scheduling of DAGs prioritized by evaluation algorithm depends strongly on speciﬁc structure of a DAG and resources availability. Another question that we tried to answer in this evaluation was if the PRIO algorithm was able to provide more eligible jobs. If so, the ratio of ‘stalls’, deﬁned as a number of cases where there were no eligible jobs when a resource event appeared in the system, should be smaller than in the other algorithms. Unfortunately, as we can see in Table 5 stall ratio was higher that in BTIME case. It could be connected with the fact, that bad schedule decisions lead to stalls while waiting for a job that is being executed longer than average. Huge value of standard deviation is caused by various parameters of resource ﬂow.

7

Typical Cases Analysis

In the previous section we presented the results of evaluation that shows that PRIO prioritization provides results statistically worse than simple BTIME prioritization. In this section, we will provide analysis of two cases, taken from the experiment, illustrated in Figure 2, in order to better understand the results. On both ﬁgures, the number of jobs remaining both for submission and for completion their execution are shown in function of time. A space between these two lines was ﬁlled to better illustrate running jobs. In the left graph, we can observe that, an application scheduled according to PRIO prioritization gained more resources and complete more jobs in ﬁrst stage (t<150) of computation process. This was enough to start next stage substantially earlier and save about 6% of overall makespan. So, in this case PRIO algorithm provide results according to expectations. In the right graph we can observe that for most of the time both prioritizations were equally eﬃcient. Sequences proposed, although diﬀerent in second half of makespan, not caused diﬀerences in allocating resources that were appearing. What is more, PRIO gained two slots in the last stage, substantially before the other algorithm. Surprisingly, at the end scheduling according to BTIME prioritization wins, while using such priorities it was possible to start the prelast job not waiting for completion of others running. So, in this particular case the strategy, that PRIO is build on, clearly provide bad solutions. Summing up, we should note, that overall results depend on subtle relations between jobs timing and resource appeared-events. Concerning applicability of

262

T. Szepieniec and M. Bubak

Fig. 2. Traces of two diﬀerent schedules according PRIO and BTIME prioritizations. Graphs present number of jobs remaining for submission (lower line) and for completion (upper line) in function of time. Dashed area between them illustrates jobs in execution. Overlapping dashed area not impose the same set of jobs!

strategy used in PRIO, we could conclude that gaining more resource give good results until it is done in ﬁnal part of the schedule. In ﬁnal stage of DAG execution it is important to keep concurrent running, avoiding last jobs to be left at the end alone.

8

Conclusions

In this paper we considerably extended knowledge about heuristics that maximize eligible jobs (the PRIO algorithm) by comparison with other heuristics and analyse the applicability of this algorithm in real usage. The main conclusion is that, the tested implementation of eligible jobs maximization strategy is not mature enough to be used in the described environment. At this moment, simpler mechanisms, like BTIME heuristics, provide better schedules. However, there is a signiﬁcant room for improvements, where strategy for maximizing eligible jobs could be useful, in improving existing solutions. To achieve this, we will continue our research towards elimination of weakness of the PRIO algorithm identiﬁed in this paper. Acknowledgments. We would like to thank Grzegorz Malewicz for providing us the implementation of PRIO tool and a simulator, on which the presented evaluation was obtained. Simulations were processed on Zeus cluster in ACC CYFRONET AGH. This work was partly supported by EU IST CoreGRID Project.

Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling

263

References 1. You, S.Y., Kim, H.Y., Hwang, D.H., Kim, S.C.: Task Scheduling Algorithm in GRID Considering Heterogeneous Environment. In: Proc. of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2004, Nevada, USA, June 2004, pp. 240–245 (2004) 2. Ma, T., Buyya, R.: Critical-Path and Priority based Algorithms for Scheduling Workﬂows with Parameter Sweep Tasks on Global Grids. In: Proc. of the 17th International Symposium on Computer Architecture and High Performance Computing, Rio de Janeiro, Brazil (October 2005) 3. Malewicz, G., Rosenberg, A., Yurkewych, M.: Towards a Theory for Scheduling Dags in Internet-Based Computing. IEEE Transactions on Computers 55(6), 757– 768 (2006) 4. Rosenberg, A.L.: On scheduling mesh-structured computations for Internet-based computing. IEEE Trans. Comput. 53, 1176–1186 (2004) 5. Rosenberg, A.L., Yurkewych, M.: Guidelines for scheduling some common computation-dags for Internet-based computing. IEEE Trans. Comput. 54, 428– 438 (2005) 6. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Advances in IC-Scheduling Theory: Scheduling Expansive and Reductive Dags and Scheduling Dags via Duality. IEEE TPDS 18(11) (November 2007) ISSN: 1045-9219 7. Malewicz, G., Foster, I., Rosenberg, A., Wilde, M.: A Tool for Prioritizing DAGMan Jobs and Its Evaluation. In: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), pp. 156–167 (2006) 8. Dong, F., Akl, S.G.: Scheduling Algorithms for Grid Computing: State of the Art and Open Problems. Technical Report of Queen’s University School of Computing, 2006-504 (January 2006) 9. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-Eﬀective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Transactions on Parallel and Distributed Systems 13(3), 260–274 (2002) 10. Radulescu, A., van Gemund, A.J.C.: On the Complexity of List Scheduling Algorithms for Distributed Memory Systems. In: Proc. of 13th International Conference on Supercomputing, Portland, Oregon, USA, ovember 1999, pp. 68–75 (1999) 11. Shi, Z., Dongarra, J.J.: Scheduling workﬂow applications on processors with different capabilities. Future Generation Computer Systems 22(6), 665–675 (2006) 12. Sakellariou, R., Zhao, H.: A Hybrid Heuristic for DAG Scheduling on Heterogeneous Systems. In: Proc. of 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, New Mexico USA, April 2004, pp. 111–123 (2004) 13. Zhao, H., Sakellariou, R.: An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm. In: Kosch, H., B¨ osz¨ orm´enyi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 189– 194. Springer, Heidelberg (2003) 14. Dong, F., Akl, S.G.: A Joint Data and Computation Scheduling Algorithm for the Grid. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 587–597. Springer, Heidelberg (2007) 15. Wordwide LHC Grid Computing. Web page: http://lcg.web.cern.ch/LCG/ 16. Daggen – Synthetic DAG Generation. http://www.loria.fr/∼ suter/dags.html 17. Canon., L.C., Jeannot, E.: A Comparison of Robustness Metrics for Scheduling DAGs on Heterogeneous Systems. In: 6th Int. Workshop on Algorithms, Models and Tools – HeteroPar 2007, Austin (2007)

Parallel Path-Relinking Method for the Flow Shop Scheduling Problem Wojciech Bo˙zejko1 and Mieczyslaw Wodecki2 1

Wroclaw University of Technology Institute of Computer Engineering, Control and Robotics Janiszewskiego 11-17, 50-372 Wroclaw, Poland [email protected] 2 University of Wroclaw Institute of Computer Science Joliot-Curie 15, 50-383 Wroclaw, Poland [email protected]

Abstract. The matter of using scheduling algorithms in parallel computing environments is discussed in the paper. A parallel path-relinking approach based on scatter search metaheuristics is proposed for the ﬂow shop problem with Cmax and Csum criteria. Obtained results are very promising: the superlinear speedup is observed for some versions of the parallel algorithm.

1

Introduction

The main issue discussed here is the problem of using scheduling algorithms in parallel environments, such as multiprocessor systems, cluster or local network. On the one hand, sequential character of the scheduling algorithms’ computation process is the obstacle in projecting enough eﬀective parallel algorithms. On the other hand, parallel computations oﬀer essential advantages of solving diﬃcult problems of combinatorial optimization. We take into consideration the permutation ﬂow shop scheduling problem, as well as the classic NP-hard problem of the combinatorial optimization which can be described as follows. A number of jobs are to be processed on a number of machines. Each job must go through all the machines in exactly the same order and the job order must be the same on each machine – machines are ordered as a linear chain. Each machine can process at most one job at any point of time and each job may be processed on at most one machine at any time. The objective is to ﬁnd a schedule that minimizes the sum of job’s completion times (F ||Csum problem) or maximal job completion time (F ||Cmax problem). Garey, Johnson & Seti [4] show that F ||Cmax is strongly NP-hard for more than 2 machines. The branch and bound algorithm was propsed by Grabowski [5]. Its performance is not entirely satisfactory however, as they experience difﬁculty in solving instances with 20 jobs and 5 machines. Thus, there exist two, not conﬂicted mutually, approaches which allow one to solve large-size instances M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 264–273, 2008. c Springer-Verlag Berlin Heidelberg 2008

Parallel Path-Relinking Method for the Flow Shop Scheduling Problem

265

in the acceptable time: (1) approximate methods (mainly metaheuristics), (2) parallel methods. In the matter of parallel metaheuristics, dedicated mainly for homogeneous multiprocessors systems (such as mainframe computers and specialized clusters) a parallel variant of the scatter search method, one of most promising currently methods of combinatorial optimization, has been projected and researched experimentally in the application of ﬂow shop scheduling problems with Cmax and Csum criteria. In some cases the eﬀect of superlinear speedup has been observed. Although algorithms have not been executed with a huge number of iterations, a new best solution has been obtained for the Csum ﬂow shop problem for benchmark instances of Taillard [13]. This work is the continuation of author’s research on constructing eﬃcient parallel algorithms to solve hard combinatorial problems ([2,3,15]). Further, we present a parallel algorithm based on scatter search method which not only speeds up the computations but also improves the quality of the results.

2

The Problems

The ﬂow shop problem with makespan criterion. We consider, as the test case, the well-known in the scheduling theory, strongly NP-hard problem called the permutation ﬂow-shop problem with the makespan criterion and denoted by F ||Cmax . Skipping consciously the long list of papers dealing with this subject we only refer the reader to recent reviews and best up-to-now algorithms [8,6,9]. The problem has been introduced as follows. There is n jobs from a set J = {1, 2, . . . , n} to be processed in a production system having m machines, indexed by 1, 2, . . . , m, organized in the line (sequential structure) – ordered as a linear chain. Single job reﬂects one ﬁnal product (or sub-product) manufacturing. Each job is performed in m subsequent stages, in common way for all tasks. Stage i is performed by machine i, i = 1, . . . , m. Every job j ∈ J is split into sequence of m operations O1j , O2j , . . . , Omj performed on machines in turn. Operation Oij reﬂects processing of job j on machine i with processing time pij > 0. Once started job cannot be interrupted. Each machine can execute at most one job at a time, each job can be processed on at most one machine at a time. The sequence of loading jobs into system is represented by a permutation π = (π(1), . . . , π(n)) on the set J. The optimization problem is to ﬁnd the optimal sequence π ∗ so that Cmax (π ∗ ) = min Cmax (π). π∈Π

(1)

where Cmax (π) is the makespan for permutation π and Π is the set of all permutations. Denoting by Cij the completion time of job j on machine i we have Cmax (π) = Cm,π(n) . Values Cij can be found by using recursive formula Ciπ(j) = max{Ci−1,π(j) , Ci,π(j−1) } + piπ(j) , i = 1, 2, . . . , m, j = 1, . . . , n, (2) with initial conditions Ciπ(0) = 0, i = 1, 2, . . . , m, C0π(j) = 0, j = 1, 2, . . . , n.

266

W. Bo˙zejko and M. Wodecki

Notice that the problem of transforming sequential algorithm for scheduling problems into parallel one is nontrivial, because of strongly sequential character of computations carried out by (2) and by other known scheduling algorithms. The ﬂow shop problem with Csum criterion. The objective is to ﬁnd a schedule that minimizes the sum of job’s completion times. The problem is indicated by F ||Csum . For the sake of special properties (blocks of critical path, [6]) the problem in question is regarded as an easier one than a problem with objective Csum . Unfortunately, there are not any similar properties, which can speedup computations, for the F ||Csum ﬂow shop problem. There are plenty of good heuristic algorithms for solving F ||Cmax ﬂow shop problem, with the objective of minimizing maximal job’s completion times. Constructive algorithms (LIT and SPD from [14]) have low eﬃciency and can only be applied to a limited range. Smutnicki [12] provides wort-case analysis of known approximate algorithms. Bozejko and Wodecki [3] proposed parallel genetic algorithm, Reeves and Yamada [11] – a hybrid algorithm consisting of elements of tabu search, simulated annealing and path relinking methods. The results of this algorithm, applied to Taillard benchmark tests [13] are the best known ones in the literature nowadays. The ﬂow shop problem with the sum of job’s completion time criterion can be formulate applying notations from the previous paragraph. We wish to ﬁnd a permutation π ∗ ∈ Π that: Csum (π ∗ ) = min Csum (π), where Csum (π) = π∈Π

n

Cmπ(j) ,

j=1

where Ciπ(j) is the time required to complete job j on the machine i in the processing order given by the permutation π.

3

Multi-thread Search: Scatter Search Method

The main idea of the scatter search method is presented in [7]. The algorithm is based on the idea of evaluation of the so-called starting solutions set. In the classic version a linear combination of the starting solution is used to construct a new solution. In case of a permutational representation of the solution using linear combination of permutations gives us an object which is not a permutation. Therefore, in this paper a path relinking procedure is used to construct a path from one solution of the starting set to another solution from this set. The best element of such a path is chosen as a candidate to add to the starting solution set. 3.1

Path Relinking

The base of the path relinking procedure, which connects two solutions π1 , π2 ∈ Π, is a multi-step crossover fusion (MSXF) described by Reeves and Yamada [11]. Its idea is based on a stochastic local search, starting from π1 solution,

Parallel Path-Relinking Method for the Flow Shop Scheduling Problem

267

to ﬁnd a new good solution where the other solution π1 is used as a reference point. The neighborhood N (π) of the permutation (individual) π is deﬁned as a set of new permutations that can be achieved from π by exactly one adjacent pairwise exchange operator which exchanges the positions of two adjacent jobs of a problem’s solution connected with permutation π. The distance measure d(π,σ) is deﬁned as a number of adjacent pairwise exchanges needed to transform permutation π into permutation σ. Such a measure is known as Kendall’s τ measure. Algorithm 2. Path-relinking procedure Let π 1 , π 2 be reference solutions. Set x = q = π1 ; repeat For each member yi ∈ N (π), calculate d(yi , π 2 ); Sort yi ∈ N (π) in ascending order of d(yi , π 2 ); repeat Select yi from N (π) with a probability inversely proportional to the index i; Calculate Csum (yi ); Accept yi with probability 1 if Csum (yi ) ≤ Csum (x), and with probability PT (yi ) = exp((Csum (x) − Csum (yi )) / T ) otherwise (T is temperature); Change the index of yi from i to n and the indices of yk , k = i+1,...,n from k to k−1; until yi is accepted; x ← yi ; if Csum (x) < Csum (q) then q ← x; until some termination condition is satisﬁed ; return q { q is the best solutions lying on the path from π1 to π2 } The condition of termination consisted in exceeding 100 iterations by the path relinking procedure. 3.2

Parallel Scatter Search Algorithm

The parallel algorithm was projected to execute on tow machines: – the cluster of 152 dual-core Intel Xeon 2.4 GHz processors connected by Gigabit Ethernet with 3Com SuperStack 3870 swiches (for the F ||Csum problem), – Silicon Graphics SGI Altix 3700 Bx2 with 128 Intel Itanium2 1.5 GHz processors and cache-coherent Non-Uniform Memory Access (cc-NUMA), craylinks NUMAﬂex4 in fat tree topology with the bandwidth 4.3 Gbps (for the F ||Cmax problem), installed in the Wroclaw Center of Networking and Supercomputing. Both supercomputers have got a distributed memory, where each processor has its local cache memory (in the same node) which is accessible in a very short time (comparing to the time of access to the memory in other node). Taking into consideration this type of architecture we choose a client-server model for the scatter

268

W. Bo˙zejko and M. Wodecki

Fig. 1. Executing concurrent path-relinking procedures in the set S

search algorithm proposed here, where calculations of path-relinking procedures are executed by processors on local data and communication takes place rarely to create a common set of new starting solutions. The process of communication and evaluation of the starting solutions set S is controlled by processor number 0. We call this model global. For comparison a model without communication was also implemented in which an independent scatter search threads are executed in parallel. The result of such an algorithm is the best solution from solutions generated by all the searching threads. We call this model independent. Algorithms were implemented in C++ language using MPI (mpich 1.2.7) library and executed under the OpenPBS batching system which measures times of processor’s usage. Algorithm 3. Parallel scatter search algorithm for the SIMD model without shared memory parfor p := 1 to number of processors do for i := 1 to iter do Step 1. if (p = 0) then {only procesor number 0} Generate a set of unrepeated starting solutions S, |S| = n. Broadcast a set S among all the processors. else {other processors} Receive from the procesor 0 a set of starting solutions S. end if;

Parallel Path-Relinking Method for the Flow Shop Scheduling Problem

269

Step 2. For randomly chosen n/2 pair from the S apply path relinking procedure to generate a set S - of n/2 solutions which lies on paths. Step 3. Apply local search procedure to improve value of the cost function of solutions from the set S . Step 4. if (p = 0) then Send solutions from the set S to procesor 0 else {only processor number 0} Receive sets S from other processors and add its elements to the set S Step 5. Leave in the set S at most n solutions by deleting the worst and repeated solutions. if |S| < n then Add a new random solutions to the set S such, that elements in the set S does not duplicate and |S| = n. end if; end if; end for; end parfor. 3.3

Computer Simulations

Tests were based on 50 instances with 100,. . . ,500 operations (n × m=20×5, 20×10, 20×20, 50×5, 50×10) due to Taillard [13], taken from the OR-Library [10]. The results were compared to the best known, taken from [10] for the F ||Cmax and from [11] for the F ||Csum . For each version of the scatter search algorithm (global or independent) following metrics were calculated: – ARPD - Average Percentage Relative Deviation to the benchmark’s cost function value where P RD =

Fref − Falg · 100%, Fref

where Fref is reference criterion function value from [10] for the F ||Cmax and from [11] for the F ||Csum , and Falg is the result obtained by parallel scatter search algorithm. There were no situations where Fref = 0 for the benchmark tests. – ttotal (in seconds) – real time of executing the algorithm for 50 benchmark instances from [13], – tcpu (in seconds) – the sum of time’s consuming on all processors for 50 benchmark instances from [13].

270

W. Bo˙zejko and M. Wodecki

Table 1. Values of APRD for parallel scatter search algorithm for the F ||Cmax problem (global model). The sum of iterations’s number for all processors is 9600. n×m

Processors 1 2 4 8 16 iter =9600 2 iter = 4800 iter = 2400 8 iter = 1200 iter = 600

20 × 5 20 × 10 20 × 20 50 × 5 50 × 10

0.000% 0.097% 0.039% 0.007% 0.345%

0.000% 0.060% 0.035% -0.001% 0.104%

0.000% 0.072% 0.061% -0.015% 0.113%

0.000% 0.131% 0.062% -0.001% 0.123%

0.096% 0.196% 0.136% 0.007% 0.272%

average

0.098%

0.029%

0.046%

0.063%

0.142%

ttotal (h:min:sec) tcpu (h:min:sec)

30:04:40 30:05:02

15:52:13 31:44:21

7:40:51 30:41:54

3:35:47 28:45:30

1:42:50 27:24:58

Table 2. Values of APRD for parallel scatter search algorithm for the F ||Cmax problem (independent model). The sum of iterations’s number for all processors is 9600. n×m

Processors 1 2 4 8 16 iter =9600 2 iter = 4800 iter = 2400 8 iter = 1200 iter = 600

20 × 5 20 × 10 20 × 20 50 × 5 50 × 10

0.000% 0.097% 0.039% 0.007% 0.345%

0.000% 0.080% 0.062% 0.000% 0.278%

0.000% 0.066% 0.048% 0.007% 0.148%

0.000% 0.039% 0.031% 0.007% 0.238%

0.096% 0.109% 0.031% 0.000% 0.344%

average

0.098%

0.084%

0.054%

0.063%

0.097%

ttotal (h:min:sec) tcpu (h:min:sec)

30:04:40 30:05:02

14:38:29 29:16:14

6:58:59 27:54:19

3:15:34 26:03:33

1:32:46 24:41:24

Table 3. Values of APRD for parallel scatter search algorithm for the F ||Csum problem (independent model). The sum of iterations’s number for all processors is 16000. n×m

Processors 1 2 4 8 16 iter =16000 2 iter = 8000 iter = 4000 8 iter = 2000 iter = 1000

20x5 20x10 20x20 50x5 50x10 average

0.000 0.000 0.000 0.904 0.913 0.363

0.007 0.000 0.000 1.037 0.986 0.406

0.000 0.000 0.000 0.906 1.033 0.388

0.006 0.000 0.000 0.903 0.989 0.380

0.016 0.000 0.000 0.933 1.110 0.412

ttotal (h:min:sec) tcpu (h:min:sec)

75:27:40 75:25:48

37:40:08 75:02:51

18:38:23 74:10:18

9:06:24 72:19:26

4:28:57 70:57:24

Parallel Path-Relinking Method for the Flow Shop Scheduling Problem

271

Table 4. Values of APRD for parallel scatter search algorithm for the F ||Csum problem (global model). The sum of iterations’s number for all processors is 16000. n×m

Processors 1 2 4 8 16 iter =16000 2 iter = 8000 iter = 4000 8 iter = 2000 iter = 1000

20x5 20x10 20x20 50x5 50x10

0.000 0.000 0.000 0.993 1.103

0.000 0.000 0.000 0.677 0.648

0.000 0.000 0.000 0.537 0.474

0.008 0.004 0.000 0.449 0.404

0.007 0.000 0.000 0.764 0.734

average

0.419

0.265

0.202

0.173

0.301

ttotal (h:min:sec) tcpu (h:min:sec)

75:23:44 75:20:42

41:19:51 77:57:57

23:28:19 75:46:07

14:30:03 74:38:51

7:23:50 73:13:35

Flow shop problem with makespan Cmax criterion. Tables 1 and 2 presents results of computations of the parallel scatter search method for the number of iterations (as a sum of iterations on all the processors) equals to 9600. The cost of computations, understanding as a sum of time-consuming an all the processors, is about 7 hours for the all 50 benchmark instances of the ﬂow shop problem. The best results (average percentage deviations to the best known solutions) have the 2-processors version of the global model of the scatter search algorithm (with communication) which are 70.4% better comparing to average 1-processor implementation(0.029% vs 0.098%). Because the time-consuming on all the processors is a little bit longer than the time of the sequential version we can say that the speedup of this version of the algorithm if almost-linear. For the 4 and 8-processors implementation of the global model and for 2,4 and 8-processors implementations of the independent model the average results of ARPD are better than ARPD of the 1-processors versions whereas the timesconsuming on all the processors (tcpu ) are shorter. So these algorithm obtain better results with a smaller cost of computations - the speedup is superlinear. This anomaly can be understood as the situation where the sequential algorithm executes its search threads such that there is a possibility to choose a better path of the solutions space trespass, which the parallel algorithm do. More about superlinear speedup can be found in the book of Alba [1]. Flow shop problem with Csum criterion. Similar situation takes place for the tests of the parallel scatter search algorithm for the F ||Csum problem. Tables 3 and 4 present results of computations for the global and independent model, for the number of iterations (as a sum of iterations on all the processors) equals to 16000. The best results are achieved for the 8-processors version of the global model version of scatter search and they are 52.3% better than the results of sequential scatter search algorithm (0.173% vs 0.363%). Also here a superlinear speedup eﬀect has been observed for the 8 and 16-processors implementations of the global model of parallel scatter search. The time consuming of

272

W. Bo˙zejko and M. Wodecki

this implementations (74:38:51 and 73:13:35, hours:minutes:seconds) was smaller than the total time of sequential algorithm execution (75:20:42). Such a situations takes place only for the global model of the scatter search algorithms – independent searches are not so eﬀective, both in results (ARPD) and speedup. The new best solution foud so far has been discovered for the ﬂow shop problem with Csum criterion during computational experiments. The new upper bound for the tai50 instance is 88106 (previous one was 88215, from [11]). Though there was not purpose of this research, results obtained by the proposed algorithm are only 0.05% worse (4 processors, independent model) in average from the best results for the Cmax problem, obtained by Nowicki and Smutnicki [9]. For the Csum problem the results are 0.17% worse (8 processors, also independent model) from the best known obtained by the algorithm of Reeves and Yamada [11].

4

Conclusions

An approach to parallelization of the scheduling algorithms for the ﬂow shop problem has been described here. In multiple-thread search, represented by a parallel scatter search here, parallelization increases the quality of obtained solutions keeping comparable costs of computations. Superlinear speedup is observed in cooperative (global) model of parallelism. The parallel scatter search skeleton can be easily adopted to solve other NP-hard problems with permutational solution representation, such as traveling salesman problem (TSP), quadratic assignment problem (QAP) or single machine scheduling problems.

References 1. Alba, E.: Parallel Metaheuristics. Wiley & Sons Inc., Chichester (2005) 2. Bo˙zejko, W., Wodecki, M.: Solving the ﬂow shop problem by parallel tabu search. In: Proceedings of PARELEC 2004, pp. 189–194. IEEE Computer Society, Los Alamitos (2004) 3. Bo˙zejko, W., Wodecki, M.: Parallel genetic algorithm for the ﬂow shop scheduling problem. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 566–571. Springer, Heidelberg (2004) 4. Garey, M.R., Johnson, D.S., Seti, R.: The complexity of ﬂowshop and jobshop scheduling. Mathematics of Operations Research 1, 117–129 (1976) 5. Grabowski, J.: A new algorithm of solving the ﬂow-shop problem, Operations Research in Progress, pp. 57–75. D. Reidel Publishing Company (1982) 6. Grabowski, J., Pempera, J.: New block properties for the permutation ﬂow shop problem with application in tabu search. Journal of Operational Research Society 52, 210–220 (2000) 7. James, T., Rego, C., Glover, F.: Sequential and Parallel Path-Relinking Algorithms for the Quadratic Assignment Problem. IEEE Intelligent Systems 20(4), 58–65 (2005) 8. Nowicki, E., Smutnicki, C.: A fast tabu search algorithm for the permutation ﬂow shop problem. European Journal of Operational Research 91, 160–175 (1996)

Parallel Path-Relinking Method for the Flow Shop Scheduling Problem

273

9. Nowicki, E., Smutnicki, C.: Some aspects of scatter search in the ﬂow-shop problem. European Journal of Operational Research 169, 654–666 (2006) 10. OR-Library: http://people.brunel.ac.uk/∼ mastjjb/jeb/info.html 11. Reeves, C.R., Yamada, T.: Genetic algorithms, path relinking and the ﬂowshop sequencing problem. Evolutionary Computation 6, 45–60 (1998) 12. Smutnicki, C.: Some results of the worst-case analysis for ﬂow shop scheduling. European Journal of Operational Research 109(1), 66–87 (1998) 13. Taillard, E.: Benchmarks for basic scheduling problems. European Journal of Operational Research 64, 278–285 (1993) 14. Wang, C., Chu, C., Proth, J.: Heuristic approaches for n/m/F/ΣCi scheduling problems. European Journal of Operational Research, 636–644 (1997) 15. Wodecki, M., Bo˙zejko, W.: Solving the ﬂow shop problem by parallel simulated annealing. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 236–247. Springer, Heidelberg (2002)

A Fast and Eﬃcient Algorithm for Topology-Aware Coallocation Valentin Kravtsov1, Martin Swain2 , Uri Dubin1 , Werner Dubitzky2 , and Assaf Schuster1 1

2

Technion-Israel Institute of Technology, Technion City, 32000, Haifa, Israel svali [email protected] University of Ulster, Cromore Road, Coleraine, BT52 1SA, Northern Ireland, UK

Abstract. Modern distributed applications require coallocation of massive amounts of resources. Grid level allocation systems must eﬃciently decide where these applications can be executed. To this end, the resource requests are described as labeled graphs, which must be matched with equivalent labeled graphs of available resources. The coallocation problem described in the paper has real-world requirements and inputs that diﬀer from those of a classical graph matching problem. We propose a new algorithm to solve the coallocation problem. The algorithm is especially tailored for medium to large grid systems, and is currently being integrated into the QosCosGrid system’s allocation module.

1

Introduction

The problem we are tackling is a maximal allocation of a labeled requests graph to a labeled oﬀers 1 graph. This problem is also referred as graph matching. The allocation must satisfy both the constraints of the nodes (machines) and the constraints of the links (network). In our setup, the allocation can be nonoptimal in terms of the allocation size; however, all allocations must obey all computing and network constraints. The motivation for our work comes from real-world scientiﬁc applications, including complex systems simulations. Complex systems simulations include highly parallel applications such as large cellular automata; molecular dynamics simulations; combinations of coarse and ﬁne-grained parallel applications, such as distributed evolutionary algorithms for optimizing parameters; techniques such as parallel tempering, where molecular dynamics simulations are combined with Monte Carlo algorithms; and agent-based models where both the frequency of communication between agents and the number of agents is highly variable and may change with time [1]. Such applications rely on the coallocation of large numbers of reliable resources. This requirement has traditionally been met by supercomputing facilities, but some applications researchers are now looking to computational grids as a more economic computing resource. 1

In this paper we use the terms “oﬀers” and “available machines” interchangeably, assuming that only available machines are oﬀered by resource providers.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 274–283, 2008. c Springer-Verlag Berlin Heidelberg 2008

A Fast and Eﬃcient Algorithm for Topology-Aware Coallocation

275

Fig. 1. A parallelized agent-based model (left) with the agent interactions represented as a graph (right)

Quasi-opportunistic supercomputing is a new approach, designed to enable the execution of demanding parallel applications on massive nondedicated resources in grid environments [2]. Fig. 1 shows an approach for parallelizing an agent-based model in which each agent interacts with others within a certain distance and must be aware of those agents that are within that distance. In Fig. 1, on the left, each black dot represents an agent, the light gray circle represents the distance for deﬁnite interactions, and the outer circle indicates possible future interactions. These interactions can be represented by a graph, as shown on the right, and it is this graph which depicts the properties and the topology of the required resources. The matching methods in the literature can be divided into two broad categories: the ﬁrst contains exact matching methods that require a strict correspondence among the two objects being matched or at least among their subparts. Algorithms that solve these problems for the general graphs are exponential in the worst case. The second category deﬁnes inexact matching methods, where a matching can occur even if the two graphs being compared are structurally diﬀerent, relaxing to some extent the given constraints [3]. Our case can be seen as a mixture of both categories: as in exact matching, we must not violate any of the constraints, but as in inexact matching, nonoptimal allocation sizes are permissible. Forgoing this optimality requirement allows us to provide an eﬃcient algorithm for resource coallocation, which in practice delivers results that are reasonably close to the optimum. The problem described above can be very hard to solve even with heuristic algorithms in real grids. As real-world grids may consist of thousands of machines, and we are planning to simultaneously allocate hundreds to thousands of jobs, the number of edges in the oﬀers and requests graphs might be of an order of 106 . Thus, even light heuristic algorithms which are linear in the product of the number of edges are almost useless when dealing with the computation time of O(1012 ). To reduce the problem complexity, we propose a simpliﬁed version, which we call the clustered topology-aware coallocation problem (CTAAP). In this problem, the oﬀered machines are aggregated into a relatively small number of

276

V. Kravtsov et al.

homogeneous clusters. Each cluster contains identical machines, interconnected by identical links. This formulation does not account for the diﬀerences between the machines in the clusters, but signiﬁcantly reduces the problem size. Unfortunately, even the reduced problem is still NP-complete with no approximation available. In this paper we propose a new heuristic algorithm, the CTAAP-Solver, which solves the CTAAP problem. In our solution, we execute graduated assignment graph matching [4] once, and use its output as a starting point for the greedy search procedure. During this greedy search, we repeatedly execute an algorithm for weighted bipartite graph matching, steering it towards a feasible solution that does not violate any constraint. This paper is organized as follows. Related work is summarized in section 2. In section 3 we discuss the problem deﬁnition and its intractability. In this section we also formalize the problem and give the details of our CTAAP-Solver algorithm and its complexity. Experimental results are given in section 4.

2

Related Work

Exact matching. Most of the algorithms for exact graph matching are based on some form of tree search with backtracking. The ﬁrst important algorithm of this family was by Ullmann [5] in 1976. Ullmann’s algorithm is widely known and, despite its age, is still widely used and is probably the most popular graph matching algorithm. A more recent algorithm for both isomorphism and subgraph isomorphism is the VF algorithm, by Cordella et al. [6]. The authors deﬁne a heuristic that is based on the analysis of the sets of nodes adjacent to the ones already considered in the partial mapping. This heuristic computes quickly, leading to a signiﬁcant improvement over Ullmann’s and other algorithms in many cases. However, the worst case run time of Ullmann’s algorithm is Θ(N !N 3 ), and Θ(N !N ) for the VF algorithm. Inexact matching. Tree search with backtracking can also be used for inexact matching. In [7], the A∗ algorithm is used with a fast and simple heuristic that takes into account only the future cost of unmatched nodes. A radically diﬀerent approach is to cast graph matching, which is inherently a discrete optimization problem, into a continuous, nonlinear optimization problem. One of the pioneering works for this approach is that of Fischler and Elschlager [8]. In [9], a new matching algorithm based on a probabilistic relaxation framework is proposed, which introduces the deﬁnition of a Bayesian graph edit distance. Gold and Rangarajan [4] presented the graduated assignment graph matching (GAGM) algorithm. In this algorithm a technique known as graduated nonconvexity is employed to avoid poor local optima [3]. However, the inexact matching algorithms that we are aware of do not guarantee that constraints will not be violated.

A Fast and Eﬃcient Algorithm for Topology-Aware Coallocation

3 3.1

277

The Topology-Aware Coallocation Algorithm Topology-Aware Coallocation: Deﬁnition and Analysis

Our coallocation model assumes an “` a la operating systems” scheduling system, meaning that the time axis is divided into discrete (potentially inconstant and long) time slots, and the decision about which processes to execute in a certain time slot is made repeatedly by an allocation management system. In this paper, we consider only a certain time slot where the quantitative values of computing resources are assumed to be constant. Thus we ignore the time index in the following discussion. The mathematical model will be deﬁned in the next subsection, while here we will discuss the intractability of the presented problem. Matching that does not account for link constraints – known also as bipartite graph matching – is a well-studied problem [10], with a variety of eﬃcient (polynomial) solving algorithms [11], [12]. However, matching that takes into account the links between the nodes, which is a general graph-matching problem, becomes NP-hard. Even the simpliﬁed form of the problem deﬁned above – the CTAAP, where oﬀered computing machines are grouped into homogeneous clusters, is still NP-complete with no approximation even for a constant number of clusters. This can be shown by reducing the independent set (IS) problem to the CTAAP. IS is deﬁned as follows: – Input: Graph G = (V, E) and a positive integer k. – Question: Is there a subset V ⊆ V of size k such that no two vertexes in V are joined by an edge in E? The reduction is as follows: given a graph G of n vertexes, we will treat it as a requests graph. We will create an oﬀers graph with two clusters, one of size k with no links between the nodes, and another cluster with n − k nodes, all of which are interconnected and also connected to all the nodes of the ﬁrst cluster. There is a solution of CTAAP that can allocate all the requests to oﬀers iﬀ there is a solution to the IS problem. Not only is the IS NP-complete [13] but, as was shown by [14], no polynomial time algorithm can approximate it within a factor 1− of n/2(log n) for any > 0, unless N P = ZP P . Given that fact, it is clear that CTAAP cannot be solved even approximately by polynomial time algorithms. 3.2

Clustered Topology-Aware Coallocation: Model Formalization

Specifying the topology request. The request for n tasks is presented as a graph GR = (V, E), where |V | = n and vi denotes a request for a single resource (machine). A positive vector C = [c1 , c2 , . . . , cn ] represents request properties, where ci denotes the minimal quantitative properties for the required computational resource vi (e.g. FLOPS). Diﬀerent quantitative properties might be described by multiple property vectors. For example, if vector C represents the minimal

278

V. Kravtsov et al.

CPU requirements, then the minimal memory requirements of the n tasks are represented by the positive vector M = [m1 , m2 , . . . , mn ]. The properties of edges e ∈ E are represented by an n-by-n adjacency matrix B, where bij refers to the connectivity level between a user’s tasks vi and vj . Usually, this matrix is symmetric, and ∀i : 1 ≤ i ≤ n, bii = 0. Matrix B represents the communication bandwidth between tasks as estimated by the user. Specifying the resource oﬀer. Analogously, an oﬀer of m clusters of identical ˆ machines is denoted as a graph GˆO = (Vˆ , E),where |Vˆ | = m. The individual properties of the identical machines in the clusters are denoted ˆ = [mˆ1 , mˆ2 , . . . , mˆm ] (memory). as Cˆ = [cˆ1 , cˆ2 , . . . , cˆm ] (CPU) and M ˆ A capacity vector CAP = [cap ˆ 1 , cap ˆ 2 , . . . , cap ˆ m ] denotes the number of availˆ represents the able machines in each cluster; an m-by-m adjacency matrix B edges’ properties (e.g., currently available communication bandwidth within and between the m clusters in the grid), assuming identical connectivity properties between all the machines in each cluster. In the oﬀers graph there are usually self-loops. The self-loop of the node vˆi denotes the connectivity level between the machines in cluster i: bˆii = 0. The bandwidth could be estimated between two adjacent (physically connected) clusters or between two distant but connected clusters using maximum ﬂow techniques. The allocation matrix. We are interested in ﬁnding the n-by-m allocation matrix X, in which the term xij = 1 represents an allocation of a requested task vi to an oﬀered resource vˆj . Several constraints must hold for a correct coallocation: m xij ≤ 1, (1) ∀i : 1 ≤ i ≤ n, j=1

denoting that one requested task can be mapped to at most one oﬀered resource; ∀j : 1 ≤ j ≤ m,

n

xij ≤ cap ˆ j,

(2)

i=1

denoting that an oﬀered cluster j can serve at most cap ˆ j tasks; ∀i, j : 1 ≤ i ≤ n, 1 ≤ j ≤ m, xij ci ≤ cˆj ∧ xij mi ≤ mˆj

(3)

denoting that the individual (computation/memory) properties of a request must ﬁt the properties of the matched oﬀer; ∀i, j, k, l : 1 ≤ j, l ≤ m, 1 ≤ i, k ≤ n, xij bik xkl ≤ bˆjl ,

(4)

denoting that the pairwise (connectivity) properties of any pair of requests must ﬁt the properties of the matched oﬀers pair; and ∀i, j : 1 ≤ i ≤ n, 1 ≤ j ≤ m, xij ∈ {0, 1},

(5)

A Fast and Eﬃcient Algorithm for Topology-Aware Coallocation

279

denoting that the decision is binary, where 1 indicates an allocation of a requested task vi to an oﬀered resource vˆj . Diﬀerent objective functions will express diﬀerent “global welfare” schemas. m we are interested in maximizing the system utilization: n Here max( i=1 j=1 xij ). 3.3

An Algorithm for Clustered Topology-Aware Coallocation

Our algorithm consists of three procedures. The ﬁrst one ﬁnds a weights matrix X by executing a modiﬁed version of the graduated assignment graph matching [4] algorithm. Matrix X contains values between 0 and 1, which denote the “proﬁtability” of each allocation Xij . An extra row and column are added By to hold the slack variables (this augmented matrix is denoted by X). incorporating slack variables, the graph matching algorithm can handle outliers (spurious or missing nodes or links) in a statistically robust manner. As β is constantly increased, only one number in each row and up to cap ˆ j numbers in each column approach 1, while all the others approach 0. ˆ M ˆ , CAP ˆ , matrixes B, B, ˆ edge compatibility Input: vectors C, M, C, function F (er , eo ) → R | er ∈ GR , eo ∈ GO Output: weights matrix X ←1+ ; β ← β0 , X while β ≤ βf do while X does not converge AND #iterations ≤ I0 do N M Qij ← (ci > cˆj ∨ mi > mˆj ) ? 0 : k=1 l=1 Xkl F (eik , ejl ) ; Xij ← exp(βQij ) ; does not converge AND #iterations ≤ I1 do while X 1 ← MX+1ij by normalizing across rows X ; // update X ij X j=1

ij

1

Xij ij ← min(1, N +1−(cap X ˆ j −1) i=1

smallesti

);

// normalizing across

columns, smallesti stands for ith smallest element in column j end end β ← β · βr ; end return X ; Algorithm 1. Step 1 – inexact graph matching In the second procedure we address equations 1-3 and 5 only (i.e., computational and capacity constraints). Discarding equation 4, we have an instance of a weighted bipartite matching problem, modeled as follows: G = (V , E ), where V = V ∪ Vˆ , and E = (vi , vˆj )|vi ∈ V ∧ vˆj ∈ Vˆ ∧ ci ≤ cˆj ∧ mi ≤ mˆj , where we use weights computed by procedure 1: w((vi , vˆj )) = Xij . To solve the

280

V. Kravtsov et al.

maximum weighted bipartite problem, we use a slightly modiﬁed version of the LEDA implementation [15]. The resulting allocation suits constraints 1-3 and 5 but might violate constraint 4. ˆ M ˆ , CAP ˆ , weights matrix X Input: vectors C, M, C, Output: allocation matrix A ˆ M ˆ , CAP ˆ ); A ← solve maximum weighted bipartite matching(C, M, X, C, return A; Algorithm 2. Step 2 – weighted bipartite graph matching In the last procedure, we have to make sure that no connectivity constraints were violated by procedure 2. To do so, we analyze all the allocation pairs (Xij and Xkl ), counting how many connectivity violating allocation pairs each allocation Xij appears in. If no connectivity violations were detected, the algorithm terminates. Otherwise, the “worst” allocation Xij (the one that appears in the most violating pairs) is removed from the allocation matrix, from then on forcing Xij = 0, and step 2 is repeated. ˆ M ˆ , CAP ˆ and matrixes B, B ˆ Input: allocation matrix A, vectors C, M, C, Output: ﬁnal allocation matrix problem costij ← ZERO M AT RIX ; curr alloc ← {(i, j)|i ∈ {1..n} ∧ j ∈ {1..m} ∧ Aij = 1} ; forall (i, j)|(i, j) ∈ curr alloc do forall (k, l)|(k, l) ∈ curr alloc ∧ (k, l) = (i, j) do if Bik > Bˆjl then // Allocation violates edge constraints problem costij ++ ; problem costkl ++ ; end end end if (problem cost is ZERO M AT RIX) then return A ; else (i, j) ← index of the biggest number in problem cost ; Aij ← 0 ; Go To Procedure 2 ; end Algorithm 3. Step 3 – cleanup 3.4

Algorithm Complexity

In order to analyze the complexity of the entire algorithm, we will analyze each one of its three steps. Here we will assume that the number of clusters in the grid is M and the number of jobs is N .

A Fast and Eﬃcient Algorithm for Topology-Aware Coallocation

281

Step 1: the normalization across rows takes O(N 2 M 2 ), while the normalization across columns takes O(N 2 M 2 log(N )). The overall complexity of step 1 is O(N 2 M 2 log(N )). Step 2: the constructed graph G has O(N + M ) vertexes and O(N M ) edges. Using the algorithm proposed in [15], the overall complexity of this step is O((N + M )2 log(N + M ) + N M (N + M )) = O(N 2 log(N + M ) + N 2 M ), assuming that N M. Step 3: as there is a maximal number of N allocations, the analysis of the correctness of all given allocation pairs takes O(N 2 ) time. Overall: In the worst case, steps 2 and 3 are repeated N M times; thus the overall time complexity in the worst case is O(N 3 M log(N + M ) + N 3 M 2 ). However, we expect the average performance to be much better. It is also important to note that the algorithm is polynomial in the number of clusters only, regardless of the number of actual machines in those clusters.

4

Experimental Results

In order to evaluate the performance of the CTAAP-Solver algorithm, we performed a series of experiments to estimate both its quality and speed. The results for the CTAAP-Solver algorithm are compared to the optimal results calculated by an integer programming technique. The integer programming model that we used is based on the ﬁve equations listed above. The fourth quadratic equation was replaced by a series of equivalent linear equations. The system of ﬁve equations, including the modiﬁed equation 4, was fed into integerprogramming solvers GLPK [16] and CBC [17], which provided an optimal solution. The following values for the constants were used in all the experiments: β0 = 0.5, βf = 10, βr = 1.075, I0 = 4, and I1 = 30. Fig. 2 describes the results of the ﬁrst experiment, in which the CTAAP-Solver algorithm results are compared with the optimal solution. The request graph of 50 nodes has computing and network properties as random values in the range of 1...100 (all the random numbers mentioned in this text have uniform distribution in the given range). The oﬀered graph consisted of 5 homogeneous clusters, each with a random capacity in the range of 1...11. In ﬁve consistent tests composed of 100 independent runs, we increased the oﬀered properties ranges from 1...100 to 1...200, then to 1...300, etc. The results depicted in Fig. 2 show that as the chances of a single request to be mapped to a single available resource increase, our algorithm performs better. The “range” itself is of no importance: a request can be mapped to an oﬀer iﬀ the oﬀer’s properties are not lower than the request’s properties. Only the order of the requests and oﬀers is important, and not the values themselves. In the ﬁrst point of the graph, the chances of a single requested machine to be mapped to a speciﬁc available machine are 50.5% (both oﬀer and request are integers, randomized in a range of 1...100). In the second test, these chances increase

282

V. Kravtsov et al.

Fig. 2. The success rate of the CTAAP-Solver algorithm

(as the oﬀer is randomized in the range of 1...200, but the request is still randomized in the range of 1...100) and thus becomes 1/2 + 1/2 · 0.505 ≈ 75%, while in the third experiment it is 2/3 + 1/3 · 0.505 ≈ 83%, and so on. Another experiment, the results of which are given in Fig. 3, compares the runtime of the CTAAP-Solver algorithm with the runtime of one of the best open-source integer programming solvers – CBC2 [17]. In this experiment the size of the requests and oﬀers graphs was constantly increased. The computing and network properties were random values in the range of 1...100. The oﬀers graph consisted of 5 homogeneous clusters, with a random capacity in the range of 1 and 1/5 of the size of the requests graph. The computing and network properties were random numbers in the range of 1...200.

Fig. 3. Left: a comparison of the runtime of the CTAAP-Solver algorithm vs. CBC2 optimized exhaustive search. Right: the runtime of the CTAAP-Solver algorithm.

5

Conclusions

Here we have presented a new algorithm that provides a fast and eﬃcient solution to the topology-aware coallocation problem. This algorithm is currently used as an important building block in the QosCosGrid scheduling system.

A Fast and Eﬃcient Algorithm for Topology-Aware Coallocation

283

Acknowledgments. The work in this paper was supported by EC grant QosCosGrid IST FP6 STREP 033883.

References 1. Charlot, M., et al.: The QosCosGrid project: Quasi-Opportunistic Supercomputing for Complex Systems Simulations. In: Description of a general framework from diﬀerent types of applications. Ibergrid conference, Centro de Supercomputacion de Galicia (GESGA) (2007) 2. Kravtsov, V., Carmeli, D., Schuster, A., Yoshpa, B., Silberstein, M., Dubitzky, W.: Quasi-Opportunistic Supercomputing in Grids, Hot Topic Paper. In: IEEE International Symposium on High Performance Distributed Computing, Monterey Bay, California, USA (2007) 3. Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty Years of Graph Matching in Pattern Recognition. International Journal of Pattern Recognition and Artiﬁcial Intelligence 18(3), 265–298 (2004) 4. Gold, S., Rangarajan, A.: A Graduated Assignment Algorithm for Graph Matching. IEEE Trans. Pattern Anal. Mach. Intell. 18(4), 377–388 (1996) 5. Ullman, J.R.: An algorithm for subgraph isomorphism. J. Assoc. Comput. Mach. 23, 31–42 (1976) 6. Cordella, L.P., Foggia, P., Sansone, C., Tortorella, F., Vento, M.: Graph matching: a fast algorithm and its evaluation. In: 14th Int. Conf. Pattern Recognition, pp. 1582–1584 (1998) 7. Gregory, L., Kittler, J.: Using graph search techniques for contextual colour retrieval. In: Joint IAPR Int. Workshops SSPR and SPR, pp. 186–194 (2002) 8. Fischler, M., Elschlager, R.: The representation and matching of pictorial structures. IEEE Trans. Computing 22, 67–92 (1973) 9. Myers, R., Wilson, R.C., Hancock, E.R.: Bayesian graph edit distance. IEEE Trans. Patt. Anal. Mach. Intell. 22, 628–635 (2000) 10. Lovasz, L., Plummer, M.D.: Matching Theory. Elsevier Science Publishing Company, New York (1986) 11. Blum, N.: A Simpliﬁed Realization of the Hopcroft-Karp Approach to Maximum Matching in General Graphs. Univ. of Bonn, Computer Science V, 85232-CS (2001) 12. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms, The Bellman-Ford algorithm, pp. 588–592. MIT Press and McGraw-Hill, New York, USA (2001) 13. Pardalos, P.M., Xue, J.: The maximum clique problem. Journal of Global Optimization 4(3), 301–328 (1994) 14. Khot, S.: Improved Inapproximability Results for MaxClique, Chromatic Number and Approximate Graph Coloring. In: Proceedings of the 42nd IEEE Symposium on the Foundations of Computer Science 600, Washington, DC, USA (2001) 15. Mehlhorn, K., Naeher, S.: The LEDA Platform of Combinatorial and Geometric Computing. Cambridge University Press, Cambridge (1999) 16. Andrew, M.: GNU Linear Programming Kit, Version 4.22. http://www.gnu.org/software/glpk/glpk.html 17. Bonami, P., et al.: Research Report RC 23771, An Algorithmic Framework For Convex Mixed Integer Nonlinear Programs. IBM T. J. Watson Research Center, Yorktown, USA (2005)

View-OS: A New Unifying Approach Against the Global View Assumption Ludovico Gardenghi1 , Michael Goldweber2 , and Renzo Davoli1 1

2

Dept. of Computer Science University of Bologna Bologna, Italy {garden,renzo}@cs.unibo.it Dept. of Mathematics and Computer Sciences Xavier University Cincinnati, OH [email protected]

Abstract. One traditional characteristic of operating systems is that all the processes share the same view of the environment. This global view assumption (GVA) means that for processes running on the same computer, the same pathname points to the same ﬁle, the processes share the same network stack and therefore the same IP addresses, the routing characteristics are identical, etc. There have been many proposals for “bending” the GVA for either individual processes or for the system as a whole. Some of these proposals include microkernels or specialized virtual machines. Most proposals are for system administrators, others are tailored to speciﬁc applications. A View-OS is our unifying solution for altering the GVA. It allows a user to partially or completely redeﬁne the behavior of an arbitrary subset of the system calls called from his processes, thus altering his view of the environment in terms of ﬁle system, communication, devices, access control etc. We have implemented it with a system-call, partial, modular virtual machine called *MView. Each divergence from the standard view may be implemented in a speciﬁc module. Hence instead of always having to load a complete kernel (e.g. Usermode Linux), the overhead of a per-process deﬁnition of the environment depends on the degree of divergence from the standard global view.

1

Introduction: A Change in Perspective

Modern operating systems make processes run inside “sandboxes” that isolate them and provide protection with respect to other processes and resources. When a process needs a system service (e.g. more memory, new processes, communication with existing ones, access to the ﬁle system or to the network, I/O with devices) the boundary of these sandboxes is crossed via a system call. Hence from the perspective of a process, the set of system calls made available by the operating system is the only “window” through which a process can “see the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 287–296, 2008. c Springer-Verlag Berlin Heidelberg 2008

288

L. Gardenghi, M. Goldweber, and R. Davoli

world” beyond itself. All possible interaction between processes and resources is mediated through this facility. Actually, other kinds of communication are available. For instance, in a UNIXlike system, processes may use shared memory or send/receive signals. Shared memory allocation, as well as signals, however are only made available through system calls. Similarly, the structure and content of ﬁle systems as well as communication with devices, other processes, and other hosts can be deﬁned exclusively in terms of “what responses are given to the process by system calls.” This set of answers represents the view of the world for the process. All processes whose system calls are serviced by the same kernel, therefore, share the same view—the global view assumption, or GVA. Consider the following: a process running on system A has all its system calls answered from the kernel running on system B. This process would work the same as if it was really running on B: it would see B’s ﬁle system, its network address and routing, and so on. The view is provided to the process by system B’s kernel. We deﬁne a system where each process is allowed to create its own unique view or perspective as a View-OS[1] [2]. In a View-OS each process makes use of the services and resources oﬀered by the kernel to deﬁne their own process-speciﬁc view. For instance, an individual view may contain: – A modiﬁed view of the ﬁle system from that shown by the kernel. This might include added, removed or changed subsets, diﬀerent permissions, or be physically deployed on remote hosts. – New network interfaces with ad-hoc addresses, hidden addresses, or diﬀerent routing and ﬁltering rules. – More generally, a diﬀerent set of visible and accessible devices, each with its own set of permissions and semantics. Notably, both the permissions and semantics may be diﬀerent from those provided by the kernel. This is accomplished in a View-OS by providing each user-mode process with the capability for redeﬁning system call behavior or even to deﬁne new ones. 1.1

Security Issues

A View-OS, like any traditional global view operating system, must insure that there is no danger to system security. Process views cannot be constructed without regard to security issues. A process can build a view only by relying on the set of resources the kernel would “normally” make available to it. No global changes to the system can be made while inside a personal view if those changes are not allowed under normal conditions. Key to the View-OS concept is that a process could however see local changes as if they were global ones. Some examples. Disk image mounting. Assume a user owns a disk image ﬁle. This user is able to read its contents. Mounting it as a ﬁle system is nothing but interpreting the contents according to some ﬁle system structure. In a View-OS, the user could mount the image inside a personal view of the ﬁle system namespace.

View-OS: A New Unifying Approach Against the Global View Assumption

289

Remote ﬁle systems. Similarly, a user may want to connect a remote subtree from another host to the local ﬁle system namespace, using some network transport. As long as the user has enough privileges on the network and on the remote subtree, there is no reason for not allowing the user to extend her personal perspective of the local host with a new portion of ﬁle system. The remainder of this paper is as follows. In section 2 we provide an overview of tools, models and architectures that have inﬂuenced View-OS, pointing out similarities and diﬀerences between them and our project. In section 3 we expand on the basic ideas of a View-OS, focusing on the fundamental concepts of the proposal. Section 4 closes the paper with some observations of using our *MView View-OS in the ﬁeld, our conclusions and future directions for this project.

2

Other Models and View-OS

In this section we present a comparison between View-OS and a selection of tools, models and techniques for virtualization. In particular our goal is to illustrate how speciﬁc aspects of each such tool/technique can be captured by a View-OS. Virtual machines [3] [4] (VMs) are the oldest[5] and most used tools able to change the perspective of processes. Processes running inside a virtual machine have the same view they would have if running on a diﬀerent system. VMs may virtualize physical architectures[6] [7] [8] [9] [10] [11] [12] [13] [14] or abstract ones[15], at various levels (typically a whole hardware architecture is emulated, thus allowing the user to install and run an operating system and all its applications). The important point here is that the perspective is completely changed with respect to the real, underlying one; moreover, it is shared by all the processes inside the VM. A View-OS allows a greater degree of ﬂexibility: if needed, the perspective change may aﬀect only a subset of the processes and then only a subset of their views. Moreover, a View-OS oﬀers a lighter approach when it is not necessary to boot an entire operating system. In a paravirtualization, the VM monitor is a light layer called the hypervisor. As with virtual machines, each virtualized system (or domain) needs to boot a kernel; the hypervisor just provides scheduling between domains. The shared devices are managed by a speciﬁc, privileged “Domain 0.” Paravirtualization is the key idea of Xen[16]. Xen provides good support for device drivers, inherited from those running in Domain 0. Since Xen implements entire virtual machines, the VM management must be done by root. As with “classical” virtual machines, a View-OS can be used to overcome the issues addressed by paravirtualization while sharing the concept that device drivers and existing applications can be inherited from existing systems or lower (possibly virtual) layers. In microkernel systems, requests made to the kernel are sent as messages to speciﬁc servers. Each server is responsible for a speciﬁc task (e.g. ﬁle system management, network, memory). In a typical microkernel each server manages requests for every process, thus giving the usual global view. However, it is possible to have diﬀerent servers for diﬀerent groups of processes, thus creating

290

L. Gardenghi, M. Goldweber, and R. Davoli

more than one perspective. This is the where microkernels and a View-OS are similar. There are, however, to two very important diﬀerences. – As stated by the name, a microkernel is a kernel architecture. Its aim is the same as a monolithic kernel: providing services to processes. Using a pool of servers instead of a single, bigger process is only matter of cleaner design and failure isolation. Usually, processes are not able to start new, personal servers. Server management is a privileged action that can be only performed by the kernel. – Microkernels come as completely new and diﬀerent operating systems. They have their set of system calls, their device drivers, their ABI, etc. This is a serious obstacle toward the eﬀective usability of microkernel prototypes: the usually limited hardware compatibility and the need to port existing applications to the new system (or to write new ones from scratch) greatly reduces the user base. Alternatively, View-OS concepts, as demonstrated by our working prototypes, may be implemented gradually on existing systems, thus keeping binary compatibility with existing applications and relying on existing operating systems for device drivers. The Exokernel [17] architecture is an attempt to loosen the tight link between interface and implementation of the various abstractions given by the operating system. Its approach is to move the physical resource management into user space (or application level) and provide a low level interface from a minimal kernel to “untrusted” libraries. A View-OS also may allow this. Provided that a user process has enough authorizations to access a device in raw mode, it may exploit View-OS by inserting a personal, custom driver between itself and the device. The View-OS approach adds these capabilities on top of an existing kernel, allowing the simultaneous usage of some kernel and some user device drivers, together with a gradual migration from the ﬁrst ones to the second ones depending on the availability of new user-level drivers. Plan 9 [18] was a very important research project by Bell Labs. In Plan 9 each process has its own name space. A process can change, add, or remove entities from its name space without aﬀecting other processes. Everything in Plan 9 is accessible through names: networking, ﬁle system, GUI windows, etc. Thus a change in the name space implies a change in the process view. Unfortunately, Plan 9 has its own kernel, thus the number of supported architectures and available device drivers is quite limited. Moreover, since Plan 9 has a very diﬀerent system model from other modern operating systems, the porting of applications from other operating systems to Plan 9 is often diﬃcult. A View-OS provides processes with some of the features of Plan 9 while retaining compatibility with standard (Linux) kernels and applications. System Call Interposition. This technique is used to monitor every system call generated by a process. Its main goal is to create a sort of “jail,” “sandbox,” or “guarded environment” for process execution, and to keep track of potential malicious activities by processes. System Call Interposition is often used only to deny dangerous operations (e.g. access to sensitive data). Blocking or denying

View-OS: A New Unifying Approach Against the Global View Assumption

291

system calls becomes a trivial sub-case of system call redeﬁnition and, thus, easy to implement with a View-OS. FUSE [19] is a mechanism, available in recent Linux releases, that allows the kernel to rely on user-mode programs for ﬁle system support. It suﬀers from the GVA, as ﬁle systems are mounted with the usual mount semantics and aﬀect the whole system. For this reason, this special purpose virtualization is commonly restricted from users or allowed with speciﬁc limitations. A View-OS provides the same features of FUSE, with the additional ability of limiting the visibility only to a subset of processes, and allowing regular users to mount real, virtual, local or remote ﬁle systems with no interference between each other. Moreover, our implementation has source-level compatibility with existing FUSE modules. “Minor” partial virtualities. There are a number of classic and well-known tools from UNIX-like systems that are good examples of speciﬁc applications which modify the GVA using diﬀerent techniques: chroot, a system call operating on the ﬁlesystem namespace; fakeroot, a user-space access control hack based on dynamic libraries preloading; /dev/tty and /proc/self, ﬁles with a diﬀerent meaning for each process. This enumeration while far from complete is hopefully representative. There are also many partial, ad-hoc tools which are also neither interoperable nor integratable. Regardless, our goal is to show that a View-OS is a very strong step toward providing a uniﬁed framework discussing and implementing various methodologies and strategies for modifying the GVA.

3

View-OS: Relaxing the Global View Assumption

UMView (and KMView) are proof of concept prototypes of a View-OS. In this section we describe a View-OS—its goals and capabilities—in more general terms. We will, nevertheless, refer to *MView when we wish to provide a practical example of a given concept. All the software has been released under the GPL free license and is available in the standard Debian distribution. The Virtual Square wiki[20] provides access to the software, technical documents and examples. A View-OS allows users to change the perspective of their processes. The basic idea is to redirect each system call to a monitor, or hypervisor. The running process does not have immediate access to the “real” services given by the kernel through system calls; each system call request is “intercepted” and checked by the hypervisor. The hypervisor then decides on one of two behaviors depending on the speciﬁc system call and on its parameters. If the hypervisor decides that the system call refers to a unmodiﬁed part of the perspective/view, it just asks—or makes the process ask—the underlying level to execute the call. A View-OS, deﬁned in this manner, is a natural ﬁt for nesting, as its interface is the same toward both its upper and lower levels. For this reason, letting the process run the system call “as it is” may mean asking the real kernel to execute the request, or asking a lower View-OS instance to check for a possible change in perspective. From the point of view of the hypervisor, there is no diﬀerence between these two cases.

292

L. Gardenghi, M. Goldweber, and R. Davoli

On the other hand, the hypervisor may elect to “trigger” and implement a change in some portion of the view of a process. This may mean executing an existing system call with altered semantics or the execution of a new system call not supported by the underlying level. For example the View-OS may implement a new system call to open a disk image ﬁle, read its content, parse it according to a speciﬁc ﬁle system format, and let the calling process believe it is accessing a real disk with real ﬁles and directories. A View-OS allows a user to “boot” a minimalist “kernel” and conﬁgure it to manage diﬀerent ﬁle systems, network stacks and services other than the real ones. (i.e. The ones made available by the kernel used to boot the machine.) The same may also apply to device drivers: the underlying kernel exports a raw view on a device and the hypervisor takes care of using the correct driver for it. This is very similar to the microkernel and exokernel approaches, as described in section 2. What the View-OS concept changes is that the microkernel and exokernel approaches, in addition to the other approaches previously described are no longer mutually exclusive. Depending on the speciﬁc issue, one can choose a more monolithic approach or a more modular one. Both the system administrator and the user may cooperate and have more ﬂexibility in choosing for each service the best option, basing their decision on performances, security needs, and software availability. For instance, if the current monolithic kernel does not yet provide a driver for a given device—it may be under development—the system administrator may choose to use a user-level driver via a View-OS hypervisor. As soon as a more oﬃcial driver becomes available for the operating system kernel it may be plugged in and used. Similarly, an inability to update the current kernel or the need for better performance may lead the administrator to delegate some services to the hypervisor or to the kernel. Our prototypes, *MView are designed to work on a GNU/Linux system and are potentially able to work with every peripheral that is supported by the Linux kernel; allowing the user to run non-modiﬁed version of GNU/Linux software. We denote this ﬂexible approach as a millikernel. That is, this solution locates between the two extreme solutions: microkernels (everything but a minimalist message-passing engine must be outside the kernel) and monolithic kernels (everything that is not a user-created process lives inside the kernel). In the remaining part of this chapter we will describe some of the most promising View-OS areas of application. 3.1

Security

A View-OS allows one to implement the required granularity for one of the most theoretically important security principles: the principle of the minimum privilege. At present, UNIX-like systems provide two main authorization mechanisms. The ﬁrst (and older) one is the usual owner/group permissions system. Recently, the privileges traditionally associated with the superuser have been split into

View-OS: A New Unifying Approach Against the Global View Assumption

293

distinct units known as capabilities, which can be independently enabled and disabled. Capabilities may include the ability to bypass some permission checks on the ﬁle system, to kill other user’s processes, to invoke privileged network operations, to change attributes and priority for a process, and so forth. While this mechanism tries to capture the need for greater granularity, it is not useful for regular users as capabilities only refer to global administrative activities. Nonadministrative users are not able to deﬁne their own capabilities nor associate them with their processes. With networking, the security situation is even worse. A given system has a ﬁxed network stack and a ﬁxed set of network interfaces. There are no useful ﬂexible control mechanisms that can be used to allow or deny network-related operations to single users or processes. If a network interface is up, everyone can see it and everyone can open ports and listen for packets on it.1 These UNIX authorization mechanisms do not comply with the minimum privilege principle, both for users and administrators. A View-OS allows an administrator to give a process exactly the minimum set of (ﬁle, network, . . . ) resources that are necessary to complete its task. An administrator may want to reﬁne capabilities for a certain process. For instance, she may decide that a given process may ignore ﬁle system permissions but only on a subtree. Users may be given personal TCP/IP addresses, so it becomes easier to apply shaping or ﬁltering rules. While this is currently possible, it requires keeping track of every TCP connection and every single UDP packet. Quite often, users would like to separate and isolate diﬀerent groups of activities (e.g. work, leisure, experiments). A game should not be able to access e-mail ﬁles/directories. If a user wants to try a new, unstable application she also wants to protect her data from accidental deletions. A View-OS addresses all these problems by allowing users and superusers to describe a minimal set of resources around every process. Finally, with a View-OS, many, if not all of the operations that require “set user ID” executables (e.g. the mount command) may be converted to regular (non-suid) operations. Hence one can remove many potential security holes: executables ran by regular users but with superuser privileges. 3.2

Flexibility

Dealing with malicious or broken software is not the only ﬁeld where a ViewOS proves useful. The open-ended ﬂexibility of a View-OS allows users to build their world around themselves. Our uniﬁcation approach also helps in making diﬀerent virtualizations cooperate and integrate. Transformations away from the global view can be relative to the ﬁle system, network, devices, and a group of other less frequently used components of a process view. 1

There are some packet ﬁltering infrastructures such as IPTables which allow the system administrator to enforce various policies but, typically, the granularity on users and processes is too coarse.

294

L. Gardenghi, M. Goldweber, and R. Davoli

While we have described the individual virtualizations made possible by an implementation of a View-OS, it is instructive to consider more complex ViewOS applications. The aim is to give an idea of how one is able to combine diﬀerent, simple, but interoperable virtualizations to create useful structures. It is worth pointing out that View-OS was designed as part of a bigger virtualization framework, named Virtual Square[21] [22]. Also part of this framework is VDE[23], a virtual networking tool able to connect virtual and real machines together. A VDE is often used to combine diﬀerent View-OS instances and link them to real networks. A remote encrypted ﬁle system. Let’s assume that a (very paranoid) user keeps a Second Extended disk image on her home computer and that a portion of this ﬁle system is encrypted using EncFS. This user may want to access this disk image while on another computer. The traditional approach would require her to copy the whole disk image onto the local host and then mount it with superuser privileges via the loopback device. With a View-OS she may combine three diﬀerent modules: a ssh module to reach the remote ﬁle system on her home machine; an ext2 module to mount the disk image and, ﬁnally, an encfs module to decrypt its content. None of these operations require root access, and none of the other local users can see see the contents of the disk image. Partitioning images and devices at user level. The basic idea is to let a user manage her disk images (or their removable storage devices) entirely in user space. In the case of removable media, we assume that the kernel grants exclusive read/write privileges on the device to that user. A speciﬁc *MView module allows to see a ﬁle or device as a disk, to partition it (using the usual Linux tools) and to mount the new partitions entirely at user level. The kernel does not have to support every single (strange) kind of ﬁle system that any user may want to use, but simply delegates the parsing of the raw device content to another layer i.e. a View-OS. Per-process IP stack or address. Not only IP addresses, but even network stacks may be assigned on a per-process basis. This is useful to test new, experimental implementations or to use diﬀerent optimizations and tuning for diﬀerent kinds of applications. The combination of VDE with a network stack module for the View-OS level makes this possible. The increased level of isolation and granularity makes mobility and server reorganization much easier as a single daemon may have its own speciﬁc IP address and can keep it even if it has to be moved onto a diﬀerent physical host. 3.3

Fast Prototyping

A key advantage to the View-OS approach to virtualization is the possibility to create very light environments with single, focused/speciﬁc alterations to the global view. This makes a View-OS very useful every time a designer has to build and test prototypes for applications or protocols and doesn’t want (or can’t) alter the conﬁguration of the running system.Usually, whole system virtual machines

View-OS: A New Unifying Approach Against the Global View Assumption

295

have to be created, conﬁgured and if needed interconnected. This “heavy” approach can often be lightened using a View-OS. Examples illustrating how a View-OS allows for fast and lightweight prototyping include copy-on-write for conﬁgurations ﬁles, providing a light framework for testing network protocols, verifying the eﬀectiveness of new system calls, and the testing of new ﬁle system implementations.

4

Conclusions

The founding idea behind a View-OS is the change in perspective that we made while examining a system. Instead of focusing on the operating system and its kernel which sees processes as an uniform set of objects, we consider a process as the main actor that makes use of operating system services as a way to build up its own perspective. The term we use to denote the classical way of looking at a system is GVA (global view assumption). The goal of a View-OS is to relax this very limiting approach. Furthermore, we endeavored to provide a uniﬁed approach encompassing different techniques, concepts, paradigms both in operating systems architectures and virtual machine design toward to our goal of relaxing the GVA. Instead of using tool A to relax one portion of a view, and tool B (which might not even interoperate with A) to relax a diﬀerent portion of one’s view, one need only use a View-OS—a uniﬁed paradigm for relaxing any portion (or portions) of a view, all accomplished in user-mode. UMView (and KMView) are working prototypes of a View-OS. UMView is implemented as a System Call Virtual Machine and allows regular Linux programs to run on regular Linux kernels with the added beneﬁt of allowing processes to create their own personal perspective/view by themselves—no superuser intervention is needed. We believe that having a working, usable implementation is as important as creating a good model. It is a future goal to not only speed up the performance of *MView, but to meaningfully measure the overhead these View-OS implementations impose. It is a non-trivial task to simply determine what to measure for this. We can report that an “empty” KMView environment, i.e. one with no perspective altering modules loaded, yielded an ad hoc measured 20% loss of performance with respect to an unaltered system. This overhead, though not so low, is quite encouraging considering KMView is a proof-of-concept prototype and not intended as a production environment, yet. Another future goal is to explore the educational potential of a View-OS. If one believes that the best way to learn about something is to build one, then implementing View-OS modules becomes an excellent educational pursuit. Since all is executed in user-mode, students can literally safely explore system behavior by redeﬁning it in every conceivable way: coherent or not, safe or unsafe, without compromising the integrity of the rest of the system.

296

L. Gardenghi, M. Goldweber, and R. Davoli

References 1. Davoli, R.: The View-OS project., http://www.sf.net/projects/view-os 2. Davoli, R., Goldweber, M., Gardenghi, L.: UMView: View-OS implemented as a system call virtual machine. In: 7th Usenix Symposium on Operating Systems Design and Implementation. Poster Session, Seattle, WA (November 2006) 3. Smith, J., Nair, R.: Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann, San Francisco (2005) 4. Smith, J.E., Nair, R.: The architecture of virtual machines. IEEE Computer 38(5), 32–38 (2005) 5. Adair, R., Bayles, R., Comeau, L., Creasy, R.: A virtual machine system for the 360/40. Technical report, IBM Cambridge Scientiﬁc Center Report 320-2007, Cambridge, Mass. (May 1966) 6. Bellard, F.: Qemu, a fast and portable dynamic translator. In: USENIX 2005 Annual Technical Conf., FREENIX Track hardware emulator (2005) 7. Bartholomew, D.: Qemu: a multihost, multitarget emulator. Linux J (145) (2006) 8. Qumranet Inc.: KVM: Kernel-based virtualization driver (2006), http://kvm.qumranet.com 9. Gavare, A.: GXemul project, http://gavare.se/gxemul/ 10. Microsoft (formerly from Connectix): Virtual PC, http://www.microsoft.com/windowsxp/virtualpc/ 11. VMware, Inc.: VMware, http://www.vmware.com/ 12. Lawton, K.: Bochs project home page, http://bochs.sourceforge.net 13. Biallas, S.: PearPC project, http://pearpc.sourceforge.net 14. Morsiani, M., Davoli, R.: Learning operating system structure and implementation through the MPS computer system simulator. In: Proc. of the 30th SIGCSE Technical Symp. on Computer Science Education, New Orleans, pp. 63–67 (1999) 15. Goldweber, M., Davoli, R.: The Kaya project and the μMPS hardware emulator. In: Proc. of ITiCSE 2005. Conf. on Innovation and Technology in Computer Science Education, Lisbon (2005) 16. Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Pratt, I., Warﬁeld, A., Barham, P., Neugebauer, R.: Xen and the art of virtualization. In: Proc. of the ACM Symp. on Operating Systems Principles (October 2003) 17. Engler, D.R., Kaashoek, M.F., O’Toole, J.: Exokernel: an operating system architecture for application-level resource management. In: SOSP 1995: Proc. of the 15th ACM symposium on Operating Systems principles, pp. 251–266. ACM Press, New York (1995) 18. Pike, R., Presotto, D., Dorward, S., Flandrena, B., Thompson, K., Trickey, H., Winterbottom, P.: Plan 9 from Bell Labs. Computing Systems 8(3), 221–254 (Summer 1995) 19. Szeredi, M.: FUSE: ﬁlesystem in user space, http://fuse.sourceforge.net 20. Davoli, R.: Virtual square wiki page, http://wiki.virtualsquare.org/ 21. Davoli, R.: Virtual square. In: Proc. of OSS2005. Open Source Software 2005, Genova (2005) 22. Davoli, R., Goldweber, M.: Virtual square in computer science education. In: Proc. of ITiCSE05. Conf. on Innovation and Technology in Computer Science Education, Lisbon (2005) 23. Davoli, R.: VDE: virtual distributed ethernet. In: Proc. of Tridentcom 2005, Trento (2005)

Evaluating Sparse Data Storage Techniques for MPI Groups and Communicators Mohamad Chaarawi and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {mschaara,gabriel}@cs.uh.edu

Abstract. In this paper we explore various sparse data storage techniques in order to reduce the amount of memory required for MPI groups and communicators. The idea behind the approach is to exploit similarities between the objects and thus store only the diﬀerence between the original process group and the resulting one. For each technique, we detail the memory saved compared to the currently used implementations, and present a runtime decision routine capable of choosing dynamically the most eﬃcient technique for each scenario. Furthermore, we evaluate the performance impact of the new structures using point-to-point benchmarks as well as an application scenario over InﬁniBand, Myrinet and Gigabit Ethernet networks.

1

Introduction

The memory footprint of a process running the MPI equivalent of ’hello world’ can reach tens of Megabytes on today’s platforms. Some of the factors contributing to the large memory footprint are related to optimizations within the MPI library code base, such as using statically allocated memory or having many different code paths in order to optimize a particular operation. The larger fraction of the memory utilized by an MPI library is however allocated dynamically at runtime and depends on system parameters such as the network interconnect and application parameters such as the the number of processes. While reducing the memory footprint of an MPI process was not considered to have a high priority for a while, recent hardware developments force us to rethink some concepts used within communication libraries. Machines such as the IBM Blue Gene/L [4] have the capability to run MPI jobs consisting of more than 100,000 processes. At the same time, each node only has 512 MB of main memory available, leading to 256 MB for each MPI process. A similar problem occurs on commodity clusters due to the increasing number of cores per processor [2] giving the end-users the possibility to run parallel jobs consisting of a large number of MPI processes, while at the same time the main memory per core remains constant at best. For platforms facing the problem outlined above, an MPI library should avoid internal structures having a high level dependency on the number of MPI processes. In this paper we are exploring various sparse data storage techniques in M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 297–306, 2008. c Springer-Verlag Berlin Heidelberg 2008

298

M. Chaarawi and E. Gabriel

order to reduce the amount of memory required for MPI groups and communicators. The idea behind the approach is to exploit similarities between the objects. Instead of storing the entire list of processes which are part of a new group or communicator, the approach presented in this paper stores only the diﬀerence between the original communicator and the resulting one. These techniques might not only be relevant for very large number of processes, but will also be beneﬁcial for applications having a moderate number of processes, generating however a very large number of communicators [1]. Much of the work on optimizing memory usage within MPI libraries has focused so far on the networking layer. Panda et al. show in [12] the beneﬁts of using the Shared Receive Queue features of InﬁniBand in order to reduce the memory usage of large scale applications. Shipman et al. [11] introduce a new pipelining protocol for network fabrics dealing with registered memory, which further reduces the memory utilization for these network interconnects. The work done in [5] focuses on controlling the number of unexpected messages a process can handle in order to limit the memory usage of the MPI library. The remainder of the paper is organized as follows: Sec. 2 discusses brieﬂy the current implementation of groups and communicators in Open MPI, and presents three diﬀerent sparse data storage techniques. Sec. 3 evaluates the performance impact of the techniques detailed in the previous section using a point-to-point benchmarks as well as the High Performance Linkpack (HPL) benchmark. Finally, Sec. 4 summarizes the paper and presents the currently ongoing work in this area.

2

Alternative Storage Formats for Groups and Communicator

In most MPI libraries available today, each MPI group contains the list of its member processes. In Open MPI [6], each entry of the list is a pointer to the according process structure, while in MPICH2 [8] the according list contains the ranks of the processes in MPI COMM WORLD. The position in this array indicates the rank of the process in this group. An MPI communicator contains typically pointers to either one MPI group for intra-communicators, or two groups for inter-communicators. While this approach guarantees the fastest access to the process structure for a communication – and thus minimal communication latencies – the information stored in those arrays is often redundant. For example, in case a numerical library creates a duplicate of MPI COMM WORLD in order to generate a unique communication context, the communicator structure will contain a process list with redundant information to the original communicator. For a 100,000 process job, this list will take 8 × 100, 000 bytes of memory, assuming that the size of a pointer is 8 bytes. Therefore, three alternative storage formats have been evaluated in this study in order to minimize the redundant information between diﬀerent process groups and thus minimize the memory footprint of the according structures. In order to evaluate beneﬁts and disadvantages of these storage formats, we have

Evaluating Sparse Data Storage Techniques

299

implemented all formats in Open MPI. For this, the group structure and the group management functions had to be adapted. The following subsections give some details to each format. PList Format: The PList storage format is the original storage format containing a list of pointers to the process structures of the group members. This implementation of this format is unchanged compared to the original version. Range Format: For this format, the included processes in a group are described by ranges of process ranks, e.g. having n consecutive processes starting from rank r. The syntax of this storage format has been derived from the MPI Group range excl/incl functions in MPI [9]. Thus, the group structure in Open MPI has been extended by a list holding the required number of < base rank, number of processes> pairs in order to describe the members of the new group. The base ranks stored in the group-range list correspond to the rank of the process in the original group. Thus, the group structure also needs to store a pointer to the original group and increase the reference counter of that object accordingly. While this storage format can be applied to any group/communicator, it will be most memory eﬃcient if the list of ranks included in the new process group can be described by a small number of large blocks. Strided Format: In some cases, the included processes in a group follow a regular pattern, e.g. a new group/communicator includes every n-th process of the original group. Three integers are required in the group structure in order to support this format: the grp offset contains the rank of the process where the pattern starts; grp last elt is the rank of the last process in the pattern; the grp stride describes how many processes from the original group have to be skipped between two subsequent members of the new group. The group creation function for strided groups can automatically determine all three parameters. In case no regular pattern could be determined by the routine, the strided group creation function will indicate that it can not be used for this particular process group. Similarly to the range format, a pointer to the original group is required to be able to determine the process structures. Bitmap Format: The main idea behind this storage format is to use a bit-array of the size of the original communicator/group. The bit at position i indicates, whether the process with the rank i in the original group is a member of the resulting group or not. The main restriction of this storage format is that ranks of the included processes in the new group have to be monotonically increasing in order to be able to uniquely map the rank of a processes from one group to another group. 2.1

Group Management Operations

Additionally to the functionality outlined in the previous subsections, each storage format provides a function which estimates the amount of main memory

300

M. Chaarawi and E. Gabriel Table 1. Memory consumption of each storage format Storage format Memory consumption PList number of processes × sizeof(void *) Range number of ranges ×2× sizeof(int) Strided 3× sizeof(int) Bitmap ( number of processes /8)

required by this format, given the list of process members. Whenever a new group or communicator is created, these functions are queried in order to decide which of the storage techniques will be applied. Table 1. summarizes the formula used to estimate the memory consumption of each format. The current runtime decision logic will then pick the storage format requiring the least amount of memory. A ﬂag in the group structure indicates which storage format is being used for a particular group. The groups used by MPI COMM WORLD and MPI COMM SELF are always stored using the PList format. The most performance sensitive functions with respect to group and communicator management are returning a pointer to the process structure given a tuple of , and the ability to translate the rank of a process given in one process group into its rank in another process group. The ﬁrst functionality is utilized by any data transfer operation, since the process structure contains elements such as the contact information and the communication peers. In the original implementation of MPI groups in Open MPI, access to a process structure is as simple as using the PList array by specifying the rank of the process in the group as the index in the array. Since this array is set to NULL whenever a format other than the PList format is being used, this approach is not applicable for multiple storage formats. Thus, a macro, GROUP GET PROC POINTER, has been introduced. This macro translates to the original access to the process pointer list in case the support for multiple storage formats has been turned oﬀ by the user. In case the support for multiple storage formats has been enabled, the macro would be replaced by a function call determining the according pointer. To get the pointer to the process structure of a given rank in a group is a two step procedure for any of the alternative storage formats presented so far. First, the function determines the rank of this process in its parent group and second, it will query the original parent group using the rank of the process in that group for the according pointer. In case the parent communicator uses a sparse storage format as well, this procedure will be repeated as long as the algorithm hits a group which uses the PList format. Since MPI COMM WORLD and MPI COMM SELF always use the PList format, the algorithm is guaranteed to determine the correct process pointer in any scenario. Another operation which has to be adapted in order to support multiple storage formats is the group rank translation function. Although it would be easy to generate a universal implementation which works for all storage formats, there are certain special cases where the performance of this operation can be signiﬁcantly improved by taking into account the relation between the groups,

Evaluating Sparse Data Storage Techniques

301

e.g. in case one group has been derived from the other group. Unfortunately, we can not detail the according formulas in this paper due to space limitations, please refer to [3]. The most severe restriction of the current approach is that all storage formats assume as of now that a group is derived from a single parent group. This is not the case for the MPI Group union and MPI Group difference operations as well as for MPI Intercomm merge. For these functions, the implementation will automatically fall back to the PList format. For similar reasons, the current implementation does not support inter-communicators.

3

Performance Evaluation

This section evaluates the performance implications of the various storage formats presented in the previous section. Two separate set of tests have been conducted, namely a point-to-point benchmark in order to quantify the eﬀect on the latency and an application benchmark. The machines used for the tests were the shark cluster at the University of Houston and the IBM BigRed cluster at Indiana University. Shark consists of 24 dual-core 2.2GHz AMD Opteron nodes connected by a 4xInﬁniBand and a Gigabit Ethernet network interconnect. BigRed, which was mainly used for the point-to-point benchmarks, consists of 768 IBM JS21 Blades, each having two dual-core PowerPC 970 MP processors, 8GB of memory, and a PCI-X Myrinet 2000 adapter. Within the scope of this analysis we used up to 512 MPI processes on 128 nodes on BigRed. 3.1

Point-to-Point Benchmark

In order to evaluate the eﬀect on the point-to-point performance of Open MPI when using the alternative storage formats for groups and communicators, we created in a new test within the latency test suite [7]. The basic idea behind the latency test suite is to provide building blocks for ping-pong benchmarks, such as diﬀerent data type constructors, communicator constructors, or data transfer primitives. This allows users to set-up their own point-to-point benchmarks, e.g. by mimicking a particular section of their applications. The new test case developed within this project creates a hierarchy of communicators. Starting from the processes in MPI COMM WORLD the test excludes all odd-ranked elements of the communicator. Using the resulting communicator the benchmark keeps creating new communicators excluding odd ranked elements until a communicator consisting of only one or two processes is being created. For each new communicator, a ping-pong benchmark will be executed between the ﬁrst and second process in one case, and between the ﬁrst and last process in another case. An additional overhead to the communication latency is expected when executing with the sparse formats, which comes from the fact that getting the actual process pointer for each data transfer operation requires some additional computation and lookup operations. This eﬀect is supposed to increase with the increase in the hierarchy of groups depending on each other.

302

M. Chaarawi and E. Gabriel

Table 2. Results of the point-to-point benchmark running 48 processes over InﬁniBand PList Range Strided Bitmap 0-ﬁrst 0-last 0-ﬁrst 0-last 0-ﬁrst 0-last 0-ﬁrst 0-last Level Level Level Level Level Level

0 1 2 3 4 5

3.7 3.7 3.7 3.74 3.74 3.74

3.7 3.7 3.7 3.7 3.7 3.7

3.7 3.74 3.74 3.74 3.85 3.9

3.7 3.74 3.79 3.85 3.85 3.85

3.7 3.74 3.74 3.74 3.8 3.8

3.7 3.71 3.74 3.74 3.8 3.8

3.7 3.74 3.74 3.79 3.8 3.9

3.7 3.74 3.79 3.79 3.85 3.9

For each implementation of the groups (plist - range - sparse - bitmap), the test was executed ﬁve times. The results that are provided show the minimum latency of the all the tests executed. This shows the best achievable result on the according cluster. Times are given in μs. The results on shark over the InﬁniBand network interconnect are shown in Table 2. The level of each communicator shown in the ﬁrst column of the table indicates the number of indirections required to look up the process structure. For level 0 ( = MPI COMM WORLD) the latency is independent of the storage format, since this communicator is always using the PList format. Furthermore, there is no performance diﬀerence for level 0 whether the ping-pong benchmark is executed between the ﬁrst and the second process, or between the ﬁrst and the last process of the communicator. As expected, the latency is mostly constant for the original PList format, and thus independent of the communicator used. For the other formats, the latency does increase depending on the level of the communicator, i.e. the number of indirections required to lookup the process structure. In order to quantify the overhead, lets consider the highest overhead observed in our measurements, which adds 0.2μs to the original latency. Accessing the process structure for that particular communicator level requires 5 indirections of the algorithm described in section 2.1. Thus, the average overhead per level can be estimated to be up to 0.04μs on this architecture. For the bitmap and the range formats, we also would have expected to see a slight increase in the latency when executing the ping-pong benchmark between the ﬁrst and the last process, compared to the ﬁrst and the second process. The reason for this is that the costs for the rank-translation algorithms for these two formats should increase with the rank being translated, since we have to walk linearly through the list of participating processes. However, due to the fact that our maximum job size is only 48 processes and that the number of processes decreases by a factor of two with each level of communicator, we could not observe in these benchmarks the expected eﬀect. There are slight diﬀerences in the performance of the alternative storage formats, with strided being slightly faster than the bitmap and the ranges format. The reason for this is probably the rank-translation algorithm, which only requires applying a simple formula for the strided format, compared to a slightly more complex algorithm for the other two sparse storage formats.

Evaluating Sparse Data Storage Techniques

303

Table 3. Results of the point-to-point benchmarks running 48 processes over Gigabit Ethernet PList Range Strided Bitmap 0-ﬁrst 0-last 0-ﬁrst 0-last 0-ﬁrst 0-last 0-ﬁrst 0-last Level Level Level Level Level Level

0 51.55 51.84 51.61 51.39 51.11 52.14 1 51.61 52.45 51.65 52.34 52.09 52.95 2 51.7 52.7 51.59 53.8 51.55 52.64 3 51.09 52.3 51.45 53.19 51.4 52.4 4 50.8 51.45 51 51.81 51.75 51.8 5 51.7 51.5 51.86 51.4 51.55 51.14

51.55 52.05 51.75 51.15 51.25 51.61

51.2 52.8 52.75 51.86 52.6 51.7

Table 4. Results of the point-to-point benchmark running 512 processes over Myrinet PList Range Strided Bitmap 0-ﬁrst 0-last 0-ﬁrst 0-last 0-ﬁrst 0-last 0-ﬁrst 0-last Level Level Level Level Level Level Level Level Level

0 1 2 3 4 5 6 7 8

6.25 6.25 6.09 6.20 6.21 6.25 6.75 6.75 6.76

7.34 7.40 7.30 7.30 7.20 7.25 7.25 7.25 7.36

6.04 6.29 6.25 6.35 6.35 6.45 7.19 7.20 7.40

7.45 7.65 8.00 8.00 8.06 8.30 8.00 8.05 8.00

6.25 6.29 6.29 6.29 6.40 6.40 6.85 6.89 7.06

7.40 7.44 7.45 7.39 7.55 7.55 7.40 7.55 7.55

6.10 6.29 6.29 6.35 6.60 6.75 7.56 7.95 8.80

7.25 9.05 9.80 10.14 10.44 10.50 10.19 10.05 9.30

Table 3. summarizes the performance results on shark over Gigabit Ethernet. In our measurements, no performance eﬀects could be observed which could be directly related to the sparse data storage formats used for groups and communicators. The reason for that is, that the perturbation of the measurements using this particular switch was higher than expected overhead, assuming that the overhead due to the diﬀerent storage formats would be in the same range as for InﬁniBand results presented previously. Table 4. presents the results obtained for a 512 processes run on 128 nodes on Big Red. In order to ensure, that the same network protocol is used for all communicator levels, the 0-first tests have been modiﬁed such that the ﬁrst MPI processes on the ﬁrst two nodes are being used for the ﬁrst three communicators ( levels 0, 1, and 2). First, we would like to analyze the results obtained using the plist format – the default Open MPI approach – on this machine. The results for this storage format are presented in the ﬁrst two columns of the Table 4. The most fundamental observation with respect to the results obtained on this machine is that the latency shows a fundamentally larger variance depending on the nodes used for the ping-pong benchmark. Furthermore, there is a relevant increase in the communication latency when executing the ping-pong benchmark between

304

M. Chaarawi and E. Gabriel

the rank 0 and the last process in the communicator, compared to the results obtained in the 0-first tests. In order to explain this eﬀect, we made several veriﬁcation runs conﬁrming the results shown above. Furthermore, we veriﬁed this behavior for the plist format with support for sparse storage formats being disabled in Open MPI. Since we can positively exclude any eﬀects due to the sparse storage formats for these results, we think that the most probably explanation has to deal with caching eﬀects when accessing the process structures of processes with a higher rank. With respect to the sparse storage formats, the results indicate a similar behavior as obtained on the shark cluster over InﬁniBand. The redirections required to look up the process structure lead to a small performance penalty when using the sparse storage techniques. In order to estimate the overhead introduced by the sparse storage techniques, we compare the latency obtained for a particular communicator level with a sparse storage technique to the latency obtained on the same communicator level with the plist storage format. This overhead is then divided by the number of indirect lookup operations required for that communicator level. Since the results show a larger variance than on the shark cluster, we provide an upper bound for the overhead by reporting only the maximum values achieved in this set of tests and the average obtained over all levels. In the 0-first tests, the range format introduces an average overhead of 0.057μs per level, the maximum overhead found was 0.08μs. The penalty on the latency per level when using the strided format in these tests was 0.04μs, while the highest overhead observed in these tests using this storage format was 0.1μs. The bitmap format shows once again the highest overhead, with an average penalty of 0.118μs per level, and a maximum overhead of up to 0.255μs. In contrary to the results obtained on the shark cluster, the 0-last tests show a signiﬁcant additional overhead for the range and the bitmap format, which are due to the fact, that the rank-translation operation involves a linear parsing of all participating processes in that communicator. The bitmap format has an average overhead of more than 0.8μs per level in these tests, while the average overhead for the range method increases to 0.197μs. As expected, the strided format does not show any sensitivity to the rank being applied in the ranktranslation operation. 3.2

Application Benchmark

In order to determine the impact of the new storage formats on the performance of a real application scenario, we executed multiple test-cases of the HPL benchmark [10]. A major requirement for the application benchmark chosen for this subsection is, that the code has to create sub-communicators which expose a beneﬁt of the sparse storage techniques. HPL organizes processes in a 2-D Cartesian process topology. Three diﬀerent type of communicators are created by HPL: (I) a duplicate of MPI COMM WORLD, (II) a row communicator for each row of the 2-D Cartesian topology, and (III) a column communicator for each column of the 2-D Cartesian topology. For (I), the new group creation functions will choose the range format for process numbers larger than 64, the bitmap

Evaluating Sparse Data Storage Techniques

305

format otherwise. Similarly, communicator (II) can be represented by a single range of processes, while the communicator (III) is best described by the strided format. Assuming a 90,000 process run of HPL organized in a 300 × 300 process topology, the default implementation of groups and communicators in Open MPI would take 724,800 bytes to store the list participating processes for all three communicators per process. Using the sparse storage techniques described in this paper, the memory consumption for that scenario can be reduced to 58 bytes per process. In the following, we present performance results for 48 processes test-cases using shark. Table 5. summarizes the measurements over the InﬁniBand network interconnect. Four diﬀerent test cases have been executed, namely two problem sizes (24,000 and 28,000) each executed with two diﬀerent block sizes (160 and 240). Since the latency does show some dependence on the storage format used for MPI groups, we would expect to see minor increases in the execution time for highly latency sensitive applications. However, none of the test cases executed in this subsection show a signiﬁcant performance degradation related to the storage format. Table 5. Execution time of the HPL benchmark on 48 Processes using InﬁniBand in seconds Size Block size PList Range Strided Bitmap 24000 24000 28000 28000

4

160 240 160 240

65.81 69.05 99.78 104.25

65.81 69.08 99.73 104.31

65.84 69.22 99.84 104.27

65.87 69.14 99.81 104.29

Summary

In this paper, we introduced various storage formats for groups and communicators in order to minimize the memory footprint of the according structures. The main idea behind these formats is to store only the diﬀerence between the original group and the newly created one. Three diﬀerent formats – range, strided and bitmap – have been implemented in Open MPI. Additionally to the memory consumption of each format, the paper also evaluates the performance impact of the sparse storage formats. Using a modiﬁed ping-pong benchmark, we could determine the performance overhead due to the new data storage formats to be up to 0.04μs per hierarchy level of the communicator. This overhead is negligible for most scenarios, especially when taking into account, that many applications only derive communicators directly from MPI COMM WORLD. Our tests using the HPL benchmark did not show any measurable overhead due to the new storage formats. The techniques detailed in this paper will be available with Open MPI version 1.3. Acknowledgments. This research was funded in part by a gift from the Silicon Valley Community Foundation, on behalf of the Cisco Collaborative Research

306

M. Chaarawi and E. Gabriel

Initiative of Cisco Systems and was supported in part by the National Science Foundation through TeraGrid resources provided by Indiana University.

References 1. Open MPI users mailing lists (2007), http://www.open-mpi.org/community/lists/users/2007/03/2925.php 2. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS Department University of California, Berkeley (2006) 3. Chaarawi, M.: Optimizations of group and communicator operations in Open MPI. Master Thesis, Department of Computer Science, University of Houston (2006) 4. Gara, A., et al.: Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development 49(2/3), 195–212 (2005) 5. Farreras, M., Cortes, T., Labarta, J., Almasi, G.: Scaling MPI to short-memory MPPs such as BG/L. In: ICS 2006: Proceedings of the 20th annual international conference on Supercomputing, pp. 209–218. ACM, New York (2006) 6. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004) 7. Gabriel, E., Fagg, G.E., Dongarra, J.J.: Evaluating dynamic communicators and one-sided operations for current MPI libraries. International Journal of High Performance Computing Applications 19(1), 67–79 (2005) 8. Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22(6), 789–828 (1996) 9. Message Passing Interface Forum. MPI: A Message Passing Interface Standard (June 1995), http://www.mpi-forum.org 10. Petit, A., Whaley, R.C., Dongarra, J., Cleary, A.: HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Version 1.0, http://www.netlib.org/benchmark/hpl/ 11. Shipman, G.M., Brightwell, R., Barrett, B., Squyres, J.M., Bloch, G.: Investigations on InﬁniBand: Eﬃcient network buﬀer utilization at scale. In: Cappello, F., Herault, T., Dongarra, J. (eds.) PVM/MPI 2007. LNCS, vol. 4757, pp. 178–186. Springer, Heidelberg (2007) 12. Sur, S., Koop, M.J., Panda, D.K.: High-performance and scalable MPI over InﬁniBand with reduced memory usage: an in-depth performance analysis. In: SC 2006: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, p. 105. ACM, New York (2006)

Method of Adaptive Quality Control in Service Oriented Architectures Tomasz Szydlo and Krzysztof Zielinski Department of Computer Science, AGH University of Science and Technology [email protected], [email protected]

Abstract. Internet era is very attractive for growing up small companies and start-ups because of simplicity of selecting services that need to be used for composing complex applications. This model requires a novel approach to accounting and providing Quality of Service defined by Service Level Agreement. By characterizing these needs we make a proposal for service adaptation that can dynamically change offered quality with respect to customer’s preferences and budget.

1 Introduction Software systems becoming larger and far more complex than ever. At the same time, they have to provide assured quality and performance. What is more, time to market is shortened significantly. Most of the basis functionalities of modern software are reusable components that might be shared between projects. Analyzing many projects from 2003 to 2006, IBM research has found that most of them express the same architectural template. The result of this findings is the Service-Oriented Solution Stack (S3) [5], which provides a detailed architectural definition of an SOA across nine layers from business process to operational systems layer. These layers are crossed by integration, quality of service, information architecture and governance layers. Each of them has a logical and physical aspect. Logical aspect includes architectural elements, design decisions, where physical aspect is related to the technology of implementation. Service Level Agreement is the contract between provider and customer. SLA defines the terms and conditions of service quality that provider delivers to the customers. Most important in SLA is the Quality of Service information. QoS information consists of several criteria like execution duration, availability, execution time, and many more. SLA constitutes also financial information as a price for using service, and the way in which penalties are compensated. With such an complex systems, it is not possible or it is very difficult to analyze and tune the application manually to fulfil SLA requirements. Any change can have influence on financial condition of the company because any deviations on business agreements are needed to be compensated. Additionally, growing up companies insist on incorporating more flexible way of accounting and payment methods for service M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 307–316, 2008. © Springer-Verlag Berlin Heidelberg 2008

308

T. Szydlo and K. Zielinski

usage. It is very convenient, from customer point of view, to point out what QoS metrics are more important than the others, and what is the maximum budget which might be spent for using service. Aim of service provider is not only to assure QoS, but dynamically change it not to overdraft budget. We think that service oriented architecture allows for building applications which might adapt to the changes in execution environment. The structure of this paper is as follows. Section 2 discusses related work. In Section 3, concept of adaptive quality control that we propose is presented. Motivating scenario is presented and evaluated in Section 4. Finally, conclusions and future work are sketched in Section 6.

2 Related Work In this section, we cover related work on QoS driven adaptable architectures and approaches. 2.1 Autonomic Computing System that is called adaptive is a one that is able to modify itself to adapt to the changes in the environment. System must be aware of context information. This is done mostly by monitoring modules and the implemented sensors. As a reaction to the changes, system modifies itself through the executing modules by plenty of effectors. Planning and evaluating what to do when changes are noticed is described by adaptation logic. This approach is investigated further, but it derived from Autonomic Computing that has been started by IBM in 2001. Main goal is to create self-management systems to overcome growing complexity and to reduce effort of maintenance.

Fig. 1. Monitor-Analyze-Plan-Execute loop

In Autonomic System, operator does not influence system directly, but defines general policies, which defines system behaviours. IBM has defined the following four areas of usage: Self-configuration is an autonomic configuration of components; Self-healing automatically discovers and corrects faults; Self-optimization automatically monitors and controls of resources to ensure good functioning with respect to defined requirements; Self-protection is an ability to identify and protect system from attacks.

Method of Adaptive Quality Control in Service Oriented Architectures

309

Adaptation process might be divided into to orthogonal aspects: − Behavioural versus architectural adaptability. Adaptation is behavioural when the behaviour of the service can be modified without modifying its structure. This is done by tuning up some parameters. In contrast, architectural adaptability takes place when the structure of the system is modified by means e.g. of switching between instances of the service. − Run-time versus design-time adaptation. Adaptation actions may be executed at run-time or at design-time. The adaptation is run-time when it can be performed during execution, and design-time is when selection of services is done on the designing stage. 2.2 QoS Frameworks Several approaches have been proposed for QoS driven service selection. Zeng [4] considers service selection as an global optimization problem using linear programming. Target function that is optimised is a linear combination of QoS metrics. Similar approaches incorporates modified Dijkstra’s algorithm [7], constraint programming logic or a knapsack problem. That approach is only applicable in design time adaptation because it does not recalculate solution during composite service execution. We share Kokash [6] opinion that quality of these solutions depends strongly on the user weights for each QoS metric that is not trivial to establish them in the right way. Abdelzaher et al. [8] shows how control-theoretic approach can be used to achieve quality of service guarantees. They demonstrate that a software system can be approximated by a linearized model and controlled by actuators and sensors.

3 Concept of Adaptive Quality Control Analysis of existing frameworks in terms of QoS unveils that this term is used interchangeably for describing quality from provider point of view as well as from client point of view. We have decided to distinguish Quality of Experience from Quality of Service.

Fig. 2. Different ways of perceiving quality

QoS is the quality of provided services; QoE is the observed quality by customer. The idea is presented in Fig. 2. It is very common that these values are completely different. Let us think of calculating balance account on the end of month. Accounting service which we are using is 99,9% available but, once a month is disabled for maintenance for 1 hour. If we are accessing this service during this hour, from our point of view, its availability would be 0,0%. When we assume that QoS is a vector of values for a given metrics, QoE is defined as:

310

T. Szydlo and K. Zielinski n

QoE = ∑ QoS ti × i =1

1 n

for n → ∞

where n is the number of invokes of composite service and ti is the time when invocation takes place. It makes us to think of monitoring QoE and on that basis adapt composite service to fulfil requirements of an agreement. Further research led us to the idea of QoS controller, depicted in Fig. 3 that would be responsible for providing desired QoE described by SLA. Service oriented architecture decomposes composite service into set of inter-working simple base services. Set of services with the same functionality will be called abstract service. During invocation of composite service, we can exchange simple service by any that belongs to the same abstract service set. We can think of this idea as of an interface and the class that implements it.

Fig. 3. Control loop

In our work we distinguish four kind of services: − Abstract Service (Si). Service, which is described by its functionality but instance is not pointed out; − Service (Iik). Service is a functionality provided by provider; − Abstract composite service (S). Composite service that contains at least one abstract service; − Composite service (C). Composite service that might be executed because all of its base services are instances.

Composite abstract service consists of several abstract services S = {S1 , S 2 .., S n },

{

}

each having several instances Si = I i1 , I i2 ,.., I iki . In this work, we are assuming that services Si are executed sequentially. This is very strong assumption and applicable only for simple services, but problem of evaluating concurrent services [3] with structured activities is out of scope of this paper. However, idea of QoS controller elaborated later is applicable for this type of composite services as well. To make a good use of service adaptation, we have to collect history of previous invocations of base services. Taking into account these data, we can envisage more or less accurately the quality of invocation. Analyzing it deeper we have figured out, that quality of provided service is strictly related to context of execution environment. Due to this fact, one can use services located on the globe where currently night is hence, servers are not overloaded.

Method of Adaptive Quality Control in Service Oriented Architectures

311

Assuming that we have a history of Iik invocations, expected quality of further invocations might be calculated using regression analysis. In the simplest case, it can be simple moving average. Before any assumptions about expected quality, we can ask base service provider of execution environment context and on that basis we can analyse only subset of the invocation history where the context information was the same. 3.1 Quality of Service Metrics Qualities of Service attributes include metrics like throughput, response time, availability but exact definition and measurement process must be well defined to give consumer and provider common understanding. As for describing Web service’s functional aspects WSDL language has been defined, there is no universal language for describing metrics and the way of measuring it. Moreover, many of the well-known metrics like availability do not have formal definition. For example availability might be described as percentile value, or the number of positive invocations during last 50 invocations, or the availability during last hour. Metrics are not only applicable for simple services, but also for composite ones so for each metric must be provided algorithm for calculating values for a service composition [1]. In this paper we are considering only services invocated in a sequence. We can distinguish metrics to the quantitative ones that have numerical values or might be described by values and the qualitative ones that are described in words. Nevertheless, for any metric must be provided an equation or algorithm to recalculate it to <0;1> range. QoS is a vector of values for a given metrics. If service a is better then service b from a given metric point of view, means that metric value of service a is greater than value for service b. In the other case, for evaluating we have to use inversions of the metric values. 3.2 QoS Controller We have designed a QoS controller, which continuously monitors deviation of current user QoE from agreed SLA, and on that basis, service QoS is tuned by selecting service instances that need to be invoked. Deviation of provided quality of service is described as follows: ΔQoS = QoESLA − QoE

Before execution of abstract service, it is decided which instance to invoke. Decision process takes into account services already invoked, correction from feedback loop, and influence of each abstract service instance on possible overall quality. Number of total invocation possibilities is ∏ Si , but only when none of i =1..n

abstract services was invoked until now. Assuming that services S1,..,Sk were invoked, total number p of possible invocations is ∏ S i . Before invocation of any base i = k +1.. n

service, all possible invocations are evaluated as presented in Fig. 4. For the services that have not been invoked yet, mean value I ik from historical invocations is used for calculating QoS of this composition. From the set {C1,..,Cp} of possible invocations, we are selecting one that fulfils:

312

T. Szydlo and K. Zielinski

Fig. 4. Execution model

(

)

(

∀ sign QoS (Cx ) − QoEi = sign ΔQoS i i

i

)

It guaranties that if any of the metrics value in SLA is greater than in QoE, composite service will be selected with QoS containing that metric large enough to balance difference, and analogically small enough when metrics value in SLA is less then in QoE. 3.3 Variable QoS Providing contracted quality of service is very competitive task, as well as, providing service with variable quality that do not exceed specified budget. Our idea is to estimate the number of invocations to the end of accounting period basing on the number of invocations until now. Having this information and the amount of money which left to spend, we can change the offer to the cheaper one. Customer has priorities what sorts of metrics are more important then others. Providing exact weights of each metric is multi dimensional decision problem and it is not a trivial task. We have found that decomposing problem of assigning weights makes it easier to deal with. Analytic Hierarchy Process [2] is a technique based on mathematics and human psychology for prioritizing elements of decision problem. For each pair of metrics user specifies which one is preferred in the form of fraction between 1/9 and 9/1. The result of AHP is a vector of weights wi for earch metric i. We will be referring to fitness factor of QoS with additional information of metrics importance as: fitness (QoS ) = ∑ wi × QoS i i

Customer chooses set of SLA in which is interested. After estimating cost per single invoke, system choose the SLA with the best fitness and the price per invoke less than calculated. 3.4 Service Level Agreement Client agreement is represented as tuple: SLA = (QoESLA , price, penalties, time ) where QoE is the quality of user experience, price is the amount of money which client has to pay for each service invoke, and penalties is the price which provider will pay in the case of any deviation of QoE, and time is the accounting period. Price for using service S is defined as follow:

Method of Adaptive Quality Control in Service Oriented Architectures

313

bill = n * price − missed * penalties

0 ⎧⎪ missed = ⎨ i i ∀ max : QoE SLA − QoE n ⎪⎩ i

(

(

))

, if QoE ≥ QoESLA , if QoE < QoESLA

where n is the number of invocations. Accounting algorithm might be illustrated by simple example. Let us say that availability agreed by SLA is 50%, and on the end of month, on 20 invokes, only 6 was successful. It means that customer was overcharged unfairly for 4 invokes which has to be compensated. It is a company politic which defines how value of price is related to value of penalties. Accounting for variable QoS is quite different because client agreement is represented as a set of SLAs and the metrics weights calculated using AHP from preferences set: SLAvar = ( preferences, {SLA1 ,.., SLAl }, budget )

System dynamically changes SLA to assure that bill will not be greater then budget. On every accounting period, client is charged independently for every SLA contract but total sum is not grater than budget. Client is eligible to receive compensation if any of SLAs is violated.

4 Motivating Scenario Let us assume that we have Internet website that provide information for amateur pilots and we want to include very accurate weather forecast for nearest airports on the main site. Composite service consists of two base services i.e. service which takes city name on input and returns airports in the near proximity, and the second service that takes name of the places and returns weather forecast. We have to define the metric for describing quality of provided data. Moreover, our portal is a growing up business hence, we have limited budget to maintain system, so we do not like to spend more than assumed price for the service. For this type of service, it is better from customer point of view, to achieve very accurate data than has high availability of poor information. To verify our concepts we have developed simulation environment that is flexible enough to implement different adaptation strategies. 4.1 Metrics We have assumption that base services are invoked in a sequence. Below are described metrics used in this example scenario and algorithms for calculating values for composite services. Availability. Availability is the probability that service is accessible. For our composite service, availability is the product of availabilities of the base services. Execution time. Execution time is the time between invocating service, and the time of receiving response. For composite service, execution time is the sum of execution time of each base service. Data Quality. For this example we have provided Data Quality metric which defines the quality of received data. Data quality means simply the accuracy in kilometres of

314

T. Szydlo and K. Zielinski

weather information. Value of this metric for composite service is the minimum value of the base services. As it was mentioned before, our service is very specific, so customer has to decide what is more important and what is not: − availability is three times as important as execution_time; − availability is five times less important than data_quality; − execution_time is five times less important than data_quality.

Fig. 5. Evaluated metrics weights

Results of metric weights evaluation with AHP are presented in Fig. 5. Customer agreed to use three SLA. Each contract has the same data quality metrics, but differs in response time and availability. In the table below, contracts are listed in the order with respect to metrics weights. Table 1. SLA contracts SLA

Price per 100 invocations

0 1 2

1 0,5 0,2

Response time [ms] 100 300 200

Availability

Data quality [km] 0,9 0,8 0,7

5 5 5

4.2 Evaluation Customer noticed that his website gets three thousands visits per month. As the website become more popular, number of invocations may significantly increase. In this situation client has decided to buy service with variable QoS and with maximum budget of 30€€ . Detailed statistic of user invocations is presented in Fig. 6. Larger number of page visits during third week we can explain that there was national flying contest. Total number of invocations was not as expected three thousands but almost five thousands. During the month, system tried to estimate number of invocations until the end of an accounting period and this is depicted in Fig. 7. Because number of invocations was greater then expected, system has to switch SLA, to makes the total sum less than assumed 30€€ . We can notice that on 14th day system has decided to switch SLA one down, but increased number of visits stayed for three days, so system has decided to switch one more down. Web site traffic started coming back to normal state, so system come back to the best SLA as depicted in Fig. 8.

Method of Adaptive Quality Control in Service Oriented Architectures

315

To calculate expected number of invocations to the end of the month, system estimated it on the moving average basis from the last 7 days, and mean number of visits per day is multiply by the number of days to the end. 500

6000 5000

invocations

invocations/day

400

300

200

4000 3000 2000 1000

100

0 1

0 1

8

15

22

8

15

29

22

29

day

day

Estimated number of invocations

Fig. 6. Number of invocations per day

Real number of invocations

Fig. 7. Estimating number of invocations to the end of the month 60

3

50 40

SLA

price

2

30 20

1

10 0 1

8

15

22

29

0 1

8

15

22

day

29

day

Variable QOS

Fig. 8. Selected SLA

Constant QOS

Fig. 9. Partial bill

0,95 0,9

availability

0,85 0,8 0,75 0,7 0,65 0,6 1400

1500

1600

1700

1800

1900

2000

2100

2200

2300

invocations Availability

Fig. 10. Convergence of availability

SLA has been changed several times during accounting period hence, QoS metrics had to convergence to the values contracted in selected SLA. One can notice that convergence schema in Fig. 10 is very similar to the one known from the control theory. 4.3 Discussion Without variable QoS, customer would be asked to pay 50€€ , which is a lot more than assured 30€€ . With variable QoS activated, system is able to encompass busy week on

316

T. Szydlo and K. Zielinski

the web site. As it has been stated in contracted SLAs, data quality was the same during the whole month, but availability and response time have changed as many times as SLA has changed. In Fig. 9 we can find partial bill during the whole month. The proposed approach does not take into account deviations caused by internet connections. Possible extension to incorporate it is to design remote invocation monitoring which would be deployed at the customer side. Secondly, accounting method might be unfair, especially when number of composite service invocation is very low or when invocations are not equally spread over the accounting period.

5 Conclusion In this paper, we have presented novel approach to managing composite services. With the growing popularity of the Internet, providing services with guaranteed QoS has become increasingly important. Our algorithm integrates statistical methods for predicting quality of service for composite services invocations, as well as, automatic adaptation strategies to keep consumer’s budget by changing SLA during execution. Presented case study verified usability of our method. Many remaining issues are worth further research. The most interesting is how this model behaves in real applications. Another interesting aspect is how to improve accounting algorithm to prevent unfair charges. Finally, further examples, experimental tests and practical experience are needed to find true potential of applying adaptive quality control to different class of composite services. This remains an important focus for our future research.

References 1. Menascé, D.A.: Composing Web Services: A QoS View. IEEE Internet Computing 8(6), 88–90 (2004) 2. Forman, E.H., Selly, M.A.: Decision By Objectives – How To Convince Others That You Are Right. World Scientific Publishing Co. Pte. Ltd., Singapore (2001) 3. Cardoso, J., Sheth, A.P., Miller, J.A., Arnold, J., Kochut, K.: Quality of service for workflows and web service processes. J. Web Sem. 1(3), 281–308 (2004) 4. Zeng, L., Benatallah, B., Ngu, A.H.H., Dumas, M., Kalagnanam, J., Chang, H.: QoS-Aware Middleware for Web Services Composition. IEEE Transactions on Software Engineering 30(5), 311–327 (2004) 5. Arsanjani, A., Zhang, L.-J., Ellis, M., Allam, A., Channabasavaiah, K.: S3: A ServiceOriented Reference Architecture. IT Professional 9(3), 10–17 (2007) 6. Kokash, N.: A Service Selection Model to Improve Composition Reliability. In: Proc. Proceedings of the International Workshop on AI for Service Composition (2006) 7. Gu, X., Nahrstedt, K., Chang, H., Ward, C.: QoS-Assured Service Composition in Managed Service Overlay Networks. In: Proc. ICDCS ’03: Proceedings of the 23rd International Conference on Distributed Computing Systems, Washington, DC, USA, p. 194 (2003) 8. Abdelzaher, T., Stankovic, J., Lu, C., Zhang, R., Lu, Y.: Feedback performance control in software services. Control Systems Magazine, 23(3), 74–90 (2003)

Ontology Supported Selection of Versions for N-Version Programming in Semantic Web Services Pawel L. Kaczmarek Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications and Informatics, Gda´ nsk University of Technology [email protected]

Abstract. Web Services environment provides capabilities for eﬀective N-version programming as there exist diﬀerent versions of software that provide the same functionality. N-version programming, however, faces the signiﬁcant problem of co-relation of failures in diﬀerent software versions. This paper presents a solution that attempts to reduce the risk of co-relation of failures by selecting for invocation services having relatively diﬀerent non-functional features. We use an ontology-driven approach to identify and store information about software features related to diﬀerences in software versions, such as: software vendor, design technology or implementation language. We present an algorithm for selection of software versions using the designed ontology. The solution was veriﬁed in a prototypical implementation with the use of an existing OWL-S API library.

1

Introduction

N-version programming (NVP) is a resilience computing mechanism [1] used for decades to increase software dependability. The technique was initially used and researched in sequential systems, however, diﬀerent research groups focus on NVP in distributed systems as described later in this paper. It seems that NVP can be successfully applied in Web Services and Semantic Web Services. The Web Services architecture assumes that services supplying the same functionality are advertised and available for clients. A client can either choose a service that supplies the best price and dependability or invoke diﬀerent services in order to increase dependability. The paper addresses the typical problem that NVP faces: there exists a corelation between errors in diﬀerent software versions [2]. The co-relation results from similar educational background, programming languages, the algorithms used and other factors. In Web Services, however, services diﬀer in vendor and technology, which might lay foundations for creation of dependable N-version systems. In our solution, we attempt to design a technique for selection of services that are unlikely to fail for similar input or in similar conditions. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 317–326, 2008. c Springer-Verlag Berlin Heidelberg 2008

318

2

P.L. Kaczmarek

Semantic N-Version Invocation Module

The designed solution is aimed at increasing dependability of NVP without increasing the number of invoked versions of a service. The limited number of invoked services reduces the invocation costs. In this solution, we select those services that have relatively diﬀerent non-functional features, which consequently limits the risk of repeating feature-speciﬁc errors during the execution of a selected set. The ﬁrst step is to identify service features, dependencies between features and their impact strength. An N-version features ontology is deﬁned to describe the features related to diﬀerences in software versions. Examples of ontology concepts are: implementation language, software vendor, design process, runtime platform and the algorithms used (see Sect. 3.1). It is assumed that the already existing service registers know diﬀerent services supplying the same functionality. It is also assumed that the existing servers already oﬀer services of equal functionality. Relevant information about available services is stored in the ontology. The next step is to design an algorithm for selection of services depending on service features. Generally, the algorithm calculates the number of common features for groups of services and selects a group in which services are relatively diﬀerent(see Sect. 3.2). An N-version invocation that uses our service selection mechanism consists of the following steps: – A matching subsystem selects available services that match clients request. – Services are selected from the initial set concerning service features. • The N-version features ontology is queried for service features • Binary service similarities are identiﬁed • Service similarities are calculated for potential groups • A group with the lowest service similarity is selected for invocation – Selected services are invoked – Result is voted and returned to the client. Finally, the solution is implemented and validated. 2.1

Module Architecture

The architecture of the the N-version invocation module is shown in Fig. 1. The system consists of the following submodules: Search and matching module - performs typical tasks for service discovery and matching. It is assumed that an already existing matching module is used and the module is capable of delivering a set of services that match a client’s request. Selection module - selects services for an N-version invocation from available services supplied by the Search module. A service features knowledge base is used to create a conﬁguration of possibly diﬀerent services. Service knowledge base - uses the N-version features ontology to store information about known services. It can be stored twofold:

Ontology Supported Selection of Versions for N-Version Programming

Client

invoke OWL−S return

Search appropriate matching

319

Service knowledge base query

Select for invocation

Invoke / vote

N−version features ontology

invoke third party

Fig. 1. Main parts of N-Version invocation modules with replicas selection

– locally - integrated in client’s application – remotely - accessible to diﬀerent clients through Web Services Invocation module - manages OWL-S [3] deﬁnitions, invokes N-version services, votes the result and returns it to a client.

3

Selection of Services

The identiﬁed service features and dependencies related to service diﬀerences in NVP are stored as an ontology. We use ontology-driven approach because of the following: (i) it is a systematic and organized way for entity description, (ii) there already exist ontologies and taxonomies that describe services and (iii) there are technological similarities between ontologies (OWL) and Semantic Web Services (OWL-S). 3.1

N-Version Features Ontology

We designed the N-version features ontology focusing on concepts concerning corelation of errors during N-version invocation. The designed ontology is based on the following existing ontologies and taxonomies: EvoOnt - A Software Evolution Ontology [4], Ontology and Taxonomy of Services [5], Core Software Ontology [6] and Service Ontology from Obelix [7]. Fig. 2 presents classes and relations deﬁned in the N-version features ontology. A SemanticService describes a service that contains ontological descriptions of the service bundle contents [7]. A service is a loosely coupled, reusable software component that semantically encapsulates discrete functionality and is distributed as well as programmatically accessible over standard Internet protocols. Concepts describing a SemanticService concern vendor, development and runtime information. A V endor of a SemanticService is an organization or a person that supplies the service. A SemanticService is designed with the use

320

P.L. Kaczmarek Algorithm

ImplementationLanguage

MiddlewareServer

A

:usesAlgorithm

E

:usesMiddleware Design A

RuntimePlatform

:implementedIn :usesTechnology

E E

:wasDesigned

E

:runsOn

:runsOn

E

DesignTechnology SemanticService

A

OperatingSystem

:usesCommon

E

CommonModule

:hasVendor E

:hasVersion

Vendor

Framework A

:externality

:commercialization

Version

Library

A

Fig. 2. Most important classes of the N-version features ontology

of a DesignT echnology such as the Waterfall model or the Spiral model and Algorithms. It is implemented in one or more ImplementationLanguages. A CommonM odule and its subclasses describe third party modules that are included in service code as Libraries or F rameworks. Finally, a SemanticService runs on a RuntimeP lat f orm. 3.2

Service Selection Algorithm

The ontological description is used by the selection algorithm to identify groups of services in which services have relatively diﬀerent features. Services from one of the groups are selected for an N-version invocation. The algorithm selects services for N-version invocation from available services that match the client’s request. The algorithm makes the following assumptions: – A matching subsystem has already selected services that match the client’s request. – The N-version features ontology describes available services. Ensuring that services in an N-version invocation diﬀer reduces the risk of co-relating failures speciﬁc for the features. Generally, the algorithm is done in the following steps: (i) the ontology is queried for service features, (ii) common features for pairs of services are calculated, (iii) common features are added up for services from potential groups, (iv) services from a group with relatively small number of common features are returned. The selection is done after basic client-server matching and before the actual invocation of diﬀerent versions of a service. The input for the selection algorithm is the set of available services that supply a required functionality. The output is the set selected for an N-version invocation. Algorithm 1 presents the most important selection steps.

Ontology Supported Selection of Versions for N-Version Programming

321

Algorithm 1. Selection of service versions for N-version invocation. input: matchingServices - all services that match the client’s request input: groupSize - number of versions that are invoked output: services selected for invocation query the N-version features ontology for information about matchingServices for all service in matchingServices do fetch serviceF eatures end for create f eatureM atrix in which rows and columns correspond to services from matchingServices for all servicei , servicej in matchingServices do f eatureM atrix[i][j] = count common features for servicei and servicej end for create groupsList containing all subsets of size groupSize from matchingServices for all group in groupsList do calculate similarityM etric: add up values from f eatureM atrix for each pair of services from group end for identify bestGroups: select groups with the smallest similarityM etric select one f inalGroup from bestGroups return services from f inalGroup

The presented listing simpliﬁes the analysis treating all features as equally important. However, features from the N-version ontology may have diﬀerent impact on correlation of failures in N-version programming. Although there is no research on such impact known to us, the designed algorithm should distinguish feature importance during service selection. We arbitrarily select features of primary and secondary impact strength. Concepts considered as of primary importance are: ImplementaionLanguage, V endor, M iddlewareServer, CommonM odule and DesignT echnology. Other concepts are considered as of secondary importance. 3.3

Gathering Data for Service Descriptions

Although the structure of the N-version features ontology is statically deﬁned, it is still necessary to ﬁll in information about known services. The most desired approach is to fetch automatically data about service features from existing sources of information. This can be done in many cases as information is available for automatic processing in diﬀerent places in Semantic Web infrastructure: WSDL, UDDI, OWL-S. Some features, however, need to be handled manually. Information about service is available in WSDL deﬁnition, UDDI registry or OWL-S ﬁles on diﬀerent abstraction level. Service vendor is described in UDDI deﬁnitions in “businessEntity“ part of service description with optional detailed information. Service runtime can be fetched by querying service endpoint about middleware platform.

322

P.L. Kaczmarek

Existing sources do not provide information about service design and implementation. In particular, it will be necessary to handle manually information about development approach, design process and used algorithms. Information about implementation language and used frameworks is not normally available in service description. Additional information is either included directly in the N-version features ontology or in OWL-S descriptions of individual services.

4

Prototypical Implementation

The designed solution was veriﬁed by a prototypical implementation. The implementation covers a simpliﬁed N-version features ontology, service selection algorithm and invocation of semantic services. The simpliﬁed ontology contained the SemanticService class and the following classes that describe service features: V endor, RuntimeP latf orm and ImplementationLanguage. The implemented algorithm uses information stored in the ontology to calculate similarityM etric for the potential groups of services. It is assumed that service features are of equal importance. Selected services are invoked using their OWL-S and WSLD deﬁnitions. Partial results from services are gathered and the ﬁnal result is voted using simple majority voting. If consensus is achieved, it is returned to the invoker, otherwise an exception is thrown by the N-version invocation module. We used the following third party libraries in the implementation: – Protege-OWL - deﬁnition of the N-version features ontology. – Mindswap OWL-S API - Java API for invocation of Semantic Web Services and transformation from WSLD to OWL-S. – Jena SPARQL - execution of SPARQL queries on the N-version features ontology to fetch information. about services. The implementation does not cover service matching between client’s request and server’s oﬀering as it is not within the scope of this paper. The matching phase is realized by a mock matcher with ﬁxed matching between services. Automated creation of the N-version features ontology is not yet implemented in the system. Code snippets showing SPARQL query and service invocation are shown in Listings 1.1 1.2. Listing 1.1. SPARQL query executed on the prototypical ontology PREFIX . . . SELECT ? s e r v i c e ? r u n t i m e P l a t f o r m ? v endor ? implLanguage WHERE { ? service rdf:type table:SemanticService . ? s e r v i c e table:hasRuntime ? runtimePlatform . ? s e r v i c e t a b l e : h a s V e n d o r ? v endor . ? s e r v i c e t a b l e : i m p l e m e n t e d I n ? implLanguage . } ...

Ontology Supported Selection of Versions for N-Version Programming

323

Listing 1.2. Code snippets for service selection and invocation import com . hp . h p l . j e n a . q uery . ∗ ; import o r g . mindswap . o w l s . p r o c e s s . ∗ ; ... ProcessExecutionEngine exec ; // from o r g . mindswap . o w l s . i o OWLSReader r e a d e r ; ... public L i s t s e l e c t S e r v i c e s ( . . . ) { ... Query q uery = QueryFactory . c r e a t e ( q u e r y S t r i n g ) ; QueryExecution qe = QueryEx ecutionFactor y . c r e a t e ( query , model ) ; R e s u l t S e t r e s u l t s = qe . e x e c S e l e c t ( ) ; ... } public S t r i n g inv ok eNVariant ( . . . ) throws E x c e p t i o n { ... selectedServices = selectServices ( . . . ) ; f o r ( i n t i = 0 ; i < s e l e c t e d S e r v i c e s . s i z e ( ) ; i ++) { ... // i n v o k e u s i n g Mindswap OWL−S API s e r v i c e = r ead er . read ( o w l s F i l e ) ; process = service . getProcess ( ) ; exec . execute ( process , v alu es ) ; ... } ... }

4.1

Selection Example

As an example of service selection let us consider the following demo conﬁguration. Six services supplying the same functionality diﬀer in their implementation language, runtime platform and service vendor. NVP is conﬁgured to invoke triples of services. We select three services from six available ones in such a way that the selected services have possibly diﬀerent features. Table 1 shows features of demo services fetched by the SPARQL query. Table 2 shows the number of common features for pairs of services. Let ES abbreviate ExemplaryService. For example, services ES1 and ES2 have two common features: RuntimeP latf orm and V endor, while services ES1 and ES3 have no common features. Table 1. Demo services Service id ExemplaryService1 ExemplaryService2 ExemplaryService3 ExemplaryService4 ExemplaryService5 ExemplaryService6

ImplLanguage CSharp JSharp J2EE J2EE J2EE J2EE

RuntimePlatform DotNet DotNet Axis2 Axis2 JBoss JBoss

Vendor SemanticDemoCorp SemanticDemoCorp FreeSemanticProducts OntologyDemoUniv SemanticDemoCorp FreeSemanticProducts

324

P.L. Kaczmarek Table 2. Number of common features between services ES1 ES2 ES3 ES4 ES5 ES6

ES1 x -

ES2 2 x -

ES3 0 0 x -

ES4 0 0 2 x -

ES5 1 1 1 1 x -

ES6 0 0 2 1 2 x

Then the algorithm calculates similarityM etric for groups of services. Assuming that triples are selected, there are 20 potential groups with similarity M etrics ranging from 1 to 5. Groups number 9 {ES1, ES4, ES6} and 15 {ES2, ES4, ES6} have the lowest value of the metric (one). Group 19 {ES3, ES5, ES6}, for example, has similarityM etric of 5. Either group number 9 or 15 is selected and passed on to the invocation and voting procedure. In this scenario, the Java programming language is a common feature for the majority of invoked services for both 9 and 15 groups. It may happen that an error speciﬁc for Java programming will be repeated and will demonstrate in both implementations for some input. Other features diﬀer in invoked services, which gives background to expect that a feature speciﬁc fault will not corrupt the N-version invocation. For example, an error in SemanticDemoCorp development process will probably not be repeated in other companies, therefore a fault speciﬁc for SemanticDemoCorp will be concealed by other versions. Analogously, if a failure demonstrate in the Axis2 middleware for some conﬁguration, it will be concealed by services running on JBoss and .NET. 4.2

Dependability of N-Version Module Itself

The N-version invocation module may be a source of additional errors and a threat for computer system dependability. We propose to use two alternative invocation mechanisms that can be used in case the primary N-version invocation module fails: (i) a secondary, simpliﬁed N-version invocation module and (ii) a simple invocation of a service. The secondary module performs a standard Nversion invocation, in which randomly chosen services are selected from the set of matching services and invoked. If both the primary and the secondary N-version modules fail, a simple invocation is performed on a service randomly selected from matching services. A possibly simple invocation stub that can detect failures or timeouts from N-version modules is created. The stub invokes either primary module, secondary module or a single service.

5

Related Work

Although dependability in distributed systems is a mature research discipline, the works related to N-version programming in service oriented architecture are

Ontology Supported Selection of Versions for N-Version Programming

325

quite recent. Looker et al. [8] propose Axis stub for N-version invocations. The solution uses service location and majority-voting scheme. Our work diﬀers in that we use semantic selection of diﬀerent versions of a web service, additionally, we propose rules for service selection to achieve best dependability results. Santos et al. [9] propose a fault tolerant infrastructure that is based on FTCorba architecture. Similarly to the previous work, the solution does not concern semantic information and does not select services from available ones. Cardoso [10] proposes semantic integration of Web Services with the use of WSDL-S and JXTA technology. The solution is based on creating Semantic Web Services proxies and peer groups that are used as N-version software. Our solution diﬀers in that we concern service features for service selection. We use the N-version features ontology driven algorithm to select those service instances that should be included in N-Version invocations. Additionally, in our solution, versions of Web Services are unaware of each other as they are discovered and invoked by the invocation module. Townend et al. [11] propose a replication based solution for grid environments. An “FT-Grid co-ordination service” is used to locate, receive, and vote upon jobs submitted by a client program. Our solution diﬀers in that we use semantic information about services and select service versions depending on their features. There is lot of work in dependability of SOA systems that is loosely related to the scope of this paper. [1] describes research in software diversity and oﬀ-theshelf components. WS* standards were proposed for WS* dependability such as: WSReliableMessaging, WSReliability, WSSecurity, WSAtomicTransactions and others. The standards concern usually lower layers of software systems. Backward recovery and exception handling is addressed in [12] [13].

6

Conclusions and Future Work

The aim of this paper was to propose a technique for selection of service versions for N-version invocations. The solution is aimed at increasing software dependability without increasing the number of invoked services in order to reduce invocation cost. We presented an ontology of service features related to N-version programing and an algorithm for service selection. The designed ontology and algorithm show that services can be selected according to their non-functional features, which reduces the risk of repeating features-speciﬁc failures. A prototype implementation shows that this solution can be eﬀectively applied in Semantic Web Services. The obtained results are promising, especially considering the fact that Web Services infrastructure supplies diﬀerent infrastructures for service development and sharing. The future work concerns further eﬀort to fully implement the designed solution. Current implementation does not integrate matching module, feature gathering functionality and some classes from the N-version features ontology. It also needs to be updated to OWL-S 1.1 version. Additionally, experiments need to be performed to determine the impact of primary and secondary service

326

P.L. Kaczmarek

features on service execution. The distinction of impact strength was done heuristically and needs to be veriﬁed. Finally, the implemented system should be veriﬁed in a real-world application. Acknowledgments. This work was supported by the Polish Ministry of Science and Higher Education under research project No. N519 022 32/2949.

References 1. ReSIST: Resilience for Survivability in IST, A European Network of Excellence: Resilience-Building Technologies: State of Knowledge (2006) 2. Knight, J.C., Leveson, N.G.: An experimental evaluation of the assumption of independence in multiversion programming. IEEE Transactions on Software Engineering (1986) 3. W3C: OWL-S: Semantic Markup for Web Services (2004) 4. Kiefer, C., Bernstein, A., Tappolet, J.: Evoont - a software evolution ontology. Technical report, Dynamic and Distributed information Systems Group, University of Zrich (2007), http://www.ifi.uzh.ch/ddis/msr/ 5. Cohen, S.: Ontology and taxonomy of services in a service-oriented architecture. The Architecture Journal, Microsoft Corporation (2007) 6. Gangemi, A., Mika, P., Sabou, M., Oberle, D.: An ontology of services and service descriptions. Technical report, OntoWare.org, Institute AIFB, University of Karlsruhe (2003), http://cos.ontoware.org/ 7. Baida, Z., Gordijn, J., Akkermans, H.: Service ontology. Technical report, Ontology-Based ELectronic Integration of CompleX Products and Value Chains (2003) 8. Looker, N., Munro, M., Xu, J.: Increasing web service dependability through consensus voting. In: 29th IEEE Annual International Computer Software and Applications Conference (2005) 9. Santos, G., Lung, L.C., Montez, C.: Ftweb: A fault tolerant infrastructure for web services. In: Ninth IEEE International EDOC Enterprise Computing Conference (2005) 10. Cardoso, J.: Semantic integration of web services and peertopeer networks to achieve fault-tolerance. In: IEEE International Conference on Granular Computing (2006) 11. Townend, P., Xu, J.: Replication-based fault tolerance in a grid environment. In: U.K. e-Science 3rd All-Hands Meeting (2004) 12. Xu, J., Romanovsky, A., Randell, B.: Concurrent exception handling and resolution in distributed object systems. IEEE Transactions on Parallel and Distributed Systems (2000) 13. Kaczmarek, P.L., Krawczyk, H.: Remote exception handling for pvm processes. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840. Springer, Heidelberg (2003)

Hybrid Index for Metric Space Databases Mauricio Marin1 , Veronica Gil-Costa2 , and Roberto Uribe3 2

1 Yahoo! Research, Santiago, Chile DCC, Universidad Nacional de San Luis, Argentina 3 DCC, Universidad de Magallanes, Chile [email protected]

Abstract. We present an index data structure for metric-space databases. The proposed method has the advantage of allowing an eﬃcient use of secondary memory. In the case of index entirely loaded in main memory our strategy achieves competitive performance. Our experimental study shows that the proposed index outperforms other strategies known to be eﬃcient in practice. A valuable feature of the proposal is that the index can be dynamically updated once constructed.

1

Introduction

Searching in metric spaces is a very active research ﬁeld since it oﬀers eﬃcient methods for indexing and searching by similarity in non-structured domains. For example, multimedia databases manage objects without any kind of structure like images, audio clips or ﬁngerprints. Retrieving the most similar ﬁngerprint to a given one is a typical example of similarity search. The problem of text retrieval is present in systems that range from a simple text editor to big search engines. In this context we can be interested in retrieving words similar to a given one to correct edition errors, or documents similar to a given query. We can ﬁnd more examples in areas such as computational biology (retrieval of DNA or protein sequences) or pattern recognition (where a pattern can be classiﬁed from other previously classiﬁed patterns). Similarity search can be trivially implemented comparing the query with all the objects of the collection. However, the high computational cost of the distance function, and the high number of times it has to be evaluated, makes similarity search very ineﬃcient with this approach. This has motivated the development of indexing and search methods in metric spaces that make this operation more eﬃcient trying to reduce the number of evaluations of the distance function. This can be achieved storing in the index information that, given a query, can be used to discard a signiﬁcant amount of objects from the data collection without comparing them with the query. Although reducing the number of evaluations of the distance function is the main goal of indexing algorithms, there are other important features. Some methods can only work with discrete distance functions while others admit continuous distances too. Some methods are static, since the data collection cannot grow M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 327–336, 2008. c Springer-Verlag Berlin Heidelberg 2008

328

M. Marin, V. Gil-Costa, and R. Uribe

once the index has been built. Dynamic methods support insertions in an initially empty collection. Another important factor is the possibility of eﬃciently storing these structures in secondary memory. Search methods in metric spaces can be grouped in two classes [2]: pivotbased and clustering-based search methods. A pivot-based strategy selects some objects as pivots from the collection and then computes the distance between the pivots and the objects of the database and use this information to group related objects. This method selects a subset of objects from the collection as pivots, and the index is built computing and storing the distances from each of them to the objects of the database. During the search, this information is used to discard objects from the result without comparing them with the query. Clustering techniques partition the collection of data into groups called clusters such that similar entries fall into the same group. Thus, the space is divided into zones as compact as possible, usually in a recursive fashion, and this technique stores a representative point (“center”) for each zone plus a few extra data that permit quickly discarding the zone at query time. In the search, complete regions are discarded from the result based on the distance from their center to the query. In this paper we propose a combination of two existing methods (Sec. 2). The ﬁrst method is used as it is proposed by their authors whereas the second one has been highly optimized by us to deal with secondary memory eﬃciently and very importantly to reduce the running time by increasing the ability of the strategy to quickly discard objects that cannot be part of the solution to a given query (Sec. 3). We present a complete evaluation of the performance of the proposed strategy in Sec. 4 which shows that our strategy consistently outperforms all others in practice. Sec. 5 presents concluding remarks.

2

Metric Spaces and Indexing Strategies

A metric space (X, d) is composed of an universe of valid objects X and a distance function d : X × X → R+ deﬁned among them. The distance function determines the similarity between two given objects. The goal is, given a set of objects and a query, to retrieve all objects close enough to the query. This function holds several properties: strictly positiveness (d(x, y) > 0 and if d(x, y) = 0 then x = y), symmetry (d(x, y) = d(y, x)), and the triangle inequality (d(x, z) ≤ d(x, y)+d(y, z)). The ﬁnite subset U ⊂ X with size n = |U|, is called the database and represents the collection of objects. A k-dimensional vector space is a particular case of metric space in which every object is represented by a vector of k real coordinates. The deﬁnition of the distance function depends on the type of the objects we are managing. In a vector space, d could be a distance function of the family Ls (x, y) = i<=i<=k (|xi − yi |s )1/s . For example s = 2 yields Euclidean distance, that is the number of insertions, deletions or modiﬁcations to make two words equal. There are three main queries of interest, – range search: that retrieves all the objects u ∈ U within a radius r of the query q, that is: (q, r)d = {u ∈ U/d(q, u) ≤ r};

Hybrid Index for Metric Space Databases

329

– nearest neighbor search: that retrieves the most similar object to the query q, that is N N (q) = {u ∈ U/∀v ∈ U, d(q, u) ≤ d(q, v)}; – k-nearest neighbors search: a generalization of the nearest neighbor search, retrieving the set kN N (q) ⊆ U such that |kN N (q)| = k and ∀u ∈ kN N (q), v ∈ U − kN N (q), d(q, u) ≤ d(q, v). We focus on range queries since nearest neighbor queries can be rewritten as range queries in an optimal way [2]. In the following we describe the data structures we combine to produce our metric-space index. 2.1

List of Clusters (LC)

This strategy [1] builds the index by choosing a set of centers c ∈ U with radius rc where each center maintains a bucket that keep all objects that are within the extension of the ball (c, rc ). Each bucket contains the k objects that are the closet ones to the respective center c. Thus the radius rc is the maximum distance between the center c and the k-nearest neighbor. The buckets are ﬁlled as the centers are created and thereby a given element a located in the intersection of two or more center balls is assigned to the ﬁrst center. The ﬁrst center is randomly chosen from the set of objects. The next are selected so that they maximize the sum of the distances to all previous centers. A range query q with radius r is solved by scanning in order of creation the centers. At each center we compute d(q, c) and in the case that d(q, c) ≤ rc all objects in the bucket associated with c are compared against the query. Also if the query ball (q, r) is totally contained in the center ball (c, rc ), there is no need to consider others centers. 2.2

Sparse Spatial Selection (SSS)

During construction, this pivot-based strategy selects some objects as pivots from the collection and then computes the distance between the pivots and the objects of the database [4]. The result is a table of distances where columns are the pivots and rows the objects. Each cell in the table contains the distance between the object and the respective pivot. These distances are used to solve queries as follows. For a range query (q, r) the distances between the query and all pivots are computed. The objects x from the collection that do not hold the condition |d(pi , x) − d(pi , q)| ≤ r for all pivots pi can be immediately discarded due to the triangle inequality. The objects that pass this test are considered as potential members of the ﬁnal set of objects that form part of the solution for the query and therefore they are directly compared against the query by applying the condition d(x, q) ≤ r. The gain in performance comes from the fact that it is much cheaper to eﬀect the calculations for discarding objects using the table than computing the distance between the candidate objects and the query. A key issue for eﬃciency is the method employed to calculate the pivots, which must be eﬀective enough to drastically reduce total number of distance computations between the objects and the query. To select the pivots set, let

330

M. Marin, V. Gil-Costa, and R. Uribe

(X, d) be a metric space, U ⊂ X an object collection, and M the maximum distance between any pair of objects, M = max{d(x, y)/x, y ∈ X}. The set of pivots contains initially only the ﬁrst object of the collection. Then, for each element xi ∈ U, xi is chosen as a new pivot if its distance to every pivot in the current set of pivots is equal or greater than α M , being α a constant parameter. Therefore, an object in the collection becomes a new pivot if it is located at more than a fraction of the maximum distance with respect to all the current pivots. 2.3

LC-SSS Combination (Hybrid)

We propose a combination between the List of Clusters (LC) and Sparse Spatial Selection (SSS) indexing strategies. In this case we both compute the LC centers and SSS pivots independently. We form the clusters of LC and within each cluster we build a SSS table using the global pivots and organization of columns and rows described above. We emphasize on global SSS pivots because intuition tells that in each cluster of LC one should calculate pivots with the objects located in the respective cluster. However, we have found that the quality of SSS pivots degrades signiﬁcantly when they are restricted to a subset of the database, and also the total number of them tends to be unnecessarily large. We call this strategy hybrid.

3

Optimizing Running Time and Secondary Memory

Our contribution to increasing the performance of the SSS index is as follows. During construction of the table of distances we compute the cumulative sum of the distances among all objects and the respective pivots. We then sort the pivots by these values in increasing order and deﬁne the ﬁnal order of pivots as follows. Assume that the sorted sequence of pivots is p1 , p2 , ...., pn . Our ﬁrst pivot is p1 , the second is pn , the third p2 , the fourth pn−1 and so on. We also keep the rows in the table sorted by the values of the ﬁrst pivot so that upon reception of a range query q with radius r we can quickly (binary search) determine between what rows are located the objects that can be selected as candidates to be part of the answer. This because objects oi being part of the answer can only be located between the rows that satisﬁes d(p1 , oi ) ≥ d(q, p1 )−r and d(p1 , oi ) ≤ d(q, p1 ) + r. In practice, during query processing and after the two binary searches on the ﬁrst column of the table, we can take advantage of the column x rows organization of the table of distances by ﬁrst performing a few, say v, vertical wise applications of the triangular inequality on the objects located in the rows delimited by the results of the binary searches, followed by horizontal wise applications of the triangular inequality to discard as soon as possible all objects that are not potential candidates to be part of the query answer. See Fig. 1 which shows the case of two queries being processed concurrently. For secondary memory the combination of these strategies have the advantage of increasing the locality of accesses to disk and the processor can keep in main

Hybrid Index for Metric Space Databases

331

P1 Pn P2 Pn−1 P3 Pn−2 ......................................Pk 11111 00000 00000 11111 00000 00000 11111 queries: Q1 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111

111111111111111 000000000000000 000000000Candidate 000000000000000111111111 111111111111111 111111111 000000000 Objetcs

00000 11111 11111 00000 00000 11111 00000 11111

1111111111 0000000000 000000000000000 111111111111111 0000000000 1111111111 000000000000000 111111111111111

00000 Q2 11111 00000 11111

Vertical Processing

Horizontal Processing

Fig. 1. Optimization to the SSS distance table

2 3 3 4

4 12 9 11

2 5 7 3

11 12 11 3

3 13 15 5 44

2 4 4 8

4 11 1 14

2 3 6 8

11 3 3 11

3 5 6 2 44

4 4 5 6

1 5 12 11

6 1 6 10

3 7 10 9

6 22 14 9 88

8 8 9 9

1 4 8 5

2 7 11 7

3 4 13 9

4 7 1 8 88

6 7 7 7

2 7 6 9

2 4 8 11

12 8 8 6

12 17 18 19 132

3 3 5 6

12 9 12 11

5 7 6 10

12 11 10 9

13 15 14 9 132

8 8 8 9

14 1 4 8

11 3 4 13

2 4 7 1 176

6 9 11 12

2 9 6 2

2 1 11 10

12 6 9 3

12 11 16 10 176

9 9 9 10

5 9 4 13

7 1 9 12

9 6 4 7

8 11 23 20 220

4 7 7 7

5 7 6 9

1 4 8 11

7 8 8 6

22 17 18 19 220

11 12 12

6 2 13

11 10 12

9 3 9

16 10 21

9 10 12

4 13 13

9 12 12

4 7 9

23 20 21

8 2 7 11

264

264

(a)

(b)

Fig. 2. Storing the distance table in blocks composed of a ﬁxed number of disk pages

memory the ﬁrst v columns of the table. In the experiments performed in this paper we observed that with v = n/4 we achieved competitive running times. In the following we describe two feasible physical organizations of the index on disk pages. The description is illustrated in Fig. 2 which presents two cases

332

M. Marin, V. Gil-Costa, and R. Uribe

for the distribution of a distance table with 23 objects and 4 pivots. The table is partitioned in 5 blocks. The ﬁrst 4 columns contains the distances from objects to the 4 pivots and the last column contains the respective object ID associated with each row. The cell located at the bottom-right indicates the physical address of the disk page containing the next table block. Each block is stored in contiguous disk pages. We assume that the main memory is large enough to store two blocks. Fig. 2.a represents a case in which all objects 1 ... 23 are available at construction time and Fig. 2.b a case in which objects are arriving one by one to the index and every time a block is ﬁlled up a new one is started. The ﬁrst case requires a external memory sorting by the ﬁrst pivot. In the latter case the ﬁrst column is kept sorted every two blocks since we are assuming that they both ﬁt into main memory. Thus external sorting is not required. In the next section we show that both strategies achieve a very similar performance which indicates that the scheme supports eﬃciently further updates once the index has been constructed from an initial set of objects. In Fig. 2.a and 2.b, the grey cells represent the cases in which the triangular inequality gives a positive match for a range query q with d(q, pi ) = {6, 8, 3, 7} for pivots pi and radius r = 3. We assume that the query is solved by performing one vertical operation followed a horizontal operation for each row selected for the ﬁrst pivot. In fact, as the ﬁrst pivot is sorted by distance it is only necessary to perform two binary searches to detect the ﬁrst row with value d(q, p1 ) − r = 3 and the last row with value d(q, p1 ) + r = 9. Then the sequence of horizontal applications of the triangular inequality determines that the objects 22, 17 and 11 are candidates which must be directly compared against the query object. Notice that a second vertical operation would have reduced signiﬁcantly the number of horizontal operations (which is a tradeoﬀ that depends on the application).

4

Experiments

The performance of Hybrid Index was tested with several collections of data. First, we used a collection of 100, 000 vectors of dimension 10, synthetically generated with Gaussian distribution. The Euclidean distance was used as the distance function when working with this collection. We also worked with a collection of 86, 061 words taken from the Spanish dictionary, and using the edit distance as the distance function. The algorithm was compared with other wellknown clustering-based indexing methods: M-Tree [6], GNAT [8], EGNAT [7], Spatial Approximation Trees (SAT) [3]. We also included in the comparison the LC [1] and SSS [4] strategies, and a recent version of the SSS called the SSSTree [5] which uses a tree structure in which the SSS pivots are used to recursively divide the space. 4.1

Cost of Secondary Memory Access

In the left part of table 1 we show for the Spanish dictionary data set the total number of blocks and objects per block for cases in which we limit the total

Hybrid Index for Metric Space Databases

333

number of pivots to 4, 8, 12, 16 and 20. The ﬁrst three columns show the disk activity when constructing the index with the 90% of the data set by using the strategy despicted in Fig. 2.a. The last column shows the case when the same data is indexed on-line by using the strategy of Fig. 2.b. In this case no reads of blocks are eﬀected and blocks are written to disk as soon as they become full during the insertion of objects. In the ﬁrst case reads and seeks have to be performed in order to perform the sorting by the ﬁrst column and move whole rows among blocks. However, the actual diﬀerence in running time between the two alternatives is negligible, presumebly because of disk-cache eﬀects. Table 1. Disk activity for index construction Pivots Blocks Objects 4 378 204 8 686 113 12 994 78 16 1291 60 20 1614 48

Writes 399 721 1030 1342 1676

Seeks 761 1373 1989 2583 3229

Reads Writes 780 380 1408 688 2025 996 2634 1293 3291 1616

The next 10% of the data set is used to perform range queries with radio 1, 2, 3 and 4. The Fig. 3.a and 3.b show the total number of block reads performed during the processing of queries for the two methods of index construction. The diﬀerences in disk ativity are irrelavant showing that both approaches achieve similar performance. However, for large radious 4 the on-line creation of the index tends to generate more activity because large radio tend to generate a large number of candidate objects which are expected to be evenly distributed onto all blocks. 4.2

Calls to the Distance Evaluation Function

Computing the distance between two complex objects is known to be very expensive in terms of running time in metric-space databases. This produces an implementation independent base upon which comparing diﬀerent strategies. In the following we review previous studies on comparison of a number of metricspace index and then we compare the best performers with our proposal. Fig. 4 and 5 show results for diﬀerent data structures proposed so far. The Hybrid strategy achieves the best performance in terms of this metric though very similar to the LC strategy. 4.3

Comparing Running Times

In Fig. 6 we present results for running times with the diﬀerent strategies. The proposed Hybrid achieves the best performance for most cases. Notice that structures such as the SAT achieves better performance than ours for range queries with large radio. The results suggests that SAT performs signiﬁcantly better for large r. However, for these radio almost all objects are part of the solution to

334

M. Marin, V. Gil-Costa, and R. Uribe Cost Average Search (n=86061 spanish words)

Cost Average Search (n=86061 spanish words) 2500

Pivots: 4 Pivots: 8 Pivots: 12 Pivots: 16 Pivots: 20

2000

Number of Access to Disk

Number of Access to Disk

2500

1500

1000

500

0

1

2

3

4

Pivots: 4 Pivots: 8 Pivots: 12 Pivots: 16 Pivots: 20

2000

1500

1000

500

0

1

2

Search Range

3

4

Search Range

(a)

(b)

Fig. 3. Disk seeks and their respective block read for during range queries

Evaluations of the distance function

70000

Hyb SSS LC SAT EGNAT SSSTree M-Tree EGNAT

60000 50000 40000 30000 20000 10000 1

1.5

2

2.5 Query range

3

3.5

4

Fig. 4. Number of calls to the distance evaluation function per query for diﬀerent metric-space index data structures. Results for the Spanish dictionary data set.

Evaluations of the distance function

90000 80000 70000 60000

Hyb LC SSS SAT M-Tree GNAT EGNAT SSSTree

50000 40000 30000 20000 0.01

0.1

1

Query range

Fig. 5. Number of calls to the distance evaluation function per query for diﬀerent metric-space index data structures. Results for the Gaussian vector space.

500 480 460 440 420 400 380 360 340 320 300

900

SSS LC SAT Hyb

335

Hyb SSS LC SAT

800 Running Time (Secs)

Running Time (Secs)

Hybrid Index for Metric Space Databases

700 600 500 400 300 200

1

2

3

4

1

2

Query range

3

Query range

(a)

(b)

Fig. 6. Total running times for processing 10,000 queries with the Spanish dictionary (left) and a Gauss vector data set (right)

1000 Total

Running Time (Secs)

800

600

Distance Evaluation

400 Triangle Inequality

200

0 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 alpha

Fig. 7. Running time for the three main components in the execution of the Hybrid

the query and we do not see a practical use of queries like this ones in actual applications. Finally Fig. 7 shows results for the cummulative running time involved in accessing the distance table and executing the distance evaluation function for diﬀerent values of the parameter α, namely diﬀerent number of pivots. The results show a tradeoﬀ between both costs with optimum in α = 0.7.

5

Conclusions

We hace presented a simple but very eﬃcient strategy to solve queries in metric-space databases. Our strategy achieves best performance than most other strategies. However, it is not able to outperform in a signiﬁcant manner to a tree

336

M. Marin, V. Gil-Costa, and R. Uribe

based structure called SSSTree which is in fact based on a strategy quite similar to ours. However, our strategy has clear advantages with respect to secondary management, total memory used by the index. Also the organization of the index in terms of a table with columns and rows allows it to exploit in an optimal way the parallelism available in the new computer architectures based on multi-cores devised to support multi-threading by hardware. We are currently evaluating the gain in performance in this architectures by solving queries using the standard openMP. Acknowledgments. This work has been partially funded by FONDECYT project 1060776, UMAG PR-F1-002IC-06, and UNSL PICT 2002-11-12600.

References 1. Ch´ avez, E., Navarro, G.: A compact space decomposition for eﬀective metric indexing. Pattern Recognition Letters 26(9), 1363–1376 (2005) 2. Ch´ avez, E., Navarro, G., Baeza-Yates, R., Marroquyn, J.L.: Searching in metric spaces. ACM Computing Surveys 3(33), 273–321 (2001) 3. Navarro, G.: Searching in metric spaces by spatial approximation. The Very Large Databases Journal (VLDBJ) 711(1) (2002) 4. Brisaboa, N., Pedreira, O.: Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Pl´ aˇsil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 434–445. Springer, Heidelberg (2007) 5. Brisaboa, N., Pedreira, O., Seco, D., Solar, R., Uribe, R.: Clustering-based similarity search in metric spaces with sparse spatial centers. In: Geﬀert, V., et al. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 186–197. Springer, Heidelberg (2008) 6. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An eﬃcient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997), pp. 426–435 (1997) 7. Uribe, R., Navarro, G., Barrientos, R., Marin, M.: An index data structure for searching in metric space databases. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3991, pp. 611–617. Springer, Heidelberg (2006) 8. Brin, S.: Near neighbor search in large metric spaces. In: 21st conference on Very Large Databases (1995)

Structural Testing for Semaphore-Based Multithread Programs Felipe S. Sarmanho, Paulo S.L. Souza, Simone R.S. Souza, and Adenilso S. Sim˜ ao Universidade de S˜ ao Paulo, ICMC, S˜ ao Carlos - SP, 668, Brazil {sarmanho,pssouza,srocio,adenilso}@icmc.usp.br

Abstract. This paper presents structural testing criteria for validation of semaphore-based multithread programs exploring control, data, communication and synchronization information. A post-mortem method based on timestamps is deﬁned to determine the implicit communication among threads using shared variables. The applicability of the coverage testing criteria is illustrated by a case study. Keywords: software testing, multithread programs, testing criteria.

1

Introduction

Concurrent programming is important to reduce the execution time in several application domains, such as image processing and simulations. A concurrent program is a group of processes (or threads) that execute simultaneously and work together to perform a task. These threads access a common addressing space and interact through memory (using shared variables). The most common method to develop multithread programs is to use thread libraries, like PThreads (POSIX Threads). Concurrent program testing is not trivial. Features like synchronization, interthread communication and non-determinism make this activity complex [1]. Multiple executions of a concurrent program with the same input may present diﬀerent results due to diﬀerent synchronization and communication sequences. Petascale systems also add more factors to this scenario, making it even worse [2]. Structural testing is a test technique that use source code information to guide the testing activity. Coverage criteria are deﬁned to apply structural testing. A testing criterion is a predicate to be satisﬁed by a set of test cases and can be used as a guide for the test data generation. It is also a good heuristic to indicate defects on programs and thus to improve their quality. This activity is composed of: (1) static analysis to obtain the necessary data about the source code, and usually obtaining a Control Flow Graph (CFP) [3]; (2) determining required elements for the coverage criterion chosen; and (3) analyzing the coverage reached in source code by test cases, based on coverage criterion.

This work is supported by CNPq.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 337–346, 2008. c Springer-Verlag Berlin Heidelberg 2008

338

F.S. Sarmanho et al.

In the literature there are some works that address testing of concurrent programs [4, 5, 6, 7, 8, 9]. Most of these works propose a test model to represent the concurrent program and to support the testing application. The Concurrency States Graph (CG) is a CFG extension proposed in [4], in which nodes represent concurrency states while the edges represent the actions required for transition between these states. That work considers concurrent languages with explicit synchronization using rendezvous-style mechanisms, such as Ada and CSP. It presents coverage criteria adapted for the CG; however, its usage is limited in the practice by the state space explosion problem. PPFG (Parallel Program Flow Graph) is a graph where was inserted the concept of synchronization node to the CFG [5, 10]. In the PPFG each process that composes the program has its own CFG. The synchronization nodes are then connected based on possible synchronizations. This model was proposed to adapt the all-du-path criterion to concurrent programs. PCFG (Parallel Control Flow Graph) also adapts the CFG for the context of parallel programs in message passing environments [7]. The PCFG includes the concept of synchronization nodes that are used to represent the send and receive primitives. The concept of variables was extended, to consider the concept of communicational use (s-use). Coverage criteria were also proposed in [7], based on models of control and data ﬂow for message passing programs. Lei and Carver propose an approach to reachability testing. Reachability testing is a combination of deterministic and non-deterministic execution, where the information and the required elements are generated on-the-ﬂy, without static analysis [6]. This proposal guarantees all feasible synchronization sequences will be exercised at least once. The lack of static analysis means it cannot say how many executions are required. This causes the state space explosion problem. In recent works, Lei et. al [9] presents a combinatorial approach, called t-way, to reduce the number of synchronization sequences to be executed. These related works bring relevant improvements for concurrent program testing. However, few works are found that investigate the application of the testing coverage criteria and supporting tools in the context of multithreading programs. For these programs, new aspects need to be considered. For instance, data ﬂow information must consider that an association between one variable deﬁnition and its use can occur in diﬀerent threads. The implicit inter-thread communication that occurs through shared memory makes complex the test activity. The investigation of these challenges it is not a trivial task and presents many diﬃculties. To overcome these diﬃculties, we present a family of structural testing criteria for semaphore-based multithread programs and a new test model for the support to the criteria. This model includes important features, such as: synchronization, communication, parallelism and concurrency. These data are collected using static and dynamic analyses. Information about communication is obtained after the execution of an instrumented version of the program, using a post-mortem methodology. This methodology has been adapted from Lei and Carver work [6]. Testing criteria were deﬁned to exploit the control and data ﬂows of these programs, considering their sequential and parallel aspects. The

Structural Testing for Semaphore-Based Multithread Programs

339

main contribution of the testing criteria proposed in this paper is to provide a coverage measure that can be used for evaluating the progress of the testing activity. This is important to evaluate the quality of test cases, as well as, to consider that a program has been tested enough. It is important to point out that the objective this work is not to debug concurrent programs which already have an error revealed.

2

Model Test for Shared Memory Programs

Let MT = {t0 , t1 , ..., tn−1 } be a multithread program composed of n threads denoted by ti . Threads can execute diﬀerent functionalities but all they share the same memory address space. They may also use an additional private memory. Each thread t has its own control ﬂow graph CFGt that is built by using the same concepts of traditional programs [3]. In short, a CF G of a thread t is composed of a set of nodes N t and a set of edges EIt . These edges that link nodes in the same thread are called intra-thread edges. Each node n in the thread t is represented by the notation nti , a well-known terminology in the software testing context. Each node corresponds to a set of commands that are sequentially executed or can be associated to a synchronization primitive (post or wait ). A multithread program M T is associated with a Parallel Control Flow Graph for Shared Memory (PCFGSM ), which is composed of both the CF Gt (for 0 ≤ t < n) and the representation of the synchronization among threads. N and E represent the set of nodes and edges of the PCFGSM , respectively. For construction of the PCFGSM , it is assumed that (1) n is ﬁxed and known at compilation time; (2) there is implicit communication by means of shared variables; (3) there is explicit synchronization using semaphores (which has two basic atomic primitives: post (or p) and wait (or w)); and (4) initialization and ﬁnalization of threads act as a synchronization over a virtual semaphore. Three subsets of N are deﬁned: Nt (nodes in the thread t), Np (nodes with post primitives) and Nw (nodes with wait primitives). For each nti ∈ Np , a set Mw (nti ) is associated, such that for each nti ∈ Np , with a post to a semaphore sem, we deﬁne Mw (nti ) as the set of nodes nqj ∈ Nw , such that exist a thread q ∈ [0..n − 1] and a wait primitive with respect to (w.r.t.) sem in nqj . In a similar way, for each nti ∈ Nw , a set Mp (nti ) is associated, such that for each nti ∈ Nw , with a wait to a semaphore sem, we deﬁne Mp (nti ) as the set of nodes nqj ∈ Np , such that exist a thread q ∈ [0..n − 1] and a post primitive w.r.t. sem in nqj . In other words, Mw (nti ) contains all possible wait nodes that can match with nti and Mp (nti ) contains all possible post nodes that can match with nti . Using the above deﬁnitions, we also deﬁne the set ES ⊂ E that contains edges that represent the synchronization (edge-s) between two threads, such that: ES = {(ntj , nqk ) | ntj ∈ Mp (nqk ) ∧ nqk ∈ Mw (ntj )}

(1)

The concurrent program shown in the Fig. 1 is used to illustrate these deﬁnitions. This program implements the producer-consumer problem with limited buﬀer, using PThreads library in ANSI C. There are three threads: (1) a master,

340

F.S. Sarmanho et al.

Fig. 1. Producer-Consumer implemented with PThreads/ANSI C

which initializes variables and creates the producer and consumer threads; (2) a producer, which populates the buﬀer; (3) a consumer, which removes items from the buﬀer for further processing. Table 1 contains values of all sets introduced above. Figure 2 shows the PCFGSM for the program in the Fig. 1. t0 , t1 and t2 represent the master, producer and consumer threads, respectively. Dotted lines represent synchronization edges. Some examples of synchronization edges are: (92 , 61 ) is a synchronization over semaphore, (90 , 12 ) is a synchronization of initialization and (121 , 100 ) is a synchronization of ﬁnalization. Note that there may exist internal synchronization edges, such as (101 , 61 ) and (92 , 52 ) in Fig. 2. A path π t = (nt1 , nt2 , ..., ntj ), where (nti , nti+1 ) ∈ EIt , is intra-thread if it has no synchronization edges. A path that includes at least one synchronization edge is called an inter-thread path and is denoted by Π = (PATHS, SYNC ), where PATHS = {π 1 , π 2 , ..., π n } and SYNC = {(pti , wjq ) | (pti , wjq ) ∈ ES } [7]. Here pti is a post node i in thread t and wjq is a wait node j in thread q. PCFGSM also captures information about data ﬂow. Besides local variables, multithread programs have more two special types of variables: (1) shared variables, used for communication; and (2) synchronization variables, used by semaphores. V denotes all variables. VLt ⊂ V contains local variables of thread t. VC ⊂ V contains the shared variables and VS ⊂ V contains the synchronization variables. Therefore, we deﬁne: def (nti ) = {x | x is a variable deﬁned in nti }.

Structural Testing for Semaphore-Based Multithread Programs

341

Table 1. Sets of the test model introduced for the program shown in Fig. 1

Fig. 2. PCFGSM graph that represents the program shown in Fig. 1

n=3 MT = {t0 , t1 , t2 } N0 = {10 , 20 , 30 , ..., 120 } N1 = {11 , 21 , 31 , ..., 121 } N2 = {12 , 22 , 32 , ..., 122 } N = N0 ∪ N1 ∪ N2 Np = {30 , 50 , 60 , 80 , 90 , 101 , 111 , 121 , 92 , 102 , 122 } Nw = {100 , 110 , 11 , 51 , 61 , 12 , 42 , 52 } EI0 = {(10 , 20 ), (20 , 30 ), ..., (100 , 110 ), (110 , 120 )} EI1 = {(11 , 21 ), (21 , 31 ), ..., (111 , 31 ), (31 , 121 )} EI2 = {(12 , 22 ), (22 , 32 ), ..., (112 , 32 ), (32 , 122 )} Es = {(30 , 61 ), (30 , 52 ), (50 , 51 ), (60 , 51 ), (80 , 11 ), (90 , 12 ), (101 , 52 ), (101 , 61 ), (111 , 42 ), (121 , 100 ), (121 , 110 ), (92 , 61 ), (92 , 52 ), (102 , 51 ), (122 , 100 ), (122 , 110 )} E = EI0 ∪ EI1 ∪ EI2 ∪ Es Mw (30 ) = {61 , 52 } Mp (100 ) = {121 , 122 } Mw (50 ) = {51 } Mp (110 ) = {121 , 122 } Mw (60 ) = {51 } Mp (11 ) = {80 } Mw (80 ) = {11 } Mp (51 ) = {50 , 60 , 102 } Mw (90 ) = {12 } Mp (61 ) = {30 , 101 , 92 } Mw (101 ) = {61 , 52 } Mp (12 ) = {90 } Mw (111 ) = {42 } Mp (42 ) = {111 } Mw (121 ) = {100 , 110 } Mp (52 ) = {30 , 101 , 92 } Mw (92 ) = {52 , 61 } Mw (102 ) = {51 } Mw (122 ) = {100 , 110 } VL0 = {prod h, cons h} VL1 = {prod, item} VL2 = {cons, my item} VC = {queue, avail} VS = {mutex, f ull, empty} def (10 ) = {avail} def (91 ) = {prod} def (21 ) = {prod} def (22 ) = {cons} def (41 ) = {item} def (62 ) = {cons} def (71 ) = {queue} def (72 ) = {avail} def (81 ) = {avail} def (82 ) = {my item}

A path π t = (n1 , n2 , ..., nj , nk ) is deﬁnition-clear w.r.t. a local variable c ∈ VLt from n1 to node nk or edge (nj , nk ), if x ∈ def (n1 ) and x ∈ / def (ni ), for i ∈ [2..j]. The notion deﬁnition-clear path is not applicable to shared variables because the communication (deﬁnition and use of shared variables) in threads is implicit. It is hard to establish a path that statically deﬁnes and uses shared variables. In Secion 2.1, we present a method to determine execution-based deﬁnition-clear paths for shared variables using a post-mortem methodology. The use of variables in multithread programs can be: computational use (cuse): computational statements related to local variable x ∈ VLt ; predicative use (p-use): conditional statements that modify the control ﬂow of the thread and are related to local variable x ∈ VLt ; synchronization use (sync-use): synchronization statements on semaphores-variable x ∈ VS ;communicational c-use (comm-c-use): computational statements related to shared variable

342

F.S. Sarmanho et al.

x ∈ VC ; and communicational p-use (comm-p-use): conditional statements used on control ﬂow of the thread, related to shared variable x ∈ VC . Based on these deﬁnitions, we establish associations between variable deﬁnition and use. Five kinds of associations are deﬁned: c-use association: is deﬁned by a triple (nti , ntj , x) iﬀ x ∈ VLt , x ∈ def (nti ), ntj has a c-use of x and there is at least one deﬁnition-clear path w.r.t. x from nti to ntj . p-use association: is deﬁned by a triple (nti , (ntj , ntk ), x) iﬀ x ∈ VLt , x ∈ def (nti ), (ntj , ntk ) has a p-use of x and there is at least one deﬁnition-clear path w.r.t. x from nti to (ntj , ntk ). sync-use association: is deﬁned by a triple (nti , (ntj , nqk ), sem) iﬀ sem ∈ VS , (ntj , nqk ) has a sync-use of sem and there is at least one deﬁnition-clear path w.r.t. sem from nti to (ntj , nqk ). comm-c-use association: is deﬁned by a triple (nti , nqj , x) iﬀ x ∈ VC , x ∈ def (nti ) and nqj has a c-use of the shared variable x. comm-p-use association: is deﬁned by a triple (nti , (nqj , nqk ), x) iﬀ x ∈ VC , x ∈ def (nti ) and (nqj , nqk ) has a p-use of the shared variable x. 2.1

Applying Timestamps to Determine Implicit Communication

In this section, we present a method to establish pairs of deﬁnition and use of shared variables. These pairs are obtained after execution of the multithread program identifying the order that the concurrent events happened. Lamport [11] presented a way to order concurrent events by means of a happens-before relationship. This relationship can determine if an event e1 occurs before an event e2 , denoted by e1 ≺ e2 . To obtain this happens-before relationship it is necessary to assign timestamps to concurrent events. Lie and Carver [6] presented a method to assign timestamps that use local logical clock. We adapt this method to assign timestamps in our testing method. The method obtains all synchronizations that happened for an execution and thus generates the communication events. The method assigns a local logical clock vector, denoted by ti .cv, for each thread ti . This vector has dimension n, where n is the total number of threads. Each position i ∈ [0..n − 1] on the clock vector is associated to thread ti . The clock-vector position i is updated when a new event occurs in thread ti . For instance, observe the c1 event in t0 (before the clock vector was [0, 0, 0]). When a synchronization event occurs in tj other i positions, for i = j, of the clockvector can also be updated. For instance, considering the match (p2 , w2 ). Before this synchronization the clock vector associated with t2 was [0, 0, 0]. After, the values were updated to [2, 4, 1]. The logical space-time diagram shown in the Fig. 3 illustrates the method, using a hypothetical example. This diagram only considers events of synchronization (pi and wj ) and communication (ck ). Vertical lines represent the logical time of each thread. Arrows among threads represent synchronization events matching post and wait events. For instance, wait events w1 and w2 race the same post event p1 , but the match (w1 , p1 ) has occurred. It is possible that a wait primitive has several posts to match. These posts are inserted in a queue. Our method considers the access criterion LIFO to get the happens-before relationship. We chose LIFO to get most updated timestamps.

Structural Testing for Semaphore-Based Multithread Programs

Fig. 3. Example of logical space-time diagram

343

Fig. 4. Example of nondeterminism

Rules deﬁned in [6] are used to establish if an event e1 happens-before an event e2 . These rules are not showed here for sake of space. With this method, it is possible to show the communications that happened for a program execution.

3

Coverage Criteria

Based on the control, data and communication ﬂow models and deﬁnitions presented in previous section, we propose two sets of structural testing criteria for shared-memory parallel programs. These criteria allow the testing of sequential and parallel aspects of the programs. Control Flow and Synchronization-based Criteria All-p-nodes criterion: the test set must execute all nodes nti ∈ Np . All-w-nodes criterion: the test set must execute all nodes nti ∈ Nw . All-nodes criterion: the test set must execute all nodes nti ∈ N . All-s-edges criterion: the test set must execute all sync edges (nti , nqj ) ∈ Es . All-edges criterion: the test set must execute all edges (ni , nj ) ∈ E. Data Flow and Communication-based Criteria All-def-comm criterion: the test set must execute paths that cover an association comm-c-use or comm-p-use for all deﬁnition of x ∈ Vc . All-def criterion: the test set must execute paths that cover an association c-use, p-use, comm-c-use or comm-p-use for all deﬁnition of x ∈ def (nti ). All-comm-c-use criterion: the test set must execute paths that cover all comm-c-use associations. All-comm-p-use criterion: the test set must execute paths that cover all comm-p-use associations. All-c-use criterion: the test set must execute paths that cover all c-use associations. All-p-use criterion: the test set must execute paths that cover all p-use associations.

344

F.S. Sarmanho et al.

All-sync-use criterion: the test set must execute paths that cover all sync-use associations. It is necessary to know which path was exercised to evaluate the required elements covered from an execution. One option to obtain this information is to instrument the source code to produce execution trace. This instrumentation can change the original program behaviour. However, this interference does not aﬀect the structural testing proposed here, because it does not prevent the extraction and the future execution of all possible pairs of synchronization. Due to non-determinism, executions of a program with the same input can cause diﬀerent event sequences to occur. The Fig. 4 shows the example where the nodes 81 and 91 in t1 have non-deterministic waits and in nodes 20 (t0 ) and 22 (t2 ) have post to t1 . All these operations are on the same semaphore. This case illustrates the possible synchronizations among these threads. During the testing activity is essential to guarantee that these synchronizations are executed. Controlled execution is a mechanism used to achieve deterministic execution, i.e. two executions of the program with the same input are guaranteed to execute the same instruction and the speciﬁed synchronization sequence. The controlled execution used in this work was adapted from Carver method [12]. The Table 2 shows some required elements for the criteria deﬁned in this section. These required elements are taken on the static analysis. Table 2. Some required elements by the proposed criteria for the program of the Fig. 1 Criteria

Required Elements

30 , 50 , 60 , 80 , 90 , 101 , 111 , 121 , 92 , 102 , 122 100 , 110 , 11 , 51 , 61 , 12 , 42 , 52 10 , 20 , 30 , ..., 120 , 11 , 21 , ..., 121 , 12 , 22 , ..., 122 (30 , 61 ), (30 , 52 ), (50 , 51 ), (60 , 51 ), (80 , 11 ), (90 , 12 ), All-edges-s (101 , 52 ), (101 , 61 ), (111 , 42 ), (121 , 100 ), (121 , 110 ), (92 , 61 ), ... (10 , 20 ), (20 , 30 ), ..., (110 , 120 ), (11 , 21 ), (21 , 31 ), ..., (111 , 31 ), (31 , 121 ), All-edges (12 , 22 ), ..., (112 , 32 ), (32 , 122 ), (30 , 61 ), (30 , 52 ), ..., (90 , 12 ), (101 , 61 ), ... All-def-comm (10 , 71 , avail), (81 , 71 , avail), (72 , 82 , avail), (71 , 82 , queue) (10 , 82 , avail), (21 , (31 , 41 ), prod), (41 , 71 , item), (71 , 82 , queue), All-def (81 , 71 , avail), (62 , (32 , 122 ), cons), ... (10 , 71 , avail), (10 , 72 , avail), (72 , 72 , avail), All-comm-c-use (72 , 81 , avail), (71 , 82 , queue), ... All-comm-p-use ∅ (81 , 72 , avail), (72 , 82 , avail), (21 , 91 , prod), (91 , 91 , prod), All-c-use (41 , 71 , item), (71 , 82 , queue), (82 , 112 , myi tem)... (21 , (31 , 41 ), prod), (21 , (31 , 121 ), prod), (62 , (32 , 42 ), cons), All-p-use (62 , (32 , 122 ), cons), ... (20 , (30 , 61 ), mutex), (20 , (30 , 51 ), mutex), (50 , (60 , 51 ), empty), All-sync-use (61 , (101 , 61 ), mutex), (52 , (92 , 61 ), mutex), ... All-nodes-p All-nodes-w All-nodes

4

Total 11 8 36 16 51 4 10 13 0 16 8 14

Case Study

In order to illustrate the proposed testing criteria, consider the program in Fig. 1. The buﬀer is limited to two produced/consumed items. Due to threads scheduling, two executions are possible: (1) produce, consume, produce, consume

Structural Testing for Semaphore-Based Multithread Programs

345

(PCPC ); (2) produce, produce, consume, consume (PPCC ). Using controlled execution, it is possible to force the order these executions. Considering to ﬁrst execution (PCPC) the executed paths and their synchronizations are: π 0 = {1,2,3,4,5,6,8,9,10,11,12} π 1 = {1,2,3,4,5,6,8,9,10,11,3,4,5,6,7,8,9,10,11,12} π 2 = {1,2,3,4,5,6,8,9,10,11,3,4,5,6,7,8,9,10,11,12} SYNC = {(80 , 11 ), (60 , 51 ), (30 , 61 ), (90 , 12 ), (111 , 42 ), (101 , 52 ), (50 , 51 ), (92 , 61 ), (121 , 100 ), (111 , 42 ), (101 , 52 ), (122 , 110 )} For this execution, some coverd elements are the edges-s (60 , 51 ) and (122 , 110 ), the comm-c-use (10 , 71 , avail), (72 , 81 , avail), (71 , 82 , queue). To illustrate how the testing criteria can contribute to reveal faults, consider that the mutex semaphore was initialized with the value 0 or 2 on the main function (code line 17). This will cause a deadlock state or an inappropriate concurrent access to shared variables respectively. An execution that covers the required elements (30 , 61 ) and (10 , 72 , avail), edges-s and comm-c-use respectively, will reveal the fault for the deadlock case. The execution of the required element comm-c-uses (81 , 82 , queue) will reveal the fault for the case of inappropriate concurrent access. For both cases other required elements can also reveal these faults. To illustrate a communication fault consider that avail was initialized with 1 (10 ) and all synchronizations are correct. This fault can be revealed with the execution of the required elements comm-c-use (72 , 72 , avail) and edge-s (92 , 52 ). It will be necessary the execution of the PPCC sequence to reveal this fault, since the paths executed with the PCPC sequence do not reveal it.

5

Conclusion

Concurrent programs testing is not a trivial activity. This paper contributes in this context by addressing some of these problems for semaphore-based multithreading programs. The paper introduced both structural testing criteria to validate shared-memory parallel programs and a new model test to capture information about control, data, communication and synchronization from these programs. The paper also presents a post-mortem method based on timestamps to determine which communications (related with shared variables) happened in an execution. This information is important to establish the pairs of deﬁnition and use of the shared variables. The proposed testing criteria are based on models of control and data ﬂow and include the main features of the most used PThreads/ANSI C programs. The model considers communication, concurrency and synchronization faults among threads and also fault related to sequential aspects of each thread. The use of the proposed criteria contributes to improve the quality of the test cases. The criteria oﬀer a coverage measure that can be used in two testing procedures. Firstly, for generation of test cases, where these criteria can be used as guideline for test data selection. Secondly, for the evaluation of a test set. The

346

F.S. Sarmanho et al.

criteria can be used to determine when the testing activity can be ended and also to compare test sets. The evolution of our work on this subject is directed to several lines of research: 1) development of a supporting tool for the introduced testing criteria (it is now being implemented); 2) development of experiments to reﬁne and evaluate the testing criteria; 3) implementation of mechanisms to validate multithread programs that dynamically create threads; and 4) conduction of an experiment to evaluate the eﬃcacy of the generated test data against ad hoc test sets.

References [1] Yang, C.D., Pollock, L.L.: The challenges in automated testing of multithreaded programs. In: 14th Int. Conference on Testing Computer Software, pp. 157–166 (1997) [2] Dongarra, J.J., Walker, D.W.: The quest for petascale computing. Computing in Science and Engineering 03(3), 32–39 (2001) [3] Rapps, S., Weyuker, E.: Selecting software test data using data ﬂow information. IEEE Transactions on Software Engineering SE-11(4), 367–375 (1985) [4] Taylor, R.N., Levine, D.L., Kelly, C.D.: Structural testing of concurrent programs. IEEE Trans. on Software Engineering 18(3), 206–215 (1992) [5] Yang, C.S.D., Souter, A.L., Pollock, L.L.: All-du-path coverage for parallel programs. In: Young, M. (ed.) ISSTA 1998: Proc. of the ACM SIGSOFT Int. Symposium on Software Testing and Analysis, pp. 153–162 (1998) [6] Lei, Y., Carver, R.: Reachability testing of concurrent programs. IEEE Trans. on Software Engineering 32(6), 382–403 (2006) [7] Vergilio, S.R., Souza, S.R.S., Souza, P.S.L.: Coverage testing criteria for messagepassing parallel programs. In: 6th LATW, Salvador, Ba, pp. 161–166 (2005) [8] Edelstein, O., Farchi, E., Goldin, E., Nir, Y., Ratsaby, G., Ur, S.: Framework for testing multi-threaded Java programs. Concurrency and Computation: Practice and Experience 15(3–5), 485–499 (2003) [9] Lei, Y., Carver, R.H., Kacker, R., Kung, D.: A combinatorial testing strategy for concurrent programs. Softw. Test., Verif. Reliab. 17(4), 207–225 (2007) [10] Yang, C.S.D., Pollock, L.L.: All-uses testing of shared memory parallel programs. Softw. Test, Verif. Reliab. 13(1), 3–24 (2003) [11] Lamport, L.: The implementation of reliable distributed multiprocess systems. Computer Networks 2, 95–114 (1978) [12] Carver, R.H., Tai, K.C.: Replay and testing for concurrent programs. IEEE Softw. 8(2), 66–74 (1991)

Algorithms of Basic Communication Operation on the Biswapped Network Wenhong Wei and Wenjun Xiao Department of Computer Science, South China University of Technology, 510641 Guangzhou, China [email protected], [email protected]

Abstract. Biswapped network (BSN) is a new topology for interconnection networks in multiprocessor systems. BSN is built of 2n copies of an n-node basic network and total nodes are 2n2. Some topological properties of BSN have been investigated, and some algorithms have been developed on the BSN such as sorting and matrix multiplication etc. In this paper, we develop algorithms for some basic communication operations—broadcast, prefix sum and data sum etc. Keywords: BSN, Broadcast, Prefix sum, Data sum.

1 Introduction The swapped network is also called as the OTIS-network and has important applications in parallel processing [1,2]. In this network architecture, n2 processors are divided n groups where there are n processors, and processors in the same group are connected by intra-group link, simultaneously, these groups are connected by inter-group link. But swapped network is not a Cayley graph, and then it is not a symmetrical network architecture, so some algorithms on it are not always convenient. For remedying this limitation about swapped network, [3] proposed biswapped network (BSN), the new network is a class of Cayley graph if the basic network is a Cayley graph and is tight related to the swapped network. BSN is of more regularity than the swapped network. BSN is built of 2n copies of an n-node basic network using a simple rule for connectivity that ensures its regularity, modularity, fault tolerance, and algorithmic efficiency. Some topological properties of BSN have been investigated [3], and some algorithms have been developed on the BSN such as sorting and matrix multiplication etc [4]. In most parallel algorithms, processors need to exchange data with other processors, hence it is the most important to develop algorithms of basic communication operation, and algorithms for basic communication operation can be used to arrive at efficient parallel algorithms for numerous applications, from image processing, computational geometry, matrix algebra, graph theory, and so forth [5]. In [6], Wang and Sahni developed algorithms of basic operations on OTIS-Mesh, their basic operation algorithms including broadcast, prefix sum and data sum etc can be only applied to OTIS-Mesh. In this paper, we develop deterministic algorithms of basic communication M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 347–354, 2008. © Springer-Verlag Berlin Heidelberg 2008

348

W. Wei and W. Xiao

operation for parallel computation on the BSN, such as broadcast, prefix sum and data sum etc, and analyze time complexity of these algorithms. According to [4], we know BSN has better topological properties than OTIS, the basic communication operation algorithms on the BSN proposed by us, are more general and better than those on OTIS-Mesh. For example, in a 2n2 processors BSN-Mesh, our broadcast algorithm’s time complexity is 4 n − 2 , but in a n2 processors OTIS-Mesh, their broadcast algorithm’s time complexity is 4 n − 3 . As the number of processor in our network is bigger than theirs, we can conclude that our broadcast algorithm is better than theirs. The remainder of this paper is organized as follows. In Section 2, we give the definition of BSN. Section 3 presents the basic data communication algorithms on BSN including broadcast, prefix sum and data sum etc, and analyzes time complexity of these algorithms. Finally, in Section 4, we provide some concluding remarks.

2 Introduction of BSN Definition 1. Let Ω be a graph with the vertex set V (Ω) = {h1, h2 ,..., hn} and the arc set E (Ω ) . Our biswapped network Σ(Ω) = Σ = (V (Σ), E (Σ)) is a graph defined as follows [3]: V(Σ) = {〈 g, p,0〉 , 〈 g, p,1〉 | g, p ∈ V(Ω)} and E(Σ) = {(〈g, p1, 0〉 , 〈g, p2, 0〉), (〈g, p1, 1〉 , 〈g, p2, 1〉) | (p1, p2)∈E(Ω), g∈V(Ω)} ∪ {(〈g, p, 0〉 , 〈p, g, 1〉), (〈g, p, 1〉 , 〈p, g, 0〉) | g, p ∈V(Ω)} Intuitively, if we regard the basis network as group, the definition postulates 2n groups, each group being a Ω digraph: n groups, with nodes numbered 〈group#, processor#, 0〉, form part 0 of the bipartite graph, and other n groups constitute part 1, with associated node numbers 〈group#, processor#, 1〉. Each group p in either part of Σ has the same internal connectivity as Ω (intra-group edges, forming the first set in the definition of E(Σ)). In addition, node g of group p in part 0/1 is connected to node p in group g of part 1/0 (inter-group or swap edges in the second set in the definition for E(Σ)). The name “biswapped network” (BSN) arises from two defining properties of the network just introduced: when group are viewed as super-nodes, the resulting graph of super-nodes is a complete 2n-node bipartite graph, and the inter-group links connect nodes in which the group number and the node number within group are interchanged or swapped. When Ω = C4 is ring, an example of the network Σ(C4 ) is denoted in Fig. 1.

Fig. 1. An example of the BSN with Ω=C4

Algorithms of Basic Communication Operation on the Biswapped Network

349

Like swapped network (or OTIS), and links between vertices of the same group are regarded as intra-group links. The links between vertices in a group and another group are following a swapping strategy, which are regarded as inter-group links.

3 Basic Communication Operations on the BSN 3.1 Broadcast Broadcast is, perhaps, the most fundamental operation in the parallel computing. In this operation, data is initially in a single processor, and after broadcasting, it is to be transmitted to all other processors in the same network. For example, if processor <0, 0, 0> has value A in BSN, all 2n2 processors of the BSN have value A after broadcasting. Suppose that broadcast is applied in all-ports mode, and it can be accomplished using the following four-step algorithm if we suppose processor u (u=) has data x: Table 1. Algroithm for broadcast

Step 1: processor u transmit its data x to processor v (v=) by inter-group connection. Step 2: processor u and processor v broadcast its data x to all other processors simultaneously by intra-group connection in respective group. Step 3: each processor in group g and group g’ transmits its data x to all other processors by inter-group connection (g’ is a group which processor v lie in and g’=p). Step 4: processor of each group broadcasts the data x to other processors within its group except group g and group g’. In Step 1, processor u transmit a copy of data x to processor v by an inter-group move, and then processor v has the data x. In Step 2, processor u and processor v broadcast their data x in their respective group simultaneously, so each processor of group g (processor u lie in group g) and group g’ (processor v lie in group g’) has a copy of data x. Step 3 is similar to Step 1, and after Step 3, the processor of each group which connect to processor u or processor v by inter-group connectivity has a copy of data x. After the last step, each processor of BSN has a copy of data x. Fig. 2 shows the process of broadcast on the BSN-C3, data x is stored in processor <0, 0, 0> initially. In the whole process of this algorithm, we suppose inter-group move and intra-group move are equivalent, so we use one unit to denote inter-group or intra-group move time of one data. If the basic network of 2n2 processors BSN is mesh, Step 2 and 4 will take 2( n − 1) time each, Step 1 and Step 3 take one unit time each. So the complexity is 4 n − 2 . If the basic network of BSN is hypercube or other connectivity graphs, we can analyze the algorithm complexity similarly.

350

W. Wei and W. Xiao

Fig. 2. An example of broadcast on the BSN with Ω=C3

Theorem 1. The broadcast algorithm on BSN is optimal if the basic network’s broadcast algorithm is optimal in all-ports mode. Proof. If broadcast algorithm is optimal on a network in all-ports mode, the diameter of that network is equal to the move steps of broadcast. Let Ω denote basic network of BSN and D denote its diameter, according to [3], the diameter of BSN is 2D(Ω)+2. In our broadcast algorithm, there are two broadcasts in the basic network, so the move steps are 2D(Ω) if the basic network’s broadcast algorithm is optimal. In addition, there are two inter-group moves, and our broadcast algorithm needs 2D(Ω)+2 moves, which equal to the diameter of BSN, so our broadcast algorithm is optimal. 3.2 Prefix Sum In 2n2 processors BSN, if we label each group of part 0 from 0 to n-1 and each group of part 1 from n to 2n-1, and label each processor of each group from 0 to n-1. Now, let D(p) be the data in processor p, 0≤p<2n2. In a prefix sum, each processor p computes PS ( p) =

∑

p

i =0

D (i ),0 ≤ p < 2n 2

. So prefix sum algorithm results from the

following equation: PS(p)=SD(p)+LS(p)

(1)

Where SD(p) is the sum of D(i) over all processors i whose group label is smaller than that of p and LS(p) is the local prefix sum within the group of p. The algorithm for prefix sum is shown in Table 2.

Algorithms of Basic Communication Operation on the Biswapped Network

351

Table 2. Algroithm for prefix sum

Step 1: perform a local prefix sum in each group. Step 2: transmit prefix sums computed in Step 1 for processor n-1 in each group of part 0 to processors in group 2n-1 and for processor n-1 in each group (except group 2n-1) of part 1 to processors in group n-1 by inter-group connection. Step 3: in group n-1 and group 2n-1, perform a modified prefix sum in data A which is received in Step 2. In this modification, processor P computes rather than

∑

P

i =0

∑

P −1 i =0

A(i )

A(i ) (P≥1).

Step 4: swap between prefix sums computed in Step 3 for processor n-1 in group 2n-1 of part 1 and processor n-1 in group n-1 of part 0 by inter-group connection. Step 5: after summing result in Step 4 and local prefix sum, processor n-1 in group n-1 of part 0 broadcasts the result to each processor and each processor in group n-1 of part 0 performs sum in the result and data A. Step 6: transmit prefix sums computed in Step 5 for processor n-1 of group n-1 to each group of part 1 and prefix sums computed in Step 3 for processor n-1 of group 2n-1 to each group of part 0. Step 7: broadcast the result from Step 6 in each processor. Step 8: perform sum in local prefix sum and modified prefix sum in each processor. Following Step 1, each group computed local prefix sum and the result is stored in processor respectively. Step 2 corresponds to inter-group transmission operation, in Step 2, the results of local prefix sum in all groups in part 0 are transmitted to each processor in group 2n-1 of part 1, and similarly, the results of local prefix sum in all groups except for group 2n-1 of part 1 are transmitted to each processor in group n-1 of part 0. In Step 3, processors in group n-1 and group 2n-1 perform a modified prefix sum on the data which is received in previous step, the modified prefix sum of current processor is equal to the prefix sum of preceding processor. For example, if data A0, A1, …, An-1 is stored in processor0, processor1, …, processorn-1 respectively, the modified prefix sum of n data is equal to 0, A0, A0+A1, …, A0+A1+…+An-2. In Step 4, modified prefix sum computed in group n-1 of part 0 and group 2n-1 in part 1 are swapped, and processor n-1 in group n-1 has prefix sum of previous n-1 groups of the part 0 and processor n-1 in group 2n-1 has prefix sum about pervious n-1 groups of the part 1. In Step 5, the data from processor n-1 in group 2n-1 that were added local prefix sum in processor n-1 of group n-1 were broadcasted to other processors in the same group, and then each processor in this group has the prefix sum of previous n groups in part 0, at last, each processor performed sum in it and modified prefix sum. After Step 6, processor n-1 in each group has modified prefix sum. Following Step 7, each processor has modified prefix sum. In last step, each processor computed Equation (1) and each processor in each group has last result. Fig. 3 shows the process of prefix sum on the BSN-C3, we denote A0, A1, A2 as the prefix sum of processor <0, 0, 0>, <0, 1, 0>, <0, 2, 0> in group 1 and B0, B1, B2 as the prefix sum of processor <1, 0, 0>, <1, 1, 0>, <1, 2, 0> in group 2, the remainder groups are similar.

352

W. Wei and W. Xiao

A0

B0

A1

A2

A2+B2+C2 C0 A2+B2

B1

B2

part 0

C1

C2

A2+B2+C2+D2+E2 D0

part 1

E0

F0

F1 F2 A2 D2+E2 (e) After Step 5, the processors in group 2 have modified prefix sum and local prefix sum. D2

D1

A2+B2+ C2+D2

E2

E1

Fig. 3. An example of prefix sum in the BSN with Ω=C3

We think about the algorithm complexity in worst occasion now, there are 3 transmitting data in inter-group, 2 broadcasts and 2 prefix sum operations in this algorithm (the time of arithmetic operation is ignored). We know that broadcast prefix sum operation will cost the most time in the array, so if the basic network of BSN is an array, the algorithm complexity is worst. Now we suppose that transmitting and broadcast a data will cost one time unit, the broadcast time is n-1 and prefix sum operation time is also n-1 in n processors array. So the whole algorithm’s complexity of 2n2 processors BSN is 4n-1 at worst.

Algorithms of Basic Communication Operation on the Biswapped Network

353

3.3 Data Sum Data sum is also named as semi-group computing, each processor is to compute the sum of the values of all processors. The algorithm is shown in Table 3. Table 3. Algroithm for data sum

Step 1: each processor of each group performs data sum in itself group. Step 2: each processor of each group transmits its data sum to other processor by the inter-group connection. Step 3: each processor of each group performs data sum operation in received data from other groups. Step 4: each processor in each group swaps the results that are computed in Step 3 by inter-group connection. Step 5: the processor of each group performs the sum operation in itself group. In Step 1, each processor performs data sum in intra-group. After Step 2, each processor in part 0 has the sum of data in part 1, and each processor of the part 1 has the sum of data in part 0 contrarily. In Step 3, each processor sums the data that was received by inter-group connection, and then the processors of part 0 (part 1) have the sum of all processors of the part 1 (part 0). After Step 4, each processor in each group has the sum of all the processors. In Step 5, all the processors perform the sum operation and each processor of each group has final data sum. In 2n2 processors BSN, if the basic network is a complete graph, our algorithm complexity is the best. Suppose that intra-group and inter-group data transmission cost one time unit. One data sum operation is performed in Step 1, Step 3 and Step 5 respectively, which cost 3 time steps in all at best. One data transmission operation is performed in Step 2 and Step 4 which totally cost two time steps. So the whole algorithm complexity is 5 at best. Table 4. Comparison between our algorithms and [6] OTIS-Mes h

BSN-Array

BSN-Mesh

BSN-Complete graph

Broadcast

44 N − 3

2N

24 8 N − 2

4

Prefix sum

84 N − 6

2 2N − 1

44 8 N − 5

7

Data sum

84 N − 7

3 2N − 1 2

34 8 N − 4

5

4 Conclusion In this paper, we have developed the algorithms of basic communication operation on the BSN including broadcast, prefix sum and data sum etc, which are important in parallel computing model, and also analyzed these algorithms’ time complexity. We assume that there are N processors in OTIS-Mesh and BSN respectively, comparison

354

W. Wei and W. Xiao

between our basic communication algorithms and [6] in the time complexity shows in Table 4. From the Table 4, we know that our algorithms including broadcast, prefix sum and data sum are better than the algorithms in [6] when the basic networks are same. In our algorithms, the time complexity is constant where basic network is complete graph, that is, the time complexity is constant at best. Acknowledgments. This work is supported by the Doctorate Foundation of South China University of Technology and Open Research Foundation of Guangdong Province Key Laboratory of Computer Network.

References 1. Parhami, B.: Swapped Interconnection Networks: Topological, Performance, and Robustness Attributes. Journal of Parallel and Distributed Computing 65, 1443–1452 (2005) 2. Day, K., Al-yyoub, A.: Topological Properties of OTIS-networks. IEEE Transactions on Parallel and Distributed Systems 13(4), 359–366 (2002) 3. Xiao, W.J., Chen, W.D., He, M.X., Wei, W.H., Parhami, B.: Biswapped Network and Their Topological Properties. In: Proceedings Eighth ACIS International Conference on Software Eng., Artific. Intelligence, Networking, and Parallel/Distributed Computing, pp. 193–198 (2007) 4. Wei, W.H., Xiao, W.J.: Matrix Multiplication on the Biswapped-Mesh Network. In: Proceedings Eighth ACIS International Conference on Software Eng., Artific. Intelligence, Networking, and Parallel/Distributed Computing, pp. 211–215 (2007) 5. Sahni, S., Wang, C.F.: BPC Permutations on the OTIS-Mesh Optoelectronic Computer. In: Proc. Fourth International Conference on Massively Parallel Processing Using Optical Interconnections, pp. 130–135 (1997) 6. Wang, C.F., Sahni, S.: Basic Operations on the OTIS-Mesh Optoelectronic Computer. IEEE Trans. Parallel and Distributed Systems 9, 1226–1236 (1998) 7. Coudert, D., Ferreira, A., et al.: Topologies for Optical Interconnection Networks Based on the Optical Transpose Interconnection System. Applied Optical IP 39(17), 2965–2974 (2000) 8. Day, K., Al-yyoub, A.: Topological Properties of OTIS-networks. IEEE Transactions on Parallel and Distributed Systems 13(4), 359–366 (2002)

Rule Engine Based Lightweight Framework for Adaptive and Autonomic Computing Jakub Adamczyk, Rafał Chojnacki, Marcin Jarząb, and Krzysztof Zieliński Institute of Computer Science, AGH - University of Science and Technology, Al. Mickiewicza 30, 30-059 Kraków, Poland {j.adamczyk, chojnacki, mj, kz}@agh.edu.pl http://www.ics.agh.edu.pl

Abstract. The paper describes a framework architecture called the Autonomic Management Toolkit (AMT). This toolkit was implemented to support dynamic deployment and management of adaptation loops. This requires automatic resource discovery, instrumentation and attachment to Autonomic Manager (AM), and furthermore a scalable and easily changed decisionmaking module, which is a major part of the AM. The architecture of a system satisfying these requirements is proposed and described. This system is compared to PMAC (Policy Management Autonomic Computing) – a highly advanced software tool offered by IBM. The central element of AMT is a lightweight AM with Rule Engine as a decisionmaking module. This makes the proposed solution lightweight and flexible. The AM activity is very briefly specified and the process of constructing an execution loop is described. The proposed interfaces are specified. These interfaces are generally sufficient to support a wide range of policies, including standard regulators, well know from control theory. Subsequently, AMT usage is illustrated by a simple example. The paper ends with an overview of related work and conclusions. Keywords: autonomic manager, rule engine, adaptive, workload management, Drools, PMAC.

1 Introduction Adaptive and autonomic management of computing resources is a well known problem facing computer scientists. For decades system components and software have been evolving to deal with the increased complexity of system control, resource sharing, and operational management. The evolution of these trends addresses the increasingly complex and distributed computing environments of today [2]. The research started several years ago by IBM, Motorola, SUN and many other companies, has resulted in software environments supporting policy-driven system management for autonomic computing [6]. One of those environments is PMAC (Policy Based Autonomic Computing) [5], developed by IBM. The majority of existing [5, 6, 17] policydriven systems rely on the construction of an adaptation loop. A challenging problem is how to construct a system able to discover new resources during runtime, automatically generate an intermediate management layer and apply M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 355–364, 2008. © Springer-Verlag Berlin Heidelberg 2008

356

J. Adamczyk et al.

selected policies with limited human intervention or through a software system restart. Another related issue is the construction of a policy evaluation engine. The decisionmaking module, being a fundamental part of the Autonomic Manager, should be powerful enough to process many policies in a scalable way and should accept their specification expressed using existing rules [4] or policy specification languages [18]. Implementation of a system satisfying these requirements opens new possibilities for building a management or adaptation system loop dynamically and with limited effort. This was the main motivation behind the construction of the Autonomic Management Toolkit (AMT), described in this paper. AMT is a lightweight framework for constructing adaptive and autonomic computing systems. The proposed solution exploits a different concept in comparison to the complex and heavy PMAC technology and promotes a lightweight approach [1]. Lightweight development is such an extensive topic that often that it is difficult to specify exactly what it means. In this paper, a lightweight framework implies programming with Plain Old Java Objects (POJOs) using design patterns implemented with JMX [8] technology MBean components used for coupling objects and integrating autonomic managers. The paper focuses on a framework architecture called AMT (Autonomic Management Toolkit) which uses JMX to control managed resources. The central element of this architecture is the lightweight autonomic manager. The most innovative concept applied in AMT construction is the usage of a Rule Engine as a reasoner within the Autonomic Manager. Such an approach introduces a natural mapping of term policies to a production rule definitions [5]. As this mapping could introduce some constrains related to policy expression, it is not applied in most applications. It performs operations on managed resources registered with the manager. The managed resource’s interface is fully compliant with PMAC. Newly discovered resources can be registered during runtime. A wrapper object implementing the managed resource interface for any resource represented as an MBean can be dynamically generated. This guarantees full flexibility and adaptabilit, not heretofore supported by PMAC. The wrapper objects support local and remote operation invocation and event notification, thus the managed resources can be located anywhere. The ongoing work is related to previous research performed at AGH UST concerning monitoring and management of virtualized environment [3], [12]. The structure of the paper is as follows. First, in Section 2, AMT architecture is shortly described and the motivation behind its construction is discussed. Next, in Section 3 the functionality of the AMT Autonomic Manager is presented. In Section 4 integration techniques for managed resources are described. In Section 5 examples of AMT usage for management of Solaris containers are presented. Section 6 contains a comparison of existing solutions. The paper ends with conclusions.

2 AMT System Architecture Description The AMT architecture presented in Fig. 1 consists of several key subsystems, which correspond to components existing in PMAC, but are constructed to provide features listed in Table 1. A single entry point to AMT is the Autonomic Manager interface, which exposes operations supported by software modules of the Autonomic Manager described below:

Rule Engine Based Lightweight Framework

− − −

357

Policy Management Module (PMM) – policies defined by system administrator are deployed to AMT. Policy Evaluators Module (PEM) – policies are obtained from storage, instantiated by a given reasoner and evaluated. The key point of this subsystem is an interface that supports interoperability with different reasoners. Resource Access Module (RAM) – defines resources and manages interactions compatible with AMT specification. This part of the system has been implemented to offer full functional compatibility with PMAC.

These three modules offer the core AMT functionality which is accessible by tje AMT Console Module (ACM).

Fig. 1. Key components of architecture of Autonomic Management Toolkit

A JIMS Integration Module (JIM) is also available. This module enables integration of AMT with the JIMS [10], [11] (JMX-based Infrastructure Monitoring System) infrastructure responsible for discovering and instrumentation of managed resources represented as MBeans. This part of system can be substituted by any other monitoring system or instrumentation technique. Each policy available in AMT has the following properties: − − −

Name – must be unique within policy scope. Scope – hierarchical structure of non-strings used for denotation of the given policy applicability domain.. Type – similar to PMAC, each policy can be either solicited or unsolicited. Each solicited decision is a direct decision requested from a resource or any other external system. Solicited decisions have an input and an output. The input and the output are sets of key-value mappings. An unsolicited decision is a reaction to a system state change, evaluated periodically. An evaluation property defines the timespan between intervals.

358

− − −

J. Adamczyk et al.

Evaluation in milliseconds – only for unsolicited policies which are evaluated periodically; evaluation in milliseconds defines the timespan between policy evaluations. Activated on startup – if set to true, policy evaluation is started immediately following AMT initialization. Reasoner – reasoner’s name, valid for policy evaluation.

AMT is designated to support many Rule Engines running in parallel, which may be attached and detached to a running Policy Evaluator without system restart. To resolve incompatibilities between different Rule Engines, AMT introduces the Reasoner Adapter (RA) concept which implements the interface described in Table 1. Table 1. Specification of the Reasoner Adapter interface

Method Name attachResource detachResource addPolicy removePolicy evaluateSolicitedDecision getInfo shutdown

Description Attaches new resource Detaches resource Adds policy for unsolicited and solicited decisions Stops policy evaluation and removes it from Rule Engine Returns solicited decision or null if decision is not available Returns Rule Engine description Performs required operations upon detachment

3 AMT Autonomic Manager Activities Application of Rule Engine as a major part of the Autonomic Manager influences its startup time procedure and runtime activity. When the Rule Engine is started, it activates a rule base in the production memory for Rule Engine use. The rule base contains all rules and class definitions to be evaluated against facts. The Rule Engine is driven by the emergence or change of facts residing in the Working Memory. Such changes may be generated by external events, as results of Rule Engine activity or by the expiration of a timer which associated with previously received facts. In our case, facts represent manageable resources, which are implemented as MBeans components created on demand, as described in Section 4. The activity of the Autonomic Manager is performed in the following steps, representing the execution of an adaptation loop, depicted in Fig. 2: 1. Managed Resources (MRs) are instantiated as MBeans. 2. Resource Wrappers of MRs which play the role of facts, are constructed and inserted into the Working Memory. The Reasoner Adapter interface is used in this step. 3. Production Rules representing policies are loaded into the Production Memory. At this point, the Inference Engine is also started. 4. The Pattern Matching algorithm is performed on all rules in the Production Memory and facts present in the Working Memory.

Rule Engine Based Lightweight Framework

359

Fig. 2. Autonomic Manager processing steps

5. All rules that are evaluated as true are added to the Agenda to be performed. 6. Action is performed on the representation MR in the Working Memory. 7. The action is forwarded to MR via the Resource Wrapper and enforced with effectors. 8. MR parameter changes accessed by sensors are communicated to the Resource Wrapper, which in turn triggers execution of step 4. Steps 4 to 8 constitute the main execution loop of the Autonomic Manager. Since rules are declarative Knowledge Representation forms, they are not analogous functions in a procedural language. Instead they are fired in response to changes in the facts available to the rule engine.

4 Managed Resource Representation For the execution loop implementation, a crucial point is the representation of managed resources. AMT uses the JMX technology for accessing resources. A key element of JMX architecture is the MBean Server, which is a container for MBean components – Java objects which represent manageable resources and which allow operating on attributes (read/write values), executing actions and subscribing to notifications of events related to these resources. Each MBean registered in a server has a unique assigned name and is accessible for clients using various protocols supported by appropriate connectors (RMI, SOAP, HTML, SNMP) installed in am MBean server. Resource representation in AMT is based on the PMAC framework concept, thus it also enables the use of the PMAC Autonomic Manager. To meet this requirement we have defined an abstract class Resource, which implements the JavaManagedResource interface, for resource representation. The resource class specification and its dependencies on other classes and interfaces are depicted in Fig. 3.

360

J. Adamczyk et al.

Fig. 3. Specification of the AMT resource classes

For each manageable resource, an MBean resource wrapper class is generated with resource-specific action methods. The implementation of such a wrapper class depends on MBean properties, methods and notifications. When an MBean attach request process is initiated, AMT checks if a suitable resource wrapper class is available. If there is no such class then a wrapper generation process is started. The wrapper is generated from a parameterized template and transformed into Java class source code using the Apache Velocity library, then compiled with the Java Compiler API and loaded into the JVM. The proposed mechanisms of the resource attachment process are flexible and enable discovery of a managed resource during runtime, with standard JMX protocols, and generating a suitable wrapper class. Therefore, the adaptation loop may be constructed on the fly, which is a unique feature of AMT. Building upon our former work [10, 11, 12] on management of virtualized resources, it creates a complete framework which satisfies the aforementioned requirements.

5 Examples of AMT Usage This section presents a case study of controlling workloads within the Solaris 10 environment. The AMT is used for workload management of Solaris Containers [8] based on mechanisms specified by the Control Theory [13] and structured as a closed-loop AM workload manager. The whole system is treated as a black box. The controller uses current CPU usage values and adjusts them by changing shares (control signals) to maintain the requested CPU usage. The implementation uses a control loop managed by the AMT, running within a JIMS Management Agent on the machine on which the workload is running. The goal of the control loop was to adjust the project.cpu-shares resource control to a value which would assure that a given percentage of CPU time

Rule Engine Based Lightweight Framework

361

would always be available in conditions of constant load for a specific workload – for instance, a given project should be guaranteed 70% of CPU time when other active workloads (disturbances) are also present. A sample controller algorithm could use the Proportional (P) regulator – Eq. 1, where: (i) Uw – required usage of CPU by workload Ww, (ii) Uwt – usage observed at time t by workload Ww, (iii) Swt – number of shares set at time t for workload Ww, (iiii) Kp – proportional coefficient. Swt+1 = Swt + Kp * e(t), where e(t) = Uwt – Uw

(1)

The above control algorithm for Solaris 10 is implemented with the Drools rulebased policy. The rule definition implementation is enhanced with some heuristics; this rather complex task is explained in more detail in [12]. The experiment was performed under Solaris 10 running on a Sun Blade B100 (1GB RAM, CPU SPARC 650 Mhz) board. In this scenario a constant disturbance was activated after the controlled workload reached a steady state (considering the fact that it was the only CPU-bound workload, it reached close to 100% CPU usage). Fig. 4 presents a case where only one CPU-bound workload is started in the selected project, at its beginning. After a few seconds, when CPU usage increases to almost 100%, two other CPU-bound workloads are started in other projects, which results in a drop of CPU usage of the selected project. Subsequently, after several more seconds, controller P is turned on. It changes the share allocation for the controlled project, stabilizing CPU usage at 70%. The experiment was performed with a sample value of Kp = 3. This section presented a simple scenario of how AMT can be used. It is rather a proof of concept which validates our AMT design than a real-life problem. We have described policies which focus on system optimization and try to maintain the system

Fig. 4. AMT used for adaptive management of Solaris 10 project.cpu-shares resource

362

J. Adamczyk et al.

in a stable state. These policies are able to initiate valid reactions when the environment state changes.

6 Related Work The AMT system offers an alternative solution to the PMAC framework. Limitations and drawbacks of the PMAC in comparison to the Autonomic Management Toolkit (AMT) are presented in Table 2. This comparison points out the AMT toolkit features which make the proposed solution more flexible and easy to use. A list of projects in the area of autonomic computing, conducted by academic institutions, can be found in [14]. The presented research can be compared to research performed under the project entitled “Models and Extensible Architecture for Adaptive Middleware Platform” [15], by National ICT Australia. Similarly to AMT, this project assumes that the ability to develop/implement autonomic services must have the following general characteristics: (i) Standards-based: programming frameworks to construct autonomic services must leverage standard services; (ii) Maintain Separation of Concerns: The adaptive behavior must be expressed in complete separation from the main application’s business logic; (iii) Deployment/Development Scalability: solutions to provide adaptive behavior should scale down to small-scale applications, and scale up to large multi-server deployments. Despite similar assumptions, the project facilitates the development of adaptive behavior for legacy server applications [16] with support for aspect programming. The Adaptive Server framework developed under this project [17] supports the development of adaptive behavior for serverside components running on application servers. This is in contrast to AMT, which addresses the general issue of the adaptation loop construction, powered by Rule Engines. Table 2. Comparison of PMAC and AMT frameworks

Feature Application server required to support operations Resource description by WSRF needed

PMAC Yes Yes

AMT No No

Resources attachment only during the system restart Policy syntax is verbose Policy specification expressiveness

Yes

No

Yes Limited

Number of policy evaluators available

One

No Depends on policy evaluator Many

Activity logging and errors reporting

Limited

Declarative style of policy representation only

Yes

Full and extendable No

Rule Engine Based Lightweight Framework

363

7 Conclusions The implementation of the AMT system exploits the potential of Rule Engine-based computing which is a very attractive solution for policy-driven autonomic computing systems. However, integration of the Rule Engine with managed computer resources is not easy and requires proper virtualization of such resources. This aspect is elaborated in related work [10, 11, 12]. As the Rule Engine [4] is a rather sophisticated software module which supports scalable pattern matching algorithms, it may be used for a large number of facts and rules constituting a representation of knowledge. This feature is important for AC systems typically built as a hierarchy of self-management subsystems. Each such subsystem usually provides [6] self-configuration, self-optimization, self-healing and self-protection functionality. In spite of differences between these terms, they are handled similarly – typically driven by rules, which specify a high-level goal of system activity. A more exhaustive evaluation of the AMT concept calls for an extensive performance study. The existing overheads and bottlenecks should be identified. The presented work is only a proof of concept, showing solutions to integration and interoperability problems affecting various technologies used for AMT implementation. The constructed framework is fully operational and open to further enhancements.

References 1. Tate, B.A., Gehtland, J.: Better, Faster, Lighter Java. O’Reilly Media, Sebastopol (2005) 2. Ganek, A.G., Corbi, T.A.: The dawning of the autonomic computing era, http://www.research.ibm.com/journal/sj/421/ganek.html 3. Janik, A., Zielinski, K.: Transparent Resource Management with Java RM API. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3994, pp. 1023–1030. Springer, Heidelberg (2006) 4. Morgan, T.: Business Rules and Information Systems: Aligning IT with Business Goals. Addison-Wesley Professional, Reading (2002) 5. Policy Management for Autonomic Computing, Tivoli, version 1.2.1, http://dl.alphaworks.ibm.com/technologies/pmac/PMDevGuide121. pdf 6. Strassner, J.C.: Policy-based Network Management – Solutions for the Next Generation. Morgan Kaufmann, San Francisco (2004) 7. Sullins, B.G., Whipple, M.B.: JMX in Action. Manning Publication Co. (2003) ISBN: 1930110561 8. Lageman, M.: Solaris Containers – What They Are and How to Use Them, Sun Microsystems (2005), http://www.sun.com/blueprints/0505/819-2679.pdf 9. Szydło, T., Szymacha, R., Zieliñski, K.: Policy-based Context-aware Adaptable Software Components for Mobility Computing. In: EDOC 2006: 10th IEEE International Enterprise Distributed Object Computing Conference, Washington, DC, USA, pp. 483–487 (2006) 10. Zieliński, K., Jarząb, M., Wieczorek, D., Bałos, K.: JIMS Extensions for Resource Monitoring and Management of Solaris 10. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3994, pp. 1039–1046. Springer, Heidelberg (2006)

364

J. Adamczyk et al.

11. Zielinski, K., Jarzab, M., Wieczorek, D., Balos, K.: Open Interface for Autonomic Management of Virtualized Resources in Complex Systems - Construction Methodology, Future Generation Computer Systems, http://www.sciencedirect.com/science/journal/0167739X 12. Jarzab, M., Zieliński, K.: Framework for Consolidated Workload Adaptive Management. In: 2nd IFIP CEE-SET 2007, Software Engineering in Progress, NAKOM, pp. 17–30 (2007) ISBN 978-83-89529-44-2 13. Hellerstein, J.L., Diao, Y., Parekh, S., Tilbury, D.M.: Feedback Control of Computing Systems. Wiley-IEEE Press, Chichester (2004) 14. Portal for Autonomic Computing resources, http://www.autonomiccomputing.org 15. Enabling Adaptation of J2EE Applications Using Components, WebServices and Aspects, National ICT Australia, http://www.cse.unsw.edu.au/~yliu/asf-demo/index.html 16. Liu, Y., Gorton, I.: Implementing Adaptive Performance Management in Server Applications. In: ICSE 2007 workshop on Software Engineering for Adaptive and Self-managing Systems (SEAMS) (2007) 17. Gorton, I., Liu, Y., Trivedi, N.: An Extensible, Lightweight Architecture for Adaptive J2EE Applications. In: International Workshop of Software Engineering and Middle-ware (SEM) (November 2006) (accepted) 18. Kephart, J.O., Walsh, W.: An artificial intelligence perspective on autonomic computing polices. In: IEEE 5th Intl. Workshop on Policies for Distributed Systems and Networks, pp. 3–12 (2004)

A Monitoring Module for a Streaming Server Transmission Architecture Sadick Jorge Nahuz1, Mario Meireles Teixeira2, and Zair Abdelouahab1 1

Graduate Program in Electrical Engineering, Federal University of Maranhao, Campus do Bacanga, 65085-580 Sao Luis, MA, Brazil 2 Department of Informatics, Federal University of Maranhao, Campus do Bacanga, 65085-580 Sao Luis, MA, Brazil [email protected], [email protected], [email protected]

Abstract. The Internet has experienced a considerable increase in the use of audio and video applications, which provoke a large consumption of the resources available in the network and servers. Therefore, the monitoring and analysis of those resources has become an essential task in order to enhance the service delivered to users. This work depicts a monitoring module implemented in a video server architecture, which is used to track the transmission of some popular video formats. Our experiments have demonstrated that one of the formats delivers a performance considerably better than the other, regarding the bandwidth allocated to each user session, what reassures the importance of having such a monitoring module available in a server’s architecture. Keywords: monitoring, qos, streaming server, mpeg-4, mov.

1 Introduction In recent years there has been an amazing development and wide spreading of network applications that transmit and receive audio and video data through the Internet. Those applications have service requirements different from those of traditional data oriented applications, based in text/images, e-mail, FTP and DNS [1]. Today, it is observed a great demand for a class of Internet applications known as Multimedia Systems. This class of systems differs from others because they need to transmit multimedia data such as video and audio, generating a large data stream, provoking an accentuated consuming of the Internet resources. Multimedia systems supply services oriented towards the execution of applications that need resources assurance so as not to hinder its performance. Multimedia systems are very sensitive to end-to-end delays and to delay variation, but they can tolerate occasional data losses [1]. Some current networks can not get adapted to multimedia services because they were not designed to stand such tasks. There are many attempts for expanding the networks current architecture in order to enhance the services quality and to provide support for multimedia applications. Currently, almost all news sites have a video format for their subscribers. In the next years, multimedia services will dominate a great part of the Internet flow, mainly the live transmission ones, but there is much to be researched and defined in that area [2]. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 365–374, 2008. © Springer-Verlag Berlin Heidelberg 2008

366

S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab

As high speed Internet access reaches an increasing number of households, multimedia services will become part of most users’ routine [9]. Everybody will access their favorite programs, irrespective of their transmission scheduling, once there will always be a file stored in a streaming server, which uses a real time transmission protocol. This paper details a monitoring module designed and implemented in the Darwin Streaming Server, from Apple. This module is used to evaluate this architecture serving both MOV and MP4 files. Transmission peculiarities of both file formats are discussed and the monitoring demonstrates that one of them is able to reach considerable higher throughput than the other. In Sec. 2 are discussed the protocols used for streaming media transmission in the Internet. Sec. 3 details the streaming server chosen for this work, namely the Darwin Streaming Server. Sec. 4 deals with the monitoring implemented in the streaming server and also the changes performed in it. In Sec. 5 are discussed the undergone experiments and their results. Sec. 6 discusses this work main conclusions and their possible unfolding.

2 Streaming Media Protocols Real-time applications do have some peculiarities, as commented above. Thus one might expect that they demand protocols specifically designed to attend their characteristics. This section analyzes the Internet protocols used for streaming media transmission: RTP, RTCP and RTSP. 2.1 RTP and RTCP Protocols The real-time transport protocol, known as RTP, was designed to provide support to applications traffic that broadcasts real-time data. It is normally integrated inside the application (in user mode), so it is not implemented as part of the operating system kernel. Defined in RFC 1889 [3], RTP protocol is a product of the Audio/Video Transport Working Group, and it was formally introduced in January, 1996, by the Networking Working Group) from IETF (Internet Engineering Task Force), aiming at supplying a standardization of functionalities for real-time data transmission applications. [22]. The RTP offers end-to-end transport functions for devices that transmit real-time data, such as audio and video, on unicast or multicast service networks, being characterized as a connectionless protocol. These functions include the identification of the type of data to be handled, sequence number, timestamps and data transmission monitoring. Although the RTP offers end-to-end transfers, it does not offer all functionalities usually found in a transport protocol. Besides that, it does not reserve resources from the network and does not guarantee Quality of Service (QoS) [10] for the applications, neither promotes reordering or re-transmission in case of packet loss, assigning these responsibilities to applications [11]. The RTP and the RTCP protocols were designed as not to depend on the underlying network and transport layers. While RTP is in charge of transportation of the streaming medias (audio and video), the RTCP controls the information returned by

A Monitoring Module for a Streaming Server Transmission Architecture

367

users that received the data, informing about the quality of reception and data transfer, the support and synchronization of different media flows. 2.2 RTSP Protocol The RTSP (Real-Time Streaming Protocol) is based upon the class “Stream Stored audio and video”, operates in the application layer and was conceived by Real Networks, Netscape Communications and Columbia University. Its first RFC was published by the IETF under the number 2326 [4]. RTSP is a public domain protocol that allows client-server interaction between the constant rate media stream source (server) and the user (transducer) [23]. That interactivity arises from the user’s necessity of having a higher control on the media reproduction in the transducer. The RTSP functionalities can be summarized as handling of the file execution, similar to the functionalities that a DVD player turns available to their users, the RTSP allows a transducer to control the media stream by means of commands. 2.3 Streaming Media Formats MPEG-4 is the global multimedia standard that broadcasts professional quality audio and video through various types of bandwidth, from cellular phone networks to WANs and others. MPEG-4 was defined by the Moving Picture Experts Group (MPEG) [12] [14], which actively participates in the International Organization for Standardization (ISO), which specified the well known MPEG-1 and MPEG-2 standards. Hundreds of researchers worldwide contributed in the MPEG-4 building [15], which was concluded in 1998, just to become an international standard in 2000. MOV, another quite popular multimedia format, is used to store video sequences and was created by Apple Computers. The QuickTime Player was presented by Apple as an alternative to the Windows Media platform, from Microsoft, betting in favor of the variety of formats available for content distribution. This video format became still more popular when their specifications were chosen by the MPEG consortium. In this work, the supplying of both types of files, MP4 and MOV are monitored and analyzed from the perspective of a streaming server, the Darwin Streaming Server, properly enhanced with our Monitor Module.

3 The Darwin Server The Darwin Streaming Server from Apple is an open source video server that can serve 3GP, QuickTime (MOV) and MPEG4 videos for clients via Internet using standard RTP and RTSP protocols. It is based on the QuickTime Streaming Server code. Darwin also supports live broadcasting once it uses a video codifier called mp4live, which is part of the MPEG4IP [5]. The latter performs real time video capturing generating a stream to be broadcasted via unicast to the Darwin server, which will distribute it using the RTP and RTSP protocols [16]. The Darwin is an event-driven server working as a set of processes that execute the RTSP, RTP and RTCP streaming media standard broadcasting protocols. The server

368

S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab

supports various compatible streaming file formats and stands up to two thousand or more on-line users [6] [18]. To perform streaming media broadcasting via RTP, it is necessary to attach a hint track at the beginning of each file. A hint track specifies, for example, RTP time scheduling and the packet’s maximum size (Maximum Transmission Unit – MTU). 3.1 Server Architecture The Darwin server is composed of a parent process that creates a child process, which is the core of the server. The parent waits for the child to exit. If in the exit of the child process an error occurs, the parent process creates a new child process. The core server acts as an interface between the server modules and the clients in the network that use RTP and RTSP protocols to send requests and receive responses (Fig. 1). The server modules process requests and deliver packs to the client. The server core does its work by creating four threads: • • • •

A Main thread manages the server checking whether it needs to terminate, generating a status log or printing statistics; An Idle Task Thread manages a queue of tasks that occur periodically. There are two kinds of task queues: the timeout tasks and the sockets; An Event Thread listens to the sockets to receive requests of RTSP events or RTP packets and deliver them to the Task Threads; and The Task Threads receive the RTSP and RTP requests from the Event Thread. Then, the Task Thread asks the proper server module to process and send packets to the client.

The Darwin server is complex, works asynchronously and needs an event based communication mechanism. For example, when a socket is used in an RTP connection to acquire data, someone must be warned so that the data come to be processed [19]. Each Task object has two main methods: Signal and Run. The Signal method is called by the server to deliver an event to a Task object. Then, the Run is requested in order to schedule a Task to process the event. The goal of each Task object is to implement the server’s functionality by using small time slices that are not blocked. The Run method is in fact a virtual function that is requested whenever a Task object has events to be processed. Within the Run function, the Task object can call GetEvents to automatically receive all the requests that remain in the queue and previously signal events. The Run function is never reentered if a Task object calls GetEvents within the Run function and signals before the function is completed. The Run function will only be called by a new event after it is finished. In fact, the Run function will be repeatedly invoked until all events in the Task object are cleared up from GetEvents. 3.2 Server Functioning The Darwin Streaming Server works by performing a broadcast of its content to all clients that requested the file or by serving on demand. When the client requests the file, the number of packs emitted to each client will depend on the time concerning the beginning of the transmission (broadcast) and on the moment in which the client

A Monitoring Module for a Streaming Server Transmission Architecture

369

Fig. 1. Darwin Server Architecture

established the connection with the server [17]. Each client’s request is served with a complete broadcast of the file [8]. The server has a modular architecture so as to facilitate the building of new functionalities through independent modules. The server uses a main process called Core Server, which truly is an interface between the clients’ requisitions and the modules. Modules must be constituted by a specific structure, with two mandatory methods, one requested by the server to initialize it and the other to execute a certain task. As to know which modules must be requested for a certain event, each module must explicit its role. Thus, each module must have a list of actions called “Roles”. When the server initiates, it first loads modules that are not compiled in the server (dynamic modules) and next, the compiled modules (static modules). Then, the server invokes each module from the QTSS (Core) with the Registry functionality, a role every module must support. Next, the module requests the QTSS_AddRole to specify the other roles it can have. After this, the server invokes the role ‘Initialize’ for each module registered in that role. The ‘Initialize’ executes any initialization task the module requires, allocating memory and global data structures. When the server is deactivated, it invokes a ‘Shutdown’ role for each module registered in that role. When it executes the Shutdown, the module must perform a clearing up and release global structures.

4 The Monitor Module In this work, we developed a monitoring module which is responsible for obtaining bandwidth use and packet loss rate for each user session. The Monitor Module directly

370

S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab

Fig. 2. Monitor Module Block Diagram

interacts with the Darwin Server kernel (Fig. 2.). It collects data from the server core and these data are processed on-line within the attributes of a server object. Those attributes work directly with the server’s structure, constantly computing the different information needed to assess its performance [17] [21]. This information remains hidden within the Darwin server itself and the monitor module was implemented to collect the information and create an interface to visualize the data. The Monitor Module implements two basic routines: the Main Routine and the Dispatch Routine. In the Main Routine, all QTSS modules must supply a Main routine. The server requests the Main routine to be initialized and uses the role ‘Initialize' for accessing the QTSS stub library (which loads the libraries), thus the server can later invoke the module. In the Dispatch Routine, every QTSS module must provide a Dispatch routine. The server requests a dispatch routine when it invokes a module for a specific task, passing to the dispatch routine the task name and a specific parameter block for the task. The Monitor module implements three rules: QTSS_Register_Role; QTSS_Initialize_Role; QTSS_RTSPFilter_Role. When the QTSS_Register_Role rule is invoked by the server, the module records the rules that it wishes to notify, which are QTSS_Initialize_Role and QTSS_RTSP Filter_Role. When the QTSS_Initialize_Role is invoked, the module performs the global object’s initializations, as, for example, to obtain a reference to the QTSS_ServerObject (object that represents the server). And when the server invokes the QTSS_RTSPFilter_Role rule, the module processes the request. When the Monitor module is notified through the QTSS_RTSPFilter_Role rule, it obtains the values of the server’s total transfer rate in the attribute qtssRTPSvrCurBandwidth, of the QTSS_ServerObject object. Next, for each client’s session, it gets

A Monitoring Module for a Streaming Server Transmission Architecture

371

the transfer rate in the qtssCliSesCurrentBitRate attribute as well as the client’s address in the attribute qtssCliRTSPSess RemoteAddrStr and the amount of lost packets in the object’s qtssCliSesPacketLossPercent attribute which represents the session of each client. After this, it performs the calculation of bandwidth use and stores the results in a file. We also developed a program in Java to read the results file and graphically present them. There are two visualization modes: one shows the bandwidth and the other shows the packet losses.

5 Experiments and Results With the Monitor Module developed in this work, we conducted experiments with the Darwin server in a 100 MBps LAN network using more than 20 computers. All machines accessed a single Darwin video server. In this experimental stage, the implemented Monitor Module was successfully tested, once it collected all information within the core server generating the connections graphs. The experiments were conducted in two steps. In the first step, MP4 video files were used, with 50 MB size. It was noticed that the Darwin server tries to always keep the files within the same data rate, but by doing so, the server ended up with a huge unused band that could be scheduled to speed up the broadcasting [20]. In the video played by clients, excellent performance and optimal image quality was observed when using the QuickTime Player [7] and the RTP protocol to access the server. Fig. 3 shows the monitoring graph of the Darwin server with MP4 video files. It is important to notice that the server is only using 10% of the available bandwidth, in

Fig. 3. Monitoring of MPEG-4 broadcasting

372

S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab

such a way that its total data output rate is in the order of 10 MBps. The Darwin server, then, leaves 90% of its band totally free, unused. In the second experimental step, we used MOV video files with up to 60 MB. It was noticed that the Darwin server treats the MOV files differently, since not all files presented the same data output rate. According to the quality of each video file, the server provides more or less bandwidth for the session. Again, the player used was the QuickTime, with the RTP protocol to access the server. QuickTime itself adds hint tracks to the MOV files as to work with the streaming server, but that operation is not automatic, being necessary to export the MOV file in the hinted track format. Fig. 4 shows the monitoring graph of the Darwin server with MOV video files. Unlike the former step with MP4 files, the Darwin server uses this time 90% of the available bandwidth, attaining a total 90 MBps as data output rate, as can be seen in the graph. At some moments, some videos use almost the whole bandwidth available in the server. The image quality is equivalent to the one previously obtained using MP4 files. The Darwin server has a management feature which provides a summary of the active sessions. Analyzing the active sessions we detected that the MOV files present an output rate five times higher than that of the MP4 video files. Through the experiments, we can conclude that serving MP4 files via streaming mode takes more processing time of the Darwin server. For this reason, the server can not deliver files at the same output rate of the MOV files. We believe that this fact does not happens due to same ‘hidden’ fault in the Darwin server, but because of the characteristics inherent to each file format, being the MOV type files more naturally biased to be broadcasted via streaming due to its own nature and historical origins. Hence, it is clearly noticed the importance of having a monitoring module such as the one developed here and implemented within the Darwin server architecture.

Fig. 4. Monitoring of MOV broadcasting

A Monitoring Module for a Streaming Server Transmission Architecture

373

Without a proper user session following, such a broadcasting discrepancy between the two video file formats would be easily overlooked by the system administrator.

6 Conclusions This paper describes a monitoring module implemented into a streaming video server, in this case the Darwin Streaming Server, from Apple, an open source server based upon the QuickTime Streaming Server code, of the same manufacturer. For such a goal, the Internet multimedia broadcasting peculiarities were analyzed, as well as the RTP and RTCP protocols. Two popular video format files, MP4 and MOV, were selected for the experiments and also reported. The Darwin server architecture was analyzed in detail as well as the architecture of our own monitor, which was implemented as a Darwin module. The experimental results, properly obtained by means of our Monitor Module, show a significant difference in the delivery rate of the MP4 and MOV files, being the latter delivered at a rate five times higher. Therefore, the importance of having a monitoring module such as the one here presented, since it can become an important tool for the system’s manager that needs to follow its performance. The Monitor Module here presented is still an ongoing project. We intend in the near future to turn it able to capture information other than the used bandwidth and the packet loss rate of each user’s session, thus enhancing its applicability. We will also analyze more carefully the different behavior found while serving MOV and MP4 media files.

References 1. Kurose, J., Ross, K.: Redes de Computadores e a Internet. Addison-Wesley, Reading (2003) 2. Tanembaum, A.: Sistemas Operacionais Modernos, 2nd edn. Prentice-Hall, Englewood Cliffs (2003) 3. Schulzrinne, H.: A Transport Protocol for Real-Time Applications (RTP) (1996), http://www.ietf.org/rfc/rfc1889.txt 4. Schulzrinne, H., Rao, A., Lanphier, R.: Real Time Streaming Protocol (RTSP) (1998), http://www.ietf.org/rfc/rfc2326.txt 5. Mpeg4ip Commuty Site, http://mpeg4ip.sourceforge.net 6. Darwin Project Site, http://developer.apple.com/darwin/projects/streaming 7. QuickTime Apple Web Site, http://www.apple.com/quicktime 8. Ferguson, P., Huston, G.: Quality of Service: Delivering QoS on the Internet and in Corporate Networks. John Wiley, Chichester (1998) 9. Comer, D.E.: Internetworking with TCP/IP: Principles, Protocols and Architecture, 4th edn., vol. 1. Prentice-Hall, Englewood Cliffs (2000) 10. Busse, I., Deffner, B., Schulzrinne, H.: Dynamic QoS Control of Multimedia Applications based on RTP. Computer Communications 19, 49–58 (1996) 11. Stallings, W.: High-Speed Networks and Internets: Performance and Quality of Service, 2nd edn. Prentice-Hall, Englewood Cliffs (2002) 12. MPEG Site, Moving Picture Experts Group, http://www.mpeg.org

374

S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab

13. Gnustream: A P2P Media Streaming System Prototype (2003) 14. The MPEG-4 Fine-Grained Scalable Video Coding Method for Multimedia Streaming over IP (2001) 15. Wakamiya, N., Miyabayashi, M., Murata, M., Miyahara, H.: MPEG-4 Video Transfer with TCP-Friendly Rate Control. In: Al-Shaer, E.S., Pacifici, G. (eds.) MMNS 2001. LNCS, vol. 2216, pp. 29–42. Springer, Heidelberg (2001) 16. Seungjoon, S.B.: Scalable Resilient Media Streaming (2003) 17. Zhao, W., Tripathi, S.K.: Bandwidth-Efficient Continuous Media Streaming Through Optimal Multiplexing (1999) 18. Sen, S., Rexford, J., Towsley, D.: Proxy Prefix Caching for Multimedia Streams (1999) 19. Chen, S., Shen, B., Yan, Y., Basu, S., Zhang, X.: Fast Proxy Delivery of Multiple Streaming Sessions in Shared Running Buffers. IEEE Transactions on Multimedia 6(7) (2005) 20. Anastasiadis, S.V., Sevcik, K.C., Stumm, M.: Server-Based Smoothing of Variable BitRate Streams. In: ACM Multimedia, pp. 147–158 (2001) 21. Tripathi, Z.: Bandwidth-Efficient Continuous Media Streaming Through Optim (1999) 22. Schulzrinne, H., Gokus: RTP: A transport protocol for real-time applications (1996) 23. Schulzrinne, H., Rao, A., Lanphier, R.: Real time streaming protocol (RTSP), request for comments 2326 (April 1998)

BSP Functional Programming: Examples of a Cost Based Methodology Fr´ed´eric Gava Laboratory of Algorithms, Complexity and Logic, University of Paris-Est [email protected]

Abstract. Bulk-Synchronous Parallel ML (BSML) is a functional dataparallel language for the implementation of Bulk-Synchronous Parallel (BSP) algorithms. It makes an estimation of the execution time (cost) possible. This paper presents some general examples of BSML programs and a comparison of their predicted costs with the measured execution time on a parallel machine. Keywords: BSP Functional Programming, Cost Prediction.

1

Introduction

Solving a problem is often a complex job especially when a parallel machine is used: it is necessary to manage communication, synchronisation, partition of data etc. at the same time. Algorithmic models and high-level languages are needed to simplify both the design of parallel algorithms (ability to compare their costs1 ) and their programmation in a safe, eﬃcient and portable manner. BSML is an extension of ML designed for the implementation of BSP algorithms as functional programs using a small set of parallel primitives. BSP [3,17] is a parallel model which oﬀers a high degree of abstraction and allows scalable and predictable performance on a wide variety of architectures with a realistic cost model based on a small set of machine parameters. Deadlocks and nondeterminism are avoided. BSML is implemented as a parallel library2 for the functional programming language Objective Caml (OCaml). Our methodology is as follow: ﬁrst, analyse the complexity of the sequential algorithm, then design one or more parallel algorithms, analyse their BSP costs, calculate the BSP parameters of the parallel machines, program these algorithms in BSML and ﬁnally test the performance of the programs on diﬀerent architectures. Using safe high-level languages like ML to program BSP algorithms (that is BSML) allows performance, scalability and expressivity. Other approaches to safe high-performance computation exist. We can cite concurrent programming [6], more or less synchronous processes [5,18] or the automatic parallelization of programs [1] and algorithmic skeletons [7]. In the ﬁrst two cases, the expressivity of concurrence or mobility is sought (but with 1 2

We speak about complexity for sequential algorithms and cost for parallel ones. Web page at http://bsmllib.free.fr.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 375–385, 2008. c Springer-Verlag Berlin Heidelberg 2008

376

F. Gava

lower performance). In other cases, it’s the simplicity of parallelism (which becomes almost transparent) which is sought. As opposed to these methods, we prefer the use of a performance model3 at the cost of expressiveness. Indeed, other approaches do not allow, in their intrinsic models, to predict the run-time4 . That makes algorithmic optimisations and the choice of the best algorithm for a given architecture diﬃcult: even if the number of processes/threads/workers or their computations can be limited, it is often diﬃcult to analyse the communication times or the placement of computation and data on the processors. Nevertheless, it is possible to ﬁnd empirical optimisations (by successive tests) or to design better schedulers [2]. But algorithmic optimisations are still hard to analyse. In this article, we illustrate our methodology with simple examples of problems that illustrate many aspects of classical algorithmic problems.

2

Functional Bulk-Synchronous Parallel Programming

2.1

The Bulk-Synchronous Parallel Model

In the BSP model, a computer is a set of uniform processor-memory pairs and a communication network allowing inter-processor delivery of messages (for sake of conciseness, we refer to [3,17] for more details). A BSP program is executed as a sequence of super-steps, each one divided into three successive disjoint phases: each processor uses its local data (only) to perform sequential computations and to request data transfers to other nodes; the network delivers the requested data; a global synchronisation barrier occurs, making the transferred data available for the next super-step. The performance of the BSP machine is characterised by 4 parameters: the local processing speed r; the number of processor p; the time l required for a barrier; and the time g for collectively delivering a 1-relation, a communication phase where every processor receives/sends at most one word. The network can deliver an h-relation (every processor receives/sends at most h words) in time g × h. The execution time (cost) of a super-step s is the sum of the maximal of the local processing, the data delivery and the global synchronisation times. 3

4

Note that this observation has already been made in the context of programming C+BSP matrix computations [12,14]. Some tools [8,11] exist but are too much complex to be used or are not implemented.

BSP Functional Programming: Examples of a Cost Based Methodology

377

bsp p: unit→int bsp l: unit→ﬂoat bsp g: unit→ﬂoat mkpar: (int→α )→α par apply: (α →β ) par→α par→β par put: (int→α ) par→(int→α ) par proj: α par→int→α super: (unit→α )→(unit→β )→α ∗ β Fig. 1. The BSML primitives

2.2

Bulk-Synchronous Parallel ML

BSML is based on 8 primitives (Fig. 1), three of which are used to access the parameters of the machine. Implementation of these primitives rely either on MPI, PUB [4] or on the TCP/IP functions provided by the Unix module of OCaml. A BSML program is built as a sequential program on a parallel data structure called parallel vector. Its ML type is α par, which expresses that it contains a value of type α at each of the p processors. The BSP asynchronous phase is programmed using the two primitives mkpar and apply so that (mkpar f) stores (f i) on process i (f is a sequential function): mkpar f = (f 0) · · · (f i) · · · (f (p−1)) and apply applies a parallel vector of functions to a parallel vector of arguments: apply · · · fi · · · · · · vi · · · = · · · (fi vi ) · · · The ﬁrst communication primitive is put. It takes as argument a parallel vector of functions which should return, when applied to i, the value to be sent to processor i. put returns a parallel vector with the vector of received values: at each processor these values are stored in a function which takes as argument a processor identiﬁer and returns the value sent by this processor. The second communication primitive proj is such that (proj vec) returns a function f where (f n) returns the nth value of the parallel vector vec. Without this primitive, the global control cannot take into account data computed locally. The primitive super allows the evaluation of two BSML expressions as interleaved threads of BSP computations. From the programmer’s point of view, the semantics of the superposition is the same as pairing but the evaluation of super E1 E2 is diﬀerent: the phases of asynchronous computation of E1 and E2 are run; then the communication phase of E1 is merged with that of of E2 and only one barrier occurs; if the evaluation of E1 needs more super-steps than that of E2 then the evaluation of E1 continues (and vice versa). 2.3

Often Used Parallel Functions

The primitives described in the previous section constitute the core of the BSML language. In this section, we deﬁne some useful functions which are parts of the standard BSML library. For the sake of conciseness, their full code is omitted. replicate:α →α par creates a parallel vector which contains the same value everywhere. The primitive apply can be used only for a parallel vector of functions that take one argument. To deal with functions that take two arguments we need to deﬁne the apply2:(α →β →γ )→α par→β par→γ par function. It is also common to apply the same sequential function at each processor. We use the parfun functions where only the number of arguments to apply diﬀers: parfun:(α →β )→α par→β par and parfun2:(α →β →γ )→α par→β par→γ par.

378

F. Gava BSP parameter g

BSP parameter l

40

180000 MPICH OPEN MPI PUB

35

140000

30

120000 Flops

Flops

MPICH OPEN MPI PUB

160000

25

100000 80000

20

60000 15

40000

10

20000 2

4

6

8

10

12

14

16

2

4

Number of processors

6

8

10

12

14

16

Number of processors

Fig. 2. BSP parameters of our machine

It is common to perform a total exchange. Each processor contains a value (represented as a vector of values) and the result of (rpl total:α par→α list v0 · · · vp−1 ) is [v0 , . . . , vp−1 ], a list of these values on each processor. 2.4

Computation of the BSP Parameters

One of the main advantages of the BSP model is its cost model: it is quite simple and yet accurate. We used the BSML implementation [15] of a program [3] that benchmarks and determines the BSP parameters of our machine (16 Pentium IV 2.8 Ghz, 1 Gb RAM nodes cluster interconnected with a Gigabyte Ethernet network). We then compute these parameters for 3 libraries, corresponding to 2 diﬀerent implementations of BSML: MPICH, OPEN-MPI5 and PUB [4]. Fig. 2 summarises the timings (where r is 330 Mﬂops/s for each library) for an increasing number of processors. We notice that the parameter l is growing in a quasi-linear way for the library PUB. However, for the libraries MPI, two jumps are visible. No explanation has been found yet. For parameter g, it is surprising to see that it is high for a few processors and then more stable (real parameter exchange of the network). This is certainly due to the buﬀer management in communication protocols and OS: when the number of processors increases, the buﬀers are ﬁlled faster and messages are transmitted immediately.

3

Examples of BSP Problems in BSML

To illustrate our methodology, we present 2 classic problems: sieve of Eratosthenes and N -body computing. For each problem, we give the parallel methods, BSP cost formulas as well as tests of comparison between theoretical (depending on the BSP parameters) and experimental performances. This comparison shows that the BSP cost analysis would help choosing the best BSML program. 5

http://www.mcs.anl.gov/mpi/mpich1 and http://www.open-mpi.org/

BSP Functional Programming: Examples of a Cost Based Methodology

3.1

379

Sieve of Eratosthenes

The sieve of Eratosthenes generates a list of primary numbers below a given integer n. We study 3 parallelization methods. We generate only the integers that only √are not multiple of the 4 ﬁrst prime numbers and we classically iterate 1 , so we to n. The probability of a number a to be a prime number is log(a) √ deduce a complexity of ( n × n)/ log(n). Fig. 3 gives the BSML code of the 3 methods. We used the following functions: elim:int list→int→int list which deletes from a list all the integers multiple of the given parameter; ﬁnal elim:int list→int list→int list iterates elim; seq generate:int →int→int list which returns the list √ of integers between 2 bounds; and select:int →int list→int list which gives the nth ﬁrst prime numbers of a list. Logarithmic reduce method. For our ﬁrst method we use the classical parallel preﬁx computation (also call folding reduce) : scan ⊕ v0 · · · vp−1 = v0 v0 ⊕ v1 · · · ⊕p−1 k=0 vk We use a divide-and-conquer BSP algorithm (implemented using the super primitive) where the processors are divided into two parts and the scan is recursively applied to those parts; the value held by the last processor of the ﬁrst part is broadcasted to all the processors of the second part, then this value and the values held locally are combined together by the associative operator ⊕ on the second part. In our computation, √ the sent values are ﬁrst modiﬁed by a given function (select to just sent the nth ﬁrst prime numbers) The parallel methods is thus very simple: each processor i holds the integers between i × np + 1 and (i + 1) × np . Each processor computes a local sieve (the processor 0 contains thus the ﬁrst prime numbers) and then our scan is applied. We then eliminate on processor i the integers that are multiple of integers of processors i − 1, i − 2, etc. We have log(p) √ super-steps where each processor sent/received at most 2 values (list of size max n). The BSP cost is accordingly: √

m×m +2× log(p) × ( log(m)

√ n × g + l) where m =

n p.

Direct method. It is easy to see that our initial distribution (bloc of integers) gives a bad load balancing (processor p − 1 has the bigger integers which have little probability to be prime). We will distributes integers in a cyclic way: a is given to processor i where a mod p = i). The second method works as √ follows: each processor computes a local sieve; then integers that are less to n are globally exchanged; a new sieve is applied to this list of integers (thus giving prime numbers) and √ each processor eliminates, in its own list, integers that are multiples of this nth ﬁrst primes. The BSP cost is accordingly: √√ √ √ √ n× n m×m 2 × log(m) + log(√n) + n × g + l

380

F. Gava

let eratosthene scan n = let p=bsp p() in let listes = mkpar (fun pid→ if pid=0 then seq generate (n/p) 10 else seq generate ((pid+1)∗(n/p)) (pid∗(n/p)+1)) in let local eras = parfun (local eratosthene n) listes in let scan era = scan super ﬁnal elim (select n) local eras in applyat 0 (fun l →2::3::5::7::l) (fun l→l) scan era let eratosthene direct n = let listes = mkpar (fun pid→ local generation n pid) in let etape1 = parfun (local eratosthene n) listes in let selects = parfun (select n) etape1 in let echanges = replicate total exchange selects in let premiers = local eratosthene n (List.fold left (List.merge compare) [] echanges) in let etape2 = parfun (ﬁnal elim premiers) etape1 in applyat 0 (fun l→2::3::5::7::(premiers@l)) (fun l→l) etape2 let rec eratosthene n = if (ﬁn recursion n) then apply (mkpar distribution) (replicate (seq eratosthene n)) else let carre n = int of ﬂoat (sqrt (ﬂoat of int n)) in let prems distr = eratosthene carre n in let listes = mkpar (fun pid →local generation2 n carre n pid) in let echanges = replicate total exchange prems distr in let prems = (List.fold left (List.merge compare) [] echanges) in parfun (ﬁnal elim prems) listes let eratosthene rec n = applyat 0 (fun l→2::3::5::7::l) (fun l→l) (eratosthene n) Fig. 3. BSML code of the the parallel versions of the sieve of Eratosthenes

√ Recursive method. Our last method is based on the generation of the nth ﬁrst primes and elimination of the multiples of this list of integers. We generate this √ by a inductive function on n. We suppose that the inductive step gives the nth ﬁrst primes and we perform a total exchange on them to eliminates the non-primes. End of this induction comes from the BSP cost: we end when n is small enough so that the sequential methods is faster than the parallel one. The inductive BSP cost is accordingly: √

Cost(n) = Cost(n) =

m×m log(m) √ n×n log(n)

+

√ √ n × g + l + Cost( n) if BSP cost > complexity

Fig. 4 gives the predicted and measured performances (using the PUB implementation). To simplify our prediction, we suppose that pattern-matching and

BSP Functional Programming: Examples of a Cost Based Methodology

Erathostene preﬁx, direct and recursive

BSP Predicted Erathostene preﬁx, direct and recursive

16

16 Ideal Direct Recursive Preﬁx

12

Ideal Pred Direct Pred Recursive Pred Preﬁx

14 BSP Predicted Acceleration

14

Acceleration

381

10 8 6 4 2

12 10 8 6 4 2

0

0 2

4

6

8

10

12

14

16

2

4

Number of processors for N=10000000

6

8

10

12

14

16

Number of processors for N=10000000

Fig. 4. Performances (using PUB) of the sieve of Eratosthenes

modulo are constants in time. Size of lists of integers can be measured using the Marshal module of OCaml. Note that we obtain a super-linear acceleration for the recursive method. This is due to the fact that, using a parallel method, each processor has a smaller list of integers and thus the garbage collector of OCaml is called less often. One can notice that predicted performances using the BSP cost model are close to the measured ones. 3.2

The N -Body Problem

The classic N -body problem is to calculate the gravitational energy of N point masses, which is given by: E=−

N N mi × mj i=1 j=1 i=j

ri − rj

The complexity of this problem is thus in order of magnitude of N 2 . To compute this sum, we show two parallel algorithms : using a total exchange of the point masses or using a systolic loop6 . At the beginning of these two methods, each processor contains a sub-part (as a list) of the N point masses: we thus have a parallel vector of lists of N/p point masses. Fig. 5 gives the BSML code of the 2 algorithms. pair energy computes the interaction of a list of masses with another one. The sequential method is thus a call of this function to the same list. Total exchange method. The method is naive: a total exchange of these lists is done and then processors compute the interaction of its own list with other ones; at the end, a parallel fold is applied to sum the partial interactions. The BSP cost is accordingly: N × g + 2 × N + Np × N + l + p × g + l that is two 6

There exist more sophisticated algorithms that take advantage of the symmetry of the sum but this is not the subject of this article.

382

F. Gava

super-steps: time of the total exchange and the concatenation of the received lists; time to perform the local interactions and time to ﬁnish the fold. Systolic loop. Our second algorithm is based on a systolic loop [13]. In such an algorithm, data is passed around from processor to processor in a sequence of super-steps. We can easily write a generic systolic loop in BSML: (∗ val systolic:(α →α →β ) →(γ →β par →γ ) →α par →γ →γ ∗) let systolic f op vec init = let rec calc n v res = if n=0 then res else let newv=Bsmlcomm.shift right v in calc (n−1) newv (op res (Bsmlbase.parfun2 f vec newv)) in calc (bsp p()) vec init

with shift right:α par→α par which shifts the values from each processor to its right-hand neighbour (part of the standard BSML library). Initially, each processor receives its share of the N point masses and calculates the interactions among them. Then it sends a copy of its particles to its righthand neighbour, while at the same time receiving the particles from its left-hand neighbour. It calculates the interactions between its own particles and those that just came in, and then it passes on the particles that came from the left-hand neighbour to the right-hand neighbour. After p − 1 super-steps, all pairs of type point = ﬂoat ∗ ﬂoat ∗ ﬂoat and atom = point ∗ ﬂoat let minus point (x1,y1,z1) (x2,y2,z2) = (x1−.x2,y1−.y2,z1−.z2) let length point (x,y,z) = sqrt(x∗.x +. y∗.y+. z∗.z) (∗ val pair energy : atom list →atom list →ﬂoat ∗) let pair energy some bodies other bodies = List.fold left (fun energy →function (r1,m1) → energy+.(List.fold left (fun energy →function (r2,m2) → let r=length point(minus point r2 r1) in if r>0. then energy+.(m1∗.m2)/.r else energy) 0. other bodies)) 0. some bodies (∗ Total exchange method ∗) let ﬁnal ex = parfun2 pair energy my bodies (parfun List.concat (total exchange my bodies)) in let res ﬁnal= fold direct (+.) 0. ﬁnal ex in ... (∗ Systolic method ∗) let energy=parfun2 pair energy my bodies my bodies in let ﬁnal sys = systolic pair energy (parfun2 (+.)) my bodies energy in let res ﬁnal= fold direct (+.) 0. ﬁnal sys in ... Fig. 5. BSML code of the parallel versions of the N -body problem

BSP Functional Programming: Examples of a Cost Based Methodology

N -body problem with Systolic and Total exchange methods

BSP Predicted N -body problem with Systolic and Total exchange methods

18

16 Ideal Systolic Total exchange

14

Ideal Systolic Total exchange

14 BSP Predicted Acceleration

16

Acceleration

383

12 10 8 6 4

12 10 8 6 4 2

2 0

0 2

4

6

8

10

12

Number of processors for N =50000

14

16

2

4

6

8

10

12

14

16

Number of processors for N =50000

Fig. 6. Performance (using MPICH) of the N -body problem

particles have been treated and a folding of these values can be done to ﬁnish the computation. The BSP cost is accordingly: N N N p × (N p ×g+l+2× p + p × p)+p×g+l N ≡ N ×g+p×l+2×N + p ×N +l+p×g+l

that is the same as before but with more synchronization time. Fig. 6 gives the predicted and measured performance (using MPICH). Size of lists of particules are measured as before. One can notice that performances scales well. The naive method has better theoretical and practical performances than the systolic ones. The asset of the systolic method appears when the number of particules is so big that lists do not ﬁt in the main memory of a node of the parallel machine: performance degenerates due to the paging mechanism used to get enough virtual memory. This is a limitation of the BSP model that could be solved using a more sophisticated one for out-of-core applications [9]. One can also notice that for our two examples (sieve of Eratosthenes and N body), measured performances are sometime better than predicted ones. This is due to the fact that in some cases, communications can perform better than predicted ones (g and l are averages of network parameters).

4

Conclusion

BSML is a language for programming BSP algorithms. We have attempted to show that it is possible to predict the performance of BSP algorithms following the parameters of a given machine and so to choose what the most eﬃcient and scalable BSML program is. We have illustrated this with two classical problems. Our work illustrates the importance of a high-level parallel paradigm with more compact and therefore more readable code without too bad performances. Even if our methodology might seem lengthy, we believe it is necessary for the future of parallel programming especially as multi-cores machines became the norm. Their programming (as well as clusters) in a safe, expressive, predictable and eﬃcient manner will surely become one of the keys to software design.

384

F. Gava

Future work will naturally be comparison with other parallel languages and libraries as OCamlP3L, C+BSPlib, C+MPI, Eden or Gph [10] (and with bigger programs and other kinds of architectures as multi-cores ones) in order to validate our approach. Finally, manual cost analysis for functional programs has its limits: it is necessary to estimate (sometimes by testing) the number of ﬂops needed to make a pattern-matching, build a tuple, etc. We could use [16] in order to estimate them automatically. Acknowledgements. Thanks to Louis Gesbert for its speel checking.

References 1. Akerholt, G., Hammond, K., Peyton-Jones, S., Trinder, P.: Processing transactions on GRIP, a parallel graph reducer. In: Reeve, M., Bode, A., Wolf, G. (eds.) PARLE 1993. LNCS, vol. 694. Springer, Heidelberg (1993) 2. Benoit, A., Robert, Y.: Mapping pipeline skeletons onto heterogeneous platforms. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 591–598. Springer, Heidelberg (2007) 3. Bisseling, R.H.: Parallel Scientiﬁc Computation. A structured approach using BSP and MPI. Oxford University Press, Oxford (2004) 4. Bonorden, O., Juurlink, B., Von Otte, I., Rieping, O.: The Paderborn University BSP (PUB) library. Parallel Computing 29(2), 187–207 (2003) 5. Chailloux, E., Foisy, C.: A Portable Implementation for Objective Caml Flight. Parallel Processing Letters 13(3), 425–436 (2003) 6. Conchon, S., Le Fessant, F.: Jocaml: Mobile agents for Objective-Caml. In: ASA 1999, pp. 22–29. IEEE Press, Los Alamitos (1999) 7. Di Cosmo, R., Li, Z., Pelagatti, S., Weis, P.: Skeletal Parallel Programming with OcamlP3L 2.0. Parallel Processing Letters (2008) 8. Di Cosmo, R., Pelagatti, S., Li, Z.: A calculus for parallel computations over multidimensional dense arrays. Computer Language Structures and Systems (2005) 9. Gava, F.: External Memory in Bulk Synchronous Parallel ML. Scalable Computing: Practice and Experience 6(4), 43–70 (2005) 10. Hammond, K., Trinder, P.: Comparing parallel functional languages: Programming and performance. Higher-order and Symbolic Computation 15(3) (2003) 11. Hayashi, Y., Cole, M.: Bsp-based cost analysis of skeletal programs. In: Michaelson, G., Trinder, P., Loidl, H.-W. (eds.) Trends in Functional Programming, ch. 2, pp. 20–28 (2000) 12. Hill, J.M.D., McColl, W.F.: BSPlib: The BSP Programming Library. Parallel Computing 24, 1947–1980 (1998) 13. Hinsen, K.: Parallel scripting with Python. Computing in Science & Engineering 9(6) (2007) 14. Krusche, P.: Experimental Evaluation of BSP Programming Libraries. Parallel Processing Letters (to appear, 2008) 15. Loulergue, F., Gava, F., Billiet, D.: Bulk Synchronous Parallel ML: Modular Implementation and Performance Prediction. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 1046–1054. Springer, Heidelberg (2005)

BSP Functional Programming: Examples of a Cost Based Methodology

385

16. Scaife, N., Michaelson, G., Horiguchi, S.: Empirical Parallel Performance Prediction From Semantics-Based Proﬁling. Scalable Computing: Practice and Experience 7(3) (2006) 17. Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and Answers about BSP. Scientiﬁc Programming 6(3), 249–274 (1997) 18. Verlaguet, J., Chailloux, E.: HirondML: Fair Threads Migrations for Objective Caml. Parallel Processing Letters (to appear, 2008)

On the Modeling Timing Behavior of the System with UML(VR) Leszek Kotulski1 and Dariusz Dymek2 1

Department of Automatics, AGH University of Science and Technology al. Mickiewicza 30, 30 059 Krakow, Poland 2 Department of Computer Science Cracow University of Economics, 31-510 Krakow, Poland [email protected], [email protected]

Abstract. UML notation is assumed to be independent from any software modeling methodology. The existing methodologies support the creation of the final system model, but they do not care about the formal documentation of the reasoning process; the associations between the elements belonging to different types of UML diagrams are remembered either as informal documentation outside the UML model or are forgotten. Described in the paper Vertical Relations try to fill this gap, and allow to look at the use of timing diagrams from the new, more complex, perspective. Usefulness of Virtual Relations in evaluation of the timing properties of the Data Warehouse Reporting systems is presented.

1 Introduction Unified Modeling Language (UML) [1] is an open standard controlled by the Object Management Group (OMG). UML is a family of graphical notation backed by single meta-model. It can be used for describing and designing software systems, in particular those using object-oriented paradigm. In actual version of the UML standard (ver. 2.0) there are 13 types of diagrams, with precisely defined semantics [2]. Variety of diagram types allows us to describe different aspects of designed system, in particular: − the use case diagram shows interactions of users or other software system, − the software structure is dealt with the class diagram, the configuration of instances of classes is shown on the object diagram, the package diagram represent the compile-time structure of classes, composite structure diagram dealing with runtime decomposition of a class (such as a federate), − the activity diagram convey the procedural and parallel behavior of classes, the state machine diagram shows how events change the interior states of an object, − to align interactions between objects the sequence diagram is used, the communication diagram is used for emphasizing the links to be used by interactions, the timing diagram is used to cope with the timing aspects of interactions, − the deployment diagram shows the structure of running system. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 386–395, 2008. © Springer-Verlag Berlin Heidelberg 2008

On the Modeling Timing Behavior of the System with UML(VR)

387

In general, UML standard allows to cope with dynamical description of system beyond the semantic level. UML enables us to describe, how a system and its components interact externally as well as internally. UML as a tool became a base for some software development methodologies like RUP (Rational Unified Processes) [3] or ICONIX [4]. UML bases on such fundamental concept like object-oriented paradigm or distributed and parallel programming but is independent from those methodologies. This fact gives UML some advantages; especially it can be treated as a universal tool for many purposes. In general, software development methodologies based on UML are the sequences of informal recommendations how to, step by step, design a software system using different kind of UML diagrams. Final result is expressed in UML and the whole designing process is informally documented. Possibility of creation of various software development methodologies based on UML flows from the fact, that inside UML, formal dependencies among diagrams of different kinds are not defined. It leaves a blank for various methods of reasoning for software development methodologies. In this paper we show that capability of establishing the formal relations among different kinds of UML diagrams gives some new advantages. Presented (in section 2) way of introduction of such relations is independent from any methodology and does not affect the UML structure and properties. We named this type of relation the Vertical Relation to distinguish it from relations between elements of every single kind of diagrams which we called the Horizontal Relation. The proposed approach is an extension of UML and we named it UML(VR) to emphasizing the existence of additional relations. The consistency of the relations among different kinds of UML diagrams is maintained on the base of graph theory. The introduction of UML(VR) allow us to suggest the application of the timing diagrams to describe the timing behavior of the Actors appearing in the use case diagrams. In section 3 an example of such a solution in case of the Reporting System based on the Data Warehouse concept is presented. Moreover, having defined the Vertical Relation we are able to use timing diagrams associated with the use case diagrams to generate the timing diagrams associated with elements of the class, the object and the deployment diagrams (see section 4), what can be useful in refactoring decisions.

2 UML(VR) Concept UML itself defines the relation between elements from the given kind of diagrams or among diagrams from the same class. Generally, UML does not formally define the relation between various kinds of diagrams. Version 2.0 introduces <> and <> stereotypes for specifying models elements that represents the same concept in different models [2], but does not extends their use at metamodel level. This property allows using different kinds of reasoning methods for development methodologies and is one of advantages of the UML. But lack of the formal linkage among elements from different kinds of diagrams can cause loosing some information during software system designing, e.g. it’s hard to find the connections between users’ requirements and servicing them software components.

388

L. Kotulski and D. Dymek

The problem of considering both horizontal and vertical consistency of UML model has been already pointed out a few years ago [5], but in practice those investigations has been concentrated on the horizontal consistency. Fortunately, the UML diagrams can be expressed as EDG graphs using XMI standard [7]. During the process of software system designing we can translate each UML diagrams into a form of graph and create a Graph Repository, which will gather the information from every phase of designing process. It gives us a possibility to take advantage of graph grammar to trace the software system designing process, treating this process as a sequence of graphs transformations. We are able to participate in the designing process and simultaneously modify the Graph Repository. In [8] it was proved that, with the help of aedNLC graph transformation system [9], we can control the generation of such a Graph Repository with O(n 2 ) computational complexity. This solution enables us to establish the formal linkage between elements from different kinds of UML diagrams as the Vertical Relation. To illustrate the capability of Vertical Relation we present below one of its exemplifications called Accomplish Relation (AR) [10], [11]. In the Graph Repository we can distinguish various layers (relevant to UML diagrams): the use case layer (UL), the sequence layer (SL), the class layer (CL) (divided onto the class body layer (CBL) and the class method layer (CML)), the object layer1 (OL) (divided onto the object body layer (OBL) and the object method layer (OML)), the timing layer (TML) and the hardware layer (HL). For any G, representing a subgraph of the graph repository R, the notation G|XL means the graph, with the nodes belonging to the XL layer (where XL stands for any UML type of diagram) and the edges induced from the connections inside R. For example, R|UL∪OL means the graph with all the nodes (n_set (R|UL∪OL)) representing user requirements and all the objects, servicing these requirements, with the edges (e_set(R|UL∪OL)) representing both horizontal and vertical relation inside the graph repository. Now we can present a definition of Accomplish Relation function: AR:(Node,Layer) → AR(Node,Layer) ⊂ n_set(R|Layer ) is the function where: Node ∈ n_set (R|XL) : XL ∈ {UL, CBL, CML, OBL, OML,HL} Layer ∈ { UL, CBL, CML, OBL, SL, OML,TML, HL}, Layer ≠ XL AR(Node,Layer) is a subset of nodes from n_set(R|Layer ), which stay in relationship of the following type: “support service” or “is used to” with given Node, based on role performed in the system structure. For better understanding, let’s consider an example: − for any user requirement r∈ n_set (R|UL), AR(r,OBL) returns a set of objects which supports this requirement service, − for any object o∈ n_set (R|OBL), AR(o,UL) returns a set of requirements that are supported by any of its methods, − for any object o∈ n_set (R|OBL), AR(o,HL) returns a set consists of the computing (hardware) node, in which given object is allocated, − for any object x∈ n_set (R|UL∪CBL∪OBL∪SL∪HL), AR(x,TML) returns a set consists of the timing diagram describing the timing properties of its behavior, 1

Packages introduce some sub-layers structure inside this layer.

On the Modeling Timing Behavior of the System with UML(VR)

389

− for any class c∈ n_set (R|CBL), AR(c,UL) returns a set of requirements that are supported by any of its method, The above relations are maintained by the repository graph structure, so there are no complexity problems with their evaluation. Moreover, the graph repository is able to trace any software or requirement modification, so these relations are dynamically changing during the system life time. In the next section the way of using the AR function in practice is presented.

3 Association Timing Diagrams for Use Case Actors One of the most interesting type of diagrams introduced in UML 2.0 standard are the timing diagrams. They are used to show interactions when a primary purpose of the diagrams is to reason about the time. Their properties became exhibited in many areas; one of the most interesting was, presented by Bunker, the solution of problem protocol compliance verification [11]. Timing diagrams focus on conditions changing within and among lifelines along a linear time axis. They describe behavior of both individual classifiers and interactions of classifiers, focusing attention on time of occurrence of events causing changes in the modeled conditions of the lifelines [2]. The classifier is defined in UML 2.0 standard as: “A collection of instances that have something in common. A classifier has the features that characterize its instances”. Classifiers include interfaces, classes, data types, and components [2]. The introduction of timing diagrams is illustrated in OMG documents only by its application to the sequence diagram. Remembering (as some kind of Vertical Relations) the influence of existence elements of one type of diagram (e.g. use case diagram) onto creation of elements of another type of diagrams (e.g. sequence or class diagrams), during the modeling process, creates new possibility of using timing diagrams. We suggest using the timing diagrams for describing how Actors activity will change during the system exploitation. Let us notice that the system overload is caused if at least one, from mentioned below, event will follow: − two or more processes that consume most of the computing system resources will start at the same time, − some process will be activated by great population of users. Information about the possible Actors (defined in the use case level) schedule can be remembered by association of a timing diagram with each of them. However, this information will be useful only if we are able to translate it into the timing diagrams, describing the structure of the other elements of the software system (i.e. class, object and deployment diagrams). The VR creates such a possibility. In [13] we consider a typical reporting systems. Every business organization during its activity generates many single reports. Some of them are created for managers and executives for an internal use only; others are created for external organizations, which are entitled to monitoring state and activity of given organization. For example, in Poland commercial banks have to generate obligatory reports inter alia to the National Bank of Poland (WEBIS reports), the Ministry of Finance (MF reports) and the

390

L. Kotulski and D. Dymek

Warsaw Stock Exchange (SAB reports)2. In all external reports, it is few hundreds of single sheets with thousands of single data. In general, these reports base on almost the same kind of source data, but external requirements on format and contents causes that different software tools (based on different algorithms) are needed. These reports have the periodical character – depending on demands, a given report must be drown up every day, week, decade, month, quarter, half of the year or year, base on data from the end of the corresponding day. Let’s assume that we have the Reporting Data Mart system based on Data Warehouse. To simplify the example we skip the organization of the Extraction Transformation and Loading (ETL) processes and assume that all necessary information are maintained by the Data Warehouse Repository. It’s ease to realize that for different Data Marts the set of used DW processes can be different. Analyzing the information content of reports we can divide them into a few categories, based on kind of source data and the way of their processing. Each of those categories, regardless of periodical character, is generated by different processes. Their results are integrated on the level of the user interface depending on period and external organization. The schema of data flow for Reporting Data Mart is presented in Fig 1.

Fig. 1. Schema of Reporting Data Mart

Each User Application represents functionality associated with the single period and with the single type of obligatory reports. Because of that, we can treat these applications as user requirements, defining Data Mart functionality. As we mention above, reports have the periodical character. It means that processes associated with these reports category have also the periodical character. They are executed only in the given period of time. This period is strictly connected with the organizational process of drawing up the given type of reports. Let us notice that the obligatory reports for the National Bank of Poland must fulfill many control rules, before they can be send out. In practice, it means that those reports are not generated in a single execution of the proper software process. Instead of this, we have the organizational process which can progress even a few days, during which the software process is executed many times after each data correction. Because of that, if we 2

Structure and information contents of those reports are based in international standards so the same situation we can meet in other countries.

On the Modeling Timing Behavior of the System with UML(VR)

391

analyzing the time of availability of system functionality connected with those reports, we must take into account the larger time of the readiness of the hardware environment than in the case of the single process execution. For the purpose of this example we can take a simple Reporting Data Mart with functionality restricted to only three reports categories: weekly, decadal and monthly. We assume that processes associated with weekly, decadal and monthly reports generation are started appropriately 2,3 or 4 days before of the reports delivery time. The second type of the reports are ad hoc reports generated by consultants or verification of the hypothesis prepared by them. The mentioned activities are represented at use case diagram presented in Fig. 2.

Fig. 2. Schema for Reports Generation activities

To estimate the system workload first we have to estimate the external usage of each system function. Each Actor artifact represents a group of users with a similar kind of behavior. Only Consultant Actor uses more that one systems function (ad hoc report generation and hypothesis verification). Thus we have to create five timing diagrams to express user behavior timing characteristics [13]. Three first timing diagrams, representing the weekly, decadal and monthly reports generation processes activity, are presented in Fig. 3, where the number of active processes of a given type is either 0 or 1. The example of a Consultant population activity with respect to the ad hoc reports generaTable 1. Process overloading tion and the hypothesis verification are presented in Fig. 4 (the Y axis is scaled to 1:10 in compari30% son to Fig. 3), basing on assumption that the ad weakly report 30% hoc reports are generated during worktime, and decadal report 30% the hypothesis verification is made in the back- monthly report ground. ad hoc report 1,5% If we are able to estimate the overload made hypothesis verification 4,5% by a single Actor request (the way of such estimation will be considered in the next section), we can evaluate the total system workload. For the purpose of this example assumed estimation is presented in Table 1.

392

L. Kotulski and D. Dymek

Fig. 3. Timing diagrams for periodical reports generation

Fig. 4. Timing diagrams for Consultant activities

Fig. 5. System workload before (a) and after (b) evaluation

Thus (after a simple calculation) we can generate timing diagram representing the final system overloading as presented in Fig. 5a. We can observe that user demand exceeds computing power of the system at 9-th, 30-th and from 57-th to 60-th day of system observation. Fortunately, data for

On the Modeling Timing Behavior of the System with UML(VR)

393

monthly and decadal reports generation usually a prepared by ETL process a few days earlier so we can start: decadal reports evaluation on 7-th and 29-th day, monthly reports on 25-th and 54-th days. Fig. 5b represents the overloading evaluation in such a case. The improvement of these processes effectiveness made by the distribution of some processes will be considered in the next section.

4 System Workload Estimation The solution presented in the previous section bases on the assumption that we are able to estimate the workload of the computing system caused by an Actor request. In such a system as Data Warehouse, where evaluations of the same requests are repeated, such estimation can be made by the observation of the real system. However, it seams to be desirable to consider the influence of the information gathered in the timing diagrams (describing Actors timing behavior) on the final model of the developed software system. In all methodologies using UML the use case diagrams (and class diagrams – for illustration of Domain Model) are the first diagrams generated during the system modeling. Here, we assume that timing diagrams associated with Actors activities are generated at the use case level to express the time relations among the elements of system structure associated with the periodical character of the system functions. The vertical relation AR, introduced in section 2, allows us to designate for each Actor’s request r: − the set of classes modeling the algorithms used during its service (AR(r,CBL)), − the set of object that are responsible for the servicing of the request r (AR(r,OBL), − the deployment of the mentioned in the previous point objects ((AR(o,DL)). Thus we are able to estimate the workload of the software and the hardware components in the following way. Let, for each r∈ n_set (R|UL), TM(r,t) represents timing diagram associated with r (more formally TM(r,t)=AR(r,TML)(t)). Having defined TM for requirements we can calculate it for methods, class, objects and hardware nodes. For any m∈ n_set (R|cmL)

TM ( m, t ) =

U TM( r, t )

r∈AR ( m , UL )

For any c∈ n_set (R|CBL)

TM (c, t ) =

U TM(r, t )

r∈AR ( c , UL )

For any o∈ n_set (R|OBL)

TM (o, t ) =

U TM (r, t )

r∈AR ( o , UL )

For any h∈ n_set (R|HL)

TM ( h, t ) =

U TM(o, t )

o∈AR ( h ,OBL )

where ∪ means the logical sum. Timing diagrams generated for methods and classes helps us to better understanding of the modeled system structure and can be very useful in finding the system elements that should be refactored [14]. Timing diagram generated for Hardware Layer gives us information about the time of the hardware nodes activity, triggered by the execution of processes corresponding with objects allocated at it.

394

L. Kotulski and D. Dymek

Let’s notice that timing diagrams generated for the object can be used to estimate the level of utilize of the hardware equipment. Let’s assume that: − we are able to estimate the (average, periodical) performance of the object components (described as per(o)); this estimation should be associated with the computational complexity of algorithms used inside the object. − we know the computing power of the hardware nodes (described as cp(h)) Then the function

EF (h, t ) =

∑ (TRA(o, t ) ∗ per(o))

o∈AR ( h ,OBL )

cp (h )

shows us the efficiency of the hardware nodes utilization in time. It can be used to indicate the periods of time in which the hardware equipment is almost not used or is very close to overloading. Brief analysis of presented function shows us that we have three ways of influence on its value: (1) we can reschedule the user requirements by changing business processes, (2) we can decrease performance demanded by the object’s processes by rewriting software modules or (3) we can increase the hardware computing power.

5 Conclusions The recent release of UML 2.0 has corrected a lot of design difficulties encountered in the 1.x revision. One of the new introduced capabilities is the possibility of characterization of the timing behavior for some components of the modeled system (with help of timing diagrams). Unfortunately still actual is Engel’s observation that a general consistency of UML model is still missing [7]. In the paper the idea of the formal remembering (as a kind of vertical relations) the associations between elements belonging to the different kinds of the UML diagrams was presented. Those associations appear during the reasoning process, while system modeling. However, this formal approach has a specific context; it means that the mentioned associations are remembered as a graph structures (equivalent to the UML Interchange standard [15]), so their maintenance and/or evaluation is possible with help graph transformation. In this sense this approach differs from other formal approaches supporting UML modeling with such formalisms as SCP [16] or B language [17]. Based on this idea, an application of timing diagrams as a tool for description of Actors timing behavior was shown. The capability of the automatic generation of the timing diagrams associated with objects and classes points out the part of the system that should be consider for possible refactoring. It is all the more important that the refactoring techniques in general are based on the system developer intuition (who discovers “bad smells” part of program [14]). Presented UML(VR) concept seems to be a very promising approach. It can be used for different purpose in the development of the software system. The using of the

On the Modeling Timing Behavior of the System with UML(VR)

395

AR functions, which is an exemplification of the Vertical Relation, has been also studied by authors in such an area as the test generation [18].

References 1. Rumbaugh, J., Jacobson, I., Booch, G.: The Unified Modeling Language Reference Manual. Addison-Wesley Longman Ltd, Amsterdam (1999) 2. Unified Modeling Language, OMG v 2.1.2., http://www.omg.org 3. IBM Rational Unified Process, http://www-306.ibm.com/software/rational/ 4. Rozenberg, D., Scott, K.: Applying Use Case Driven Object Modeling with UML: An Annotated e-Commerce Example. Addison-Wesley, Reading (2001) 5. Kuźniarz, L., Reggio, G., Sourrooille, J., Huzar, Z.: Workshop on Consistency in UML-based Software Development, http://www.ipd.bth.se/uml2002/RR-2002-06.pdf 6. Sourrouille, J., Caplat, G.: A Pragmatic View about Consistency of UML Models. In: Workshop on Consistency Problems in UML-based Software Development II, San Francisco (2003) 7. Engels, G., Groenewegen, L.: Object-Oriented modeling: A road map. In: Finkelstein, A. (ed.) Future of Software Engineering 2000, pp. 105–116. ACM Press, New York (2000) 8. Kotulski, L.: Nested Software Structure Maintained by aedNLC graph grammar. In: Proceedings of the 24th IASTED International Multi-Conference Software Engineering, Innsbruck, Austria, pp. 335–339 (2006) 9. Kotulski L.: Model wspomagania generacji oprogramowania w środowisku rozproszonym za pomocą gramatyk grafowych. Wydawnictwo Uniwersytetu Jagiellońskiego, Kraków (2000) ISBN 83-233-1391-1 10. Dymek, D., Kotulski, L.: On the hierarchical composition of the risk management evaluation in computer information systems. In: The Second International Conference DepCoS RELCOMEX 2007, Szklarska Poreba, Poland, pp. 35–42 (2007) 11. Dymek, D., Kotulski, L.: Evaluation of Risk Attributes Driven by Periodically Changing System Functionality. Transaction on Engineering, Computing and Technology 16, 315– 320 (2006) 12. Bunker, A., Gopalakrishnan, G., Mckee, S.A.: Formal hardware specification languages for protocol compliance verification. ACM Transactions on Design Automation of Electronic Systems 9(1), 1–32 (2004) 13. Kotulski, L., Dymek, D.: On the load balancing of Business Intelligence Reporting Systems. In: Proceedings of the AIS SIGSAND European Symposium on Systems Analysis and Design, University of Gdansk, Poland, pp. 121–125 (2007) 14. Flower., M., Beck, K., Brant, J., Opdyke, W.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co. Inc., Amsterdam (2000) 15. UML Diagram Interchange, OMG, version 1.0, http://www.omg.org/technology/documents/modeling_spec_catalog 16. Engels, G., Küster, J.M., Heckel, R., Groenewegen, L.: A methodology for specifying and analyzing consistency of object-oriented behavioral models. In: The 8th European Software Engineering Conference held jointly with ESEC/FSE-9, pp. 186–195. ACM, New York (2001) 17. Snook, C., Butler, M.: UML-B: Formal modeling and design aided by UML. ACM Transaction on Software Engineering Methodology 15(1), 92–122 (2006) 18. Dymek, D., Kotulski, L.: Using UML(VR) for supporting the automated test data generation. In: The Third International Conference DepCoS - RELCOMEX 2008, Szklarska Poreba, Poland (2008)

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering Rafal Pokrywka1,2 1

Institute of Computer Science AGH University of Science and Technology al. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected] 2 IBM SWG Laboratory ul. Armii Krajowej 18, 30-150 Krak´ ow, Poland

Abstract. There is a general class of methods of detecting anomalies in a computer system which are based on heuristics or artiﬁcial intelligence techniques. These methods are to distinguish between normal and anomalous system behaviour. The main weakness of these methods is a false alarm rate which is usually measured by counting false-positives on a sample set representing normal behaviour. In this measurement a base rate of anomalous behaviour in a live environment is not taken into account and that leads to a base-rate fallacy. This problem can greatly aﬀect a real number of false alarms which can be signiﬁcantly greater then expected value. Usually little can be done to further improve classiﬁcation algorithms. In this paper a diﬀerent approach to reducing real false alarm rate based on layered ﬁltering is presented and discussed. The solution explores potential in a properly structured system of several anomaly detectors.

1

Introduction

Work presented here is a part of research on Intrusion Detection System (IDS) which is based on anomaly detection. This kind of security systems are usually placed in key nodes of network infrastructure and are often used to complement signature based detection systems. The system detects anomalies by comparing normal behaviour, stored in some way as a proﬁle, and current behaviour of supervised processes. This approach allows to detect an attack without a priori knowledge of attack’s technique in opposite to signature-based detectors which are basically blind to novel attack patterns. Anomaly detectors sometimes fail to correctly classify current event. From all misclassiﬁcation types the false-positive error, also referred as false alarm which is when normal behaviour is marked as anomalous, is the most signiﬁcant. This article focuses on a very important aspect of my research – proper handling of this kind of errors. The goal is to have an eﬀective system yet with very

This paper is NOT related to any of my job responsibilities as an employee of IBM.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 396–404, 2008. c Springer-Verlag Berlin Heidelberg 2008

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering

397

low real false alarm rate – even close to 0. The problem is that real life system usually works ﬁne most of the time and only sometimes abnormal event happens – simply the frequency of anomalies is fairly low. The false alarm rate of anomaly detectors is usually measured by taking the ratio between a number of alarms ﬁred on a sample set of normal events and the size of this set – the base rate is not taken into account in this measurement. The result is that real false alarm rate achieved during monitoring of real systems can be signiﬁcantly larger then expected – even large enough to make anomaly detector basically useless. This phenomenon is known as base-rate fallacy and directly stems out from Bayes theorem. It also frequently skips the attention of researchers. It is a hard task to further improve current anomaly detection algorithms, which have already achieved quite low false alarm rates, but even large improvement may not be enough. The real improvement can be achieved by combining several anomaly detectors with diﬀerent properties and performance into layers which gradually ﬁlters out abnormal events. In this article the term ”event” is used to describe smallest set of information from the system that can be classiﬁed. The term ”behaviour” relates to a sequence of events.

2

Anomaly Detectors Overview

Anomaly detection algorithms are used to distinguish if a current event is normal or not. To simplify discussion at this moment it can be assumed that information about type of anomaly, like for example error condition or buﬀer overﬂow, can be neglected. In this case detector can be seen as binary classiﬁer as there are only two possible classes of events – normal event or anomaly. There are four possible outcomes from a detector with respect to the actual class of an event: – true positive (TP) – when current event is anomalous and detector prediction is correct (signalisation of anomaly) – false positive (FP) – when current event is normal but detector prediction is incorrect (signalisation of anomaly) – true negative (TN) – when current event is normal and detector prediction is correct (no signalisation of anomaly) – false negative (FN) – when current event is anomalous but detector prediction is incorrect (no signalisation of anomaly) The false positive is also known in statistics as type I error and this is the situation when a false alarm is ﬁred. False negative is know as type II error and it describes the situation when real anomaly is missed by the detector. True positive is when a detector correctly signals anomaly and true negative when it correctly indicates normal event. Anomaly detectors in scientiﬁc publications are usually characterized by the following operational characteristics:

398

R. Pokrywka

– detection rate (DR) – which is exactly the same as true positive rate and P can be calculated using the following equation: DR = T PT+F N – false alarm rate (FAR) – which is the same as false positive rate and is P calculated using: F AR = F PF+T N A good graphical presentation of these two characteristics (DR and FAR) is provided by ROC curve. It allows to conveniently compare diﬀerent detection algorithms or choose the best set of algorithm parameters. A great survey of a couple of the most popular detection techniques can be found in [3].

3

The Base-Rate Fallacy

The diﬃculty in improving eﬀectiveness of anomaly detectors due to base-rate fallacy phenomenon has been ﬁrst pointed out by Stefan Axelsson in [2]. The fallacy stems out directly from Bayes theorem which relates prior and posterior probability of an event and is given by the following formula: P (A|B) =

P (A) ∗ P (B|A) . P (B)

(1)

P (B) can be expressed, using the law of total probability for n mutually exclusive outcomes of A, in the following way: P (B) =

n

P (Ai ) ∗ P (B|Ai ) .

(2)

i=1

Finally after combination of (1) and (2) the most popular Bayes theorem form can be derived: P (A) ∗ P (B|A) . (3) P (A|B) = n i=1 P (Ai ∗ P (B|Ai ) Following Axellsson let’s assume that I means an anomalous event in a system, ¬I that it is a normal event (no anomaly), A that there is an alarm signalisation from a detector and ¬A that there is no alarm ﬁred. The false alarm rate can be expressed by the probability P (A|¬I) and detection rate by P (A|I). True negative and false negative rates can be obtained, respectively, in the following way: P (¬A|¬I) = 1 − P (A|¬I) and P (¬A|I) = 1 − P (A|I). Equation (3) can now be rewritten as: P (I|A) =

P (I) ∗ P (A|I) . P (I) ∗ P (A|I) + P (¬I) ∗ P (A|¬I)

(4)

The goal in anomaly detection is to maximise P (I|A), which is called by Axellson a Bayesian Detection Rate (BDR), and P (¬I|¬A), which is the probability that lack of alarm really means lack of anomaly and in this paper will be called Bayesian True Negative Rate (BTNR). In a real life environment the frequency of anomalies is fairly low. Based on [2] the assumptions has been made that

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering

399

the average is 2 ∗ 10 anomalous events per day and 106 events overall per day. This allows to calculate the following probabilities: P (I) = 2∗10−5 and P (¬I) = 1−P (I) = 0.99998. Another assumptions has been made about characteristics of hypothetical anomaly detector: DT = P (A|I) = 1 and F AR = P (A|¬I) = 10−5 . In fact such values would be a really great achievement – they are simply not realistic and are only to show how signiﬁcant base-rate fallacy is. Taking all these values the calculated BDR is 0.66667. It means that the probability a ﬁred alarm is not a false alarm is only 0.66667. This value in practice can not be tolerated – it makes anomaly detector useless for an administrator. Under the same base rate (¬I)∗P (¬A|¬I) assumptions the probability BT N R = P (¬I|¬A) = P (¬I)∗PP(¬A|¬I)+P (I)∗P (¬A|I) is dominated by the base rate of normal events and is always close to 1 which means that anomaly rarely escapes detection. Axelsson argues that it is crucial to keep false alarm rate as low as possible even if the algorithm complexity and resource consumption is very high. This of course makes the detection very slow and there is a risk that intrusion is not detected on time. The potential damages and losses may have already been done. Also further improvement of F AR of current detection algorithms may not be a feasible task – there is to much eﬀort for little gains. Layered Filtering may be an answer for these diﬃculties.

4

Layered Filtering

Layered ﬁltering is well known in air or water pollution elimination. It consist of at least two ﬁlters in a sequence and each ﬁlter is responsible for elimination of diﬀerent pollution type. The air, for example, ﬂows through all ﬁlters and every ﬁlter is responsible for eliminating diﬀerent chemical pollution or dust. As a result a clean air is supposed to be achieved. Returning back to the computer science ﬁeld there is a method of combining binary classiﬁers to get a multiple classiﬁer. This classiﬁer uses more then one specialised binary classiﬁer for each class and combines theirs outcomes – see for example [4]. The idea behind layered ﬁltering takes something from both analogies: computer science and non-computer science. It gradually ﬁlters out normal system events as air ﬁlters do with pollution. It is also similar to multiple classiﬁer but with the exception that there are still only two classes of an event and the specialisation considers only the method of how a check is performed. At the end Layered Filtering follows the rule of thumb that the whole exceeds the sum of its parts. Let’s consider a sequence of layered anomaly detectors Ld1 , ..., Ldn where n is the number of layers. Each detector has one input stream and two output streams: a− stream for anomalous and n− stream for normal events. A detector Ldi passes to Ldi+1 only those events which are classiﬁed as anomalous. Among them there could be a lot of false positives but this is not important at this point. A Detector can perform any processing or transformations of events, under the condition that a − stream of Ldi is compatible with input stream of Ldi+1 . The

400

R. Pokrywka

All System events

Ld_1

Abnormal Events + False alarms

Ld_2

Ld_3

Final Output

Filtered out normal events

Fig. 1. Layered ﬁltering schema

ﬁrst and last detectors are distinguished in a sense that Ld1 must accept event types of system under supervision and Ldn outputs information about anomalies to security oﬃcer. Figure 1 presents a schema for an anomaly detector based on layered ﬁltering. Let introduce the following symbols: – – – – – –

DTi – i-th layer detection rate F ARi – i-th layer false alarm rate P (I)i – i-th layer base rate of anomalous events P (¬I)i – i-th layer base rate of normal events BDRi – i-th layer bayesian detection rate BT N Ri – i-th layer bayesian true negatives rate

Because of its operational characteristics the detector Ldi+1 operates on a set of events with signiﬁcantly changed base rates of anomalies and non-anomalies – this is the most important part as it reduces the base-rate fallacy inﬂuence on a real false alarm rate. The base rates probabilities for the next layer are expressed in the following way: P (I)i+1 = BDRi . P (¬I)i+1 = 1 − BDRi .

(5) (6)

It is now possible to write equations for BDRi+1 and BT N Ri+1 as functions of, respectively BDRi and BT N Ri : BDRi ∗ DRi+1 . BDRi ∗ DRi+1 + (1 − BDRi ) ∗ F ARi+1 (1 − BDRi ) ∗ (1 − F ARi+1 ) . = (1 − BDRi ) ∗ (1 − F ARi+1 ) + BDRi ∗ (1 − DRi+1 )

BDRi+1 = BT N Ri+1

(7) (8)

In this method requirements for operational characteristics for Ldi can be relaxed signiﬁcantly in terms of false alarm rate. However still it is best to have detection rate as high as possible. Additional advantage is that eﬃciency,

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering

401

in terms of system resources usage, should be improved because avoiding false alarms is one of the most complicated task for anomaly detector. What is more quantity of events that reaches further layers is greatly reduced and it makes it possible to use there more sophisticated algorithms without the risk of signiﬁcant increase in the overall computational complexity.

5

Results and Example IDS

In this section results for two example layered ﬁlters are shown. In both cases the initial frequencies of normal and abnormal events are taken from previous section. These frequencies seen by next layers are calculated using (5) and (6). BDR and BT N R values are calculated using (7) and (8). Table 1. Results for the ﬁrst layered ﬁlter Ld1

Ld2

Ld3

P (I)i 0.00002 0.66667 0.99994 P (¬I)i 0.99998 0.33333 0.00005 1.0 0.98 0.98 DRi F ARi 0.00001 0.0001 0.00001 BDRi 0.66667 0.99994 0.99999 1 0.96153 0.00254 BT N Ri

BDR BTNR 1

Rate Value

0.8

0.6

0.4

0.2

0 1

2 Layer Number

Fig. 2. BT N R and BDR changes for each layer of the ﬁrst ﬁlter

3

402

R. Pokrywka Table 2. Results for the second layered ﬁlter Ld1

Ld2

Ld3

P (I)i 0.00002 0.000391 0.27135 P (¬I)i 0.99998 0.99960 0.72864 0.98 0.95 0.98 DRi 0.05 0.001 0.0001 F ARi BDRi 0.000391 0.271353 0.999726 BT N Ri 0.999999 0.999980 0.99260

BDR BTNR 1

Rate Value

0.8

0.6

0.4

0.2

0 1

2 Layer Number

3

Fig. 3. BT N R and BDR changes for each layer of the second ﬁlter

First ﬁlter consists of three layers. Characteristics of each layer detector were based on assumptions from [2] and are rather unrealistic. Table 1 shows the calculated probabilities of normal and abnormal events, operational characteristics, BDR and BT N R values for each layer. It can be noticed that BDR for the ﬁrst ﬁlter increases signiﬁcantly for the second layer and also BT N R value falls dramatically for the third layer. In order the anomaly detector to be eﬀective these both values must be maximised. As this is natural that increase of one of these values causes decrease of the second one the right balance must be found of layers quantity and detector characteristics. Figure 2 shows the plot of how the BDR and BT N R values changes for each layer.

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering

403

For the Second ﬁlter there have been a more realistic assumptions made, for the false alarm rates and detection rates of each layer, based on evaluations from [3]. This ﬁlter also consists of three layers. In this case the increase of BDR is a bit slower and also BT N R stays on acceptable levels for all layers. Table 2 shows the calculated values. Figure 3 presents the relation of the layer number and BDR (BT N R) values. A form of real implementation of layered ﬁlter can be found in [1]. The system presented there is not directly referenced as layered ﬁlter but it consists of one layer based on Variable Order Markov Chain and the second one based on neural networks and multiagent systems. False alarm rate achieved there is actually 0 but the tests have been performed on a limited number of data sets and there have been a help from additional mechanism for suppressing false alarms called ”anergic agents”.

6

Conclusions and Further Work

This article shows that layered ﬁltering approach for anomaly detection has a potential in reducing false alarm rate. Furthermore it can help in reducing computational complexity because of relaxed requirements especially for the ﬁrst layer. However each layer detector must be carefully chosen in terms of performance and eﬀectiveness. Also the balance between BDR and BT N R must be monitored as for certain number of layers BT N R starts to fall very quickly. At the end it would be a mistake to use a detector based on the same methods in more then one layer as probably no additional classiﬁcation decisions would be made. The method presented here focuses on reducing false alarm rate but similar approach can be taken in reducing false negative rate by taking the second output from previous layer and connecting to it some sort of specialised detector which can verify if any anomaly has skipped detection and eventually redirect it to the next layer. It is also possible to build a two dimensional network of detectors for more sophisticated system for reducing both values. Additional important remark can be done about constructing network security system based on signatures and anomaly detectors. Placing signature based detector on the ﬁrst line of defence changes the base rates in the wrong way and makes anomaly detector useless. First line in detector stack must be held by system based on anomaly detection with proper characteristics minimising false negative rate. Second line may consists of signature based system or another anomaly detector. The work presented here may be an important input to the process of building network security from more then one intrusion detection system. It is very interesting how operational characteristics change with number of layers. Further work includes implementation of IDS which is based on layered ﬁltering and performing tests on real life environments. Also further research is needed on other methods of increasing intrusion detection systems eﬀectiveness in terms of false negative and false positive rates.

404

R. Pokrywka

References 1. Cetnarowicz, K., Rojek, G., Pokrywka, R.: Intelligent Agents as Cells of Immunological Memory. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3993, pp. 855–862. Springer, Heidelberg (2006) 2. Axelsson, S.: The Base-Rate Fallacy and the Diﬃculty of Intrusion Detection. ACM Trans. Inf. Syst. Secur. 3, 186–205 (2000) 3. Warrender, C., Forrest, S., Perlmutter, B.: Detecting Intrusions Using System Calls: Alternative Data Models. In: IEEE Symposium on Security and Privacy, pp. 133– 145 (1999) 4. Klautau, A., Jevtic, N., Orlitsky, A.: Combined Binary Classiﬁers with Applications to Speech Recognition. In: International Conference on Spoken Language Processing 2002, pp. 2469–2472 (2002)

Performance of Multicore Systems on Parallel Data Clustering with Deterministic Annealing Xiaohong Qiu1, Geoffrey C. Fox2, Huapeng Yuan2, Seung-Hee Bae2, George Chrysanthakopoulos3, and Henrik Frystyk Nielsen3 1

Research Computing UITS, Indiana University Bloomington [email protected] 2 Community Grids Lab Indiana University Bloomington {gcf,yuanh,sebae}@indiana.edu 3 Microsoft Research Redmond WA {georgioc, henrikn}@microsoft.com

Abstract. We present a performance analysis of a scalable parallel data clustering algorithm with deterministic annealing for multicore systems that compares MPI and a new C# messaging runtime library CCR (Concurrency and Coordination Runtime) with Windows and Linux and using both threads and processes. We investigate effects of memory bandwidth and fluctuations of run times of loosely synchronized threads. We give results on message latency and bandwidth for two processor multicore systems based on AMD and Intel architectures with a total of four and eight cores. We compare our C# results with C using MPICH2 and Nemesis and Java with both mpiJava and MPJ Express. We show initial speedup results from Geographical Information Systems and Cheminformatics clustering problems. We abstract the key features of the algorithm and multicore systems that lead to the observed scalable parallel performance. Keywords: Data mining, MPI, Multicore, Parallel Computing, Performance, Threads, Windows.

1 Introduction Multicore architectures are of increasing importance and are impacting client, server and supercomputer systems [1-6]. They make parallel computing and its integration with large systems of great importance as “all” applications need good performance rather than just the relatively specialized areas covered by traditional high performance computing. In this paper we consider data mining as a class of applications that has broad applicability and could be important on tomorrow’s client systems. Such applications are likely to be written in managed code (C#, Java) and run on Windows (or equivalent client OS for Mac) and use threads. This scenario is suggested by the recent RMS (Recognition, Mining and Synthesis) analysis by Intel [5]. In our research, we are looking at some core data mining algorithms and their application to scientific areas including cheminformatics, bioinformatics and demographic studies using GIS (Geographical Information Systems). On the computer science side, we are M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 407–416, 2008. © Springer-Verlag Berlin Heidelberg 2008

408

X. Qiu et al.

looking at performance implications of both multicore architectures and use of managed code. Our close ties to science applications ensures that we understand important algorithms and parameter values and can generalize our initial results on a few algorithms to a broader set. In this paper we present new results on a powerful parallel data clustering algorithm that uses deterministic annealing [20] to avoid local minima. We explore in detail the sources of the observed synchronization overhead. We present the performance analysis for C# and Java on both Windows and Linux and identify new features that have not been well studied for parallel scientific applications. This research was performed on a set of multicore commodity PC’s summarized in Table 1; each has two CPU chips and a total of 4 or 8 CPU cores. The results can be extended to computer clusters as we are using similar messaging runtime but we focus in this paper on the new results seen on the multicore systems. Table 1. Multicore PC’s used in paper AMD4: 4 core 2 Processor HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, L2 Cache 2x1MB (for each chip), Memory 4GB. XP 64bit & Server 2003 Intel4: 4 core 2 Processor Dell Precision PWS670, 2 Intel Xeon CPUs at 2.80GHz, L2 Cache 2x2MB, Memory 4GB. XP Pro 64bit Intel8a: 8 core 2 Processor Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, L2 Cache 2x4M, Memory 8GB. XP Pro 64bit Intel8b: 8 core 2 Processor Dell Precision PWS690, 2 Intel Xeon CPUs x5355 at 2.66GHz, L2 Cache 2X4M, Memory 4GB. Vista Ultimate 64bit and Fedora 7 Intel8c: 8 core 2 Processor Dell Precision PWS690, 2 Intel Xeon CPUs x5345 at 2.33GHz, L2 Cache 2X4M, Memory 8GB. Redhat 5

Sect. 2 discusses the CCR and SALSA runtime described in more detail in [7-9]. Sect. 3 describes our motivating clustering application and explains how it illustrates a broader class of data mining algorithms [17]. These results identify some important benchmarks covering memory effects, runtime fluctuations and synchronization costs discussed in Sections 4-6. There are interesting cache effects that will be discussed elsewhere [8]. Conclusions are in Sect. 8 while Sect. 7 briefly describes the key features of the algorithm and how they generalize to other data mining areas. All results and benchmark codes presented are available from http://www.infomall. org/salsa [16].

2 Overview of CCR and SALSA Runtime Model We do not address possible high level interfaces such as OpenMP or parallel languages but rather focus on lower level runtime to which these could map. In other papers [7-9] we have explained our hybrid programming model SALSA (Service Aggregated Linked Sequential Activities) that builds libraries as a set of services and uses simple service composition to compose complete applications [10]. Each service then runs on parallel on any number of cores – either part of a single PC or spread out

Performance of Multicore Systems on Parallel Data Clustering

409

over a cluster. The performance requirements at the service layer are less severe than at the “microscopic” thread level for which MPI is designed and where this paper concentrates. We use DSS (Decentralized System Services) which offers good performance with messaging latencies of 35 µs between services on a single PC [9]. Applications are built from services; services are built as parallel threads or processes that are synchronized with low latency by locks, MPI or a novel messaging runtime library CCR (Concurrency and Coordination Runtime) developed by Microsoft Research [11-15]. CCR provides a framework for building general collective communication where threads can write to a general set of ports and read one or more messages from one or more ports. The framework manages both ports and threads with optimized dispatchers that can efficiently iterate over multiple threads. All primitives result in a task construct being posted on one or more queues, associated with a dispatcher. The dispatcher uses OS threads to load balance tasks. The current applications and provided primitives support a dynamic threading model with some 8 core capabilities given in more detail in [9]. CCR can spawn handlers that consume messages as is natural in a dynamic search application where handlers correspond to links in a tree. However one can also have long running handlers where messages are sent and consumed at a rendezvous points (yield points in CCR) as used in traditional MPI applications. Note that “active messages” correspond to the spawning model of CCR and can be straightforwardly supported. Further CCR takes care of all the needed queuing and asynchronous operations that avoid race conditions in complex messaging. CCR is attractive as it supports such a wide variety of messaging from dynamic threading, services (via DSS described in [9]) and MPI style collective operations discussed in this paper. For our performance comparisons with MPI, we needed rendezvous semantics which are fully supported by CCR and we chose to use the Exchange pattern corresponding to the MPI_SENDRECV interface where each process (thread) sends and receives two messages equivalent to a combination of a left and right shift with its two neighbors in a ring topology. Note that posting to a port in CCR corresponds to a MPISEND and the matching MPIRECV is achieved from arguments of handler invoked to process the port.

3 Deterministic Annealing Clustering Algorithm We are building a suite of data mining services to test the runtime and two layer SALSA programming model. We start with data clustering which has many important applications including clustering of chemical properties which is an important tool [18] for finding for example a set of chemicals similar to each other and so likely candidates for a given drug. We are also looking at clustering of demographic information derived from the US Census data and other sources. Our software successfully scales to cluster the 10 million chemicals in NIH PubChem and the 6 million people in the state of Indiana. Both applications will be published elsewhere and the results given here correspond to realistic applications and subsets designed to test scaling. We use a modification of the well known K-means algorithm [19], using deterministic annealing [20], that has much better convergence properties than K-means and good parallelization properties.

410

X. Qiu et al.

For a set of data points X(labeled by x) and cluster centers Y(labeled by k), one gradually lowers the annealing temperature T and iteratively calculates: Y(k) = ∑ x p(X(x),Y(k)) X(x) . p(X(x),Y(k)) = exp(-d(X(x),Y(k))/T) p(x) / Zx . with Zx = ∑ k exp(-d(X(x),Y(k))/T) .

(1)

Here d(X(x),Y(k)) is the distance defined in space where clustering is occurring. Parallelism can be implemented by dividing points X between the cores and there is a natural loosely synchronous barrier where the sums in each core are combined in a reduction collective to complete the calculation in (1). Rather than plot speed-up, we focus in more detail on the deviations from “perfect speed-up (of P)”. Such parallel applications have a well understood performance model that can be expressed in terms of a parallel overhead f(n,P) (roughly 1-efficiency) where different overhead effects are naturally additive. Putting T(n,P) as the execution time on P cores or more generally processes/threads, we can define Overhead f(n,P) = (PT(n,P)-T(Pn,1))/T(Pn,1) . and efficiency ε = 1/(1+f) and Speed-up = ε .

(2)

For the algorithm of eqn. (1), f(n,P) should depend on the grain size n where each core handles n data points and in fact f(n,P) should decrease proportionally to the reciprocal of the grain size with a coefficient that depends on synchronization costs [6, 21-23]. This effect is 0.45 clearly seen in Fig. 1, which Parallel Overhead 0.4 shows good speed-up on 8 cores of around 7.5 (f(n,P )~ 10 Clusters 0.35 .05) for large problems. 0.3 However we do not find f(n,P) going fully to zero as n 0.25 increases. Rather it rather 0.2 erratically wanders around a 0.15 small number 0.02 to 0.1 as 20 Clusters parameters are varied. The 0.1 overhead also decreases as 0.05 shown in Fig. 1 as the number 0 of clusters increases. This is 0 0.5 1 1.5 2 2.5 3 3.5 4 expected from (1) as the ratio 10000/Grain Size of computation to memory Fig. 1. Parallel Overhead for GIS 2D Clustering on access is proportional to the Intel8b using C# with 8 threads (cores) and CCR Synnumber of clusters. In Fig. 2 chronization. We use two values (10, 20) for the numwe plot the parallel overhead ber of clusters and plot against the reciprocal of the as a function of the number of number of data points. clusters for two large real problems coming from Census data and chemical property clustering. These clearly show the rather random behavior after f(n,8) decreases to a small value corresponding to quite good parallelism – speedups of over 7 on 8 core systems. The results in Fig. 2(b) show lower asymptotic values which were determined to correspond to the binary

Performance of Multicore Systems on Parallel Data Clustering 0.200

Parallel Overhead

0.200

411

Parallel Overhead

0.180

2D GIS Census Data

0.160

0.150

PubChem 1052 Binary Chemical Properties

0.140 0.120 0.100

0.100

0.080 0.060

0.050

0.040 0.020 0.000

0.000

a)

0

5

10

15

20

25

30

b)

0

2

4

6

8

10

Number of Clusters

12

14

16

Number of Clusters

Fig. 2. Parallel Overhead defined in (2) as a function of the number of clusters for a) 2 dimensional GIS data for Indiana in over 200,000 blocks and 40,000 compounds each with 1052 binary properties

data used in Chemistry clustering. This problem showed fluctuations similar in size to 2(a) if one used floating point representation for the Chemistry “fingerprint” data. Of course the binary choice shown in Fig. 2(b) is fastest and the appropriate approach to use. Looking at this performance in more detail we identified effects from memory bandwidth, fluctuations in thread run time and cache interference [24]. We present a summary of the first two here and present cache effects and details in [7, 8].

4 Memory Bandwidth In Fig. 3, we give typical results of a study of the impact of memory bandwidth in the different hardware and software configurations of Table 1. We isolate the kernel of the clustering algorithm of Sect. 2 and examine its performance as a function of grain size n, number of clusters and number of cores. We employ the scaled speed up strategy and measure thread dependence at three fixed values of grain size n (10,000, 50,000 and 500,000). All results are divided by the number of clusters, the grain size, and the number of cores and scaled so the 10,000 data point, one cluster, one core result becomes 1 and deviations from this value represent interesting performance effects. We display cases for 1 cluster where memory bandwidth effects could be important and also for 80 clusters where such effects are small as one performs 80 floating point steps on every variable fetched from memory. Although we studied C, C#, Windows and Linux, we only present Windows C# results in Fig. 3. C with Windows shows similar effects but of smaller magnitude while Linux shows small effects (the results for all n and cluster counts are near 1). Always we use threads not processes and C uses locks and C# uses CCR synchronization. Data is stored so as to avoid any cache line (false sharing) effects [8, 24]. The results for one cluster in Fig. 3(a) clearly show the effect of memory bandwidth with scaled run time increasing significantly as one increases the number of cores used. The performance improves in Fig. 3(b) (scaled runtime < 1) with more clusters when the memory demands are small. In this benchmark the memory demands scale directly with number of cores and inversely with number of clusters. A major concern with multicore system is the need for a memory bandwidth that increases linearly with the number of cores. In Fig. 3(a) we see a 50% increase in the run time for a grain size of 10,000 and 1 cluster. This is for

412

X. Qiu et al.

1.6

C# and Windows and the overhead is reduced to 22% for C on Runtime 1.4 Windows and 13% for C on 500,000 1.3 Linux. Further we note that one expect the 10,000 data point case 1.2 50,000 to get excellent performance as 1.1 Datapoints per thread the dataset can easily fit in cache 1 1 2 3 4 5 6 7 8 a) and minimize memory bandwidth Number of Threads (one per core) 1 needs. However we see similar Scaled Intel 8b Vista C# CCR 80 Clusters Runtime results whether or not dataset fits 50,000 0.95 10,000 into cache. This must be due to 500,000 the complex memory structure 0.9 Datapoints leading to cache conflicts. We get per thread 0.85 excellent cache performance for the simple data structures of ma0.8 1 2 3 4 5 6 7 8 trix multiplication. b) Number of Threads (one per core) In all cases, we get small Fig. 3. Scaled Run time on Intel8b using Vista and overheads for 80 clusters (and in C# with CCR for synchronization on Clustering fact for cluster counts greater Kernel for three dataset sizes with 10,000 50,000 or than 4), which explains why the 500,000 points per thread(core). Each measurement applications of Sect. 2 run well. involved averaging over at least 1000 computations There are no serious memory separated by synchronization whose small cost is not bandwidth issues in cases with included in results. several clusters and in this case that dominates the computation. This is usual parallel computing wisdom; real size problems run with good efficiency as long as there is plenty of computation. [6, 2123] The data mining cases we are studying satisfy this and we expect them to run well on multicore machines expected over the next 5 years. Scaled 1.5

Intel 8b Vista C# CCR 1 Cluster

10,000

5 Synchronization Performance The synchronization performance has been discussed in detail previously [9] for CCR where we discussed dynamic threading in detail showing it had an approximate 5µs overhead. Here we expand the previous brief discussion of the rendezvous (MPI) style performance with Table 2 giving some comparisons between C, C# and Java for the MPI Exchange operation (defined in Sect. 2) running on the maximum number of cores (4 or 8) available on the systems of Table 1. Results for the older Intel8a are available online [16]. In these tests we use a zero size message. Note that the CCR Exchange operation timed in Table 2 has the full messaging transfer semantics of the MPI standards but avoids the complexity of some MPI capabilities like tags [25-27]. We expect that future simplified messaging systems that like CCR span from concurrent threads to collective rendezvous’s will chose such simpler implementations. Nevertheless we think that Table 2 is a fair comparison. Note that in the “Grains” column, we list number of concurrent activities and if they are threads or processes. These measurements correspond to synchronizations occuring roughly every 30µs and were averaged over 500,000 such synchronizations in a single run.

Performance of Multicore Systems on Parallel Data Clustering

413

Table 2. MPI Exchange Latency Machine Intel8c

OS Redhat

Intel8c

Fedora

Intel8b

Vista Fedora

AMD4

Intel4

Vista XP Redhat

XP XP

Runtime MPJE MPICH2 MPICH2 Fast Option Nemesis MPJE mpiJava MPICH2 MPJE MPJE mpiJava CCR MPJE MPJE mpiJava MPICH2 CCR CCR

Grains 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Thrds 4 Procs 4 Procs 4 Procs 4 Procs 4 Thrds 4 Thrds

Latency µs 181 40.0 39.3 4.21 157 111 64.2 170 142 100 20.2 185 152 99.4 39.3 16.3 25.8

The optimized Nemesis version of MPICH2 gives best performance while CCR with for example 20µs latency on Intel8b, outperforms “vanilla MPICH2”. We can expect CCR and C# to improve and compete in performance with systems like Nemesis using the better optimized (older) languages. We were surprised by the uniformly poor performance of MPI with Java. Here the old mpiJava invokes MPICH2 from a Java-C binding while MPJ Express [27] is pure Java., It appears threads in Java currently are not competitive in performance with those in C#. Perhaps we need to revisit the goals of the old Java Grande activity [29]. As discussed earlier we expect managed code (Java and C#) to be of growing importance as client multicores prolifergate so good parallel multicore Java performance is important.

6 Performance Fluctuations We already noted in Sect. 3 that our performance was impacted by fluctuations in run time that were bigger than seen in most parallel computing studies that typically look at Linux and processes whereas our results are mainly for Windows and threads. In Figs. 4 and 5 we present some results quantifying this using the same “clustering kernel” introduced in Sect. 4. We average results over 1000 synchronization points in a single run. In Figs. 4 and 5 we calculate the standard deviation of the 1000P measured thread runtimes gotten if P cores are used. Our results show much larger run time fluctuations for Windows than for Linux and we believe this effect leads to the 2-10% parallel overheads seen already in Fig. 2. These figures also show many of the same trends of earlier results. The smallest dataset (10,000) which should be contained in cache has the largest fluctuations. C and Linux show lower fluctutions

414

X. Qiu et al.

0.2

0.1

Std Dev Intel 8a XP C# CCR Runtime 1 Cluster

10,000

0.15

0.1

0.075

0 0

1

2

3

4

5

6

Number of Threads (one per core)

7

50,000 0.025

b)

8

500,000 10,000

0.05

50,000 500,000 Datapoints per thread

0.05

a)

Std Dev Intel 8a XP C# CCR Runtime 80 Clusters

Datapoints per thread

0 0

1

2

3

4

5

6

7

Number of Threads (one per core)

8

Fig. 4. Ratio of Standard Deviation to mean of thread execution time averaged over 1000 instances using XP on Intel 8a and C# with CCR for synchronization on Clustering Kernel for three dataset sizes with 10,000 50,000 or 500,000 points per thread (core) 0.006

0.1

Std Dev Runtime

Std Dev Intel 8c Redhat C Locks

Intel 8c Redhat C Locks 1 Cluster

0.075

10,000

10,000 Runtime 80 Clusters Datapoints 0.004 per thread

50,000

0.05

50,000

0.025

Datapoints per thread

500,000

0

a)

1

2

3

4

5

6

Number of Threads (one per core)

7

8

500,000

0.002

0

b)

1

2

3

4

5

6

Number of Threads (one per core)

7

8

Fig. 5. Ratio of Standard Deviation to mean of thread execution time using Redhat on Intel8c (a,b) Linux and C with locks for synchronization on Clustering Kernel for three dataset sizes with 10,000 50,000 or 500,000 points per thread (core). Fedora shows larger effects than Redhat.

than C# and Windows. Further turning to Linux, Redhat outperforms Fedora (shown in [9]). C# in Fig. 4 has rather large (5% or above) fluctuations in all cases considered. Note our results with Linux are all obtained with threads and so are not directly comparable with traditional MPI Linux measurements that use processes. Processes are better isolated from each other in both cache and system effects and so it is possible that these fluctuations are quite unimportant in past scientific programming studies but significant in our case. Although these fluctuations are important in the limit of large grain size when other overheads are small, they are never a large effect and do not stop us getting excellent speedup on large problems.

7 Generalization to Other Data Mining Algorithms The deterministic annealing clustering algorithm has exactly the same structure as other important data mining problems including dimensional scaling and Gaussian mixture models with the addition of deterministic annealing to mitigate the local minima that are a well known difficulty with these algorithms [17]. One can show [17] that one gets these different algorithms by different choices for Y(k), a(x), g(k), T and s(k) in (3). As in Sect. 2, X(x) are the data points to be modeled and F is the objective function to be minimized.

Performance of Multicore Systems on Parallel Data Clustering

415

K F = −T ∑ a ( x ) ln ⎡ ∑ k =1 g (k ) exp{−0.5( X ( x) − Y (k )) 2 / (Ts ( k ))}⎤ . ⎣ ⎦ x =1

(3)

N

Thus we can immediately deduce that our results imply that scalable parallel performance can be achieved for all algorithms given by (3). Further it is interesting that the parallel kernels of these data mining algorithms are similar to those well studied by the high performance (scientific) computing community and need the synchronization primitives supported by MPI. The algorithms use the well established SPMD (Single Program Multiple Data) style with the same decomposition for multicore and distributed execution. However clusters and multicore systems use different implementations of collective operations at synchronization points. We expect this structure is more general than the studied algorithm set.

8 Conclusions Our results are very encouraging for both using C# and for getting good multicore performance on important applications. We have initial results that suggest a class of data mining applications run well on current multicore architectures with efficiencies on 8 cores of at least 95% for large realistic problems. We have looked in detail at overheads due to memory, run time fluctuation and synchronizations. Our results are reinforced in [8, 9] with a study of cache effects and further details of issues covered in this paper. Some overheads such as runtime fluctuations are surprisingly high in Windows/C# environments but further work is likely to address this problem by using lessons from Linux systems that show small effects. C# appears to have much better thread synchronization effects than Java and it seems interesting to investigate this.

References 1. Patterson, D.: The Landscape of Parallel Computing Research: A View from Berkeley 2.0 Presentation at Manycore Computing, Seattle, June 20 (2007) 2. Dongarra, J. (ed.): The Promise and Perils of the Coming Multicore Revolution and Its Impact, CTWatch Quarterly, February 2007, vol. 3(1) (2007), http://www.ctwatch.org/quarterly/archives/february-2007 3. Sutter, H.: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb’s Journal 30(3) (March 2005) 4. Annotated list of multicore sites: http://www.connotea.org/user/crmc/ 5. Dubey P.: Teraflops for the Masses: Killer Apps of Tomorrow Workshop on Edge Computing Using New Commodity Architectures, UNC (May 23, 2006), http://gamma.cs.unc.edu/EDGE/SLIDES/dubey.pdf 6. Fox G.: Parallel Computing 2007: Lessons for a Multicore Future from the Past Tutorial at Microsoft Research (February 26 to March 1 2007) 7. Qiu, X., Fox, G., Ho, A.: Analysis of Concurrency and Coordination Runtime CCR and DSS, Technical Report January 21 (2007) 8. Qiu, X., Fox, G., Yuan, H., Bae, S., Chrysanthakopoulos, G., Nielsen, H.: Performance Measurements of CCR and MPI on Multicore Systems Summary, September 23 (2007)

416

X. Qiu et al.

9. Qiu, X., Fox, G., Yuan, H., Bae, S., Chrysanthakopoulos, G., Nielsen, H.: High Performance Multi-Paradigm Messaging Runtime Integrating Grids and Multicore Systems. In: Proceedings of eScience 2007 Conference, Bangalore, India, December 10-13 (2007) 10. Gannon, D., Fox, G.: Workflow in Grid Systems Concurrency and Computation. Practice & Experience 18(10), 1009–1019 (2006) 11. Nielsen, H., Chrysanthakopoulos, G.: Decentralized Software Services Protocol – DSSP, http://msdn.microsoft.com/robotics/media/DSSP.pdf 12. Chrysanthakopoulos, G.: Concurrency Runtime: An Asynchronous Messaging Library for C# 2.0, Channel9 Wiki Microsoft, http://channel9.msdn.com/wiki/default.aspx/Channel9.ConcurrencyRuntime 13. Richter J.: Concurrent Affairs: Concurrent Affairs: Concurrency and Coordination Runtime, Microsoft, http://msdn.microsoft.com/msdnmag/issues/06/09/ ConcurrentAffairs/default.aspx 14. Microsoft Robotics Studio is a Windows-based environment that includes end-to-end Robotics Development Platform, lightweight service-oriented runtime, and a scalable and extensible platform, http://msdn.microsoft.com/robotics/ 15. Chrysanthakopoulos, G., Singh, S.: An Asynchronous Messaging Library for C#, Synchronization and Concurrency in Object-Oriented Languages (SCOOL) at OOPSLA Workshop, San Diego, CA (October 2005), http://urresearch.rochester.edu/handle/1802/2105 16. SALSA Multicore research Web site, http://www.infomall.org/salsa For Indiana University papers cited here, http://grids.ucs.indiana.edu/ptliupages/publications 17. Qiu X., Fox G., Yuan H., Bae S., Chrysanthakopoulos G., Nielsen H.: Parallel Clustering and Dimensional Scaling on Multicore Systems Technical Report (February 21 2008) 18. Downs, G., Barnard, J.: Clustering Methods and Their Uses in Computational Chemistry. Reviews in Computational Chemistry 18, 1–40 (2003) 19. K-means algorithm at, http://en.wikipedia.org/wiki/K-means_algorithm 20. Rose, K.: Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86, 2210–2239 (1998) 21. Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., White, A. (eds.): The Sourcebook of Parallel Computing. Morgan Kaufmann, San Francisco (2002) 22. Fox, G., Johnson, M., Lyzenga, G., Otto, S., Salmon, J., Walker, D.: Solving Problems in Concurrent Processors, vol. 1. Prentice-Hall, Englewood Cliffs (1988) 23. Fox, G., Messina, P., Williams, R.: Parallel Computing Work! Morgan Kaufmann, San Mateo Ca (1994) 24. How to Align Data Structures on Cache Boundaries, Internet resource from Intel, http://www.intel.com/cd/ids/developer/asmona/eng/dc/threading/knowledgebase/43837.htm 25. Message passing Interface MPI Forum, http://www.mpi-forum.org/index.html 26. MPICH2 implementation of the Message-Passing Interface (MPI), http://wwwunix.mcs.anl.gov/mpi/mpich/ 27. Baker, M., Carpenter, B., Shafi, A.: MPJ Express: Towards Thread Safe Java HPC. In: IEEE International Conference on Cluster Computing (Cluster 2006), Barcelona, Spain, September 25-28 (2006), http://www.mpj-express.org/docs/papers/mpj-clust06.pdf 28. mpiJava Java interface to the standard MPI runtime including MPICH and LAM-MPI, http://www.hpjava.org/mpiJava.html 29. Java Grande, http://www.javagrande.org

Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications Pawel Gepner, David L. Fraser, and Michal F. Kowalik Intel Corporation {pawel.gepner,david.l.fraser,michal.f.kowalik}@intel.com

R Xeon R procesAbstract. The second generation of Quad-Core Intel sors was launched on November 12th 2007. In this paper we take a look at what the new 45 nm based Quad-Core Intel Xeon Processor brings to high performance computing. We compare an Intel Xeon 5300 series based system with a server utilizing his successor the Intel Xeon 5400. We measure both CPU generations operating in dual socket platforms in typical HPC benchmark scenario using some common HPC benchmarks. The results presented clearly show that the new Intel Xeon processor 5400 family provides signiﬁcant performance advantage on typical HPC workloads and would therefore be seen to be an appropriate choice for many of HPC installations.

Keywords: HPC, multi-core processors, quad-core processors, parallel processing, benchmarks.

1

Introduction

Today multi-core processors are becoming a standard for high performance computing. The second generation Quad-Core Intel Xeon processor not only represents a technology shrink to 45 nm process technology but also brings lot of new mechanisms which improve overall performance and power savings characR teristics. Intel Xeon 5400 is based on the same micro-architecture as the Intel CoreTM Microarchitecture including some extensions which actually raise the performance. This paper has been written by Intel employees, therefore competitive products were not taken into consideration. Intel Xeon processor family contains 3000, 5000 and 7000 series where each of them is dedicated to diﬀerent platforms and applications: – Intel Xeon 3000 family is optimized for single-socket solutions; – Intel Xeon 5000 family is optimized for dual-socket solutions; – Intel Xeon 7000 family is optimized for multi-socket system (4 + way). All of them are based on the same microarchitecture principles and they are all used in HPC instalations, where Intel Xeon 5000 family is the most common therefore authors have focused on them. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 417–426, 2008. c Springer-Verlag Berlin Heidelberg 2008

418

P. Gepner, D.L. Fraser, and M.F. Kowalik

Our strategy of building CPUs which do not drive performance via faster clock speeds but rather based on an energy-eﬃcient architecture has changed the landscape of high-performance computing (HPC) completely. If we look on the 30th edition of the TOP500 list released (Nov. 12, 2007) at SC07, the international conference on high performance computing, networking, storage and R CoreTM analysis, in Reno (NV, US). We see 317 systems based on the Intel R Microarchitecture and 102 of them are systems based on Quad-Core Intel R processor. Xeon True performance is a combination of both clock frequency and Instruction Per Clock (IPC). This shows that the performance can be improved by increasing frequency and/or IPC. Frequency is a function of both the manufacturing process and the micro-architecture. Basically there are two micro-architecture approaches which somehow determinate CPU design: more IPC or higher frequency. The ﬁrst approach uses very few transistors, but the path from start to ﬁnish is very long, the second is based on shorter path, but it uses many more transistors [1, 2]. From manufacturing process perspective a key consideration is reducing the size of the transistors which means reducing the distances between the transistors and reducing transistor switching times. These two things contribute signiﬁcantly to faster processor clock frequencies. Unfortunately as processor frequencies rise, the total heat produced by the processor scales with it. Reducing the transistor size, because smaller transistors can operate on lower voltages, allows the chip to produce less heat this does not however solve the all of the issues and in fact generates another problem with electrons. Based on quantum mechanics principles small elements such as electrons are able to spontaneously ﬂow, over short distances. The emitter and transistor base are now so close together that a considerable number of electrons can escape from one to the other, this eﬀect is called leakage. Reducing operating voltages which eﬀectively reduces the available voltage swing, that is the diﬀerence between logic 1 and logic 0 becomes too small so that the transistor will not operate properly. In addition we need to deal with the decreased transistor size, as the leakage current increases we require more complicated process technology. In conclusion, this complicated situation where the number of transistors per unit area needs to increase, but the operating frequency must go down, will likely increase the number of cores required and decrease (not increase so quickly) the clock frequency [3, 4]. Multi-core processor systems changed the dynamics of the market and enabled new innovative designs delivering high performance with optimized power characteristics [5]. They drive multithreading and parallelism at a higher than instruction level, and provide it to mainstream computing on a massive scale.

2

Processor Microarchitecture

The Architecture typically refers to the high level description of the Instruction Set Architecture (ISA). The Architecture generally deﬁnes an instruction set for software to write to and, in turn, the software could then be run on all processor implementations of that particular architecture.

Second Generation Quad-Core Intel Xeon Processors

419

The Microarchitecture deﬁnes a speciﬁc means of implementing compatible hardware that supports the higher level architecture. New micro-architectures typically deﬁne improvements that ultimately increase the user beneﬁts when running software that is compatible with the high-level architecture. Microarchitecture is enhanced with each processor generation, delivering improvements in performance, energy eﬃciency, and capabilities while still maintaining application-level compatibility. Microarchitecture refers to the implementation of the ISA in silicon, including cache memory design, execution units, and pipelining. In fact, beneﬁts from many microarchitecture enhancements can be achieved without any modiﬁcation or recompilation of code. R Quad Core Xeon R processor family The 45 nm next generation Intel (Harpertown) is the next instance of Intel processors based on Intel 45 nm transistor technology, a new transistor breakthrough that allows for processors with nearly twice the transistor density and drastically reduced electrical leakage. New Intel Quad Core Xeon includes new instructions and microarchitecture enhancements that will deliver superior performance and energy-eﬃciency while maintaining compatibility to already existing applications. Microarchitecture enhancements in Intel Quad Core Xeon 5400 processor family include: – – – – – –

New set of instructions – Intel SSE4 50% larger L2 Cache Super Shuﬄe Engine and Fast Radix-16 Divider Enhanced Cache Line Split Load Deep Power Down Technology Enhanced Intel Dynamic Acceleration Technology

Intel SSE4 is a set of new instructions designed to improve the performance and energy eﬃciency of a broad range of applications. Intel SSE4 builds upon the Intel 64 Instruction Set Architecture (ISA), the most popular and broadly used computer architecture for developing 32-bit and 64-bit applications. Intel SSE4 consists of 54 instructions divided into two major categories: Vectorizing Compiler/Media Accelerators and Eﬃcient Accelerated String/Text Processing. The new Intel Quad Core Xeon currently supports 47 of the Intel SSE4 instructions including the Vectorizing Compiler and Media Accelerator instructions. The remaining instructions will be available in future generations of Intel processors. Software will be able to use Vectorizing Compiler and Media Accelerators to provide high performance compiler primitives, such as packed (using multiple operands at the same time) integer and ﬂoating point operations, that allow for performance optimized code generation. It also includes highly optimized mediarelated operations such as sum absolute diﬀerence, ﬂoating point dot products, and memory loads. The Vectorizing Compiler and Media Accelerator instructions should improve the performance of audio, video, and image editing applications, video encoders, 3-D applications, and games. 50% larger L2 Cache (up to 12 MB in Quad Core implementation): Reduces the latencies for accessing instructions and data, improving application

420

P. Gepner, D.L. Fraser, and M.F. Kowalik

performance (especially those that work on large data sets). 24 Way Set AssociaR Xeon R (16 Way tivity improves data access versus previous generation Intel Set Associativity). Improved Store Forwarding maximizes data balance cache to memory. Super Shuﬄe Engine and Fast Radix-16 Divider: 3X faster shuﬄes and 1.6X-2X faster divides. The Super Shuﬄe Engine will greatly improve the performance of Intel SSE4 and Supplemental Streaming SIMD Extensions 3 (SSSE3) instructions. New super shuﬄe engine performs 128 bit operation in a single cycle and does not require any software changes. Enhanced Cache Line Split Load: Greatly improved performance on unaligned loads (those that span across cache boundaries) and optimized store and load operations. New 16 byte aligned load instruction on WC (write combining) memory improves read bandwidth from WC memory by reading cache-line size quantities. This Streaming Load routine gives 8X faster reading from WC Memory and improves the performance of memory-intensive applications. Deep Power Down Technology: A new power state that dramatically reduces processor power consumption. This is ideal solution for developing energy eﬃcient applications. Enhanced Intel Dynamic Acceleration Technology: Improves energy efﬁciency by dynamically increasing the performance of active cores when not all cores are utilized. Conceptually it uses the power headroom of the idle cores to boost the performance of the non-idle core. When one core enters an idle power C-state (CC3 or deeper) and the OS requests a higher performance state on the running core, the non-idle core is boosted up to a higher voltage and higher frequency (EDAT freq) however the overall chip power envelope still remains within the speciﬁed Thermal Design Power (TDP). How all of these innovations and changes in microarchitecture reﬂect to the overall system performance and accelerating high performance computing is described below. In the testing environment we have been testing single system performance in a typical HPC workload.

3

Processor Performance

In this section we have focused on processor performance to compare two generations of the Quad-Core Intel Xeon processors. A popular benchmark well-suited for parallel, core-limited workloads is the Linpack HPL benchmark. Linpack is a ﬂoating-point benchmark that solves a dense system of linear equations in parallel. The metric produced is Giga-FLOPS or billions of ﬂoating point operations per second. Linpack performs operations called LU Factorization. They are highly parallel and store most of their working data set in processor cache [6]. The processor operations it does perform are predominantly 64-bit ﬂoatingpoint vector operations and uses SSE instructions. This benchmark is used to determine the world’s fastest computers published at the website [7].

Second Generation Quad-Core Intel Xeon Processors

421

In both cases each processor core has 3 functional units that are capable of generating 128-bit results per clock. In this case we may assume that a single processor core does two 64 bit ﬂoating-point ADD instructions and two 64 bit ﬂoating-point MUL instructions per clock. Theoretical performance, calculated as the product of MUL and ADD executed in each clock, multiplied by frequency in both cases are the same. Both implementations are based on the same microarchitectures. R Xeon R processor X5365 this gives 3 GHz x 4 operations For Quad-Core Intel per clock x 4 cores = 48 GFLOPS. Exactly the same theoretical performance we have for new Quad-Core Intel Xeon processor E5472 = 3 GHz x 4 operations per clock x 4 cores = 48 GFLOPS. This is theoretical performance only, it is interesting to observe how a 50% bigger cache and 20% faster Front Side Bus (FSB) of the Intel Xeon processor E5472 beneﬁt Linpack and see if this is the good benchmark for CPU performance. In the all performance tests we were using systems conﬁgured as follows, with a one exception for the Stream benchmark. Conﬁguration details Quad-Core Intel Xeon processor X5365 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors X5365 3.00 GHz, 2x4 MB L2 cache, 1333 MHz FSB, 16 GB memory (8x2 GB FBDIMM 667 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Intel Optimized SMP LINPACK Benchmark 10 for Linux / LMBENCH 3.0 / Amber, Eclipse, Fluent, Gamess, Gromacs, Gaussian, LS-DYNA, Monte Carlo, PamCrash, Star-CD. Quad-Core Intel Xeon processor E5472 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors E5472 3.00 GHz, 2x6 MB L2 cache, 1600 MHz FSB, 16 GB memory (8x2 GB FBDIMM 800 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Intel Optimized SMP LINPACK Benchmark 10 for Linux / LMBENCH 3.0 / Amber, Eclipse, Fluent, Gamess, Gromacs, Gaussian, LS-DYNA, Monte Carlo, PamCrash, Star-CD. Using LINPACK HPL we see (Fig. 1) 5% performance improvement between the system based on the Quad-Core Intel Xeon processor X5365 and Quad-Core Intel Xeon processor E5472. It indicates that as we have expected based on the theoretical performance we will not see a big performance improvement on CPU intensive tasks. New Quad-Core Intel Xeon processor E5472 processor and it’s bigger cache and 1600 FSB do not play an important role in Linpack scenarios as it is not a memory intensive benchmark.

4

Memory Performance

In this section we illustrate the memory performance of the two generations of Quad-Core Intel Xeon processors. Memory performance is combination of two elements: latency and throughput. Each is appropriate to a diﬀerent work load, and each tells a diﬀerent story. Latency measures how long it takes to chase a

422

P. Gepner, D.L. Fraser, and M.F. Kowalik

Fig. 1. LINPACK: Dense Floating-Point Operations

chain of pointers through memory. Only a single chain is tracked at a time. Each chain stores only one pointer in a cache line and each cache line is randomly selected from a pool of memory. The pool of memory simulates the working environment of an application. When the memory pool is small enough to be placed inside cache, the benchmark measures the latency required to fetch data from cache. By changing the size of the memory pool we can measure the latency of any speciﬁc level of cache, or to main memory, by creating the pool bigger than all levels of cache. R Xeon R processor We measured latency using a 3.0 GHz Quad-Core Intel X5365 and a 3.0 GHz Quad-Core Intel Xeon processor E5472. The results of this experiment are shown in Fig. 2. Based on the diﬀerent FSB as well as the bigger L2 cache the Quad-Core Intel Xeon processor E5472 would have slightly lower latencies to caches but the memory latencies comparing to his predecessor are much more noticeable, and from the result we can see that this is the case. The bigger L2 cache and faster FSB 1600 MHz beneﬁt L2 cache latency as well reduce the access time to the main memory. All those elements in conjunction with the R 5400 chipset enabling 1600 MHz FSB helps random memory access. new Intel The second important characteristic of the memory performance is the throughput for sequential memory accesses. The benchmark we have used to measure the throughput is Stream benchmark. The Stream benchmark is a synthetic benchmark program, written in standard Fortran 77. It estimates, both memory reads and memory writes (in contrast to the standard usage for bcopy). It measures the performance of four long vector operations. These operations are: COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i) These operations are representative of long vector operations and the array sizes are deﬁned in that way so that each array is larger than the cache of the processors that are going to be tested. This gives us an indication of

Second Generation Quad-Core Intel Xeon Processors

423

Fig. 2. Memory latency

how eﬀective the memory subsystem is in our implementation excluding caches As Fig. 3 shows we see huge memory bandwidth improvement with new QuadR Xeon R processor E5472 mainly due to 20% faster FSB (1600 MHz) Core Intel as well improved functionality of Intel 5400 chipsets (“Seaburg”) capable to operate at 1600 MHz FSB. This 32% throughput improvement versus older generation Quad Core Xeon based system will be reﬂected in all memory intensive applications not only for Stream but also for other HPC data intensive workloads. Conﬁguration details Quad-Core Intel Xeon processor X5355 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors X5355 2.66 GHz, 2x4 MB L2 cache, 1333 MHz FSB, 16 GB memory (8x2 GB FBDIMM 667 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Stream Benchmark. Quad-Core Intel Xeon processor E5472 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors E5472 3.00 GHz, 2x6 MB L2 cache, 1600 MHz FSB, 16 GB memory (8x2 GB FBDIMM 800 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Stream Benchmark.

5

Application Performance

Linpack and Stream are synthetic benchmarks and they measure the performance of speciﬁc subsystems and do not deliver the full picture of system capability. Typical HPC applications use much more than a single subsystem and their nature is much more sophisticated. So to get a better understanding how R Xeon R processor E5472 based platform beneﬁts real the new Quad-Core Intel applications we have selected a couple of real code examples. These application

424

P. Gepner, D.L. Fraser, and M.F. Kowalik

Fig. 3. Stream benchmark – memory throughput

and benchmarks represent a broad spectrum of HPC workloads and seem to be a typical representation of a testing suite for this class of calculation. Amber is a package of molecular simulation programs. The workload measures the number of problems solved per day (PS) using eight standard molecular dynamic simulations. See [8] for more information. Eclipse from Schlumberger – reservoir simulation software for structure, geology, ﬂuids and development scheme. Fluent is a commercial engineering application used to model computational ﬂuid dynamics. The benchmark consists of 9 standard workloads organized into small, medium and large models. These comparisons use all but the largest of the models which do not ﬁt into the 8 GB of memory available on the platforms. The Rating, the default Fluent metric, was used in calculating the ratio of the platforms by taking a geometric mean of the 8 workload ratings measured. GAMESS from Iowa State University, Inc. – a quantum chemistry program widely used to calculate energies, geometries, frequencies and properties of molecular systems. Gaussian from Gaussian, Inc. – a quantum chemistry program widely used to calculate energies, geometries, frequencies and properties of molecular systems. GROMACS (Groningen Machine for Chemical Simulations) from Groningen University is molecular dynamics program widely used to calculate energies, geometries, frequencies and properties of molecular systems. LS-DYNA is a commercial engineering application used in ﬁnite element analysis such as a car collision. The workload used in these comparisons is called 3 Vehicle Collision and is publicly available from [9]. The metric for the benchmark is elapsed time in seconds. Monte Carlo from Oxford Center for Computational Finance (OCCF) – ﬁnancial simulation engine using Monte Carlo technique [10].

Second Generation Quad-Core Intel Xeon Processors

425

Fig. 4. Platform comparison across HPC selected workloads

PamCrash from ESI Group – an explicit ﬁnite-element program well suited for crash simulations. Star-CD is a suite of test cases, selected to demonstrate the versatility and robustness of STAR-CD in computational ﬂuid dynamic solutions. The metric produced is elapsed seconds converted to jobs per day. For more information go to [11]. All these selected workloads have been tested on two dual socket HPC optiR Xeon R processor X5365 based platmized platforms. The Quad-Core Intel form has been used as the baseline to illustrate the improvement (Fig. 4) that the new platform is going to bring in diﬀerent workloads in typical HPC scenario. As we can see the Quad-Core Intel Xeon processor E5472 based platform shows signiﬁcant performance improvement up to 37%. The 12 MB L2 cache and new 1600 FSB help especially in the data intensive application and the workloads where data movement plays an important role. If the task is more CPU intensive the diﬀerence is around 10-15%.

6

Conclusion

The new Quad-Core Intel Xeon processors bring 45 nm technology to HPC with all the beneﬁts of superior thermal characteristic and substantial production capability. Following the path of its predecessor these new products are continuing to demonstrate performance leadership. From the theoretical performance point of view both generations of Quad-Core Intel Xeon families deliver the same theoretical performance peak but we have seen that in a real life scenario the new architecture extensions bring a lot of performance improvement, even typically CPU intensive tasks show a better performance in the range of 7-15%. In the future this can be extended when the new set of SSE4 instructions become even more widely used and start providing additional headroom for performance improvement. The Vectorizing Compiler instructions should improve the performance of all those HPC applications which use multiple operands at the same R Xeon R processors time. These are the areas where the new Quad-Core Intel

426

P. Gepner, D.L. Fraser, and M.F. Kowalik

bring the biggest improvement – the data intensive workloads. The 50% bigger L2 cache as well as 1600 MHz FSB in conjunction with the Intel 5400 chipsets made the new Quad-Core Intel Xeon processor based platform up 37% more effective when comparing to the old one – whilst operating at the same processor frequency. We see a signiﬁcant performance advantage, ranging from 20-37% and in addition observe signiﬁcant performance per watt reduction, as the platform stays in the same power envelope. All of this drives signiﬁcant improvements in user experience for the HPC environment and become a compeling choice for many of the new HPC installations.

References 1. Gepner, P., Kowalik, M.F.: Multi-Core Processors: New Way to Achieve High System Performance. In: PARELEC 2006, pp. 9–13 (2006) 2. Smith, J.E., Sohi, G.S.: The Microarchitecture of superscalar processors. Proc. IEEE 83, 1609–1624 (1995) 3. Ronen, R., Mendelson, A., Lai, K., Lu, S.-L., Pollack, F., Shen, J.P.: Coming challenges in microarchitecture and architecture. Proc. IEEE 89, 325–340 (2001) 4. Moshovos, A., Sohi, G.S.: Microarchitectural innovations: Boosting microprocessor performance beyond semiconductor technology scaling. Proc. IEEE 89, 1560–1575 (2001) 5. Ramanathan, R.M.: Intel Multi-Core Processors: Leading the Next Digital Revolution, Technology @ Intel Magazine (2005), http://www.intel.com/technology/magazine/computing/multi-core-0905.pdf 6. Dongarra, J., Luszczek, P., Petitet, A.: Linpack Benchmark: Past, Present, and Future, http://www.cs.utk.edu/∼ luszczek/pubs/hplpaper.pdf 7. http://www.top500.org/ 8. http://amber.ch.ic.ac.uk/amber8.bench1.html 9. http://www.topcrunch.org/ 10. http://www.occf.ox.ac.uk/ 11. http://www.cd-adapco.com/products/STAR-CD/

Heuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications Piotr Dziurzanski and Tomasz Maka Szczecin University of Technology, ul. Zolnierska 49, 71-210 Szczecin, Poland {pdziurzanski,tmaka}@wi.ps.pl

Abstract. A novel approach for implementation of multimedia streaming applications into mesh Networks on Chips structure is proposed. We provide a new multi-path routing algorithm together with heuristic algorithms for core mapping so that minimize the total transfer in the hardware implementation. The proposed approach has been tested with two popular stream-based video decompression algorithms. The experimental results conﬁrming the proposed approach are provided. Keywords: Multimedia streaming applications, On-Chip routing algorithm, IP core mapping, Multi-path routing.

1

Introduction

Computational-intensive multimedia applications are especially well suited for parallel and distributed processing due to their data-dominated algorithms that can be split intro a number of stages. These stages can be implemented in separate computational units working in a pipeline-like way and transmitting each other streams of relatively large, but usually ﬁxed, amount of data. Some widelyknown examples of the algorithms of that type are, e.g., MPEG-4, DAB, DVB, and many others. In these applications, it is usually required to keep an assumed quality level of service and meet real-time constraints [7]. Multi Processor Systems on Chips (MPSoCs) are often considered as suitable hardware implementations of these applications [3]. As each processing unit of a MPSoC can realize a single stage of streaming application processing, it is still problematic to connect these units together. The simplest point to point (P2P) connections require too much space, whereas bus-based connections result in large number of conﬂicts and, consequently, despite various arbitrage techniques decrease the overall performance of the whole system [4]. Besides, both P2P and bus-based realizations do not scale well with the constantly increasing number of independent Intellectual Property (IP) cores (i.e., computational units) required by contemporary devices dealing with a number of various algorithms in a single system [1]. In order to omit these obstacles, the packet-based Network-on-Chip (NoC) paradigm for the designs of chips realizing distributed computation has been introduced [2]. The recent popularity of this approach can be attributed to a M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 427–435, 2008. c Springer-Verlag Berlin Heidelberg 2008

428

P. Dziurzanski and T. Maka

lower number of conﬂicts in a chip with a large number of cores. It is reported that NoC architectures oﬀer high bandwidth and good concurrent communication capability, but they require additional mechanisms to overcome problems typical for packet switching communication, such as packet deadlock or starvation, but the techniques known for traditional computer networks have to be altered before applying to on-chip networks [6]. A mesh is one of the most often used on-chip network topologies owing to its regularity and reliability due to the existence of many redundant interconnections between nodes. In NoCs, each mesh node is comprised of the IP core realizing a particular stage of the algorithm and a router which is typically connected to four neighboring nodes. A typical NoC implementation utilizes packet switching approach that is called wormhole routing [5]. In this technique, each packet is split into smaller units of equal length, ﬂits (ﬂow control units). Usually, the ﬁrst ﬂits contain some routing information, such as the destination address. Having obtained the routing information, a wormhole router selects the next-hop router and establishes the path to that neighboring router. This path can be used exclusively for transferring the current package ﬂit by ﬂit as long as the whole package has not been transferred. The next-hop router typically does not store the whole package in its buﬀers, but tries to establish a connection with another router being selected for the transfer. If another package is to be sent through the connection already used for transferring a package, its transfer is deferred as long as all ﬂits of the previous package has not been sent. This situation is known as contention and may result in signiﬁcant decreasing of eﬃciency. Contentions are especially likely to be observed in various data-intensive applications, where large streams of data are transferred in every second. This may result in violating the real-time constraints and thus making an MPSoC designing in this way inappropriate for the task. The most popular routing algorithm used in NoCs, named XY, can be also viewed as inappropriate for switching large streams of information. According to this algorithm, a ﬂit is ﬁrstly routed according to the X axis as long as the X coordinate is not equal to the X coordinate of the destination core, and then the ﬂit is routed vertically. Despite being deadlock-free [5], this algorithm is not adaptive and thus is not equipped with mechanism for decreasing the contention level. Taking into consideration the above mentioned facts, it follows that in order to design a NoC-based MPSoC for multimedia streaming applications it is necessary to (i) propose a routing algorithm that is more suitable to this task that the traditional XY algorithm and (ii) to propose a mapping scheme of IP cores into mesh nodes that decreases the contention level.

2

Proposed Design Flow

The ﬁrst stage of the proposed approach is to construct a ﬂow network for a given stream-based algorithm. The processing blocks of the algorithm are identiﬁed and the transfers between them are computed. We describe all transfers in a

Heuristics Core Mapping in On-Chip Networks

429

Table 1. Transfers table for H.264 decoder Source Destination Transfer [bps] 1 2 11744051 2 3 503316480 3 4 503316480 4 5 788529152 5 8 788529152 1 5 360710144 1 6 37748736 6 4 11744051 1 7 251658240 7 4 1560281088 8 7 2348810240

so-called transfer table. An example of the transfer table for the H.264 decoder is presented in Table 1. The numbers in the Source and Destination columns represent indices of the nodes in the ﬂow network. Then, we have to determine the mapping of the cores into the mesh structure leading to the improved performance of the NoC structure. The impact of the mapping on the ﬁnal implementation properties is very signiﬁcant in case of the traditional wormhole XY routing approach. For the example of MPEG-4 decoder [9], the diﬀerence between the required capacity and the best and the worst mappings is about 203.04 per cent. For example, the XY algorithm applied to the H.264 video decoder [8] for core permutation 0-2-3, 7-8-4, 1-6-5 in the ﬁrst, second, and the third row, respectively, leads to the transfers presented in Fig. 1. In this situation, the maximal transfer between adjacent cores is relatively high being equal to 2240 Mbit/s. It means that in every second such amount of data is to be transferred between cores 8 and 7, so that the NoC infrastructure

Fig. 1. Transfers between cores for the H.264 decoder, XY routing algorithm (in Mbit/s)

430

P. Dziurzanski and T. Maka

has to oﬀer capacities large enough to cope with this transfer. Assuming the most popular regular NoC mesh architecture, all the links have to have equal capacities, so all of them have to be capable of transferring 2240 Mbit/s. However, the majority of the remaining links are utilized in small percentage of this maximal value. It may be expressed with the standard deviation value, which is equal to about 598.36 Mbit/s. Thus, we assumed that the standard deviation express the transfer balancing level. The smaller is the standard deviation, the transfers are closer to each other. Moreover, only 13 links out of 24 are utilized, which results in unbalanced transfers and poor utilization of the available resources. This is our main motivation to introduce a routing scheme we named tapeworm routing. An eﬃciency of this routing algorithm depends on a structure of cores and their connections to each other, which is deﬁned with a mapping of the ﬂow network nodes functionality into NoC cores. Having selected an appropriate mapping, it is important to balance transfers between each path in a NoC structure. Our mapping algorithms need on its input a complete list of data transfers in the network ﬂow built at the previous stage. The proposed technique takes advantage of the well-known Ford-Fulkerson method for determining maximal throughput of the network between a set of cores. The example of the three successive steps of the tapeworm algorithm is presented in Fig. 2. In this ﬁgure, the numbers written above links mean a ﬂow and a remaining available capacity. The data length to be sent between cores S and D is equal to 70 bits. At the ﬁrst stage, 30 bits of the link between routers 2 and 3 have been already allocated. As the available capacity between routers 1 and 2 (i.e., the link selected by the XY rule) is only 50 bits, 50 bits are sent by this link, and the remaining 20 bits are sent by the alternative route to the 4th router. In router 2, the package is further segmented: 20 bits follow the path to the router 3, whereas 30 bits are sent to router 5. Thus, the total data length sent between routers 5 and 6 is equal to 50 bits. A few algorithms for heuristic core mapping into a NoC structure are provided in the following section.

3

On-Chip Cores Mapping Heuristics

In order to determine the appropriate cores mapping into NoC nodes, i.e., the mapping leading to the minimal value of transfers between cores while using the tapeworm algorithm, it is possible to use an exact algorithm. However, its application is reasonable to the size of the NoC mesh lower than 4x4 cores due to its immense computational complexity O(n! · n2 ), where n is the number of cores. For larger n, however, it is possible to use heuristic algorithms that do not signiﬁcant worsen the ﬁnal result, as it is shown in the sequel of this paper. Below, we propose three heuristics that according to experimental results (presented in Section 4), lead to results close to the exact approach. We start our algorithm with a generating of a population of random core permutations to be mapped into the NoC structure. Then, we execute one of the provided below heuristics (Fig. 3-5) for every permutation a number of times.

Heuristics Core Mapping in On-Chip Networks

(a)

(b)

431

(c)

Fig. 2. Successive steps of the tapeworm algorithm

1. 2. 3. 4. 5. 6. 7.

src ← select randomly a core c1 do nDir ← random direction (Left, Right, Up, Down) while (move toward nDir direction is possible) dest ← select the adjacent core of c1 , c2 , in the nDir direction if(the exchange between src and dest cores decreases the total transfer) swap (c1 , c2 ) Fig. 3. Pseudo-code of heuristics 1 for cores mapping

1. 2. 3. 4. 5. 6. 7.

select randomly an item from NoC transfers table between src and dest do nDir ← random direction (Left, Right, Up, Down) while (move dest toward nDir direction is possible) dest ← select the adjacent core of dest in the nDir direction if(the exchange between src and dest cores decreases the total transfer) swap (src, dest) Fig. 4. Pseudo-code of heuristics 2 for cores mapping

1. select randomly an item from NoC transfers table between src and dest 2. d0 ← calculate Manhattan distance between scr and dest cores 4. (d1 , d2 , d3 , d4 ) ← calculate Manhattan distance between src and all neighbors of dest 5. if(min(d0 , d1 , d2 , d3 , d4 ) = d0 ) 6. nmin ← neighbor with the lowest distance 7. swap (dest, nmin ) Fig. 5. Pseudo-code of heuristics 3 for cores mapping

We decided to terminate when a maximum number of generations has been produced, or the obtained transfers do not decrease for a speciﬁc number of steps. As we tested our approach for multimedia applications split into relatively low number of cores (9), we could compare the obtained results with the exact solutions.

432

P. Dziurzanski and T. Maka

In the ﬁrst heuristics (Fig. 3), we select randomly two neighboring cores, and then compute total number of bits transmitted in a second for two cases: (i) for the existing core mapping (ii) and the one obtained after the exchange of the two selected cores. If the latter approach is characterized with a lower transfer, the exchange is performed. In the second approach (Fig. 4), we select randomly a single transfer from the transfer table of the algorithm to be mapped. Next, the direction from the set {left, right, up, down} is selected, and a new mapping, when the destination core is exchanged with the core situated directly on the selected side is performed. Similarly to the previous heuristic, both total transfers before and after the exchange are computed and if the core with the second one is lower, the exchange is carried out. In the third heuristics (Fig. 5), we also select a single transfer from the transfer table. Then, we calculate the Manhattan distance between the source and the destination cores of the selected transfer. The Manhattan distances between the source and all the neighbors of the destination core are also computed and, if the distance is lower for any of the neighbors, it is exchanged with the destination core. In case when more than one neighbor is characterized with a lower distance, the lowest of them is chosen. It is worth stressing that in the case of this approach there is no need of computing the total transfer in the NoC, which is time-saving as presented in the next section.

4

Experimental Results

In order to verify the above provided approach, we have chosen the MPEG-4, and H.264 decoders and decided to execute all the heuristic described in the previous section for determining the permutation mapping that is characterized with the lowest total transfer in the NoC realization. For the selected decoders, we have built ﬂow networks and determined the amount of data transmitted between their nodes in every second. We executed an implementation of the proposed approaches 2000 times and concluded that in every case a local minimum has been found at a relatively early iteration (the solution was the same as obtained with the exact algorithm). For the two ﬁrst approaches, the local minimum has been found no later than in 367th iteration, whereas for the 3rd approach it was no later than 1684th iteration. However, the vast majority of the found local minima has been found much earlier, what is Table 2. Comparison of the mean transfers and standard deviations obtained with the proposed approaches (in Mbit/s) Approach 1 Approach 2 Approach 3 Algorithm Mean StdDev Mean StdDev Mean StdDev MPEG-4 1734.74 9.03 1734.63 9.1 1726.75 1.96 H.264 828.32 2.79 828.39 2.64 829.21 2.64

Heuristics Core Mapping in On-Chip Networks

433

700 approach 1 approach 2 approach 3

600

number of minima

500

400

300

200

100

0

0

100

200

300

400

500

iterations

Fig. 6. Minima histogram for H.264

1000 900

approach 1 approach 2 approach 3

800

number of minima

700 600 500 400 300 200 100 0

0

100

200

300

400 iterations

500

600

700

Fig. 7. Minima histogram for MPEG-4

depicted by histograms presented in Fig. 6 and Fig. 7 for H.264 and MPEG-4, respectively. Only few instances has been found in further iterations than 300. In Tab. 2, we have presented average transfers and standard deviations for a set of 1000 algorithm executions. All three approaches resulted in values close

434

P. Dziurzanski and T. Maka

to the minimum. For H.264 (MPEG-4), we obtained the global minimum for 59.55, 54.9, and 26.7 (47.15, 48.4, and 96.85) per cent of cases for the ﬁrst, second, and the third approach, respectively. In the two ﬁrst approaches, it is important to stress that for each iteration one has to determine the total transfer for a number of times. We have measured that for 200 iterations the transfer is to be determined 60100 times. As our tapeworm routing runs for 8.224 seconds in average (Intel Pentium 4 CPU 3 GHz, 1GB RAM memory), the running time for the two ﬁrst approaches is less than 6 days. On the other hand, the third approach produces results in 0.5s. Although the ﬁrst time seems huge, especially in a comparison with the second one, it is still much less, the exact algorithm is practically unacceptable for even 4x4 cores (a few million years of computation). The execution time of the ﬁrst two approaches depends only polynomially on the size of the mesh.

5

Conclusion

We described our approach for implementing data-intensive streaming multimedia applications in Network on Chip based on the mesh topology. We focused on two phases: mapping of algorithm stages into the target NoC’s nodes and developing a new multi-path routing algorithm. Both our proposals beneﬁt from the static streams of data transferred between IP cores being known at the design stage. We provided three heuristic mapping algorithms and compared them with the exact solutions. The provided experimental results, based on two reallife multimedia decoder examples, showed that in majority of cases our heuristics give results equal to the exact solution at relatively early iterations. The computational complexity of the proposed heuristics is polynomial with respect to the number of cores, whereas the complexity of the exact solution is intractable as being estimated with O(n! · n2 ), where n is the number of IP cores to be mapped. Consequently, the proposed technique allows us to practically solve the problem of mapping a streaming multimedia application into a NoCbased MPSoC in a reasonable time so that the data transfers between IP cores are balanced and close to the global minimum.

References 1. Bjerregaard, T., Mahadevan, S.: A Survey of Research and Practices of Networkon-Chip. ACM Computing Surveys (CSUR) 38, Article 1 (2006) 2. Dally, W.J., Towels, B.: Route Packets, Not Wires: On-Chip Interconnection Networks. In: 38th ACM IEEE Design Automation Conference (DAC), pp. 684–689 (2001) 3. Kavaldjiev, N., et al.: Routing of guaranteed throughput traﬃc in a network-on-chip. Technical Report TR-CTIT-05-42 Centre for Telematics and Information Technology, University of Twente, Enschede (2005) 4. Lee, H.G., Chang, N., Ogras, U.Y., Marculescu, R.: On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and networkon-chip approaches. ACM Transactions on Design Automation of Electronic Systems 12(3), article no. 23 (2007)

Heuristics Core Mapping in On-Chip Networks

435

5. Li, M., Zeng, Q.A., Jone, W.B.: DyXY: a proximity congestion-aware deadlockfree dynamic routing method for network on chip. In: 43rd ACM IEEE Design Automation Conference (DAC), pp. 849–852 (2006) 6. Ogras, U.Y., Marculescu, R.: Prediction-based Flow Control for Network-on-Chip Traﬃc. In: 43rd ACM IEEE Design Automation Conference (DAC), pp. 839–844 (2006) 7. Smit, G.J.M., et al.: Eﬃcient Architectures for Streaming DSP Applications, Dynamically Reconﬁgurable Architectures. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany (2006) 8. van der Tol, E.B., Jaspers, E.G.T., Gelderblom, R.H.: Mapping of H.264 decoding on a multiprocessor architecture. In: Image and Video Communications and Processing, Santa Clara, CA, USA, vol. 5022, pp. 707–718 (January 2003) 9. van der Tol, E.B., Jaspers, E.G.T.: Mapping of MPEG-4 Decoding on a Flexible Architecture Platform. In: Media Processors 2002, San Jose, CA, USA, vol. 4674, pp. 362–363 (January 2002)

Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures Fahimeh Jafari1,2, Mohammad H. Yaghmaee1, Mohammad S. Talebi2, and Ahmad Khonsari3,2 1

Ferdowsi University of Mashhad, Mashhad, Iran 2 IPM, School of Computer, Tehran, Iran 3 ECE Department, University of Tehran, Tehran, Iran {jafari,ak}@ipm.ir, [email protected], [email protected]

Abstract. Network-on-Chip (NoC) has been proposed as an attractive alternative to traditional dedicated busses in order to achieve modularity and high performance in the future System-on-Chip (SoC) designs. Recently, end to end flow control has gained popularity in the design process of network-on-chip based SoCs. Where flow control is employed, fairness issues need to be considered as well. In fact, one of most difficult aspects of flow control is that of treating all sources fairly when it is necessary to turn traffic away from the network. In this paper, we proposed a flow control scheme which admits Max-Min fairness criterion for all sources. In fact, we formulated Max-Min fairness criterion for the NoC architecture and presented implementation to be used as flow control mechanism. Keywords: Network-on-Chip, flow control, Max-Min fairness.

1 Introduction Network-on-Chip (NoC) is a new paradigm structure for designing future System-onChips (SoC) [1]. A typical NoC architecture provides a scalable communication infrastructure for interconnecting cores. Since the communication infrastructure as well as the cores from one design can be easily reused for a new product, NoC provides maximum possibility for reusability. NoCs with their flexible and scalable interconnect provide high computational power to support computationally extensive multimedia applications, i.e. those that combine audio, video and data. In contrast to simple data applications, which can work without guarantees of timing of data delivery, multimedia applications require a guaranteed degree of service in terms of required bandwidth and timelines. According to the networking terminology, we refer to the traffic of simple data as elastic or Best Effort (BE) traffic and to multimedia traffic as inelastic or Guaranteed Service (GS) traffic. Due to the rapid growth of the number of processing elements (PEs) in NoCs [2], employing efficient policy for flow control is inevitable in the design of NoCs to M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 436–445, 2008. © Springer-Verlag Berlin Heidelberg 2008

Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures

437

provide the required Quality of Service (QoS). A NoC should support network level flow control in order to avoid congestion in the bottleneck links, i.e. link through which several sources pass [3]. The design and control of NoCs raises several issues well suited to study using techniques of operational research such as optimization and stochastic modeling. Recently, some novel researches have been embarked in studying congestion control in NoCs [4-5]. Congestion control schemes in NoCs mainly focus on utilizing NoC’s resources, with the aim of minimizing network cost or maximizing network utility while maintaining the required QoS for Guaranteed Service traffics. Many strategies for flow control have been proposed for off-chip networks, e.g. data networks, etc. [6-9]. On-chip networks pose different challenges. For instance, in off-chip environments, to overcome congestion in links, packet dropping is allowed. On the contrary, reliability of on-chip wires makes NoCs a loss-less environment. So far, several works have addressed this problem for NoC systems. In [4], a prediction-based flow-control strategy for on-off traffic in on-chip networks is proposed where the prediction is used in router to be aware of buffer fillings. In [5] a flowcontrol scheme for Best Effort traffic based on Model Predictive Control is presented, in which link utilization is used as congestion measure. Dyad [10] controls the congestion by switching from deterministic to adaptive routing when system is going to be congested. [11] proposes a flow control scheme as the solution to rate-sum maximization problem for choosing the BE source rates. The solution to the rate-sum optimization problem is presented as a flow control algorithm. Where flow control is employed, fairness issues need to be considered as well [3]. In fact, one of most difficult aspects of flow control is to choose a policy to accommodate a fair rate allocation. All of the abovementioned studies only regarded the flow control by taking into account the constraints of the system and to the best of our knowledge no policy to maintain fairness among sources was chosen. The fairness of TCP-based flow control algorithms was first analyzed in [12]. The analysis in [12] was based on a single bottleneck link. Different flow control approaches can be classified with respect to the fairness criteria, in favor of which rate allocation is done. One of the famous forms of fairness criterion is Max-Min fairness, which has been discussed in earlier literature and described clearly in [13]. Our main contribution in this paper is to present a flow control scheme for Best Effort traffic in NoC which satisfies Max-Min fairness criterion. Our framework is mainly adopted from the seminal work [13] which presents a basic Max-Min fairness optimization problem. In this paper, we reformulate such a problem for the NoC architecture. The organization of the paper is as follows. In Section 2 we present the system model, the concept of Max-Min fairness and formulation of the flow control as an optimization problem. In section 3 we present an iterative algorithm as the solution to the flow control optimization problem. Section 4 presents the simulation results and discussion about them. Finally, the section 5 concludes the paper and states some future work directions.

438

F. Jafari et al.

2 System Model We consider a NoC with two dimensional mesh topology, a set S of sources and a set L of bidirectional links. Let cl be the finite capacity of link l ∈ L . The NoC assumed to use wormhole routing. In wormhole-routed networks, each packet is divided into a sequence of flits which are transmitted over physical links one by one in a pipeline fashion. The NoC architecture is also assumed to be lossless, and packets traverse the network on a shortest path using a deadlock free XY routing. A source consists of Processing Elements (PEs), routers and Input/Output ports. Each link is a set of wires, busses and channels that are responsible for connecting different parts of the NoC. We denote the set of sources that share link l by S (l ) . Similarly, the set of links that source s passes through is denoted by L(s ) . By definition, l ∈ S (l ) if and only if s ∈ L(s ) . We assume that there are two types of traffic in the NoC: GS and BE traffic. For notational convenience, we divide S into two parts, each one representing sources with the same kind of traffic. In this respect, we denote the set of sources with BE and GS traffic by S BE and SGS , respectively. Each link l is shared between the two aforementioned traffics. GS sources will obtain the required amount of the capacity of links and the remainder should be allocated to BE sources. 2.1 Max-Min Fairness Concept Any discussion of the performance of a rate allocation scheme must address the issue of fairness, since there exist situations where a given scheme might maximize network throughput, for example, while denying access for some users or sources. MaxMin fairness is one the significant fairness criteria. Crudely speaking, a set of rates is max-min fair if no rate can be increased without simultaneously decreasing another rate which is already smaller. In a network with a single bottleneck link, max-min fairness simply means that flows passing through the bottleneck link would have equal rates. The following definition states the formal definition of Max-Min fairness. Defination 1. A feasible rate allocation x = (x s , s ∈ S ) is said to be “max-min fair” if and only if an increase of any rate within the domain of feasible allocations must be at the cost of a decrease of some already smaller rate. Formally, for any other feasible allocation y , if ys > x s then there must exist some s ′ such that x s ′ ≤ x s and ys ′ < x s ′ [13]. Depending on the network topology, a max-min fair allocation may or may not exist. However, if it exists, it is unique (see [14] for proof). In what follows the condition under which the Max-Min rate allocation exists will be stated. Before we proceed to this condition, we define the concept of bottleneck link.

Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures

439

Defination 2. With our system model above, we say that link l is a bottleneck for source s if and only if 1. link l is saturated: ∑ x s + ∑ x s = cl s ∈SBE (l )

s ∈SGS (l )

2. source s on link l has the maximum rate among all sources using link l. Intuitively, a bottleneck link for source s is a link which limits x s . Theorem 1. A max-min fair rate allocation exists if and only if every source has a bottleneck link (see [14] for proof). 2.2 Flow Control Model Our focus will be on two objectives. First, choosing source rates (IP loads) of BE traffics so that to accomplish flow control in response to demands at a reasonable level. Second, maintaining Max-Min fairness for all sources. We model the flow control problem in NoC as the solution to an optimization problem. For more convenience, we turn the aforementioned NoC architecture into a mathematical model as in [5]. In this respect, the Max-Min fairness flow control problem can be formulated as:

max min xs

(1)

s ∈S

x

subject to:

∑

xs +

s ∈SBE (l )

∑

x s ≤ cl

∀l ∈ L

s ∈SGS (l )

(2)

xs > 0 ∀s ∈ S BE (3) where source rates, i.e. x s , s ∈ S , are optimization variables. The constraint (2) says the aggregate BE source rates passing thorough link l cannot exceed its free capacity, i.e. the portion of the link capacity which has not been allocated to GS sources. For notational convenience, we define u = min x s s ∈S

∑

cˆl = cl −

xs ,

s ∈SGS (l )

therefore the above mentioned problem can be rewritten as:

u = min x s

(4)

max u

(5)

s ∈S

subject to:

∑

x s ≤ cˆl

∀l ∈ L

s ∈SBE (l )

xs > 0

∀s ∈ S BE

(6) (7)

440

F. Jafari et al.

To solve the above problem, it should be converted so as to be in the form of disciplined optimization problems [15] as follows: (8) max u subject to:

u ≤ xs

∑

∀s ∈ S

x s ≤ cˆl

∀l ∈ L

s ∈SBE (l )

xs > 0

(9) (10)

∀s ∈ S BE

(11) The above optimization problem can be solved using several methods. In the next section, we introduce a simple and famous algorithm, known as “progressive filling”, to solve (8) iteratively. In order to compare the results of progressive filling algorithm with the exact values, we solve problem (8) using CVX [16] which is a MATLAB-based software for disciplined convex optimization problems, whose results will be given in section 4.

3 Max-Min Fairness Algorithm Theorem 1 is particularly useful in deriving a practical method for obtaining a maxmin fair allocation, called “progressive filling”. The idea is as follows: rates of all flows are increased at the same pace, until one or more links are saturated. The rates of flows passing through saturated links are then frozen, and the other flows continue to increase rates. All the sources that are frozen have a bottleneck link. This is because they use a saturated link, and all other sources using the saturated link are frozen at the same time, or were frozen before, thus have a smaller or equal rate. The process is repeated until all rates are frozen. Lastly, when the process terminates, all sources have been frozen at some time and thus have a bottleneck link. Using Theorem 1, the allocation is max-min fair. Theorem 2. For the system model defined above, with fixed routing policy, there exists a unique max-min fair allocation. It can be obtained by the progressive filling algorithm. (see [14] for proof) In the sequel, we derive the max-min rate allocation as the solution to problem (8) and based on this algorithmic solution, we present a flow control scheme for BE traffic in NoC systems. Thus, the aforementioned algorithm can be employed to control the flow of BE traffic in the NoC. The iterative algorithm can be addressed in distributed scenario. However, due to well-formed structure of the NoC, we focus on a centralized scheme; we use a controller like [5] to be mounted in the NoC to implement the above algorithm. The necessary requirement of such a controller is the ability to accommodate simple mathematical operations and the allocation of few wires to communicate flow control information to nodes with a light GS load.

Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures

441

Algorithm 1. Max-Min Fair (MMF) Flow Control Algorithm for BE in NoC. Initialization: 1. Initialize cˆl of all links. 2. Define: a. T as the set of sources not passing through any saturated link. b. B as the set of saturated links. c. 3. 4.

B = L − B and T = S BE − T .

Set source rate vector to zero. Initialize T = S BE and B = ∅ .

Loop: Do until ( T

=∅)

2.

⎡⎛ ⎤ ⎞ Δs = min ⎢⎢⎜⎜cl − ∑ Rls x s (t )⎟⎟⎟ ∑ Rls ⎥⎥ l ∈B ⎜ ⎠⎟ s ∈T s ∈T ⎣⎝ ⎦ x s (t + 1) = x s (t ) + Δs ∀s ∈ T

3. 4.

Calculate new bottleneck links and update B and B . ∀s ∈ T ; if s passes through any saturated link then

1.

T ⇐ T − {s } Output: Communicate BE source rates to the corresponding nodes.

4 Simulation Results In this section we examine the proposed flow control algorithm, listed above as Algorithm 1, for a typical NoC architecture. We have simulated a NoC with 4 × 4 Mesh topology which consists of 16 nodes communicating using 24 shared bidirectional links, each one has a fixed capacity of 1 Gbps. In our scenario, packets traverse the network on a shortest path using a deadlock free XY routing. We also assume that each packet consists of 500 flits and each flit is 16 bit long. In order to simulate our scheme, some nodes are considered to have a GS data, such as Multimedia, etc., to be sent to a destination while other nodes, which maybe in the set of nodes with GS traffic, have a BE traffic to be sent. As stated in section 2, GS sources will obtain the required amount of the capacity of links and the remainder should be allocated to BE traffics. We are mainly interested in investigating the fairness properties among source rates. In order to investigate the rate allocation in the optimal sense, we solved problem (8) using CVX [16], which is a MATLAB-based software for disciplined

442

F. Jafari et al.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Fig. 1. Network topology

convex optimization problems. Optimal source rates, obtained by CVX, are shown in Fig. 2. Source rates obtained from Algorithm 1 is depicted in Fig. 3. The main feature regarding Fig. 1 and Fig. 2 is that both yield equal values for the minimum source rate, i.e. 0.03 Gbps. The main difference is in the aggregate source rate which is greater for the result of Algorithm 1. In order to compare the results of the proposed Max-Min fair flow control with other fairness criteria, we have accomplished rate allocation based on maximizing the sum of source rates, i.e. the so-called Rate-Sum Maximization, whose results are depicted in Fig. 4. Comparing Fig. 3 with Fig. 4, it's apparent that although Rate-Sum criterion aims at maximizing the sum of source rates, there is no guarantee for the rates of weak sources, i.e. sources which achieve very small rate. Indeed, in many scenarios with Rate-Sum criterion, such sources will earn as small as zero. To compare the results of the three above mentioned schemes in more detail, we have considered five parameters featuring the merit of the different schemes as following: 1. 2. 3. 4. 5.

least source rate sum of source rates Variance of source rates with respect to mean value. Jain’s fairness Index [17] min-max ratio [17]

These parameters are presented in Table 1. Jain’s fairness Index and max-min ratio, are defined by (12) and (13), respectively.

(∑ Jain's Fairness Index =

S s =1

xs

)

2

S ∑ s =1 x s2

Min-Max Ratio =

S

(12)

min x s s ∈S

max x s s ∈S

(13)

Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures

443

8

Source Rate (x10 bps)

From table 1 we realize that rate allocation with Maximum Rate-Sum criteria, yield slightly greater rate-sum from Max-Min Fair criteria, i.e. Algorithm 1. However, as discussed above, Algorithm 1 guarantees that the rate allocation is max-min fair, and hence the minimum source rate wouldn’t be greater with any other feasible rate allocation and hence rate allocation is carried out in favor of weak sources. On the contrary, Maximum Rate-Sum has no guarantee on such sources and as a result, the weakest source, has achieved his rate as low as zero. Another point which is worth mentioning is that similarity of the rate allocation to uniform rate allocation is further in the Max-Min scheme. To be more precise, we have calculated the variance of source rates in with respect to mean value of source rates in equilibrium. Table 1 shows that the variance of Max-Min rate allocation, obtained from Algorithm 1, is evidently less than that of Maximum Rate-Sum scheme, which in turn implies the inherent fairness in the Max-Min rate allocation. 3.5 3 2.5 2 1.5 1 0.5 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16

Sources

Fig. 2. Rate allocation using CVX results

8

Source Rate (x10 bps)

3.5 3 2.5 2 1.5 1 0.5 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16

Sources

Fig. 3. Rate allocation using Algorithm 1

444

F. Jafari et al.

8

Source Rate ( x10 bps)

3.5 3 2.5 2 1.5 1 0.5 0 1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16

Sources

Fig. 4. Rate allocation using Rate-Sum Maximization Table 1. Quantitative comparison between different rate allocation schemes

Max-Min Fair (Mathematical Model) Max-Min Fair (Algorithm 1) Maximum RateSum

Least Rate ( ×108 bps)

Sum of Source Rates ( ×108 bps)

Variance

Fairness Index

Min-max Ratio

0.310

10.079

0.1558

0.7181

0.1856

0.310

13.545

0.5004

0.5888

0.1148

0

15.349

1.1974

0.4346

0

5 Conclusion In this paper we addressed the flow control problem for BE traffic in NoC systems. We considered two objectives. First, choosing source rates (IP loads) of BE traffics so that to accomplish flow control in response to demands at a reasonable level. Second, maintaining Max-Min fairness for all sources. Flow control was modeled as the solution to a simple algorithmic solution to an optimization problem. The algorithm can be implemented by a controller which admits a light communication and communication overhead. Finally, we compared the results of the proposed Max-Min fair flow control with Rate-Sum Maximization scheme based on several criteria such as Jain’s fairness index, max-min ratio, and etc. comparison shows using the proposed flow control scheme, rate allocation has larger fairness index, which denotes that the aim of the proposed flow control scheme is to allocate NoC resources in a fair manner.

References 1. Benini, L., DeMicheli, G.: Networks on Chips: A New SoC Paradigm. Computer Magazine of the IEEE Computer Society 35(1), 70–78 (2002) 2. Dally, W.J., Towles, B.: Route Packets, Not Wires: On-Chip Interconnection Networks. In: Design Automation Conference, pp. 684–689 (2001)

Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures

445

3. Cidon, I., Keidar, I.: Zooming in on Network on Chip Architectures. Technion Department of Electrical Engineering (2005) 4. Ogras, U.Y., Marculescu, R.: Prediction-based flow control for network-on-chip traffic. In: Proceedings of the Design Automation Conference (2006) 5. van den Brand, J.W., Ciordas, C., Goossens, K., Basten, T.: Congestion- Controlled BestEffort Communication for Networks-on-Chip. In: Design, Automation and Test in Europe Conference and Exhibition, pp. 948–953 (2007) 6. Kelly, F.P., Maulloo, A., Tan, D.: Rate control for communication networks: Shadow prices, proportional fairness, and stability. J. Oper. Res. Soc. 49(3), 237–252 (1998) 7. Mascolo, S.: Classical control theory for congestion avoidance in high-speed internet. In: Decision and Control IEEE Conference, vol. 3, pp. 2709–2714 (1999) 8. Gu, Y., Wang, H.O., Hong, Y., Bushnell, L.G.: A predictive congestion control algorithm for high speed communication networks. In: American Control Conference, vol. 5, pp. 3779–3780 (2001) 9. Yang, C., Reddy, A.V.S.: A taxonomy for congestion control algorithms in packet switching networks. J. IEEE Network 9(4), 34–45 (1995) 10. Hu, J., Marculescu, R.: DyAD - smart routing for networks-on-chip. In: Design Automation Conference, pp. 260–263 (2004) 11. Talebi, M.S., Jafari, F., Khonsari, A., Yaghmae, M.H.: A Novel Congestion Control Scheme for Elastic Flows in Network-on-Chip Based on Sum-Rate Optimization. In: International Conference on Computational Science and its Applications, pp. 398–409 (2007) 12. Chiu, D.M., Jain, R.: Analysis of the increase and decrease algorithms for congestion avoidance in computer networks. J. Computer Networks and ISDN Systems 17(1), 1–14 (1989) 13. Bertsekas, D.P., Gallager, R.: Data Networks. Prentice-Hall, Englewood Cliffs (1992) 14. Le Boudec, J.Y.: Rate adaptation, Congestion Control and Fairness: A Tutorial. Ecole Polytechnique Fédérale de Lausanne (EPFL) (2001) 15. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (1999) 16. Grant, M., Boyd, S., Ye, Y.: CVX (Ver. 1.0RC3): Matlab Software for Disciplined Convex Programming, \url{http://www.stanford.edu/boyd/cvx} 17. Jain, R., Chiu, D., Hawe, W.: A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems. DEC Research Report TR-301 (1984)

Fast Quadruple Precision Arithmetic Library on Parallel Computer SR11000/J2 Takahiro Nagai1 , Hitoshi Yoshida1 , Hisayasu Kuroda1,2 , and Yasumasa Kanada1,2 1

Dept. of Frontier Informatics, The University of Tokyo, 2-11-16 Yayoi Bunkyo-ku Tokyo, Japan {takahiro.nagai,hitoshi.yoshida}@klab.cc.u-tokyo.ac.jp 2 The Information Technology Center, The University of Tokyo, 2-11-16 Yayoi Bunkyo-ku Tokyo, Japan {kuroda,kanada}@pi.cc.u-tokyo.ac.jp

Abstract. In this paper, the fast quadruple precision arithmetic of four kinds of basic operations and multiply-add operations are introduced. The proposed methods provide a maximum speed-up factor of 5 times to gcc 4.1.1 with POWER 5+ processor used on parallel computer SR11000/J2. We also developed the fast quadruple precision vector library optimized on POWER 5 architecture. Quadruple precision numbers, which is 128 bit long double data type, are emulated with a pair of 64 bit double data type on POWER 5+ prosessor used on SR11000/J2 with Hitachi Optimizing Compiler and gcc 4.1.1. To avoid rounding errors in computing quadruple precision arithmetic operations, emulation needs high computational cost. The proposed methods focus on optimizing the number of registers and instruction latency.

1

Introduction

Some numerical methods require much more computation complexity due to rounding errors as increasing the scale of a problem. For example, CG method, one of the solutions for linear equation Ax=b and using Krylov subspace, is aﬀected by computation errors on a large scale problem. Floating point arithmetic operations generate rounding errors because a real number is approximated with the ﬁnite number of signiﬁcant ﬁgures. To reduce errors in ﬂoating point arithmetic, quadruple precision arithmetic, e.g. higher precision arithmetic is required. Quadruple precision number, which is a 128 bit long double data type, can be emulated with a pair of 64 bit double precision numbers on POWER 5 architecture by the run-time routine. The cost of the quadruple precision operations takes much more than the double precision operations. In this paper, we present the fast quadruple precision arithmetic of four basic arithmetic operations, i.e. {+, −, ×, ÷} and multiply-add operation, and vector library for POWER 5 architecture based machine such as parallel computer SR11000/J2. We implemented M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 446–455, 2008. c Springer-Verlag Berlin Heidelberg 2008

Fast Quadruple Precision Arithmetic Library

447

Table 1. IEEE 754 data type of 64 bit double and 128 bit long double on SR11000/J2 Data type

Total bit Exponent Exponent range Signiﬁcand number of signiﬁcant length bit length bit length ﬁgures in decimal IEEE 754 64 11 −1022 ∼ 1023 52 about 16.0 128 11 × 2 −1022 ∼ 1023 52 × 2 about 31.9 SR11000/J2

fast quadruple precision arithmetics and made up a quadruple precision vector library including four basic arithmetics and multiply-add operations. We achieved high performance that maximum speed up factor of 5 times to gcc 4.1.1.

2

128-Bit Long Double Floating Point Data Type

POWER5+ processors, which are CPUs of SR11000/J2, have 64 bit ﬂoating point registers. In 64 bit architecture, to store ﬂoating point data with quadruple precision, a pair of 64 bit registers is used by the software. Quadruple precision can handle up to about 31 decimal digits precision number, compared to (1 + 52)×log102 16 handled by double precision. The point to notice is that the exponent range is the same as that of double precision. Although the precision is greater, the magnitude of representable numbers is the same as 64 bit double precision numbers. That is, while 128 bit data type can store numbers with more precision than 64 bit data type, it does not store numbers of greater magnitude. The details are as follows. Each pair of 64 bit numbers has two 64 bit ﬂoating point data type with sign, exponent and signiﬁcand. We show the data format with IEEE 754 standard explained in Table 1[5]. Typically, the low-order part has a magnitude that is less than 0.5 units in the last place of the high-order part, so the value of two parts never overlap and the entire signiﬁcand of the low-order number adds precision beyond the high-order number.

3

Quadruple Precision Arithmetic

All of algorithms as follows are double or quadruple precision data type based on the round-to-nearest rounding mode. We express the ﬂoating operations using {⊕, , ⊗, } for {+, −, ×, ÷} respectively. For example, ﬂoating point addition a + b = ﬂ(a + b)+err(a + b) exactly, then we use a ⊕ b = ﬂ(a + b) to denote the result of addition, and err() is the error caused by the operation. Now, we explain quadruple precision arithmetic operations consisted of two kinds of basic algorithms, Quick-TwoSum() and TwoSum(). These 64 bit double precision algorithms are already used and implemented on gcc 4.1.1 for 128 bit long double data type[3] and they have explained in papers[1,2,4,8]. This type of 128 bit long double data type does not support IEEE special numbers, NaN and INF. The quadruple precision algorithms introduced in this paper are not satisﬁying the IEEE compliance.

448

3.1

T. Nagai et al.

About Precision

There are two kinds of quadruple precision algorithms for addition operation for the accuracy. – Accuracy of 106 bit signiﬁcand – Accuracy of about 106 bit signiﬁcand permitting a few bits rounding errors in the last part Compared with these algorithms, the latter method is realized with the half number of instructions of the former one by permitting a few bits rounding error in the last part. This reason is extra instructions in error compensation. In this paper, we select the latter algorithm by focusing on speeding up of quadruple precision arithmetic. We have already quantitatively analyzed the quadruple precision arithmetic of addition and multiplication[11]. Here we introduce the quadruple precision algorithms optimized and implemented as vector library. 3.2

Addition

Quadruple precision addition algorithm of Quad-TwoSum(a, b), which is consisted of ﬂoating point addition TwoSum(), computes (sH , sL ) = ﬂ(a + b). Here, (sH , sL ) is a part of s, s = sH + sL . Each of sH and sL is 64 bit data type and indicates high-order and low-order part respectively. Then, a, b, s are 128 bit data type. We do not need to consider separating the quadruple precision numbers because each number is stored into memory as 64 bit data type automatically. TwoSum() algorithm[2] also computes s =ﬂ(c + d) and e =err(c + d). We have Quad-TwoSum(a, b){ (t, r) ←TwoSum(aH , bH ) e ← r ⊕ aL ⊕ b L sH ← t ⊕ e sL ← t sH ⊕ e return (sH , sL ) }

TwoSum(c, d){ s←c⊕d v ← sc e ← (c (s v)) ⊕ (d v) return (s, e) }

to pay attention that both of c and d are not 128 bit data type but 64 bit double data type. Quad-TwoSum() is addition routine for quadruple precision numbers with error considerations using TwoSum() algorithm. Then the number of operation steps is 11 Flop (FLoating point OPeration), sum of Two-Sum 6 Flop and 5 Flop from addition and subtraction operations. We see that this quadruple precision algorithm requires 11 times more operations compared to 1 Flop of double precision. Flop indicates the number of ﬂoating point operations.

Fast Quadruple Precision Arithmetic Library

3.3

449

Multiplication

Quadruple precision multiplication algorithm Quad-TwoProd(a, b) computes (pH , pL ) = ﬂ(a × b). (pH , pL ) is a part of p, p = pH + pL . Then, a, b, p are also 128 bit data type. Quad-TwoProd(a, b){ m1 ← aH ⊗ b L t ← aL ⊗ b H ⊕ m 1 p H ← aH ⊗ b H ⊕ t e ← aH ⊗ b H p H pL ← e ⊕ t return (pH , pL ) } Some processors have FMA (Fused Multiply-Add) instruction set that can compute expressions such as a × b ± c with a single rounding error. It is a merit of this instruction that there are not double rounding errors for addition following multiplication operation. FMA instruction is comparatively fast because it is implemented on hardware as well as addition or multiplication instruction. A series of POWER processor can handle FMA instruction, so we made up multiplication algorithm using FMA instruction. It costs 8 Flop in quadruple precision multiplication operation. 3.4

Division

Quadruple precision division Quad-TwoDiv(a, b) computes (qH , qL ) = ﬂ(a ÷ b). (qH , qL ) is a part of q, q = qH + qL . Then, a, b, q are also 128 bit data type. Quad-TwoDiv(a, b){ d1 ← 1.0 bH m1 ← aH ⊗ d1 e1 ← −(bH ⊗ m1 aH ) m1 ← d1 ⊗ e1 ⊕ m1 m2 ← −(bH ⊗ m1 aH ) m2 ← aL ⊕ m 2 m2 ← −(bL ⊗ m1 m2 ) m3 ← d1 ⊗ m2 m2 ← −(bH ⊗ m3 m2 ) m2 ← d1 ⊗ m2 ⊕ m3 qH ← m1 ⊕ m2 e2 ← m1 qH qL ← m2 ⊕ e2 return (qH , qL ) } This algorithm is based on the Newton-Raphson method. The number of operation steps is 18 Flop and 1 double division operation. The deﬁnition of Flop

450

T. Nagai et al.

does not include the double division operation because it is costly compared to the cost of double precision addition and multiplication operations. This algorithm is applicable in the usual case that special numbers such as NaN, INF are not generated by ﬁrst operation of double division (1.0 b).

4

Speeding-Up the Quadruple Precision Arithmetic

We quantitatively evaluate each algorithms of addition and multiplication on parallel computer SR11000/J2 at Information Technology Center, the University of Tokyo. In terms of the number of operations, addition takes 11 Flop and multiplication takes 8 Flop. Division takes 18 Flop and 1 double precision data division. From the analysis in term of the number of addition and multiplication operations, it is possible to speed up by reducing data dependency between instructions under condition that the each instruction latency such as fadd, fmul, fmadd of processors is the same clocks as others. And we have considered the multiply-add operation in quadruple precision with the combination of multiplication and addition.

5

Optimizing Quadruple Precision Arithmetic

First, the theoretical peak performance is 9.2 GFlops in one processor on SR11000/J2. Quadruple precision arithmetic operations are rarely aﬀected by the delay of data transfer from main memory to register because computation time of one quadruple precision operation is large. To get high performance, it is most important to increase throughput and hide instruction latency by pipelining the operations for vector data. To realize pipeline processing, we focus on the loop unrolling. We see that latency of ﬂoating point instructions on POWER 5+ such as fadd, fmul, fsub, fabs and fmadd is 6 clocks. Throughput is 2 clocks for fmadd and 1 clock for others. Fig.1 shows the pipeline processing in case of instruction latency is 6 clock.

Fig. 1. Pipelining for 6 clock instruction latency

Fast Quadruple Precision Arithmetic Library

5.1

451

Hiding Instruction Latency

Because of loop unrolling, we can optimize performance by way of hiding instruction latency. Data dependency of quadruple precision arithmetic operations is solved by loop-unrolling, which lines up same instructions as follows. An example of solution is shown below. Here, f r means a 64 bit ﬂoating point register in POWER architecture. In Problem(), there is data dependency in three instructions, {+, ×, ÷}. It is possible to hide latency by loop unrolling like Solution(), whose unrolling size is 2. Solution(){ f r1 ← f r2 + f r3 f r9 ← f r7 + f r8 f r5 ← f r1 × f r4 f r11 ← f r9 × f r10 f r7 ← f r5 ÷ f r6 f r13 ← f r11 ÷ f r12 }

Problem(){ f r1 ← f r2 + f r3 f r5 ← f r1 × f r4 f r7 ← f r5 ÷ f r6 }

5.2

Number of Registers

Loop unrolling prevents from stall of CPU resource among instructions. As POWER 5+ processor has 32 logical registers, we used the full logical registers. In fact, there are 120 physical registers and they are utilized by the register renaming function. If m is the number of registers needed for 1 quadruple operation, maximum unrolling size = 32 / m (1) Quadruple precision addition needs 4 registers in 1 operation of ci = ai + bi , that is, m is 4. We can realize maximum unrolling size of 8. maximum unrolling size = 32 / 4 = 8

(2)

In a similar way, quadruple precision multiplication of ci = ai × bi also needs 4 registers, then m is 4. The maximum unrolling size is 8. Quadruple precision division of ci = ai / bi needs 5 registers in 1 operation, then m is 5. Maximum unrolling size is 32/5 = 6. To attain unrolling size 8 as well as addition or multiplication operation, we store 1 register data into memory and reload when it is needed. This method achieves unrolling size of 32/4 = 8.

6

How to Use Quadruple Precision Arithmetic Operations Library

We have discussed algorithms and how to optimize quadruple precision arithmetic for vector data in sections 3, 4 and 5. The interfaces of each quadruple precision arithmetic operations are shown in this section. This library is especially eﬀective for vector data and implemented in C with optimized assembler-code. Users specify the include ﬁle ”quad vec.h” in C and call each arithmetic function in library. We have to note here that it is easy for adaptation to FORTRAN.

452

T. Nagai et al. Table 2. Compile option compile option Optimizing C Compiler 01-03/C cc -Os +Op -64 -noparallel No paralleled -roughquad Quadruple precision (add, multiply, div) gcc -maix64 -mlong-double-128 -O3 gcc 4.1.1

– Addition ci = ai + bi void qadd vec(long double a[],long double b[],long double c[],int n) – Subtraction ci = ai − bi void qsub vec(long double a[],long double b[],long double c[],int n) – Multiplication ci = ai × bi void qmul vec(long double a[],long double b[],long double c[],int n) – Division ci = ai / bi void qdiv vec(long double a[],long double b[],long double c[],int n) – Multiply-Add ci = s × bi + ci (s : constant) void qmuladd vec(long double *s,long double b[],long double c[],int n)

Here are the sample program routine computing matrix-multiplication in size N using qmuladd vec() described above. long double a[N][N], b[N][N], c[N][N]; ···

for(i=0;i
7

Numerical Experiment

We implemented and evaluated four kinds of arithmetic operations, addition, multiplication, division and multiply-add operation. Subtract is same operation as addition except for sign. Our proposed methods were optimized with assembler-code and compared with Hitachi Optimizing Compiler of SR11000/J2 [10] and gcc 4.1.1. OS is IBM AIX version 5.3 with large page setting[9]. Compile options are shown in Table 2. The experimented data size is six patterns, i.e. size of L1 cache, half of L2, L2, half of L3, L3 and out of L3. We measured the MQFlops value (1 quadruple precision operation in 1 second is deﬁned as 1 QFlops). Figures from Fig.2 to Fig.9 show the quadruple precision arithmetic operation performances. The eﬀective clocks, which is the clocks in each loop unrolling size in one loop in our proposed method, is shown in Fig.2 and the computational performance is shown in Fig.3 in addition. Our proposed methods in quadruple precision arithmetic operations show high performance in all of data ranges. Performances of our proposal and Hitachi optimizing compiler in quadruple precision addition are

Fast Quadruple Precision Arithmetic Library

Fig. 2. Eﬀective Clocks in our proposed addition

Fig. 3. MQFlops in addition

Fig. 4. Eﬀective Clocks in our poposed multiplication

Fig. 5. MQFlops in multiplication

Fig. 6. Eﬀective Clocks in our proposed division

Fig. 7. MQFlops in division

453

almost same. Operations in gcc 4.1.1 are much slow because its execution calls library in each steps and it takes much cost in function overhead. From the result of quadruple precision addition, our proposed method attained 73.70/38.34 1.9 times speed up than that of Hitachi optimizing compiler and 73.70/13.96 5.3 times speed up than that of gcc 4.1.1 when data size is just on

454

T. Nagai et al.

Fig. 8. Eﬀective Clocks in multiply-add

Fig. 9. MQFlops in multiply-add

Fig. 10. Matrix multiplication using multiply-add arithmetic operation

L1 cache. At the end, matrix-multiplication result using optimized multiply-add operation is shown in Fig.10.

8

Concluding Remarks

In this paper, fast quadruple precision arithmetic of four kinds of basic arithmetic operations and multiply-add operation are developed and evaluated. The proposed methods provide a maximum speed-up 5 times faster for vector data than gcc 4.1.1 with POWER 5+ processor on parallel computer SR11000/J2. Even though proposed method in quadruple precision addition operation is almost the same with Hitachi optimizing compiler in performance, other quadruple precision arithmetic operations’ results show high performance in all of data ranges. We developed the fast quadruple precision library for vector date optimized on POWER 5 architecture. Quadruple precision arithmetic operations are costly, compared with double precision operations because compensating rounding errors. Then we applied the best optimization of hiding latency to ﬁt the number of registers by loop unrolling to quadruple precision arithmetic operation.

Fast Quadruple Precision Arithmetic Library

455

As a future work, we have to develop quadruple precision library, which will be available in various architecture such as Intel and AMD. POWER architecture as well as PowerPC has FMA instructions which can operate in same clocks as add or multiply. Especially, in some environment where there is no FMA instruction, we have to develop fast algorithm in quadruple precision arithmetic operations.

References 1. Dekker, T.J.: A Floating-Point Technique for Extending the Available Precision. Numer. Math. 18, 224–242 (1971) 2. Knuth, D.E.: The Art of Computer Programming, 2nd edn. Addison-Wesley Series in Computer Science and Information. Addison-Wesley Longman Publishing Co., Inc, Amsterdam (1978) 3. The GNU Compiler Collection, http://www.gnu.org/software/gcc/index.html 4. A fortran-90 double-double library, http://www.nersc.gov/∼ dhbailey/mpdist/mpdist.html 5. ANSI/IEEE754-1985 Standard for Binary Floating-Point Arithmetic (1985) 6. Akkas, A., Schulte, M.J.: A Quadruple Precision and Dual Double Precision Floating-Point Multiplier. In: DSD 2003: Proceedings of the Euromicro Symposium on Digital Systems Design, pp. 76–81 (2003) 7. Hida, Y., Li, X.S., Bailey, D.H.: Algorithms for quad-double precision ﬂoating point arithmetic. In: Proceedings of the 15th Symposium on Computer Arithmetic, pp. 155–162 (2001) 8. Bailey, D.H.: High-Precision Floating-Point Arithmetic in Scientiﬁc Computation. In: Computing in Science and Engineering, vol. 07, pp. 54–61. IEEE Computer Society, Los Alamitos (2005) 9. AIX 5L Riﬀerences Guide Version 5.3 (IBM Redbooks). IBM Press (2004) 10. Optimizing C User’s Guide For SR11000. Hitachi, Ltd. (2005) 11. Nagai, T., Yoshida, H., Kuroda, H., Kanada, Y.: Quadruple Precision Arithmetic for Multiply/Add Operations on SR11000/J2. In: Proceedings of the 2007 International Conference on Scientiﬁc Computing CSC, Worldcomp 2007, Las Vegas, pp. 151–157 (2007)

Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades Jos´e L. Abell´ an, Juan Fern´ andez, and Manuel E. Acacio Dept. de Ingenier´ıa y Tecnolog´ıa de Computadores, University of Murcia, Spain {jl.abellan,juanf,meacacio}@ditec.um.es

Abstract. The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to oﬀer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two diﬀerent types, along with the variety of synchronization and communication primitives oﬀered to programmers, make the task of developing eﬃcient applications very challenging. This situation gets even worse when we consider Dual Cell-Based Blade architectures where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided by dual Cell-based blades under varying workloads. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms which must be taken into account by programmers for improving the eﬃciency of their applications.

1

Introduction

Nowadays, among all contemporary CMP (or chip-multiprocessor) architectures, there is one that is currently concentrating an enormous attention due to its architectural particularities and tremendous potential in terms of sustained performance: the Cell Broadband Engine (Cell BE from now on). From the architectural point of view, the Cell BE can be classiﬁed as a heterogeneous CMP. In particular, the ﬁrst generation of the chip integrates up to nine cores of two distinct types [1]. One of the cores, known as the Power Processor Element or PPE, is a 64-bit multithreaded Power-Architecture-compliant processor with two levels of on-chip cache that includes the vector multimedia extension (VMX) instructions. The main role of the PPE is to coordinate and supervise the tasks performed by the rest of cores. The remaining cores (a maximum of 8) are called Synergistic Processing Elements or SPEs and provide the main computing power of the Cell BE.

This work has been jointly supported by the Spanish MEC and European Comission FEDER funds under grants “Consolider Ingenio-2010 CSD2006-00046” and “TIN2006-15516-C04-03”.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 456–465, 2008. c Springer-Verlag Berlin Heidelberg 2008

Characterizing the Basic Synchronization and Communication Operations

457

The Cell BE provides programmers with a broad variety of communication and synchronization primitives between the threads that comprise parallel applications, which were evaluated in [2]. At the end, the performance achieved by the applications running on the Cell BE will depend in great extent on the ability of the programmer to select the most adequate primitives as well as their corresponding conﬁguration values. The main purpose of this work is to expose the performance bottlenecks and asymmetries of those primitives under varying workloads on a dual Cell-based blade. The rest of the paper is organized as follows. In Section 2 we provide a short revision of the architecture of the Cell BE and a dual Cell-based blade, and a description of some of the communication and synchronization primitives provided to programmers. Next, in Section 3 we introduce our tool, which is called CellStats, for characterizing these primitives. The results obtained after executing CellStats on a dual Cell-based blade are presented in Section 4. Finally, Section 5 gives the main conclusions of the paper and some of the lessons learned that can help programmers to identify the most appropriate primitive in diﬀerent situations.

2 2.1

Dual Cell-Based Blade Architecture

The Cell BE architecture [1] is a heterogeneous multi-core chip composed of one general-purpose processor, called PowerPC Processor Element (PPE), eight specialized co-processors, called Synergistic Processing Elements (SPEs), a highspeed memory interface controller, and an I/O interface, all integrated in a single chip. All these elements communicate through an internal high-speed Element Interconnect Bus (EIB) (see Figure 1(a)). Each SPE is a 128-bit RISC processor designed for high-performance on streaming and data-intensive applications [3]. Each SPE consists of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). The SPUs are in-order processors with two pipelines and 128 128-bit registers. All SPU instructions are inherently SIMD operations that the proper pipeline can run at four diﬀerent granularities. As opposed to the PPE, the SPEs do not have a private cache memory. In contrast, each SPU includes a 256 KB LS memory to hold both instructions and data of SPU programs, that is, the SPUs cannot access main memory directly. The MFC contains a DMA Controller and a set of memory-mapped registers called MMIO Registers. Each SPU can write its MMIO registers though several Channel Commands. The DMA controller supports DMA transfers among the LSs and main memory. These operations can be issued by the owner SPE, which accesses the MFC through the channel commands, or the other SPEs (or even the PPE), which access the MFC through the MMIO registers. A dual Cell-based Blade is composed of two separate Cell BEs linked together through the EIB, therefore the maximum theoretical performance is duplicated with respect to that of one Cell BE, which is very interesting for emerging

458

J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio

(a) Block Diagram of Cell BE.

(b) Block Diagram of a Dual Cell-Based Blade.

Fig. 1. Cell BE Architecture

scientiﬁc, game and multimedia applications. The main components of a dual Cell-based blade are shown in Figure 1(b). In this architecture the two Cell BEs operate in SMP mode with full cache and memory coherency. Main memory is split into two diﬀerent modules, namely XDRAM0 and XDRAM1, that are attached to Cell0 and Cell1 respectively. In turn, the EIB is extended transparently across a high-speed coherent interface running at 20 GBytes/second in each direction. 2.2

Programming

The SPEs use DMA transfers to read from (Get) or write to (Put) main memory. DMA transfer size must be 1, 2, 4, 8 or a multiple of 16 Bytes up to a maximum of 16 KB. DMA transfers can be either blocking or non-blocking. The latter allow overlapping computation and communication: there might be up to 128 simultaneous transfers between the eight SPE LSs and main memory. In addition, an SPE can issue a single command to perform a list of up to 2048 DMA transfers, each one up to 16 KB in size. In all cases, peak performance can be achieved when both the source and destination addresses are 128-Byte aligned and the size of the transfer is an even multiple of 128 Bytes [4]. Mailboxes are FIFO queues that support exchange of 32-bit messages among the SPEs and the PPE. Each SPE includes two outbound mailboxes, called SPU Write Outbound Mailbox and SPU Write Outbound Interrupt Mailbox, to send messages from the SPE; and a 4-entry inbound mailbox, called SPU Read Inbound Mailbox, to receive messages. Every mailbox is assigned a channel command and a MMIO register. The former allows the owner SPE to access the outbound mailboxes. The latter enables remote SPEs and the PPE to access the inbound mailbox. In contrast, signals were designed with the only purpose of sending notiﬁcations to the SPEs. Each SPE has two 32-bit signal registers to collect incoming notiﬁcations. A signal register is assigned a MMIO register to enable remote SPEs and the PPE to send individual signals (overwrite mode) or combined

Characterizing the Basic Synchronization and Communication Operations

459

signals (OR mode) to the owner SPE. Read-modify-write atomic operations enable simple transactions on single words residing in main memory. For example, the atomic add return atomic operation adds a 32-bit integer to a word in main memory and returns its value before the addition. Programming of a dual Cell-based blade is equivalent to that of an independent Cell from a functional point of view. However, there are two important diﬀerences. First, dual Cell-based blades have 16 SPEs at programmer’s disposal rather than 8 SPEs. This feature involves doubling the maximum theoretical performance but also making much more diﬃcult to extract thread-level parallelism from applications. Second, from an architectural point of view, any operation crossing the Cell-to-Cell interface results in signiﬁcantly less performance than those that stay on-chip (see Section 4). These facts must be taken into account by programmers to avoid unexpected and undesirable surprises when parallelizing applications for a dual Cell-based blade platform.

3

CellStats

3.1

Architecture

CellStats is a command-line tool which admits a number of parameters such as the operation to evaluate, the number of SPEs, the speciﬁc Cell or Cells to use, the number of iterations and other operation-speciﬁc parameters. However, the process to launch, instruct and synchronize the threads is the same in all cases. First, the PPE marshals an structure called control block. The control block contains all the information needed by each SPE to complete the operation demanded by the user. Next, the PPE creates as many threads as speciﬁed by the user and synchronizes them using mailboxes. In turn, SPEs transfer the control block from main memory to their private LSs, report control block transfer completion to the PPE, and wait for PPE’s approval to resume execution. Then, each SPE performs the task entrusted by the user in a loop. In order to measure the time to complete the loop, the SPE utilizes a register called SPU Decrementer which decrements at regular intervals or ticks 1 . Upon completion of the loop, the SPE sends to the PPE the number of elapsed ticks through its outgoing mailbox. In this way, the PPE can compute not only the elapsed time from the go-ahead indication given to the SPEs, but also the time taken by each individual SPE to complete the task. For further details refer to [2]. 3.2

Functionality

CellStats performs a diﬀerent experiment depending on the parameters speciﬁed by the user: thread creation; PPE-to-SPE or SPE-to-SPE synchronization using mailboxes or signals; data transfers from main memory to local LS/local LS to main memory or remote LS to local LS/local LS to remote LS through DMA operations or list of DMA operations; and atomic operations such as fetch&add, 1

Duration of every tick for the dual Cell-based blade is 70 ns.

460

J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio

fetch&sub, fetch&inc, fetch&dec, and fetch&set on main memory locations. Besides, it is possible to specify the XDRAM memory module (0 or 1) in which memory buﬀers are allocated. CellStats manages position of memory buﬀers by using the numactl command. Thread creation. This operation measures the time to launch the threads that are executed by the SPEs. To do that, an empty task that returns immediately is used. Consequently, this operation takes into account not only the time to create the threads but also the time needed to detect their ﬁnalization. Mailboxes. This operation performs a PPE-to-SPE or an SPE-to-SPE synchronization using mailboxes. The PPE/SPE writes a message in the incoming mailbox (SPU Read Inbound Mailbox ) of the receiver SPE. Next, the receiver SPE reads the message and replies with another message written to its outgoing mailbox (SPU Write Outbound Mailbox ). When the initiator SPE/PPE reads the message, the synchronization process is complete. In the former case, the PPE uses the runtime management library function spe write in mbox [5] involving a system call which explains the increased latency. Nevertheless, the PPE can also write directly into the corresponding SPE’s MMIO register using a regular assignment. Signals. Unlike mailboxes, this operation performs a PPE-to-SPE or an SPE-to-SPE synchronization using signals. The initiator SPE/PPE signals the destination SPE by writing to the corresponding MMIO register (SPU Signal Notiﬁcation). If the initiator is an SPE, the destination SPE signals in turn the source SPE, thus ﬁnishing the synchronization cycle. Otherwise, the destination SPE sends the reply to the PPE using its outgoing mailbox (SPU Write Outbound Mailbox ). Like mailboxes, it is possible to write directly into the SPE’s MMIO register instead of using the runtime management library function call spe write signal [5]. Atomic operations. These operations enable sequences of read-modify-write instructions on main memory locations in an atomic fashion performed by as many SPEs as indicated by the user. The memory location accessed by the SPEs can be shared or private. In the latter case, the user can also specify the distance, measured in Bytes, between two consecutive private variables. DMA operations. Data transfers between main memory and the local LS, or between a remote LS and the local LS, are achieved through DMA operations. The user can specify not only the DMA size but also whether the source buﬀer (Gets) or the destination buﬀer (Puts) is shared or private, and whether the memory location is in main memory or in an SPE’s LS. Just like atomic operations, the user can specify the distance, measured in Bytes, between two consecutive private buﬀers.

4 4.1

Evaluation Testbed

To develop CellStats we used the IBM SDK v2.1 for the Cell BE architecture installed atop Fedora Core 6 on a regular PC [6]. This development kit includes

Characterizing the Basic Synchronization and Communication Operations

461

a simulator, named Mambo, that allows programmers to execute binary ﬁles compiled for the Cell BE architecture. To obtain the experimental results, we installed the same development kit atop Fedora Core 6 on a dual Cell-based IBM BladeCenter QS20 blade which incorporates two 3.2 GHz Cell BEs v5.1, namely Cell0 and Cell1, with 1 GByte of main memory and a 40 GB hard disk. 4.2

Results

Thread creation. The average latency for launching each new thread, as described in Section 3.2, is considerably high, around 1.68 ms. In order to reduce the cost introduced by thread management, programmers can create SPE threads at startup and keep them alive until the application ﬁnishes. In this way, the PPE can submit tasks to the SPE threads by means of communication primitives such as mailboxes or signals, thus minimizing overhead. Mailboxes and Signals. In Table 1, the average latencies, measured in nanoseconds, for PPE-to-SPE synchronization using mailboxes or signals are shown. In both cases, the PPE can either invoke a system call (Mailbox-sc or Signal-sc) or write directly into the corresponding SPE’s MMIO register (Mailbox or Signal). Besides, we consider that the selected SPE can be placed on either Cell for comparison (PPE-SPEc0 for Cell0 and PPE-SPEc1 for Cell1). As we can see, the latency is shorter when writing directly into the SPE’s MMIO registers, as deﬁned in ﬁle cbe mfc.h, instead of using the runtime management library function calls spe write signal or spe write in mbox [5]. In the former case, it is worth noting that the synchronization latency doubles when the destination SPE resides on Cell1 in both cases. In addition, Table 1 summarizes the average SPE-to-SPE synchronization latency, measured in nanoseconds, using mailboxes or signals when both SPEs are located on the same Cell (SPEc0-SPEc0) or on diﬀerent Cells (SPEc0-SPEc1), respectively. In the former case, the latency is almost four times shorter because the synchronization messages stay on-chip and do not need to cross the Cell-to-Cell interface. Table 1. Average latency for PPE-to-SPE and SPE-to-SPE synchronization Primitive Mailbox-sc Mailbox Signal-sc Signal

PPE-SPEc0 PPE-SPEc1 SPEc0-SPEc0 SPEc0-SPEc1 10,000.0 779.7 18,000.0 503.8

10,000.0 1678.2 18,000.0 1182.3

N/A 158.1 N/A 160.1

N/A 589.9 N/A 619.4

Atomic Operations. The average latency of the fetch&add atomic operation for a single variable is shown in Figure 2. By using numactl, we have selected the variable’s memory location (XDRAM0 or XDRAM1). As we can see, latency remains constant, at approximately 111 ns, when the variable is privately accessed by the SPEs. However latency grows linearly, up to 7.5 μs for 16 SPEs, when the

462

J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio

Fig. 2. Latency of fetch&add on shared and separate variables (128-Bytes stride)

variable is shared by all intervening SPEs. This is due to the fact that shared variables serialize the execution of atomic operations. Results for the rest of the atomic operations are similar and, therefore, have been omitted for the sake of brevity. Notice that the XDRAM memory module employed has negligible effect on performance results. This is because of the small size of the variable (4 Bytes). DMA Operations. There are three diﬀerent scenarios for data movement: data transfers between main memory and an SPE’s LS (Gets), data transfers between an SPE’s LS and main memory (Puts) and data transfers between SPEs’ LSs (Movs). Results for Puts do not report signiﬁcant diﬀerences to those of Gets and, therefore, have been omitted for the sake of brevity. In Figure 3 latency and bandwidth ﬁgures for Gets using Cell0 and Cell1 are shown. In particular, to generate Figures 3(a), 3(c), 3(e) and 3(g) (left side) all SPEs from Cell0 were used before any SPEs from Cell1, while to generate Figures 3(b), 3(d), 3(f) and 3(h) (right side) SPEs were used in the opposite order. As we can see, two general trends can be identiﬁed. First, latency is constant for message sizes smaller than or equal to the cache line, that is 128 Bytes. Second, latency grows proportionally to the message size for messages larger than the cache line until the available bandwidth is exhausted. In addition, a more in depth analysis provides other interesting conclusions. Latency is constant, but proportional to the number of SPEs, for message sizes up 128 Bytes regardless of the originating Cell when shared buﬀers are used (see Figures 3(a) and 3(b)). Latency is constant, around 300 ns, for message sizes up 128 Bytes regardless of the originating Cell when private buﬀers are used (see Figures 3(c) and 3(d)).2 For bandwidth ﬁgures, there are three important trends to be considered. Firstly, when 8 SPEs are involved, Gets initiated in Cell0 obtain an aggregate bandwidth of 24.6 GB/s (close to the peak memory bandwidth), while Gets initiated in Cell1 reach an aggregate bandwidth of 13.6 GB/s. This is due to the fact that buﬀers are always placed in XDRAM0 memory module. Therefore, Gets from SPEs in Cell1 must cross the Cell-to-Cell interface, limiting the 2

Stride is larger than or equal to the cache line size in all cases.

Characterizing the Basic Synchronization and Communication Operations

463

(a) Gets from Cell0 (shared memory)

(b) Gets from Cell1 (shared memory)

(c) Gets from Cell0 (private memory)

(d) Gets from Cell1 (private memory)

(e) Gets from Cell0 (shared memory)

(f) Gets from Cell1 (shared memory)

(g) Gets from Cell0 (private memory)

(h) Gets from Cell1 (private memory)

Fig. 3. Latency and bandwidth of DMA Gets on shared and private main memory buﬀers for a variable number of SPEs and packet sizes using Cell0 and Cell1

maximum achievable aggregate bandwidth. With the numactl command, we have veriﬁed that allocating all buﬀers in XDRAM1 memory module reports just the opposite results. Secondly, when 16 SPEs are considered both Cells are

464

J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio

(a) Intra-Cell Movs (shared LS buﬀer)

(b) Intra-Cell Movs (shared LS buﬀer)

(c) Inter-Cell Movs (shared LS buﬀer)

(d) Inter-Cell Movs (shared LS buﬀer)

Fig. 4. Latency and bandwidth of Movs on shared LS buﬀers for a variable number of SPEs and packet sizes using a single Cell and both Cells

involved, thus the ﬁgures report the beneﬁts of transferring data from the closest XDRAM memory module (for SPEs within Cell0), and also report the drawback of going through the Cell-to-Cell interface (for SPEs within Cell1). Finally, for private buﬀers the aggregate bandwidth grows faster for message sizes up to 1 KB because of exploiting simultaneous transfers to diﬀerent buﬀers. After that, the aggregate bandwidth ﬁgures converge to the same values as before. In turn, latency and bandwidth ﬁgures for Movs using Cell0 and Cell1 are shown in Figure 4. In particular, Figures 4(a) and 4(b) correspond to DMA Movs in Cell0 , while Figures 4(c) and 4(d) correspond to DMA Movs between Cell0 and Cell1. In the former case, SPEs approach the maximum available bandwidth of the EIB-to-SPE interface. In the later case, the Cell-to-Cell interface bandwidth is the limiting factor. Nevertheless, the latency is much longer than expected resulting in an aggregate bandwidth shorter than that of Gets originating in Cell1.

5

Conclusions

In this work, we have evaluated the synchronization and communication mechanisms of the Cell BE on a dual Cell-based blade platform. In this way, we can give some recommendations for dual Cell-based blade programmers such as: programmers should avoid frequent creation of threads, since thread creation introduces a signiﬁcant overhead; they should use direct writes to the SPEs’

Characterizing the Basic Synchronization and Communication Operations

465

MMIO registers, since using runtime management library calls is very slow; for atomic operations, whenever possible, they should use private buﬀers residing on diﬀerent cache memory lines, because latency of shared buﬀers grows linearly with the number of involved SPEs; in case of DMA transfers, they should use private buﬀers up to 1KB. For messages larger than 1KB, the latency is identical in both cases; ﬁnally programmers should be aware of the Cell-to-Cell interface, which determines the maximum achievable bandwidth, and also the asymmetries that arise when memory locations are in the furthest XDRAM memory module. This can be controlled by using the numactl command.

References 1. Kahle, J., Day, M., Hofstee, H., Johns, C., Maeurer, T., Shippy, D.: Introduction to the Cell Multiprocessor. IBM Journal of Research and Development 49(4/5), 589–604 (2005) 2. Abell´ an, J.L., Fern´ andez, J., Acacio, M.E.: CellStats: a Tool to Evaluate the Basics Synchronization and Communication Operations of the Cell BE. In: Proceedings of 16th Euromicro International Conference on Parallel Distributed and network-based Processing, pp. 261–268 (2008) 3. Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T.: Synergistic Processing in Cell’s Multicore Architecture. IEEE Micro 26(2), 10–24 (2006) 4. Kistler, M., Perrone, M., Petrini, F.: Cell Processor Interconnection Network: Built for Speed. IEEE Micro 25(3), 2–15 (2006) 5. IBM Systems and Technology Group: SPE Runtime Management Library Version 2.1. (2007) 6. IBM Systems and Technology Group: Cell Broadband Engine Software Development Toolkit (SDK) Installation Guide Version 2.1. (2007)

Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for Machine Learning Ahmed El Zein1 , Eric McCreath1 , Alistair Rendell1 , and Alex Smola2 1

Dept. of Computer Science, Australian National University, Canberra, Australia 2 Statistical Machine Learning Program, NICTA, Canberra, Australia {Ahmed.ElZein,Eric.McCreath,Alistair.Rendell,Alex.Smola}@anu.edu.au

Abstract. NVIDIA have released a new platform (CUDA) for general purpose computing on their graphical processing units (GPU). This paper evaluates use of this platform for statistical machine learning applications. The transfer rates to and from the GPU are measured, as is the performance of matrix vector operations on the GPU. An implementation of a sparse matrix vector product on the GPU is outlined and evaluated. Performance comparisons are made with the host processor.

1

Introduction

The GeForce 8800 GPU is the ﬁrst GPU from NVIDIA to implement a uniﬁed architecture where the pixels and vertices are processed by the same hardware. This provides a higher degree of programmability than for previous GPUs and is much better suited to general purpose computing. In recognition of this, NVIDIA have released a general purpose programming interface called CUDA (see section 2.3 for details) and have packaged the same basic hardware as a dedicated co-processor for use by high performance computing applications (the Tesla product range). Moreover, NVIDIA have also announced [1] that future generations of their hardware will provide support for IEEE double precision arithmetic; a move that will arguably remove the one remaining major bottleneck to the widespread use of GPUs in scientiﬁc computations. While the CUDA programming interface signiﬁcantly eases use of the NVIDIA GPUs for general purpose programming, the programming model provided by CUDA is very diﬀerent to that available on a traditional CPU. For instance CUDA has the concepts of shared, constant, texture, and global memories that all have slightly diﬀerent properties, and determining how best to use each memory type for a given application is non-trivial. Also, it must be remembered that when using any coprocessor the observed performance will depend heavily on what fraction of the application can be run on the coprocessor, and whether the overheads introduced in order to move data to and from the coprocessor are small compared to the computational times involved. In this paper we outline our initial eﬀorts to migrate a Statistical Machine Learning (ML) application to the GeForce 8800 GPU. The kernel of this application involves an iterative solver that performs repeated matrix vector products. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 466–475, 2008. c Springer-Verlag Berlin Heidelberg 2008

Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU

467

For a number of reasons this application would appear to be well suited to use of a GPU. First, matrix operations are generally well suited to vector or stream processors such as the GeForce 8800. Second, matrix vector products scale as O(N · d), where N is the number of data points and d is the inherent dimensionality of the problem, whereas other steps of a ML application typically scale as O(N ). Consequently for high-dimensional problems, migrating this part of the application to the GPU is potentially beneﬁcial. Third, the matrix does not change between iterations, so it can be copied once to memory on the GPU and reused during each iteration. Finally, for many ML problems single precision arithmetic is suﬃcient, so the porting eﬀort required is of immediate beneﬁt even before double precision GPUs become available. On closer inspection the situation is not quite as simple. In particular, although matrix operations can be easily vectorized, the amount of data may exceed what a single GPU card can hold (there is 768 MB on the GPU used). Consequently the resulting performance depends heavily on the bandwidth of the bus connecting the CPU and the GPU. Secondly, for matrix-vector multiplications, the limiting factor is the memory bandwidth rather than the raw ﬂoating point performance (the latter exceeds the former on both CPU and GPU). This ratio is generally less favourable for GPUs than the ratio between GPU and CPU peak ﬂoating point ratios. Finally, many ML problems involve sparse matrices, so the use of a sparse matrix vector product may be preferable to use of the dense equivalent. Sparse matrix algorithms are, however, considerably harder to adapt to stream processors. With the goal of migrating the complete ML application to CUDA this paper addresses three issues: i) What transfer rates can be achieved between host and GPU memory and vice versa, ii) What performance is achieved when using the CUDA supplied BLAS library to perform a variety of dense matrix vector products of sizes similar to those required by ML applications, iii) How does the performance of the latter compare with what we can obtain by hand coding sparse matrix vector products in CUDA. The following section gives background information about the ML application, the NVIDIA 8800 GPU hardware, its CUDA programming model, and methods for sparse matrix vector products. Section 3 details our experimental setup, while Section 4 contains detailed performance results. Section 5 uses the performance data gather here to discusses how a full ML application is likely to perform on the GeForce 8800 and outlines plans for our future work.

2 2.1

Background The ML Application

One of the key objectives in ML is, given some patterns xi , such as pictures of apples and oranges, and corresponding labels yi , such as the information whether xi is an apple or an orange, to ﬁnd some function f which allows us to estimate y from x automatically. See e.g. [2] for an introduction. In this quest, convex optimization is a key enabling technology for many problems. For

468

A. El Zein et al.

Initial guess w

Compute Xw

Calculate loss l and gradient g

Return w

Test for convergence

Iterative solver updates w

Compute X'g

Fig. 1. Iterative solver algorithm. The black boxes refer to matrix-vector operations which could be accelerated by a GPU.

Table 1. Statistics for some typical ML datasets [3] Domain Intrusion Detection Ranking Text Categorization Text Categorization

Dataset Rows Columns Nonzero Elements Density KDDCup99 3,398,431 127 55,503,855 12.86% NetFlix 480,189 17,770 100,480,507 1.17% Reuters C11 804,414 47,236 60,795,680 0.16% Arxiv astro-ph 62,369 99,757 4,977,395 0.08%

instance, Teo et al. [3] proposed a scalable convex solver for such problems. It is an iterative algorithm that involves guessing a solution vector w, using this to evaluate a loss function l(x, y, w) and its derivative g = ∂w l(x, y, w), and then updating w accordingly. This process is repeated until a desired level of convergence is achieved (see Fig. 1). As mentioned above the majority of time is spent evaluating the matrix vector products, and the elements of matrix (X) do not change between iterations. Many ML datasets are very sparse, as shown in Table 1. Exploiting the sparsity decreases the memory footprint of the matrix as well as the the number of ﬂoating point operations required for the matrix vector product. Unfortunately it also introduces random memory access patterns and indirect addressing, which is likely to result in less eﬃcient utilization of a GPU’s hardware. 2.2

NVIDIA 8800 GTX Hardware

Figure 2 illustrates the architecture of the GeForce 8800 GTX used in this work. At the heart of the device lies the Streaming Processor Array (SPA) consisting of 8 Texture Processor Cluster (TPC) units. Each TPC contains 2 Streaming Multiprocessor (SM) units and a texture unit. The SM in turn consists of 8 Stream Processors (SP) clocked at a default of 1.35 GHz. When running CUDA applications each SP is able to issue one multiply-add (MAD) instruction per cycle. This gives each SM a peak performance of 21.6 GFLOPS, and the GeForce 8800 GTX with 16 SMs an aggregate performance of 345.6 GFLOPS. The SPA is connected to 768 MB of GDDR3 memory through a 384-bit (48 byte) wide interface. Clocked at 900 MHz (1800 MHz eﬀective double data rate) by default, the frame buﬀer memory has a peak bandwidth of 84.375 GB/s. More details of the NVIDIA hardware can be found in [4].

Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU Hardware View

469

Software View

Fig. 2. GeForce 8800 GTX architecture and CUDA memory model

2.3

CUDA

The Compute Uniﬁed Device Architecture (CUDA), is a hardware and software architecture that enables the issuing and managing of computations on the GPU as a data-parallel device without the need to map the computations to a graphics API. CUDA transforms the hardware’s personality from a graphics card to a multi-threaded coprocessor. Provided with CUDA are Basic Linear Algebra Subprograms (BLAS) and Fast Fourier Transform (FFT) implementations, however NVIDIA only provides a C/C++ API for these. CUDA executes that part of the application that runs on the GPU using hundreds or thousands of threads. These threads are organized into a grid of blocks. The grid can be either one or two dimensional, while each block can be a one, two or three dimensional group of threads. The grid and block dimensions can be set at runtime with each thread able to retrieve its own thread and block id. Each block of threads is executed on one physical SM, with NVIDIA hardware only allowing synchronization and access to fast shared memory for threads in the same block. An illustration of the CUDA memory model is given in Fig. 2. A programming guide [1] providing additional information is available from NVIDIA (http://www.nvidia.com). 2.4

Sparse Matrices on GPUs

A popular representation for sparse matrices is compressed sparse row (CSR) [5] storage. Non-zero elements are arranged into a dense vector val. For each value in val, its column index from the original matrix is stored in a dense vector of the same size ind at the same oﬀset. A third pointer array (ptr) carries the oﬀset of the ﬁrst element in every row. CSR storage and associated pseudo code for a sparse matrix vector product are shown in Fig. 3.

470

A. El Zein et al.

for each row i do for l=ptr[i] to ptr[i+1]-1 do y[i]=y[i]+val[l].x[ind[l]]

Fig. 3. CSR format and pseudo-code for matrix vector product

Sparse matrix vector products (SpMV) have been implemented on older GPU hardware [6,7,8], but were limited by the graphics API and hardware constraints; Bolz et al. [6] and Kr¨ uger et al. [8] achieved 9 and 110 MFLOPS respectively in 2003. Ujaldon et al. [9] achieved 222 MFLOPS in 2005 and recently Sengupta et al. [10] achived 215 MFLOPS with CUDA on a GeForce 8800 GTX in 2007.

3

Experimental Setup

The GeForce 8800 GTX was hosted in a 2 GHz dual core AMD Athlon64 3800+ system with 2GB of PC3200 DDR memory. The processor has 128KB of L1 cache and 1 MB L2 cache and it has a theoretical peak performance of 8 GFLOPS. For all benchmarks and experiments, the code was run for 100 iterations and the average time was used to calculate bandwidth and FLOPS. Results for the GPU include the time required to transfer the vector to the GPU and the resulting product vector back to the host. For the sparse matrix vector product FLOPS were calculated as (2 × nonzero elements ÷ time). For dense matrix vector products the CUDA BLAS library (CUBLAS) was used on the GPU, while ATLAS1 was used on the host. Both specialist matrix vector routines (SGEMV) and general matrix matrix routines (SGEMM) were considered, although for ATLAS SGEMV always outperformed SGEMM, since SGEMM is optimized for matrix-matrix multiplications, and thus results for ATLAS SGEMM will not be given. ATLAS permits the matrix to be given in either row or column major format, while CUDA only supports matrices in column major format. Results are given for both normal (N) and transpose (T) ordering of the matrix as both these are required (see Fig. 1). As yet CUBLAS doesn’t support sparse matrices, so our own sparse matrix vector implementation was written (described later). Sparse test matrices were generated using the following code with the condition that each row contains at least one non-zero element. s: chosen sparsity (0% to 99%) for each row i do for each column j do if s <= random(0,99) do matrix[i][j] = 1.0 1

Automatically Tuned Linear Algebra Software, http://math-atlas.sourceforge.net

Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU

471

Table 2. Host initiated memory transfer rates (GB/s)

Main Main GPU GPU GPU

4 4.1

Memory to GPU Memory (pinned) to GPU to Main Memory to Main Memory (pinned) Memory to GPU Memory

Latency μs 1KB 1MB 100MB 22 0.03 0.80 1.10 18 0.04 2.70 3.10 18 0.04 0.40 0.50 15 0.05 2.80 3.00 12 0.14 50.59 71.17

Results Memory Transfer Rates

Rates for various memory transfer operations are given in Table 2. All transfers are initiated by a CUDA call on the host, with the time recorded from before this call until after the transfer was complete. Hence all benchmarks involve communication over the PCIe bus which has a maximum bandwidth of 4 GB/s. For host to GPU transfers with large data sizes only ∼25% of the PCIe bandwidth is achieved. CUDA, however, allows for the allocation of non-pageable pinned memory on the host, and when this is used approximately 75% of the peak PCIe rate is achieved. When using unpinned memory transfer rates from the GPU to main memory are signiﬁcantly less than from main memory to GPU, but these become roughly equivalent when using pinned memory. All transfers were found to have a latency of ∼20μs, probably reﬂecting the latency of the PCIe bus. The bandwidth for transferring data from GPU memory to GPU memory was also measured and found to have an asymptotic value close to the 84.4 GB/s peak, with nearly 60% of this achieved for a 1 MB transfer. 4.2

Dense Matrix Vector Performance

Using CUBLAS matrix dimensions that were not a multiple of 16 were found to have signiﬁcantly lower performance; since for ML applications padding can be done once, only results for matrices that are a multiple of 16 will be reported here. Performance data for square matrices of ascending sizes are given in Fig. 4. These show that on the host system performance is roughly constant for all matrix sizes and that normal ordering signiﬁcantly out performs transpose ordering. On the GPU performance is much more varied. For normal ordering SGEMV performance increases dramatically as the dimension increases, but for transpose ordering it is roughly constant. Thus, while transpose SGEMV ordering is over twice as fast as normal ordering when N=1024, by the time N=5120 it is 30% slower. In almost all cases use of SGEMM instead of SGEMV is found to be slower. At best use of the GPU is ∼ 4.5× faster than use of the host processor. To observe the eﬀect of matrix shape on performance the total size of the matrix was set to ∼100 MB (5120 × 5120 or 26,214,400 elements) while the number of rows and columns was varied. The results, given in Fig. 5, show a degradation in performance when the number of columns exceeds the number

A. El Zein et al.

Host RowMajor Dimension SGEMV N T 1024 2.7 1.2 2048 2.9 1.2 2816 2.9 1.1 3200 2.9 1.1 4480 3.0 1.1 5120 3.0 1.2

GPU ColMajor SGEMV SGEMM N T N T 3.6 7.8 6.9 6.5 7.2 9.2 6.8 7.1 9.5 9.0 6.4 7.0 10.6 10.0 6.2 6.9 13.0 8.7 6.8 7.6 13.6 9.9 7.0 7.8

Effect of Size on Performance 20

15 GFLOPS

472

host-SGEMV-rowMajor-N gpu-SGEMV-colMajor-N gpu-sgemm-colMajor-N

10

5

0 1000

10000 Dimensions

Fig. 4. Performance (GFLOPS) for square matrix vector products

204800 128 51200 512 25600 1024 10240 2560 5120 5120 2560 10240 1024 25600 512 51200 128 204800

Host RowMajor SGEMV N T 1.7 0.5 2.5 1.1 2.8 1.1 2.9 1.2 3.0 1.2 2.7 1.1 2.7 1.2 2.7 1.2 2.7 0.9

GPU ColMajor SGEMV SGEMM N T N T 12.5 9.3 6.0 6.8 13.4 9.9 6.7 7.6 12.8 9.7 6.9 7.8 12.0 10.2 7.0 7.8 13.6 9.9 7.0 7.8 8.9 8.3 7.0 7.9 3.9 10.1 7.1 7.7 2.0 5.3 7.1 7.9 0.5 1.3 1.8 2.0

Effect of Shape on Performance 20

15 GFLOPS

Columns Rows

host-SGEMV-rowMajor-N gpu-SGEMV-colMajor-N gpu-sgemm-colMajor-N

10

5

0 0.0001 0.001 0.01

0.1

1

10

100 1000 10000

Rows/Columns

Fig. 5. Performance (GFLOPS) as function of shape for matrix vector product

of rows, particularly when using transpose ordering. Although a similar eﬀect occurs when using SGEMM this only happens when the diﬀerence is 2 orders of magnitude. In summary the results given here suggest that there is further scope to optimise the performance of the SGEMV routine in CUBLAS. 4.3

Initial Sparse Matrix Vector Implementation

The approach taken here stores the matrix in CSR and is parallelized by assigning rows to threads such that each thread multiplies all the elements in a given row by the corresponding elements in the vector before writing the sum to the relevant element in the result vector. The following issues were considered: Memory Loads: While CUDA only supports scalar operations, it also supports upto 128-bit wide vector data types. Loading a ﬂoat or a ﬂoat4 from memory costs the same. For a 3000×3000 matrix at 95% sparsity and 32 threads/block, an SpMV took 827 and 442 μs when using ﬂoat and ﬂoat4 data types respectively. Use of Diﬀerent Memory Types: The GeForce 8800 GPU oﬀers diﬀerent types of memory (Fig. 2). The key to optimizing the SpMV code on the GPU is determining the most eﬃcient use of each memory type.

Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU

473

Global Memory. On our card there was 768 MB of global memory. This memory is not cached, but can be read and written to by any thread. Shared Memory. Each SM has 16KB of shared memory that threads in the same block can use to share data. We do not currently use this. Constant Memory. There is 64KB of cacheable read only memory that is initialized each time a GPU kernel is started. Storing the vector in constant memory limits vector sizes to ∼ 16000 but further reduces the SpMV duration to 242 μs for a 3000×3000 matrix at 95% sparsity and 32 threads/block. Texture References. CUDA allows binding of global memory to a texture reference. Our initial results suggest this may be useful for the matrix, but only with large row dimensions. Results given here do not use texture references. The performance of the SpMV implementation for square matrix vector products with a variety of diﬀerent numbers of threads per block is given in Figure 6 for matrices with 75% and 95% sparsity. The results show signiﬁcant variation in performance as a function of number of threads per block, particularly at 95% sparsity. The optimal number of threads per block changes with the size of the matrix, and although there is a general trend suggesting more threads per block for large matrices this is not always true. Secondly, beyond some key dimension the performance drops markedly and becomes roughly the same regardless of thread count. At best a performance of 3-4 GFLOPS is observed. By comparison recent work of Gahvari et al. [11] using a range of sparse matrices on a 1.4 GHz Opteron gave a maximum performance of ∼400 MFLOPS for an unblocked CSR SpMV and a median performance of ∼180 MFLOPS (On current hardware these values would probably increase by a factor of 3). To determine when it is preferable to use SpMV over dense matrix vector products we plot in Fig. 7 the speedup of SpMV at 75% and 95% sparsity over the equivalent SGEMV runs. This shows that for 75% sparsity using SpMV can be advantageous for dimensions upto around 4000, while for 95% sparsity using SpMV is always an advantage.

5

Discussion and Further Directions

The results from Section 4.1 show that it is possible to transfer a large ML dataset to global memory on the GPU card at around 3GB/s. On the GPU card used in this work there was 768 MB of memory, so if 600 MB of this were used to store the ML dataset it would take ∼0.2 s to transfer the data from host memory to GPU memory. This is the minimum amount of time that must be saved when using the GPU instead of the host to perform the computational work. Achieving this is most likely to be possible if the dataset can be copied to the GPU once, left there and re-used in each iteration of the convex solver (Fig. 1). From Table 1 and using CSR storage this should be possible for the intrusion detection and two text categorization datasets. For larger datasets an alternative strategy would be to divide the problem/dataset over multiple GPU cards, or to use double buﬀering to overlap movement of data to the GPU with computation on the GPU.

474

A. El Zein et al.

SpMV Performance at 75% Sparsity 5

32 threads 64 threads 128 threads 160 threads 512 threads

GFLOPS

4 3 2 1 0 1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Dimensions SpMV Performance at 95% Sparsity 5

32 threads 64 threads 128 threads 160 threads 512 threads

GFLOPS

4 3 2 1 0 1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Dimensions

Fig. 6. SpMV performance at 75% and 95% sparsity

Speedup of SpMV over SGEMV at 75% Sparsity 200

32 threads 128 threads 160 threads 256 threads 512 threads

% Speedup

150 100 50 0 -50 -100 1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Dimensions

% Speedup

Speedup of SpMV over SGEMV at 95% Sparsity 500 450 400 350 300 250 200 150 100 50 0 1000

32 threads 128 threads 160 threads 256 threads 512 threads

2000

3000

4000

5000

6000

7000

8000

9000

Dimensions

Fig. 7. SpMV speedup over SGEMV at 75% and 95% sparsity

10000

Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU

475

Results for dense matrix vector products show a potential speedup of 3 − 5× from use of the GPU for matrices of dimension 2500× 2500 and above. To obtain this performance from the current version of CUBLAS requires, however, that the number of rows in the (column major) matrix be a multiple of 16. While 3 − 5× is a useful performance gain, the cost of moving the dataset to the GPU could easily erode this advantage. As a consequence it is not clear which would be a better option if, for example, it were a choice between buying a dual core host with an NVIDIA GTX 8800 as a coprocessor or a quad core host. Performance for our initial sparse matrix vector product is signiﬁcantly better than we originally expected, achieving similar 3 − 5× speedups over the host CPU, even for small 1000 × 1000 matrices if the sparsity is over 90%. Since many ML datasets are sparse this suggests that it would be advantageous to place further eﬀort into optimising the SpMV routine for CUDA, and in particular, trying to eliminate the performance drop observed after certain dimensions and trying to determine automatically the optimal number of threads per block to use for a given problem size. While other approaches exist that may oﬀer performance gains in speciﬁc areas (such as the use of coalesced memory reads), they also introduce complexities. We are in the process of evaluating such approaches. Finally, for easier integration with existing software it would be useful to implement an OSKI2 (Optimized Sparse Kernel Interface) front-end.

References 1. NVIDIA: NVIDIA CUDA Programming Guide. 1.0 edn. (2007) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, Inc., Chichester (1998) 3. Teo, C.H., Smola, A., Vishwanathan, S.V., Le, Q.V.: A scalable modular convex solver for regularized risk minimization. In: KDD 2007: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 727–736. ACM, New York (2007) 4. NVIDIA: NVIDIA GeForce 8800 GPU Architecture Overview (2006) 5. Duﬀ, I.S., Erisman, A.M., Reid, J.K.: Direct Methods for Sparse Matrices. Oxford University Press, Oxford (1986) 6. Bolz, J., Farmer, I., Grinspun, E., Schr¨ ooder, P.: Sparse matrix solvers on the gpu: conjugate gradients and multigrid. ACM Trans. Graph. 22(3), 917–924 (2003) 7. Buck, I.: Data parallel computation on graphics hardware (2003) 8. Kr¨ uger, J., Westermann, R.: Linear algebra operators for gpu implementation of numerical algorithms. In: SIGGRAPH 2003: ACM SIGGRAPH 2003 Papers, pp. 908–916. ACM, New York (2003) 9. Ujaldon, M., Saltz, J.: The gpu on irregular computing: Performance issues and contributions. In: CAD-CG 2005: Proceedings of the Ninth International Conference on Computer Aided Design and Computer Graphics (CAD-CG 2005), pp. 442–450. IEEE Computer Society, Washington DC (2005) 10. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for gpu computing. In: Graphics Hardware, pp. 97–106. ACM, New York (2007) 11. Gahvari, H., Mark Hoemmen, J.D., Yelick, K.: Benchmarking sparse matrix-vector multiply in ﬁve minute. In: SPEC Benchmark Workshop (2007) 2

http://bebop.cs.berkeley.edu/oski/

Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm for Wireless Capsule Endoscopy Paweł Turcza1,4, Tomasz Zieliński2,4, and Mariusz Duplaga3,4 1

Department of Instrumentation and Measurement, 2 Department of Telecommunications, AGH University of Science and Technology, Kraków, Poland 3 Collegium Medicum, Jagiellonian University, Kraków, Poland 4 Center of Innovation, Technology Transfer and University Development, Jagiellonian University, Kraków, Poland {turcza,tzielin}@agh.edu.pl, [email protected]

Abstract. The paper presents hardware implementation aspects of new efficient image compression algorithm designed for wireless capsule endoscopy with Bayer color filter array (CFA). Since power limitation, small size conditions and specific image data format (CFA) exclude application of traditional image compression techniques dedicated ones are necessary. Discussed algorithm is based on integer version of discrete cosine transform (DCT). Therefore it has low complexity and power consumption. It is demonstrated that the performance of proposed algorithm is comparable to the performance of JPEG2000 – very complex, sophisticated wavelet based coder. In the paper a VLSI coder architecture is proposed and power requirements are discussed. Keywords: Image coding, integer transformation, DCT, Bayer, JPEG2000.

1 Introduction An endoscopic medical procedure, often accompanied by a biopsy of pathological changes, plays a fundamental role in diagnosis of many gastrointestinal (GI) tract diseases. Until recently it has been the most important method of investigation of upper (gastroscopy) and lower (colonoscopy) parts of the GI tract. However, examination with flexible endoscope is also very unpleasant experience for a patient. Due to enormous progress in microelectronics a wireless endoscopic capsule has been invented recently that makes possible a non-invasive evaluation of the whole GI tract together will small intestine. First such capsule was built by Given Imaging Ltd. [1] in the end of 20-th century. It is equipped with a CMOS sensor, lighting, data processing module and transmission unit (Fig. 1). After swallowing by a patient, the capsule is passing passively through the GI tract due to peristaltic intestine movements and making photos (images) that are wirelessly transmitted to the recorder carried by a patient. Unfortunately, this technology is not ideal yet. Due to lack of autonomous locomotion and navigation system, detail investigation of the highvolume stomach is impossible, whereas detail investigation of large intestine is M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 476–485, 2008. © Springer-Verlag Berlin Heidelberg 2008

Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm

477

possible only after its inflation since normally it is shrinked. Quality of the recorded video (256×256 pixels, 8 bits per color, only 2 frames per second) is very low also. It has been already reported in literature [2], [3] that the endoscopic capsule can approximately reach only one megabit per second transmission bitrate due to limitation of power consumption and severe attenuation of radio waves in a human body. Because the CMOS sensor having VGA resolution (640×480 pixels) delivers 2.45×106 bits per image, it is obvious that without data compression transmission of one image would last more than 2 seconds spending quite a lot of energy. Since existing standard image compression algorithms are not appropriate for capsule endoscopy (CE) due to their high computational complexity, simple dedicated algorithms are under development [3], [4]. In this paper memory requirements optimization and hardware implementation aspects of a recently proposed algorithm [5], [6], better than [3], [4], are presented. Influence of codebook reduction in entropy coder on algorithm performance is carefully evaluated and justified. It is shown that the performance of such a modified algorithm is comparable to the performance of JPEG2000 – very complex wavelet based coder. It should be mentioned that problems associated with color filter arrays (CFA) Bayer pattern are efficiently solved by our algorithm, in contrary to the approach presented in [3]. Our method makes it also possible to reconstruct more precisely G1 and G2 CFA image color components. In contrary to [4] our algorithm offers higher compression range, so higher image frame rate is possible. In the paper a VLSI coder architecture is proposed and power requirements are discussed.

2 New Image Coding Algorithm Simplified block diagram of data processing in wireless endoscopy capsule is presented in Fig. 1. The output image data from a CMOS image sensor after image and channel coding (required for bandwidth reduction and error protection) are transmitted by a wireless transceiver (TX) to the outside of the body where they are received, stored and eventually decompressed for subsequent diagnosis. A control unit in the capsule controls operation of a compression module according to commands received from an external controller. Proposed image coder performs sequentially four operations: color transformation, image transformation, coefficients quantization and entropy coding. GB GB RG R G GB GB RG R G Bayer CFA image sensor

image coding

channel coding

wireless TX

uC

Fig. 1. Block diagram of data processing in wireless endoscopy capsule

478

P. Turcza, T. Zieliński, and M. Duplaga

2.1 Color and Structure Transformation In the CE as well as in inexpensive digital cameras, color filters arrays (CFA) are placed on monochrome CMOS image sensor to produce color images. The most popular CFA pattern has been proposed by Bayer [7]. The Bayer CFA, shown in Fig. 2, uses 2 x 2 repeating patterns having two green pixels, one red and one blue. It is clear that application of the CFA results in collection of incomplete image information (only one color component for each pixel) so color interpolation is necessary to reconstruct the full color image from the sensor data. From image compression point of view interpolation step introduces redundancy that is difficult to remove. Therefore, the image compression should precede color data interpolation step for best performance [8]. Additional redundancy reduction can be achieved by a color space transformation. For normal RGB images this is get by the following one:

0.114 ⎤⎡ R ⎤ ⎡ Y ⎤ ⎡0.299 0.587 ⎢ ⎥ ⎢ ⎥ = − − −0.434⎥⎢ C 0.146 0.288 ⎢ b⎥ ⎢ ⎥⎢G ⎥ ⎢⎣Cr ⎥⎦ ⎢⎣0.617 −0.517 0.100 ⎥⎢ ⎦⎣ B ⎥⎦

(1)

where Y is luma component while Cb and Cr denote blue and red chroma components, respectively. However, due to CE power limits and quincunx sampling scheme of green component in Bayer CFA, we propose application of modified RGB to YCgCo space conversion known from Fidelity Range Extensions (FRExt) of H.264 video coding standard [9]: ⎡ Y1 ⎤ ⎡1/ 2 ⎢ ⎥ ⎢ ⎢ Y2 ⎥ = ⎢0 ⎢Cg ⎥ ⎢1/ 4 ⎢ ⎥ ⎢ ⎣⎢ Co ⎦⎥ ⎣0

0 1/ 2 1/ 4 0

1/ 4 1/ 4 −1/ 4 −1/ 2

1/ 4 ⎤ ⎡ G1 ⎤ 1/ 4 ⎥⎥ ⎢⎢G2 ⎥⎥ −1/ 4 ⎥ ⎢ B ⎥ ⎥⎢ ⎥ 1/ 2 ⎦ ⎣ R ⎦

(2)

where the Cg stands for green chroma and the Co stands for orange chroma. Quincunx array (luma component) has high frequency content that makes compression task difficult. Therefore structure conversion [8] (see Fig. 2) with optional reversible filtering (deinterlacing) (see Fig. 3) is applied in our coder. 2.2 Image Data Transformation and Entropy Coding

Block diagram of a general transform based image coder (for only one image component) is presented in Fig. 4. 2D image transformation is the first performed operation. Its goal is to concentrate image energy in the possibly smallest number of transform coefficients that are then scalar quantized and efficiently coded by entropy encoder. As the image transformation usually discrete cosine (DCT) or wavelet (DWT) transforms are chosen. In the proposed algorithm integer approximation of the DCT transformation is used.

Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm

G 1 B G 1 B G1 B R G2 R G 2 R G 2 G 1 B G 1 B G1 B R G2 R G 2 R G 2

Y1

Y1 Y2

Y1

Y1 Y2

Cg

Cg

Cg

Cg

Cg

Cg

Y1 Co Y1 Co

Y2

Y1 Y2

color transformation

Y1 Y2

Y2

Cg Y 1 Y2 Co Cg Y 1 Y2 Co

Cg Y 1 Cg Y2 Co Y 2 Cg Y 1 Cg Y2 Co Y 2

Co

Co

Co

Co

Co

Co

479

structure conversion + optional Y deinterlacing Y1 Y 2 Y1 Y2 Y 1 Y2 Y1 Y 2 Y1 Y2 Y 1 Y2

Cg Cg Cg Cg Cg Cg

Co Co Co Co Co Co

Fig. 2. Bayer CFA and color/structure conversion Y1

Y1 Y2

1/2 1/4

1/2 1 1/4

2 -1/2 -1/2 2

Y1 Y2

Y2

1/2

1 1/4

1/4 1

forward filtering

-1/2 2

-1/2

inverse filtering

1

1

Y1

Y1 Y2

1

Y1 Y2

Y2

Fig. 3. Reversible image filtering (deinterlacing) [8]

Image transformation Image component

Integer 2D-DCT

Scalar quantizer

Huffman entropy coder code book

Fig. 4. Block diagram of transform based image coder

The two-dimensional (2D) DCT is a separable transformation and is usually implemented as the 1D row-wise transformation followed by the 1D column-wise one. In the proposed algorithm the 2D-DCT of each 4×4 non-overlapped input data block X (separate RGB or YCgCo components) has been computed as

Y = Tf XTTf where Tf denotes an integer approximation of the 1D-DCT matrix [10]:

(3)

480

P. Turcza, T. Zieliński, and M. Duplaga

⎡1 ⎢2 Tf = ⎢ ⎢1 ⎢ ⎣1

1 ⎤ −1 −2⎥⎥ 1 −1 −1 1 ⎥ ⎥ −2 2 −1⎦

1

1

(4)

and superscript T is a transposition. Since the transformation (3)(4) represents only an approximation of the original DCT, an additional scaling operation by matrix Sf is required in (3): ab / 2 a 2 ab / 2 ⎤ ⎡a2 ⎢ ab / 2 2 ⎥ b / 4 ab / 2 b 2 / 4 ⎥ T ⎢ S = Y = ( Tf XTf ) ⊗ S f , f (5) ⎢a2 ab / 2 a 2 ab / 2 ⎥ , ⎢ ⎥ 2 2 ⎣ ab / 2 b / 4 ab / 2 b / 4 ⎦ where a = 1/ 2, b = 2 / 5 . and ⊗ is the Kronecker product. The element-by-element multiplication ⊗ can be incorporated into quantization step. What is important, apart from good data decorrelation property, the 1D-DCT (4) has also very efficient computational implementation presented in Fig. 5. x(0)

+

x(1)

+

x(2)

-

+

x(3)

-

+

-2 2

+

X(0)

+

X(2)

+

X(1)

+

X(3)

Fig. 5. Low complexity 4-point DCT butterfly algorithm

The inverse DCT of a 4×4 input data block Y can be computed using the following formula: ˆ = TT ( Y ⊗ S ) T X i i i

(6)

where ⎡1 ⎢1 Ti = ⎢ ⎢1 ⎢ ⎣1

1 1/ 2 −1/ 2 −1

1 −1 −1 1

1/ 2 ⎤ −1 ⎥⎥ ⎥, 1 ⎥ −1/ 2 ⎦

⎡a 2 ⎢ ab Si = ⎢ 2 ⎢a ⎢ ⎣ ab

ab a 2 b 2 ab ab a 2 b 2 ab

ab ⎤ ⎥ b2 ⎥ ab ⎥ . ⎥ b2 ⎦

(7)

After the DCT transformation the coefficients of Y are scalar quantized what results in high compression ratio and loss of information.

Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm

481

In the discussed algorithm the coefficients of the DCT transform are first quantized and then entropy coded. A technique well known from the JPEG standard [11] was applied to data blocks 4×4 instead of 8×8. Since DC coefficients of adjacent 4×4 blocks are strongly correlated they are coded differentially. The remaining AC coefficients are entropy coded in a 2-step process. In the first step sequence of quantized coefficients is converted into an intermediate sequence of symbols (RL, v) where RL is the number of consecutive zero-valued AC coefficients in the sequence preceding the nonzero AC coefficient v. If all the remaining AC coefficients in the block are zero they are represented by symbol (0,0). In the second step variable-length Huffman codes (VLC) are assigned to the symbols z = 16RL+|v|2 where |v|2 denotes length of binary representation (without leading zeros) of v (if v>0) or -v (if v<0). Value of v is encoded using variable length integer (VLI) code whose length in bits |v|2 was already encoded using VLC. VLI of v equals to binary representation of v if v>0 or equals to -v if v<0. Separated Huffman tables are designed for DC and AC coefficients. 2.3 Hardware Implementation

Finally, designing future hardware implementation of the proposed image encoder has been addressed. A developed VLSI architecture is presented in Fig. 6. As expected, the memory is the most using resources. It is required by the 2D DCT transformation, the color converter and the entropy encoder. Although 1D DCT over rows can be implemented very efficiently (during pixel acquisition) the vertical one requires 4 rows to be stored in memory. Since the color converter operates on two lines, one of them has to be acquired first and stored in memory. In order to reduce memory required by entropy coder, Huffman codes for symbols z ≥ 64 are not stored in a code book. Instead they are encoded as a escape sequence follow by fixed length code. Since such symbols occur very infrequently, this technique allows for 4 times reduction of codebook size causing only negligible decrease in compression ratio. The proposed approach significantly reduces Huffman code length as well as width of codebook memory word. The estimated power budget of the design is given in Tab. 1. For VGA (640x480) image resolution and 10 frames per second (fps) pixel clock is equal to 3 MHz. The circuit supply current is 3 x 0.984 = 2.95 mA. We can conclude that the expected power requirement are similar to much simple design proposed in [3], but our algorithm offers significantly higher compression ration.

3 Results The proposed algorithm has been implemented as a computer program and its efficiency has been measured. Two exemplary full RGB color images, one from colonoscopy (Fig. 7) and one from gastroscopy (Fig. 8), have been used in initial experimental tests. Since the discussed algorithm operates on CFA Bayer format, the test images have been converted into this format by low-pass filtering (with short 3-tap

P. Turcza, T. Zieliński, and M. Duplaga

Bayer CFA Image

Color space transformation Registers

482

One line store SRAM 8 kbits

[ ][

Y1 1/ 2 0 1 /4 1/4 Y 2 = 0 1/2 1 /4 1/4 1/ 4 1/4 − 1 /4 − 1/4 Cg 0 0 − 1 /2 1/2 Co

(4 x 8) x DF1

Y1 Y2 Cg Co

][ ] G1 G2 B R

(2+2+3+1) x 8 x ADD31

8 x BUFE2 Y1 Y2 Cg Co

MUX

Address generator

4 lines buffer S R AM 64 kbit 8 x BUFE2 4 lines buffer S R AM 64 kbit

Control unit

MUX 8 x MUX21

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

& q3

& + q2

& + q1

(3 x 12) x ADD31

D

Transposition buffer

(16 x 12) x MUX21

D

& +

+

x(1)

+

x(2) x(3)

-

+

-

+

-

-2 2

+

X(0)

+

X(2)

+

X(1)

+

X(3)

RL coder Huffman 4 x DF1 codebook 4 x ADD31

x(0)

DCT unit

q0 (8 x 12) x ADD31

(20 x 12) x DF1

Quantization unit

64 x 12 x 4 S R AM 4 kbits

bits s hifter

to Tx

Fig. 6. Simplified VLSI architecture of the proposed image coder

Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm

483

Table 1. Estimated power budget model [uA/MHz] CMOS cell type Module Color trans. Line buffer DCT Quantizer Entropy coder TOTAL

DF1 0.345 32 240 4

SRAM

ADD31 MUX21 BUFE2 4 kbits 0.342 0.148 0.254 130 64 8 16 96 192 36 4 1

8 kbits 160 1 -

Supply 64 kbits current μA/MHz 251 193 2 503 144 12 132 984

half band filter) and appropriate resampling. As an objective quality measure the peek signal to noise ratio (PSNR) has been used PSNR = 10log10

2552 1 , mse = mse N ⋅M

N

M

∑∑ ( xˆ n =1 m =1

n, m

− xn , m )2

(8)

where xˆn , m denotes a value of the pixel having coordinates (n, m) in reconstructed image (in Bayer CFA format). Obtained compression ratio (CR) with corresponding PSNR for different versions of algorithm based on DCT are presented in Fig. 9 and compared with the standard JPEG2000 coder. The following denotations are used in them: G1-G2-B-R − independent coding of 4 images from CFA Bayer sensor having colors G1, G2, B, R – the simplest approach; G12-B-R − coding without color transformation (2) but with structure transformation (Fig. 2); G12-B-R-di − coding without color transformation (2) but with structure transformation (Fig. 2) and additional deinterlacing/filtering (Fig. 3); Y12-U-V − coding with color transformation (2) and structure transformation (Fig. 2); Y12-U-V-di − coding with color transformation (2), structure transformation (Fig. 2) and additional deinterlacing /filtering (Fig. 3); Y12-U-VDCT8F − algorithm based on standard 8-point, floating-point DCT transform with color transformation (2) and structure transformation (Fig. 2). We can conclude from Fig. 9 that: 1) due to aliasing and high-frequency image content created by the Bayer CFA sampling the 4×4 DCT algorithm slightly outperforms the JPEG2000 for lower compression ratios (e.g. 15) but it is worse than the JPEG2000 for higher ratios (e.g. 30); both algorithms have been defeated by the floating point 8×8 DCT; 2) usage of color space conversion (2) is beneficial; 3) deinterlacing operation, described in Fig. 3, should be neglected since it makes the results worse. In order to obtain results with higher statistical significance more extensive test on longer data set has been performed. 100 video frames have been chosen in random manner from gastroscopy/colonoscopy recordings and coded using Huffman tables with 64 entries (precomputed using different set of frames). From results presented in

484

P. Turcza, T. Zieliński, and M. Duplaga

Fig. 7. Exemplary colonoscopy image

Fig. 8. Exemplary gastroscopy image

Fig. 9. Results for the DCT based image coder. (test images: left - colonoscopy, right – gastroscopy).

Fig. 10. Performance comparision

Fig. 11. Histograms of results from Fig. 10

Fig. 10, 11 we can conclude that for images with many diagnostic details the proposed compression scheme has efficiency similar to the JPEG2000 standard. The JPEG2000 algorithm is significantly better only for simple details-free images which do not have big diagnostic value.

Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm

485

4 Conclusions Low-complexity, low-power image compression algorithm suitable for wireless capsule endoscopy has been proposed and tested in the paper. It makes use of integer version of the discrete cosine transformation. Transform coefficients are encoded using optimized, low complexity Huffman coder. Assuming that low implementation complexity and PSNR having at least 36dB is required, the best PSNR-CR performance has been obtained for the DCT algorithm with the color space conversion (2) and structure conversion (presented in Fig. 2) but without image filtering/deinterlacing (shown in Fig. 3). For it we can get compression ratio CR = 15 for colonoscopy images and CR = 25 for gastroscopy ones.

Acknowledgments. The research activities presented in this paper were conducted under the European Commission R&D Project No 3E061105 (VECTOR).

References 1. Iddan, G., Meron, G., Glukhovsky, A., Swain, P.: Wireless capsule endoscopy. Nature 6785, 417–418 (2000) 2. Mylonaki, M., Fritscher-Ravens, A., Swain, P.: Wireless capsule endoscopy: a comparison with push enteroscopy in patient with gastroscopy and colonoscopy negative gastrointestinal bleeding. Gut J., 1122–1126 (2003) 3. Turgis, D., Puers, R.: Image compression in video radio transmission for capsule endoscopy. Sensors and Actuators A, 129–136 (2005) 4. Xie, X., Li, G.L., Wang, Z.H.: A Near-Lossless Image Compression Algorithm Suitable for Hardware Design in Wireless Endoscopy System. EURASIP Journal on Advances in Signal Processing, Article ID 82160, 1–13 (2007) 5. Turcza, P., Duplaga, M.: Low-Power Image Compression for Wireless Capsule Endoscopy. In: IEEE Int. Workshop on Imaging Systems and Tech. – IST 2007, Krakow, Poland (2007) 6. Turcza, P., Zielinski, T., Duplaga, M.: Low complexity image coding algorithm for capsule endoscopy with Bayer color filter array. Signal Processing, Poland, Poznan, 27–32 (2007) 7. Bayer, B.E.: Color Imaging Array: U.S. Patent 3,971,065 (1976) 8. Koh, C.C., Mukherjee, J., Mitra, S.K.: New efficient methods of image compression in digital cameras with color filter array. IEEE Trans. on Consumer Electronics 49(4), 1448– 1456 (2003) 9. ITU-T Rec. H.264 / ISO/IEC 11496-10: Advanced Video Coding, Final Committee Draft, Document JVTF100 (December 2002) 10. Malvar, H.S., Hallapuro, A., Karczewicz, M., Kerofsky, L.: Low-Complexity Transform and Quantization in H.264/AVC. IEEE Trans. on Circuits and Systems for Video Tech. 7 (2003) 11. Wallace, G.K.: The JPEG still picture compression standard. Communications of the ACM 34, 30–44 (1991)

Database Prebuffering as a Way to Create a Mobile Control and Information System with Better Response Time Ondrej Krejcar and Jindrich Cernohorsky VSB Technical University of Ostrava, Center of Applied Cybernetics, Department of measurement and control, 17. Listopadu 15, 70833 Ostrava Poruba, Czech Republic {Ondrej.Krejcar,Jindrich.Cernohorsky}@vsb.cz

Abstract. Location-aware services can benefit from indoor location tracking. The widespread adoption of Wi-Fi as the network infrastructure creates the opportunity of deploying WiFi-based location services with no additional hardware costs. Additionally, the ability to let a mobile device determine its location in an indoor environment supports the creation of a new range of mobile control system applications. Main area of interest is in a model of a radio-frequency based system enhancement for locating and tracking users of our control system inside the buildings. The developed framework as it is described here joins the concepts of location and user tracking as an extension for a new control system. The experimental framework prototype uses a WiFi network infrastructure to let a mobile device determine its indoor position. User location is used for data pre-buffering and pushing information from server to user’s PDA. All server data is saved as artifacts (together) with its position information in building. Keywords: prebuffering; localization; framework; Wi-Fi; 802.11x; PDA.

1 Introduction The usage of various wireless technologies has been increased dramatically and would be growing in the following years. This will lead to the rise of new application domains each with their own specific features and needs. Also these new domains will undoubtedly apply and reuse existing (software) paradigms, components and applications. Today this is easily recognized in the miniaturized applications in network-connected PDAs that provide more or less the same functionality as their desktop application equivalents. It is very likely for these new mobile application domains to adapt new paradigms that specifically target the mobile environment. We believe that an important paradigm is context-awareness. Context is relevant to the mobile user, because in a mobile environment the context is often very dynamic and the user interacts differently with the applications on his mobile device when the context is different. While usually a desktop machine is in a fixed context, a mobile device may be used in work, on the road, during the meeting, or at home. Context is not limited to the physical world around the user, but also incorporates the user’s M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 489–498, 2008. © Springer-Verlag Berlin Heidelberg 2008

490

O. Krejcar and J. Cernohorsky

behavior, terminal and network characteristics. Context-awareness concepts can be found as basic principles in a long-term strategic research for mobile and wireless systems such as formulated in [5]. The majority of context-aware computing to date has been restricted to location-aware computing for mobile applications (locationbased services). However, position or location information is a relatively simple form of contextual information. To name a few other indicators of context awareness that make up the parametric context space: identity, spatial information (location, speed), environmental information (temperature), resources that are nearby (accessible devices, hosts), availability of resources (battery, display, network, bandwidth), physiological measurements (blood pressure, heart rate), activity (walking, running), schedules and agenda settings. Context-awareness means that anybody is able to use context information. We consider location as prime form of context information. We are focused on position determination in an indoor environment. Location information is used to determine an actual user position and his future position. We have performed a number of experiments with the control system, focusing on the position determination we are encouraged by the results. The remainder of this paper describes the conceptual and technical details.

2 Basic Concepts and Technologies of User Localization The proliferation of mobile computing devices and local-area wireless networks has fostered a growing interest in location-aware systems and services. A key distinguishing feature of such systems is that the application information and/or interface presented to the user is, in general, a function of his/her physical location. The granularity of needed location information could vary from one application to another. For example, locating a nearby printer requires fairly coarse-grained location information whereas locating a book in a library would require fine-grained information. While much research has been focused on a development of services architectures for location-aware systems, less attention has been paid to the fundamental and challenging problem of locating and tracking mobile users, especially in in-building environments. We focus mainly on RF wireless networks in our research. Our goal is to complement the data networking capabilities of RF wireless LANs with accurate user location and tracking capabilities for user needed data pre-buffering. This property we use as an information ground for extension of control system. 2.1 Data Collection A key step of the proposed research methodology is a data collection phase. We record information about the radio signal as a function of a user’s location. The signal information is used to construct and validate models for signal propagation. Among other information, the WaveLAN NIC makes the signal strength (SS) and the signalto-noise ratio (SNR) available. SS is reported to units of dBm and SNR is expressed in dB. A signal strength of Watts is equivalent to 10*log10(s/0.001) dBm. For example, signal strength of 1 Watt is equivalent to 30 dBm. Each time the broadcast

Database Prebuffering as a Way to Create a Mobile Control and Information System

491

Fig. 1. Localization principle – triangulation

packet is received the WaveLAN driver extracts the SS information from the WaveLAN firmware. Then it makes the information available to user-level applications via system calls. It uses the wlconfig utility, which provides a wrapper around the calls to extract the signal information. 2.2 Localization Methodology The general principle states that if a WiFi-enabled mobile device is close to such a stationary device – Access Point (AP) it may “ask” the provider’s location position by setting up a WiFi connection. If the mobile device knows the position of the stationary device, it also knows that its own position is within a 100-meter range of this location provider. Granularity of location can improve by triangulation of two or several visible WiFi APs. The PDA client will support the application in automatically retrieving location information from nearby location providers, and in interacting with the server. Naturally, this principle can be applied to other wireless technologies. The application (locator) is now implemented in C# using the MS Visual Studio .NET 2005 with .NET compact framework and a special OpenNETCF library enhancement [6]. Schema on figure [Fig. 1] describes a runtime localization process. Each star indicates supposed user location, which was exactly measured and computed. The stars points are exactly measured and computed points of suppose user position. The real track on figure presents real movement of user during the time. The exact track mean computed track from measured WiFi intensity level.

492

O. Krejcar and J. Cernohorsky

2.3 WiFi Middleware The WiFi middleware implements the client’s side of location determination mechanism on the Windows Mobile 2005 PocketPC operating system and is part of the PDA client application. The libraries used to manage WiFi middleware are: AccessPoint, AccessPointCollection, Adapter, AdapterCollection, AdapterType, ConnectionStatus, Networking, NetworkType, and SignalStrength. 2.4 Predictive Data Push Technology This part of the project is based on a model of location-aware enhancement, which we have used in created control system. This technique is useful in framework to increase the real dataflow from wireless access point (server side) to PDA (client side). Primary dataflow is enlarged by data pre-buffering. These techniques form the basis of predictive data push technology (PDPT). PDPT copies data from information server to clients PDA to be helpful when user comes at desired location. The benefit of PDPT consists of reduction of time needed to display desired information requested by a user command on PDA. Time delay may vary from a few seconds to number of minutes. It depends on two aspects. First one is the quality of wireless Wi-Fi connection used by client PDA. A theoretic speed of Wi-Fi connection is max 687 kB/s, because of protocol cost on physical layer (app. 30-40 %). However, the test of transfer rate from server to client’s PDA, which we have carried out within our Wi-Fi infrastructure provided the result speed only 80 - 160 kB/s (depends on file size and PDA device). The second aspect is the size of copied data. Current application records just one set of signal strength measurements at the time (by Locator unit in PDPT Client). By this set of values the actual user position is determined by the PDPT server side. PDPT core responds to location change by selection of the artifact to load to PDPT client buffer. The data transfer speed is widely influenced by the size of these artifacts. For larger artifact size the speed is going down. Theoretical background and tests were needed to determine an average artifact size. First of all the maximum response time of an application (PDPT Client) for user was needed to be specified. A special book [8] of „Usability Engineering” specified the maximum response time for an application to 10 seconds. During this time the user was focused on the application and was willing to wait for an answer. We used this time period (10 second) to calculate the maximum possible data size of a file transferred from server to client (during this period). If transfer speed was from 80 to 160 kB/s the result file size was from 800 to 1600 kB. The next step was an average artifact size definition. We used a sample database of network architecture building plan (Autocad file type), which contained 100 files of average size of 470 kB. The client application can download during the 10 second period from 2 to 3 artifacts. The problem is the time, which is needed for displaying them. In case of Autocad file type we measured this time to average 45 seconds. This time consumption is certainly not acceptable, for this reason we are looking for a better solution. We need to use some basic data format, which can be displayed by PDA natively (BMP, JPG, GIF) without any additional striking time consumption. The solution is in format conversion from any to this native (for PDA devices). In case of sound and video format we also recommend using basic data format (wav, mp3, wmv, mpg).

Database Prebuffering as a Way to Create a Mobile Control and Information System

493

The final result of our real tests and consequential calculations is definition of artifact size to average value of 500 kB. The buffer size may differ from 50 to 100 MB in case of 100 to 200 artifacts. 2.5 PDPT Framework Data Artifact Management The PDPT Server SQL database manages the information (for example data about Ethernet hardware such as Ethernet switch, UTP socket, CAT5 cable lead, etc.) in the context of their location in building environment. This context information is same as location information about user track. The PDPT core controls data, which are copied from the server to PDA client by context information (position info). Each database artifacts must be saved in database along the position information, to which it belongs.

Fig. 2. PDPT Framework data artifact management

During the creating process of PDPT Framework the new software application called “Data Artifacts Manager” was created. This application manages the artifacts in WLA database (localization oriented database). User can set the priority, location, and other metadata of the artifact. This manager substitutes the online conversion mechanism, which can transform the real online control system data to WLA database data artifacts during the test phase of the project. This manager can be also used in case of offline version of PDPT Framework usage. The artifacts manager in this offline case is shown at [Fig 2].

494

O. Krejcar and J. Cernohorsky

The Manager allows to the administrator to create a new artifact from multimedia file (image, video, sound, etc.), and edit or delete the existing artifact. The left side of the screen contains the text field of artifact metadata as a position in 3D space. This position is determined by artifact size (in case of building plan) or binding of artifact to some part of a building in 3D space. The 3D axis is possible to take from building plan by some GIS software like Quantum GIS or by own implementation [7]. The central part represents a multimedia file and right side contains the buttons to create, edit, or delete the artifact. The lower part of the application screen shows actual artifacts in WLA database located on SQL Server. 2.6 Framework Design PDPT framework design is based on the most commonly used server-client architecture. To process data the server has online connection to the control system. Technology data are continually saved to SQL Server database [3] and [1].

Fig. 3. System architecture – UML design

The part of this database (desired by user location or his demand) is replicated online to client’s PDA, where it is visualized on the screen. User PDA has location sensor component, which continuously sends the information about nearby AP’s intensity to the framework kernel. The kernel processes this information and makes a decision if or how the part of SQL Server database will be replicated to client’s SQL Server CE database. The kernel decisions constitute the most important part of whole framework, because the kernel must continually compute the position of the user and track, and make a prediction of his future movement. After doing this prediction the appropriate data (part of SQL Server database) are pre-buffered to client’s database for the future possible requirements. The PDPT framework server is created as Microsoft web service to handle a bridge between SQL Server and PDPT PDA Clients.

Database Prebuffering as a Way to Create a Mobile Control and Information System

495

Fig. 4. PDPT Client – Windows Mobile application

2.7 PDPT Client For testing and tuning of PDPT Core was created the PDPT Client application. This client realizes classical client to the server side and an extension by PDPT and Locator module. Figure [Fig. 4] shows classical view of data presentation from MS SQL CE database to user (in this case the image of Ethernet network in company area plan). Each process running in a PDPT client is measured in millisecond resolution to provide a feedback from real situation. Tabs PDPT and Locator present a way to tune the settings of PDPT values.

3 Experiments We have executed a number of indoor experiments with the PDPT framework using the PDPT PDA application. WiFi access points are placed at different locations in building, where the access point cells partly overlap. We have used triangulation principle of AP intensity to obtain a better granularity. It has been found that the location determination mechanism selects the access point that is the closest to the mobile user as the best location provider. Also after the loss of IP connectivity,

496

O. Krejcar and J. Cernohorsky

switching from one access point to another (a new best location provider) takes place within a second in the majority of cases, resulting in only temporary loss of IP connectivity. This technique partially uses a special Radius server [4] to realize “roaming” known in cell’s networks. User, who lost the existing signal of AP is required to ask the new AP to receive IP. This is known as “renew” in Ethernet networks. At the end of this process, user has his identical old IP and connection to new AP. Other best technique to realize roaming is using of WDS (Wireless Decision System). Currently, the usability of the PDPT PDA application is somewhat limited due to the fact that a device has to be continuously powered. If it is not, the WiFi interface and the application cannot execute the location determination algorithm and the PDPT server does not receive location updates from the PDA. 3.1 Data Transfer Increase Tests Using PDPT Framework The main result of utilization of PDPT framework is reduction of data transfer speed. The test is focused on the real usage of developed PDPT Framework and its main issue at increased data transfer. Table 1. Data transfer tests description Test 1 2 3 4 5 6 7 8 9 10 11 12

Type

HTC Blueangel

HTC Universal

Mode SQl CE SQl CE SQL SQL PDPT PDPT SQl CE SQl CE SQL SQL PDPT PDPT

Data 2949 4782 2949 4782 2949 4782 2949 4782 2949 4782 2949 4782

Time 5 9 34 57 12 20 3 6 21 38 9 16

Speed 643 2228 80 69 234 278 514 1782 51 64 214 2228

At the Table 1 is a summary of eighteen tests with three types of PDA and three types of data transfer mode. Each of these eighteen tests are fivefold reiterated for a better accuracy. Data in the table are average values from each iterations. The mode column may contain three different data transfer modes. The SQL CE mode represents data saved at mobile device memory (SQL Server CE) and a data transfer time is very high. The second mode SQL shows data, which are stored at server (SQL Server 2005). Primary data are loaded over Ethernet / Internet to SQL Server CE of mobile device and secondary data are shown to user. Data transfers time consumption of this method is generally very high, which results in a high waiting time for users. The third data mode PDPT is a combination of previous two methods. The PDPT mode provides very good results in the form of a data transfer acceleration. A realization of this test consists of user movement from a sample location A to B at three different way directions. Location B was a destination with requested data,

Database Prebuffering as a Way to Create a Mobile Control and Information System

497

which is not contained at SQL CE buffer in mobile device before test. The result time of this mode consist of average time from view a collection of requested data artifacts on client PDA. If the requested artifact is in SQL CE buffer before request, the time is very short. If the artifact is not present in buffer, PDPT client must download them from server. Final time for third method represents the real usage result. Acknowledgment. This work was supported by the Ministry of Education of the Czech Republic under Project 1M0587

4 Conclusions The main objective of this paper is in the enhancement of control system for locating and tracking of users inside a building. It is possible to locate and track the users with high degree of accuracy. In this paper we have presented the control system framework that uses and handles location information and control system functionality. The indoor location of a mobile user is obtained through an infrastructure of WiFi access points. This mechanism measures the quality of the link of nearby location provider access points to determine actual user position. User location is used in the core of server application of PDPT framework to data pre-buffering and pushing information from server to user’s PDA. Data pre-buffering is the most important technique to reduce time from user request to system response. The experiments show that the location determination mechanism provides a good indication of the actual location of the user in most cases. The median resolution of the system is approximately five meters. Some inaccuracy does not influence the way of how the localization is derived from the WiFi infrastructure. For the PDPT framework application this was not found to be a big limitation for the PDPT framework application as it can be found at chapter Experiments. The experiments also show that the current state of the basic technology, which was used for the framework (mobile device hardware, PDA operating system, wireless network technology) is now at the level of a high usability of the PDPT application [13].

References 1. Reynolds, J.: Going Wi-Fi: A Practical Guide to Planning and Building an 802.11 Network. CMP Books (2003) 2. Wigley, A., Roxburgh, P.: ASP.NET applications for Mobile Devices. Microsoft Press, Redmond (2003) 3. Tiffany, R.: SQL Server CE Database Development with the.NET Compact Framework. Apress (2003) 4. The Internet Engineering Task Force RADIUS Working Group, http://www.ietf.org/ 5. The Wireless World Research Forum (WWRF), http://www.wireless-world-research.org/ 6. OpenNETCF - Smart Device Framework, http://www.opennetcf.org 7. Horak, J., Unucka, J., Stromsky, J., Marsik, V., Orlik, A.: TRANSCAT DSS architecture and modelling services. Journal: Control and Cybernetics 35, 47–71 (2006) 8. Nielsen, J.: Usability Engineering. Morgan Kaufmann, San Francisco (1994)

498

O. Krejcar and J. Cernohorsky

9. Krejcar, O.: Prebuffering as a way to exceed the data transfer speed limits in mobile control systems. In: Icinco 2008, 5th International Conference on Informatics in Control, Automation and Robotics. Insticc Press, Funchal (2008) 10. Evennou, F., Marx, F.: Advanced integration of WiFi and inertial navigation systems for indoor mobile positioning. Eurasip journal on applied signal processing, Hindawi publishing corp., New York (2006) 11. Olivera, V., Plaza, J., Serrano, O.: WiFi localization methods for autonomous robots. Journal Robotica 24, 455–461 (2006) 12. Salazar, A.: Positioning Bluetooth (R) and Wi-Fi (TM) systems. Journal IEEE transactions on consumer electronics 50, 151–157 (2004) 13. Janckulik, D., Krejcar, O., Martinovic, J.: Personal Telemetric System – Guardian. In: Biodevices 2008, pp. 170–173. Insticc Setubal, Funchal (2008)

Network Traffic Classification by Common Subsequence Finding Krzysztof Fabja´nski and Tomasz Kruk NASK, The Research Division, Wawozowa ˛ 18, 02-796 Warszawa, Poland {krzysztof.fabjanski,tomasz.kruk}@nask.pl http://www.nask.pl

Abstract. The paper describes issues related to network traffic analysis. The scope of this article includes discussion regarding the problem of network traffic identification and classification. Furthermore, paper presents two bioinformatics methods: Clustal and Center Star. Both methods were precisely adapted to the network security purpose. In both methods, the concept of extraction of a common subsequence, based on multiple sequence alignment of more than two network attack signatures, was used. This concept was inspired by bioinformatics solutions for the problems related to finding similarities in a set of DNA, RNA or amino acids sequences. Additionally, the scope of the paper includes detailed description of test procedures and their results. At the end some relevant evaluations and conclusions regarding both methods are presented. Keywords: network traffic analysis, anomaly detection, network intrusion detection systems, common subsequence finding, bioinformatics algorithms, Clustal algorithm, Center Star method, automated generation of network attack signatures

1 Introduction The Internet became one of the most popular tools used by almost everyone. It is important to mention that the Internet and the World Wide Web (WWW) are not synonymous. The World Wide Web is one of the many services available in the Internet. The Internet consists of an enormous number of computer networks. Therefore, the issue regarding network security is so important. The network security issue is not only a set of security methods required for ensuring safety. It also consists of elements related to network security policy which should be obeyed. Different institutions and companies are introducing their private security policies. Often, security policies are performed according to some known standards. Unfortunately, this approach does not guarantee that the precious resources will remain unaffected. Other, more sophisticated methods should be introduced. One of the most recognized families of systems are network intrusion detection systems. This group of systems allows to alert about unwanted and malicious activity registered in the network flow. The process of identifying a malicious network flow involves comparing the network flow content with a predefined set of rules. The set of rules, sometimes known as M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 499–508, 2008. c Springer-Verlag Berlin Heidelberg 2008

500

K. Fabja´nski and T. Kruk

well as a set of network attack signatures, describes different Internet threats by mapping their content into the specific format. Despite of the malicious flow method identification, there are many new Internet threats which have not been discovered yet. Fortunately there are methods and heuristic approaches which allow to identify new Internet threats by following different network trends and statistics. Although those methods are very promising, still there is a huge requirement for new algorithms. Those algorithms should be capable of analysing a huge portion of attack signatures for network intrusion detection systems, produced in an automatic manner. In order to support this process, some new approaches were proposed. One of the ways allowing to analyse the attack signature collections is the bioinformatics approach. Multiple sequence alignment is a fundamental tool in bioinformatics analysis. This tool allows to find similarities embedded in a set of DNA, RNA or amino acids sequences. The bioinformatics approach can be adapted to the network traffic identification and classification problem. The second section of this article presents different systems for network traffic analysis. Section three develops briefly two bioinformatics methods: Center Star and Clustal. The fourth section includes various test results. The last section discusses algorithm complexity and their suitability in network traffic analysis.

2 Network Traffic Classification and Identification Problem Computer threats are often a reason of unwanted incidents, which might cause irreversible damage to the system. From the scientific point of view, computer threats use certain vulnerabilities, therefore threats, vulnerabilities and exposures should be considered as a disjointed issue. One of the most popular and widely present group of Internet threats is Internet Worm [1]. Intrusion detection systems (IDS) [2] detect mainly malicious network flow by analyzing its content. An example of IDS are network intrusion detection systems (NIDS). NIDS are able to detect many types of malicious network traffic including worms. One of the most popular NIDS is Snort. It is an open source program available for various architectures. It is equipped with the regular expression engine which enhance the network traffic analysis. It analyses the network flow by comparing its content (not only a payload) with a specific list of rules. During this process, Snort utilises the regular expression engine. As a result of this analysis, Snort makes a decision regarding a particular network flow, whether it is malicious or regular. An example of simple Snort rule is shown in the table (Table 1). Table 1. An exemplary Snort rule alert udp $EXTERNAL_NET 2002 -> $HTTP_SERVERS 2002 ( msg:"MISC slapper worm admin traffic"; content:"|00 00|E|00 00|E|00 00|@|00|"; depth:10; reference:url,isc.incidents.org/analysis.html?id=167; reference:url,www.cert.org/advisories/CA-2002-27.html; classtype:trojan-activity; sid:1889; rev:5;)

Network Traffic Classification by Common Subsequence Finding

501

Snort works as a one thread application. Its action is to receive, decode and analyse the incoming packets. Snort allows us to identify unwanted malicious network flows by generating appropriate alerts. The main problem is that if the set of rules for Snort has a poor quality, we can expect many false positive or false negative alerts. Therefore classification of a network attack signature as well as improving their quality is a matter of great importance. Very often NIDS are combined with systems for automated generation of network attack signatures, such as Honeycomb [3,4]. Tools which join functions of NIDS and automated signature creation system are known as network early warning systems (NEWS). An exemplary network early warning system is Arakis [5]. NEWS develop very sophisticated methods for network traffic classification in order to speed up the process of identification of potential new Internet threats. Main problem concerning classification and identification of network flows is related to extraction of common regions from the network attack signatures sets [6]. Many techniques were developed. One of the techniques allowing the network security specialists to distinguish the regular network flow from the suspicious one, is usage of honeypots [7]. Honeypot is a specially designed system which simulates some network resources in order to capture the malicious flow. Generally, it consists of a part of an isolated, unprotected and monitored network with some virtually simulated computers which seem to have a valuable resources or information. Therefore, flow which occurs inside the honeypot is assumed to be malicious by the definition. Protocol analysis and patterndetection techniques performed on flows collected by honeypots result in network attack signatures generation. Generation of network attack signatures is mainly based on the longest common substring extraction [4,8]. One of the tools allowing generation of network attack signatures is Honeycomb.

3 Sequence Alignment Sequence alignment is a tool that can be used for extraction of common regions from the set of network attack signatures [6]. Extraction of common regions is shown in the figure (Fig. 1). It is somewhat similar to the biologist task. The biologist identifies newly discovered genes by comparing them to the family of genes whose function is already known. Comparison is performed by assigning those newly discovered genes to the known families by common subsequence finding. Problem of extraction of common regions from the network attack signature set is actually the multiple sequence alignment (MSA) [9] problem. The MSA is a generalization of the pairwise alignment [10]. Insertion of gaps is performed into each string so that resulting strings have equal length. Although the problem of multiple sequence GET ||| GET ||| GET

/ | /a/a.HTM | /

HTTP |||| HTTP |||| HTTP/1.1

Fig. 1. Problem of the longest common subsequence finding

502

K. Fabja´nski and T. Kruk

alignment is an NP-complete task, there are many heuristics, probabilistic and other approaches that cope with that issue. A specific classification of those methods was proposed in [10]. Among so many algorithms, two classical approaches where chosen. The first algorithm and probably the most basic is a Center Star method. It was chosen for network traffic identification purpose. The main goal of this adaptation was to check whether this method can be used for network attack signature common region extraction. The second algorithm that was required for classification of network attack signatures is Clustal. It is worth to mention that in both algorithms a global alignment was used. Global alignment was computed using Needleman-Wunsch [20] algorithm. 3.1 Center Star Method The Center Star method [11] is classified to the group of algorithms with some elements of approximation. As it was mentioned before, multiple sequence alignment is an NP-complete problem. Presented method, Center Star, is an approximation of multisequence alignment. Thus, expected results can provide, but do not have to provide optimal solutions. The Center Star method consists of three main steps. Detailed description of Center Star method can be found in [11]. 3.2 Clustal Algorithm Clustering is the method which classifies particular objects into appropriate groups (clusters). Classification is performed based on the defined distance measurement technique. Every object from a single cluster should share a common trait. Data clustering is widely used in many science fields. We can find it in data mining, pattern recognition or bioinformatics. Data clustering algorithms can be divided into two main categories: hierarchical methods. latexdeschierarchical methods – assign objects to the particular clusters by measuring the distance between them. In partitioning approach, new clusters are generated and then recomputing of the new cluster centers is performed, partitioning algorithms. latexdescpartitioning algorithms – start with an initial partition and then by iterative control strategy optimize an objective function. Every cluster is represented by the gravity center or by its center object. In hierarchical methods, in turn, we can distinguish two different types: agglomerative. – we begin the clustering procedure with each element as a separate cluster. Merging them into larger clusters, we come to the point where all elements can be classified into one big cluster, divisive. – starts the process from one big set and then, divides it into successively smaller subsets Clustal is an example of agglomerative algorithm, also know as ”bottom-up” approach. During implementation of Clustal algorithm some modification were performed. Modification were introduced in order to adapt this method for network attack signature classification purpose. Instead of profile representation of internal nodes in the dendrogram, consensus sequence was used. This was caused mainly by the fact that so far the

Network Traffic Classification by Common Subsequence Finding

503

scoring scheme used for network traffic classification has a very basic structure. Assuming that network flow can be represented as a sequence of extended ASCII characters, we have 1 for each match and 0 otherwise. Although the standard objective function is the only reasonable solution for this moment, there is some research [12], which may result in new scoring scheme proposition.

4 Tests and Results This section provides detailed description concerning efficiency tests of Center Star method and Clustal algorithm. Tests were executed on the Intel(R) Xeon(TM) CPU 3.00GHz computer equipped with 2075808 kB of the total memory. Compiler used for compilation was g++ (v4.1). In the test procedure, external data sets, extracted from Arakis database, were used. What is most interesting, data were extracted from Arakis database. Therefore, some tests results were confronted with the Arakis algorithm results. Data, used in the test, consisted of real network signatures, suspected to be malicious. Arakis algorithms were mainly based on DBSCAN [19] clustering mechanism and edit distance measurement. 4.1 Center Star Methods Tests In the figure (Fig. 2 (a)) we have experimental data set presented. The horizontal axis represents the total number of characters (counted as a total sum of network attack signatures lengths). The vertical axis, in turn, represents the actual number of processed signatures. This data set was used in Center Star method tests. (Fig. 2 (b)) shows the actual execution time of the Center Star method. Time was measured in seconds. Next figure (Fig. 2 (c)) reflects the relation between the length of the multiple sequence alignment (MSA) and the common subsequence extracted from MSA. This relation provides us information regarding total length of the extracted subsequence. In the next test, we measured average length of single division in one signature. Assuming that single network attack signature may consist of many parts, this test provided us an approximate information concerning the quality of the extracted common subsequence. The greater the average length of a single division in network attack signature, the lower the probability of false positive or false negative alerts. Center Star method algorithm was compared with Arakis algorithms. The results of the test are shown in the figure (Fig. 2 (d)). In some cases Arakis algorithm seems to obtain better results than Center Star algorithm. Those situations were precisely investigated and it turned out that the reason of that had a background in different interpretation of the clusters representatives. In some cases Arakis algorithm does not update the clusters representatives even if some very long network attack signatures have expired. This has a consequences in overestimating the average single division length of a common subsequence. 4.2 Clustal Algorithm Tests Most of the Clustal tests were performed in order to compare results with those of Arakis algorithm. Data used in the tests, are shown in the figure (Fig. 3 (a)).

504

K. Fabja´nski and T. Kruk

No. of characters vs Number of signatures

Number of characters vs Time

100

6000 5000

80 70

4000

60

Time

Number of signatures

90

50 40

3000 2000

30 1000 20 10

0 0

20000

40000 60000 80000 100000 120000 140000 No. of characters

0

20000

database

40000 60000 80000 100000 120000 140000 Number of characters Center Star method

(a) Center Star: The no. of characters vs the (b) The Center Star method execution time number of signatures No. of characters vs Pattern length

No. of characters vs Average length of divs

2000

1600 Average length of divs

1800 Pattern length

1600 1400 1200 1000 800 600 400 200 0

1400 1200 1000 800 600 400 200 0

0

20000

40000 60000 80000 100000 120000 140000 No. of characters

LCS length

MSA length

0

20000

40000 60000 80000 100000 120000 140000 No. of characters

Center Star method

Arakis algorithm

(c) Center Star method: the MSA and LCS (d) Average single division length of common relation subsequence: Arakis vs Center Star

Fig. 2. Center Star method tests

Next test (Fig. 3 (b)) investigates the Clustal algorithm execution time in respect to the total number of processed characters. Time was measured in seconds. The (Fig. 4) represents the comparison of the Arakis clustering algorithm with the Clustal method. Comparison of those two methods was made in order to show the main advantage of the Clustal algorithm. The main advantage of the Clustal algorithm is the possibility of adjustment. Two subfigures (a,c) present the number of clusters produced by the Arakis and Clustal algorithms. In the first subfigure (a), we can notice that Clustal algorithm produces smaller number of clusters than Arakis solution. However, the precision for that test was rather poor (b). Smaller number of clusters was achieved using EP S1 = 0.01. EP S1 is an epsilon which determines whether the investigated signature should be classified to particular cluster. The condition where distance1 between two signatures is greater than EPS1 is consider as satisfied. Precision is expressed as ratio of MSA length to the LCS length. The closer the ratio to 1, the better precision we obtain. Better precision is obtained at the greater number of clusters cost. In the next subfigures (c,d), EP S1 was set to 0.9. For this values, Clustal algorithm produces much more clusters than Arakis algorithm. On the other hand, precision 1

Levenshtein distance [18].

Network Traffic Classification by Common Subsequence Finding

No. of characters vs Number of signatures

Number of characters vs Time

800

2500

700 2000 600 500

1500 Time

Number of signatures

505

400

1000

300 200

500 100 0

0 0

50000

100000 150000 200000 No. of characters

250000

300000

0

50000

database

100000 150000 200000 Number of characters

250000

300000

Clustal algorithm

(a) Clustal: number of characters vs number of signatures

(b) The Clustal algorithm execution time

Fig. 3. Clustal algorithm tests

gained in those two tests was very high. All four subfigures (a,b,c,d) were generated by computing the Clustal algorithm with the standard scoring scheme2 . Parameters M AT CH, M ISM AT CH and GAP _P EN ALT Y were set according to standard scoring scheme. The reason why gap penalty had the same value assigned as mismatch was straightforward. So far there is no scoring scheme for ASCII alphabet, therefore only trivial approach was presented. In this approach gap penalties were not considered. However in extended test procedure, different values for gap penalties were assigned. Those test results were preliminary and thus they were not published in this paper.

5 Evaluation and Conclusions In this section, detailed estimation of the main methods are given. Estimation was based on theoretical assumption and faced with empirical implementation of both methods. 5.1 Center Star Method Complexity The Center Star method consists of three main phases, after which multiple sequence alignment is found. In the first phase of this method, all pairwise alignment are formed 2 (distance K matrix calculation). The complexity of this phase, in the worst case, is O((N + 3N ) 2 ), where K is the number of input signatures. The second phase is related to finding the signature which is "the closest" to others. This step requires O(K). In the last step, multiple sequence alignment is formed. The last phase computational complexity ). of the Center Star method is O(2N ∗ K 2 The Center Star method provides essential functionality in common motif finding process. It allows us to extract the common subsequence from the multiple sequence alignment. This procedure requires O(KN ).

2

1 for match and 0 otherwise.

506

K. Fabja´nski and T. Kruk No. of characters vs Number of cluster

No. of characters vs Pattern length

140

200 180 160 Pattern length

Number of cluster

120 100 80 60 40

140 120 100 80 60 40 20

20

0 0

0 0

50000

100000 150000 200000 No. of characters

Clustal algorithm

250000

50000

300000

100000 150000 200000 No. of characters

LCS length

250000

300000

MSA length

Arakis algorithm

(a) dist = 1, MATCH = 1, MISMATCH = 0, (b) dist = 1, MATCH = 1, MISMATCH = 0, GAP_PENALTY = 0, EPS1 = 0.01 GAP_PENALTY = 0, EPS1 = 0.01 No. of characters vs Pattern length 80

300

70 Pattern length

Number of cluster

No. of characters vs Number of cluster 350

250 200 150 100

60 50 40 30

50 20 0

0 0

50000

100000 150000 200000 No. of characters

Clustal algorithm

250000

300000

50000

100000 150000 200000 No. of characters

LCS length

250000

300000

MSA length

Arakis algorithm

(c) dist = 1, MATCH = 1, MISMATCH = 0, (d) dist = 1, MATCH = 1, MISMATCH = 0, GAP_PENALTY = 0, EPS1 = 0.9 GAP_PENALTY = 0, EPS1 = 0.9

Fig. 4. Number of clusters vs precision: comparison of the Arakis algorithm with Clustal algorithm

5.2 Clustal Algorithm Complexity In Clustal algorithm we have very complicated and time consuming procedures, including distance matrix calculation, dendrogram creation and clustering mechanism. All those three phases have the following computational complexities: ) 1. Distance matrix calculation - O((N 2 + 3N ) K K 2 2 2. Dendrogram creation - O( 2 +2(K − 1) ∗ [ K 2 +N + 4(N + K)]) 3. Clustering (reading the dendrogram and writing clusters to the file) - O(K) All calculations regarding computational complexity were based on theoretical assumptions and source code analysis. Run-time dependencies shown in (Fig. 3 (b)) seem to confirm the results. Moreover, presented computational complexities do not compromise the theoretical assumptions regarding complexities of presented methods. To sum up, Clustal and Center Star algorithms have got some advantages and disadvantages. One of the biggest drawback of both algorithms is their high run-time complexity. On the other hand, the whole task is an NP-complete problem, so we cannot expect better run-time complexity. Clustal, as well as Center Star method, can be modified in order to decrease this complexity. In the Center Star method instead of finding

Network Traffic Classification by Common Subsequence Finding

507

all pairwise alignment, we can take a randomly selected sequence from the set of input signatures. After that, we can form the multiple sequence alignment by computing all pairwise alignments of the chosen sequence with the rest of sequences. As a result, we would omit the process of choosing the center sequence, which involves computation of all pairwise alignments in the set of network attack signatures. On the other hand, in Clustal algorithm, instead of using Neighbor-Joining algorithm [13][15][16][17] for dendrogram creation, we could have used an Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [14]. The UPGMA is faster than Neighbor-Joining algorithm at precision expense. This improvements lead to the better time complexity, but on the other hand, they result in worse common subsequence extraction and worse network traffic classification. In our case, better time complexity might occur to be more important than worse common subsequence extraction. Extraction of the common subsequence during preprocessing phase should be performed in online mode. On the other hand, clustering of already created signatures must be performed in offline mode. This paper involves the process of classification and identification of network attack signatures only. There were no other tests checking the influence on the number of false positive or false negative alerts. Such experiments will be performed after we finally prove that bioinformatics methods are suitable for suspicious network traffic analysis. In further work, it is expected that adapted methods will be constantly developed. Although, the results of tests performed on the real network traffic data are very promising, still there is an issue related to new scoring function proposition. Therefore, further work will focus on aspects related to statistics regarding the network traffic. Statistics will allow in future, to represent the particular families of Internet threats as a profile structures. The profile structure will allow to create scoring matrices, similar to those which can be met in bioinformatics. Moreover, profile structures will allow to deal with Internet threats such as polymorphic worms. Furthermore, profiles will allow us to identify those regions in the network traffic patterns, which remain unchanged even in case of polymorphic Internet threats.

References 1. Nazario, J.: Defense and Detection Strategies against Internet Worms. Artech House, Boston & London (2004) 2. Kreibich, C., Crowcroft, J.: Automated NIDS Signature Creation using Honeypots. University of Cambridge Computer Laborator (2003) 3. Kreibich, C., Crowcroft, J.: Honeycomb - Creating Intrusion Detection Signatures Using Honeypots. In: Proceedings of the Second Workshop on Hot Topics in Networks (Hotnets II). Cambridge Massachusetts: ACM SIGCOMM, Boston (2003) 4. Rzewuski, C.: Bachelor’s Thesis: SigSearch - automated signature generation system (in Polish). Warsaw University of Technology, The Faculty of Electronics and Information Technology (2005) 5. Kijewski, P., Kruk, T.: Arakis - a network early warning system (in Polish) (2006) 6. Kreibich, C., Crowcroft, J.: Efficient sequence alignment of network traffic. In: IMC 2006: Proceedings of the 6th ACM SIGCOMM on Internet measurement, isbn 1-59593-561-4, pp. 307–312. ACM Press, Brazil (2006)

508

K. Fabja´nski and T. Kruk

7. Bakos, G., Beale, J.: Honeypot Advantages & Disadvantages, LasVegas, pp. 7–8 (November 2002) 8. Kreibich, C.: libstree – A generic suffix tree library, http://www.icir.org/christian/libstree/ 9. Gusfield, D.: Efficient method for multiple sequence alignment with guaranteed error bound. Report CSE-91-4, Computer Science Division, University of California, Davis (1991) 10. Reinert, K.: Introduction to Multiple Sequence Alignment. Algorithmische Bioinformatik WS 03, 1–30 (2005) 11. Bioinformatics Multiple sequence alignment, http://homepages.inf.ed.ac.uk/fgeerts/course/msa.pdf 12. Kharrazi, M., Shanmugasundaram, K., Memon, N.: Network Abuse Detection via Flow Content Characterization. In: IEEE Workshop on Information Assurance and Security United States Military Academy (2004) 13. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 2, 406–425 (1987) 14. Tajima, F.: A Simple Graphic Method for Reconstructing Phylogenetic Trees from Molecular Data. In: Reconstruction of Phylogenetic Trees, Department of Population Genetics, National Institute of Genetics, Japan, pp. 578–589 (1990) 15. The Neighbor-Joining Method, http://www.icp.ucl.ac.be/~opperd/private/neighbor.html 16. Weng, Z.: Protein and DNA Sequence Analysis BE561. Boston University (2005) 17. Multiple alignment: heuristics, http://www.bscbioinformatics.com/Stu/Dbq/clustalW.pdf 18. Levenshtein, V.: Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady, 707–710 (1966) 19. Ester, M., Kriegel, H., Sander, J., Xiaowei, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 1996), Institute for Computer Science, University of Munich (1996) 20. Fabjañski, K.: Master’s Thesis: Network Traffic Classification by Common Subsequence Finding. Warsaw University of Technology, The Faculty of Electronics and Information Technology, Warsaw (2007)

A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks Orhan Dagdeviren1 and Kayhan Erciyes2 1

Izmir Institute of Technology Computer Eng. Dept. Urla, Izmir TR-35340, Turkey [email protected] 2 Ege University International Computer Institute Bornova, Izmir, TR-35100, Turkey [email protected]

Abstract. Leader Election is an important problem in mobile ad hoc networks and in distributed computing systems. In this study, we propose a hierarchical, cluster based protocol to elect a leader in a mobile ad hoc network. The initial phase of the protocol employs a clustering algorithm to group nodes of the network after which a leader for a cluster(clusterhead) is elected. The second phase is performed by forming a connected ring of these leaders using the Ring Formation Algorithm. Finally, Chang Roberts Leader Election Algorithm for rings is employed in the ﬁnal phase to elect the super-leader among the clusterheads. We provide performance results of this protocol for various mobility parameters and analyze its time and message complexities. Keywords: leader election, Chang Roberts algorithm, mobile ad hoc networks.

1

Introduction

Leader election is a fundamental problem addressed by many researchers in distributed systems. This problem in distributed systems is ﬁrst introduced by LeLann who also proposes a solution for this problem on a unidirectional ring [1]. Chang and Roberts improve this solution and reduces the average messaging complexity [2]. Also, for bidirectional rings and arbitrary networks, various solutions are proposed [3,4,5,6]. Mobile Ad hoc NETworks (MANETs) are a class of distributed networks which do not have a ﬁxed topology and where the nodes communicate using temporary connections with their neighbors. Leader election in MANETs is a relatively new research area. Malpani et. al. [7] propose two leader election protocols based on TORA. The algorithms select a leader for each connected component. The ﬁrst algorithm is designed to tolerate single topology change whereas the second algorithm tolerates concurrent topology changes. The nodes only exchange messages with their neighbors, thus making this protocol suitable for MANETs. The authors show the proof of correctness but they do not M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 509–518, 2008. c Springer-Verlag Berlin Heidelberg 2008

510

O. Dagdeviren and K. Erciyes

give any simulation results. Vasudevan et. al. [8] propose a weakly-self stabilizing and terminating leader election protocol for MANETs. Their algorithm uses the concept of diﬀusing computations. They show the proof of correctness by using temporal logic but they also do not give any simulation results. Pradeep et. al. [9] propose a leader election algorithm similar to Malpani et. al.’s algorithm. They use the Zone Routing Protocol and show the proof of correctness of their algorithm. Masum et. al. [10] propose a consensus based asynchronous leader election algorithm for MANETs. The authors claim that their algorithm is adaptive to link failures. All of these algorithms elect leader(s) from ordinary nodes. On the other hand, our algorithm elects one super-leader from previously selected leaders. Cokuslu and Erciyes [11] proposed a two level leader election hierarchy for MANETs. Their protocol is based on constructing dominating sets for clustering and electing the super clusterhead from the subset of connected clusterheads. Since the clusterheads must be connected, the super clusterhead selection protocol is restricted by the underlying protocol. Also dominating set construction is an expensive operation under high mobility. Our Mobile CR protocol can use any clustering and routing protocol under BFA to be stable under high mobility and density. In this study, we propose a Leader Election Protocol (LEP ) that has three layers (phases) for MANETs. At the lowest layer, a clustering algorithm divides the MANET into balanced clusters, using the previously designed Merging Clustering Algorithm (M CA) [12]. The second layer employs the Backbone Formation Algorithm (BF A) which provides a virtual ring architecture of the leaders of the clusters formed by M CA [12]. Finally, using the Mobile Chang Roberts Leader Election Algorithm (M obile CR), the super-leader among the leaders elected in the second phase is elected. We show experimentally and theoretically that the protocol is scalable and has favorable performance with respect to time and message complexities. The rest of the paper is organized as follows. Section 2 provides the background and the proposed architecture is outlined in Section 3. Section 4 describes the extended Chang Roberts algorithm on the proposed model called Mobile CR. The implementation results are explained in Section 5 and the discussions and the conclusions are outlined in Section 6.

2

Background

In this section we explain the clustering using MCA and backbone formation using BFA to show the underlying mechanism for LEP . 2.1

Clustering Using Merging Clustering Algorithm

An undirected graph is deﬁned as G = (V, E), where V is a ﬁnite nonempty set and E ⊆ V × V . The V is a set of nodes v and the E is a set of edges e. A graph GS = (VS , ES ) is a spanning subgraph of G = (V, E) if VS = V . A spanning tree of a graph is an undirected connected acyclic spanning subgraph. Intuitively, a minimum spanning tree (MST) for a graph is a subgraph that has the minimum

A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks

511

number of edges for maintaining connectivity. Merging Clustering Algorithm M CA [12] ﬁnds clusters in a MANET by merging the clusters to form higher level clusters as mentioned in Gallagher, Humblet, Spira’s algorithm [6]. However, we focus on the clustering operation by discarding minimum spanning tree. This reduces the message complexity as explained in [12]. The second contribution is to use upper and lower bound parameters for clustering operation which results in balanced number of nodes in the cluster formed. The protocol is simulated in ns2 and stable results under varying density and mobility are shown. 2.2

Backbone Formation Algorithm

Backbone Formation Algorithm constructs a backbone architecture on a clustered MANET [13]. Diﬀerent than other algorithms, the backbone is constructed as a directed ring architecture to gain the advantage of this topology and to give better services for other middleware protocols. The second contribution is to connect the clusterheads of a balanced clustering scheme which completes two essential needs of clustering by having balanced clusters and minimized routing delay. Besides, the backbone formation algorithm is fault tolerant as the third contribution. Our main idea is to maintain a directed ring architecture by constructing a minimum spanning tree between clusterheads and classifying clusterheads into BACKBONE or LEAF nodes, periodically. To maintain these structures, each clusterhead broadcasts a Leader Info message by ﬂooding. In this phase, cluster member nodes act as routers to transmit Leader Info messages. Algorithm has two modes of operation; hop-based backbone formation scheme and position-based backbone formation scheme. In hop-based backbone formation scheme, minimum number of hops between clusterheads are taken into consideration in the minimum spanning tree construction. Minimum hop counts can be obtained during ﬂooding scheme. For highly mobile scenarios, an agreement between clusterheads must be maintained to guarantee the consistent hop information. In position-based backbone formation scheme, positions of clusterheads are used to construct the minimum spanning tree. If each node knows its velocity and the direction of velocity, these information can be appended with a timestamp to the Leader Info message to construct better minimum spanning tree. But in this mode, nodes must be equipped with a position tracker like a GPS receiver. Backbone formation algorithm is implemented on top of MCA using the ns2 simulator. The results with varying MANET conditions are shown to be stable [13].

3

The Proposed Architecture

We propose a four layer architecture for MANETs as shown in Fig. 1. Implementations of other higher level functions on top of the lower three layers are possible. The lowest layer is the routing layer in which AODV [14] is used and other routing protocols can also be used. AODV is chosen since it is a widely used routing protocol which also has a stable ns2 release. The second layer is where

512

O. Dagdeviren and K. Erciyes

the clustering takes place at the end of which, balanced clusters are formed. Our clustering layer provides that nodes in vicinity of each other join to the same cluster to reduce routing overhead. The third layer inputs these clusters and form a virtual ring of the leaders of these clusters. Finally, the fourth layer shows the implementation of the Mobile CR Algorithm on top of these three layers.

Mobile Chang Roberts Algorithm

Backbone Formation Algorithm

Merging Clustering Algorithm

Ad hoc on Demand Distance Vector

Fig. 1. The Proposed Architecture

4

Mobile Chang-Roberts Algorithm

Chang Roberts Algorithm is an asynchronous leader election algorithm for unidirectional ring networks. Assuming each process can be either red meaning a potential candidate for becoming a leader or black meaning a resigned state, an informal description of the algorithm is as follows. Any red process can initiate the algorithm, however, if a red process receives a token before initiating the algorithm, it resigns by turning black [15]. Non-initiators remain black, and act as routers. A process that receives a token with a smaller id than itself, removes the token. A higher id token is forwarded to the next node and if it reaches the originator, it has a higher id than all and the originator can then declare itself as the leader. 1. 2. 3. 4. 5. 6. 7.

Initially all initiator processes are red. For each initiator i, token < i > is sent to its neighbor; do (for every process i) token < j > ∧ j > i → skip; token < j > ∧ j < i → send token < j >; color := black (i resigns) token < j > ∧ j = i → L(i) := i (i becomes the leader) od

A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks

513

8. (for a non-initiator process) 9. do token < j > received → color := black; send < j > od This algorithm ensures that the lowest id of the initiators wins and becomes the leader and its complexity is O(n2 ). We provide the following improvement to the classical Mobile CR Algorithm. Every node keeps a list of the identities of the nodes that it has seen so far by the tokens it has received. It then only passes tokens that has a smaller sender id than the ones in the list, rather than checking the id present in the incoming token with itself only. 4.1

Finite State Machine Diagram of the Mobile CR Algorithm

The Mobile CR Algorithm is described using a ﬁnite state machine diagram to capture all of the possible asynchronous activities. Firstly, the list of messages used in Mobile CR Algorithm is speciﬁed as follows : – LEADER DEAD : It is triggered by an internal event which detects super leader’s crash. – T OKEN : Sent or forwarded by a leader to its next leader for election. – LEADER : Sent by the leader which is the winner of the election. TOKEN p
TOKEN/TOKEN

LEADER/LEADER LOST

SLEEP

LEADER_DEAD/TOKEN LEADER_ FOUND TOKEN p>q /TOKEN LEADER/LEADER

LEADER

CANDIDATE

TOKEN q=p,p=min /TOKEN, LEADER

Fig. 2. FSM of the Mobile CR Leader

The following is the list of node states: – SLEEP : I am in idle state. – LOST : I am not a CANDIDATE as there is someone with higher ID than me. – CAN DIDAT E : I am CANDIDATE to become a LEADER and I am in election. – LEADER F OU N D : LEADER is determined and I know who it is. – LEADER : I am the LEADER.

514

4.2

O. Dagdeviren and K. Erciyes

An Example Operation

Fig. 3 shows an example scenario of the Mobile CR Algorithm for a network with 40 nodes located on top of a 600m2 surface area. The x,y coordinate of each node is given next to it. Node 37’s coverage area is shown by the dotted circle. As shown by the bold lines, the network is partitioned into 5 clusters with MCA. After partitioning the network, the backbone is constructed with BFA as a directed ring architecture which is depicted with the arcs and its arrows in Fig 3. Initially all leader nodes are at the SLEEP state. Within a small amount of

183,24 21 Cluster 1 17 39,47 97,121 14

31 188,98

13 318,20 268,53 30

18 410,28

274,36 6

412,137

23

189,212 1

56,271 19

24 175,254

113,363 7

8 176,317

493,97

12 29

94,190 26

27 555,18

318,168

28 549,166

coverage area of node 37

Cluster 3 267,283 37

264,362

16 513,242

22 340,246

Cluster 2 25 361,317

438,364

15

20 34 584,393

32 336,435 39 39

181,442

21,396

36

58,515 10

3 92,468

2 133,539 Cluster 4

9 268,468

0

402,468 38

504,437

343,512 4

11 239,539 342,590 35

33 390,539 Cluster 5

5 541,540

600m * 600 m surface area

Fig. 3. An Example Operation for Mobile CR

time, node 31 becomes an initiator and changes its state to the CAN DIDAT E by sending a T OKEN message to its next leader on the ring, node 29. Node 29, which is in SLEEP state receives the T OKEN message and changes its state to the LOST . Node 29 sends an acknowledgement message to node 31. Each protocol message is acknowledged to maintain reliable transmission. Node 29 forwards T OKEN message to the Node 37. At the same time, node 39 becomes an initiator and also makes a state transition from SLEEP to CAN DIDAT E state by sending a token message to its next leader on the ring, node 36. Node 37 forwards the T OKEN of node 31 to the node 39. Node 39 loses the election, since id of received token is smaller. However node 39’s token will be forwarded to the node 31. Node 31 blocks the T OKEN message of the node 39. In the end, the T OKEN message of node 31 circulates the ring and node 31 becomes the LEADER of leaders. Node 31 sends a LEADER message to the next leader on the ring which then circulates the ring to indicate the new leader of leaders.

A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks

4.3

515

Analysis

The proposed protocol consists of three layers as shown in Fig. 1. 1. Merging Clustering Algorithm M CA 2. Backbone Ring Formation Algorithm BF A 3. Mobile Chang Roberts Algorithm M obile CR Theorem 1. The message and time complexity of the protocol is O(kn) where k is the number of clusters. Proof. The message complexity for the protocol is the sum of the message complexities of the above three algorithms plus the messages required for termination detection of the ﬁrst two algorithms. Assuming termination detection requires negligible number of messages, the message complexity of the Leader Election Protocol(LEP) is : O(LEP ) = O(M CA) + O(BF A) + O(M obile CR)

(1)

O(LEP ) = O(n) + O(kn) + O(k 2 )

(2)

O(LEP ) = O(kn)

(3)

By using the same method time complexity can be found as:

5

O(LEP ) = O(n) + O(kn) + O(n)

(4)

O(LEP ) = O(kn)

(5)

Results

We implemented the protocol stack with the ns2 simulator. Diﬀerent size of ﬂat surfaces are chosen for each simulation to create medium , dense and highly dense connected networks. Medium, small and very small surfaces vary between 140m × 700m to 700m × 700m, 130m × 650m to 650m × 700m, 120m × 600m to 600m × 600m respectively. Average degree of the network is approximately N/4 for the medium connected, N/3.5 for the dense connected and N/3 for the highly dense connected networks where N denotes the total number of nodes in the network. Although each packet is acknowledged by the destination to maintain reliable transmission and retransmitted if dropped, sparse networks are not studied because of the lack of connectivity. N varies from 10 to 50 in our experiments. Random movements are generated for each simulation and random waypoint model is chosen as the mobility pattern. Low, medium and high mobility scenarios are generated and respective node speeds are limited

516

O. Dagdeviren and K. Erciyes

between 1.0m/s to 5.0m/s, 5.0m/s to 10.0m/s, 10.0m/s to 20.0m/s. We use the codes of MCA and BFA previously simulated by us under Mobile CR to obtain end-to-end measurements. Each measurement is taken as the average value of 20 measurements with the same mobility and density pattern but diﬀerent randomly generated node locations and speeds. Our previous study shows that MCA and BFA is stable under diﬀerent density and mobility conditions. Fig. 4 and Fig. 5 show that election time increases linearly with total number of nodes and Mobile CR is stable under density and mobility changes. The runtimes decrease as mobility increases as shown in Fig. 5 since the number of clusterheads forming the ring are smaller for high mobility scenarios resulting in less network traﬃc.

Fig. 4. Election Time against Density for Mobile CR

Fig. 5. Election Time against Mobility for Mobile CR

Number of nodes on the ring formed by BFA is an important parameter for election time in Mobile CR. Upper and lower bound cluster parameters are deﬁned in MCA to adjust cluster sizes. We use this parameter to divide the network into a number of clusters. Network is divided into 3, 4, 5, and 7 clusters and the eﬀect of the number of clusterheads on the ring is measured. Fig. 6 shows that the election time slightly changes with the total number of clusters in the network. One can expect a linear increase of the election time with the total number of clusters, but the number of actively routing nodes are selected by AODV and not only clusterheads are used for routing.

A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks

517

Fig. 6. Election Time against Number of Clusters for Mobile CR

Fig. 7. Election Time against Initiator number for Mobile CR

Lastly, we investigate the behaviour of the election time when nodes start concurrently. In our algorithm, each node stores a list of the received id of initiators and blocks the tokens of the candidates having greater id than leader and 1 to 5 initiators are selected for simulations. The results in Fig. 7 shows that the algorithm is stable against varying number of concurrent initiators.

6

Conclusions

We provided a three layer architecture for the dynamic leader election problem in MANETs where the clustering phase provided the leaders for each cluster in the ﬁrst phase. The backbone formation algorithm provided a ring network among the local leaders in the second phase and a super-leader among the local leaders is elected using the Mobile CR algorithm in the ﬁnal phase. We showed experimentally and theoretically that this approach is scalable and has an overall favorable performance. We think this approach may ﬁnd various implementation environments in MANETs where sub activities within the MANET are handled by group/cluster of nodes each having a leader for local decisions but overall control is achieved by a single super-leader node. The protocol can be invoked periodically which ensures correct handling of failing leaders.

518

O. Dagdeviren and K. Erciyes

References 1. LeLann, G.: Distributed Systems: Towards a Formal Approach. IEEE Information Processing 77, 155–169 (1977) 2. Chang, E.J., Roberts, R.: An Improved Algorithm for Decentralized Extrema Finding in Circular Arrangements of Processes. ACM Com., 281–283 (1979) 3. Franklin, W.R.: On an Improved Algorithm for Decentralized Extrema Finding in Circular Conﬁgurations of Processors. ACM Com., 281–283 (1982) 4. Peterson, G.L.: An O(nlogn) Unidirectional Algorithm for the Circular Extrema Problem. ACM Trans. Prog. Lang. 4, 758–763 (1982) 5. Dolev, D., Klawe, M., Rodeh, M.: An O(nlogn) Unidirectional Distributed Algorithm for Extrema-Finding in a Circle. J. Algorithms 3, 245–260 (1982) 6. Gallagher, R.G., Humblet, P.A., Spira, P.M.: A Distributed Algorithm for Minimum-Weight Spanning Trees. ACM Trans. Prog. Lang., 66–77 (1983) 7. Malpani, N., Welch, J., Vaidya, N.: Leader Election Algorithms for Mobile Ad Hoc Networks. In: Proc. Int. Works. on Disc. Alg. and Meth., pp. 96–103 (2000) 8. Vasudevan, S., Immerman, N., Kurose, J., Towsley, D.: A Leader Election Algorithm for Mobile Ad Hoc Network. UMass Comp. Sci. Tech. Rep. (2003) 9. Pradeep, P., Kumar, V., Yang, G.-C., Ghosh, R.K., Mohanty, H.: An Eﬃcient Leader Election Algorithm for Mobile Ad Hoc Networks. In: Ghosh, R.K., Mohanty, H. (eds.) ICDCIT 2004. LNCS, vol. 3347, pp. 32–41. Springer, Heidelberg (2004) 10. Masum, S.M., Ali, A.A., Bhuiyan, M.T.I.: Asynchronous Leader Election in Mobile Ad Hoc Networks. In: AINA 2006, pp. 827–831 (2006) 11. Cokuslu, D., Erciyes, K.: A Hierarchical Connected Dominating Set Based Clustering Algorithm for Mobile Ad Hoc Networks. In: IEEE MASCOTS 2007, pp. 60–66 (2007) 12. Dagdeviren, O., Erciyes, K., Cokuslu, D.: Merging Clustering Algorithm for Mobile Ad Hoc Networks. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Lagan´ a, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 681–690. Springer, Heidelberg (2006) 13. Dagdeviren, O., Erciyes, K.: A Distributed Backbone Formation Algorithm for Mobile Ad Hoc Networks. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, pp. 219–230. Springer, Heidelberg (2006) 14. Perkins C. E., Belding-Royer E. M., Das S.: Ad Hoc On Demand Distance Vector (AODV) Routing. RFC 3561 (2003) 15. Ghosh, S.: Distributed Systems, An Algorithmic Approach, ch. 11, pp. 175–176. Chapman and Hall/CRC (2007)

Distributed Algorithms to Form Cluster Based Spanning Trees in Wireless Sensor Networks Kayhan Erciyes1, Deniz Ozsoyeller2, and Orhan Dagdeviren3 1

Ege University International Computer Institute Bornova, Izmir, TR-35100, Turkey [email protected] 2 Izmir University of Economics, Computer Eng. Dept. Balcova, Izmir TR-35350, Turkey [email protected] 3 Izmir Institute of Technology Computer Eng. Dept. Urla, Izmir TR-35340, Turkey [email protected]

Abstract. We propose two algorithms to form spanning trees in sensor networks. The ﬁrst algorithm forms hierarchical clusters of spanning trees with a given root, the sink. All of the nodes in the sensor network are then classiﬁed iteratively as subroot, intermediate or leaf nodes. At the end of this phase, the local spanning trees are formed, each having a unique subroot (clusterhead) node. The communication and data aggregation towards the sink by an ordinary node then is accomplished by sending data to the local subroot which routes data towards the sink. A modiﬁed version of the ﬁrst algorithm is also provided which ensures that the obtained tree is a breadth-ﬁrst search tree where a node can modify its parent to yield shorter distances to the root. Once the subspanning trees in the clusters are formed, a communication architecture such as a ring can be formed among the subroots. This hybrid architecture which provides co-existing spanning trees within clusters yields the necessary foundation for a two-level communication protocol in a sensor network as well as providing a structure for a higher level abstraction such as the γ synchronizer where communication between the clusters is performed using the ring similar to an α synchronizer and the intra cluster communication is accomplished using the sub-spanning trees as in the β synchronizers. We discuss the model along with the algorithms, compare them and comment on their performances. Keywords: spanning tree, clustering, synchronizers, wireless sensor networks.

1

Introduction

Wireless Sensor Networks (WSNs) have important scientiﬁc, environmental, medical and military applications. Example WSN applications include habitat monitoring, remote patient monitoring and military defense systems [1]. WSNs may consist of hundreds or even thousands of nodes that operate independently. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 519–528, 2008. c Springer-Verlag Berlin Heidelberg 2008

520

K. Erciyes, D. Ozsoyeller, and O. Dagdeviren

A survey of WSNs can be found in [2]. WSN nodes are small, inexpensive, embedded, require low power and are distributed regularly or irregularly over a signiﬁcantly large area. WSN nodes are usually deployed in highly dynamic and sometimes hostile environments. It is therefore very important that these networks should have the capability to perform unattended and distributed but coordinated operation with the other nodes and also to provide self-healing in the case of faults. Communication in WSNs can be performed using two fundamental approaches as tree based and cluster based. Cluster based communications require grouping of closely coupled elements of the sensor network into clusters and electing one of these nodes as the clusterhead (cluster leader) [3]. The cluster leader provides the coordination of the communication among the cluster members and other clusters. Energy is an important and a crucial resource in sensor networks due to the limited lifetime of sensor batteries and also diﬃculty of recharging batteries of thousands of sensors in remote or hostile environments. Communication of sensor nodes dominates their energy consumption even when they are at idlelistening state [4]. In this study, we propose a distributed algorithm that forms hierarchical spanning trees in a WSN where each sub-spanning tree has a root node that has the role of the leader for that subtree. Our algorithm has the topology of a spanning tree but also has a cluster structure with a clusterhead, therefore is an integration of the tree based and cluster based approaches. To our knowledge, the algorithm in this study is the ﬁrst attempt to provide a hybrid approach for communication in WSNs. The rest of the paper is organized as follows. Section 2 provides the related work and the algorithms designed are detailed in sections 3 and 4 along with analysis and results obtained. Finally, conclusions are presented in Section 5.

2 2.1

Background Clustering in WSNs

A WSN can be modelled by a graph G(V, E) where V is the set of vertices (nodes of WSN) and E is the set of edges (communication links among the nodes). Clustering the nodes of a graph or graph partitioning is NP-Hard. For this reason, clustering in WSNs is usually performed using some heuristics. Some of the beneﬁts to be gained from clustering in mobile ad hoc and WSNs are the reduction in energy for message transfers and forming of a virtual backbone for routing purposes [5]. HEED (Distributed Clustering in Ad-hoc Sensor Networks: A Hybrid, EnergyEﬃcient Approach) [6], proposes a distributed clustering algorithm for sensor networks. Clusterheads in HEED are elected using a probabilistic heuristic that considers the residual energy of a node and the number of its neighbors (its degree). HEED assumes a homogenous network and also that neighbor connectivity is known and provides balanced clusters. LEACH (Low-Energy Adaptive Clustering Hierarchy) [7] provides rotating clusterheads chosen randomly and

Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs

521

assumes clusterheads consume uniform energy. Both HEED and LEACH ﬁnd clusters in a ﬁnite number of steps. In PEAS [8], a node goes to sleep (turn oﬀ its radio) when it detects a routing node in its transmission range. In GAF [9], the sensor network is divided into ﬁxed square grids each with a routing node. Communication to the sink is propagated by the routers. The ordinary nodes in each grid can turn oﬀ their radio components when they have no transmission. GEAR (Geographical and Energy Aware Routing: a recursive data dissemination protocol for wireless sensor networks) [10] and TTDD (Twotier Data Dissemination Model for Large Scale Wireless Sensor Network) [11] are examples of other protocols for cluster formation in WSNs. 2.2

Spanning Tree Formation in WSNs

Building spanning trees rooted at a sink node for data collection is a fundamental method for data aggregation in sensor networks. However, due to the nature of the sensor networks, the spanning tree should be formed in a decentralized way. Gallagher, Humblet and Spira [12], Awerbuch [13], Banerjee and Khuller [14] have all proposed distributed spanning tree algorithms. Gallagher, Humblet and Spira distributed algorithm determines a minimum weight spanning tree for an undirected graph by combining small fragments into larger fragments. A fragment of a spanning tree is its subtree. Time complexity for this algorithm is O(NlogN). The ENCAST (ENergy Critical node Aware Spanning Tree) algorithm [15] ﬁnds a shortest path tree (SPT) by breadth ﬁrst traversal from the sink, and each node can reach the sink via the minimum number of hops using this SPT. However, there may be more than one SPTs in dense sensor networks due to the fact that nodes have many neighbor nodes and some of these neighbors have the same minimum-hop distance from the sink energy of a node as the second selection criteria and attempts to label nodes with less energy as leaf nodes.

3

The Distributed Spanning Tree Algorithm

The ﬁrst algorithm is a modiﬁcation of the distributed spanning tree formation algorithm for general networks. We modify this general algorithm below so that clusters which are subtrees are also formed with energy considerations of the WSN. We assume that the sensor nodes are distributed randomly and densely over the area to be monitored and the sensor ﬁeld can be mapped into a two dimensional space. Furthermore, all the sensor nodes have identical and ﬁxed transmission ranges and hardware conﬁgurations and each sensor node can monitor its power level EP . 3.1

Description of the Algorithm

The algorithm we propose is described informally as follows. The sink periodically starts the algorithm by sending a P AREN T message to its neighbors. Any

522

K. Erciyes, D. Ozsoyeller, and O. Dagdeviren

node i that has not received a P AREN T message before sets the sender as its parent, sends ACK(i) message to its parent and sends a P AREN T (i) message to all of its neighbors. We provide a depth of subtree parameter d as the modiﬁcation to the above classical algorithm to form a spanning tree. Every node that is designated a parent performs n hops = (n hops + 1) MOD d to append to its outgoing message. The recipient of the message with n hops =0 are the SUBROOTs, and n hops ¡=d are INTERMEDIATE nodes or leaf depending on their level within a subtree. The state diagram of Fig. 1 depicts the operation of the Distributed Spanning Tree Algorithm (DSTA). The algorithm is initiated by the sink at regular intervals. Any ordinary node that has not been labeled before, receiving a PARENT message from an upper node, labels itself according to the number of hops the message has traveled which is shown by the parameter of the PARENT message.

Fig. 1. The Finite State Machine Diagram of DSTA

Any further change of states between subroot, intermediate and leaf nodes are not shown for simplicity. The following is a list of messages used in DSTA : – P AREN T : Sent by a parent to the neighbors soliciting for children. – CHILD : Sent by the child to parent acknowledging to be a successor. – T IM EOU T : Internal message informing a timeout has ocurred. This message prevents a subroot waiting indeﬁnitely for acknowledgements from potential children. The message contains the following ﬁelds : – Sender : SINK, SUBROOT, SUBROOT0, INTERMED, LEAF; – type : PARENT, CHILD; – n hops: integer showing the number of hops the message has travelled.

Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs

523

If the number of hops in the message is equal to zero, the node labels itself as the SU BROOT . Else if the number of hops is smaller than the allowed depth d of the sub-tree, the node is an intermediate (IN T ERM ) node. Once the number of hops equals the depth, the node is classiﬁed as a LEAF . Each labeled node acknowledges its parent by the CHILD message. The following is the list of sensor node states: – SU BRT : A node is labeled as a subroot as the message it has received from its parent has n hops = 0. – SU BCH : A Subroot node has at least one conﬁrmed child in the local tree. – IN T ERM : A node is an intermediate node, that is, it is not a subroot or a leaf node. – IN T CH : An intermediate node with at least one child – LEAF : A node that is the leaf of a local spanning tree. – LEAF CH : A leaf node with at least one child. – SU BRT 0 : A subroot node that has received a SINK message – SU BCH0 : A subroot 0 node that has at least one child. Remark 1. Energy Considerations : A sensor node rejects being labeled as subroot if its energy level is below a threshold, for example, two thirds of EP . This is required as a subroot will have more message transfers than an ordinary node. A branch of the spanning tree formed constitutes a cluster where a subroot node is the clusterhead. Subroots may have other attributed roles in application speciﬁc settings. For our purpose, each subroot has the capability to manipulate or ﬁlter any incoming message to it during convergecast. 3.2

Analysis of DSTA

In this section, we analyze the number of communication steps (count of messages) to form the spanning trees using DSTA and comment on its performance. Based on the state machine of Fig. 1, the labeling of a sensor node as SUBROOT, INTERMED or LEAF requires two messages called P AREN T and CHILD. The ﬁrst message is sent by the parent soliciting children and the second mesage is the acknowledgement of the child to its parent. Theorem 1. Time complexity of DSTA is O(D) where D is the diameter of the network from the sink to the furthest leaf and its message complexity is O(n). Proof. The time required for the algorithm is clearly the diameter D of the network. Once a node is labeled and has a designated parent, it will only send a message to its neighbors once. If Δ is the maximum degree of the network graph, total number of messages is Δ*n and for small Δ, message complexity is O(n). 3.3

Results

The distributed spanning tree algorithm is implemented with the ns2 simulator. The IEEE 802.11g standards are chosen for lower layer protocols. Total number

524

K. Erciyes, D. Ozsoyeller, and O. Dagdeviren

of nodes vary from 100 to 500 nodes. Diﬀerent size of ﬂat surfaces are chosen for each simulation to create high dense, dense and medium connected topologies to measure the eﬀect of node degree. Surface areas vary from 2700m × 1200m to 17920m × 1920m. Depth parameter is changed to obtain diﬀerent number of clusters, as well as, SUBROOT, INTERMEDIATE and LEAF nodes.

Fig. 2. DSTA Run-times against the Number of Nodes

Fig. 2 displays the run-time results of the distributed spanning tree algorithm ranging from 100 to 500 nodes. Run-time values increase almost linearly, except for the case of 300 nodes which may be due to their random distribution, indicating the scalability when the total number of nodes is increased from 100 to 500 nodes. 4.5s is needed for the formation of distributed spanning tree with clusters. For a network with 100 nodes, diﬀerent topologies are created to measure the eﬀect of the average node degree parameter. Because each node must be informed by its neighbors to complete reliable ﬂooding. Any corrupted message must be retransmitted. Fig. 3 shows that algorithm performs well up to high dense topologies with 8 nodes connected on the average. Depth parameter of the DSTA changes the number of clusters and the node states in WSN. Number of SUBROOT, INTERMEDIATE and LEAF nodes are measured for 300 nodes as shown in Fig. 4. As depth parameter is increased from 2 to 6, SUBROOT node count decreases and INTERMEDIATE count increases

Fig. 3. DSTA Run-times against the Average Node Degree

Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs

525

Fig. 4. DSTA Number of Node States against the Depth

as expected. Number of LEAF s, which are mostly the gateway nodes, decreases same as SUBROOT s with depth parameter. Our results conform with the analysis that the run-time values and message counts grow linearly. Also, algorithm is stable under diﬀerent node degrees. Depth parameter changes the number of clusters and node states and its selection is very important. Worst delivery times are scalable and show that nodes can route their packets on top of this spanning tree with reasonable delays.

4

Breadth-First Search Based DSTA

The second algorithm we propose for spanning tree formation in WSNs is the modiﬁcation of the Breadth-First Search (BFS) spanning tree algorithm for general networks shown below : 1. Initially, the root sets L(root) = 0 and all other vertices set L(v) = ∞. 2. The root sends out the message Layer(0) to all its neighbors. 3. A vertex v, which gets a Layer(d) message from a neighbor w checks if d + 1 < L(v). If so, it does the following: – parent(v) = w; – L(v) = d + 1; – Send Layer(d + 1) to all neighbors except w. We apply the algorithm above, however, based on their designated distances, nodes are labeled as ROOT, SUBROOT and LEAF as in DSTA. Theorem 2. Time complexity of BFS-DSTA is O(n) and its message complexity is O(n|E|). Proof. As the longest path in a network has n-1 nodes, time complexity of the general asynchronous BFS spanning tree algorithm is O(n). Since at every step, there will be a maximum of |E| messages, the message complexity is O(n|E|. For BFS-DSTA, general rules apply and the complexities are the same as the asynchronous BFS algorithm

526

4.1

K. Erciyes, D. Ozsoyeller, and O. Dagdeviren

Results for BFS Based DSTA

We implemented BFS-MDSTA (DSTA with multiple sinks using BFS) in a similar setting of ns2 as in DSTA. Fig. 5 shows the running times of BFS-MDSTA for 1,3 and 5 sinks. We see that there is a linear increase as the number of nodes are increased and also running times depend linearly on the number of concurrent sinks.

Fig. 5. The Running Times for Multi-sink Formation with BFS-MDSTA

Fig. 6 shows the number of clusters formed when BFS-MDSTA is applied to a WSN for 1,3 and 5 sinks with a constant depth of 3. The curves are almost identical showing an even distribution of clusters independent of the count and location of the sinks.

Fig. 6. The Average Number of Clusters for BFS-MDSTA for d = 3

Fig. 7 shows the eﬀect of the subtree depth d on the cluster count when BFS-DSTA is applied upto 250 WSN nodes which was the upper limit that the simulator could tolerate due to the data complexity of maintaining 5 concurrent sinks. We see here that the count of clusters decrease linearly as d increases which is expected.

Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs

527

Fig. 7. The Average Number of Clusters for BFS-MDSTA against Depth

5

Discussions and Conclusions

We proposed two distributed algorithms for spanning tree formation to provide a communication infrastructure in sensor networks. The ﬁrst algorithm (DSTA) has a lower message complexity of O(n) but does not necessarily ﬁnd the shortest route to the sink. The second algorithm (BFS-DSTA) uses BFS property and ﬁnds the shortest route with elevated message complexity of O(n|E|). These algorithms may be activated at regular intervals by the sink and the dynamic spanning tree conﬁguration consisting of healthy nodes only discards the sensor nodes that have ceased functioning due to energy loss or other hostile environment conditions. We showed that these algorithms are scalable and provide balanced clusters which consist of tress within the clusters. This architecture may be suitably used for a γ synchronizer which requires the same structure we propose. One future direction of this work would therefore be another communication structure such as a ring between the clusterheads so that an α synchronizer can be constructed among the clusters. The local spanning trees produced by DSTA and BFS-DSTA naturally comprise clusters of the sensor network and therefore can be used for other resource management tasks in sensor networks other than the communication infrastructure or the synchronizer function described in this study. The subroot nodes are the leaders of the clusters that can act as the representatives of their cluster members for various tasks in the sensor networks. These leaders can be connected in various conﬁgurations such as the ring or other in order to perform tasks such as mutual exclusion in sensor networks. Advantage of this hybrid approach would be the simple and fast data aggregation using the spanning tree within the cluster and a more general framework such as ring based communication among the clusters. Our work is ongoing and we are looking into labeling some nodes of the WSN as privileged nodes of improved transmission capabilities so that these nodes may form an upper spanning tree and hence an upper communication backbone of the WSN. Acknowledgements. This work was partially supported by Turkish Science and Research Council Career Project 104E064.

528

K. Erciyes, D. Ozsoyeller, and O. Dagdeviren

References 1. Mainwaring, A., Polastre, J., Szewczyk, R., Culler, D., Anderson, J.: Wireless Sensor Networks for Habitat Monitoring. ACM WSNA (2002) 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless Sensor Networks: A Survey. Elsevier Comp. Ntws 38, 393–422 (2002) 3. Chong, J.: A Survey of Clustering Schemes for Mobile Ad Hoc Networks. IEEE Comm. Surveys and Tutorials 7(1), 32–48 (2005) 4. Estrin, D., Govindan, R., Heidemann, J., Kumar, S.: Next Century Challenges: Scalable Coordination in Sensor Networks. Mob. Comp. and Networking (1999) 5. Dagdeviren, O., Erciyes, K.: A Distributed Backbone Formation Algorithm for Mobile Ad Hoc Networks. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, pp. 219–230. Springer, Heidelberg (2006) 6. Younis, O., Fahmy, S.: HEED: A Hybrid, Energy-Eﬃcient, Distributed Clustering Approach for Ad hoc Sensor Networks. IEEE Trans. on Mob. Comp. 3(4) (2004) 7. Heinzelman, W.R., Chandrakasan, A., Balakrishnan, H.: Energy-Eﬃcient Communication Protocol for Wireless Microsensor Networks. IEEE HICSS (2000) 8. Ye, F., Zhong, G., Cheng, J., Lu, S., Zhang, L.: A Robust Energy Conserving Protocol for Long-Lived Sensor Networks. IEEE ICDCS (2003) 9. Xu, Y., Heidemann, J., Estrin, D.: Geography Informed Energy Conservation for Ad Hoc Routing. ACM/IEEE MOBICOM, 70–84 (2001) 10. Yu, Y., Govindan, R., Estrin, D.: Geographical and Energy Aware Routing: A Recursive Data Dissemination Protocol for Wireless Sensor Networks. UCLA Computer Science Department TR UCLA/CSD-TR-01-0023 (2001) 11. Ye, F., Luo, H., Cheng, J., Lu, S., Zhang, L.: A Two-tier Data Dissemination Model for Large-Scale Wireless Sensor Networks. ACM MOBICOM (2002) 12. Gallagher, R.G., Humblet, P.A., Spira, M.: A Distributed Algorithm for MinimumWeight Spanning Trees. ACM Trans. on Prog. Lang. and Sys. 5 (1983) 13. Awerbuch, B.: Optimal Distributed Algorithms for Minimum Weight Spanning Tree, Counting, Leader Election and Related Problems. ACM STIC (1987) 14. Banerjee, S., Khuller, S.: A Clustering Scheme for Hierarchical Routing in Wireless Networks. University of Maryland Tech. Report CS-TR-4103 (2000) 15. Zou, S., Nikolaidis, I., Harms, J.J.: ENCAST: Energy-Critical Node Aware Spanning Tree for Sensor Networks. IEEE CNSR, 249–254 (2005)

The Effect of Network Topology and Channel Labels on the Performance of Label-Based Routing Algorithms Reza Moraveji1,2, Hamid Sarbazi-Azad1,3, and Arash Tavakkol1 1 IPM School of Computer Science, Tehran, Iran Dept. of ECE, Shahid Beheshti Univ., Tehran, Iran 3 Dept. of Computer Engineering, Sharif Univ., Tehran, Iran {moraveji_r,azad,arasht}@ipm.ir, [email protected] 2

Abstract. Designing an efficient deadlock-free routing is a point of concern for irregular topologies. In this paper, we take a step toward the goal by developing three novel deadlock-free routing algorithms in the content of a new family of algorithms called label-based routing algorithms for irregular topologies. In addition, the newly proposed family covers three previously reported routing algorithms [2, 3]. Moreover, by simulating and comparing the newly and traditional proposed routing methods, it is shown that the performance of this family highly depends on the network topology and channel labeling process. Keywords: Label-based routing algorithm, irregular network, network of workstation, performance evaluation.

1 Introduction In recent years, cluster-based irregular networks (INs) such as networks of workstations (NOWs also irregular network-on-chips) have emerged as one of the cost-effective alternatives for traditional regular parallel computers. In such systems, an irregular high-speed network is often required in order to provide the wiring flexibility needed in network and also design of scalable systems with incremental expansion capability [4, 7]. Without a careful design for routing scheme of INs, deadlock may happen in these networks [5, 6]. Since the topology of INs is not predefined, designing and applying deadlock-free routing algorithms are usually done without any pre-assumption about the network topology. Therefore, the major problem of these networks is the complexity of designing a general deadlock-free routing algorithm. The main purpose of this work, section 2, is to take a step in this direction by initially developing some deadlock-free routing schemes in the body of new family of routing algorithms, called label-based routing algorithms, for irregular topologies. Moreover, evaluating the performance of label-based routings in irregular networks under realistic conditions is another major concern. To this end, extensive simulation experiments have been conducted in section 3. Section 4 concludes the paper and outlines some directions for future works in this line of research. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 529–538, 2008. © Springer-Verlag Berlin Heidelberg 2008

530

R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol

2 Label-Based Routing Algorithms In order for a routing algorithm to be deadlock-free, cyclic buffer dependencies between messages and physical channels they allocate, must not occur. When the approach of labeling is used for generating deadlock-free routing algorithms, first, the given topology has to be prepared for implementing the label-based routing algorithm. Let us briefly describe the way in which the topology is labeled and also the method by which the related routing schemes are generated. The main idea of label-based routing algorithms is to classify network channels by assigning predefined labels, then grouping the labeled channels in the way that there is no cyclic dependency between each group. These groups are referred to as zones in [1]. Afterwards, the generated zones are ordered in a sequence such that when a message passes through the needed channels in the zones (regarding the sequence of the zones) the sequence guaranties the message to reach its destination. 2.1

Fundamental Concepts of Graph Labeling, Deadlock-Free Zones and Routing Algorithms

The first step in generating a label-based routing algorithm is graph labeling. Since we plan to make a comparison between the previously reported routing algorithms and the newly proposed ones here, in this paper we use the reported graph labeling in [1-3]. As the starting point, a spanning tree (based on BFS1 graph traversal) is formed on the given irregular network as the base of labeling process. Nodes and channels are labeled in two stages as follows. First stage: Nodes are labeled in ascending order regarding to spanning tree formation and according to their distances from the root of spanning tree. A channel that faces toward the lower node label is called '1' and the channel that goes away from the lower node label is called '0' (figure 1). Second stage: Subsequently, the second stage of labeling is applied to the graph in the case that an increasing number is assigned to each node in the order that nodes are visited by pre-order tree traversal. Channels are labeled using the policy of first stage (figure 1). Therefore, each channel is assigned two different labels and it is possible to think of a channel label as a compound label containing two distinct labels. It is obvious that there may be at most four possible channel labels for a given irregular topology.

(11), (10 ), (01), (00 ) These channel labels are: As a result, a single '0' transition (channel) from node A to B means that the corresponding label of node A is lower than B, and a single '1' transition from node A to B represents that the node A has higher corresponding label than B. Therefore, when both labels are brought into account as a compound label, we have the following outcomes: 11 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 < b0 , a1 < b1 ) 10 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 b0 , a1 b1 )

1

Breadth first search.

The Effect of Network Topology and Channel Labels

531

01 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 > b0 , a1 < b1 )

00 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 > b0 , a1 > b1 )

where ( a0a1 ) and (b0b1 ) are node labels. The second step in generating a label-based routing algorithm is to group the channels such that there is no cyclic dependency between the channels of the same group. Since there are several ways to group the channels, it is possible to generate various deadlock-free groups (zones) and in turn, different deadlock-free routing algorithms.

Fig. 1. Node and channel labeling

As mentioned in label-based routing [1] there is a predefined ordering to travel between channel groups and a message cannot use channel labels belong to a previously traversed group, while it can use channel labels of the same group adaptively. Consequently there is no cyclic dependency between channels of different groups. Therefore, it is sufficient to group channel labels with no cyclic dependency. As indicated in figure 2, message 1 holds channel (A, B) labeled as (10), and it requests the use of channel (C, D) labeled as (11) and message 2 holds (C, D) and requests the use of (A, B). Let’s assume that these two channel labels {(10), (11)} are in a same group. Now, we should consider the situation in which messages 1 and 2 just use channel labels of the mentioned group, {(10), (11)}. For message 1 we have, 10 11 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 > b0 , a1 < b1 ) C ( c0c1 ) ⎯⎯ → D ( d0 d1 ) ⇒ ( c0 > d0 , c1 < d1 )

a0 < b0 ... < c0 < d 0 ⇒ a0 < d 0

Therefore, if message 2 wants to make a request for (A, B) while holding (C, D), it should cross other channels such as (00), (01) and it contradicts the mentioned group ordering traversal [1]. Thus, it is possible to put (10), (11) in one group.

532

R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol

Fig. 2. Cyclic dependency between (A, B) and (C, D)

The following corollary defines a general rule for creating deadlock-free zones or groups of channels without cyclic dependency. Corollary: there is no cyclic dependency between the channels X and Y if and x : y0 ) ⊕ ( x1 : y1 ) = 1 only if they satisfy condition ( 0 , where : is the bitwise XNOR operator and ⊕ represents the bitwise OR operator. It should be noted that labels X ( x 0x 1 ) Y y y and ( 0 1 ) are channel labels not node labels and four possible channel labels were introduced formerly. Possible deadlock-free zones are (11,10 ) , (11,01) ,

(10 ,00 ) , and (01,00 ) .

The third and last step in generating a label-based routing algorithm is to order deadlock-free zones in a sequence that guaranties connectivity for the routing algorithm. Theorem: The path between any pair of nodes is guaranteed by selecting a sequence of channel labels whose first (second) bit is ‘1’ followed by a sequence of channel labels whose first (second) bit is ‘0’ [1]. Proof: When a message chooses a channel with a compound label which contains at least one '1', it gets closer to the root node of the spanning tree. In the worst case situation the respective message reaches the root node and it is obvious that from the root node there is at least one path to each other nodes in the spanning tree (whole network). Therefore, when the channel labels are ordered in a sequence of '1's followed by '0's whether in terms of first or second bit of the channel label, there is at least one path between each pair of source and destination. Now considering the generated deadlock-free zones and possible sequences, the following label-based routing algorithms can be defined: 1. 2.

R1: (11,10 ) → ( 01, 00)

R2: (11, 01) → (10,00 )

Up / down routing Left / right routing

3.

R3: (11) → ( 01,00) → (10)

New

4.

R4: (11) → (10,00 ) → ( 01)

L - turn routing

5. 6.

R5: (10) → (11, 01) → ( 00)

R6: ( 01) → (11,10 ) → ( 00)

New New

The Effect of Network Topology and Channel Labels

533

3 Empirical Performance Evaluation The main performance metric of INs is the average message latency (average amount of time it takes a message to completely reach its destination). In a thorough analysis, the mentioned performance metric of the label-based routings is analyzed under different working conditions considering different irregular topologies and different spanning trees. As you will see, some interesting points are derived from the results of the analyses that were not reported or referred in the previous works on the performance evaluation of routing algorithms in irregular networks. Analysis of this kind can be conducted through results obtained from a real implementation of the network. But a cost effective alternative is to use a simulation of the system. 3.1 Simulator To evaluate the functionality of irregular networks under different conditions, a discrete-event simulator has been developed that mimics the behavior of described label-based routing algorithms at flit level. Input data (irregular topology) to the simulator is specified in the form of adjacency matrix. Also, the spanning tree assigned to the network can be both determined by user or automatically by one of the famous BFS or DFS2 (with a predetermined heuristic) algorithms. 3.2 The Effect of Network Topology When comparing the performances of two or more routing algorithms, using the same working conditions such as number of virtual channels, message lengths, and traffic patterns, it is always expected that one (or more) of the compared routing algorithms shows better performance than the others. Generally, by changing the conditions for all routing algorithms the order of routing performances usually does not change. For example, the performance of XY routing [7] (ignoring the simplicity of implementtation) in comparison with the west-first routing [7] under the same working conditions is worse. By changing the topology on which the respective routing algorithms applied from Mesh3×3 to Mesh5×5 , the performance of west-first routing still remains better, since the latter routing algorithm always provides more adaptivity than the former one. As a result, in most cases a fair comparison provides a definite order of performances of the compared instances. As we will see in this section, this is not true for label-based routing algorithms (compared instances). The performance of the label-based routing algorithm highly depends on the topology to which the routing is applied. Therefore, it is not possible to make a general sequence of the performance of the six aforementioned label-based routing algorithms. Another design parameter that has a strong influence on the performance of the label-based routing algorithm is the degree of irregularity of the network topology; that is, the performance of this family is involved in the variance of node degrees of the topology. 2

Depth first search.

534

R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol

In order to show the above characteristics in irregular networks, a comparative performance evaluation is presented in the results of the figure 3 where the average message latency is plotted against the traffic generation rate. The analyzed irregular networks are as follows: G1(16 ,48), G2(36,124), G3(64, 240), G4(64, 240), G5: Mesh8*8, and G6(100, 364).

Network topology is the first parameter that should be considered, while choosing the best label-based routing algorithm. Let’s look at the simulation results obtained from different network sizes and network topologies. As can be seen, the sequences of routing performances are totally different from one topology to another. The following list presents the sequences of the routing performances for different topologies: • • •

G1: G2: G3:

R6, R4, R3, R5, R2, R1 R1, R3, R6, R2, R4, R5 R2, R1, R6, R3, R4, R5

• • •

G4 : G5 : G6 :

R6, R3, R1, R2, R4, R5 R3, R4, R1, R2, R5, R6 R3, R6, R2, R1, R4, R5

To see how different the sequences of routing performances are, consider the sequences in G3 and G4. Excluding R4 and R5, the sequence in G3 is the reverse of that in G4 although these networks contain the same number of nodes and even channels. The only difference between these two networks is the way of connecting nodes or network topology. Another interesting example that exhibits the effect of network topology on the performance of label-based routing is R1, which shows totally inconsistent behavior in G1 and G2. R1 is the best routing algorithm in G2 while is the worst one in G1. We have the same scenario for R6, in G4 and G5 (mesh8×8). As a consequence, it is sagacious first to specify the network topology; then, choose the routing which shows the best performance on the chosen topology. Another interesting effect is the degree of irregularity (variance of node degrees) of topology. It is evident from figure 3(e) (mesh8×8) that the message latencies and generation rates for which saturations occur for the six routing algorithms are nearly the same. The reason is that although the mesh topology is not completely regular (a network is regular when all the nodes have the same degree), all of the internal nodes have the same degree of four so that the variance of node degrees goes down. The identical result can be seen for G1. As a result, when the regularity of the topology decreases, or the variance of node degrees diminishes, the performance of the labelbased routing algorithms are closely the same (figures 3(a) and 3(e)). It should be noted that when the network size decreases (like G1), the probability that the variance of node degrees become smaller increases (but this may not be true in all cases). 3.3 The Effect of Spanning Tree Construction In the previous section the effect of network topology (network size) on the performance of the six label-based routing algorithms was discussed. According to the numerous presented results, it was shown that the performance of this family of routing algorithms highly depends on the network topology. Going further through the structural details of the six label-based routing algorithms leads us to analyze the effect of forming different spanning trees created in terms of different root nodes. The structure of a label-based routing algorithm is determined by two parameters which are number and order of zones (channel labels).

500

A v e r a g e M e s s a g e L a te n c y

A v e r a g e M e s s a g e L a te n c y

The Effect of Network Topology and Channel Labels

(11,10) → (01,00) (11,01) → (10,00)

400

(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)

300

(01)→ (11,10) → (00)

500 (11,10) → (01,00) (11,01) → (10,00)

400

(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)

300

(01)→ (11,10) → (00)

200

200

100

100

0

0 0

0.001

0.002

0.003

0.004

0.005

0.006

0

0.007

0.0005

0.001

0.0015

0.002

0.0025

0.003

Traffic Generaton Rate

Traffic Generaton Rate

(a) G1

(b) G2

500 (11,10) → (01,00) (11,01) → (10,00)

400

(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)

300

500

A v e r a g e M e s s a g e L a te n c y

A v e ra g e M e s s a g e L a te nc y

535

(01)→ (11,10) → (00) 200

(11,10) → (01,00) (11,01) → (10,00)

400

(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)

300

(01)→ (11,10) → (00) 200

100

100

0

0

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0.0018

0

0.0002

0.0004

0.0006

0.0008

0.001

Traffic Generaton Rate

A v e ra g e M e s s a g e L a te nc y

A v e r a g e M e s s a g e L a te n c y

0.0016

(d) G4

500 (11,10) → (01,00) (11,01) → (10,00) (11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)

300

0.0014

Traffic Generaton Rate

(c) G3

400

0.0012

(01)→ (11,10) → (00) 200

100

500

(11,10) → (01,00)

400

(11,01) → (10,00) (11)→ (01,00) → (10) (11)→ (10,00) → (01)

300

(10)→ (11,01) → (00) (01)→ (11,10) → (00) 200

100

0

0 0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0

0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009

Traffic Generaton Rate

(e) G5

0.001

Traffic Generaton Rate

(f) G6

Fig. 3. The average message latency of label-based routing algorithms on G1 – G6 with a message length of 64 flits

Now assume that in an arbitrary network there are two minimal paths between two nodes as follows3: Path1: 11 → 10 → 11 → 00 → 01 → 10 → 00 Path 2 : 11 → 10 → 00 → 00 → 01 → 00 → 01 3

A sequence of channel labels marks a path in the network.

536

R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol

500

500

400

Average Message Latency

A v e r a g e M e s s a g e L a te n c y

Among the six routing algorithms, only R1 can direct the message through the two existing paths. As a result, if the sequence of channel labels in a routing algorithm, like R1, results in more possible minimal paths, the average distance of the network will decrease. Moreover, the sequence of channel labels in a path is determined by graph labeling. Therefore, the performance of a label-based routing algorithm depends on the way of graph labeling.

Root 14 Root 8 Root 0

300

Root 0

400

Root 5 Root 18 Root 26

300

200

200

100

100

0

0

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0

0.0005

0.001

0.0015

Traffic Generaton Rate

500

Root 18 Root 48 Root 0

300

500

400

Root 26 Root 59 Root 0

300

200

200

100

100

0

0 0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0

0.0016

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0.0018

Traffic Generaton Rate

Traffic Generaton Rate

(c) R1 on G3

(d) R6 on G4 500

500

400

A v e ra g e M e s s a g e L a te n c y

A v e r a g e M e s s a g e L a te n c y

0.0025

(b) R3 on G2 A v e r a g e M e s s a g e L a te n c y

A v e r a g e M e s s a g e L a te n c y

(a) R6 on G1

400

0.002

Traffic Generaton Rate

Root 34 Root 47 Root 0

300

400 Root 0 Root 32 300

200

200

100

100

Root 78

0

0 0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0

0.0002

0.0004

0.0006

Traffic Generaton Rate

(e) R3 on G5

0.0008

0.001

Traffic Generaton Rate

(f) R4 on G6

Fig. 4. The average message latency of label-based routing algorithms on G1 – G6 with different spanning tree roots and message length of 64 flits

The Effect of Network Topology and Channel Labels

537

Labeling of a network is determined by the spanning tree and the spanning tree is formed based on a root node. Thus, in an irregular topology, the number of different ways that a graph can be labeled is equal to the number of different spanning trees can be formed on the graph. Thus the performance of a label-based routing algorithm depends on the channel labels and all the topological parameters that have direct or indirect effect on the channel labels. Figure 4 shows the effect of spanning tree root on the average message latency in G1-G6 using R1, R3, R4, and R6. It can be observed that changing the root of the spanning tree, and in turn channel labels, initially causes a substantial difference in the network latencies and the generation rates for which saturation occurs.

4 Conclusion First, in addition to cover three previously reported routing algorithms for irregular networks; we proposed three novel deadlock-free routing algorithms in family of routing algorithms called label-based routing algorithms. Second, the work has confronted the task of evaluating the performance of the mentioned family in irregular networks under realistic conditions. Third, by analyzing the experimental results, we revealed that the network topology, channel labels, and other topological parameters related to channel labels have great influence on the performance of label-based routing algorithms. Therefore, it is not possible to make a general sequence of the performance of the six aforementioned label-based routing algorithms. Regarding to previous work which exhibits the reaction of routing algorithms in regular networks in case of analytical models [8], further research in this line may consider developing such models for irregular networks. Moreover, investigating a general routing methodology for irregular networks and proposing some heuristics to compute the best spanning tree on the given topology can be considered for future work.

References 1. Moraveji, R., Sarbazi-azad, H.: A General Methodology of Routing in Irregular Networks. Technical Report, IPM School of Computer Science, Tehran, Iran (2007) 2. Schroeder, M.D., et al.: Autonet: a High-speed, Self configuring Local Area Network Using Point-to-point Links. J. Selected Areas in Communication 9, 1318–1335 (1991) 3. Koibuchi, M., Funahashi, A., Jouraku, A., Amano, H.: L-Turn Routing: An Adaptive Routing in Irregular Networks. In: International Parallel Processing Conference, pp. 383– 392 (2001) 4. Sancho, J.C., Robles, A., Duato, J.: An Effective Methodology to Improve the Performance of the Up*/Down* Routing Algorithm. IEEE Transaction on Parallel and Distributed Systems 15, 740–745 (2004) 5. Lysne, O., Skeie, T., Reinemo, S., Theiss, I.: Layered Routing in Irregular Networks. IEEE Transaction on Parallel and Distributed Systems 17, 51–65 (2006)

538

R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol

6. Puente, J.A., Gregorio, F., Vallejo, R., Beivide.: High-performance Adaptive Routing for Networks with Arbitrary Topology. J. System Architecture 52, 345–358 (2006) 7. Duato, S.J., Yalamanchili, L.N.: Interconnection Networks: An Engineering Approach. IEEE Computer Society Press, Los Alamitos (2003) 8. Moraveji, R., Sarbazi-Azad, H., Nayebi, A., Navi, K.: Performance Modeling of Wormhole Hypermeshes under Hot-spot Traffic. In: Diekert, V., Volkov, M.V., Voronkov, A. (eds.) CSR 2007. LNCS, vol. 4649, pp. 290–302. Springer, Heidelberg (2007)

On the Probability of Facing Fault Patterns: A Performance and Comparison Measure of Network Fault-Tolerance Farshad Safaei1, Ahmad Khonsari3, 2, and Reza Moraveji1,2 1

Dept. of ECE, Shahid Beheshti Univ., Tehran, Iran 2 IPM School of Computer Science, Tehran, Iran 3 Dept. of ECE, Univ. of Tehran, Tehran, Iran [email protected], {ak,moraveji_r}@ipm.ir

Abstract. An important issue in the design and deployment of interconnection networks is the issue of network fault-tolerance for various types of failures. In designing parallel processing using torus as the underlying interconnection topology as well as in designing real applications on such processors, the estimates of the network reliability and fault-tolerance are important in choosing the routing algorithms and predicting their performance in the presence of faulty nodes. Under node-failure model, the faulty nodes may coalesce into fault patterns, which classified into two major categories, i.e., convex (|-shaped, -shaped) and concave (L-shaped, T-shaped, +-shaped, H-shaped, U-shaped) regions. In this correspondence, we propose the first solution for computing the probability of message facing the fault patterns in tori both for convex and concave regions that is verified using simulation experiments. Our approach works for any number of faults as long as the network remains connected. We use these models to measure the network faulttolerance that can be achieved by adaptive routings, and to assess the impact of various fault patterns on the performance of such networks.

1 Introduction Communication in faulty networks is a classical field in network theory. In practice, one cannot expect nodes or communication links to work without complications. Software or hardware faults may cause nodes or links to go down. To be able to adapt with faults without serious degradation of the service, networks and routing protocols have to be set up so that they are fault-tolerant. Several recent studies address faulttolerance in a diverse range of systems and applications [1-12]. Almost all of the performance evaluation studies for functionality of these systems, however, have made use solely of simulation experiments. The limitations of simulation-based studies are that they are highly time-consuming and expensive. Effective analytical models are necessary for predicting the behavior of large networks to help weigh the cost-performance trade-offs of various adaptive routing algorithms. In this correspondence, we focus specifically on the impact of the fault patterns that permits analytical model to predict the probability of facing fault patterns experienced by a M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 539–548, 2008. © Springer-Verlag Berlin Heidelberg 2008

540

F. Safaei, A. Khonsari, and R. Moraveji

message when an adaptive routing scheme is used. To the best of our knowledge, no study has been so far reported in the literature for calculating the probability of message facing the fault patterns to examine the relative performance merits of adaptive fault-tolerant routing algorithms. In this paper, we investigate the characteristics of fault patterns which are suitable for modeling faults in interconnections networks, particularly in torus topology. Our approach employs the theoretical results of algebra, and combinatorics to calculate the probability of occurrences of facing the fault patterns in a 2-D torus. Deriving expressions for characterizing fault patterns play a critical role in studying the performance of faulty networks by means of mathematical analysis. The rest of the correspondence is organized as follows. In Section 2, we describe the basic properties of the torus topology as well as the fault-tolerance in networks. In Section 3, we derive an analytical model for calculating the probability of massage facing the fault patterns. In Section 4, the analytical results and comparison with simulation experiments are presented. Finally, Section 5 draws conclusions.

2 Terminologies This section starts with a discussion of the torus structure and then provides a short summary of fault-tolerance in interconnection networks. Some of these definitions are reiterated from previous works [6, 7, 10-12], for the sake of completeness. 2.1 The Torus Topology The torus has been popular interconnection network topology in contemporary systems [6] due to their desirable properties, such as ease of implementation and ability to exploit communication locality to reduce message latency [12]. In addition, torus is a regular (i.e., all nodes have the same degree) and edge-symmetric network, which improves load balancing across the channels [13]. Definition 1 [13]: An R × C 2-D torus network, denoted by TR×C . Each node

(x 1, y1 ) is connected to its four neighbors (x 1 ± 1mod R, y1 ) and (x 1, y1 ± 1modC ) . Therefore, the total number of channels in torus TR×C is E = 2 × R ×C . 2.2 Network Fault-Tolerance The growth of parallel applications on multiprocessors system-on-chip (Mp-SoCs), multicomputers, cluster computers, and peer-to-peer systems motivates interest in parallel computer networks. The construction of such networks connecting a large population of processing units and components (such as routers, channels and connectors) poses several challenges. First, selecting the right routing algorithms should reflect the full potential of the underlying network topology. Second, connectivity among active nodes of the interconnection network should be maintained, even in the presence of high failure rates or when a large portion of nodes is not active. To seek solutions to these issues, adaptive fault-tolerant routing algorithms have been frequently suggested as a means of providing continuous operations in the presence of one or more

On the Probability of Facing Fault Patterns

541

failures by allowing the graceful system degradation. In designing a fault-tolerant routing algorithm, a suitable fault model is one of the most important issues [7-10]. The fault model can reflect fault situations in a real system. Rectangular fault modes (also known as block faults) are the most common approach to model faulty nodes and to facilitate routing in 2-D tori [9]. However, rectangular fault regions sacrifice many nonfaulty nodes and hence its resources are wasted. In order to reduce non-faulty nodes in rectangular fault regions, many studies have addressed the concept of fault patterns with different shapes [7-10, 12] may form convex or concave regions. A convex region is defined as a region ϕ in which a line segment connecting any two points in ϕ lies entirely within ϕ . If we change the "line segment" in the standard convex region defini tion to "horizontal or vertical line segment", the resulted region is called rectilinear convex segments [7, 10]. Any region that is not convex is a concave region. Examples of convex regions are |-shape, -shape and concave regions are L-shape, U-shape, Tshape, H-shape, +-shape. The detailed mathematical expressions for characterization of the most common concave and convex fault regions in torus and mesh networks have been reported in [14]. For a comprehensive survey of the important issues of the faulttolerant systems and networks, the reader is referred to the articles in [12-15].

3 Mathematical Analysis This section starts with the description of the assumptions used in construction of the analytical models. The derivation and implementation procedure of the mathematical models are then presented. After that, the proposed models are validated through simulation experiments. 3.1 Assumptions The analytical models are based on common assumptions that have been widely accepted in the literature [6, 7, 9, 11-15]. i. Messages are uniformly directed to other network nodes. ii. Messages are routed adaptively through the network. Further, a message is assumed to always follow one of the available shortest paths in the absence of faults. iii. The probabilities of node failure in the network are equiprobable and independent of each other. Moreover, fault patterns are static [6, 7, 9-15] and do not disconnect the network. iv. Nodes are more complex than links and thus have higher failure rates [7, 12, 14]. So, we assume only node failures. 3.2 Calculating the Probability of Message Facing Faulty Patterns Consider an R × C torus network in which there are some faulty nodes have formed one of the fault patterns so that faulty nodes do not disconnect the network. We call such network a connected R × C torus with the X − shape fault pattern.

542

F. Safaei, A. Khonsari, and R. Moraveji

In this section, our goal is to calculate the probability of a message facing the existing fault pattern in the connected R × C torus network in the presence of the X − shape fault pattern. Remark: A path facing the fault-pattern means that there exists one or more points from the set of points reside on the given path. In the torus network, the position of the fault patterns does not play an important role. Since, by changing the coordinate we can transfer these patterns to any other location in the network without changing the location of the nodes respect to each other. Therefore, we can obtain the exact shape of that by knowing the type of the fault pattern and some characteristics. We denote the set of these characteristics in the X − shape fault pattern by S X . In the rectangular fault pattern, the determining characteristics of such regions are its height and width that are indicated by l and h , respectively. Thus, S : l , h . For instance, in the |-shape fault pattern, the determining characteristics of the shape of this line segment are its vertical height or its horizontal width.

S| : 1, h

Vertical line segment

S| : l,1

Horizontal line segment

Fig. 1 depicts some of the commonest fault patterns together with the precise determining characteristics of them. In the fault patterns with different horizontal and vertical determining characteristics, it is possible to alter the horizontal case to vertical case by changing the role of R and C in the torus network. Therefore, for all the proposed fault patterns, the determining characteristics in the vertical case are adequate. In case that the set of fault points are not in any of the mentioned above fault patterns, we should know the coordinate of the new fault pattern points or define new characteristics according to its shape. Here, we investigate those fault patterns about which we know the determining characteristics of their exact shape in addition to having information of their general shape. The set of X − shape fault pattern points with S X characteristics is demonstrated by F (X ; S X ) and the probability of a path confronting it is illustrated by P (X ; S X ) . In order to calculate the parameter P (X ; S X ) , we should enumerate the number of all existing paths facing the X − shape fault pattern and divide them by the number of all existing paths in the connected R × C torus network. This probability is expressed formally as

Phit

The number of minimal paths crossing the fault region The number of all minimal paths existing in the network

(1)

The following theorem provides the total number of paths with minimal length in the network. Theorem 1: In a connected R × C torus network with the X − shape fault region, the number of all existing paths between any pair of non-faulty nodes is denoted by

On the Probability of Facing Fault Patterns

543

Fig. 1. Examples of fault patterns in a 2-D torus network

∑

LT (a,b )

(2)

a ,b ∈v (TR×C )\F ( X ;S X )

where F ( X ; S X ) is the set of X − shape faulty points with determining characteristics S X . Proof: Consider a connected R × C torus with the X − shape fault pattern. Consider two non-faulty points a and b in the mentioned above network. The number of paths crossing from a to b is given by

LT (a, b)

544

F. Safaei, A. Khonsari, and R. Moraveji

Thus, the number of all existing paths in the above network can be calculated as the aggregate of the total number of paths crossing between any two of non-faulty points in the network

∑

LT (a,b )

a ,b ∈v (TR×C )\F ( X ;S X )

■

which completes the proof.

Example 1: Consider an 8 × 7 torus network in which the embedded T-shaped fault region has three determining characteristics l = 7 , h = 5 , and h1 = 3 (see Fig. 2). We wish to route messages from point a to point b . In this network, there is a minimal path from a to b as follows

a = ( 6, 3 ) → ( 6,1 ) → ( 6, 7 ) → ( 5, 7 ) → ( 4, 7 ) = b Therefore, the set of first components of the existing nodes along this path is { 5, 6, 4 } and also the set of second components of the existing nodes in this path is { 3, 2,1, 7 } . So, we get

M (a, b) = { 6, 5, 4 } × { 3, 2,1, 7 } = {( 6, 3 ), ( 6, 2 ), ( 6,1 ), ( 6, 7 ), ( 5, 3 ), ( 5, 2 ) , ( 5,1 ), ( 5, 7 ), ( 4, 3 ), ( 4, 2 ), ( 4,1 ), ( 4, 7 )}

Fig. 2. (a) A torus network with two arbitrary points a and b in the presence of T-shaped fault pattern; (b) Demonstration of M (a, b) mesh subnetwork.

Before we proceed to calculate Equation (1), we pause to give a few definitions; then we present and prove a theorem. Definition 2: Let a and b be two non-faulty points of v(TR×C ) . For any two arbitrary points C i = ( xCi , yC i ) and C j = ( xC j , yC j ) of R ( M (a, b) ) , the number of possible paths from C j to C i is indicated by LM a ,b (C j ,C i ) so that the direction of each path in dimension X (Y ) is collinear with the direction of a path from a to b in dimension ( X )Y . Let us attempt to define LM a ,b (C j ,C i ) by ⎛ Θax ,b (C j ,C i ) + Θay ,b (C j ,C i ) ⎞⎟ ⎜⎜ ⎟⎟ ⎜⎜ ⎟ Θax ,b (C j ,C i ) ⎝ ⎠⎟

(3)

On the Probability of Facing Fault Patterns

545

in which Θax ,b (C j ,C i ) is a function indicating the number of orientations along a path from C j to C i in dimension X which are collinear with the orientations along a path from a to b and it’s criterion is expressed as

(0 ≥ x (0 ≥ x

⎢R ⎥ − xa ≥ − ⎢ ⎥ or xb − x a > ⎣2⎦ ⎢R ⎥ ⎥ or xCi − xC j C i − xC j ≥ − ⎢ ⎣2⎦

⎧⎪ ⎪⎪ Δx (C j ,C i ) ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ a ,b Θx (C j ,C i ) = ⎪⎨ Δx (C j ,C i ) ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ − Δx (C j ,C i ) ⎪⎪⎩

(0 ≤ x (0 ≤ x

b

)

⎢R⎥ ⎢ ⎥ and ⎣2⎦ ⎢R⎥ >⎢ ⎥ ⎣2⎦

)

)

⎢R⎥ ⎢R ⎥ − xa ≤ ⎢ ⎥ or xb − xa < − ⎢ ⎥ and ⎣2⎦ ⎣2⎦ ⎢R⎥ ⎢R ⎥ ⎥ or xC i − xC j < − ⎢ ⎥ C i − xC j ≤ ⎢ ⎣2⎦ ⎣2⎦ b

(4)

)

otherwise

Similarly, we can obtain the criterion of function Θay ,b (C j ,C i ) by interchanging the roles of X and Y as ⎧ ⎪ Δy (C j ,C i ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ a ,b Θy (C j ,C i ) = ⎪ ⎨ Δy (C j ,C i ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ − Δy (C j ,C i ) ⎪ ⎪ ⎪ ⎪ ⎩

(0 ≥ y (0 ≥ y

)

(0 ≤ y (0 ≤ y

)

⎢R ⎥ ⎢R⎥ − ya ≥ − ⎢ ⎥ or yb − ya > ⎢ ⎥ and ⎣2⎦ ⎣2⎦ ⎢R ⎥ ⎢R ⎥ ⎥ or yC i − yC j > ⎢ ⎥ C i − yC j ≥ − ⎢ ⎣2⎦ ⎣2⎦ b

)

⎢R ⎥ ⎢R⎥ − ya ≤ ⎢ ⎥ or yb − ya < − ⎢ ⎥ and ⎣2⎦ ⎣2⎦ ⎢R ⎥ ⎢R ⎥ ⎥ or yC i − yC j < − ⎢ ⎥ C i − yC j ≤ ⎢ ⎣2⎦ ⎣2⎦ b

(5)

)

otherwise

Theorem 2: Given that a and b are two non-faulty points of a torus network and R ( M ( a, b ) ) = {C 1,C 2 , …,C k } , the number of paths from a to b that do not traverse the points C 1,C 2 , …,C k can be calculated as

det dij (a, b)

(6)

0 ≤i, j ≤k

where

d0 j (a, b) = LMa,b (C j ,C k +1 )

j = 0,1, …, k

dij (a, b) = LMa,b (C j ,C i )

i = 1,2, …, k

j = 0,1, …, k

(7)

Proof: The proof is quite involved and we omit it due to lack of space. The interested reader is referred to [16] for the proof. Theorem 3: Let TR×C be a connected R × C torus network with the X − shape fault region having S X characteristics. The number of paths in TR×C not facing the fault pattern is expressed as

546

F. Safaei, A. Khonsari, and R. Moraveji

∑

det

a ,b ∈v (TR×C )\ F (X ;S X )

0 ≤i, j ≤Ca ,b

dij (a, b)

(8)

in which C a,b is the number of elements of R ( M (a,b ) ) ; that is

R ( M (a, b) ) = C a,b Proof: Consider two arbitrary points a and b from the set v(TR×C ) \ F (X ; S X ) . According to Theorem 2, the number of minimal paths (from a to b ) not traversing the points R ( M (a, b) ) , equals det dij (a, b) . Therefore, the number of minimum 0 ≤i, j ≤C a ,b

paths from from a to b not crossing the F (X ; S X ) points, will be equal to

det

0 ≤i, j ≤C a ,b

dij (a, b) . So, the number of all existing paths in TR×C not traversing the

F (X ; S X ) points is equal to aggregate of the total number of paths between any two non-faulty points in TR×C not traversing the points of F (X ; S X ) . That is,

∑

det

a ,b ∈v (TR×C )\ F (X ;S X )

0 ≤i, j ≤Ca ,b

■.

dij (a, b)

It follows from the preceding theorem, the probability that a path in TR×C not facing the fault pattern, Pmiss , is given by

∑

det

a ,b ∈v (TR×C )\ F (X ;S X )

∑

0 ≤i, j ≤Ca ,b

dij (a, b)

LT (a, b)

a ,b ∈v (TR×C )\ F (X ;S X )

=

∑

det

0 ≤i, j ≤Ca ,b

a ,b ∈v (TR×C )\ F (X ;SX )

dij (a, b)

LT (a, b)

(9)

Therefore, it is trivial that

Phit = 1 − Pmiss = 1 − P (X ; S X ) = 1 −

∑

a ,b ∈v (TR×C )\ F (X ;S X )

det

0 ≤i, j ≤Ca ,b

dij (a, b)

LT (a, b)

(10)

4 Experimental Results In the previous sections, we have derived mathematical expressions to calculate the probability of facing the fault patterns. These analytical expressions form the core of other fault patterns computation in other topologies and can be extensively generalized. An experimental approach is necessary to verify the analytical evaluation to which mathematical analysis led. A program has been developed which simulates the failure of nodes and the subsequent constructing of the corresponding fault patterns. The simulator generates faults in the network so that the resulting fault regions are convex or concave. It also checks that all nodes in the network are still connected using adaptive routing. The objective of the simulation is to measure the values of the probability of facing the fault patterns for different values of faulty nodes in the torus topology. For every run, the simulator creates the fault pattern and keeps statistics of the following data:

On the Probability of Facing Fault Patterns

547

• The number of minimal paths crossing the network. • The number of minimal paths confronting the fault pattern. • For each source - destination pair, the probability of facing the fault pattern is computed. Table 1 reveals the results obtained from simulation experiments and the mathematical models in torus for different sizes of the network and various shapes of fault patterns. Table 1. Experimental results of the probability of facing fault patterns in the torus with different fault patterns, and various sizes of the network which agrees with the analytical expressions Torus Network (TR×C ) 9 ×13

10× 10

11× 9

6 ×7

6×5

|-shape, h=4

0.173

0.165

0.147

0.214

0.194

|-shape, l=3

0.109

0.131

0.134

0.156

0.191

||-shape, h=h1=3, l=3, h' =2 (case 1)

0.208

0.219

0.201

0.310

0.377

||-shape, h=4, h1=3, l=2, h' =2 (case 2)

0.212

0.201

0.183

0.257

0.244

L-shape, h=3, l=3

0.165

0.182

0.170

0.229

0.263

Fault patterns characteristics

L-shape, h=4, l=5

0.247

0.286

0.273

0.395

0.429

T-shape, l=4, h=3, l1=2

0.178

0.201

0.194

0.261

0.305

T-shape, l=5, h=4, l1=4

0.237

0.261

0.257

0.375

0.397

U-shape, l=3, h=4, h1=2

0.207

0.219

0.202

0.296

0.353

-shape, l=3, h=2

0.133

0.150

0.147

0.170

0.195

H-shape, l=3, h=4, h1=3, h'=3, h'1=2

0.213

0.226

0.209

0.317

0.392

H-shape, l=5, h=4, l1=4, h1=2

0.216

0.243

0.239

0.339

0.346

5 Conclusions In recent years, efforts have been made to integrate performance and reliability of adaptive routing algorithms to overcome the drawback in the traditional evaluation methods for interconnection networks. For this purpose, a new performance metric of network reliability, probability of facing fault patterns, has been introduced. It is used to assess the performance-related reliability of such routing schemes in the presence of faulty patterns, which can be categorized to two major classes of convex (|-shape, -shape) and concave (L-shape, U-shape, +-shape, T-shape, and H-shape) regions. In this paper, we have attempted to derive mathematical expressions for calculating the probability of message facing fault patterns occur in adaptively-routed torus networks. Predicting the network measures, such as message latency and channel waiting times throughout a faulty network are applications of the results derived in this paper. Since,

548

F. Safaei, A. Khonsari, and R. Moraveji

the mesh topology has become a popular interconnection architecture for constructing massively parallel computers; a more challenging extension of our work would be to propose mathematical expressions for fault patterns in the well-known mesh topologies.

References 1. Chakravorty, S., Kalé, L.V.: A Fault Tolerant Protocol for Massively Parallel Systems. In: Proceedings of the 16th International Symposium on Parallel and Distributed Processing (2004) 2. Al-Karaki, J.N.: Performance Analysis of Repairable Cluster of Workstations. In: Proceedings of the 16th International Symposium on Parallel and Distributed Processing (2004) 3. Karimou, D., Myoupo, J.F.: A Fault-Tolerant Permutation Routing Algorithm in Mobile Ad-Hoc Networks. In: Lorenz, P., Dini, P. (eds.) ICN 2005. LNCS, vol. 3421, pp. 107– 115. Springer, Heidelberg (2005) 4. Gupta, G., Younis, M.: Fault-tolerant clustering of wireless sensor networks. In: IEEE Conf. on Wireless Communications and Networking, pp. 1579–1584 (2003) 5. Pande, P.P., et al.: Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures. IEEE Trans. Computers 54(8), 1025–1040 (2005) 6. Dao, B.V., Duato, J., Yalamanchili, S.: Dynamically configurable message flow control for fault-tolerant routing. IEEE Transactions on Parallel and Distributed Systems 10(1), 7–22 (1999) 7. Suh, Y.J., et al.: Software-based rerouting for fault-tolerant pipelined communication. IEEE Trans. on Parallel and Distributed Systems 11(3), 193–211 (2000) 8. Chen, C.L., Chiu, G.M.: A Fault-tolerant routing scheme for meshes with nonconvex faults. IEEE Trans. on Parallel and Distributed Systems 12(5), 467–475 (2001) 9. Shih, J.-D.: Fault-tolerant wormhole routing in torus networks with overlapped block faults. IEE Proc.-Comput. Digit. Tech. 150(1), 29–37 (2003) 10. Wu, J., Jiang, Z.: On Constructing the Minimum Orthogonal Convex Polygon in 2-D Faulty Meshes, IPDPS (2004) 11. Theiss, I.: Modularity, Routing and Fault Tolerance in Interconnection Networks, PhD thesis, Faculty of Mathematics and Natural Sciences, University of Oslo (2004) 12. Gómez, M.E., et al.: A Routing Methodology for Achieving Fault Tolerance in Direct Networks. IEEE Transactions on Computers 55(4), 400–415 (2006) 13. Duato, J., Yalamanchili, S., Ni, L.M.: Interconnection networks: An engineering approach. Morgan Kaufmann Publishers, San Francisco (2003) 14. Hoseiny Farahabady, M., Safaei, F., Khonsari, A., Fathy, M.: Characterization of Spatial Fault Patterns in Interconnection Networks. Journal of Parallel Computing 32(11-12), 886– 901 (2006) 15. Xu, J.: Topological structure and analysis of interconnection networks. Kluwer Academic Publishers, Dordrecht (2001) 16. Safaei, F., Fathy, M., Khonsari, A., Gilak, M., Ould-Khaoua, M.: A New Performance Measure for Characterizing Fault-Rings in Interconnection Networks. Journal of Information Sciences (submitted, 2007)

Cost-Minimizing Algorithm for Replica Allocation and Topology Assignment Problem in WAN Marcin Markowski and Andrzej Kasprzak Wroclaw University of Technology, Chair of Systems and Computer Networks Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland {marcin.markowski,andrzej.kasprzak}@pwr.wroc.pl

Abstract. The paper deals with the problem of simultaneously assignment of network topology and server’s replica placement in the wide area network in order to minimize the criterion composed of the leasing capacity cost and the building cost of the network. An exact algorithm, based on the branch and bound method, is proposed to solve the problem. Algorithm takes into account the problem of ensuring the reliability of the network. Some interesting features, observed during the computational experiments are reported. Keywords: resource replication, topology assignment, WAN.

1 Introduction Designing or modernizing of the Wide Area Network (WAN) consists in the assignment of resource allocation (i.e. servers, replicas of servers), topology and flow routes. The optimal arrangement of resources and optimal allocation of network channels let us obtain the most efficient and economical solution. Designing of the wide area computer networks is always a compromise between the quality of service in the network and reliability of the network from the one hand and the costs needed to build and to support the network from the other. Quality of services and the network costs are criteria often used during the designing process. In [1] we have considered problem based on designing the WAN assuming that the maximal support cost of the network is bounded. In [2] we have proposed algorithm for CFA problem with cost criterion. In those papers the reliability requirements were not considered. In many cases it is useful to formulate the optimizing problem in such a way: to minimize the cost of the network, when the acceptable quality level is known and the reliability level is given. Then in the paper the problem of server’s replication and topology assignment with the cost criterion delay constraint and reliability constraint is considered. In our opinion it is well-founded to consider two kind of cost: the building cost of the network, borne once and the supporting cost (i.e. connected with the capacity leasing), borne regularly. Then the criterion function is composed of two ingredients: the regular channel capacity leasing cost and the disposable server cost. We assume that the maximal acceptable total average delay in the network is given as the constraint. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 549–558, 2008. © Springer-Verlag Berlin Heidelberg 2008

550

M. Markowski and A. Kasprzak

Designing of the wide area network topology consists in assignment of channels location and choosing of channels' capacities. Properly designed network topology should ensure communication between all pairs of nodes in WAN. There must be at least one path between each pair of nodes in the network. In case of the channel failure or node failure, all paths leading through this channel or node become unserviceable. In case when only one path exists between pair of nodes, failure of any of the path elements makes communication between nodes impossible. To ensure the reliability and survivability of the network, there should exist few different paths between each pair of nodes. Paths are different when they do not have any common channel or node, except the source node and destination node. Usually minimal condition imposing to the network is to ensure two paths between each node pair. Some problems and solutions related to the network reliability are presented in [3, 4]. We assumed that the minimal number of paths between each pair of nodes is given as the constraint. It allows to design network topology with denoted reliability level. The problem considered here may be formulated as follows: given:

user allocation at nodes, the set of the nodes to which replicas may be connected for each server, maximal value of the total average delay in the network, traffic requirements user-user and user-server, the set of potential channels, capacities and their costs (i.e. cost-capacity function) for each potential channel, minimize: linear combination of the capacity leasing cost and server cost, over: servers allocation, channel’s capacity and multicommodity flow, subject to: multicommodity flow constraints, channel capacities constraints, server allocation constraints, total average delay constraint, network reliability constraint (minimal number of different paths between nodes). We consider the discrete cost-capacity function because it is the most important from practical point of view for the reason that the channels capacities can be chosen from the sequence defined by international ITU-T recommendations. Such formulated problem is NP-complete as more general than the capacity and flow assignment problem (CFA) with discrete cost-capacity function which is NP-complete [5]. The literature focusing on the simultaneous server’s replication and topology assignment problem is very limited. Some algorithms for this problem with different delay criterion may be found in [1, 6]. The formulated here problem is more general. It uses cost criterion and take into account the maximal acceptable average delay in the WAN as the constraint. Moreover it takes into account some aspects of reliability of the designed network. In the literature such formulated problem has not been considered yet.

2 Problem Formulation Let n be the number of nodes of the wide area network and b be the number of potential channels which may be used to build the network. For each potential channel i there is the set C i = {c1i ,..., c si (i )−1} of alternative values of capacities from which

exactly one must be chosen if the i -th channel was chosen to build the WAN. Let d ij

Cost-Minimizing Algorithm

551

be the cost of leasing capacity c ij [€€ /month]. Let c si (i ) = 0 for i = 1,..., b. Then C i = C i ∪ {csi (i ) } be the set of alternative capacities from among which exactly one

must be used to channel i. If the capacity c si (i ) is chosen then the i -th channel is not used to build the wide area network. Let x ij be the decision variable, which is equal to one if the capacity c ij is assigned to channel i and x ij is equal to zero otherwise. Since exactly one capacity from the set C i must be chosen for channel i, then the following condition must be satisfied: s (i )

∑ x ij

= 1 for i = 1,..., b.

j =1

(1)

Let W i = {x1i ,..., x si (i ) } be the set of variables x ij , which correspond to the i-th channel. Let X r' be the permutation of values of all variables x ij for which the condition (1) is satisfied, and let X r be the set of variables, which are equal to one in X r' . Designing the wide area network topology (channels allocation) the reliability of the network should be ensured. Particularly, in case of a failure of a link (channel) or a node, some routes must be redirected to another path. Let PN be the least number of different paths between each pair of nodes in the network. Paths between two nodes are different only when they do not have any common nodes or channels. Let MPN be the minimal number of paths, which have to exist between all pairs of nodes of the WAN. Ensuring MPN number of paths is very important for reliability of the network. Let K denotes the total number of servers, which must be allocated in WAN and let LK k denotes the number of replicas of k -th server. Let M k be the set of nodes to which k -th server (or replica of k -th server) may be connected, and let e(k ) be the number of all possible allocation for k -th server. Since only one replica of server may be allocated in one node then the following condition must be satisfied

LK k ≤ e(k ) for k = 1,..., K .

(2)

Let y kh be the decision binary variable for k -th server allocation; y kh is equal to one if the replica of k -th server is connected to node h , and equal to zero otherwise. Since LK k replicas of k -th server must be allocated in the network then the following condition must be satisfied

∑ y kh = LK k

h∈M k

for k = 1,..., K .

(3)

Let Yr be the set of all variables y kh , which are equal to one. The pair of sets ( X r , Yr ) is called a selection. Let ℜ be the family of all selections. X r determines

552

M. Markowski and A. Kasprzak

the network topology and capacities of channels and Yr determines the replicas allocation at nodes of WAN. Let T ( X r , Yr ) be the minimal average delay per packet in WAN in which values of channel capacities are given by X r and traffic requirements are given by Yr (depending on server replica allocation). T ( X r , Yr ) can be obtained by solving a multicommodity flow problem in the network [7]. Let U (Yr ) be the server cost and let

d ( X r ) be the capacity cost. Let Q( X r , Yr ) be linear combination of the capacity cost and the server cost: Q( X r , Yr ) = αD( X r ) + βU (Yr )

(4)

where α and β are the positive coefficients; α , β ∈ [0,1], α + β = 1. Let T max be the maximal acceptable average delay per packet in WAN. Then, the considered servers allocation, capacity and flow assignment problem in WAN with total average delay constraint is formulated as follows: min Q( X r , Yr )

(5)

PN X r ≥ MPN

(6)

( X r , Yr ) ∈ ℜ

(7)

T ( X r , Yr ) ≤ T max

(8)

( X r ,Yr )

subject to

Where PN X r is the least number of different paths between each pair of nodes in the network in which values of channel capacities are given by X r .

3 Calculation Scheme of the Branch and Bound Algorithm Assuming that LK k = 1 for k=1,...,K and C i = C i for i=1,...,b and omitting the constraint (6), the problem (5-7) is resolved itself into the “host allocation, capacity and flow assignment problem”. Since the host allocation, capacity and flow assignment problem is NP-complete [6, 7] then the problem (5-8) is also NP-complete as more general. Then, the branch and bound method can be used to construct the exact algorithm. Starting with the selection ( X 1 , Y1 ) ∈ ℜ we generate a sequence of selections ( X s , Ys ) . Each selection ( X s , Ys ) is obtained from a certain selections ( X r , Yr ) of the sequence by complementing one variable x ij (or y kh ) by another variable from W i (or {ykm : m ∈ M k and m ≠ h} ). For each selection ( X r , Yr ) we constantly fix a subset Fr ∈ ( X r , Yr ) and momentarily fix a set Frt . The variables in Fr are constantly fixed and represent the path

Cost-Minimizing Algorithm

553

from the initial selection ( X 1 , Y1 ) to the selection ( X r , Yr ) . Each momentarily fixed variable in Frt is the variable abandoned during the backtracking process. Variables, which do not belong to Fr or Frt are called free in ( X r , Yr ) . There are two important elements in branch and bound method: testing operation (lower bound of the criterion function) and branching rules. Then, in the next section of the paper, the testing operation and choice operation are proposed. The lower bound LBr and branching rules are calculated for each selection ( X r , Yr ) . The lower bound is calculated to check if the “better” solution (selection ( X s , Ys ) ) may be found. If the testing is negative, we abandon the considered selection ( X r , Yr ) and backtrack to the selection ( X p , Y p ) from which selection ( X r , Yr )

was generated. The basic task of the branching rules is to find the variables for complementing to generate a new selection with the least possible value of the criterion function. The detailed description of the calculation scheme of branch and bound method may be found in the paper [8].

4 The Lower Bound Since the traffic requirements in the network depend on the server allocation, then obtaining the lower bound for the problem (5-8) is difficult. We propose, the lower bound may be obtained by relaxing some constraints and by approximating the discrete cost-capacity curves with the lower linear envelope [5, 7]. To find the lower bound LBr of the criterion function (4) we reformulate the problem (5-8) in the following way: - we assume that the variables x ij ∈ X r − Fr , such that X i ∩ Fr = ∅ are continuous variables. Then we can approximate the discrete cost-capacity curves (given by the set C i ) with the lower linear envelope. Let Z ' be the set of such channels i, for which variables x ij are continuos. The criterion function (4) turns itself into:

⎛ ⎞ ⎜ ⎟ Q( X r , Yr ) = α ⎜ ∑ c i d i + ∑ d i ⎟ + β ∑ ykhukh , i ⎜ i∈Z ' y kh ∈Yr i: x j ∈Fr ⎟⎠ ⎝ where d i = min (d ij / c ij ) and c i is capacity of channel i (continuous variable). xij ∈X i

- we assume that the variables

y kh ∈ Y k − Fr , for

k = 1,.., K ,

such that

Y ∩ Fr = ∅ are continuous variables. We create the model of the WAN in the following way. We add to the considered network 2 K new artificial nodes, numbered from n + 1 to n + 2 K . The artificial nodes n + k and n + K + k correspond to the k -th server. Moreover we add to the network directed artificial channels 〈n+k,m〉, 〈m,n+K+k〉, 〈n+K+k,n+k〉, for all m ∈ M k and k = 1,.., K , such that k

554

M. Markowski and A. Kasprzak

Y k ∩ Fr = ∅ . The capacities of the new artificial channels are following: c(n + k , m ) = ∞ , c(m, n + K + k ) = ∞ , c(n + K + k , n + k ) =

∑ u kh . The leasing

h=1..n

costs of all artificial channels are equal to zero. Then, the lower bound LBr of minimal value of the criterion function Q( X s , Ys )

for every possible successor ( X s , Ys ) generated from ( X r , Yr ) may be obtained by solving the following optimization problem: 2 ⎞ ⎛ ⎛ ⎞ ⎛ ⎟ ⎜ ⎜ ⎟ d i ⎞⎟ ⎜ ⎟ ⎜ ⎜ f ⎟ ∑ i ⎜⎜ ⎟ ⎜ ⎜ c i ⎟⎟ ⎟ i∈Z ' ⎝ ⎠ LBr = min ⎜ α ⎜ ∑ d i f i + + ∑ d i ⎟ + β ∑ ykhukh ⎟ f f ⎜ ⎜ i∈Z ' ⎟ y kh ∈Yr i: x ij ∈Fr ⎟ γT max − ∑ i i i ⎟ ⎜ ⎜ ⎟ i x c f − i x j ∈Fr j j ⎟⎟ ⎜⎜ ⎜ ⎟ ⎠ ⎠ ⎝ ⎝ subject to

f i ≤ c i for i ∈ Z' fi ≤ fi ≤

x ij c ij

ir c max

for

x ij

∈ Fr

for each i ∈ Z '

(9)

(10) (11) (12)

ir where c max is maximal capacity connected with variables x ij ∈ X i − Frt , and Frt is the subset of momentarily fixed variables. The solution of problem (9−12) gives the lower bound LBr. To solve the problem (9-12) we can use an efficient Flow Deviation method [5, 7].

5 Branching Rules The purposes of the branching rules is to find the normal variable from ( X r , Yr ) for complementing and generating a successor of the selection ( X r , Yr ) with the least possible value of criterion function (4). The choice criterions should be constructed in such a way that complementing reduces the value of (4) and the increase of the total average delay in the network is as minimal as possible. Complementing variables x ij causes the capacity change in channel i. Then, the values of average delay in the network and the capacity cost changes, and the value of server cost does not change. Moreover, complementing variable x ij , j < s (i ) by

xsi (i ) = 0 causes that network topology changes - constraint (6) can not be violated. We propose the following choice criterion for complementing variable x ij ∈ X r by the variable xli ∈ X s :

Cost-Minimizing Algorithm

⎧ ⎪ i ⎪ cl Δi jl = ⎨ ⎪ ⎪∞ ⎩

f − i i − fi c j − fi

555

fi

(

α d ij − d li

)

if f i < cli and PN X s ≥ MPN otherwise

where f i is the flow in the i -th channel obtained by solving the multicommodity flow problem for network topology and channel capacities given by the selection X r . The choice criterion for complementing variable ykh ∈ Yr by the variable ykm ∈ Ys may be formulated as follows:

k δ hm

⎧1 ⎪ ⎪γ ⎪ =⎨ ⎪ ⎪ ⎪⎩∞

∑

x ij ∈ X r

~ fi

~ − T ( X r , Yr ) x ij c ij − f i

β (u kh − ukm )

~ if f i < x ij c ij for x ij ∈ X r otherwise

~ ~ ~ where the flow f = [ f1 ,.., f b ] was constructed as follows: flow from all users to k -th server was moved from the routes leading from users to node h to the routes leading from ~ users to node p . The calculation scheme for obtaining flow f may be found in [1]. Let E r = ( X r ∪ Yr ) − Fr , and let Gr be the set of all reverse variables of normal variables, which belong to the set E r . We want to choose a normal variable the complementing of which generates a successor with the possible least value of criterion (4). We should choose such pairs {( x ij , xli ) : x ij ∈ Er , xli ∈ Gr } or {( y kh , y kp ) : k is minimal. ykh ∈ Er , ykm ∈ Gr } , for which the values of the criterion Δi jl or δ hm

6 Computational Results The presented algorithm was implemented in C++ code. Extensive numerical experiments have been performed with this algorithm for many different networks. The experiments were conducted with two main purpose in mind: first, to examine the impact of various problem parameters on the solution (i.e. on the value of the criterion Q) and second, to test the computational efficiency of the algorithm. The typical dependence of the optimal value of Q on maximal acceptable total average delay per packet T max for different values of parameters α and β is presented in the Fig. 1. It follows from computational experiments that the dependence of Q on T max is decreasing function. The following conclusions follows from the computer experiments (Fig.1).

556

M. Markowski and A. Kasprzak

Conclusion 1. There exist such acceptable total average delay per packet T*max , that the problem (5-8) has the same solution for each T max greater or equal to T*max .

The typical dependence of the optimal value of D on the value of parameter α ( β = 1 − α ) is presented in the Fig. 2. It follows from computational experiments and from Fig. 2 that the dependence of D on α is increasing function. We have examined the impact of reliability parameter MPN on the solution. Typical dependences of the optimal value of the criterion Q and total average delay in the network on the minimal number of different paths are presented in the Fig. 3 and Fig. 4. Experiments were conducted for MPN=1, 2 and 3. Obtaining of more than three different path between each pair of nodes is difficult or even impossible for small and medium wide area networks. To ensure MPN=4 there must exists at least four channels adjacent to each node of the network. It makes the network very expensive, because the leasing costs of channels increase fast and the cost of nodes (WAN switches) increases as well. As it follows from Fig. 3 the value of combined cost criterion Q is quite similar for MPN=1 and MPN=2 and rapidly increases for MPN > 2. Similar dependences were discovered for all examined networks. Typical dependence of the optimal value of the total average delay in the network, obtained by solving problem (5-8), on the minimal number of different paths is decreasing function (Fig. 4). In most cases dependency of T on MPN can be approximated with the linear function. Basing on the results, partially presented in the Fig. 3 and Fig. 4, we can formulate the following conclusion. Conclusion 2. For small and medium wide area networks the optimal value of minimal number of paths between each pair of nodes is equal to two. Computational properties of the presented algorithm were tested during experiments. Let NT = ((T max − Tmin ) /(T*max − Tmin )) ⋅ 100% be the normalized maximal acceptable total average delay per packet in the network - problem (5-8) has no

α = 0,6, β = 0,4

Q

α = 0,2, β = 0,8

250000 150000 50000 0,0010

0,0015

0,0020

0,0025

0,0030

T

max

Fig. 1. Typical dependence of the optimal value of criterion Q on maximal acceptable total average delay per packet

Cost-Minimizing Algorithm

557

T. max = 0,002

D

T. max = 0,003

120 000 95 000 70 000 0,00

0,20

0,40

0,60

α

0,80

. Fig. 2. Typical dependence of the optimal value of D on the coefficient α

solution for T max < Tmin . This normalized value let us compare the results obtained for different wide area network topologies and for different numbers and locations of servers. Let P(u , v) , in percentage, be the arithmetic mean of the relative number of iterations for NT∈[u,v] calculated for all considered network topologies and for different servers locations. Fig. 5 shows the dependency of P on divisions [0%,10%), [10%,20%),..., [90%,100%] of NT. It follows from Fig.3 that the exact algorithm is especially effective from computational point of view for NT ∈ [60%, 100%].

Q

T

2,6E+05

0,0030

2,3E+05

0,0015 0,0000

2,0E+05 0

1

2

MPN

3

0

1

2

3

MPN

Fig. 3. Typical dependence of the optimal Fig. 4. Typical dependence of the optimal value of criterion Q on the minimal number of value of the total average delay in the different paths network on the minimal number of different paths

7 Conclusion In the paper the exact algorithm for solving the servers replication and topology assignment problem with cost criterion, delay constraint and reliability constraint is proposed. Such formulated problem has not been considered in the literature yet. It follows from computational experiments that the presented algorithm is effective from computational point of view for greater values of acceptable average delay in

558

M. Markowski and A. Kasprzak

Fig. 5. The dependence of P on normalized maximal average delay per packet NT

the WAN (Fig.5). We are of the opinion that the WAN property formulated as Conclusion 2 is very important from practical point of view. It shows the optimal value of minimal path’s number between each node pairs in the small or medium networks. Moreover, properties presented in the section 4 and 5 may be very useful to construct effective approximate algorithm for solving the problem (5-8). This work was supported by a research project of The Polish State Committee for Scientific Research in 2005-2007.

References 1. Markowski, M., Kasprzak, A.: The Three-Criteria Servers Replication and Topology Assignment Problem in Wide Area Networks. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3982, pp. 1119–1128. Springer, Heidelberg (2006) 2. Markowski, M., Kasprzak, A.: An Exact Algorithm for the Servers Allocation, Capacity and Flow Assignment Problem with Cost Criterion and Delay Constraint in Wide Area Networks. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 442–445. Springer, Heidelberg (2007) 3. Koidea, T., Shinmorib, S., Ishiic, H.: Topological Optimization with a Network Reliability Constraint. Discrete Applied Mathematics 115, 135–149 (2001) 4. Yi-Kuei, L.: Reliability of a Flow Network Subject to Budget Constraints. IEEE Transactions on Reliability 56(1), 10–16 (2007) 5. Pioro, M., Medhi, D.: Routing, Flow, and Capacity Design in Communication and Computer Networks. Elsevier, Morgan Kaufmann Publishers, San Francisco (2004) 6. Chari, K.: Resource Allocation and Capacity Assignment in Distributed Systems. Computers Ops Res. 23(11), 1025–1041 (1996) 7. Fratta, L., Gerla, M., Kleinrock, L.: The Flow Deviation Method: an Approach to Store-andForward Communication Network Design. Networks 3, 97–133 (1973) 8. Wolsey, L.A.: Integer Programming. Wiley-Interscience, New York (1998)

Bluetooth ACL Packet Selection Via Maximizing the Expected Throughput Efficiency of ARQ Protocol Xiang Li1,2,*, Man-Tian Li1, Zhen-Guo Gao2, and Li-Ning Sun1 2

1 Robot Research Institute, Harbin Institute of Technology, Harbin, 150001, China College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, China {leexiang, gag}@hrbeu.edu.cn, {limt, lnsun}@hit.edu.cn

Abstract. Bluetooth provides different kinds of data packet types with different sizes and error correction mechanisms, thus adapter layer can choose the most suitable packet to be transmitted according to the error rate on the link and application requirements. Based on the acknowledgement history of the most recently transmitted packets, an adaptive algorithm is proposed to choose the suitable Bluetooth data packet for transmission through maximizing the expected throughput efficiency of ARQ protocol on Bluetooth ACL data communication link. Simulation results indicate that this method works very well with a short observation history and also show special performance of DM and DH data packet transmission. Keywords: Bluetooth, Piconet, ARQ, ACL, Throughput Efficiency.

1 Introduction Bluetooth (BT) [1,2] is a short-range radio link intended to be a cable replacement between portable and/or fixed electronic devices. Two types of transmission links, SCO and ACL links are used. SCO link is a symmetric point to point link supporting time-bounded voice traffic. SCO packets are transmitted over reserved intervals without being polled. ACL link is a point to multipoint link between master and all slaves in the piconet and can use all the remaining slots of the channel not used for SCO link. Bluetooth is a frequency hopping system which can support multiple communication channels in a common area (each channel is defined by a unique frequency hopping sequence). Frequency hopping is used in such a way that the radio is turned to the same frequency for the entire duration of the packet, but then changes to a different frequency each time it transmits a new packet or retransmits an erroneous packet. Since the fading and interference in the new frequency channel will be significantly different than that of the previous one, the use of frequency hopping with ARQ provides an effective method of diversity. Automatic Repeat Request (ARQ) protocols are designed to remove transmission errors from data communications systems. When used over relatively high bit-error rate (BER) links (e.g., 10-5 or higher) such as wireless or satellite links, their performance is *

Supported by the Harbin Engineering University Foundation (HEUFT06015).

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 559–568, 2008. © Springer-Verlag Berlin Heidelberg 2008

560

X. Li et al.

sensitive to the packet size used in the transmission. When too large a packet size is employed, there is an increased need for retransmissions, while too small a packet size is inefficient because of the fixed overhead required per packet. When an ARQ scheme is to be used at the link layer over a relatively high error-rate link, the packet size should be chosen based on the error-rate [3]. The optimal communication problem in Bluetooth has been investigated in some documents. In [4], a solution is proposed to enhance the Bluetooth link layer to make use of channel state information and adopt the suitable Bluetooth packet type to enhance TCP throughput. The throughput of the six Bluetooth ACL packets that use ARQ as a function of channel symbol SNR are derived in [5], then optimal packet type can be selected at different SNRs. Document [6] provides algorithms to maximize the throughput under lossy transmission conditions in a piconet with one or more slaves by selecting the packet lengths optimally in accordance with the channel conditions for different frequencies. All these works are concentrating on throughput under different channel conditions, such as BER or SNR, which are not easy to know ahead. In this paper, we concern with choosing optimal packet payload length on the Bluetooth ACL data communication links, in terms of maximizing the throughput efficiency of ARQ protocol based on the acknowledgement history of the most recently transmitted packets. That is, given the number of packets that required retransmission, an estimate of the channel BER is made, based on which a packet size is chosen to maximizes the expected throughput efficiency of the data link protocol.

2 Bluetooth Data Packets In Bluetooth, the data on the piconet channel is conveyed in packets. The general packet format is shown in Fig.1. Each packet consists of 3 entities: the access code, the header, and the payload. In Fig. 1, the number of bits per entity is indicated [1].

Fig. 1. Standard Packet Format

The access code and header are of fixed size: 72 bits and 54 bits respectively. The payload length can range from zero to a maximum of 2745 bits. Different packet types have been defined. Packets may consist of the (shortened) access code only, of the access code − header, or of the access code − header − payload. Data in Bluetooth can be transmitted asynchronously using ACL packets. In this paper, we mainly focus on ACL packets data transfer used in asynchronously connections. Seven ACL packet types are defined in the Bluetooth. DM stands for DataMedium rate, DH for Data-High rate. DM packets are all 2/3-FEC encoded to tolerate possible transmission errors. Not encoded by FEC, DH packets are more errorvulnerable, but it can carry more information.

Bluetooth ACL Packet Selection

561

3 Adaptive Packet Selection Algorithm 3.1 Throughput Efficiency of ARQ Protocol A protocol performance is usually characterized by many parameters which are defined by the communication system requirements. The most important parameters are the probability of receiving a message without errors and the protocol throughput efficiency. There are several definitions of the protocol throughput efficiency. Most frequently it is defined as the ratio of the mean number of information bits successfully accepted by the receiver to the number of bits that could have been transmitted during the same time interval [7]. To do so we must first derive an expression for the throughput efficiency of the ARQ protocol. The expressions derived in this section assume the use of an “optimal” ARQ protocol in that only packets containing errors are retransmitted. The throughput efficiency of ARQ scheme that uses packets having n bits of information bits k is determined by [8]: η =(

k )/R n

(1)

where, the first term k/n of the above expression represents the ratio of information bits to total bits in a packet, and R represents the average number of transmission attempts per packet. Assuming that the ARQ scheme retransmits a packet until the acknowledgement of a successful reception, the average number of attempts, R , needed to successfully transmit one packet is given by [4]: R =1×

（1-p）+2×p×（1-p）+3×p×p×（1-p）+…= 1 −1 p

(2)

where, p is the packet error rate. So, for a given p, the throughput efficiency of ARQ that uses packets having n bits of information bits k is given by: k n

η = ( )(1 - p)

(3)

3.2 Choosing Packet Size Via Maximizing the Expected Throughput Efficiency of ARQ Protocol When a perfect retransmission algorithm (A perfect retransmission algorithm is one that only retransmits packets that are in error and can continuously transmit new packets as long as no errors occur. The selective repeat protocol is an example of a perfect retransmission algorithm) is employed, the optimal packet size to be used by the data link protocol is given by [3]: kopt =

− h ln(1 − b) − − 4h ln(1 − b) + h 2 ln(1 − b 2 ) 2 ln(1 − b)

(4)

where b is the known channel BER and h is the number of overhead bits per packet(These bits are used for control, error detection, and framing).

562

X. Li et al.

，

When h equals to 126 the optimal packet size under different channel BER can be displayed in Fig. 2 by formula (4). Fig. 2 shows that the optimal packet size decreases according to the increase of the channel BER. This change trends meets to the real application requirement because when a much larger packet size is used the efficiency of the protocol would drop dramatically while the channel BER is much higher. Therefore a much smaller packet size is efficient under a much higher channel BER because a small packet has low packet error rate. On the contrary, a much larger packet size makes efficient use of the channel when the channel BER is much lower.

Fig. 2. Optimal packet size under different channel bit error rate

In Bluetooth, in order to guarantee reliable transmission ARQ mechanism is adopted. That is, the receiving side sends back special control frame as the acknowledgement or negative acknowledgement (ACK/NACK) to the input. In case of drop frame or acknowledgement message, the timer will send out timeout signal when the timer has expired, and to remind other sides that some problems have happened and this frame must be retransmitted. At the same time, receiver must be capable of distinguishing between retransmitted and new frame. With an ARQ scheme in Bluetooth specification, DM, DH and the data field of DV packets are transmitted and retransmitted until acknowledgement of a successful reception is returned by the destination (or timeout is exceeded). The acknowledgement information is included in the header of the return packet, so-called piggy-backing. To determine whether the payload is correct or not, a CRC code is added to the packet. The ARQ scheme only works on the payload in the packet (only that payload which has a CRC). The packet header and the voice payload are not protected by the ARQ. Depending on the packet retransmission records on the current link, an adaptive method to select the best packet for data transmission is proposed to improve the performance of Bluetooth system. The basic idea behind this scheme is that a large packet has low overheads and is advantageous when the BER is relatively low, while a small packet has a low packet error rate and thus is advantageous when the BER is high. So, depending on channel BER, every type of Bluetooth ACL data packet has different performance. Without any bit error, the DH5 packet would give the best performance since it carries the most information bit per unit time. However, as the BER increases, the packet error rate of DH5 increases faster than smaller packets. Thus, there exists a problem to how to select suitable packet to adapt to the current channel conditions. However, it is difficult to estimate the channel BER in a short time when the link is operating under error detection with retransmission [9], but we can acquire the packet retransmission history easily.

Bluetooth ACL Packet Selection

563

This paper proposes a simple algorithm to choose the packet size such that the conditional efficiency of the protocol is maximized under different channel BERs based on packet transmission record. Supposed the BER is b, and R, the number of retransmission requests out of the last M packet transmissions, by averaging the above expression over all possible values of b and using the conditional distribution of b given R (assuming that b is constant over the period of interest). The resulting expression is given by [3]: 1

η R (k ) = ∫ ηP (b | R )db 0

(5)

where P(b|R), is the conditional probability of b given that R out of the last M packets required retransmission. We now wish to choose the value of k that maximizes ηR. To do so we must first express the conditional probability of b given R. The conditional probability of b given R, P(b|R), can be expressed as follows: P[b | R ] =

P[b, R ] P[ R | b]P[b] = P[ R] P[ R ]

(6)

Solving for the above conditional probability requires knowledge of a prior distribution of b. In the absence of a prior, we assume a uniform prior, that is P[b]=1. This approach, in essence, is the same as a maximum likelihood approach where a uniform prior is assumed, except that here we associate a cost function with the estimates of b. With this approach we get 1

1

0

0

P[ R] = ∫ P[ R | b]P[b]db = ∫ P[ R | b]db

(7)

and so P[b | R ] =

P[ R | b ]

∫

1 0

P [ R | b ]db

(8)

In wireless communication, we generally assume that channels are independent and same distribution model, that is, assume that error rates on the channel are independent of each other and commonly kept invariable. Given packet error rate p, the probability that R packets contain errors and therefore require retransmission is the probability that R out of M packet are in error. Since packet errors are independent from packet to packet, this probability can be expressed according to the binomial distribution with parameter p, therefore P[R|b] can be expressed as: R M −R P[ R | b ] = ( M R ) p (1 − p )

(9)

The packet error rate, p, for DH packets is:

p = 1 − (1 − b) k

(10)

Recalling that DM packets are protected by a (15,10) Hamming code (encoded with a 2/3 block FEC), i.e., in every block, 15 bits are used to encode 10 bits of data, which is capable of correcting one bit error per 15 bit code block. The payload is correctly decoded provided that all code blocks contain one or fewer errors. The packet error rate, p, for DM packets can be approximated as:

564

X. Li et al.

p = 1 − ((1 − b)15 + 15b(1 − b)14 ) k / 15

(11)

Hence, for DH packets, P[R|b] can be expressed as: k' R k '( M − R ) P[ R | b ] = ( M R )(1 − (1 − b ) ) (1 − b )

(12)

For DM packets, P[R|b] can be expressed as: 15 14 k '/ 15 R P[R | b] = (M ) ((1− b)15 +15b(1− b)14)k '(M −R) /15 R )(1 − ((1 − b) +15b(1 − b) )

where

k ' is the payload size used in the previous M transmissions.

Combining equations (5)–(13), we can get tively as: 1

η R (k ) = ∫ [ 0

k (1 − b ) k × n

ηR(k) for DH and DM packets respec-

( MR )(1 − (1 − b ) k ' ) R (1 − b ) 1

∫( 0

M R

k'

(M −R)

)(1 − (1 − b ) k ' ) R (1 − b ) k '( M − R ) db

]db

( RM )(1 − ((1 − b)15 + 15b(1 − b)14 )k' /15 ) R ((1 − b)15 + 15b(1 − b)14 )k' (M − R)/15 1

∫( 0

M R

(13)

)(1 − ((1 − b)15 + 15b(1 − b)14 )k' /15 )R ((1 − b)15 + 15b(1 − b)14 )k' (M − R)/15 db

η R (k) =

∫

1 0

[

(14)

]db

(15)

k((1 − b) 15 + 15b(1 − b) 14 ) k/15 × n

where n=k+126. It is now possible to choose the value of k, the payload length to be used in future transmissions, so that the throughput efficiency of the ARQ protocol is maximized. This can be done by choosing the value of k that maximizes equation (14) or (15) for a value of R that is equal to the number of retransmission requests that occurred during the previous M transmissions using the payload size k ' .

4 Simulation Results Usually, the solution way of the maximization problem for ηR(k) in equation (7) or (8) is difficult; However, for specific values of M, R and k ' equation (7) or (8) can be solved

numerically. An optimal value for k can now be found using numerical search algorithms. Since the numerical evaluation of this integral is very intensive, a comprehensive search for the optimal value of k is not practical. Instead, a restricted search using select values for k can be performed. Such a search, for example, can consider values of k that are a multiple of 100; thereby significantly reducing the complexity of the search. Such a restricted search has little impact on the performance of the protocol since values of k that are within 100 bits of the optimal block size should result in near-optimal performance. Here, analysis of equation (14) or (15) is taken using Matlab7.0, where k is always valued as the multiple of 100 and the pace of b is 0.000001. In fig. 3, we plot the optimal payload size when a history of 50 previously transmitted 1500 bit packets payload is considered. As can be seen from the fig. 3 (a), for DH

Bluetooth ACL Packet Selection

565

Optimal payload length

Optimal payload length

packets transmission, when the previous fifty transmissions resulted in no errors the payload length can be maximized to 2744 bits (the maximization throughput efficiency can be gotten at payload length of 3200 bits). When one and two errors occurred the payload length can be increased to 2100 and 1700 bits respectively. When three errors occurred the payload length can be kept at 1500 bits and when more than three errors occur the payload length is reduced. As depicted in fig. 3 (b), for DM packets transmission, when the previous fifty transmissions resulted in no errors the payload length can be maximized to be 2745 bits (the maximization throughput efficiency can be gotten at packet payload length of 4400 bits). When one, two and three errors occurred the payload length can be increased to 2500, 1900 and 1600 bits respectively and when more than three errors occur the payload length is reduced.

R (Retransmisssion Packets)

5QXPEHURIUHWUDQVPLVVLRQSDFNHWV (a) DH packet

R (Retransmisssion Packets)

5QXPEHURIUHWUDQVPLVVLRQSDFNHWV (b) DM packet

Fig. 3. Optimal packet size selection based on retransmission history

Let k be the optimal packet size chosen for a given value of R out of the last M packet transmissions. The efficiency of the ARQ protocol with that value of k can be computed according to equation (3) combined with equation (10) or (11). Then it can be averaged over the distribution of R given b to yield the performance of ARQ for a given value of b. Fig. 4 shows the mean throughput efficiency of ARQ with various values of M and b, and a previous packet payload length of 1500 bits. As can be seen from the figure, whether it is DH or DM packets transmission, good performance is obtained with a history of just 75000 bits payload size (50 packets at packet payload size of 1500 bits). When b is higher than 10-5, more of history is required to obtain a reasonable estimate of throughput efficiency for DM packets transmission; but for DH packets transmission, the situation is quite different, only few or the least history of packets transfer is required to obtain high throughput efficiency. The throughput efficiency can be computed according to equation (3) using the optimal packet size obtained from the formula (4) under different channel BER. In the previous fifty packets transmission with payload length of 1500 bits, the mean throughput efficiency of DH is calculated based on the select optimizing packet size under different retransmission packets. Fig. 5 compares mean throughout efficiency of DH packets transmission based on retransmission history and the optimal packets transmission (opt) according to formula (4) for various b. As can be seen from the

566

X. Li et al.

Fig. 4. Mean throughput efficiency of algorithm for various b

Fig. 5, both mean throughout efficiency increase against the decrease of b. The mean throughout efficiency of DH packets based on retransmission packets is always below to the optimal packets transmission (opt), but the difference is not so great. Similarly, the mean throughput efficiency of DM is calculated based on the select optimizing packet size under different numbers of retransmission packets. Fig. 6 compares mean throughout efficiency of DM packets transmission based on retransmission history and the optimal packets transmission (opt) according to formula (4) for various b. As can be seen from the figure, both mean throughout efficiency increase by the decrease of b. The mean throughout efficiency of Bluetooth DM packets based on retransmission packets becomes mild while the b is less than 10-3, which is close to 0.7773. When b is large than 10-3, the mean throughout efficiency of DM packets based on retransmission packets is larger than the optimal packets transmission (opt). Hence, the larger throughout efficiency can be gained by using DM packets transmission with ARQ protocol when the channel BER is high. In the previous fifty packets transmission with payload length of 1500 bits, select optimizing packet size under different retransmission packets, fig. 7 compares mean throughout efficiency of ARQ during DH and DM packets transmission for various b.

RSW

'+

(

(

(

(

(

(

Fig. 5. Comparing mean throughput efficiency of DH and optimal packets transmission (opt)

Bluetooth ACL Packet Selection

567

RSW

'0

(

(

(

(

(

(

Fig. 6. Comparing mean throughput efficiency of DM and optimal packets transmission (opt)

It is important to note that the performance of ARQ is much more vulnerable to DH packets when b is high. That is, when b is high the use of DH or DM packet type can have a disastrous effect on the throughput efficiency, and DM packets transfer can produce higher throughput efficiency than DH packets. When b is low, small variations in the throughput efficiency from the different bit error rate b, packet types (DH/DM). So in a high error rate environment, it is better to take DM packets as data transmission, which accords with the capability of DM to tolerate high transmission error rate. Oppositely, in a low error rate environment, it is better to take DH packets as data transmission, it is because not decoded by 2/3-FEC DH packets have relatively higher data transfer rate than DM packets data transmission.

DH DM

( ( ( ( ( (

Fig. 7. Comparing mean throughput efficiency of DH and DM packets transmission

5 Conclusion This paper introduces a method to select the optimal packet payload length used by Bluetooth ACL data link layer. The throughput efficiency of ARQ protocol is given based on retransmission history. So, given packet transmission record we can choose the packet size such that the expected throughput efficiency of the ARQ protocol is maximized under different channel BERs. Simulation results show that the method works very well even with a short observation history (50 packets at size of 1500 bits payload, total 75000 bits payload transfer). In a high error rate environment, it is

568

X. Li et al.

better to take DM packets as data transmission; but in a low error rate environment, it is better to take DH packets as data transmission to improve the data transfer rate.

References 1. Bluetooth V1.1 Core Specifications, http://www.bluetooth.org 2. Harrtsen, J.: The bluetooth radio system. IEEE Personal Communicatios 7(1), 28–36 (2000) 3. Modiano, E.: An adaptive algorithm for optimizing the packet size used in wireless ARQ protocols. Wireless Networks 5, 279–286 (1999) 4. Chen, L.J., Kapoor, R., Sanadidi, M.Y., Gerla, M.: Enhancing bluetooth TCP throughput via link layer packet adaptation. In: Proc. of the 2004 IEEE International Conference on Communications (ICC 2004), pp. 4012–4016. IEEE Press, Paris (2004) 5. Valenti, M.C., Robert, M., Reed, J.H.: On the throughput of bluetooth data transmissions. In: Proc. of the IEEE Wireless Commun. and Networking Conf., pp. 119–123. IEEE Press, Orlando (2002) 6. Sarkar, S.: Optimal Communication in Bluetooth Piconets. IEEE Transactions on Vehicular Technology 54(2), 709–721 (2005) 7. Turin, W.: Throughput analysis of the Go-Back-N protocol in fading radio channels. IEEE Journal on Selected Areas in Communications 17(5), 881–887 (1999) 8. Pribylov, V.P., Chernetsky, G.A.: Throughput efficiency of automatic repeat request algorithm with selective reject in communication links with great signal propagation delay. In: Proc. of the 3-rd IEEE-Russia Conferences Microwave Electronics: Measurements, Identification (MEMIA 2001), pp. 202–205. IEEE Press, Novosibirsk (2001) 9. Jesung, J., Yujin, L., Yongsuk, K., Joong, S.M.: An adaptive segmentation scheme for the Bluetooth-based wireless channel. In: Proc. of the 10th International Conference on Computer communications and networks, pp. 440–445. IEEE Press, Scottsdale (2001)

High Performance Computer Simulations of Cardiac Electrical Function Based on High Resolution MRI Datasets Michal Plotkowiak1, Blanca Rodriguez2 , Gernot Plank3 , J¨ urgen E. Schneider4 , 1,2 5 David Gavaghan , Peter Kohl , and Vicente Grau6 1

6

LSI Doctoral Training Centre, University of Oxford, UK [email protected] 2 Computing Laboratory, University of Oxford, UK 3 University of Graz, Austria 4 Department of Cardiovascular Medicine, University of Oxford, UK 5 Department of Physiology, Anatomy and Genetics, University of Oxford, UK Department of Engineering Science and Oxford e-Research Centre, University of Oxford, UK [email protected]

Abstract. In this paper, we present a set of applications that allow performance of electrophysiological simulations on individualized models generated using high-resolution MRI data of rabbit hearts. For this purpose, we propose a pipeline consisting of: extraction of signiﬁcant structures from the images, generation of meshes, and application of an electrophysiological solver. In order to make it as useful as possible, we impose several requirements on the development of the pipeline. It has to be fast, aiming towards real time in the future. As much as possible, it must use non-commercial, freely available software (mostly open source). In order to verify the methodology, a set of high resolution MRI images of a rabbit heart is investigated and tested; results are presented in this work.

1

Introduction

The heart is an electromechanical pump, whose function and eﬃciency are known to be intimately related to cardiac histoanatomy. A large number of anatomical and structural factors aﬀect cardiac electromechanical activity, but their detailed inﬂuence is poorly understood. Computer simulations have demonstrated the ability to provide insight into the role of cardiac anatomy and structure in cardiac electromechanical function in health and disease [1],[2]. The most advanced cardiac models to date incorporate realistic geometry and ﬁbre orientation [3]. However, for each animal species, only one example of cardiac anatomy is generally used, which obscures the eﬀect of natural variability. In addition, cardiac tissue is generally represented as structurally homogeneous, and cardiac geometry is often overly simpliﬁed, as the endocardial structures for example are not represented in detail. This limits the utility of the computational models to M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 571–580, 2008. c Springer-Verlag Berlin Heidelberg 2008

572

M. Plotkowiak et al.

understand the role of heart structure and anatomy in cardiac electromechanical function. Recent advances in medical imaging techniques allow generation of high resolution images containing a wealth of information on the 3D cardiac anatomy and structure. Among all possible techniques, magnetic resonance imaging (MRI) is the most suitable for our purposes. MRI allows acquisition of images in vivo as well as in vitro, and can provide high-quality, high resolution datasets. Thus, MRI images of whole hearts can be used to obtain highly detailed, high resolution models of cardiac anatomy and structure. Figure 1 shows two anatomical MRI sections through two diﬀerent rabbit hearts, obtained using an 11.7 T MRI system (500 MHz). The information that can be obtained from these high resolution MRI datasets can be used to build the next generation of cardiac computational models with realistic and accurate representation of cardiac anatomy and structure. The goal of the present study is to develop and identify a set of methodologies to run computer simulations of cardiac electrical activity. Proposed heart models incorporate detailed description of cardiac anatomy and structure based on high resolution MRI datasets.

Fig. 1. MRI slices of two rabbit hearts acquired at diﬀerent resolutions and diﬀerent levels of contraction, the left image at 26.4 x 26.4 x 24.4 μm, and the right image at 32 x 32 x 44 μm

2

Methodology

This section describes the techniques to go from the high-resolution MRI images to computer simulations of cardiac electrophysiological function. This involves the use of software developed speciﬁcally for this project, based on open-source libraries (for segmentation and surface generation), as well as available software packages (for mesh generation and cardiac electrophysiology solver). 2.1

Data Acquisition

The MRI data were acquired using an 11.7 T (500 MHz) MRI system, consisting of a vertical magnet (bore size 123 mm; Magnex Scientiﬁc, Oxon, UK), a Bruker

High Performance Computer Simulations of Cardiac Electrical Function

573

Avance console (Bruker Medical, Ettlingen, Germany), and a shielded gradient system (548 mT/m, rise time 160 μs; Magnex Scientiﬁc, Oxon, UK). For the MRI signal transmission/reception, dedicated quadrature driven birdcage RFcoils were used. Scanning was performed using a fast gradient echo technique for high-resolution gap-free 3D MRI. Images of coronary perfusion-ﬁxed rabbit hearts, embedded in agarose, were acquired with an in-plane resolution of 26.4 μm x 26.4 μm, and an out-of-plane resolution of 24.4 μm. All the methodological details were described previously in [5]. 2.2

Segmentation

Segmentation of the MRI datasets is the ﬁrst step towards extracting anatomical information for incorporation into computational models of cardiac electrophysiology. Segmentation may be described as the process of labelling each voxel in a medical image to indicate its tissue type or to which anatomical structure each voxel belongs to. As a ﬁrst step, in this study, the aim was to segment high resolution MRI of rabbit hearts in order to generate an accurate description of the epicardial and endocardial structures. For this purpose an application, based on Insight Toolkit libraries (ITK, www.itk.org), was developed. ITK is an open-source software system originally developed to support the Visible Human Project1 . In recent years it has become a standard tool for biomedical image segmentation and registration. Segmentation in the present study is performed using the fast marching method, one of the family of level set methods. Level Set Method. Level set methods [6] are a family of numerical techniques for tracking the evolution of contours and surfaces. The main idea is to embed the evolving contour/surface in a higher dimension function ψ. The contour C is represented as the zero level set of this function. In the context of image segmentation, level sets are generally used to evolve a contour/surface using the evolution of the higher dimension function. A large number of variations have been presented in the literature, for which a comprehensive review is out of the scope of this paper. In the formulation introduced in [6]. the level sets function is generally initialized with a signed distance map to an initial surface, and then evolves guided by a speed function combining internal (generally related to contour regularity) and external (generally linked to image features) inﬂuences. We use a simpliﬁed level set method called the fast marching method [7]. In this method it is assumed that the speed function F > 0, which means that the front is always going forward. This has the advantage of having to consider each voxel only once, thus making the algorithm signiﬁcantly faster. We chose this method because of both reduced computational requirements (necessary when dealing with large 3D datasets as in this case) and for convenience reasons (available implementation in ITK). The position of the moving front can be characterized by calculating its arrival time T (x, y) as it crosses each point (x, y) on the plane or space. Thresholding the arrival time provides a segmentation of the image. 1

www.nlm.nih.gov/research/visible/visible human.html

574

M. Plotkowiak et al.

In the current application, we initialize the contour by using a set of automatically generated seeds. These are located using a set of heuristic rules involving the intensity of a block centered at each image voxel. A highly conservative threshold was used here to ensure that all the seeds belong to the object. The images were pre-processed, using an anisotropic ﬁlter, before application of the fast marching algorithm. 2.3

Surface Generation

In order to generate surface data for further volume meshing and visualization, special algorithms were applied, based on the Visualization Toolkit (VTK) package libraries. For a more detailed description refer to [8]. The proposed surface generation application contains three main elements: marching cubes, decimation and smoothing. Its role is not only to generate a spatial object from binary 2D data but also to prepare structures of interest for ﬁnite element meshing. Marching Cubes. The marching cubes algorithm produces a triangular mesh by computing isosurfaces from volumetric data [9]. A cube is deﬁned by the values at its eight vertices, corresponding to eight voxels in the original 3D image. When one or more vertices of a cube have values less than the speciﬁed isovalue, and one or more have values greater than this value, the voxel contributes some component to the isosurface. Determining which edges of the cube are intersected by the isosurface, a triangular patch can be created. The ﬁnal surface representation is obtained by connecting the patches from all cubes. Decimation. Marching cubes usually generate a large number of polygons, and so this has to be reduced before generating a ﬁnite element mesh. We used the decimation algorithm from [10], available in VTK. Decimation is designed to reduce the total number of triangles in a mesh, while preserving the original topology [8]. The proposed decimation is an iterative process, where each point in a triangle mesh is visited. Three basic steps are carried out for each of the points. In the ﬁrst one, the local geometry and topology in the neighbourhood of the point are classiﬁed. Next, the vertex is assigned to one of possible ﬁve categories: simple, boundary, complex, edge, or corner point. Finally, using a decimation criterion that is based on a local error measure, it is determined whether the point can be removed. If the criterion is satisﬁed, the point and associated triangles are deleted and the resulting hole is re-triangulated. Smoothing. Mesh smoothing is a method of shifting points of a mesh that can signiﬁcantly improve its quality (i.e. appearance and shape), without modifying mesh topology. Smoothing applications improve isosurfaces by removing surface noise. We have used Laplacian smoothing, a general smoothing technique that has been used successfully in other applications [8].

High Performance Computer Simulations of Cardiac Electrical Function

2.4

575

Finite Element Mesh

The majority of the leading cardiac electrophysiology simulators (such as the one used in this study and described in the next section) utilize tetrahedral ﬁnite element meshes as an input. Therefore, from the model cardiac surfaces, generated as described in the previous section, a tetrahedral ﬁnite element mesh was generated. First, a surface mesh composed of triangles or rectangles was generated, and then the volume mesh composed of tetrahedral elements was ﬁtted into the cardiac volume. The most common unstructured meshing algorithms are Delauney triangulation and advancing front methods. There are many different automatic mesh generating tools available. However, meshes generated in this way may contain poorly shaped or distorted elements that cause numerical problems. For instance the size of dihedral angles is very important; if too small then the corresponding number of the elemental matrices increases, if too large then the discretization error in the ﬁnite solution increases. The meshing program used in this project, Tetgen2 , performs Delauney tetrahedralization using algorithms presented in [11]. It also includes algorithms for quality control, e.g. Shewchuk’s Delaunay reﬁnement algorithm. This algorithm ensures that no tetrahedra in the generated mesh have a radius-edge ratio greater than 2.0. The reason for choosing TetGen is that it is freely available, contains state-of-the-art tetrahedralization algorithms, and oﬀers good mesh quality control. 2.5

Electrophysiological Simulations

The ﬁnite element meshes were generated speciﬁcally to conduct simulations of ventricular electrophysiological activity, using one of the most advanced bidomain cardiac simulators available to date, namely Cardiac Arrhythmias Research Package (CARP)3 . CARP uses computational techniques to solve the cardiac bidomain equations [12] deﬁned as: ∇(σi ∇Vm ) + ∇[(σi + σe )φe ] = −Istim

(1)

∇(σe ∇φe ) = −Iion − Istim

(2)

Vm = φi − φe ;

(3)

where φi and φe are intra- and extracellular potentials, σi and σe are intra- and extracellular conductivity tensors, and Iion and Istim are the volume densities of the transmembrane and stimulated currents. In the bidomain model, membrane kinetics are represented by a system of ordinary diﬀerential equations that are used to compute the total transmembrane ionic current Iion . CARP uses an expanded library of ion models and plugins (augmentations) called LIMPET. As the animal species in this study is rabbit, the Puglisi-Bers rabbit Ionic model described in [13] was used, which consists of 2 3

http://tetgen.berlios.de/ http://carp.meduni-graz.at/

576

M. Plotkowiak et al.

a system of 17 ordinary diﬀerential equations (ODEs) that describe the electrophysiological behaviour of ion channel currents, pumps and exchangers in rabbit ventricular cells. Bidomain models are often used for studies of deﬁbrillation, simulating the application of strong shocks to the heart, which requires representation of the extracellular space. However, simulations of cardiac propagation often use monodomain models, which can be obtained from the bidomain model by assuming φe = 0. The detailed representation of cardiac structure incorporated in the meshes results in a large number of mesh nodes and therefore in computationally very demanding simulations, despite CARPs eﬃciency. Thus, simulations required the use of high performance computing such as oﬀered by the UK National Grid Service (NGS) (www.grid-support.ac.uk).

3

Results and Discussion

Here we present results for each step of the model development: segmentation, surface generation, ﬁnite element mesh, and electrophysiological simulations using a high resolution MRI dataset of the rabbit heart. 3.1

Segmentation Results

MRI data were reconstructed and stored in the form of 1440 2D TIFF-images with a resolution of 1024 x 1024 pixels. For our purpose, the TIFF-images were down-sampled by a factor of 4 on each axis in order to speed up the segmentation process. The same segmentation method can be applied to the full resolution data, however, it would require the use of high performance computing capabilities such as the NGS, for which our segmentation software is not ready at this stage. As a ﬁrst step in the model development process, we focused on segmentation of the endocardial and epicardial surfaces. Even though, in theory, the fast marching process could work from a single seed, in practice a reduced number of seeds was prone to produce leakage in areas where gradient values were small. In our segmentation program, we generate about 4000 seed points for each MRI slice. The results of segmentation of the rabbit heart are shown in Fig.2. 3.2

Surface Generation Results

The marching cube algorithm was used to generate 3D isosurfaces from the segmented 2D slices. Due to the large size of data (about 200 MB), a decimation algorithm was applied. Parallel algorithms are in the process of being developed to allow handling of the full resolution dataset. Decimation allowed reduction of the datasize by about 50% while still maintaining a very detailed structure. Finally, to improve the quality of decimated isosurfaces smoothing was applied. The ﬁnal isosurfaces are presented in Fig.3.

High Performance Computer Simulations of Cardiac Electrical Function

577

Fig. 2. Segmentation of MRI slices of rabbit heart. The black outline on the MRI slices shows the segmented boundaries of ventricular structure.

Fig. 3. Rabbit heart isosurfaces generated using marching cubes algorithm. Surfaces were decimated and smoothed using VTK functions.

3.3

Finite Element Mesh

As a result of the steps described above, we obtain an isosurface in stl format, fully compatible with the mesh generator TetGen that we use here. In order to ensure a good mesh quality, i.e. a mesh where the radius-edge ratio is smaller than 2.0 for all tetrahedrals and maximum element volume is constrained, a set of TetGen switches was used4 . The ﬁnal mesh consisting of 828,476 nodes and 3,706,400 tetrahedral elements, is shown in Fig.4. 3.4

Electrophysiological Simulations Results

The output tetrahedral mesh from TetGen had to be converted into a format compatible with the CARP solver. All simulations for the generated models were 4

http://tetgen.berlios.de/

578

M. Plotkowiak et al.

Fig. 4. Diﬀerent cuts through the ﬁnite element mesh for rabbit heart containing about 4 million tetrahedral elements

Fig. 5. Diﬀerent stages of electrical propagation in the developed model. Transmembrane potential values at each epicardial node are shown using a grey scale, where black is resting potential and white is depolarized.

Fig. 6. Proposed model development pipeline ilustrating main applications, methods, ﬁle formats, and visualization programs used

carried out using a monodomain model. There is no special preparation needed for the bidomain calculation, however for the sake of obtaining a simple electrical propagation the monodomain mode is suﬃcient and computationally more eﬃcient. Figure 5 shows transmembrane potential distribution on the epicardium at several time points during electrical propagation from apex to base following the application of the electrical stimulus.

High Performance Computer Simulations of Cardiac Electrical Function

4

579

Conclusions

The main contribution of this paper is the design and implementation of an application pipeline, that uses high resolution MR images to create individualized anatomically detailed heart models that are compatible with advanced cardiac electrophysiology simulators (such as CARP). Techniques such as the ones presented here, by generating models with unprecedented realism and level of detail, and introducing natural variability between individuals, have the potential to strengthen and broaden electrophysiological models as a fundamental tool in cardiac research. The main techniques applied in the presented model development are: segmentation, which uses fast marching algorithm to extract ventricular geometry; surface generation that uses marching cubes algorithm and some surface processing (decimation and smoothing), and ﬁnite element mesh generation using Delauney tetrahedralization. The main parts of the developed pipeline are presented in Fig.6. The main objectives of this work were to demonstrate the feasibility of the process, and to propose a working prototype. Most of the methods used are amenable to improvement; for instance, more sophisticated segmentation algorithms will be needed in order to extract diﬀerent anatomical structures, such as papillary muscles and blood vessels, and adaptive meshing techniques may be applied for creating more eﬃcient ﬁnite element models. The electrophysiological model used here has some limitations, and in order to be useful in particular applications additional information, such as ﬁbre orientation or cell type distribution, will have to be included. This can be obtained using additional segmentation techniques, or by including information from diﬀerent image modalities such as histology. In addition, cardiac electro-mechanically coupled solvers are currently being developed that will allow simulation of cardiac electromechanical activity using the meshes developed through the proposed pipeline.

Acknowledgements This work was supported by a LSI DTC scholarship (to M.P.), an MRC Career Development Award (to B.R.), a Marie Curie Fellowship (to G.P.), and a BBSRC grant (BB E003443 to P.K.).

References 1. Kerckhoﬀs, R.C.P., et al.: Computational methods for cardiac electromechanics. Proc. IEEE 1994, 769–783 (2006) 2. Rodriguez, B., et al.: Diﬀerences between left and right ventricular geometry aﬀect cardiac vulnerability to electric shocks. Circ. Res. 97, 168–175 (2005) 3. Vetter, F.J., McCulloch, A.D.: Three-dimensional analysis of regional cardiac function: A model of rabbit ventricular anatomy. Prog. Biophys. Mol. Biol. 69, 157–183 (1998) 4. Nielsen, P.M.F., et al.: Mathematical-model of geometry and ﬁbrous structure of the heart. Amer. J. Physiol. 260, H1365–H1378 (1991)

580

M. Plotkowiak et al.

5. Burton, R., et al.: Three-Dimensional Models of Individual Cardiac Histoanatomy: Tools and Challenges. Ann. NY Acad. Sci. 1080, 301–319 (2006) 6. Osher, S., Sethian, J.: Fronts Propagating with Curvature-Dependent Speed: Algorithms Based on Hamilton-Jacobi Formulations. Journal of Computational Physics 79, 12–49 (1988) 7. Sethian, J.: Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge (2002) 8. Schroeder, W., et al.: The Visualization Toolkit, 3rd edn. Kitware Inc (2004) 9. Lorensen, W., Cline, H.: Marching Cubes: A High Resolution 3D Surface Construction Algorithm. Computer Graphics 21(3), 163–169 (1987) 10. Schroeder, W., et al.: Decimation of Triangle Meshes. Computer Graphics 26(2), 65–70 (1992) 11. Si, H., Gaertner, K.: Meshing Piecewise Linear Complexes by Constrained Delaunau Tetrahedralizations. In: Proceeding of the 14th International Meshing Roundtable (2005) 12. Vigmond, E.J., et al.: Solvers for the cardiac bidomain equations. Prog. Biophys. Mol. Biol. 33, 10 (2007) 13. Puglisi, J., Bers, D.: LabHEART: an interactive computer model of rabbit ventricular myocyte ion channels and Ca transport. Am. J Physiol. Cell. Physiol. 281(6), 2049–2060 (2001)

Statistical Modeling of Plume Exhausted from Herschel Small Nozzle with Baﬄe Gennady Markelov1 and Juergen Kroeker2 1

AOES Group BV, Pustbus 342, 2300 AH Leiden, The Netherlands [email protected] 2 EADS Astrium GmbH, Friedrichshafen, Germany

Abstract. Constantly released Helium on board of the Herschel spacecraft is used to cool three scientiﬁc instruments down to 0.3 K. The Helium is released by small nozzles creating a counter-torque. This compensates the torque caused by the solar pressure acting on the spacecraft surfaces. An optimization of the nozzle shape did not allow avoiding severe plume impingement on the spacecraft surfaces and consequently the application of baﬄes has been considered to reduce plume impingement eﬀects. Two baﬄe shapes, cylindrical and conical, with diﬀerent radii and lengths have been analyzed numerically. The analysis has been performed with a kinetic approach, namely, the direct simulation Monte Carlo (DSMC) method to cope with changing ﬂow regime from continuum in a subsonic part of the nozzle through transitional to freemolecular ﬂow inside the baﬄe. A direct application of DSMC-based software would require large computer resources to model the nozzle and plume ﬂows simultaneously. Therefore, the computation was split in two successive computations for the nozzle and nozzle/plume ﬂow. Computations of plume ﬂow with and without baﬄes were performed to study eﬀects of baﬄe size and shape on the plume divergence, and plume impingement on the Herschel spacecraft. It has been shown that small baﬄes even widen the plume. An increase of radius/length of the baﬄe decreases the plume divergence, however even the largest baﬄe could not meet the requirements.

1

Introduction

The ’Herschel Space Observatory’ is part of the fourth cornerstone mission in the ’Horizons 2000’ program of the European Space Agency (ESA), with the objectives to study the formation of galaxies in the early universe and the creation of stars. In a dual launch together with Planck, Herschel will be placed in an operational Lissajous orbit around the Earth-Sun L2 point by an Ariane 5 in 2008 to perform photometer and spectrometer measurements, covering the full far infrared to sub-millimetre wavelength range from 60 to 670 micrometers during its operational lifetime of 3.5 years. The prime contractor for Herschel/Planck is ThalesAlenia Space in Cannes, France, while the Herschel Payload Module is developed, built and tested under responsibility of EADS Astrium Spacecrafts in Friedrichshafen, Germany. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 581–589, 2008. c Springer-Verlag Berlin Heidelberg 2008

582

G. Markelov and J. Kroeker

The released Helium creates at the nozzles a counter-torque, which is used to compensate the torque caused by the solar pressure acting on the spacecraft. This counter-torque is partly neutralized by the Helium plume impingement on the spacecraft surfaces. To reduce the eﬀect of the plume impingement, 95% of the total thrust has to be within a half-cone of 30 deg in a distance of 0.5 m from the nozzle. To achieve the goal the following investigations have been performed: – Optimization of the nozzle geometry to decrease a plume divergence, – Application of a baﬄe for further decrease of the divergence (the baﬄe shall be small and have a simple shape), – Deﬁnition of proper cant angle and nozzle locations if the design goal could not be achieved by the above activities. Plume exhausted from a nozzle in a hard vacuum is characterized by the presence of all ﬂow regimes, from continuum in the nozzle and even in the plume near ﬁeld through transitional to free-molecular ﬂow at a large distance from the nozzle. Modelling of such ﬂows requires a special approach, for example, a successive application of continuum and kinetic methods (see [1] and refs. herein). The given problem is complicated by the facts that 1) a transitional ﬂow regime occurs inside the nozzle due to a small mass ﬂow rate and 2) a baﬄe can aﬀect the ﬂow at the nozzle exit plane and, probably, even inside the supersonic part of the nozzle. This complicates the splitting of the computational domain into regions and the application of proper numerical methods modelling the ﬂow inside the domains. This paper applies the direct simulation Monte Carlo (DSMC) method [2] to model plume ﬂow with and without baﬄes, study eﬀects of baﬄe size and shape on the plume divergence, and plume impingement on the Herschel spacecraft. This method is a computer simulation of movement and collisions of particles and it applies a statistical approach to perform the collisions. This is the most widely used numerical method in the area of rareﬁed gas dynamics and it was successfully applied to model plume ﬂows and near continuum ﬂows.

2

Nozzle and Baﬄe Geometries

Initially, the nozzle had the supersonic part with an half-angle of 15 deg and exit diameter of 5.45 mm. Temperature of helium at the nozzle inlet is 69 K and mass ﬂow rate is 1.1 mg/sec. The nozzle creates a rather wide plume with about 66% of total thrust within a half-cone of 30 deg. An optimization of the nozzle shape allowed to increase a plume collimation and achieve a plume shape with 74% of total thrust inside a half-cone of 30 deg. The optimal nozzle has a larger supersonic part: the half-cone angle of 32 deg and exit diameter of 15 mm (Fig. 1 left). However, this nozzle does not meet the design goal, 95%, and plume still impinges on the Herschel spacecraft surface. Figure 1 right shows the surface distribution of torque created by the plumes. The plumes impinge mainly the SVM shield, spacecraft body and the radiators creating a torque acting on the spacecraft body. To decrease further the plume divergence, the

Statistical Modeling of Plume Exhausted from Herschel Small Nozzle

583

Radiator Nozzles

SVM shield

Fig. 1. Local Knudsen number ﬂow-ﬁeld for the optimal nozzle (left) and My torque distribution over the spacecraft surface (nozzles without baﬄes, values in N/m)

Lb

Lb

β

Rb

φ

Rb

φ

Fig. 2. Cylindrical (left) and conical (right) baﬄes

application of baﬄes has been considered. The baﬄes shall have a simple shape, either cylindrical or conical. Figure 2 shows geometrical parameters of the baﬄe where Lb is an axial length, Rb is a radius, andβ is the baﬄe angle.

3

Numerical Approach

Modeling of plume ﬂow is quite complex from a numerical view point because as it includes diﬀerent ﬂow regimes, a continuum regime in the nozzle and transitional and free-molecular in the far plume ﬁeld. For the Herschel small nozzle the transitional regime occurs already inside the nozzle due to the small mass ﬂow rate. For example, values of local Knudsen number are larger then a threshold value 0.1, which deﬁnes a border between continuum and transitional regimes (see Fig. 1 left). Therefore, the kinetic approach has to be applied inside the nozzle already. Computations were performed with DSMC-based software, SMILE [3]. The software has 20 year history of development and is validated mutually. A variable hard sphere model [2] was applied to model intermolecular

584

G. Markelov and J. Kroeker

collisions and diﬀuse reﬂection with complete energy accommodation was used as a gas/surface interaction model. To perform collisions between model particles, SMILE uses the Cartesian uniform grid. Each cell of the grid can be subdivided into smaller Cartesian cells to meet the method requirements on a linear size of the collisional cell. This allows implementation of an eﬃcient algorithm to trace the model particles. The uniform cells are used as a base for other algorithms: radial weights and parallel algorithms. The radial weight is assigned for each cells strip along X-axis to control a number of the model particles and make their distribution more uniform in radial direction. A parallel algorithm applies a static distribution of these cells between processors and cells are distributed to the processors on statistical base. In this case all processors communicate with each other. However, this algorithm allows a good load balance for a small number of processors and, as a result, an eﬃcient use of the parallel computer [4]. It is desirable to have the cells small enough to make these algorithms eﬃcient. The plume has to be computed up to a distance of 0.5 meters from the nozzle and collisional cells have to be small inside the nozzle and large in the plume far ﬁeld. By adaptation of the uniform cells (subdivision into smaller collisional cells) the ﬂow resolution in any place of computational domain can be achieved. However, even for a grid 1000x1000 cells in axial and radial directions, respectively, the nozzle locates inside two/three cells. This leads to very ineﬃcient use of the software due to large load imbalance over processors and requires large computer resources. To reduce requirements to computer resources, a computation of nozzle and plume ﬂows has been split into the following two successive computations: 1. modeling of ﬂow inside the nozzle and in the vicinity of the nozzle exit, 2. modeling of the nozzle ﬂow from the nozzle throat and plume ﬂow. The two-step approach allows us to reduce the requirements, signiﬁcantly. An additional beneﬁt is that the ﬁrst computation has been done only once for all the baﬄe geometries and it requires more computer resources then the second computation. Numerical solutions inside the supersonic part of the nozzle are exactly the same for both computations. In principal, the second computation could be started using an inﬂow boundary located downstream from the nozzle throat. In this case, a velocity distribution function has to be sampled along this boundary. Otherwise, any simpliﬁcation of the function, for example, application of ellipsoidal Maxwellian decreases the solution accuracy. For the ﬁrst computation an eﬃciency of the computational cluster used is not less then 85% using up to 28 processors. However, the second computation is very ineﬃcient as the nozzle ﬂow occupies a few cells only, which yields a poor load balancing. For example, an increase of number of processors from 2 to 8 yields a growth of wall clock time required to perform the computation. An improvement has been achieved by redistributing the cells over processors. For the redistribution, the computation has been stopped at certain moments and the work load has been estimated for each cell by comparing the time spent

Statistical Modeling of Plume Exhausted from Herschel Small Nozzle

585

by each processor and number of model particles in that cell. Then cells were grouped along Y-axis to have an approximately equal work load over the groups. This redistribution also reduces the communication between processors as each processor has now a closed sub-domain and not loose cells in the entire computational domain. After this cell redistribution the eﬃciency has been increased up to 80% using four processors.

4

Numerical Results

Computations of plume ﬂow have been performed for the optimal nozzle geometry. This section uses the following parameters to describe plume properties at a distance of 0.5 m from the nozzle. t30 is the fraction of total thrust, t, within ˙ a half-angle of 30 deg, α95 is a half-angle, which includes 95% of total thrust. m and ta are mass ﬂow rate and thrust along a plume centerline. 4.1

Cylindrical Tube

A geometry of the cylindrical tube is deﬁned by two parameters, length and radius. Computations have been performed to study eﬀects of both parameters on the plume divergence and the obtained results are shown in this section. Eﬀect of the tube length. The cylindrical tubes with a length of Lb = 63, 100, and 250 mm have been considered. The tube diameter is calculated assuming that the tube trailing edge is deﬁned by a half-cone angle of φ = 30 deg. A tip of the cone is near the beginning of the throat. Figures 3 and 4 show Mach number ﬂow ﬁelds for all these tubes. Helium atoms reﬂect on the tube surface in accordance with diﬀuse reﬂection. Some of Helium atoms go back to the nozzle and disturb the plume near ﬁeld ﬂow. An application of the tube makes the plume more collimated in terms of t30 (Table 1). However, tubes with 63-100 mm length create a more divergent plume in terms of α95 parameter. Only an application of 250 mm tube leads to very collimated plume with t30 = 0.901, which is close to the design goal of the Herschel small nozzles. The inﬂuence of the tube on the plume structure is clearly seen using a density distribution at a distance of 0.5 m from the nozzle (Fig. 4 right). The tube creates a sudden drop in the density distribution at 30 deg half-cone angle and this drop is larger and sharper for larger tube lengths. Tubes with a length of 63-100 mm create lower density at angles less than 15 deg. The tube with the length of 250 mm provides higher density for angles up to 30 deg and lower density at larger angles comparing with the plume created by the bare nozzle. Due to an open left hand end of the tube, 0.1 mg/sec Helium is released in the opposite direction for the 250 mm tube. When the left hand end is closed, Helium ﬂows only along X direction. However, the closed end causes a slight plume divergence, for example, 100 mm tube creates a wider plume as the open tube with 63 mm length (cf. Tables 1 and 2).

586

G. Markelov and J. Kroeker

Fig. 3. Mach number ﬂow ﬁeld for tubes with a length of 63 mm (left) and 100 mm (right)

Fig. 4. Mach number ﬂow ﬁeld for a tube with a length of 250 mm (left) and density distribution at a distance of 0.5 m (right)

Eﬀect of the tube radius. The eﬀect has been investigated for tubes with a length of 100 mm and 250 mm. The radius of 100 mm tube was decreased from 58.34 mm down to 40 mm. As a result, the trailing edge of the baﬄe is on the half-cone angle of φ = 20 deg. From an intuitive view-point, this should increase the plume collimation. However, a small baﬄe leads to a large disturbance of the ﬂow in the near plume ﬁeld where intermolecular collisions occur. As a result, Table 1. The plume properties for tubes length, mm t30 α95 , deg m, ˙ mg/sec 0 0.741 46.64 1.10 63 0.763 51.98 1.02 100 0.799 50.31 1.01 250 0.901 37.23 0.99

ta , mN 0.833 0.760 0.751 0.748

t, mN 0.925 0.851 0.834 0.808

Statistical Modeling of Plume Exhausted from Herschel Small Nozzle

587

Table 2. The plume properties for tubes with closed left end length, mm t30 α95 , deg m, ˙ mg/sec ta , mN t, mN 100 0.763 51.76 1.09 0.801 0.898 250 0.877 38.86 1.09 0.807 0.877 Table 3. Eﬀect of tube radius on plume properties Lb , mm 100 100 250 250

Rb , mm φ, deg 58.34 30 40.00 20 144.94 30 91.59 20

t30 α95 , deg m, ˙ mg/sec 0.799 50.31 1.01 0.637 56.43 0.92 0.901 37.23 0.99 0.832 41.30 0.86

ta , mN 0.751 0.623 0.748 0.612

t, mN 0.834 0.726 0.808 0.674

the tube with a radius of 40 mm creates a plume wider than the bare nozzle does (Table 3). Figure 5 shows that the density distribution for this plume does not have the signiﬁcant drop at 20 deg and the density in the core ﬂow is lower then the correspondent values for larger tube radius. A decrease of the tube radius for 250 mm length yields also a wider plume. In this case the baﬄe surface is still far from the nozzle (cf. Figs. 4 left and 5 right) and the plume is more collimated than without the baﬄe. The decrease of the tube radius does not aﬀect the density distribution for small angles. However, the density drop is not big as it is for larger tube radius and, as a result, density is higher at angles larger than 30 deg. 4.2

Conical Baﬄe

Various conical baﬄes have been considered with half-cone angles of 5, 10, 15, and 20 deg. The length of the baﬄes is set to Lb = 100 mm. The trailing edge

Fig. 5. Density distribution at a distance of 0.5 m for tubes (left) and Mach number ﬂow ﬁeld for a tube with a length of 250 mm and smaller radius (right)

588

G. Markelov and J. Kroeker

Fig. 6. Mach number ﬂow ﬁeld for a conical baﬄe with a length of 100 mm (left, 5 deg; right, 10 deg)

Fig. 7. Mach number ﬂow ﬁeld for a conical baﬄe with a length of 100 mm (left, 15 deg; right, 20 deg)

position is deﬁned by a half-cone angle of 30 deg. Figures 6 and 7 show Mach number ﬂow-ﬁelds for these baﬄes. An increase of the baﬄe angle decreases very slightly the plume collimation in terms of t30 (Table 4). The plausible reason is that the larger portion of Helium ﬂux is emitted along X direction. At the same time the increase of β leads to more collimated plume in terms of α95 . 4.3

Conclusions

The Herschel spacecraft uses Helium to cool scientiﬁc instruments and to compensate the torque caused by the solar pressure acting on the spacecraft surface. The Helium is emitted through small nozzles and their design has to provide a minimum plume impingement. The ﬂow in the nozzles and in the plume passes from continuum regime through transitional to free-molecular regime and only kinetic approach could handle such ﬂows. The kinetic approach, namely, the direct simulation Monte Carlo method has been applied to perform a numerical analysis.

Statistical Modeling of Plume Exhausted from Herschel Small Nozzle

589

Table 4. Cone angle eﬀect on plume properties β, deg 0 5 10 15 20

t30 α95 , deg m, ˙ mg/sec 0.799 50.31 1.01 0.791 49.97 1.03 0.786 49.83 1.05 0.783 49.04 1.07 0.779 47.88 1.08

ta , mN 0.751 0.766 0.778 0.788 0.800

t, mN 0.834 0.851 0.865 0.876 0.888

A straightforward application of DSMC-based software, SMILE, would have required enormous computer resources to model the nozzle and plume ﬂows with the required accuracy. Consequently the nozzle and plume ﬂow analyses have been split into two successive computations. In the ﬁrst analysis the ﬂow inside entire nozzle has been computed and then subsequent a computation of a ﬂow in the supersonic part of nozzle and plume ﬂow has been performed. The second computation has used as boundary conditions the results of the ﬁrst analysis. This allowed us to reduce signiﬁcantly requirements to computer resources and achieved the required accuracy. To decrease the plume divergence and, hence, plume impingement on the spacecraft surface, the application of various baﬄe shapes, cylinders and cones, were investigated. It was shown that the small baﬄes created even a wider plume than a bare nozzle. An increase of radius/length of the cylindrical baﬄe decreases the plume divergence. The baﬄe with largest length and radius showed the best performance, close to the requirement. The application of the conical baﬄe increase the plume collimation, but no signiﬁcant eﬀect of the half-cone angle was observed.

References 1. Markelov, G.N.: Plume Impingement Analysis for Aeolus Spacecraft and Gas/Surface Interaction Models. J. Spacecraft Rockets 3, 607–618 (2007) 2. Bird, G.A.: Molecular Gas Dynamics and the Direct Simulation of Gas Flows. Pergamon Press, Oxford (1994) 3. Ivanov, M.S., Markelov, G.N., Gimelshein, S.F.: Statistical Simulation of Reactive Rareﬁed Flows: Numerical Approach and Applications. AIAA Paper 98-2669 (1998) 4. Ivanov, M., Markelov, G., Taylor, S., Watts, J.: Parallel DSMC strategies for 3D computations. In: Schiano, P., Ecer, A., Periaux, J., Satofuka, N. (eds.) Parallel CFD 1996, pp. 485–492. North Holland, Amsterdam (1997)

An Individual-Based Model of Influenza in Nosocomial Environments Boon Som Ong1, Mark Chen2, Vernon Lee2, and Joc Cing Tay1,* 1 ROSS Scientific Pte Ltd Innovation Centre, Units 211-212, 16 Nanyang Drive Singapore 637722 * [email protected] 2 Department of Clinical Epidemiology, Tan Tock Seng Hospital Moulmein Road, Singapore 30843

Abstract. Traditional approaches in epidemiological modeling assume a fully mixed population with uniform contact rates. These assumptions are inaccurate in a real epidemic. We propose an agent-based and spatially explicit epidemiological model to simulate the spread of influenza for nosocomial environments with high heterogeneity in interactions and susceptibilities. A field survey was conducted to obtain the activity patterns of individuals in a ward of Tan Tock Seng Hospital in Singapore. The data collected supports modeling of social behaviors constrained by roles and physical locations so as to achieve a highly precise simulation of the ward’s activity. Our results validate the long-standing belief that within the ward, influenza is typically transmitted through staff and less directly between patients, thereby emphasizing the importance of stafforiented prophylaxis. The model predicts that outbreak size (and attack rate) will increase exponentially with increasing disease infectiousness beyond a certain threshold but eventually tapers due to a target-limited finite population. The latter constraint also gives rise to a peak in epidemic duration (at the threshold level of infectiousness) that decreases to a steady value for increasing infectiousness. Finally, the results show that the rate of increase in distinct cumulated contacts will be highest within the first 24 hours and gives the highest yield for contact tracing among patients that had prolonged periods of nonisolation. We conclude that agent-based models are a necessary and viable tool for validating epidemiological beliefs and for prediction of disease dynamics when local environmental and host factors are sufficiently heterogeneous. Keywords: Agent-based modeling, Spatially-explicit model, Epidemiology, Influenza, Contact patterns.

1 Introduction During the SARS crisis, hospitals were found to be especially vulnerable to outbreaks [1-3]. Hospitals are also susceptible to nosocomial influenza, and rapid crossinfection between healthcare workers and patients can occur [4-6]. In spite of this, there has been little work in simulating the potential spread of infections in the hospital setting, with a granularity that allows policy makers and infection control M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 590–599, 2008. © Springer-Verlag Berlin Heidelberg 2008

An Individual-Based Model of Influenza in Nosocomial Environments

591

practitioners to explore the utility and potential impact of various hospital outbreaks and infection control measures. We have therefore chosen, 1) to base the geographic and spatial context of our epidemiological model in which the disease outbreak is based on a hospital environment and, 2) to use agents whose behaviors are based on surveyed activity data of patients and healthcare workers. We have designed and developed a spatially explicit agent-based epidemiological simulation model called ASINE (which stands for Agent-based Simulator for Infections in the Nosocomial Environment). This model can simulate the dynamics of disease spread through person-to-person contact among the staff and patient population for a particular ward at Tan Tock Seng Hospital (TTSH) in Singapore. While hospital infections have traditionally been modeled using compartmental models, our use of a spatially explicit agent-based simulation is driven by the fact that, in a hospital environment: 1) Individuals interact with each other locally, 2) Individuals are mobile but may be restricted to certain areas, and 3) The individual environment is heterogeneous [7, 8]. Although individual-based models have been used to model the spread of community influenza [7-10], but such an approach, to our knowledge, has not been applied to nosocomial influenza.

2 The Spread of Influenza within a Hospital Ward As alluded to in our introduction, the primary motivation for our work arose out of the experience of nosocomial SARS outbreaks in 2003, and the threat which pandemic influenza may pose to the hospital environment. From this section onwards, we will refer primarily to influenza, as an example of an infectious disease which can potentially be spread in the hospital through staff and patient interactions. In a typical hospital environment, the main venues for human traffic are within the clinical wards; these were also the key locations where outbreaks were observed during the SARS epidemic in Singapore [1]. Influenza is predominantly spread from person to person, by droplet spray or by direct or indirect (e.g. via fomites) contact with nasopharyngeal secretions [6]. In our model, the geographic context is the spatial environment of CDC1 ward 71. 2.1 Modeling the Environment We implemented a Geographic Information System (GIS) as a data model for a twodimensional schematic map that represents the environment of interest, in which individuals performs their social activities [11, 12]. The spatial environment only consists of location objects in specific positions with no explicit path information. Therefore, a graph is used to provide the navigational structure for agents to move within the ward [13, 14]. Each location can be thought of a node in the graph and thus an edge can then be added between two nodes to denote that a path exists between them. For each location object, we specify a Cartesian coordinate for its position and a rectangle of certain width and height for its shape (say for a bed) in the two dimensional environment. The topology for the CDC ward was thereby approximated in this manner in accordance to our onsite inspection of the ward. 1

Communicable Disease Centre, TTSH.

592

B.S. Ong et al.

2.2 Modeling the Human Population at CDC Ward 71 The healthcare staff can be categorized as: doctors, nurses and health attendants. There is only one clerk and one cleaner. There are many types of patients, but we have categorized them into two types: ambulant and non-ambulant. Ambulant patients are allowed to move around the ward area but for a non-ambulant patient, he/she would be bedridden for the whole duration of stay at the ward. The last type of agent is a visitor, who may visit the patients. In summary, the population at CDC ward 71 (which we modeled) comprises: • • • • • •

18 nurses, 6 for each shift. 3 health attendants, 1 for each shift. 4 doctors, who do their rounds at the ward daily. 1 ward clerk. There is only 1 shift for the ward clerk. 1 ward cleaner. There is only 1 shift for the ward cleaner. Number of patients (ambulant and non-ambulant) and visitors can be parameterized during initialization.

2.3 Modeling Agent Behaviors There are two types of routines - standard and miscellaneous. Healthcare workers have standard routines to follow during a work shift. Nurses need to carry out tasks like taking parameters for patients and bed turnings for non-ambulant patients. Health attendants need to serve meals during meal timings and doctors usually make their rounds in the morning. These standard routines occur during certain times of the day and they must be carried out. Apart from these standard routines, different types of individuals each have a set of activities that may be performed. These activities are categorized as miscellaneous routines an agent performs. For instance, a ward clerk may only visit administrative areas like the doctor’s office or the nursing station. A visitor may only visit the patient’s room and nursing station, but is out of bounds to the staff room. By definition, the visitor may also choose not to visit a patient. Hence each patient can have 0 or more visitors during the visitation hour. Each individual agent performs such activities or actions probabilistically. The algorithm for the selection of an action is based on roulette wheel selection where each action is associated with a probability value that corresponds to its fitness. The fitter the action, the greater the chance the agent will perform this action. The social interactions of each individual are simulated on a daily basis with activity patterns obtained from a field survey. This field survey helps to derive the sets of routines mentioned previously that an agent has to carry out. The survey method was sample-based and purely observational. Movements of representative healthcare staff, patients and visitors was observed during an average work day, so as to establish the frequency, duration and intensity of contacts between healthcare staff, patients and their visitors. The ethics review committee of the National Healthcare Group, Singapore, was consulted, with approval obtained, to ensure that the conduct of the field survey respected privacy and ethical consults. For each location x visited by an individual y (upon observation of y during the survey), the probability of visiting x is calculated based on the observed frequencies of visit to x given by Nx divided by the total frequencies of visit to all locations by y,

An Individual-Based Model of Influenza in Nosocomial Environments

593

given by N. For example, a total of 50 activities for a particular nurse were observed (during the survey). Out of these 50 activities, 10 of them are activities performed at the nursing station. So the probability of a nurse going to a nursing station is 1/5. The actual type of activities performed by each individual is not relevant. The duration an agent ai spends at location x is drawn from a normal distribution based on the frequencies of visit and the amount of time associated with each visit. 2.4 Modeling Disease Transmission Epidemics are usually described using a set of states; namely, susceptible (S), infected (E), infectious (I) and recovered/removed (R) [15]. Depending on the disease’s natural history, an epidemiological model can be described using the SEIR, SEIS, SIR, or SIS pattern as shown in Fig. 1. Fig. 1 illustrates a finite state machine (FSM) which describes the possible state transitions of an infectious disease. The possible set of state transitions that can be obtained from the FSM is SEIR, SEIS, SIR and SIS. The FSM allows us to model diseases of different natural history through alternative state transition routes.

Fig. 1. Possible state transitions of an infectious disease

To simulate the pathogenesis of a disease within a host and transmission between hosts, each individual is associated with a disease model that describes the health state of that individual (using discrete states, susceptible, infected, infectious and recovered). A disease is modeled as an agent and is responsible for transiting between epidemic states and performs computations for infecting other susceptible human agents. The joint preconditions for successful influenza transmission between two agents are 1) both agents must first collocate at a location in order for influenza to spread, and 2) distance between infector agent x and infectee agent y must be within a certain radius. In a spacious room therefore, the disease may not be transmitted so easily. In our model, the infection radius is defined as twice the size of the human agent. Transmission probability, β, is defined as the transmissibility of the infectious agent multiplied by the susceptibility of the susceptible agent, which is based on unittime per contact with the following formula: Unit-time per contact transmission probability, β1 = 1 – (1 – β)

T /T 1

(1)

where T = Newtonian time and T1 = simulation time. Both transmissibility and susceptibility depend on the instantaneous health state of the individual. The latter is a random variable drawn from a normal distribution with a mean of 0.5 and standard deviation of 0.5. The range is between 0 (indicates severely

594

B.S. Ong et al.

immune-suppressed) and 1 [8]. Therefore, a healthy person may be less susceptible to infection and less likely to transmit the disease. All human agents are initialized to be at a susceptible state. The infected and infectious states are both associated with a non-zero time period. When an infectious agent infects a susceptible agent, the susceptible agent will move to the infected state. After the infected period has expired, the agent will move to the infectious state. This is the state where the disease becomes contagious and transmission to other agents can occur at this state. After the infectious period has expired, the agent recovers. Mortality is currently not modeled.

3 Model Validation and Results We designed several experiments with two aims in mind. The first experiment (A) aimed to validate our agent-based epidemiological model in terms of its ability to simulate the ward environment. The second and third experiments (B and C) are samples of the type of results we can obtain from this model, which may be useful in guiding control measures. A) Contact patterns between individuals within the ward environment Table 1. Contact patterns between individuals within the ward environment

Index individual

Distinct contacts of index for 1 day, by individual type

Cleaner Clerk Doctor Health Attendant Nurse Ambulant Patient Non-ambulant Patient Visitor

Cleaner Clerk Doctor Health Attendant

Nurse Ambulant Patient

0 0.89 0

10.12 0.4 14.28 1.21 5.62 7.85

0.94 0 0.37

0.04 1.72 2.83

1.76 1.85 0.2

1.27

Nonambulant Patient 0.15 0 14.65

Visitor Total Contacts 0.72 3.96 4.35

14.13 23.91 35.87

0.64

0.67

0.21

0

8.21

0.3

2.06

13.36

0.6

0.85

1.19

1.36

12.89 4.06

5.65

3.54

30.14

0.05

0.16

3.45

0.4

8.18

0

1.91

16.51

2.36

0.01

0

3.58

0.03

7.03

0

0

1.29

11.94

0.04

0.09

0.4

0.19

1.86

0.57

0.57

2.46

6.18

Table 1 shows the average number of distinct contacts (from 100 realizations) encountered by a putative index individual within the course of a 24-hour period. We see that the contact patterns in the ward which are generated by the model resemble what we expect of a ward environment. For example, the clerk has a high number of contacts mainly with the nursing staff, due to her central location at the nursing station; she has however, minimal contact with patients. The doctors, on the other hand, have a high number of contacts with nurses, patients and visitors; they also have the highest number of contacts amongst all staff types. When we look at patients, we see that patients have less contacts overall, and that patients have few contacts with other patients; in particular, we see that non-ambulant patients are unlikely to have contact with other patients. The model thus affirms the current opinions that, in many nosocomial diseases, transmission is not occurring from direct patient-to-patient contact but through healthcare workers as vectors.

An Individual-Based Model of Influenza in Nosocomial Environments

595

B) Outbreak size and path-length for an infectious disease with different transmission probabilities We can also use our model to simulate other diseases which may, in the future, cause outbreaks spread by direct person-to-person contact. The set of parameters which have been used to describe the biology and natural history of influenza is shown in Table 2. In this set of experiments, we simulated the propagation of the entire outbreak (i.e. all generations of cases) and then calculated the simulated outbreak size, the total attack rate (including staff, patients and visitors), the epidemic duration, and the number of generations of cases. Table 2. Disease parameters for influenza (adopted from [10]) Duration (in days) Minimum Maximum Mean

Infected state 1 3 1.9

Infectious state 3 6 4.1

Again, the parameter describing the infectiousness of the disease is unknown, so we simulated a range of values for infectiousness (β), as shown in Fig. 2a. We see that, below a certain threshold value of infectiousness, the average outbreak size is very small; this is because, at these values, the index case on the average produces less than one other infectious case (R0 < 1), therefore no propagated transmission is possible. With higher levels of infectiousness, the outbreak size has a near linear association (on a logarithmic scale) with infectiousness. Further increase in infectiousness, however, only has a marginal effect on outbreak size since the ward environment is a finite population and almost all individuals who can be infected would have been infected; this is illustrated by Fig. 2b, which shows the attack rates approaching 100% at values of infectiousness exceeding 0.1. When looking at epidemic duration and the maximum number of generations within an outbreak, an interesting pattern emerges. At lower levels of infectiousness, epidemic duration increases with increasing infectiousness (Fig. 2c). This is because of the likelihood that the epidemic will generate successive generations of cases (Fig. 2d). However, with higher levels of infectiousness, epidemic durations decrease; this is because, when the average number of cases infected by an infectious patient increases, the finite number of individuals within the ward environment can be infected in fewer generations than at lower levels of infectiousness. C) Outcome and yield of contact tracing for simulated outbreaks with different infection parameters Fig. 3a and Fig. 3b simulate the dynamics of a commonly used intervention, that of contact tracing; the situation simulated is one where the ward environment is exposed to an infectious index case for a number of days before the case is identified and isolated. For Fig. 3a, we explore the number of distinct contacts that would be generated over the infectious period if the index were a patient, or any of the staff types shown in the picture. We see that staff have far more contacts than patients, and nonambulant patients have the least number of contacts. The number of distinct contacts

596

B.S. Ong et al.

(a)

(b)

(c)

(d)

Fig. 2. Outbreak size and path length of influenza

An Individual-Based Model of Influenza in Nosocomial Environments

597

(a)

(b)

Fig. 3. Outcome and yield of contact tracing

accumulated over time is interesting to note; the sharpest increase is between time zero and day 1; this is because, within the first 24 hours, an index case (in particular in the case of patients), would have met most of the individuals which he/she will ever meet within the ward environment. The result of this cumulative contact pattern translates into the patterns observed in Fig. 3b when we look at the yield of contact tracing, when the index case is a patient. If uniform infectiousness is assumed over time, then the cumulative contacts infected by an index case increases with the days that the patient is left without isolation at a faster rate than the increase in the number of contacts, with the result being that the yield of contact tracing is higher for cases who have not been isolated for a longer period, regardless of the level of infectiousness assumed.

4 Conclusions We have designed and developed an individual-based epidemiological simulation model that can accurately simulate the spread of influenza within a ward at the Communicable Disease Centre of Tan Tock Seng Hospital in Singapore. As influenza is typically spread by droplet or direct person-to-person contact, the basic interactivity and social structure of the domain would be of paramount importance. As such, we undertook field surveys of the movement patterns of staff, patients and their visitors. These movements result from visitation patterns, bed and nurse-patient allocation

598

B.S. Ong et al.

methods, and from miscellaneous activities such as visitations to various staff rooms, pantries, washrooms, or taking work breaks and mandatory activities like bed turning, taking of parameters, administration of medication and shift change. We also employed a two dimensional topology for the physical structure of the ward to constrain the navigational movement of individuals. Disease transmission was based on an individualized SEIR model without morbidity and mortality. Heterogeneity in individual health statuses and interaction patterns determine local transmissibility and form the measurable dynamics of disease spread for parameters such as epidemic duration and attack rate. The resulting individual-based, stochastic model of influenza spread within a constrained environment with finite population was validated with epidemiologists through a number of experiments. We established the long-standing belief that within the wards, influenza is typically transmitted through staff and less directly between patients, thereby emphasizing the importance of staff-oriented prophylaxis. Also, results show that outbreak size (and attack rate) increases exponentially with increasing disease infectiousness beyond a certain threshold but tapers eventually due to a target-limited finite population constraint. The latter constraint also gave rise to a peak in epidemic duration (at the threshold level of infectiousness) that decreases to a steady value for increasing infectiousness. Finally, we showed that the rate of increase in distinct cumulated contacts was highest within the first 24 hours and gave the highest yield for contact tracing among patients that had longer periods of non-isolation. Through this project, we showed that agent-based models are a necessary tool for validating epidemiological beliefs and for prediction of disease dynamics when local environmental and host factors are sufficiently heterogeneous. Acknowledgments. We would like to thank CDC, Tan Tock Seng Hospital of Singapore for providing needed data clearance and access to the ward and its staff. In particular, A/Prof. Leo Yee Sin, Dr. Angela Chow, Staff Nurse Quek Lee Kheng and student helpers Ms. Guo Zaiyi and Ms. Christine Ong.

References 1. Heng, B.H., Lim, S.W.: Epidemiology and control of SARS in Singapore. Epidemiological News Bulletin, Ministry of Health Singapore 29, 42–47 (2003) 2. Skowronski, D.M., Astell, C., Brunham, R.C., Low, D.E., Petric, M., Roper, R.L., Talbot, P.J., Tam, T., Babiuk, L.: Severe acute respiratory syndrome (SARS): a year in review. Annual Review of Medicine 56, 357–381 (2005) 3. Yu, I.T.S., Sung, J.J.Y.: The epidemiology of the outbreak of severe acute respiratory syndrome (SARS) in Hong Kong–what we do know and what we don’t. Epidemiology and Infection 132, 781–786 (2004) 4. Salgado, C.D., Farr, B.M., Hall, K.K., Hayden, F.G.: Influenza in the acute hospital setting. Lancet Infectious Diseases 2, 145–155 (2002) 5. Sartor, C., Zandotti, C., Romain, F., Jacomo, V., Simon, S., Atlan-Gepner, C., Sambuc, R., Vialettes, B., Drancourt, M.: Disruption of services in an internal medicine unit due to a nosocomial influenza outbreak. Infection control and hospital epidemiology 23, 615–619 (2002) 6. Stott, D.J., Kerr, G., Carman, W.F.: Nosocomial transmission of influenza. Occupational Medicine 52, 249–253 (2002)

An Individual-Based Model of Influenza in Nosocomial Environments

599

7. Bian, L.: A conceptual framework for an individual-based spatially explicit epidemiological model. Environment and Planning B: Planning and Design 31, 381–395 (2004) 8. Dunham, J.B.: An Agent-Based Spatially Explicit Epidemiological Model in MASON. Journal of Artificial Societies and Social Simulation 9 (2005) 9. Ferguson, N.M., Cummings, D.A., Fraser, C., Cajka, J.C., Cooley, P.C., Burke, D.S.: Strategies for mitigating an influenza pandemic. Nature 442, 448–452 (2006) 10. Longini Jr., I.M., Halloran, M.E., Nizam, A., Yang, Y.: Containing Pandemic Influenza with Antiviral Agents. American Journal of Epidemiology 159, 623–633 (2004) 11. Crooks, A.T.: Exploring Cities using Agent-based Models and GIS. In: Proceedings of the Agent 2006 Conference on Social Agents: Results and Prospects, University of Chicago and Argonne National Laboratory, Chicago, IL (2006), http://www.agent2006.anl.gov/2006procpdf/Crooks_Agent_2006. pdf 12. Gonçavels, A.S., Rodrigues, A., Correia, L.: Multi-Agent Simulation Within Geographic Information Systems. In: Coelho, H., Espinasse, B. (eds.) Proceedings of 5th Workshop on Agent-Based Simulation (2004) 13. Buckland, M.: Programming game AI by example. Wordware Pub. (2005) 14. Smed, J., Hakonen, H.: Algorithms and Networking for Computer Games. Wiley, Chichester (2006) 15. Hethcote, H.W.: The Mathematics of Infectious Diseases. SIAM Review 42, 599–653 (2000)

Modeling Incompressible Fluids by Means of the SPH Method: Surface Tension and Viscosity Pawel Wr´ oblewski1 , Krzysztof Boryczko1, and Mariusz Kope´c2 1

Department of Computer Science AGH University of Science and Technology, Krak´ ow {pawel.wroblewski,boryczko}@agh.edu.pl 2 Faculty of Physics and Applied Computer Science AGH University of Science and Technology, Krak´ ow [email protected]

Abstract. The adaptations of the SPH method for simulating incompressible ﬂuids which focuse on two features: the surface tension and artiﬁcial viscosity, are presented in this article. The background and principles of the SPH method are explained and its application to incompressible ﬂuids simulations is discussed. The methodology and implementation of artiﬁcial viscosity in the SPH method are presented. The modiﬁcation for surface tension simulation, which relies on incorporating additional forces into the model, as well as the methodology are suggested. Also, the new equations for artiﬁcial viscosity are presented, which are able to simulate a ﬂow of non-newtonian ﬂuids. The results obtained with the method are presented and discussed.

1

Introduction

A number of existing computer methods can be used for simulating phenomena from the real world. Many of these phenomena, very important in the contemporary science and engineering, are related to ﬂuid mechanics. These phenomena refer to diﬀerent spatio-temporal scales, starting from micro scale, through meso scale to macro scale. Very interesting processes, e.g. turbulences, wall-ﬂuid interactions, free surface behavior, take place in the domain between these scales. However, neither correct nor eﬃcient methods have been available for this area yet. The SPH method is a very popular particle method for simulating processes in the macro scale [12]; it also seems to be possible to simulate phenomena from the domain between macro and meso scales. However, despite many advantages, the proposed method has also several drawbacks. From the authors’ point of view, one of the most awkward one is the lack of possibility to simulate surface tension. There are a few papers presenting modiﬁcations of the SPH method which remove this disadvantage [16][13]. However, the analysis of these modiﬁcations reveals that they are either deprived of a strong physical background or are too complicated for computer implementation. Another drawback of the SPH method is the problem with modeling a ﬂow of non-newtonian ﬂuids. This topic is almost not present in papers concerning SPH simulations [15] and there M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 600–609, 2008. c Springer-Verlag Berlin Heidelberg 2008

Modeling Incompressible Fluids by Means of the SPH Method

601

is no straight scheme for obtaining the eligible model of non-newtonian viscosity, what opens this research area for new investigations. The background and principles of SPH method are shown along with its variants, depending on the target application. There are also presented adaptations of this method to incompressible ﬂuids simulations. The proposed modiﬁcation of the SPH method enables modeling surface tension in several physical phenomena. The changes rely on additional forces acting between SPH particles as well as between SPH particles and walls of a vessel. The methodology of adding new forces is discussed. The implemented modiﬁed algorithm has been employed for simulating two phenomena: a ﬂuid drop behavior in vacuum without gravity and a capillary rise between two vertical, parallel plates inserted into a ﬂuid. The second application of the SPH method presented in this paper is the simulation of a ﬂow of non-newtonian ﬂuid. In order to achieve it we propose new equations for artiﬁcial viscosity, which actually are a modiﬁcation of Monaghan’s artiﬁcial viscosity model. This modiﬁcation consists of the change of the character of viscosity’s dependence on interparticle velocity. In the proposed model this dependence is non-linear. The new model of viscosity was validated in the simulation of the ﬂow of the viscous ﬂuid in a long, cylindrical vessel. The non-newtonian character of the ﬂuid manifested itself in the modiﬁed velocity proﬁle for the modeled ﬂow [4].

2

Smoothed Particle Hydrodynamics

The SPH method was created in order to simulate astrophysical phenomena [10][5]. The main idea of the method is a ﬁeld approximation in the set of points from space. The hydrodynamical forces, corresponding to the Navier-Stoke’s equations are calculated in these points (the SPH particles) and with such a background the equations of motion are solved. The approximation procedure uses the kernel function, which vanishes at inﬁnity and its integral is equal to unity. From the theoretical point of view one could choose the gaussian bellshaped function, however in practice it is common to use the spline function with compact support. The authors present results obtained by means of the kernel function proposed in [11]. The approximation procedure applies not only to the hydrodynamical forces, but also other quantities referred to by the modeled phenomenon. The approximation equation for a density is presented below [11]: mj Wij , (1) ρi = j

where mj – the mass of particle j, Wij = W (rij , hij ) – the kernel function evaluated for particles i and j, rij – the distance between particles i and j and hij – a smoothing length. The sum in the above equation runs over all particles from the system. In practice, however, if the support of kernel function is compact, it is enough to count only those SPH particles, for which the kernel function is non-zero. Then, the character of interparticle interactions is shortrange; there exists cut-oﬀ radius rcut such, that for every pair of particles whose

602

P. Wr´ oblewski, K. Boryczko, and M. Kope´c

distance is larger than rcut , the force acting between them is equal to zero. For simulation with short-range interactions it is possible to use the structure of cubic cells, which rapidly accelerate calculations [2]. Every SPH particle undergoes the acceleration given by the formula: Pi Pj dvi =− mj + 2 + Πij ∇i Wij , (2) dt ρ2i ρj j where Pi – the pressure at point i, ρi – the density at point i and Πij – the viscosity part of the force. One can ﬁnd the full derivation of equation (2) in [8]. Besides the force acting between ﬂuid particles, it is also necessary to incorporate into the model forces acting between ﬂuid particles and the walls. In this paper we use a wall consisting of particles, and the corresponding force is very similar to the one given in [12]. It is given by formula: c0 2 Γ (rij /rcut ) mj dvi = , dt 10 rij mi + mj j

(3)

where ⎧2 ⎪ 3, ⎪ ⎪ ⎨2(2q − 3q 2 ), Γ (q) = ⎪ 2(1 − q)2 , ⎪ ⎪ ⎩ 0,

3

if q < 13 , if 13 < q < 12 , if 12 < q < 1, otherwise.

(4)

Incompressible SPH

The standard SPH formula for density (1) is useless when modeling the ﬂuids with free surface. If this equation is applied, the density in the vicinity of the surface changes continuously from the value assumed for all particles to the value of zero on the distance of 2h, which is obviously a discord with the experiments. In the case of such ﬂuids another formula derived from the SPH approximation is used: dρi = mj (vi − vj ) · ∇i Wij , dt j

(5)

which evaluates only the rates of change of the density. Application of this equation requires the initialization of the density values at the beginning of the simulation. The incompressible character of the ﬂuid is modeled by an appropriate equation of state, which is used for evaluating the pressure values in equation (2). The authors use the equation of state given by [12]: 7 ρ ρ0 c20 −1 . (6) P = 7 ρ0

Modeling Incompressible Fluids by Means of the SPH Method

603

When modeling incompressible ﬂuids it is very problematic to choose the proper timestep. If real values of speed of sound c are applied, the timestep is too small for any practical application. Therefore it is convenient to use a value of c several orders of magnitude smaller than the real one. This approach accelerates the calculation signiﬁcantly, and does not inﬂuence results [3].

4

Modeling Viscosity with SPH

In many simulations of ﬂuid ﬂow it is necessary to comply the transition of kinetic energy of the ﬂuid into its thermal energy. In the case of the SPH method presented here, where the thermal energy of the ﬂuid is not concerned, it is necessary to incorporate the dissipation of the ﬂuid energy by means of viscosity. Also, the every day experience of viscosity in almost all real ﬂuids demands incorporating the viscosity term Πij into the equation (2). The most often used [8] model of artiﬁcial viscosity in SPH simulations is the one proposed by Monaghan [11]:

−αc μ +βμ2 ij ij ij , if v ij · rij < 0 ρij Πij = , (7) 0 , if v ij · rij ≥ 0 where μij =

hv ij · rij 2 + η2 . rij

(8)

Monaghan proposes to set η 2 = 0.01h2 . In this model viscosity vanishes, when v ij · rij ≥ 0, which has an equivalent on the SPH interpolation level: ∇ · v ≥ 0 [11]. According to this, viscosity is present only when two particles approach to each other. In the opposite case, the viscosity force is equal to 0. There are also other models of artiﬁcial viscosity used in SPH simulations. One can mention ones proposed by Hernquist and Katz [6] or by Balsara [1]. A more detailed discussion of the appropriate choice of the artiﬁcial viscosity model is presented in [9]. In simulations presented in this article the authors use the model proposed by Monaghan. Its advantages are: simplicity and relatively low computational costs.

5

5.1

The Proposed Modiﬁcations of SPH Method for Simulating: Surface Tension and Non-newtonian Character of a Fluid Additional Forces for Surface Tension Model

Modeling the phenomena in which the surface tension eﬀects arise one needs to incorporate two additional parts into the model. The ﬁrst one represents interactions acting between ﬂuid particles responsible for modeling the surface tension of the ﬂuid. The second part is the modiﬁcation of the ﬂuid-wall interactions, which is responsible for proper modeling the wetting character of the ﬂuid.

604

P. Wr´ oblewski, K. Boryczko, and M. Kope´c

Fig. 1. The function of new additional force

Fluid-ﬂuid interactions. The surface tension is an eﬀect of mutual attraction of ﬂuid molecules. It is impossible to simulate the exact inter-molecule interaction in the SPH method, because the scale of the method is much larger than the scale where intermolecule forces are present. However, the main idea of the model is still the same and it is realized by incorporating additional attractive forces. When trying to ﬁnd the form of this force the authors found that it was very diﬃcult to do so, when the range of the new, additional attractive force was the same as the range of the SPH interactions. In this case the additional force modiﬁed the nature of the SPH force, and together they led to numerical artifacts. For example, the SPH particles tended to bind in pairs. This is why, following the advice given in [14], the authors move the range of acting of the additional force beyond the SPH range. Actually, it is twice as far as SPH range. In this case the artifacts are not observed anymore, and the results are in good agreement with expectations. The form of the new, proposed force is given by the equation: 3 1 rcut − rij , rcut , (9) Fij = −A · W 2 4 where A is a positive constant and W is the kernel function. The plot of this new function is depicted in the Fig. 1. The form of the proposed additional force is one of many possible. During tests with many diﬀerent forms of this force the authors concluded, that, in general, the form of the force does not inﬂuence the results, if only values of the force are negative in the range [rcut , 2rcut ]. Therefore, the authors proposed the force given by (9), which is convenient to be implemented. Fluid-wall interactions. Similarily, in order to model phenomena which concern hydrophilic or hydrophobic ﬂuids it is also necessary to incorporate additional attractive forces acting between ﬂuid particles and wall particles. In the original SPH model the force acting between walls and ﬂuid was purely repulsive. The authors incorporated additional, attractive forces into the simulation. Their form is given by the formula (9), i.e. it is exactly the same, as in the case of ﬂuid-ﬂuid interactions, but with a diﬀerent value of constant A. The reasons, why this particular form of the force was used are the same, as in the case mentioned above.

Modeling Incompressible Fluids by Means of the SPH Method

5.2

605

The Modiﬁcation of Artiﬁcial Viscosity Model

Additionally, in order to propose the model capable of simulating ﬂows of nonnewtonian ﬂuids, we propose a modiﬁcation of Monaghans’ artiﬁcial viscosity. The modiﬁcation relies on a change of the equation (8) to: μij =

hv ij · r ij exp vij − 1 . 2 + η2 · rij vij

(10)

According to this change, an artiﬁcial viscosity acting between two particles depends nonlinearly on their mutual velocity, and this way it is possible to obtain the non-netwonian character of a ﬂuid ﬂow. This manifests itself in the change of the velocity proﬁle of the ﬂow, which now corresponds to the viscosity coeﬃcient nonlinearly dependent on the shear rate [4]. By using modiﬁcation given by (10) the authors introduced non-linear dependence of μij on vij . The equation (10) is only an example of such modiﬁcations and is supposed to show the possibilities of further research in this area. The authors tested also several other equations and received similar results.

6

Results

The modiﬁcations presented above have been tested in simulations of three different ﬂuid phenomena. 6.1

Fluid Drop Oscillations without Gravity

The ﬁrst phenomenon used for validating the form of additional attractive forces was the behavior of ﬂuid drop in vacuum. We took well equilibrated circular drop and transformed it into an ellipsoid with the transformation [14]: sin(φ/2) sin u 2 x r , (11) = y sin φ cos(φ/2) cos u sgny where r = x2 + y 2 , φ = 0.63π and u = arctan xy . The ’z’ coordinate for every particle remained unchanged. Then, we examined the relaxation time of the drop, which should depend on the surface tension value of the ﬂuid. If there is no artiﬁcial viscosity, then the drop deformed with the above formula oscillates about the equilibrium state, which is ideal sphere. However, when the artiﬁcial viscosity is present, it is more convenient to examine the relaxation time. The sample relaxation scheme is depicted in the Fig. 2. The authors ran four diﬀerent simulations, for four diﬀerent values of the parameter corresponding to the surface tension value for the ﬂuid. The results obtained are presented in the Fig. 3. The relaxation time should depend on the surface tension coeﬃcient γ as 1 ∼ γ − 2 [14]. The plot presented in the Fig. 3 shows, that the results from the simulation are in good agreement with this dependence at least qualitatively.

606

P. Wr´ oblewski, K. Boryczko, and M. Kope´c

Fig. 2. A relaxation-oscillation of ﬂuid drop in vacuum

Fig. 3. A relaxation time versus surface tension

6.2

Capillary Rise of the Liquid

The second phenomenon used for validating the model of surface tension with the SPH method is capillary rise of the modeled liquid. The eﬀect in this phenomenon depends on the diﬀerence between attractive forces of ﬂuid-ﬂuid interactions, and ﬂuid-walls interactions. If ﬂuid-ﬂuid attractive forces are stronger than ﬂuid-walls interactions, then one should expect to see convex meniscus (negative capillary rise). In the opposite case, a concave meniscus should be visible (positive capillary rise). This is what the authors received from the simulations as a result. For ﬂuidﬂuid attractive forces stronger than ﬂuid-walls interactions, the results are as presented in the Fig. 4.a. On the other hand, when the ﬂuid-wall interactions are stronger than the ﬂuid-ﬂuid ones, the obtained results are as in the Fig. 4.b. Both simulated phenomena prove that the proposed modiﬁcation of SPH method properly model the surface tension eﬀects. 6.3

The Fluid Flow in an Elongated Vessel

The next phenomenon modeled by means of SPH method is the ﬂuid ﬂow in an elongated vessel. In the beginning all ﬂuid particles ﬁlling the vessels rested. Then we applied the initial velocity to all of them and continued the simulation till the ﬂow stopped. The ﬂow was delayed by the forces acting between the

Modeling Incompressible Fluids by Means of the SPH Method a)

607

b)

Fig. 4. Two menisci obtained from the simulations: a) convex meniscus, b) concave meniscus a)

b)

Fig. 5. a) Velocity distribution. b) Velocity proﬁles for two diﬀerent viscosity models.

wall and ﬂuid particles. The interactions of wall-ﬂuid particles were modeled by means of the DPD method [7], where brownian part of the force was omitted. It was the dissipative part of the DPD model, which delayed ﬂow of a ﬂuid. The sample velocity distribution, where velocities were marked with color, is presented in the Fig. 5.a. The authors ran several such simulations, each for diﬀerent artiﬁcial viscosity model. When compared the Monaghan’s model with the modiﬁed artiﬁcial viscosity, it seems that our modiﬁcation can be treated as a starting point for further research on the area of non-newtonian ﬂuid ﬂows. This can be derived from the analysis of velocity proﬁles from executed simulations. The velocity proﬁles of two models: Monaghan’s model and the one given by equation (10), are presented in the Fig. 5.b. The obtained proﬁle is more ﬂattened than the Monaghan’s one, which meets the expectations [4]. 6.4

The Parallel Implementation of the Model

The SPH method, along with presented modiﬁcations, was implemented in parallel by means of the OpenMP environment. The simulation of a ﬂuid ﬂow was

608

P. Wr´ oblewski, K. Boryczko, and M. Kope´c

Fig. 6. The relative eﬃciency of OpenMP implementation

ran for several values of the number of processors. The time of execution of a single simulation step was measured, and the relative eﬃciency was evaluated. The results are depicted in the Fig. 6. The similar values of relative eﬃciency were obtained for simulation of SPH method with modiﬁcations for surface tension. In this case, however, the execution time of a single simulation step was greater, since the interparticle interaction range was doubled.

7

Conclusions and Future Work

The results presented in the paper show that the proposed form of the additional forces in the SPH method properly model the surface tension of the simulated ﬂuid. These forces were validated in two simulated phenomena, and satisfactory results were obtained. Also, the simulation of the ﬂuid ﬂow with the modiﬁed artiﬁcial viscosity indicates that proposed improvements allow for reasonable modeling of nonnewtonian ﬂuids. Such simulations will ﬁnd diﬀerent applications in science and engineering, for example in modeling properties of the blood ﬂows [4]. However, this method still needs additional work and more detailed derivation of its equations. The presented results qualitatively prove the correctness of the modiﬁed method. However, it is still a big challenge to validate the new forces quantitatively. Yet, no analytical relation between physical quantities such as viscosity, surface tension and simulation parameters is known. Acknowledgments. This research is ﬁnanced by the Polish Ministry of Science and Higher Education, Project No. 3 T11F 010 30.

Modeling Incompressible Fluids by Means of the SPH Method

609

References 1. Balsara, D.: Von Neumann stability analysis of smoothed particle hydrodynamicsSuggestions for optimal algorithms. J. Comput. Phys. 121, 357 (1995) 2. Boryczko, K., Dzwinel, W., Yuen, D.: Parallel implementation of the ﬂuid particle model for simulating complex ﬂuids in the mesoscale. Concurrency and computation: practice and experience 14, 137–161 (2002) 3. Colagrossi, A., Landrini, M.: Numerical simulation of interfacial ﬂows by smoothed particle hydrodynamics. J. Comp. Phys. 191, 448–475 (2003) 4. Gijsen, F., Vosse, F., Janssen, J.: The inﬂuence of the non-newtonian properties of blood on the ﬂow in large arteries: steady ﬂow in a carotid bifurcation model. Journal of Biomechanics 32, 601–608 (1999) 5. Gingold, R.A., Monaghan, J.J.: Smoothed particle hydrodynamics - Theory and application to non-spherical stars. Mon. Not. R. Astr. Soc. 181 (1977) 6. Hernquist, L., Katz, N.: TREESPH: A uniﬁcation of SPH with the hierarchical tree method. The Astrophysical Journal Supplement Series 70, 419–446 (1989) 7. Hoogerbrugge, P.J., Koelman, J.: Simulating microscopic hydrodynamic phenomena with dissipative particle dynamics. Europhys. Lett. 19, 155–160 (1992) 8. Liu, G.R., Liu, M.B.: Smoothed particle hydrodynamics: a meshfree particle method. World Scientiﬁc, Singapore (2003) 9. Lombardi, J., Alison, S., Rasio, F., Shapiro, S.: Tests of Spurious Transport in Smoothed Particle Hydrodynamics. Journal of Computational Physics 152, 687– 735 (1999) 10. Lucy, L.B.: A numerical approach to the testing of the ﬁssion hypothesis. Astron. J. 82, 1013–1024 (1977) 11. Monaghan, J.J.: Smoothed Particle Hydrodynamics. Annu. Rev. Astron. Astrophys. 30, 543–574 (1992) 12. Monaghan, J.J.: Smoothed Particle hydrodynamics. Rep. Prog. Phys. 68, 1703– 1759 (2005) 13. Morris, J.P.: Simulating surface tension with smoothed particle hydrodynamics. Int. J. Numer. Meth. Fluids 33, 333–353 (2000) 14. Nugent, S., Posch, H.A.: Liquid drops and surface tension with smoothed particle applied mechanics. Phys. Rev. E 62, 4968–4975 (2000) 15. Shao, S., Lo, E.Y.M.: Incompressible SPH method for simulating Newtonian and non-Newtonian ﬂows with a free surface. Advances in Water Resources 26, 787–800 (2003) 16. Tartakovsky, A., Meakin, P.: Modeling of surface tension and contact angles with smoothed particle hydrodynamics. Phys. Rev. E 72, 02630 (2005) 17. Wr´ oblewski, P., Boryczko, K., Kopeæ, M.: SPH - a comparison of neighbor search methods based on constant number of neighbors and constant cut-oﬀ radius. TASK Quart. 11, 275–285 (2007)

Optimal Experimental Design in the Modelling of Pattern Formation ` Adri´ an L´ opez Garc´ıa de Lomana, Alex G´ omez-Garrido, David Sportouch, and Jordi Vill` a-Freixa Grup de Recerca en Inform` atica Biom`edica, IMIM-Universitat Pompeu Fabra, C/Doctor Aiguader, 88, 08003 Barcelona, Catalunya, Spain {adrianlopezgarciadelomana,david.sportouch}@gmail.com, {agomez,jvilla}@imim.es http://cbbl.imim.es

Abstract. Gene regulation plays a major role in the control of developmental processes. Pattern formation, for example, is thought to be regulated by a limited number genes translated into transcription factors that control the diﬀerential expression of other genes in diﬀerent cells in a given tissue. We focused on the Notch pathway during the formation of chess-like patterns along development. Simpliﬁed models exist of the patterning by lateral inhibition due to the Notch-Delta signalling cascade. We show here how parameters from the literature are able to explain the steady-state behavior of model tissues of several sizes, although they are not able to reproduce time series of experiments. In order to reﬁne the parameters set for data from real experiments we propose a practical implementation of an optimal experimental design protocol that combines parameter estimation tools with sensitivity analysis, in order to minimize the number of additional experiments to perform. Keywords: lateral inhibition, GRN, optimal experimental design, multicellular system.

1

Introduction

One of the most breathtaking processes in biology is the development of a complex creature. In a matter of just a day (a ﬂy maggot), a few weeks (a mouse) or several months (ourselves), an egg grows into millions, billions, or, in the case of humans, 10 trillion cells formed into organs, tissues and parts of the body. So, the main question in developmental biology is to understand how do cells arising from division of a single cell become diﬀerent from each other. The complexity of the process of pattern formation in developmental biology has been dealt with by a number of researchers in the last decades (for reviews see [1]), both topologically, studying the diﬀerent genes involved in the process and their relationships, and dynamically, measuring and modeling the temporal behavior of those genes and their products. Diﬀerent simulation methods have been applied to dynamical models of patterning, involving both ordinary (ODE) [2] M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 610–619, 2008. c Springer-Verlag Berlin Heidelberg 2008

Optimal Experimental Design in the Modelling of Pattern Formation

611

and partial (PDE) diﬀerential equations or discrete representations of the cells as cellular automata, among others[3]. Initial models for pattern formation were based on simple assumptions that were able to capture most of the relevant information for a given general question. Thus, it is worth noting the eﬀorts of Meinhardt [4] and others [5] in order to unravel the general rules governing the formation of complex patterns during the embryo development by using simple although soundable mathematical models. At times, high throughput studies can be also performed in order to obtain time dependent qualitative information on the topology of the GRN. This type of information can be processed by probability and statistical inference tools that complement the verbal models deﬁned by the experimentalists and provide a ﬁrst formal model of the network. However, if one is able to quantify the dynamical information about the expression levels of diﬀerent genes, even at the level of a few key genes by, for example, real time polymerase chain reaction (RT-PCR) experiments, global optimization protocols can be used to reﬁne the parameters that describe the dynamics of the model. In typical situations the modeller claims for experimental data that is scarce, low quality and, more importantly in most cases, diﬃcult to obtain. How to maximize the outcome from limited resources is the aim of this paper. Here we present the implementation of a practical optimal experimental design pipeline in a theory/experiment integrated fashion. We demonstrate the utility of the protocol in the parameter estimation for one of the simplest models of pattern formation in biology, namely the Notch-Delta pathway for lateral inhibition (LI). To demonstrate the implementation of the method, we work with ﬁctitious RT-PCR experimental data obtained from known models of the Notch-Delta interaction, as the LI model occurs between partner cells in a tissue, which oﬀers an extra challenge for experimental manipulation. However, the proposed protocol is completely general for a RT-PCR experimental setting in any biological system that suits this technique.

2 2.1

Methods Problem Statement

As outlined in the introduction, dynamical biological systems can be described by a large variety of mathematical models. Here we will restrict ourselves to models deﬁned in terms of ODEs. Following [6], the time evolution of a system state of K species x(t, θ) ∈ RK is solution of this set of ODEs: ∂ ∂t x(t, θ) = f (x(t, θ), θ, u(t)). (1) x(0) = x0 . Here θ∈RP denotes the parameters of the system, and u(t) is a vector containing the input of the system. The L properties of the system yM (t, θ)∈RLi that can be measured are described by an observation function g at time ti , i = 1, ..., N (N is here the number of design points):

612

A.L. Garc´ıa de Lomana et al.

yM (ti , θ, u) = g(x(ti , θ, u)), i = 1, ..., N. D

(2)

Li

The observations Y (ti )∈R , i = 1, ..., N are considered as random variables and are given by YD (ti ) = yM (ti , θ0 , u) + i , i = 1, ..., N,

(3)

where θ0 is the true parameter vector and i ∈RLi , i = 1, ..., N describes the distribution error at time ti . We assume that the distribution of the noise (ob2 servation error) follows a normal law (where the variances σij can be estimated from repetitions of the experiments): 2 ij ∼ N (0, σij ), j = 1, ..., Li i = 1, ..., N.

(4)

In fact, yM (t, θ) refers to theoretical values (given by the model) and yD (ti ) (realizations of the random variables YD (ti )) refers to practical values (it corresponds to the Li measurements made experimentally at each time ti , i = 1, ..., N ). Maximum Likelihood Method. This method will help us to get estimates of the parameters of the system. In this method, we need to maximize the likelihood function Jml (θ) to get estimates of the parameter vector θ. This function is deﬁned as: Jml (θ; yD (t1 ), . . . , yD (tN )) = f(YD (t1 ),...,YD (tN )) (yD (t1 ), ..., yD (tN )).

(5)

As deﬁned the random variables YD (ti ) follow a multivariate normal law YD (ti ) ∼ NL (yM (ti , θ, u), C(ti )), i = 1, ..., N,

(6)

where C(ti )∈ML,L (R) is the covariance matrix deﬁned as Cll (ti ) = σil2 and Ckl (ti ) = 0 if k = l. −

D

D

Jml (θ; y (t1 ), . . . , y (tN )) =

1 e N Li NL 2 (2π) 2 σij

N Li 1 2 i=1 j=1

yjD (ti ) − yjM (ti , θ, u) σij

2 . (7)

i=1 j=1

Maximizing the likelihood function (regarding of θ) is in fact the same as maximizing the logarithm of the likelihood function, which in turn is the same as minimizing the opposite of the logarithm of the likelihood function.So, this leads to minimize the following function: −ln(Jml (θ; y D (t1 ), . . . , y D (tN )) =

N Li N L NL 1 i 1 2 ln(σij )+ + 2 2 i=1 j=1 2 i=1 j=1

yjD (ti ) − yjM (ti , θ, u) σij

2 . (8)

So, this comes to minimize the following function: 2 Li N D M (t ) − y (t , θ, u) y i i j j . χ2 (θ) = σij i=1 j=1

(9)

Optimal Experimental Design in the Modelling of Pattern Formation

613

This corresponds to the minimization of a weighted residual sum of squares (with weights: wij = σ12 ) to get the estimated parameters.At this point, we can ij compute analytically asymptotic estimates of the parameters θˆ and asymptotic conﬁdence intervals. In this scope, we assume that we are in the case where we have so much observations that the deviation Δθ between the real θ0 and estimated parameters θˆ is small. Thus, we can expand the observation function in a Taylor series: yjM (ti , θ, u) = yjM (ti , θ0 , u) + ∇θ yj |ti ,θ0 (θ − θ0 ).

(10)

We insert this result in the function to minimize, and we get: χ2 (θ) =

Li 2 N ij i=1 j=1

2 σij

−2

ij 1 ∇θ yj |ti ,θ0 Δθ + 2 Δθ T ([∇θ yj ]T [∇θ yj ])|ti ,θ0 Δθ . 2 σij σij

To minimize χ2 (θ), we need to solve the following equation: get the estimated parameters: Δθ = F

−1

∂ 2 ∂θ χ (θ)

Li N ij T 2 ([∇θ yj ] )|ti ,θ0 . σ i=1 j=1 ij

(11) = 0, so we

(12)

where F is the Fisher information matrix. Parameter Estimation and Covariance Matrix. From the knowledge of F we can easily get the exact values of the (asymptotic) estimated parameters: θˆ = θ0 + F −1

Li N ij T 2 ([∇θ yj ] )|ti ,θ0 . σ ij i=1 j=1

(13)

As we assumed that the residuals are independently distributed, the covariance matrix of the estimated parameter vector is computed by (where the average is over the repetition of experiments): Σ =< ΔθΔθ T >= F −1 .

(14)

Thanks to this covariance matrix, we can see the correlation between the parameters. The correlation matrix is deﬁned by: Σ Rij = √ ij , if i = j. Σii Σjj (15) Rij = 1, if i = j. 2.2

Parameter Correlation and Identiﬁability Criteria

Equipped with Eqn. (15), we can measure the interrelationship between the parameters and get an idea of the compensation eﬀects of changes in the parameter

614

A.L. Garc´ıa de Lomana et al.

values on the model output. For instance, if two parameters are highly correlated, a change in the model output caused by a change in a model parameter can be compensated by an appropriate change in the other parameter value. This prevents such parameters from being uniquely identiﬁable even if the model output is very sensitive to changes to individual parameters. Then we can try to improve the information contained in the data by optimizing one of the criteria derived from Σ. We used the modiﬁed E-optimal design: (Σ) ). As it minimizes the ratio of the largest to the smallest eigenvalue, min( λλmax min (Σ) it optimizes the functional shape of the conﬁdence intervals. All calculations have been performed with ByoDyn (http://cbbl.imim.es/ByoDyn) most of them at the QosCosGrid [7] environment.

3

Results

3.1

The ODEs Model for the Notch-Delta System

In the model, two adjacent cells,i and j, initially expressing the the same amount of the genesnotch and delta generate an asymmetric ﬁnal expression of the genes by the lateral inhibition mechanism. The interaction of the protein NOTCH with its ligand Delta activates the cleavage of the NOTCH intracellular domain (NICD) by a γ-secretase. NICD activates the expression of hes5 and ultimately downregulates delta. If one assumes in a very rough approximation that the quantities of the diﬀerent species in the system are large enough to work with concentrations, we have formalized the verbal model represented in Figure 1, after adimensionalization by t = T0 τ and [x]i = [x]0 xi (τ ): dnotchi (τ ) notch (rnotch − notchi (τ )) . = T0 Kdeg dτ dN OT CHi (τ ) NOTCH (notchi (τ ) − N OT CHi (τ )) = T0 Kdeg dτ k k ND − 2 Kbind [DELT A]0 N OT CHi (τ ) DELT Aj (τ ). n j=1

HES5si (τ ) ddeltai (τ ) delta 1− = T0 Kdeg − deltai (τ ) . s dτ κHES5 + HES5i (τ ) dDELT Ai (τ ) DELTA (deltai (τ ) − DELT Ai (τ )) = T0 Kdeg dτ k k ND Kbind [N OT CH]0 DELT Ai (τ ) N OT CHj (τ ). 2 n j=1 ⎛ ⎞ k k dN Di (τ ) ND ND ⎝ Kbind [DELT A]0 N OT CHi (τ ) DELT Aj (τ ) − Kdeg N Di (τ )⎠ . = T0 dτ n2 j=1

dhes5i (τ ) N Dim (τ ) hes5 (τ ) . = T0 Kdeg − hes5 i dτ κND + N Dim (τ )

−

dHES5i (τ ) HES5 (hes5i (τ ) − HES5i (τ )) . = T0 Kdeg dτ

(16)

where we have assumed that the NOTCH cleavage after the formation of the ND complex, the NICD transport to the nucleus and the transcription factor activation can be simply approximated by the amount of ND complex that is formed

Optimal Experimental Design in the Modelling of Pattern Formation

615

Fig. 1. Simpliﬁed model for the Notch/Delta pathway for two adjacent cells i and j. NOTCH* and DELTA* refers to the activated forms.

on the membrane surface. In 16 notch is constitutively activated, while sigmoidal activation and inhibition curves are used for hes5 and delta, respectively. hes5 HES5 N OT CH DELT A ND = Kdeg = Kdeg = Kdeg = Kdeg = By using the parameters Kdeg delta notch Kdeg = 0.01; s = m = 2.0; κN D = κHES5 = 0.1; N OT CH0 = 5.0; Kdeg = ND = 0.25; DELT A0 = 3.0; rnotch = 0.620926, the steady state 0.0016649; Kbind concentrations of the three genes in our model acquire the characteristic chesslike pattern represented in Figure 2, in which diﬀerent cell types are clearly deﬁned. In addition, the ﬁgure shows the correlation matrices for the diverse systems. It appears that the boundary eﬀect vanishes with bigger tissue sizes and that the 5×5 cells model can be considered converged for the purposes of this paper, as seen from the invariant correlation matrix when comparing the 5×5 and 7×7 systems. Thus, in the following paragraphs we will present our protocol for experimental design based on the 5×5 tissue model. Next, we consider a typical experimental setting in which RT-PCR experiments are carried out and provide time dependent data for each of the genes involved in our model. We will generate hypothetical data from real experiments of inner ear early development in chick[8]. In a typical scenario of the model, 4 tissue samples may be extracted at diﬀerent stages of development. For each of them RT-PCR experiments may be performed, using three replicas for security, showing a behavior that in the best case will be just close to the simulated concentration proﬁles from Eq. 16. The parameter set θ may then be globally optimized with several methods. We use here a simple approach consisting on local optimizations from 10 or 100 starting random values of θ with varying value of σ 2 for the generated data points. Once the ﬁtting parameters are obtained to some approximation, by using the above detailed simple approach or by more sophisticated methods[9], we are interested in improving their conﬁdence intervals. This can be achieved, of course,

616

A.L. Garc´ıa de Lomana et al. a.

delta

delta

delta

delta

b. H a h A D LT S5 S5 TC lt s5 tc 0 _N DE HE ND NO de he no 0 HE ND h A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ tc 0 LT bi de de de de de de de TC pp pp no A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d _n _ K eg d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s

H a A h TC lt s5 tc D LT S5 S5 h 0 _N DE HE ND NO de he no 0 HE ND tc A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ 0 no LT bi de de de de de de de TC pp pp A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d _n _ K eg d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s

A h H a D LT S5 S5 TC lt s5 tc 0 _N DE HE ND NO de he no 0 HE ND h A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ tc 0 LT bi de de de de de de de TC pp pp no A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d _n _ K eg d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s

H a A h D LT S5 TC lt s5 tc S5 0 _N DE HE ND NO de he no 0 HE ND h A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ tc 0 LT bi de de de de de de de TC pp pp no A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d n _ K eg_ d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s

Fig. 2. (a.) Delta steady state concentration distribution in model tissues of diﬀerent dimensions. (b.) Correlation matrices of the adimensional parameters in each case in (a). The steady state is achieved at a biologically plausible time scale.

by choosing a better optimization algorithm or, complementarily, by using information theory in order to estimate what data will provide more information to improve the parameters practical identiﬁability. This is extremely relevant as new experiments can consume an important number of resources and even one may decide they are not worth trying because of intrinsic identiﬁability problems of the model. In order to learn about the information content of new data, we generate in silico data at 200 time points through the total simulation time ttotal using the parameters optimized in the previous step. We call this set θ . In a leave-one-out fashion, each value is deleted at a time and the modiﬁed E-criteria is evaluated for the remaining data, in order to discover the computer-generated point, according to the current model (topology plus parameters) that contain more information. Figure 3 shows the result of this approach for the two genes of the system. In the ﬁrst iteration of the protocol, the modiﬁed E-criteria suggests that new values for the concentration of hes5 at time t = 1260 would be the most informative. At this stage we measure new data for that gene at such time step and we proceed the next iteration of the approach again. Such measurement in a real experimental set up is simulated here by a new in silico value obtained with or without noise with respect to the known model. Finally, Figure 4 shows the evolution of the modiﬁed E-criteria for a number of iterations of the protocol. It can be seen how the higher information content of the new experimental data set (increased after each OED iteration) does not necessarily involves a better (lower) value for the modiﬁed E criteria. This problem has multiple origins, being the noise of the new data measured or the fact that the optimization method does not ﬁnd the same minimum in each parameter estimation step.

Optimal Experimental Design in the Modelling of Pattern Formation

model

4500000

617

delta hes5

4000000 3500000 ME criteria

3000000

2500000 2000000 1500000 10000000

200

400

600

Time

800

1000

1200

1400

Fig. 3. Modiﬁed E-criteria after adding one time point per gene at each time from a set of previous targeted behavior

13

10

12

10

11

Modified E-criteria

10

10

10

10

10

10

100 gradient searches; error=0.0 10 gradient searches; error=0.1 100 gradient searches; error=0.1

10

100 gradient searches; error=0.25 5

10

0

1

2

3

4

5

Protocol iterations

Fig. 4. Evolution of the modiﬁed E-criteria for 5 iterations of the OED procedure

4

Conclusions

Optimal experimental design has been demonstrated in a realistic example of experiment /theory iterative protocol. In this paper, the experimental data is

618

A.L. Garc´ıa de Lomana et al.

indeed estimated from new calculations in order to show the general applicability of the protocol, although its migration to real experimental setups is straightforward. The beneﬁts from using the proposed approach are clear, as the new experiments to be carried out are decided from a predicted behavior of the modiﬁed E-criteria for a set of in silico generated data from the model from optimal parameters from the previous step in the iteration. The proposed protocol provides an easy and neat method to incorporate experimental data, that may be diﬃcult or expensive to obtain, in an informed way. At the same time it provides clues about the identiﬁability of the parameters for the proposed model, according to the evolution of the modiﬁed E-criteria with the iterations of the OED approach. Thus, one expects the modiﬁed E-criteria to approach the limit of 1 for a perfectly identiﬁable model if a big number of experiments is performed, while reaching a diﬀerent limiting value is indicative of the unidentiﬁability of the model. The protocol has been exempliﬁed on a hypothetical situation in which a simple gene regulatory network includes three genes interacting in a multicellular system. However, the data proposed, its distribution and the error one performs in the experimental evaluations are realistic and match a typical experimental setting. The next step is to apply this protocol to real data on a more complex model like the regionalization of cellular systems during vertebrate development[10]. Finally, the practical implementation of the protocol makes it suitable for parallelization in several points, like the multiple optimization in each step or the evaluation of the modiﬁed E-criteria itself for several time/species trial values. Acknowledgments. ALGL thanks Generalitat de Catalunya for a PhD fellowship. Partially funded by grant BQU2003-04448 (MCYT: Spanish Ministry of Science and Technology), and EC-STREP projects QosCosGrid (FP6-IST2005-033883) and BioBridge (FP6-LIFESCIHEALTH-2005-037909). The authors thankfully acknowledge the computer resources and assistance provided by the Barcelona Supercomputing Center.

References 1. Tomline, C.J., Axelrod, J.D.: Biology by numbers: mathematical modelling in developmental biology. Nature Reviews 8, 331–340 (2007) 2. Jaeger, J., Surkova, S., Blagov, M., Janssens, H., Kosman, D., Kozlov, K.N., Manu, M.E., Vanario-Alonso, C., Samsonova, M., Sharp, D.H., Reinitz, J.: Dynamic control of positional information in the early Drosophila embryo. Nature 430, 368–371 (2004) 3. de Jong, H.: Modeling and simulation of genetic regulatory systems: a literature review. J Comput. Biol. 9(1), 67–103 (2002) 4. Meinhardt, H.: Computational modelling of epithelial patterning. Curr. Opin. Genet. Dev. 17(4), 272–280 (2007) 5. von Dassow, G., Meir, E., Munro, E.M., Odell, G.M.: The segment polarity network is a robust developmental module. Nature 406, 188–192 (2000) 6. Faller, D., Klingm¨ uller, U.T.J.: Simulation methods for optimal experimental design in systems biology. Simulation 79, 717–725 (2003)

Optimal Experimental Design in the Modelling of Pattern Formation

619

7. Coti, C., Herault, T., Peyronnet, S., Rezmerita, A., Cappello, F.: Grid services for MPI. In: ACM/IEEE (ed.) Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon, France (May 2008) 8. Alsina, B., Abello, G., Ulloa, E., Henriqw, D., Pujades, C., Giraldez, F.: FGF signaling is required for determination of otic neuroblasts in the chick embryo. Dev. Biol. 267(1), 119–134 (2004) 9. Rodriguez-Fernandez, M., Egea, J.A., Banga, J.R.: Novel metaheuristic for parameter estimation in nonlinear dynamic biological systems. BMC Bioinformatics 7, 483 (2006) 10. Alsina, B., Garcia de Lomana, A., Vill` a-Freixa, F., Giraldez, F.: (submitted, 2008)

Self-Organised Criticality as a Function of Connections’ Number in the Model of the Rat Somatosensory Cortex Grzegorz M. Wojcik and Wieslaw A. Kaminski Institute of Computer Science Maria Curie-Sklodowska University pl. Marii Curie-Sklodowskiej 5, 20-031-Lublin, Poland [email protected]

Abstract. The model of the part of the rat somatosensory cortex was examined. Large network of Hodgkin-Huxley neurons was simulated and the modular architecture of this structure divided into layers and subregions was implemented. High degree of complexity required eﬀective parallelisation of the simulation. In this article the results of the parallel neural computations are presented. An occurrence of the self-organised criticality was observed and its characteristics as a function of the connections number was investigated. It was proved that frequency of the socalled spike potential avalanches depends on the density of inter-neuron connections. In addition some benchmarking runs were conducted and parallelisation eﬀectiveness is presented to some extent.

1

Introduction

The critical point is a point at which a system changes radically its behaviour or structure. Self-organised critical phenomena are deﬁned by a complex system which reaches a critical state by its intrinsic dynamics, independently of a value of any control parameter. A typical example of a system exhibiting a self-organised criticality (SOC) is a sand pile model. The sand is slowly dropped onto a surface, forming a pile. As the pile grows, avalanches occur and they carry the sand from the top to the bottom of the pile. At least in the model, the slope of the pile becomes independent of the rate at which the system is driven by dropping sand. This exempliﬁes so-called (self-organised) critical slope [1]. The oldest numerical models describing the sand-pile problem are presented i.e., in [1], [3], [6]. In this model, a one-dimensional pile of sand is considered. Grains of sand are stored in the columns. The dynamics of the system is deﬁned by a set of equations describing the eﬀect of a one-grain addition. After a proper number of grains have been added to the appropriate columns, a critical inclination of the sand pile occurs and this causes disorder leading to relaxation of the whole system. This disorder is referred to as an avalanche. Critical states of a system are signalled by a power-law distribution in some observable. In the case of sand-piles, the size and the distribution of the avalanches M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 620–629, 2008. c Springer-Verlag Berlin Heidelberg 2008

Self-Organised Criticality as a Function of Connections’ Number

621

can be measured. A frequency of an avalanche occurrence in the system is a function of its size and can be expressed by the power law [1]: D(S) ∼ S −k ,

(1)

where k is always a characteristic number for a given system. Complex systems exhibiting behaviour predicted by the SOC have been widely investigated [4], [5], [8], [10], [11], [14]. Earthquakes, forest ﬁres, biological evolution are just three examples of wide range of phenomena that have been successfully modelled this way [1]. There are experiments that conﬁrm the existence of frequency tuning and adaptation to stimulus statistics in neurons of the rat somatosensory cortex [7]. SOC was found in the model of large biological neural networks [13], however, the aim of the research discussed in this contribution was to investigate whether and how the SOC occurrence depends on the number of connections in the simulated brain tissues. Good understanding the SOC mechanism in the model will allow us to design new series of experiments with the large number of interacting neurons leading to the discovery of new class of neurodynamical phenomena taking place in real brains. Simulations of microcircuits consisting of numerically complicated HodgkinHuxley (HH) neurons [9] are power consuming. The simulation time can be shortened by using cluster-based parallelised computing. All the simulations discussed in this paper were conducted in the parallel version of GENESIS compiled for the MPI environment [15]. The choice of the GENESIS simulator allowed us to use many processors and to design eﬀective way of parallelisation. Remarkably, in this article we demonstrate that critical relaxation phenomena depend on the density of inter-neuron connections existing in the network. Consequently, the eﬀectiveness of model’s parallelisation, the simulation time and its speedup as a functions of the connections’ number will be presented in the last section.

2

Model and Method of Parallelisation

The somatosensory pathways bring sensory information from the periphery into the brain, e.g., from a rat’s whisker to the somatosensory cortex. Information from the snout passes along the trigeminal nerve, projecting to the trigeminal complex in the brainstem, which sends projections to the medial ventral posterior nucleus of the thalamus (VPm). Each whisker has a representative physical structure in the brain, forming 2-D maps of the whisker pad throughout the pathway. In the cortex, these structures are known as barrels. They are formed from clusters of stellate cortical neurons, with the cell bodies arranged in a ring and dendrites ﬁlling the middle ”hole”. The dendrites form synapses with multiple axons rising from the VPm [16]. The neurons chosen for the simulations were implemented according to the HH model [9]. Cells are relatively simple (for detail, see Appendix A). The only one modiﬁcation in the model of neuron was arranged in order to avoid rapid synchronisation of the whole network: an additional parameter responsible

622

G.M. Wojcik and W.A. Kaminski

for the probability of exocitosis was added for each synaptic connection in the post-synaptic neuron. Such a change required a simple modiﬁcation of original GENESIS code. Changed version of GENESIS, compiled for Linux and MPI, can be downloaded from [15]. The simulated net consisted of 2025 of the above-mentioned neurons. All the cells were placed on a square-shaped grid with 45 rows and 45 columns. Each neuron was identiﬁed by a pair of numbers ranging from 0 to 44. Network cells were divided into 22 groups, called layers, numbered from 1 to 22. Communication between the neurons based on the following principles - the input signal from each cell from the m-th layer was transported to all the neurons from layers: m + 1, m + 2, m + 3, ..., m + Ns , where Ns was the integer number, not greater than the number of layers (see Fig. 1). Note that such a structure (2-D with dense ”neural rings”) imitates the structure of the rat’s cortical barrels. The system can be easily parallelised, so we decided to simulate the problem on 15 processors. The network was divided into 15 zones. In each zone the same number of neurons was simulated. The zones were numbered from 1 to 15 and the way in which the zones were arranged is presented in Fig. 1. Such a choice allowed us to run simulations in optimal way, without the barriers being timed

22

0

44

0

22

44 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Fig. 1. Scheme of the simulated network. Layers are highlighted by thick lines. Stimulating neuron is marked with the black square and all other neurons are put on the intersections of grid lines. Neurons coordinates are marked on the top and the left side of the scheme. In each zone there are 3 columns of neurons as marked at the bottom. The choice of columns belonging to particular zones is arbitrary.

Self-Organised Criticality as a Function of Connections’ Number

623

out. The complexity of the system increases rapidly with Ns , so does the time of simulation. A good parallelisation of the model not only shortens its simulation time, but most often makes it executable at all. That is why parallelisation techniques are so important for HH systems with large number of synapses. Synaptic connections were characterised by three parameters: weight w, time delay τ , and the probability p of transporting the signal, which corresponded to the mentioned probability for the occurrence of exocitosis. The probability p was set to a constant, the same for all of the synapses (p = 0.5).Values of two other parameters depend on the position of both the pre-synaptic and postsynaptic neuron. For each pair of neurons from the m-th layer and the n-th layer respectively, the parameters w and τ were chosen according to following rules: w=

w0 , |m − n|

τ = 10−4 |m − n| [s],

(2) (3)

where w0 was a positive constant (in our simulations w0 = 2). The system was stimulated by the neuron N [23, 23] that was the main receptor of activities from the outside of the net (i.e., a glass capillary stimulating the whisker [7] or an electrode transmitting some random stimulus directly into the cortex). As a result, the receptor was producing a periodic spike potential with a frequency of about 80 Hz. The net was characterised by the parameter T that corresponded to the biological system’s real working time (in these simulations T = 15 s).

3

Simulations and Results

The stimulus was transported from the central unit to all other cells through the arranged connections. During the simulation, the time of spike potential occurrence was collected for each neuron. The avalanche occurs when the group of neurons is spiking in the same and small interval of time (i.e., ti = 1 ms). The algorithm used to compute the number of avalanches was implemented in C++ (the simple analysis of text ﬁles containing the time and value of membrane potential, in search for the neurons having high spiking activity in the same time interval). It was shown that for a system with a small number of neighbourhoods (Ns < 6) (the same small number of connections), the power law cannot precisely describe the number of spike-potential avalanche occurrences as a function of their size (Fig. 2). When the Ns = 6 a kind of phase transition leading the system to the SOC behaviour can be observed (Fig. 3) [13]. Systematic analysis of the SOC was performed i.e., by Peter Sloot [12] and the aim of our research described in this contribution was to investigate whether the occurrence of SOC depends only on the number of neighbourhoods or it can appear or disappear for a given Ns as a function of the intra-network connections’ number. As for the most sensitive parameter the value of 6 was chosen for the Ns parameter in all series of the aforementioned experiments [13].

624

G.M. Wojcik and W.A. Kaminski

1000

D(s)

100

10

1

1

10

100 s - Size of Avalanche

1000

Fig. 2. Frequency D(s) as the function avalanche size for Ns = 1, px = 1, T = 15 s

1000

D(s)

100

10

1

1

10

100 s - Size of Avalanche

1000

Fig. 3. Frequency D(s) as the function of avalanche size for Ns = 6, px = 1, T = 15 s

Then the another parameter px deﬁning the probability of synapse creation between two neurons from the the simulated network was implemented in the model. Surprisingly it was noted that self-organisation depends not only on the number of connections (what could be concluded from previous research) but also on the network architecture. On the basis of Fig. 2 and Fig. 3 one could hypothesise that when the number of connections falls the self-organisation disappears. However, for Ns = 6 the SOC manifests itself even better when we decrease the strength of the inter-neuron communication by setting the px below 0.07 (Fig. 4). What’s more when 0.07 < px < 0.4 the SOC behaviour tends to disappear (see two curves in Fig. 5) to come back for px > 0.4 (Fig. 6). Because of relatively high system complexity and tendency to the rapid synchronisation the number of spikes decreases with the number of connections in the network (Fig. 7). That is why the number of avalanches and the inclination of the SOC

Self-Organised Criticality as a Function of Connections’ Number 10000

625

px = 0.01 px = 0.04 px = 0.06

D(s)

1000

100

10

1 10

100 s - Size of Avalanche

1000

Fig. 4. Frequency D(s) as a function of avalanche size for Ns = 6, px < 0.07, T = 15 s 10000

px = 0.08 px = 0.20 px = 0.40

D(s)

1000

100

10

1 10

100 s - Size of Avalanche

1000

Fig. 5. Frequency D(s) as a function avalanche size for Ns = 6, 0.07 < px < 0.4 10000

px = 0.10 px = 0.40 px = 0.80

D(s)

1000

100

10

1 10

100 s - Size of Avalanche

1000

Fig. 6. Frequency D(s) as a function avalanche size for Ns = 6, px > 0.4, T = 15 s

626

G.M. Wojcik and W.A. Kaminski 10000

px = 0.20 px = 0.70 px = 0.80

D(s)

1000

100

10

1 10

100 s - Size of Avalanche

1000

Fig. 7. Scale of the SOC for Ns = 6, T = 15 s. and diﬀerent px 10000

px = 0.01 px = 0.80

D(s)

1000

100

10

1 10

100 s - Size of Avalanche

1000

Fig. 8. SOC inclination for Ns = 6, T = 15 s. and diﬀerent px

curve are diﬀerent and depend on px (Fig. 7-8). The number of spikes in the network falls with the growth of connections’ density (Fig. 9).

4

Parallelisation Eﬀectiveness

The local cluster used for all the simulations was built of 13 machines including one special machine – the so-called “access node”. Each SMP machine had two 64-bit 1.4 GHz Itanium2 IA64 processors with 4 GB of RAM memory. The cluster works under control of Debian Linux Sarge (v. 3.1) and 2.6.8-1 kernel version. The model is simulated in GEneral NEural Simulation System GENESIS v.2.2.1 with its MPI extension. A gcc compiler was used for the system compilation and in case of the MPI and Linux OS the compilation required some tuning of GENESIS code. Changed version can be found in [15].

Self-Organised Criticality as a Function of Connections’ Number

627

1e+07 No. of spikes

No. of Connections / No. of Spikes

No. of connections

1e+06

100000

10000 0

0.1

0.2

0.3

0.4

0.5 px

0.6

0.7

0.8

0.9

1

Fig. 9. Density of connections and number of spikes as a function of px 1000 speedup 1 node

Simulation Time [h] / Speedup

15 nodes

100

10

1 0

0.1

0.2

0.3

0.4

0.5 px

0.6

0.7

0.8

0.9

1

Fig. 10. Time of simulation and speedup as a function of px

The length of a typical run for Ns = 6 and T = 15 s was about 10 hours when the problem was parallelised for 15 nodes. However, for one node the simulation time ranged from 12 h to 230 h depending on the value of px . This gave us in the best case the speedup of 23 (Fig. 10). At ﬁrst sight it is very optimistic result especially for the structures with large number of synapses. One should remember that 5 years ago such networks with Ns > 6 modelled on one SPARC 400 MHz node had the simulation time of about 3 weeks.

5

Conclusions

Systematic analysis of the simulated part of the rat’s somatosensory cortex dynamics was conducted. Eﬀective parallelisation was applied. SOC manifests itself in large biological neural networks. The ”quality” of SOC depends both on the number of connections and on the architecture of the system. The role of SOC

628

G.M. Wojcik and W.A. Kaminski

phenomena in mammalian brains is still unrecognised. However, good modelling will make it possible for us to design the new series of neuroscientiﬁc experiments, leading in the end to better understanding of the brain functionality.

Appendix A: Properties of HH Neurons Our model consisted of multicompartmental neurons with two dendrite compartments, a soma and an axon. The dendrites contained a synaptically activated channel and the soma had voltage activated HH sodium and potassium channels. The behaviour of each compartment was equivalent to the behaviour of some electrical circuit [2]. Thus, each circuit was characterised by a typical for GENESIS group of parameters set as follows: resistances Ra = 0.3 Ω, Rm = 0.33 Ω, capacity Cm = 0.01 F, and potential Em = 0.07 V. For the soma compartment Ek = 0.0594 V and for the dendrite Ek = 0.07 V. Conductance for each type of ionic channels was chosen to be: GK = 360 Ω −1 and GN a = 1200 Ω −1 . These parameters originated from neurophysiological experiments [2] and were chosen to make the model biologically more realistic. The soma had a circular shape with the diameter of 30 μm, dendrites and axon were cable-like with the length of 100 μm. All the other parameters were chosen as suggested by GENESIS authors to simulate the behaviour of the biological-like neurons [2]. More details concerning the HH model can be found elsewhere [2], [9]. Acknowledgements. This work has been supported by the Maria Curie-Sklodowska University, Lublin, Poland (under the grant of UMCS Vice President 2007) and Polish State Committee for Scientiﬁc Research under the grant number (N519 017 32/2120). Special thanks to Peter Sloot for inspiration during the meeting in Russia.

References 1. Bak, P.: How nature works: The Science of Self-Organised Criticality. Copernicus Press, New York (1996) 2. Bower, J.M., Beeman, D.: The Book of GENESIS – Exploring Realistic Neural Models with the GEneral NEural SImulation System. Telos, New York (1995) 3. Jensen, H.J.: Self Organizing Criticality. Cambridge University Press, Cambridge (1998) 4. Aegerter, C.M., Gnther, R., Wijngaarden, R.J.: Avalanche dynamics, surface roughening, and self-organized criticality: Experiments on a three-dimensional pile of rice. Phys. Rev. E 67, 051306 (2003) 5. Bak, P., Christensen, K., Danon, L., Scanlon, T.: Uniﬁed Scaling Law for Earthquakes. Phys. Rev. Lett. 88, 178501 (2002) 6. Bak, P., Tang, C., Wisenfeld, K.: Self-organized criticality: An explanation of the 1/f noise. Phys. Rev. Lett. 59, 381–384 (1987) 7. Garcia-Lazaro, J.A., Ho, S.S.M., Nair, A., Schnupp, J.W.H.: Adaptation to Stimulus in Rat Somatosensory Cortex. FENS Abstr. 3, A109.4 (2006)

Self-Organised Criticality as a Function of Connections’ Number

629

8. Lubeck, S.: Crossover phenomenon in self-organized critical sandpile models. Phys. Rev. E 62, 6149–6154 (2000) 9. Hodgkin, A.L., Huxley, A.F.: A Quantitative Description of Membrane Current and its Application to Conduction and Excitation in nerve. J. Physiol. 117, 500–544 (1952) 10. Paczuski, M., Bassler, K.E.: Theoretical results for sandpile models of selforganized criticality with multiple topplings. Phys. Rev. E 62, 5347–5352 (2000) 11. Pastor-Satorras, R., Vespignani, A.: Corrections to scaling in the forest-ﬁre model. Phys. Rev. E 61, 4854–4859 (2000) 12. Sloot, P.M.A., Overeinder, B.J., Schoneveld, A.: Self-organized criticality in simulated correlated systems. Comp. Phys. Comm. 142, 7–81 (2001) 13. Wojcik, G.M., Kaminski, W.A., Matejanka, P.: Self-organised Criticality in a Model of the Rat Somatosensory Cortex. In: Malyshkin, V.E. (ed.) PaCT 2007. LNCS, vol. 4671, pp. 468–476. Springer, Heidelberg (2007) 14. Yang, X., Du, S., Ma, J.: Do Earthquakes Exhibit Self-Organized Criticality? Phys. Rev. Lett. 92, 228501 (2004) 15. The GENESIS compiled for Linux MPI: http://www.luna.umcs.lublin.pl/download/modgenesis4mpi.tgz 16. The Rat Somatosensory Pathway: http://www.bris.ac.uk/Depts/Synaptic/info/pathway/somatosensory.htm

Approximate Clustering of Noisy Biomedical Data Krzysztof Boryczko and Marcin Kurdziel Institute of Computer Science, AGH University of Science and Technology, al. Mickiewicza 30, 30–059 Krak´ ow, Poland {boryczko,kurdziel}@agh.edu.pl

Abstract. Classical clustering algorithms often perform poorly on data harboring background noise, i.e. large number of observations distributed uniformly in the feature space. Here, we present a new density-based algorithm for approximate clustering of such noisy data. The algorithm employs Shared Nearest Neighbor Graphs for estimating local data density and identiﬁcation of core points, which are assumed to indicate locations of clusters. Partitioning of core points into clusters is performed by means of Mutual Nearest Neighbor distance measure. This similarity measure is sensitive to changes in local data density, and is thus useful for discovering clusters that diﬀer in this respect. Performance of the presented algorithm was demonstrated on three data sets, two synthetic and one real world. In all cases, meaningful clustering structures were discovered. Keywords: Cluster analysis, Noisy data, Multidimensional data, Shared Nearest Neighbor Graph, Mutual Nearest Neighborhood.

1

Introduction

Formerly, research in cluster analysis focused on data sets were almost all observations are believed to be members of some clusters. Even if outlier observations were accounted for, they were thought as exceptions rather than signiﬁcant fraction of the data set. In recent years however, eﬀorts were made to develop clustering techniques suitable for data sets were outlier observations are so frequent that they in fact become a noisy background in which clusters are submerged. A classical example is the DBSCAN algorithm [1], which employs a density-based deﬁnition of clusters. Density-based notion of clusters was also adopted in [2]. Unlike DBSCAN, which relay on simple counting of points within spheres of some given radius, this method employ Shared Nearest Neighbor (SNN) graphs for density estimation. Some approaches to noisy data clustering employ data sampling instead of explicit density estimation. This is the case in CURE [3] algorithm for example. Yet another approach to this task focus on graph-based cluster connectivity measures. Typical representatives of this approach are Chameleon [4] and ROCK [5] algorithms. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 630–640, 2008. c Springer-Verlag Berlin Heidelberg 2008

Approximate Clustering of Noisy Biomedical Data

631

We present a new algorithm for clustering of high-dimensional, noisy data. The algorithm, named Clustering With Nearest Neighborhood (CWNN), is inspired by ideas presented in [2], [6] and [7]. CWNN employs the SNN graph to detect the so-called core data points. This allows for explicit handling of background noise as well as automatic assessment of the number and shapes of clusters. The strength of the our approach lies in a method for partitioning core points into origins of clusters. We propose to partition the set of core points by employing the Mutual Nearest Neighbor (MNN) distance measure computed over the proximity measure derived from the SNN graph. Experimental results illustrating performance of this method for data harboring background noise, including multidimensional cases, are demonstrated. It is important to note here, that perfect discrimination between background noise and data clusters is often unattainable, especially if they are of comparable densities. Therefore, CWNN should be seen as an approximate clustering method. 1.1

Mutual Nearest Neighbor Distance Measure

Consider a set of points X = {x1 , x2 , . . . , xn } and a distance metric d(·, ·). For example, this can be a ﬁnite subset of m-dimensional cube, X ⊂ −γ, γm ⊂ Rm , and the Euclidean distance. Let NGH(xi ) be a list of neighbors of the point xi , sorted in an ascending order according to d(·, ·). Further, let Gk = {X, E} be the k-Nearest Neighbor (k-NN) graph of X, i.e.: (xi , xj ) ∈ E ⇔ xj ∈ {NGHl (xi ) : l = 1 . . . k}

(1)

where NGHl (xi ) is the l-th element of the list NGH(xi ). The Mutual Nearest Neighbor distance measure, originally proposed in [7], estimates proximity between a pair of points on the basis of their rankings in mutual k-NN lists. In particular, for a pair of points xi , xj ∈ X, such that: xi = NGHk (xj ) ∧ xj = NGHl (xi )

(2)

the value of the MNN distance measure is equal to: MNN(X, xi , xj ) = k + l

(3)

For clustering purposes, the MNN distance measure has a strong advantage over classical distance metrics, e.g. the Euclidean distance, of being more sensitive to changes in local data density [7]. This is illustrated on the example depicted in Fig. 1. The points form two clusters of diﬀerent densities (marked by C1 and C2 ). Suppose, that we would like to identify the clusters by simply comparing the distances between the points. A straightforward approach would be to identify the connected components within the data set, assuming that any two points are connected when the distance between them is smaller than some threshold value ε. Using the Euclidean distance, it is impossible to choose a proper threshold value ε. For ε ≤ 2 each point from the cluster C1 will be assigned to a separate, artiﬁcial cluster. On the other hand, for every

632

K. Boryczko and M. Kurdziel

ε > 2 the whole data set will be assigned to a single cluster. Now, consider the MNN distance measure. For every two points from the cluster C1 that are adjacent to each other and are placed in the same column (e.g., points a and b in Fig. 1) or the same row (e.g., points b and c in Fig. 1) the value of the MNN distance measure is equal to 2. The same situation occurs in the cluster C2 . However, the value of MNN distance measure between the points x and y is equal to: MNN(C1 ∪ C2 , x, y) = 4. We can clearly see this from the lists of the nearest neighbors of those two points: NGH(x) = [{xn1 , xn2 , y}, yn2 , . . .] and NGH(y) = [{yn1 , yn2 }, yn3 , {x, yn4 , yn5 }, . . .]. Consequently, this data set can be properly clustered with the threshold value for the MNN distance measure ε = 3.

a

b

c

xn1

xn2

X

2

C2 Y yn2

yn1

yn4

yn3

yn5

C1

2

1

Fig. 1. An example data set, made of two clusters that cannot be discovered using only the Euclidean metric but can be found with the MNN distance measure

1.2

Estimating Proximity in Multidimensional Spaces with Sparse Shared Nearest Neighbor Graphs

Euclidean metric is not suited for estimating proximity of points in highdimensional spaces (see e.g. [8]). A proximity measure that is better suited for multidimensional data was proposed in [6]. In this paper, proximity between a pair of points was deﬁned to be the number of neighbors they share. We employ this idea in a slightly modiﬁed manner. Consider a k-NN graph Gk = (X, E) of the input data set X. A graph Sk = (X, E, W ) in which weights given by: wij = #{xs ∈ X \ {xi , xj } : (xi , xs ) ∈ E ∧ (xj , xs ) ∈ E}

(4)

are assigned to edges (xi , xj ) ∈ E, is called the Shared Nearest Neighbor graph of X. Provided that the number of shared neighbors depends on how close the points are (which is a reasonable assumption), their proximity can be deﬁned in the following way: (5) dSk (xi , xj ) = k − wij The measure dSk (xi , xj ) is well deﬁned only if both the edges (xi , xj ) and (xj , xi ) belong to Sk . If this is not the case, we consider the dSk (xi , xj ) to be inﬁnite.

Approximate Clustering of Noisy Biomedical Data

633

The SNN graph can be used to establish a strong neighborhood relationship in X [2]. This is done by removing from Sk all edges (xi , xj ) ∈ E for which wij < t, where t is a threshold value. In the resultant graph, denoted by Skt , edges connect strong neighbors. Relation deﬁned in this way has an advantage of being relatively immune to background noise. In particular, if a suﬃciently high threshold value t is chosen, the noise points that lie outside of high density regions (i.e. clusters) will not have any strong neighbors.

2

Clustering Method Based on the SNN Graph and the MNN Distance Measure

The pseudo-code of CWNN is presented as Algorithm 1. First, the SNN graph Sk of the input data set is build and sparsiﬁed using a threshold value t. The resultant sparse SNN graph Skt is used to construct the set of core points Xc , i.e. set of data points that have more than td strong neighbors within a sphere of radius εn . It is assumed, that the core points span regions of data space with relatively dense distribution of data points. In the next step, a graph Gc of core points is created, in which points u, v ∈ Xc are connected if and only if the SNN distance measure dSk (u, v) < ε and the MNN distance measure MNNSk (Xc , u, v) < tm . Here, MNNSk denotes the MNN distance measure calculated over the proximity measure derived from the SNN graph Sk (eq. 5). To locate origin of clusters, CWNN identiﬁes connected components in Sk . Finally, CWNN evaluates each of the non-core data points xi and if d(xi , c) < εn , where c ∈ Xc is the core point that is closest to xi , then the point xi is assigned to the cluster represented by c. Otherwise, xi is assigned to the noise cluster CN . Our approach employs the MNN distance measure for constructing connected components in the graph of core points. To strengthen this measure against background noise it is calculated over proximity measure derived from the SNN graph. As the MNN distance measure is eﬀective in identifying local changes in data density (see Section 1.1), gradients in the data density should split the graph of core points into a number of connected components, each one with a more uniform density distribution. Consequently, the clustering structure should reveal more information about the analyzed data set. Number of clusters constructed by CWNN is equal to the number of connected components in the graph of core points. This is controlled by two threshold values: tm for the MNN distance measure and ε for the SNN distance measure. CWNN do not assume any particular geometry of the clusters. In principle, shapes of clusters depend only on the shapes of connected components. Number of points assigned to the noise cluster depends on two factors: the number of core points identiﬁed by the algorithm and the threshold value εn . The threshold εn speciﬁes the maximum distance d(·, ·) between a given point xi and its nearest core point c, which still allows for assigning xi to the cluster represented by c. The number of core points depends mainly on the distribution of density within the data set. However, setting a broader initial neighborhood

634

K. Boryczko and M. Kurdziel

Algorithm 1. The CWNN algorithm INPUT: set of points X = {x1 , x2 , . . . , xn }; distance metric d(·, ·) PARAMETERS: k, t, td , tm , ε ∈ N+ ; εn ∈ R+ OUTPUT: set of data clusters Ω = {C1 , C2 , . . . , Ck }; noise cluster CN Gk = build the k-NN graph of X Sk = build the SNN graph from the graph Gk Skt = sparsify the SNN graph Sk with the threshold t Xc = ∅, Vc = ∅ for all xi ∈ X do ρ = number of strong neighbors of the point xi that lie in a sphere of radius εn if ρ > td then Xc = Xc ∪ {xi } end if end for for all (xi , xj ) ∈ Xc × Xc , i < j do if [dSk (xi , xj ) < ε] & [MNNSk (Xc , xi , xj ] < tm ) then Vc = Vc ∪ {(xi , xj ), (xj , xi )} end if end for Ω = ﬁnd the connected components of the graph Gc = (Xc , Vc ) for all xi ∈ X \ Xc do c = ﬁnd the point x ∈ Xc such that: ∀x ∈ Xc \ {x} : d(x, xi ) < d(x , xi ) if d(xi , c) < εn then Cj = ﬁnd the cluster C ∈ Ω such that c ∈ C Cj = Cj ∪ {xi } else CN = CN ∪ {xi } end if end for

(i.e., higher number of neighbors in the k-NN graph) or decreasing the number of required strong neighbors, td , will increase the number of core points. The computational complexity of the ﬁrst part of CWNN, i.e. identiﬁcation of core points, is O(n2 log n + nk log k) due to construction of the k-NN graph and counting of shared neighbors. The computational complexity of the remaining part of the algorithm is O(m2 log m + m · n), where m is the number of core points. However, the number of core points is smaller than the total number of points: m ≤ n. Consequently, the computational complexity of the whole CWNN algorithm is approximately O(n2 log n). The memory complexity is O(n2 ).

3

Experimental Results

Three data sets were used to demonstrate the eﬀectiveness of CWNN. The ﬁrst one, further called Chameleon data set, was taken from [4]. The second one is a synthetic three-dimensional test set, further called tube data set. Third test set, i.e. microcalciﬁcation data set, consists of feature vectors constructed by the

Approximate Clustering of Noisy Biomedical Data

635

Table 1. Parameters of CWNN used for clustering of the test data sets Parameter k t td tm ε Chameleon data set 100 75 4 20 25 Tube data set 250 180 13 15 115 Microcalciﬁcation data set 1600 1100 275 1900 1050

εn 10.0 59.0 2.19

authors during work on an algorithm for detecting suspicious lesions in digital mammograms [9]. Parameters used for clustering the test sets are given in Table 1. To set the value of these parameters we used the following heuristic. First, we set the value for the number of neighbors in the k-NN graph. This graph should reveal local properties of the data set. Therefore, we use a small fraction, i.e. between 1% to 2%, of the total number of points for this parameter. The upper value is used for multidimensional data. Next, we construct the histogram of the number of shared neighbors in the k-NN graph and locate its maximum. The ﬁrst minimum following the maximum corresponds to the number of shared nearest neighbors above the most frequent one. We set the parameter t to a value near this minimum, ensuring that strong neighborhood relationship connect only the truly close points. In the next step, we construct the histogram for the number of strong neighbors. This histogram will usually have a peak corresponding to the background noise followed by peaks corresponding to clusters. We set the parameter td to a value after the peak from noisy background, ensuring correct noise identiﬁcation. The value for εn is set by evaluating all non-noise points. For each such point we calculate the distance to its furthest strong neighbor. The histogram of these distances is used to identify most frequent values and εn is set close to them. The parameters ε and tm are set by inspecting minimal spanning trees of core points constructed using dSk (·, ·) and MNNSk (Xc , ·, ·) respectively. Again, we construct histograms of edge lengths in these trees. The minima in these histograms correspond to edges connecting clusters. First such minimum is usually a good choice for the value of underlying parameter. Subsequent minima can be used if a more coarse grained clustering is desirable. Comparative tests. In [4] four two-dimensional data sets were used for evaluation of the Chameleon algorithm, three of which contain background noise. We applied CWNN to these three noisy data sets and in each case were able to obtain the correct clustering. For the lack of space we will report only the result for the hardest case (in [4] it is called DS4 ). The Chameleon data set is pictured in Fig. 2a. The clustering given by CWNN is presented in Fig. 2b. As we can see, CWNN was able to remove the background noise while preserving the bona ﬁde clusters. We should note here, that these clusters diﬀer in densities. Furthermore, the density of the rectangular cluster is near the density of the noise. In addition, clusters lay close to each other. In particular, triangle-like clusters are nearly adjacent. Nevertheless CWNN did succeed in clustering this data set. In comparison, according to [4], DBSCAN, a well known density clustering method, is unable to identify the correct clustering

636

(a)

K. Boryczko and M. Kurdziel

(b)

Fig. 2. (a) Two-dimensional Chameleon data set. (b) Clustering of the Chameleon data set obtained with CWNN algorithm.

in this data set. Another density clustering method, CURE, is also reported to fail on this data. Additional results provided on a web page referenced in [4] show that CLARANS [10], ROCK, and group average hierarchical clustering are all unable to correctly cluster the Chameleon data set. The Chameleon algorithm itself managed to identify the genuine clusters in this data. However, this algorithm has no explicit noise removal technique. Therefore, in addition to the bona ﬁde clusters Chameleon constructed additional, spurious clusters out of the background noise, which is evident on the Fig. 6 in [4]. Tube data set. The tube data set is pictured in Fig. 3a. It consists of two cubical clusters, namely cluster A and cluster B, surrounded by a variable density background noise. Density of data points in the clusters is ﬁve times greater than the density of the surrounding noise. The density of the noise itself increases linearly from the right to the left end of the tube, resulting in 4-fold diﬀerence between the ends. Result of clustering the tube data set is pictured in Fig. 3b. The two biggest clusters found by CWNN are depicted in Fig. 3c. CWNN managed to identify 78.8% of noise points (i.e. 16,207 out of 20,560). From the 577 data points in the cluster A, 516 were found (89.4%). In cluster B, out of the 1078 data points, 1075 points were found (99.7%). CWNN was therefore successful in discovering both clusters. Some artiﬁcial clusters were created from the background noise, near cluster B. This is a consequence of the noise density near left end of the tube being comparable with the density of the cluster A. Therefore, assignment of the whole noisy background to the noise cluster would result in lost of the cluster A. Yet, cluster B was not merged with any of the artiﬁcial clusters (see Fig. 3c). Gradient of the data density at the border of cluster B, that is well preserved in the set of core points, increases the value of the MNN distance measure between core points in the cluster and core points in the background. This prohibits merging of artiﬁcial clusters with cluster B. Microcalciﬁcation feature vectors data set. The microcalciﬁcation data set contains feature vectors describing suspicious regions of interest (ROIs) found in 200 high-resolution digital mammograms from the DDSM database [11]. Analysis of such data sets is an important step in design and implementation of

Approximate Clustering of Noisy Biomedical Data

Cluster B ClusterB

637

ClusterA ClusterA

(a)

(b)

(c)

Fig. 3. (a) Three-dimensional tube data set containing two clusters enclosed by noisy background. (b) Clustering of the tube data set obtained with CWNN algorithm. (c) Two biggest clusters found by CWNN.

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Fig. 4. Example mammogram regions of interest (ROIs), corresponding to feature vectors that belong to diﬀerent clusters discovered by CWNN

computer aided detection (CAD) systems. Moreover, CAD systems for screening mammography are among most heavily researched computerized detection techniques, owing to diﬃculties in recognizing early cancer symptoms on mammogram images. Thus, microcalciﬁcation data set is an example of rather important class of biomedical data. Each mammogram ROI is described by 27 pixel intensity features, such as: entropy, contrast, moments of a brightness’ histogram, and other. As the algorithm used for initial selection of ROIs was tuned for high sensitivity, a large number of false-positive detections was made. Feature vectors of false-positive ROIs constitute the noise in the data set. Additional details are given in [9]. Out of approximately 93,300 data points in the microcalciﬁcation data set, approximately 53,000 (i.e. 57%) were assigned to noise by CWNN. Six clusters were constructed from the remaining data points. Example ROIs corresponding to feature vectors selected randomly from the six discovered data clusters are presented in Fig. 4. As it can be seen, clusters number 4, 5 and 6 contain ﬂat ROIs characterized by low image contrast. The ROIs diﬀer in the average brightness. Cluster no. 1 contains ROIs with linear structures, usually on a dark background. The regions of interest from the cluster no. 3 contain similar structures but on a brighter background. Finally, cluster no. 2 contains round, punctate occlusions resembling small microcalciﬁcations.

638

4

K. Boryczko and M. Kurdziel

Implementation Notes

We have implemented parallel versions of the most costly routines in CWNN, namely construction of the k-NN and sparse SNN graphs and calculation of the MNN distance measure. Parallel version of these routines were designed for shared memory machines and implemented using the OpenMP1 standard. Parallelization of the k-NN graph construction is straightforward. In particular, the ﬁrst outer loop of this routine runs over all data points, for each one calculating the distances to the points with greater indices. There is no data dependencies between the iterations of this loop and thus they can be directly split between the threads. In the next step, for each point the distances to the remaining points are sorted in an ascending order. Loop performing this sorting can also be parallelized by direct splitting between the threads. The routine for constructing the sparse SNN graph contains two nested loops. The outer loop runs over all data points. In the i-th iteration of the outer loop, the inner loop runs over nearest neighbors y ∈ NGH(xi ) of the point xi , assessing for each of them whether xi ∈ NGH(y) and counting shared neighbors. These operations require read-only access to the k-NN graph and therefore do not impose data dependencies. However, in rare cases threads can compete for write access to the weights matrix of the SNN graph. We eﬃciently solved this issue by employing a small hash table of locks that protects elements of the weights matrix. This enables splitting of the outer loop between the threads. 8000

25

SGI Altix

SGI Altix

20

6000 5000

Speedup

Execution time [s]

7000

4000 3000 2000

15 10 5

1000 0

5

(a)

10 15 20 25 Number of processors

0

30

(b)

4

8

12 16 20 24 Number of processors

28

32

Fig. 5. The execution time (a) and speedup (b) of the parallel implementation of CWNN algorithm. The results were obtained on the microcalciﬁcation data set.

Calculation of the MNN distance measure requires read-only access to the weighted graph of core points. The edge weights are the distance measures derived from the SNN graph (see Sections 1.2 and 2). Write access is needed only for the array storing the results. Therefore, we can employ a parallelization strategy similar to the one used in distance calculation during construction of the k-NN graph. To illustrate the eﬃciency of the parallelization scheme, benchmark runs on the microcalciﬁcation data set were made. The tests were carried out on the SGI 1

www.openmp.org

Approximate Clustering of Noisy Biomedical Data

639

Altix platform, equipped with 1.5 GHz Intel Itanium 2 processors and running Linux operating system. The results are presented in Fig. 5. As we can see, the algorithm scales almost linearly for the number of processors between 2 and 32.

5

Future Work

In the current setup, CWNN can be applied to various types of data provided that a distance metric d(·, ·) is available for them. However, the performance of the algorithm in such cases needs further evaluation, which will be the focus of our future research. Another issue to be studied thoroughly are the methods for estimating the initial values for the CWNN parameters. Although our experience shows that with a help of simple heuristic reasonable values for these parameters can be established in few trial runs, an automatic method would make the algorithm more user-friendly. Acknowledgements. The authors are grateful to the Professor Witold Dzwinel for his valuable comments. This work was partly founded by the Polish Committee for Scientiﬁc Research (KBN) grants no. 3T11F01030 and 3T11F01930.

References 1. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, USA (1996) 2. Ert¨ oz, L., Steinbach, M., Kumar, V.: Finding clusters of diﬀerent sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, USA, vol. 47 (2003) 3. Guha, S., Rastogi, R., Shim, K.: Cure: an eﬃcient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 73–84. ACM Press, New York (1998) 4. Karypis, G., Han, E., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999) 5. Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000) 6. Jarvis, R., Patrick, E.: Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers 22(11), 1025–1034 (1973) 7. Gowda, K., Krishna, G.: Agglomerative clustering using the concept of mutual nearest neighborhood. Pattern Recognition 10, 105–112 (1978) 8. Aggarwal, C., Hinneburg, A., Keim, D.: On the surprising behaviour of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000) 9. Boryczko, K., Kurdziel, M.: Recognition of subtle microcalciﬁcations in highresolution mammograms. In: Proceedings of 4th International Conference on Computer Recognition Systems, Advances in Soft Computing, pp. 485–492 (2005)

640

K. Boryczko and M. Kurdziel

10. Ng, R., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering 14(5), 1003–1016 (2002) 11. Heath, M., Bowyer, K., Kopans, D., Kegelmeyer, W., Moore, R., Chang, K., Munishkumaran, S.: Current status of the digital database for screening mammography. In: Digital Mammography, pp. 457–460. Kluwer Academic Publishers, Dordrecht (1998)

Domain Decomposition Techniques for Parallel Generation of Tetrahedral Meshes Barbara Glut and Tomasz Jurczyk AGH University of Science and Technology, Krak´ ow, Poland {glut,jurczyk}@agh.edu.pl

Abstract. We present solutions for dealing with the problem of parallel generation of unstructured meshes for 3D domains. The selected approach is based on a geometric decomposition of the domain where the input data is given in the form of a surface mesh. The diﬀerence between the two presented decomposition techniques lies in the step where the actual partitioning takes place. In the ﬁrst method the partitioning is obtained using solely the surface mesh while in the latter one a coarse tetrahedral mesh is used. After the decomposition and creation of an interface mesh (which is a surface mesh in both approaches) the ﬁnal volume meshes are generated independently for each subdomain. The interface mesh and the subdomain meshes are constructed using a Riemannian metric combined with a control space structure which allows to generate meshes with varied density and anisotropy [1]. Keywords: Mesh Generation, Geometric Decomposition, Tetrahedral Mesh, Anisotropic Metric.

1

Introduction

In the modern simulations of processes with the ﬁnite element method the increasingly more complicated models require very large number of elements in order to achieve suﬃciently precise computations. Consequently, such computational tasks are often solved using a parallel approach. However, the parallelization of the solver does not solve the problem completely. The sequential construction of meshes with a large number of elements poses some problems as well – mainly with respect to the memory requirements. It should be noted that the problem of a parallel mesh generation is considered much more diﬃcult than the parallelization of the further computation step [2]. An eﬃcient parallel algorithm requires an adequate load balancing for computational nodes while minimizing the communication overhead between the processors. The task of decomposing the domain for the subsequent mesh generation is complicated due to the limited information available at the beginning of this process (which usually includes only the geometric model description). At this point it is usually diﬃcult to properly assess the time required to discretize the subdomains of the created partitioning which is necessary to achieve the proper M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 641–650, 2008. c Springer-Verlag Berlin Heidelberg 2008

642

B. Glut and T. Jurczyk

load balancing. An additional complication is often introduced by an irregular density and anisotropy of the mesh in some areas of the discretized model.1 In recent years much notice has been devoted to the problem of a parallel mesh generation and a number of solutions have been proposed [3]. It seems that the key to the successful parallelization of the mesh generation problem is the proper partitioning of the domain into subdomains and a possibly independent discretization of these subdomains. Depending on the method and the order of an interface mesh generation three classes of methods can be identiﬁed [4]: 1. The a priori class includes methods which ﬁrst create meshes of the interfaces and then, in parallel, the meshes for each subdomain are generated. 2. The a posteriori methods generate in parallel the meshes of the subdomains ﬁrst. These meshes are then adjusted in a way which assures consistency of the mesh within the whole domain. 3. The third class contains methods where the interface meshes and the subdomain meshes are generated concurrently. There have been so far developed no methods which would solve this problem in a satisfactory way for a wide class of 3D models. It is partly due to the fact that even the sequential problem of the volume mesh generation for arbitrary domains is diﬃcult enough. A number of diﬀerent techniques of mesh construction is utilized in various generators which makes it diﬃcult to choose the most advantageous class of the parallelization methods for the given task. Additionally, the geometric complexity of the considered models is constantly increasing which reduces the chances of ﬁnding a deﬁnite solution for this problem. However, due to the importance of this problem the heuristic solutions applicable in a possibly wide family of models have to be sought.

2

Main Concept of the Proposed Techniques

Two approaches of decomposing the discretized domain into subdomains are presented in the article. For both methods the input data are the boundary surface meshes. The diﬀerence between these methods is the selection of a moment when the parallelization procedure is executed. Both methods can be categorized as the a priori class where the interface meshes are constructed ﬁrst and then the mesh of each subdomain can be independently generated in parallel. Such approach has a number of advantages. During the parallel generation of the mesh the communication between the computational nodes is limited. The volume mesh does not need to be stored in the memory of a single computational node. The only data interchanged during the simulation phase is the information about interface meshes which have to be compatible. Moreover, the sequential mesh generator can be utilized without any modiﬁcations. This technique assures also keeping the initial surface mesh intact which can be beneﬁcial for the computational process. 1

Such requirements concerning the shape and density of elements may result for example from the computational aspects or the geometric features of the domain.

Domain Decomposition Techniques for Parallel Generation

643

The studies presented in this article are founded on the mesh generator developed by authors [1]. This generator constructs meshes using the information gathered in the control space [5]. The concept of a Riemannian metric stored within the adaptive control space structure has a substantial inﬂuence on the proposed techniques of the mesh decomposition for a parallel generation. As a consequence the developed methods can be successfully used in the problems with a high local variation of the density and the anisotropy of the mesh which are often found in the contemporary simulations with the adaptive solvers.

3

Method I: Decomposition of Surface Mesh (DSM)

The DSM technique (Fig. 1) [8] is based on the geometric decomposition of the domain using the surface meshes only. The surface mesh is partitioned by cutting it with separators which at this development stage are implemented as planes. Then, the subdomains are being closed by generation of a proper surface mesh on the separators. Finally, the volume mesh can be constructed independently for each of these closed subdomains. The main steps for each closed subdomain: 1. Selection of the separator. 2. Localization of the intersection of the separator with the surface mesh and determination of the cutting contours. 3. Generation of a surface mesh on the separator (Fig. 1(b)) which in case of a planar separator requires: – construction of a 2D control space taking into account various metric sources, – generation of a 2D mesh on the cutting plane, – projection of the planar mesh to the 3D space. 4. Closing of the subdomains (Fig. 1(c)). 5. Generation of volume meshes in the subdomains (Fig. 1(d)). The selection of a separator (with respect to both its shape and placement) is crucial for the ﬁnal eﬀect of the domain decomposition as well as for the course of the subsequent phases of the method. The selection of the separator should assure a low cut size, a proper load balancing and a minimal number of multiply connected interfaces. In the literature two main techniques are usually proposed for the selection of the cutting plane: along the inertial axis [6] and perpendicularly to the longest edge of the bounding cubicoid [7]. However, none of these methods guarantee suﬃciently good results in a general case and this problem is considered as a subject of further studies. In the presented examples the authors applied cutting using the information about the bounding box. The construction of an interface mesh requires ﬁrst the localization of cutting contours of the surface mesh and the separator. These contours are then projected onto the separator plane. In order to generate the mesh of the cutting plane, a special 2D control space is created. The metric in this case is associated with the lengths of edges in the cutting contour calculated both in the threeand two-dimensional space. Any other available metric sources are also included.

644

B. Glut and T. Jurczyk

(a) surface mesh

(b) split

(c) closing (cross-section)

(d) ﬁnal mesh (cross-section)

Fig. 1. Subsequent steps of DSM (the cross-section visualization of a mesh is created by removing a set of elements)

Using the created metric ﬁeld the 2D mesh is generated and projected to the 3D space. This technique was described in more detail in [8,9] where diﬀerent problems respecting the placement of a separator were also considered.

4

Method II: Decomposition of Coarse Volume Mesh (DCVM)

In the second method (Fig. 2) the coarse volume mesh is utilized to partition the discretization domain. This coarse mesh is created as a result of a discretization based on the boundary nodes only. In this technique the separators are purely virtual and their purpose is to guide the mesh reﬁnement in the selected subdomain. The partitioning of the domains is achieved by separation of the mesh along the reﬁned fragments of the mesh which also deﬁnes the boundaries of closed subdomains. The meshes for the closed subdomains are generated independently as in the ﬁrst approach. In this method the cost of the sequential part of the procedure increases but a more detailed information about the discretized domain becomes available which might help to achieve a better decomposition. The subsequent steps of this method: 1. 2. 3. 4.

Generation of a coarse 3D mesh (Fig. 2(a)). Determination of the separator placement. Reﬁnement of the mesh in the vicinity of the separator (Fig. 2(b)). Separation of the subdomains and recognition of the interface surface (Fig. 2(c)). 5. Reﬁnement of the volume meshes in the subdomains (Fig. 2(d)). The coarse mesh based on the boundary nodes only is created with utilization of a three-dimensional control space. The contents of this space is determined

Domain Decomposition Techniques for Parallel Generation

(a) coarse mesh (cross-section)

(b) reﬁnement (cross-section)

(c) split (cross-section)

645

(d) ﬁnal mesh (cross-section)

Fig. 2. Subsequent steps of DCVM

using the geometry of the model and any additional criteria which may be introduced by the user. As in the DSM, the ﬁrst problem which has to be solved is determining the placement of the virtual separator. The solutions proposed in the literature for methods starting decomposition from coarse mesh are usually applying partitioning libraries (like METIS2 , CHACO3 , etc.) [10], neural networks or genetic algorithms [11]. However, in order to better compare both presented methods the selection of a separator is determined in this work similarly as in the DSM. In the vicinity of the separator the control space (and the metric stored therein) is modiﬁed in order to obtain a selective reﬁnement of the mesh in the subsequent step of the procedure. This special control space CS3d p used to prepare the coarse mesh for partitioning is calculated using the main control 3d space CS3d m . The metric near the separator is copied directly from CSm and in the other areas of the discretized domain the maximum metric is applied (with an additional smoothing of the metric ﬁeld). The mesh created with respect to CS3d p is partitioned along the virtual separator which at this point is already properly reﬁned. The procedure of an actual mesh partitioning starts with identiﬁcation of all faces incident to two tetrahedral blocks belonging to a diﬀerent partitions. In order to reduce the cut size (i.e. the number of faces between the partitions) an additional operation of moving some mesh elements between the partitions is performed. Finally, the mesh elements from diﬀerent partitions are divided into the separate meshes which requires duplication of all mesh entities (vertices, edges and faces) forming the interface mesh and some updating of the mesh interconnections (all these operations are local). The volume meshes in the subdomains can be then further reﬁned independently and the discretization of each subdomain is guided by the main control space CS3d m. 2 3

http://glaros.dtc.umn.edu/gkhome/views/metis/ http://www.cs.sandia.gov/∼ bahendr/chaco.html

646

5

B. Glut and T. Jurczyk

Examples

Both proposed methods were inspected for various geometric models and discretizations of theirs surfaces. The test meshes are shown in Fig. 3. The results of the mesh generation via domain decomposition (with one separator) using both described methods are shown in Fig. 4, 5, 6 and 7. Since the article concentrates on the partitioning method itself, all test were computed on a single machine. Table 1 presents the numbers of elements in the volume meshes created using diﬀerent approaches. For both presented methods the summary number of tetrahedra is similar as in the case of the sequential generation. The only signiﬁcant diﬀerence between the methods is visible in the number of faces on the interface between the mesh partitions. In Table 2 there are gathered the running costs (for a single 3.2 GHz Intel P4 computer with 1 GB memory) of the subsequent steps of the mesh generation process. Both tested methods allow to decrease the expected parallel meshing time.4 The running times for the DCVM method are somewhat higher than for the DSM method. In this case the increased time is mostly due to the cost of the initial sequential step. The times of the volume mesh construction in the partitioned subdomains using the second method (DCVM) are lower since the coarse volume meshes for the subdomains are already available. Also the boundary recovery cost is absent (since this procedure had been already run in the earlier sequential part) which is most visible in the case of the mesh M3 (a non-convex domain). The quality of meshes (Table 3) obtained using both the ﬁrst and the second method is very similar and also close to the quality of the meshes generated sequentially.

(a) M1

(b) M2

(c) M3

(d) M4

Fig. 3. Example meshes

4

The given summary time does not include the cost of transferring the partition data between the computation nodes which may depend on the speciﬁc parallel architecture.

Domain Decomposition Techniques for Parallel Generation

(a) DSM

647

(b) DCVM

Fig. 4. Decomposition of the mesh M1

(a) DSM

(b) DCVM

Fig. 5. Decomposition of the mesh M2

(a) DSM

(b) DCVM

Fig. 6. Decomposition of the mesh M3 Table 1. Number of elements (NFB [103 ] – number of boundary faces, NT [103 ] – number of tetrahedra, NTi [103 ] – number of tetrahedra in ith partition, NFI – number of faces on interface (cut size))

Mesh NFB M1 4.1 M2 6.5 M3 7.9 M4 31.1

Sequential NT 427.2 794.9 83.9 254.6

NT1 209.1 380.9 41.7 127.2

DSM NT2 NT(1+2) 206.5 415.6 377.0 757.9 44.1 85.8 119.5 246.7

NFI 1566 2693 391 432

NT1 214.2 396.5 43.1 133.1

DCVM NT2 NT(1+2) 227.4 441.6 432.0 828.6 43.2 86.3 127.7 260.8

NFI 3657 7152 654 1075

648

B. Glut and T. Jurczyk

(a) DSM

(b) DCVM

Fig. 7. Decomposition of the mesh M4 (only one of the created subdomains is shown) Table 2. Generation time (ts [s] – sequential generation time, ti [s] – mesh generation time for the ith subdomain, the summary parallel generation time tsum is estimated as ts + max(ti ))

Mesh M1 M2 M3 M4

Sequential ts 42.1 80.0 9.6 25.1

ts 0.4 0.5 0.2 1.0

DSM t1 t2 17.1 16.7 32.4 33.0 4.2 6.0 12.0 11.3

tsum 17.5 33.6 6.2 13.0

ts 14.0 27.1 6.0 10.4

DCVM t1 t2 15.4 16.7 29.1 32.5 2.6 2.7 9.1 8.7

tsum 30.7 59.6 8.7 19.5

Table 3. Quality of the generated meshes (η M – average mean ratio of mesh elements min – minimum mean ratio of mesh elements calculated calculated in metric space, ηM in metric space, μM – average length of mesh edges calculated in metric space [12]) Sequential min μ Mesh η M ηM M M1 0.882 0.167 1.037 M2 0.874 0.050 1.031 M3 0.858 0.035 1.034 M4 0.875 0.109 1.040

6

DSM min η M ηM 0.881 0.034 0.874 0.009 0.858 0.092 0.874 0.153

μM 1.018 1.008 1.025 1.027

DCVM min μ η M ηM M 0.882 0.069 1.037 0.874 0.025 1.031 0.846 0.001 1.046 0.875 0.065 1.040

Conclusions

The DSM method based on the generation of a mesh on a separator surface has the beneﬁt of a low cost of the sequential part and a small cut size (for a given selection of a separator). However, this technique is sensitive to a proper placement of a separator [8]. If the angle between the separator and the surface mesh is too small (which can be diﬃcult to avoid for complex models) there may arise problems with the correct projection of contour nodes onto the separator surface. Moreover, the quality of the volume elements generated in such areas may be unacceptably low.

Domain Decomposition Techniques for Parallel Generation

649

The second method (DCVM) overcomes this problem and the quality of the generated mesh elements is unaﬀected by the selection of the (virtual) separator placement. The separator type for this method can be also easily extended to a non-planar surface. Moreover, due to the availability of the coarse volume mesh during the partitioning phase the predicted number of elements in the ﬁnal mesh for various areas of the discretized domain might be assessed with higher accuracy. As a result better balancing of the decomposition balancing may be achieved. Unfortunately, these beneﬁts are combined with an increased cost of the sequential part of the algorithm and a higher cut size.

7

Further Research Directions

The computational and communication costs are diﬀerent for various computer architectures. Because of this it is diﬃcult to select the proper parallelization strategy applicable for diﬀerent architectures. From this point of view the approach where each subdomain becomes an individual object to discretize appears to be advantageous. However, this thesis has to be tested and veriﬁed for a number of diﬀerent architecture conﬁgurations and tools. The further studies are also required with respect to the localization of the optimal placement and shape of the separator. This task is correlated with the problem of assessment of the predicted number of volume elements in the ﬁnal mesh based only on the number of boundary elements. The authors were inspecting a similar problem for a two-dimensional case [13]. A further studies are however necessary for three-dimensional meshes where the prediction will additionally utilize the information from the control space. Acknowledgments. The partial support of the AGH Grant No. 11.11.120.777 is gratefully acknowledged.

References 1. Glut, B., Jurczyk, T., Kitowski, J.: Anisotropic Volume Mesh Generation Controlled by Adaptive Metric Space. In: AIP Conf. Proc. NUMIFORM 2007, Materials Processing and Design: Modeling, Simulation and Applications, Porto, Portugal, June 17-21, vol. 908, pp. 233–238 (2007) 2. Tu, T., Yu, H., Ramirez-Guzman, L., Bielak, J., Ghattas, O., Ma, K.-L., O’Hallaron, D.R.: From Mesh Generation to Scientiﬁc Visualization: An End-toEnd Approach to Parallel Supercomputing. In: Proc. of SC 2006, Tampa, FL (2006) 3. Chrisochoides, N.: A survey of parallel mesh generation methods, http://www.cs.wm.edu/∼ nikos/pmesh survey.pdf 4. Cougny, H.L., Shepard, M.S.: Parallel volume meshing face removals and hierarchical repartitioning. Comput. Methods Appl. Mech. Engrg. 174, 275–298 (1999) 5. Jurczyk, T., Glut, B.: Adaptive Control Space Structure for Anisotropic Mesh Generation. In: Proc. of ECCOMAS CFD 2006 European Conference on Computational Fluid Dynamics, Egmond aan Zee, The Netherlands (2006)

650

B. Glut and T. Jurczyk

6. Ivanov, E., Andr¨ a, H., Kudryavtsev, A.N.: Domain decomposition approach for automatic parallel generation of 3D unstructured grids. In: Proc. of ECCOMAS CFD 2006 European Conference on Computational Fluid Dynamics, Egmond aan Zee, The Netherlands (2006) 7. Larwood, B.G., Weatherill, N.P., Hassan, O., Morgan, K.: Domain decomposition approach for parallel unstructured mesh generation. Int. J.Numer. Meth. Engng. 58, 177–188 (2003) 8. Glut, B., Jurczyk, T., Breitkopf, P., Rassineux, A., Villon, P.: Geometry Decomposition Strategies for Parallel 3D Mesh Generation. In: Proc. of Int. Conf. on Computer Methods and Systems CMS 2005, Krak´ ow, Poland, vol. 1, pp. 443–450 (2005) 9. Jurczyk, T., Glut, B., Breitkopf, P.: Parallel 3D Mesh Generation using Geometry Decomposition. In: AIP Conf. Proc. NUMIFORM 2007, Materials Processing and Design: Modeling, Simulation and Applications, Porto, Portugal, June 17-21, vol. 908, pp. 1579–1584 (2007) 10. Ito, Y., Shih, A.M., Erukala, A.K., Soni, B.K., Chernikov, A., Chrisochoides, N.P., Nakahashi, K.: Parallel unstructured mesh generation by an advancing front method. Mathematics and Computers in Simulation 75, 200–209 (2007) 11. Sziveri, J., Seale, C.F., Topping, B.H.V.: An enhanced parallel sub-domain generation method for mesh partitioning in parallel ﬁnite element analysis. Int. J. Numer. Meth. Engng. 47, 1773–1800 (2000) 12. Jurczyk, T.: Eﬃcient Algorithms of Automatic Discretization of Non-Trivial Three-Dimensional Geometries and its Object-Oriented Implementation. PhD thesis, AGH University of Science and Technology, Krak´ ow, Poland (2007), http://home.agh.edu.pl/jurczyk/papers/phd-jurczyk.pdf 13. Jurczyk, T., Glut, B.: Organization of the Mesh Structure. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3037, pp. 646–649. Springer, Heidelberg (2004)

The Complete Flux Scheme for Spherically Symmetric Conservation Laws J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen Eindhoven University of Technology Department of Mathematics and Computer Science PO Box 513, 5600 MB Eindhoven, The Netherlands {j.h.m.tenthijeboonkkamp,m.j.h.anthonissen}@tue.nl

Abstract. We apply the ﬁnite volume method to a spherically symmetric conservation law of advection-diﬀusion-reaction type. For the numerical ﬂux we use the so-called complete ﬂux scheme. In this scheme the ﬂux is computed from a local boundary value problem for the complete equation, including the source term. As a result, the numerical ﬂux is the superposition of a homogeneous ﬂux and an inhomogeneous ﬂux. The resulting scheme is second order accurate, uniformly in the Peclet numbers. Keywords: ﬁnite volumes, advection diﬀusion equation, complete ﬂux scheme.

1

Introduction

Many problems in physics and engineering can be modelled using conservation laws. These laws lead in general to a system of partial diﬀerential equations that cannot be solved analytically. Finite volume methods are a popular choice to discretise these equations, because they feature a discrete conservation property: the computational domain is divided into control volumes and on each volume a discrete conservation law holds. In this paper we study a ﬁnite volume method for three-dimensional spherically symmetric steady conservation laws. This type of equation arises, e.g., in combustion theory, where the study of laminar spherical ﬂames is useful for ﬁnding parameters such as burning velocity or ﬂame curvature in premixed combustion [1]. Our model problem includes advection, diﬀusion and reaction terms and we shall develop a numerical scheme that is second order accurate for all ﬂow conditions. This means that the scheme should always retain its high accuracy, unlike, e.g., standard exponentially ﬁtted schemes, which are second order accurate for diﬀusion dominated ﬂows but reduce to the ﬁrst order upwind scheme when the advection term becomes large. Additionally the proposed scheme does not produce spurious oscillations for advection dominated ﬂows, which is a well-known ﬂaw of standard central discretisations. High accuracy and the absence of wiggles are favourable properties that may also be achieved by using high resolution schemes such as ﬂux limiting or M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 651–660, 2008. c Springer-Verlag Berlin Heidelberg 2008

652

J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen

(weighted) essentially nonoscillatory (ENO) methods [4]. These techniques lead to larger discretisation stencils however which is disadvantageous. The method we present uses direct neighbours only. Our algorithm is an extension of the ﬁnite volume methods for Cartesian grids introduced in [2,5,6] to spherically symmetric conservation laws. We use an exponential scheme for computing the numerical ﬂuxes. The approximation for the ﬂux is based on the complete diﬀerential equation. This implies that we also include the source term in the numerical ﬂuxes. Manzini and Russo [3] also present a ﬁnite volume method for advectiondominated problems that is second-order accurate away from boundary and internal layers. They pay special attention to the construction of the numerical advective ﬂuxes in order to prevent numerical oscillations. This goal is achieved by a sophisticated reconstruction algorithm for cell gradients and a velocitybiased mixing of upwind and downwind contributions. Their scheme contains a nonlinear term for shock capturing. This paper is organized as follows. In Section 2, we formulate a stationary advection-diﬀusion-reaction equation, introduce control volumes and formulate a second order discrete conservation law. In Section 3, we derive an expression for the numerical ﬂux that is second order accurate for all ﬂow conditions. In Section 4, we combine the discrete conservation law with the numerical ﬂux and apply the resulting scheme to a spherically symmetric boundary value problem. We show numerical results using both the homogeneous and the complete ﬂux scheme. By means of Richardson extrapolation we verify the order of accuracy of the ﬁnite volume scheme for diﬀerent ﬂow conditions.

2

Finite Volume Discretization

In this section we outline the ﬁnite volume method (FVM) for three-dimensional, spherically symmetric conservation laws. Consider the following steady conservation law of advection-diﬀusion-reaction type, i.e., ∇·(mϕ − Γ ∇ϕ) = s, (1) where m is the mass ﬂux, Γ ≥ Γmin > 0 a diﬀusion/conduction coeﬃcient and s a (chemical) source term. The unknown ϕ can be, e.g., the temperature or the concentration of a species in a reacting mixture. The parameters Γ and s are usually (complicated) functions of the unknown ϕ, however, for the sake of discretisation we will consider these as given functions of the spatial coordinate x. Equation (1) has to be coupled with the ﬂow equations, i.e., the continuity equation and the momentum equations. The former reads ∇·m = 0.

(2)

Associated with (1), we introduce the ﬂux vector f , deﬁned by f := mϕ − Γ ∇ϕ.

(3)

The Complete Flux Scheme for Spherically Symmetric Conservation Laws

653

Equation (1) then simply reduces to ∇·f = s. In a FVM we cover the domain with a ﬁnite number of control volumes Ωj (j = 1, 2, . . . , N ) and impose the integral form of the conservation law for each control volume, i.e., f ·n dS = s dV, (4) ∂Ωj

Ωj

where n is the outward unit normal on the boundary ∂Ωj . Next, we need numerical approximations for the integrals in (4). In the following, we assume the problem to be spherically symmetric, i.e., ϕ = ϕ(r), and likewise for all other variables, and moreover, f = f (r)er with er the ﬁrst basis vector in spherical coordinates. We introduce a spatial grid {rj } of (uniform) grid size Δr. As control volumes we choose the spherical shells Ωj := (rj−1/2 , rj+1/2 ) with rj+1/2 := 12 (rj + rj+1 ). Then, the surface integral in (4) reduces to f ·n dS = f ·er dS − f ·er dS ∂Ωj r=rj+1/2 r=rj−1/2 (5) 2 2 = 4π rj+1/2 f (rj+1/2 ) − rj−1/2 f (rj−1/2 ) . For the approximation of the volume integral in (4) we apply the midpoint rule, to ﬁnd 3 . 3 s dV = 43 π rj+1/2 − rj−1/2 (6) sj , Ωj

with sj := s(rj ). Combining (4), (5) and (6) and using the relation x3 − y 3 = (x − y)(x2 + xy + y 2 ) we obtain the second order discrete conservation law 2 2 1 Fj+1/2 − rj−1/2 Fj−1/2 = Δr rj2 + 12 Δr2 sj , (7) rj+1/2 where Fj+1/2 is the numerical ﬂux at the cell interface approximating f (rj+1/2 ). Finally, the FVM has to be completed with the derivation of an expression for the numerical ﬂux.

3

Derivation of the Numerical Flux

Our objective in this section is to derive an expression for the numerical ﬂux that is uniformly second order accurate in the grid size, i.e., the discretisation error should always be second order for all ﬂow regimes in combination with a source term of arbitrary strength. We adopt the following notation: variables deﬁned in the grid points rj and rj+1 are indicated with the subscripts C and E, respectively, and variables at the interface rj+1/2 by the subscript e. The derivation of the expression for the numerical ﬂux Fe at the eastern cell interface re located between the grid points rC and rE is based on the following model boundary value problem (BVP) for the unknown ϕ: 1 2 = s, rC < r < rE , (8a) r mϕ − Γ ϕ 2 r ϕ(rC ) = ϕC , ϕ(rE ) = ϕE , (8b)

654

J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen

where the prime ( ) denotes diﬀerentiation with respect to r. In the derivation that follows, we assume that M := r2 m = Const > 0

for r ∈ (rC , rE ).

(9)

Note that the condition M = Const is a direct consequence of the continuity equation (2). The diﬀusion coeﬃcient Γ and the source term s are arbitrary suﬃciently smooth functions of r. The (scalar) ﬂux corresponding to (8) reads f := mϕ − Γ ϕ .

(10)

To derive the expression for Fe we carry out the following procedure: 1. Derive the integral expression for ϕ(r) from the inhomogeneous BVP (8). 2. Derive the integral representation for f (re ) from (10). 3. Approximate all integrals involved. In the following, we need the variables D, λ, Λ and S, deﬁned by r r M D(r) := Γ (r)r2 , λ(r) := , Λ(r) := λ(η) dη, S(r) := η 2 s(η) dη. D(r) re re (11) The variable Λ(r) is called the Peclet integral. Substituting (10) in (8a) and integrating the resulting equation we obtain the following integral balance r2 f (r) − r2 f (re ) = S(r). (12) Using the deﬁnitions of D and Λ in (11), it is clear that the expression for the ﬂux can be rewritten as r2 f (r) = −D ϕ e−Λ eΛ . (13) Inserting this expression in (12) and once more integrating we obtain the following expression for the ﬂux f (re ): 2 r f (re ) = r2 f (h) (re ) + r2 f (i) (re ), (14a) rE −Λ(rE ) 2 (h) ϕE − e−Λ(rC ) ϕC D−1 e−Λ dr, (14b) (re ) = − e r f rC rE 2 (i) rE −1 −Λ r f (re ) = − D−1 Se−Λ dr D e dr, (14c) rC

rC

where r2 f (h) (re ) and r2 f (i) (re ) are the homogeneous and inhomogeneous part, corresponding to the homogeneous and particular solution of (8), respectively. We introduce some notation. a, b denotes the usual inner product of two functions a(r) and b(r) deﬁned on (rC , rE ), i.e., rE a, b := a(r)b(r) dr. (15) rC

The Complete Flux Scheme for Spherically Symmetric Conservation Laws

655

10

8

6

4

2

−10

−8

−6

−4

−2

0

2

4

6

8

10

Fig. 1. The Bernoulli function B(z)

For a generic variable v(r) deﬁned on (rC , rE ) we indicate the geometric average (of vC and vE ) and the harmonic average by v˜e and vˆe , respectively, i.e., v˜e :=

1 v −1 , 1 . := vˆe Δr

√ vC vE ,

(16)

Consider the expression for the homogeneous ﬂux. Assume ﬁrst that Γ (r) = Const on (rC , rE ). In this case expression (14b) reduces to ˜e 2 (h) D B(−Pe )ϕC − B(Pe )ϕE , (re ) = r f Δr

(17)

where Pe is the Peclet number deﬁned by Pe :=

M Δr . ˜e D

(18)

Furthermore, B(z) is the Bernoulli function, deﬁned by B(z) :=

z , ez − 1

(19)

see Figure 1. For the constant coeﬃcient homogeneous ﬂux, i.e., Γ (r) and M constant on (rC , rE ), we introduce the notation ˜ e /Δr, Pe ; ϕC , ϕE , (r2 f (h) )(re ) = F h D (20) ˜ e /Δr and Pe and to denote the dependence of (r2 f (h) )(re ) on the parameters D on the function values ϕC and ϕE . In the general case, when Γ (r) is an arbitrary function of r, we can rewrite the homogeneous ﬂux in (14b) as 2 (h) ˆ e /Δr, λ, 1; ϕC , ϕE . r f (re ) = F h D (21) ˜ e and Pe Thus, the ﬂux can be written as the constant coeﬃcient ﬂux with D ˆ replaced by De and λ, 1, respectively.

656

J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen

Next, consider the inhomogeneous ﬂux. Assume ﬁrst that λ(r) = Const on (rC , rE ) and deﬁne P := λΔr. Substituting the expression for S(r) in (14c) and changing the order of integration, we ﬁnd the following alternative representation for the inhomogeneous ﬂux rE 2 (i) r − rC , (22) G(σ(r); P ) r2 s(r) dr, σ(r) := r f (re ) = Δr rC where σ(r) is the normalized coordinate on (rC , rE ) and where G(σ; P ) is the Green’s function for the ﬂux. It is given by ⎧ 1 − e−P σ ⎪ ⎪ for 0 ≤ σ ≤ 12 , ⎪ ⎨ 1 − e−P G(σ; P ) = (23) ⎪ P (1−σ) ⎪ 1 − e ⎪ ⎩− for 12 < σ ≤ 1; 1 − eP see Figure 2. Note that G(σ; P ) relates the ﬂux to the source term and is diﬀerent from the usual Green’s function, which relates the solution to the source term. If we furthermore assume that s(r) = Const on (rC , rE ), relation (22) reduces to 2 (i) 2 r f (re ) = Δr 12 − W (P ) rC s + O(Δr2 ), (24) where W (z) is a weighting function, deﬁned by W (z) :=

ez − 1 − z ; z ez − 1

(25)

see Figure 3. From both ﬁgures, it is clear that the inhomogeneous ﬂux is only of importance for advection dominated ﬂow, i.e., |P | 1, in combination with a large source term and in this case the upwind value sC for s(r) should be taken.

1

0.8

P = 0.01

0.6

P=1 P=5

0.4

P = 10 0.2

G

0

−0.2

−0.4

−0.6

−0.8

−1

0

0.1

0.2

0.3

0.4

0.5

σ

0.6

0.7

0.8

0.9

1

Fig. 2. Green’s function G(σ; P ) for several values P > 0

The Complete Flux Scheme for Spherically Symmetric Conservation Laws

657

1

0.8

0.6

0.4

0.2

−10

−8

−6

−4

−2

0

2

4

6

8

10

Fig. 3. The weighting function W (z)

For arbitrary functions Γ (r) and s(r) we have a similar representation for the inhomogeneous ﬂux, i.e., rE 2 (i) G(σ(r); λ, 1) r2 s(r) dr, (26a) r f (re ) = rC

with G(σ; P ) deﬁned in (23) and where the normalized coordinate σ(r) is deﬁned by r

λ(η) dη/λ, 1.

σ(r) :=

(26b)

rC

Note that λ(r) > 0 implying that σ(r) is a monotonically increasing function on (rC , rE ). Expanding s(r) in a Taylor series, we can also evalute the integral in (26a), to ﬁnd 2 2 (i) sC + O(Δr2 ). (27) r f (re ) = Δr 12 − W (λ, 1) rC Summarizing, we have the exact representation (21) for the homogeneous ﬂux and the second order approximation (27) for the inhomogeneous ﬂux. Both exˆ e /Δr = M/λ, 1, the inner pressions hold for arbitrary Γ (r) and s(r). Since D product λ, 1 is the only integral that remains to be approximated. Straightforward integration and applying the mean value theorem of integration gives λ, 1 =

M Δr , Γ (r∗ ) r˜e2

r∗ ∈ (rC , rE ).

(28)

Using the approximation Γ (r∗ ) = Γ˜e + O(Δr) we obtain λ, 1 = P˜e + O(Δr2 ). Inserting this approximations in the expressions (21) and (27) and omitting O(Δr2 )-terms we obtain the following result for the numerical ﬂux: 2 (29a) r F e = r2 F (h) e + r2 F (i) e , 2 (h) h ˜ = F De /Δr, P˜e ; ϕC , ϕE , (29b) r F e 1 2 2 (i) = Δr 2 − W (P˜e ) rC sC , (29c) r F e which is a second order approximation of (14).

658

4

J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen

Numerical Schemes and Example

Combining the expressions (29) for the numerical ﬂux with the discrete conservation law (7) we can derive two numerical schemes, i.e., the complete ﬂux (CF) scheme and the homogeneous ﬂux (HF) scheme, for which we only take into account the homogeneous part of the ﬂux. We apply these schemes to a model BVP to investigate their performance for both diﬀusion dominated and advection dominated ﬂow. Substituting (29) in (7) we obtain the numerical scheme 1 Δr2 ) sj , (30a) −aW,j ϕj−1 + aC,j ϕj − aE,j ϕj+1 = bW,j sj−1 + bC,j + Δr(rj2 + 12 with coeﬃcients aC,j etc., given by aW,j =

˜ j−1/2 D B(−Pj−1/2 ), Δr

bW,j = Δr

1 2

aE,j =

2 − W (Pj−1/2 ) rj−1 ,

˜ j+1/2 D B(Pj+1/2 ), Δr

bC,j = Δr −

1 2

aC,j = aW,j + aE,j ,

+ W (Pj+1/2 ) rj2 .

(30b) (30c)

For the HF scheme we have to take bW,j = bC,j = 0. Note that both schemes gives rise to a tridiagonal system Aϕ = Bs, which can be very eﬃciently solved using LU-decomposition. Consider the following BVP 1 d 2 dϕ M ϕ − Γ r = s, 0 < r < 1, (31a) r2 dr dr dϕ (1) = 0, (31b) ϕ(0) = 5, dr with Γ (r) and s(r) given by Γ (r) = Γmin ( 1 +

√ r ),

s(r) =

smax . 1 + smax (2r − 1)2

(31c)

The diﬀusion coeﬃcient Γ (r) is a smoothly varying function whereas the source term has a sharp peak, introducing a steep interior layer near r = 12 ; see Figure 4. To assess the order of accuracy of both schemes, we compute numerical approximations of ϕ( 12 ) with increasingly smaller grid sizes and apply Richardson extrapolation to these results. More precisely, let ϕ( 12 ) = ϕh + εh = ϕh/2 + εh/2 = ϕh/4 + εh/4 ,

h = Δr,

(32)

where ϕh denotes the numerical approximation of ϕ( 12 ) computed with grid size h and εh the corresponding (global) discretisation error, etc. Assuming the following expansion (33) εh = Chp + O(hq ), q > p, we can derive the following expression for the order of accuracy p: . ϕh/2 − ϕh =: q h . 2p = h/4 ϕ − ϕh/2

(34)

The Complete Flux Scheme for Spherically Symmetric Conservation Laws

659

18 16 14 12 10 8 6 4 0

0.2

0.4

0.6

0.8

1

r

Fig. 4. Solution of the model BVP (31). Parameter values are: M = 1, Γmin = 10−7 and smax = 103 . Table 1. The q h -values for the complete ﬂux scheme and the homogeneous ﬂux scheme as a function of N = 1/h. Parameter values are: M = 1 and smax = 103 . Γmin = 10−1 N HF CF 10 2.82 2.57 20 5.56 5.65 40 10.03 12.67 80 4.92 5.24 160 4.07 3.97 320 4.02 3.96 640 4.01 3.98 1280 4.02 4.01

Γmin = 10−7 HF CF 2.37 2.99 2.70 6.59 2.31 18.08 2.03 6.07 2.01 4.07 2.00 4.02 2.00 4.00 2.00 4.00

In Table 1 you ﬁnd the values of q h for both schemes. Clearly, when diﬀusion is dominant, i.e., for Γmin = 10−1 , both schemes behave second order accurate. Thus both schemes perform equally well, in agreement with the previous observation that the inhomogeneous ﬂux is only of importance for advection dominated ﬂow. On the other hand, for dominant advection, i.e., Γmin = 10−7 the homogeneous ﬂux scheme reduces to ﬁrst order, whereas the complete ﬂux scheme is still second order.

5

Conclusions and Future Research

In this paper we have derived the complete ﬂux scheme for spherically symmetric conservation laws of advection-diﬀusion-reaction type. The numerical ﬂux is computed from a local BVP for the entire equation, including the source term. All parameters are assumed to be arbitrary, suﬃciently smooth functions of the radial coordinate r. As a result, the numerical ﬂux is the superposition of a homogeneous ﬂux, corresponding to the homogeneous solution of the BVP, and an inhomogeneous ﬂux, corresponding to the particular solution. The resulting scheme behaves second order accurate, uniformly in the Peclet number, does not

660

J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen

introduces numerical oscillations near a steep layer and has a simple three-point stencil. Directions for further research are the following. A ﬁrst obvious extension is to apply the scheme to time dependent conservation laws. An option would be to include the time derivative in the inhomogeneous ﬂux and subsequently apply a suitable time integration method. A second extension is to apply the scheme to the conservation laws of a spherically symmetric ﬂame. Since our model BVP has an interior layer reminiscent of a ﬂame front, it is expected that the CFscheme will give accurate results for laminar ﬂames. The major problem in this case is to construct fast and robust iterative methods to solve the nonlinear, discrete system. A ﬁnal extension the authors have in mind is to simulate time dependent, i.e., expanding or imploding, spherical ﬂames. All these issues will be subject of future research.

References 1. Groot, G.R.A.: Modelling of Propagating Spherical and Cylindrical Premixed Flames. PhD Thesis, Eindhoven University of Technology (2003) 2. Van ’t Hof, B., Ten Thije Boonkkamp, J.H.M., Mattheij, R.M.M.: Discretisation of the stationary convection-diﬀusion-reaction equation. Numer. Meth. for Part. Diﬀ. Eq. 14, 607–625 (1998) 3. Manzini, G., Russo, A.: A ﬁnite volume method for advection-diﬀusion problems in convection-dominated regimes. Comput. Methods Appl. Mech. Engrg. 197, 1242– 1261 (2008) 4. Shu, C.-W.: High-order ﬁnite diﬀerence and ﬁnite volume WENO schemes and discontinuous Galerkin methods for CFD. International Journal of Computational Fluid Dynamics 17(2), 107–118 (2003) 5. Thiart, G.D.: Finite diﬀerence scheme for the numerical solution of ﬂuid ﬂow and heat transfer problems on nonstaggered grids. Numerical Heat Transfer, Part B 17, 43–62 (1990) 6. Thiart, G.D.: Improved ﬁnite diﬀerence scheme for the solution of convectiondiﬀusion problems with the SIMPLEN algorithm. Numerical Heat Transfer, Part B 18, 81–95 (1990)

Computer Simulation of the Anisotropy of Fluorescence in Ring Molecular Systems: Tangential vs. Radial Dipole Arrangement Pavel Heˇrman1 , Ivan Barv´ık2, and David Zapletal1,3 1

2

Department of Physics, Faculty of Education, University of Hradec Kr´ alov´e, Rokitansk´eho 62, CZ-500 03 Hradec Kr´ alov´e, Czech Republic [email protected] Institute of Physics of Charles University, Faculty of Mathematics and Physics, CZ-12116 Prague, Czech Republic 3 Department of Mathematics, University of Pardubice, Studentsk´ a 95, CZ-53210 Pardubice, Czech Republic

Abstract. Time dependence of the anisotropy of ﬂuorescence in recently discovered cyclic antenna units of the BChl photosystem is modeled. Interaction with a bath and a static disorder here modeled as uncorrelated Gaussian disorder in the transfer integrals is taken into account. Parallel computer enviroment is used because one is forced to recalculate every physical quantity for several thousands of diﬀerent realizations of disorder. Results for the ring LH4 with radial optical transition dipole arrangement are compared with those for the ring LH2 with the tangential one. The diﬀerence between LH2 and LH4 results for the static disorder in transfer integrals has an opposite sign in comparison with that one for the static disorder in local energies. Equivalent diﬀerences are shifted to a smaller times for the stronger interaction with a bath.

1

Introduction

The most common antenna complexes in purple bacteria are the light-harvesting complexes LH1 and LH2. Some bacteria express also other types of LH complexes such as the B800-820 LH3 complex in Rhodopseudomonas acidophila strain 7050 or the B800 LH4 complex in Rhodopseudomonas palustris [1]. The general organization of the LH2 and LH4 complexes is the same: a ring-shaped structure is formed from cyclically repeated identical subunits. However, the symmetries of these rings are diﬀerent: LH2 and LH3 are usually nonameric but LH4 is octameric. The other diﬀerence is the presence of four bacteriochlorophyll (BChl) molecules per repeating unit in LH4 rather than three ones found in LH2 and LH3. The most striking diﬀerence is the occurrence of an additional Bchl-a ring in the LH4 complex, the B800-2 ring, at a position approximately halfway between the densely packed B-a/B-b ring and the B800-1 ring that are both also present in LH2 [1]. In LH2, the B850 ring has nearly tangentially oriented Bchl-a pigments, whereas in LH4 the equivalent B-a/B-b pigments are organized in a more radial fashion. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 661–670, 2008. c Springer-Verlag Berlin Heidelberg 2008

662

P. Heˇrman, I. Barv´ık, and D. Zapletal

We are therefore dealing with ring-shaped units with nonameric and octameric symmetry resembling those rings from antenna complexes LH2 and LH4 with a strong interaction J (in the range 150 − 450 cm− 1) between BChl molecules. Our theoretical approach therefore considers an extended Frenkel exciton states model. Despite intensive study, the precise role of the protein moiety in governing the dynamics of the excited states is still under debate [2]. At room temperature the solvent and protein environment ﬂuctuate with characteristic time scales ranging from femtoseconds to nanoseconds. The dynamical aspects of the system are reﬂected in the line shapes of electronic transitions. To fully characterize them and thereby the dynamics of the system, one needs to know not only the ﬂuctuation amplitude (coupling strength) but also the time scale of each process involved. The observed linewidth reﬂects the combined inﬂuence of static disorder and exciton coupling to intermolecular, intramolecular, and solvent nuclear motions. The simplest approach is to decompose the line proﬁle into homogeneous and inhomogeneous contributions of the dynamic and static disorder. Yet, a satisfactory understanding of the nature of static disorder in light-harvesting systems has not been reached. In the site excitation basis, static disorder can be present as in diagonal hamiltonian matrix elements as in oﬀ-diagonal ones. Time-dependent experiments [3,4] led for the B850 ring in LH2 complex to conclusion that the elementary dynamics occurs on a time scale of about 100 fs [5,6,7]. For example, depolarization of ﬂuorescence was studied already quite some time ago for a model of electronically coupled molecules [8,9]. Rahman et al. [8] were the ﬁrst who recognize the importance of the oﬀ-diagonal density matrix elements (coherences) [10] which can lead to an initial anisotropy larger than the incoherent theoretical limit of 0.4. Already some time ago substantial relaxation on the time scale of 10-100 fs and an anomalously large initial anisotropy of 0.7 was observed by Nagarjan et al. [5]. The high initial anisotropy was ascribed to a coherent excitation of a degenerate pair of states with allowed optical transitions and then relaxation to states at lower energies which have forbidden transitions. Nagarjan et al. [6] concluded, that the main features of the spectral relaxation and the decay of anisotropy are reproduced well by a model considering decay processes of electronic coherences within the manifold of the excitonic states and thermal equilibration among the excitonic states. In that contribution the exciton dynamics was not calculated explicitly. In several steps [11,12,13,14,15,16,17] we have extended the former investigations by Kumble and Hochstrasser [18] and Nagarjan et al. [6] for LH2 rings. We added the eﬀect of dynamic disorder by using a quantum master equation in the Markovian [11] and non-Markovian limits [13,14]. We also investigated inﬂuence of four types of the uncorrelated static disorder (Gaussian disorder in local energies, transfer integrals, radial positions of BChls and angular positions of BChls) [12,15,16,17]. Inﬂuence of correlated static disorder, namely an elliptical deformation of the ring, has been also investigated [11]. Recently we have investigated the time dependence of the anisotropy of ﬂuorescence for newly discovered type of the molecular ring, the LH4 ring with

Computer Simulation of the Anisotropy of Fluorescence

663

the uncorrelated static disorder in local energies [19]. Main goal of our present investigation is the comparison of the time dependence of the anisotropy of ﬂuorescence after an impulsive excitation for two molecular rings: for molecular ring with tangentially arranged optical transition dipoles rt (t), like in LH2, as well as for the radially arranged one rr (t) like in LH4 [1]. We concentrate on the uncorrelated static disorder - Gaussian disorder in transfer integrals.

2

Model

In the following we assume that only one excitation is present on the ring after an impulsive excitation [18]. The Hamiltonian of an exciton in the ideal ring coupled to a bath of harmonic oscillators reads 1 m Jmn a†m an + hωq b†q bq + √ ¯ Gq ¯hωq a†m am (b†q + b−q ) = H0 = N q m q m,n(m=n) 0 = Hex + Hph + Hex−ph .

(1) a†m

0 Hex

represents the single exciton, i.e. the system. The operator (am ) creates (annihilates) an exciton at site m. Jmn (for m = n) is the so-called transfer integral between sites m and n. Hph describes the bath of phonons in the harmonic approximation. The phonon creation and annihilation operators are denoted by b†q and bq , respectively. The last term in Eq. (1), Hex−ph , represents the exciton– bath interaction which is assumed to be site–diagonal and linear in the bath coordinates. The term Gm q denotes the exciton–phonon coupling constant. 0 Inside one ring the pure exciton Hamiltonian Hex (Eq. (1)) can be diagonalized using the wave vector representation with corresponding delocalized ”Bloch” states and energies. Considering homogeneous case with only nearest neighbour transfer matrix elements Jmn = J12 (δm,n+1 + δm,n−1 ) and using Fourier transformed excitonic operators (Bloch representation) 2π ak = l , l = 0, ±1, . . . ± N/2 , (2) an eikn , k = N n the simplest exciton Hamiltonian in k representation reads 0 Hex = Ek a†k ak , with Ek = −2 J12 cos k .

(3)

k

In the local site basis inﬂuence of static disorder is modeled by a Gaussian distribution for the uncorrelated transfer integral ﬂuctuations δJnm with a standard deviation ΔJ Hs = δJmn a†m an . m,n(m=n)

We are using nearest neighbour approximation J = J12 . Hamiltonian of the static disorder adds to the Hamiltonian of the ideal ring H = H 0 + Hs .

(4)

664

P. Heˇrman, I. Barv´ık, and D. Zapletal

All of the Qy transition dipole moments of the chromophores (BChls B850) in a ring without static and dynamic disorder lie approximately in the plane of the ring and the entire dipole strength of the B850 band comes from a degenerate pair of orthogonally polarized transitions (at an energy slightly higher than the transition energy of the lowest exciton state (LH2), slightly lower than the one of the highest exciton state (LH4)). The dipole strength μa of eigenstate |a of the ring with static disorder and the dipole strength μα of eigenstate |α of the ring without static disorder read μa =

N

can μn ,

n=1

μα =

N

cα n μn ,

(5)

n=1

a where cα n and cn are the expansion coeﬃcients of the eigenstates of the unperturbed ring and the disordered one in site representation, respectively. In the case of impulsive excitation the dipole strength is simply redistributed among the exciton levels due to disorder [18]. Thus the impulsive excitation with a pulse of suﬃciently wide spectral range will always prepare the same initial state, irrespective of the actual eigenstates of the real ring. After impulsive excitation with polarization ex the excitonic density matrix ρ [12] is given by [6]

ραβ (t = 0; ex ) =

1 (ex · μα )(μβ · ex ), A

A=

(ex · μα )(μα · ex ).

(6)

α

The usual time-dependent anisotropy of ﬂuorescence r(t) =

Sxx (t) − Sxy (t) , Sxx (t) + 2Sxy (t)

Sxy (t) =

Pxy (ω, t)dω

(7)

is determined from Pxy (ω, t) = A ρaa (t)(μa · ey )(ey · μa )[δ(ω − ωa 0 ) + δ(ω − ωa0 )]. (8) a

a

The brackets in Eq. (7) denote the ensemble average and the orientational average over the sample. The crucial quantity entering r(t) in Eq. (7) is the exciton density matrix ρ. ˇ apek [20] read The dynamical equations for ρ obtained by C´ d ρmn (t) = i(Ωmn,pq + δΩmn,pq (t))ρpq (t). dt pq

(9)

In long time approximation coeﬃcient δΩ(t → ∞) becomes time independent. All details of calculations leading to the time convolution-less dynamical equations for ρ(t) are given elsewhere [14] and we shall not repeat them here. Obtaining of the full time dependence of δΩ(t) is not a simple task. We have succeeded to calculate microscopically full time dependence of δΩ(t) only for the simplest molecular model namely dimer [21]. In case of molecular ring we should resort

Computer Simulation of the Anisotropy of Fluorescence

665

to some simpliﬁcation [14]. In what follows we use Markovian version of Eq. (9) with a simple model for correlation functions Cmn of the bath assuming that each site (i.e. each chromophore) has its own bath completely uncoupled from the baths of the other sites. Furthermore it is assumed that these baths have identical properties [3,22]. Then only one correlation function C(ω) of the bath is needed Cmn (ω) = δmn C(ω) = δmn 2π[1 + nB (ω)][J(ω) − J(−ω)].

(10)

Here J(ω) is the spectral density of the bath [22] and nB (ω) the Bose-Einstein distribution of phonons. The model of J(ω) often used in literature is J(ω) = Θ(ω)j0

ω 2 −ω/ωc e 2ωc3

(11)

and has its maximum at 2ωc .

3

Numerical Solution

For the time propagation of the density matrix ρ (Eq. 9) the short iterative Arnoldi method [23] as well as the standard Runge-Kutta scheme have been used. An advantage of the ﬁrst one with respect to the second one is the low computational eﬀort for moderate accuracy [24]. Furthermore, the expansion coeﬃcients are adapted at each time to a ﬁxed time step with a prespeciﬁed tolerance in contrast to the Runge-Kutta scheme in which the time step is adapted. An uniform time grid is important for averaging of various realizations at the same time points without interpolation. The realization averaging and the orientational averaging can easily be parallelized by means of Message passing interface (MPI). Some computations were performed on a PC cluster. So instead of running about 10 000 realizations on one node, 312 realizations can be calculated on each of the 32 nodes (or 52 realizations on each of 192 nodes). Results of our simulations are presented graphically in the next section. We use dimensionless energies normalized to the transfer integral J12 = J and the renormalized time τ . To convert τ into seconds one has to divide τ by 2πcJ with c being the speed of light in cm s−1 and J in cm−1 . Estimation of J varies between 250 cm−1 and 400 cm−1 . Our time unit (τ = 1) corresponds for these extreme values to 21.2 fs or 13.3 fs.

4

Results

Molecular ring is common shape of many antenna units in bacterial photosynthetic systems. They diﬀer by number of BChl molecules, orientation of their optical transition dipoles, interchromophor distance, etc. We present graphically results of our modeling for time dependence of the ﬂuorescence anisotropy in recently discovered cyclic antenna unit of the BChl photosystem, namely in LH4

666

P. Heˇrman, I. Barv´ık, and D. Zapletal

Fig. 1. The time and ΔJ dependence of the anisotropy of ﬂuorescence rr of the LH4 (octameric) ring (without dynamic disorder)

with radial optical transition dipole arrangement. The transition dipole arrangement has a pronounced eﬀect on the strength of the interaction J between BChl molecules. The width of the exciton energy band of the ideal ring is two times larger for the tangential arrangement as in LH2 in comparison with radial arrangement in LH4. Also signs of J are opposite in both conﬁgurations. Optically accessible exciton states in ideal rings are near the bottom (upper) edge of the exciton band in LH2 (LH4) respectively. The time dependence of ﬂuorescence anisotropy (Eq. (7)) has been calculated using dynamical equations for the exciton density matrix ρ to express the time dependence of the optical properties of the ring units in the femtosecond time range. Details are the same as in Ref. [14,15,16]. Substantial relaxation on the

Fig. 2. The time and ωc dependence of the anisotropy of ﬂuorescence rr of the LH4 (octameric) ring. The dynamic disorder is included at T = 0.5 J (j0 = 0.2 J - left, j0 = 0.4 J - right).

Computer Simulation of the Anisotropy of Fluorescence

667

Fig. 3. The time and ΔJ dependence of the anisotropy of ﬂuorescence rr of the LH4 (octameric) ring. The dynamic disorder is also included at T = 0.5 J (j0 = 0.2 J upper row, j0 = 0.4 J - lower row, ωc = 0.1 J - left column, ωc = 0.3 J - right column).

Fig. 4. The time and ΔJ dependence of the diﬀerence between anisotropy of ﬂuorescence rt − rr of the LH2 (nonameric) ring and the LH4 (octameric) one (without dynamic disorder)

time scale of 10-100 fs and an anomalously large initial anisotropy of 0.7 has been observed. Nagarjan et al. [6] suggested a model considering decay processes of electronic coherences within the manifold of the exciton states and thermal equilibration among the excitonic states. He supposed (without explicit calculation of the exciton dynamics) that this model reproduces well main features of the spectral relaxation and the decay of anisotropy in cyclic molecular units.

668

P. Heˇrman, I. Barv´ık, and D. Zapletal

Fig. 5. The time and ωc dependence of the diﬀerence between anisotropy of ﬂuorescence rt − rr of the LH2 (nonameric) ring and the LH4 (octameric) one for two strengths j0 of dynamic disorder at temperature T = 0.5 J (j0 = 0.2 J - left, j0 = 0.4 J - right)

Fig. 6. The time and ΔJ dependence of the diﬀerence between the anisotropy of ﬂuorescence rt − rr of the LH2 (nonameric) ring and LH4 (octameric) one. The dynamic disorder is also included at T = 0.5 J (j0 = 0.2 J - upper row, j0 = 0.4 J - lower row, ωc = 0.1 J - left column, ωc = 0.3 J - right column).

Let us look at time decay of the anisotropy of ﬂuorescence in the molecular ring with radial arrangement rr (t) - LH4, octameric one. Recently we discussed [19] results in the case of one type of the static disorder - uncorrelated Gaussian disorder in the local energies.

Computer Simulation of the Anisotropy of Fluorescence

669

In the present paper we concentrate on other type of the static disorder, the Gaussian uncorrelated disorder in the transfer integrals J, characterized by ΔJ . In Fig. 1. there is the time dependence of the anisotropy of ﬂuorescence in case of the pure static disorder in transfer integrals for diﬀerent ΔJ . It is seen that the pure static disorder ΔJ = 0.2 leads to decay of the anisotropy of ﬂuorescence from 0.7 to 0.4 within τ = 20. Inﬂuence of the pure dynamic disorder, for two its strengths j0 and diﬀerent maxima 2ωc of the spectral density function J(ω) is shown in Fig. 2. Dynamic disorder has a pronounced eﬀect mainly in case of lower ωc . Consequences of the combined static and dynamic disorder are presented in Fig.3. The time dependence of the anisotropy of ﬂuorescence is displayed on four 3D graphs for two strengths of interaction with the bath and two maxima 2ωc of the spectral density function. While for ωc = 0.1J the inﬂuence of the static disorder is secondary due to dominance of the dynamic disorder, for larger ωc = 0.3J the inﬂuence of the static disorder is more pronounced. Comparison of the time dependence of ﬂuorescence anisotropy for the octameric ring with radial arrangement of optical transition dipoles (like in LH4) and for the nonameric one with tangential arrangement of optical transition dipoles is presented graphically as diﬀerences rt (τ ) − rr (τ ) in Figs 4-6.

5

Conclusions

The diﬀerence of the anisotropy of ﬂuorescence between the nonameric tangentially arranged ring rt (τ ) and octameric radially arranged one rr (τ ) for the static disorder in transfer integrals (Fig. 4) has an opposite sign in comparison with the result for the static disorder in local energies as shown in Fig. 2 in [19]. For the inﬂuence of the dynamic disorder (interaction with the bath), given by Eqs(9-11), we can conclude (Fig. 5) that the same diﬀerences are shifted to a smaller times for the stronger interaction j0 . Similar conclusion can be drawn for the case of simultaneously acting static and dynamic disorder (shown in Fig. 3 and 6). We can also see more rapid decay of the anisotropy of ﬂuorescence due to dynamic disorder in nonameric ring for smaller values of ωc (ωc = 0.1 J) (negative diﬀerence) and in octameric ring for larger values of ωc (ωc = 0.3 J) in presence of static disorder in transfer integrals (Fig. 6).

Acknowledgement Support from the Ministry of Education, Youth and Sports of the Czech Republic (projects MSM0021620835 - I.B. and LC06002 - P.H.) is gratefully acknowledged.

References 1. de Ruijter, P.F., et al.: Biophysical J. 87, 3413 (2004) 2. Jang, S., Dempster, S.F., Silbey, R.J.: J. Phys. Chem. B 105, 6655 (2001)

670

P. Heˇrman, I. Barv´ık, and D. Zapletal

3. Sundstr¨ om, V., Pullerits, T., van Grondelle, R.: J. Phys. Chem. B 103, 2327 (1999) 4. Novoderezhkin, V., van Grondelle, R.: J. Phys. Chem. B 106, 6025 (2002) 5. Nagarjan, V., Alden, R.G., Williams, J.C., Parson, W.W.: Proc. Natl. Acad. Sci. USA. 93, 13774 (1996) 6. Nagarjan, V., Johnson, E.T., Williams, J.C., Parson, W.W.: J. Phys. Chem. B 103, 2297 (1999) 7. Nagarjan, V., Parson, W.W.: J. Phys. Chem. B 104, 4010 (2000) 8. Rahman, T.S., Knox, R.S., Kenkre, V.M.: Chem. Phys. 44, 197 (1979) 9. Wynne, K., Hochstrasser, R.M.: Chem. Phys. 171, 179 (1993) 10. K¨ uhn, O., Sundstr¨ om, V., Pullerits, T.: Chem. Phys. 275, 15 (2002) 11. Heˇrman, P., Kleinekath¨ ofer, U., Barv´ık, I., Schreiber, M.: J. Lumin. 447, 94–95 (2001) 12. Heˇrman, P., Kleinekath¨ ofer, U., Barv´ık, I., Schreiber, M.: Chem. Phys. 275, 1 (2002) 13. Barv´ık, I., Kondov, I., Heˇrman, P., Schreiber, M., Kleinekath¨ ofer, U.: Nonlin. Opt. 29, 167 (2002) 14. Heˇrman, P., Barv´ık, I.: Czech. J. Phys. 53, 579 (2003) 15. Reiter, M., Heˇrman, P., Barv´ık, I.: J. Lumin. 110, 258 (2004) 16. Heˇrman, P., Barv´ık, I., Reiter, M.: J. Lumin. 112, 469 (2005) 17. Heˇrman, P., Barv´ık, I.: J. Lumin. 558, 122–123 (2007) 18. Kumble, R., Hochstrasser, R.: J. Chem. Phys. 109, 855 (1998) 19. Heˇrman, P., Barv´ık, I., Zapletal, D.: J. Lumin. 128, 768 (2008) ˇ apek, V.: Z. Phys. B 99, 261 (1996) 20. C´ 21. Barv´ık, I., Macek, J.: J. Chin. Chem. Soc. 47, 647 (2000) 22. May, V., K¨ uhn, O.: Charge and Energy Transfer in Molecular Systems. WileyWCH, Berlin (2000) 23. Pollard, W.T., Friesner, R.A.: J. Chem. Phys. 100, 5054 (1994) 24. Kondov, I., Kleinekath¨ ofer, U., Schreiber, M.: J. Chem. Phys. 114, 1497 (2001)

Functional Availability Analysis of Discrete Transport System Realized by SSF Simulator Tomasz Walkowiak and Jacek Mazurkiewicz Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, ul. Janiszewskiego 11/17, 50-372 Wroclaw, Poland {Tomasz.Walkowiak,Jacek.Mazurkiewicz}@pwr.wroc.pl

Abstract. The paper describes a novel approach to functional availability analysis of discrete transport systems realized using Scalable Simulation Framework (SSF). The proposed method is based on modeling and simulating of the system behavior by Monte Carlo simulation. No restriction on the system structure and on a kind of distribution is the main advantage of the method. The paper presents some exemplar system modeling. The authors stress the problem of inﬂuence of the functional parameters on ﬁnal system availability. The problem described in the paper is practically essential for deﬁning an organization of vehicle maintenance and transport system logistics.

1

Introduction

Decisions related to transport systems ought to be taken based on diﬀerent and sometimes contradictory conditions. The transport systems are characterized by a very complex structure. The performance of the network can be impaired by various types of faults related to the transport vehicles, communication infrastructure or even by traﬃc congestion [8]. The analysis of transport system functionality can only be done if there is a formal model of the transport logistics. The classical models used for reliability analysis are mainly based on Markov or Semi-Markov processes [1] which are idealized and it is hard to reconcile them with practice. We suggest the Monte Carlo simulation [4] for proper functional parameters calculation. No restriction on the system structure and on a kind of distribution is the main advantage of the method [9]. We propose to use the SSF (Scalable Simulation Framework) [2] instead of dedicated system elaboration. Our previous works [5], [7], [9], [10] show that it is very hard to prepare the simulator which includes all aspects of discrete transport. The SSF is a simulation core. It was developed for a usage in the SSFNet [3] a popular simulator of computer networks. We developed an extension to SSF allowing to simulate transport systems. We propose a formal model of discrete transport system to analyze functional aspects of complex systems. The presented in the next chapter discrete transport system model is based on the Polish Post regional centre of mail distribution. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 671–678, 2008. c Springer-Verlag Berlin Heidelberg 2008

672

2

T. Walkowiak and J. Mazurkiewicz

Discrete Transport System with Central Node and Time-Table (DTSCNTT)

The model can be described as follows: DT SCN T T = CN, N, R, V, I, M, T T ,

(1)

where: CN - central node, N - set of ordinary nodes, R - set of routes, V - set of vehicles, I - input model, M - set of maintenance crews and TT - vehicles’ time-table. Commodities: We can discuss several kinds of a commodity transported in the system. Single kind commodity is placed in a uniﬁed container, and containers are transported by vehicles. The commodities are addressed and there are no other parameters describing them. Nodes: We have single central node in the system. The central node is the destination of all commodities taken from other - ordinary nodes. Moreover the length between each two nodes is given. Input Model: The aim of the system is to transport containers from the central node to ordinary nodes and in the opposite way. The containers are generated in each node. The central node is the global generator of commodities driven to each ordinary nodes of the system. The generation of containers is described by Poisson process. In case of central node there are separate processes for each ordinary node. Whereas, for ordinary nodes there is one process. The input model includes intensities of container generation in each ordinary node (routed to central node) and a table of intensities of containers for each ordinary node in the central node. Vehicles: We assumed that all vehicles are of the same type and are described by following functional and reliability parameters: mean speed of a journey, capacity - number of containers which can be loaded, reliability function and time of vehicle maintenance. The central node is the start and destination of vehicle travels. The temporary state of each vehicle is characterized by following data: vehicle state, distance traveled from the begin of the route, capacity of the commodities. The vehicle running to the end of the route is able to take diﬀerent kinds of commodity (located in uniﬁed containers, each container includes singlekind commodity). The vehicle hauling a commodity is always fully loaded or taking the last part of the commodity if it is less than its capacity. Routes: Each route describes possible trip of vehicles. The set of routes we can describe as series of nodes: R = c, v1 , ..., vn , c and vi ∈ N and c = CN.

(2)

Maintenance Crews: Maintenance crews are identical and unrecognized. The crews are not assigned to any node, are not combined to any route, they operate in the whole system and are described only by the number of them. The temporary state of maintenance crews is characterized by: number of crews which are not involved into maintenance procedures and queue of vehicle waiting for the maintenance.

Functional Availability Analysis of Discrete Transport System

673

Time-Table: Vehicles operate according to the time-table exactly as city buses or intercity coaches. The time-table consists of a set of routes (sequence of nodes staring and ending in the central node, times of approaching each node in the route and the recommended size of a vehicle. The number of used vehicle, or the capacity of vehicles does not depend on temporary situation described by number of transportation tasks or by the task amount for example. It means that it is possible to realize the journey by completely empty vehicle or the vehicle cannot load the available amount of commodity (the vehicle is to small). Time-table is a ﬁxed element of the system in investigated time horizon, but it is possible to use diﬀerent time-tables for diﬀerent seasons or months of the year. Each day a given time-table is realised, it means that at a time given by the time table a vehicle, selected randomly from vehicles available in the central node, departures from central node and loaded with containers addressed to each ordinary nodes included in a given route. This is done in a proportional way. Next, after arriving at given node (it takes some time according to vehicle speed - random process and road length) the vehicle waits in an input queue if there is any other vehicle being loaded/unload at the same time. There is only one handling point in each node. The time of loading/unloading vehicle is described by a random distribution. The containers addressed to given node are unloaded and empty space in the vehicle is ﬁlled by containers addressed to a central node. The operation is repeated in each node on the route and ﬁnally the vehicle is approaching the central node when is fully unloaded and after it is available for the next route. The process of vehicle operation could be stopped at any moment due to a failure (described by a random process). After the failure, the vehicle waits for a maintenance crew (if there are no available due to repairing other vehicles), is being repaired (random time) and after that it continues its journey.

3

Simulation Methodology

Discrete transport system described in the previous section is very hard to analyze by a formal model. It does not ﬁt the Markov process framework. A common way of analyzing that kind of systems is a computer simulation. To analyze the system we must at ﬁrst build a model and then operate the model. The system model needed for simulation has to encompass the system elements behavior and interaction between elements. In case of dependability we have to include system element reliability model. Except the system functionality model we have to model the traﬃc in the system. The data for simulation of a given real exemplar system consists of system element model (described in the system functionality meta-model formalism) and a given traﬃc conﬁguration. Once a model has been developed, it is executed on a computer by an eventsimulation, which is based on a idea of event. The event is described by time of event occurring, type of event (in case of DTSCNTT it could be vehicle failure) and element or set of elements of the system on which event has its inﬂuence. The simulation is done by analyzing a queue of event (sorted by time of event occurring) while updating the states of system elements according to rules related

674

T. Walkowiak and J. Mazurkiewicz

to a proper type of event. The event-simulation program could be written in general purpose programming language (like C++), in fast prototyping environment (like Matlab) or special purpose discrete-event simulation kernels. One of such kernels, is the Scalable Simulation Framework (SSF) [2] which is a used for SSFNet [3] computer network simulator. SSF is an object-oriented API - a collection of class interfaces with prototype implementations. It is available in C++ and Java. SSFAPI deﬁnes just ﬁve base classes: Entity, inChannel, outChannel, Process, and Event. The communication between entities and delivery of events is done by channels (channel mappings connects entities) [3]. For the purpose of simulating DTSCNTT we have used Parallel Real-time Immersive Modeling Environment (PRIME) [6] implementation of SSF due to much better documentation then that available for original SSF. We have developed a generic class (named DTSObject) derived from SSF Entity which is a base of classes modeling DTSCNTT objects like: scheduler, node, truck and crew which model the behavior of presented in section 2 discrete transport system. The eﬀectiveness of simulation done in PRIME environment is very promising. The tests done on one batch of simulation of DTSCNTT exemplar described in the next section needed from 3.9 to 9 seconds on Pentium 2 GHz computer. The time needed to perform one simulation depends on the number of events presented in the system, which is a result of DTSCNTT conﬁguration. Due to a presence of randomness in the DTSCNTT model the analysis of it has to be done based on Monte-Carlo approach. It requires a large number of repeated simulation. The SSF is not a Monte-Carlo framework but by simple re-execution of the same code (of course we have to start from diﬀerent values of random number seed) the statistical analysis of system behavior could be realized [12].

4

Functional Availability of DTSCNTT

The analysis of a given system requires a metric. We propose to use the availability of the system. We deﬁne it as an ability of realising the transportation task in required time. The availability is a probability measure. Introducing the following notation: – T - time measured from the moment when the container was introduced to the system to the moment when the container was transferred to the destination (random value), – Tg - guaranteed time of delivery, if exceeded the container is delayed, – N(t) - stochastic process describing the number of delayed containers at time t, – k - the level of acceptable delay, we can deﬁne the functional availability Ak (t) as a probability that the number of delayed containers at time t does not exceed k, i.e.: Ak (t) = P r {N (t) ≤ k} .

(3)

The calculation of stochastic process N(t) is based on analysing a state of each

Functional Availability Analysis of Discrete Transport System

675

T a

0

T

T< - Tg

b

t

0

Tg Tg

T> T Tg > Tg

t opóznienie delay

Fig. 1. The delivery in guaranteed time (a) and delayed delivery (b)

not yet delivered container. As illustrated in Fig. 1. we can observe two possible situations: (a) - delivery was realised before guaranteed time Tg - there is no delay, (b) - delivery was delayed - time of delay: T - Tg .

5

DTSCNTT Case Study

For testing purposes of presented DTSCNTT system (chapter 2) and developed extension of SSF (chapter 3) we have developed an exemplar transport system. It consists of one central node (city Wroclaw, Poland) and three ordinary nodes (cites nearby Wroclaw: Rawicz, Olesnica and Nysa). The distances between nodes has been set approximating the real distances between used cities and they equal to: 85, 60 and 30 km. We assumed a usage of 5 trucks (two with capacity set to 10 and three with capacity 15) with mean speed 50km/h. The vehicles realized 19 trips a day: from central node to ordinary node and the return trip. Failures of trucks were modeled by exponential distribution with mean time

NET [ Vertex [ID Nys MTTB 0.6] Vertex [ID Raw MTTB 0.4] Vertex [ID Ole MTTB 0.3] CeVertex [ID Wro MTTB [Nys 0.5 Raw 0.4 Ole 0.3] ] Truck [No 2 Speed 50 Size 10 MTTF 1000] Truck [No 3 Speed 50 Size 15 MTTF 1000] Trip [Size 10 Start 8.00 Dest[ID Ole Time 8.40]]] Trip [Size 10 Start 9.30 Dest[ID Ole Time 10.10]] Trip [Size 10 Start 11.00 Dest[ID Ole Time 11.40]] Trip [Size 10 Start 12.30 Dest[ID Ole Time 13.10]] Trip [Size 10 Start 14.00 Dest[ID Ole Time 14.40]] Trip [Size 10 Start 15.30 Dest[ID Ole Time 16.10]] Trip [Size 10 Start 17.00 Dest[ID Ole Time 17.40]] …

Fig. 2. Exemplar DTSCNTT description in DML ﬁle

676

T. Walkowiak and J. Mazurkiewicz

A20(t) 1.00 0.95 0.9 0.85 0.8 0.75 0.7

0

100

200

300

400

500

600

700

800

900

1000

t Fig. 3. Functional availability of the DTSCNTT, 5 trucks operate

A20(t) 1.001 0.95 0.9 0.85 0.8 0.75 0.7

0

100

200

300

400

500

600

700

800

900

1000

t Fig. 4. Functional availability of the DTSCNTT, 4 trucks operate

to failure equal to 1000h. The repair time was modeled by normal distribution with mean value equal to 2h and variance of 0.5h. The containers addressed to ordinary nodes were available in the central node at every 0.5, 0.4 and 0.3 of an hour respectively. Containers addressed to the central node were generated at every 0.6, 0.4, 0.3 of hour in following ordinary nodes. There was a single maintenance crew. The availability of the system Ak (t) was calculated with guaranteed time Tg =24h and parameter k =20. Time-table as well as other functional parameters were described in a DML ﬁle (see example in Fig. 2.). The Domain Modeling Language (DML) [6] is a SSF speciﬁc text-based language which

Functional Availability Analysis of Discrete Transport System

677

includes a hierarchical list of attributes used to describe the topology of the model and model attributes values. Based on 10 000 time simulations (in each 100 days) the availability of system was calculated. Results presented in Fig. 3. shows the periodic changes. The situation is an eﬀect of used time-tables and a method of cointainers’ generation. The containers are generated during all day (by Poisson process) but according to a time-table trucks do not operate in the night. The probability of delay increases at the night, but selected number of trucks (5) is satisfactory for given system. We have also analyzed a system with a reduced number of vehicles (with 4). The resulting value of the availability function is presented in Fig. 4. It could be noticed that the availability of the system decreases due to lack of suﬃcient number of trucks. It should be noticed here that looking in the used time-table and not taking into consideration a randomness of the transport system (failures and traﬃc jams) only three vehicles should be enough to transport all the generated containers.

6

Conclusion

We have presented a simulation approach to functional analysis of Discrete Transport System with Central Node and Time-Table (DTSCNTT). The DTSCNTT models behavior of the Polish Post regional centre of mail distribution. Developed simulation software allows to analyze availability of the system in a function of all model parameters, like for example changes in a time-table or in a number of used trucks. Also, some economic analysis could be done following the idea presented in [5], [11], [12]. It could be used for example for selection of the optimum value for SLA (service level agreement). The presented results, i.e. changes of availability in a function of a number of used trucks shows that presented approach allows to answer a non trivial question what should be a number of vehicles to fulﬁll some requirements given to the transport system. The implementation of DTSCNTT simulator done based on SSF allows to apply in a simple and fast way changes in the transport system model. Also the time performance of SSF kernel results in a very eﬀective simulator of discrete transport system. Therefore, in our opinion introduced exemplar analysis shows that the described method of transport system modeling can serve for practical solving of essential decision problems related to an organization and parameters of a real transport system. The proposed analysis seems to be very useful for mail distribution centre organization. Work reported in this paper was sponsored by a grant No. 4 T12C 058 30, (years: 2006-2009) from the Polish Committee for Scientiﬁc Research (KBN).

References 1. Barlow, R., Proschan, F.: Mathematical Theory of Reliability, Society for Industrial and Applied Mathematics, Philadelphia (1996) 2. Cowie, J.H.: Scalable Simulation Framework API reference manual (1999), http://www.ssfnet.org/SSFdocs/ssfapiManual.pdf

678

T. Walkowiak and J. Mazurkiewicz

3. Cowie, J.H., Nicol, D.M., Ogielski, A.T.: Modeling the Global Internet. Computing in Science and Engineering 1(1), 42–50 (1999) 4. Fishman: Monte Carlo: Concepts, Algorithms, and Applications. Springer-Verlag, New York (1996) 5. Kaplon, K., Mazurkiewicz, J., Walkowiak, T.: Economic Analysis of Discrete Transport Systems. Risk Decision and Policy 8(3), 179–190 (2003) 6. Liu, J.: Parallel Real-time Immersive Modeling Environment (PRIME), Scalable Simulation Framework (SSF), User’s manual, Colorado School of Mines Dep. of Mathematical and Computer Sciences (2006), http://prime.mines.edu 7. Mazurkiewicz, J., Walkowiak, T.: Fuzzy Economic Analysis of Simulated Discrete Transport System. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 1161–1167. Springer, Heidelberg (2004) 8. Sanso, B., Milot, L.: Performability of a Congested Urban-Transportation Network when Accident Information is Available. Transportation Science 33, 1 (1999) 9. Walkowiak, T., Mazurkiewicz, J.: Hybrid Approach to Reliability and Functional Analysis of Discrete Transport System. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3037, pp. 236–243. Springer, Heidelberg (2004) 10. Walkowiak, T., Mazurkiewicz, J.: Reliability and Functional Analysis of Discrete Transport System with Dispatcher. In: Advances in Safety and Reliability, European Safety and Reliability Conference - ESREL 2005, pp. 2017–2023. Taylor & Francis Group, London (2005) 11. Walkowiak, T., Mazurkiewicz, J.: Simulation Based Management and Risk Analysis of Discrete Transport Systems. In: IEEE TEHOSS 2005 Conference, pp. 431–436 (2005) 12. Walkowiak, T., Mazurkiewicz, J.: Discrete transport system simulated by SSF for reliability and functional analysis. In: International Conference on Dependability of Computer Systems. DepCoS - RELCOMEX 2007, pp. 352–359. IEEE Computer Society Press, Los Alamitos (2007)

Parallel Implementation of Vascular Network Modeling Krzysztof Jurczuk and Marek Kr¸etowski Faculty of Computer Science, Bialystok Technical University Wiejska 45a, 15-351 Bialystok, Poland {kjurczuk,mkret}@wi.pb.edu.pl

Abstract. The paper presents modeling of the vascular system in a parallel environment. The aim of this approach is to accelerate the simulation of vascular network growth and make it closer to analogous real life processes. We concentrated on the perfusion process and made an attempt to parallelize the process of connecting ischemic macroscopic functional units to existing vascular systems. The proposed method was implemented on a computing cluster with the use of the MPI standard. The results show that it is possible to gain a signiﬁcant speedup that allows us to make simulations for a greater number of macroscopic functional units and vessels in a reasonable time, which increases the possibility to create more complex and more precise virtual organs.

1

Introduction

The human body is characterized by high complexity. It can be observed on each level, starting from molecules, cells and ending on organs and the whole organism [3]. Moreover, a lot of internal mechanisms are parallel or even distributed. These factors are the main reasons why the modeling of living systems is becoming more and more important. The modeling provides new ways to better understand complex interactions between elementary mechanisms and behaviors of the whole organs. One of the main diﬃculties in model designing is the necessity to capture the most essential properties of the system and disregard the elements whose role is insigniﬁcant. It is not easy to choose appropriate simpliﬁcations, which applies especially to living organisms. Too simple models can be useless but, on the other hand, too elaborate models can be ineﬀective in practical cases, which means that the computations cannot be done in a reasonable time. Therefore, it appears natural to attempt to use parallel computing in the modeling of living organisms, especially vascular networks. Implementations in a parallel environment can accelerate the simulation process and allow us to introduce more sophisticated details. In this paper, we focus on the modeling of vascular systems. They are very important in the detection processes of various pathological anomalies, because changes in vascular structures can be directly caused by diseases. Most of these modiﬁcations appear in medical images, especially when the contrast product M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 679–688, 2008. c Springer-Verlag Berlin Heidelberg 2008

680

K. Jurczuk and M. Kr¸etowski

is administrated. Vessels play a key role in a contrast material propagation and they are one of the most visible structures in dynamic images. Therefore, the modeling of vascular systems can support the development of methods to detect early indicators of diseases and help to understand the mechanisms of image formation. Many vascular models have been proposed, e.g. Constrained Constructive Optimization (COO) method for an arterial tree generation [11], an algorithm of arterial tree growth inside a deﬁned and gradually expanding shape [1], improved CCO method to simulate the coronary tree [6] or a fractal model [13]. According to our knowledge, all of them use the sequential algorithm to develop vascular systems. In our previous studies [2] and [7], we used the physiological modeling as a way to better understand medical images (both CT and MRI) and to ﬁnd some image markers of pathologies. In the research we also made use of the sequential algorithm to generate a virtual organ of liver (represented by three vascular trees and parenchyma) [8], CT simulator and MRI virtual scanner implemented in a parallel environment. In this paper, however, we propose a parallel algorithm of the vascular network growth, based on the previously used sequential algorithm. We concentrated on the perfusion process and parallelized the process of connecting new cells to existing vascular trees. The aim of this research is to accelerate the simulation of the vascular network growth and bring it as close to the real, analogous process as possible. The rest of the paper is organized as follows. In the next section the organ model with sequential algorithm of vascular system development is brieﬂy recalled. Whereas, in Sect. 3 the parallel algorithm of the same vascular growth process is presented. An experimental validation of the presented approaches is performed in Sect. 4. The conclusion and some plans for future research are sketched in the last section.

2

Model Description

In its generic form [8], the discussed model was constructed for the modeling of internal organs which develop by a division of their structural elements. But it should be emphasized that it is oriented towards an image generation. Therefore, the model concentrates on elements which are directly visible in images or have a signiﬁcant inﬂuence on image analysis. The main components of the model are: the tissue and the vascular network that perfused it. Most of features are not linked with any internal organ. However, it is very hard to model particular organs without some kind of specialization. Therefore, the model expresses the speciﬁcity of liver, as it is one of the most important organs. It plays a major role in the metabolism and has a number of functions in the body, including glycogen storage, decomposition of red blood cells, plasma protein synthesis, and detoxiﬁcation [12]. Moreover, it possesses an unique organization of the vascular network with three types of trees: hepatic artery, portal vein and hepatic vein.

Parallel Implementation of Vascular Network Modeling

2.1

681

Tissue Modeling

The tissue is represented by a set of Macroscopic Functional Units (MFU) that are distributed regularly inside the speciﬁed shape. MFU is a small, ﬁxed size part of tissue that constitutes the functional unit of the model. It is described by its class, which precises the most of its properties, both functional and structural (e.g. probability of mitosis and necrosis, blood ﬂow rate, blood pressure, size and density). Moreover, certain parameters are described by deﬁned distributions. This mechanism enables modeling the blood ﬂow with more natural variability. Several classes of MFU can be deﬁned in the organ, which allows simulating pathological changes. 2.2

Vascular Network Modeling

In the model, each vessel is represented by an ideal, rigid tube with a ﬁxed radius, wall thickness, length and position. The wall thickness depends on the vessel diameter and its function. The model distinguishes vessels larger than capillaries, whereas the capillaries themselves are hidden in the MFUs. Based on this simpliﬁcation, the vascular tree model assumes the form a binary tree (see Fig. 1a). It means that anastomoses, which occur sporadically, especially in pathological situations, are not taken into account. The binary trees, representing vascular trees, are built of nodes characterized by their spatial position, blood ﬂow rate and pressure.

a)

b)

Fig. 1. Binary vascular trees: a) successive bifurcations b) perfusion process of new MFU - searching the closest vessels in three vascular trees

In the model, the blood is treated as a Newtonian ﬂuid, with constant viscosity (μ), which makes it possible to calculate the pressure diﬀerence (ΔP ) between two extremities by the Poiseuille’s law: ΔP = Q

8μl , πr4

(1)

where l is the length, r is the radius and Q is the blood ﬂow of the vessel. Moreover, at each bifurcation the law of matter conservation has to be observed: Q = Qr + Ql ,

(2)

682

K. Jurczuk and M. Kr¸etowski

where Q is the blood ﬂow in a parent vessel, Qr , Ql are the blood ﬂows in descendant vessels (right and left daughter branches). It means that the quantity of blood which entering and leaving a bifurcation has to be equal. Another equation is connected with a decreasing radius of vessels in the trees where we move from proximal to distal segments of vascular trees: rγ = rrγ + rlγ ,

(3)

where r is the radius of parent vessel, rr , rl are the radiuses of descendant vessels (right and left daughter branches) and γ varies between 2 and 3 [5]. This morphological law describes the dependency of the mother vessel radius and the radiuses of its two daughters. 2.3

Sequential Vascular System Growth Algorithm

The organ growth is modeled as a analogy to a hyperplasia process (the increasing number of cells). It starts with an organ, whose size is a fraction of adult one. As it is shown in Fig. 2, after parameters initialization, in discrete time moments (called cycles), an organ enlarges its size. Therefore, between MFUs a new, empty space appears. Additionally, each cycle consists of subcycles. In each subcycle, a MFU has a certain probability to give birth to a new MFU of the same class or to die. Consequently, changes in the tissue and in the corresponding vascular network appear. The processes of the birth/perfusion and the death/retraction are repeated in each subcycle until the empty space, which can appear between cycles, is not occupied by new MFU elements. The whole process ends when organ reaches its full, adult size. At the beginning of each subcycle, for every MFU, a few randomly chosen spatial positions of the new MFU in a neighborhood are tested. If all conditions connected with a free space and a tissue density are fulﬁlled, a new MFU is created. This new, small functional unit is not perfused by the existing vascular system and it is initially ischemic. The next step is to ﬁnd an optimal bifurcation point which can be used to perfuse a new MFU. First, the distances between all vessels and the new element are calculated. We choose a ﬁxed number of the closest vessels (see Fig. 1b). Later, temporary bifurcations are created. When there are more than one tree, the algorithm chooses all possible combinations of candidate vessels (a single combination consists of one vessel from each tree). The spatial position of the bifurcation is controlled by the Downhill Simplex procedure [10] (minimization of additional blood volume needed for the new MFU perfusion [8]). Only one combination of vessels can be used to perfuse the new MFU. Therefore, we have to choose the one which is the most appropriate from among all candidates. Additionally, the possible collisions between vessels must be checked. In the presented approach only non-crossing conﬁgurations are taken into account. Therefore, the algorithm detects intersections between perfusing vessels both from the same and diﬀerent trees. Finally, for each remaining conﬁguration the volume of the whole tree is computed. The combination with the lowest sum of volumes permanently perfuses the MFU.

Parallel Implementation of Vascular Network Modeling

683

Fig. 2. Flow chart representing two loops of events which are distinguished in the presented modeling of the organ

The MFUs are perfused by the vascular system in a sequential manner, one by one. This process is time consuming because for each new MFU a large number of temporary bifurcations is created. It requires many calculations to assure the consistency of the characteristics (i.e. blood ﬂow and pressure,...) describing individual vessels. After the reproduction process, comes the degeneration phase in which some of the MFUs can die. The algorithm of retraction is not so time consuming in comparison to the perfusion process. The vessels supplying the MFU simply retract and disappear. It requires only a single recalculation of the constraints for the vascular system.

3

Parallel Vascular System Development

The most time consuming operation in the presented algorithm of the vascular network growth is the process of perfusion. It results from large number of MFUs, the complicated structure of vascular trees and especially by the necessity to ﬁnd the optimal bifurcation. Therefore, a decision was made to spread the computations concerning the perfusion process over computational nodes. Moreover, our intention was to bring the solution closer to reality, where analogous perfusion processes can be also parallel. The general scheme of the proposed algorithm in a parallel environment is presented in Fig. 3. The algorithm has two parts. The ﬁrst (see Fig. 4) is

684

K. Jurczuk and M. Kr¸etowski

Fig. 3. Two parts of the parallel algorithm. The ﬁrst performed at the beginning of each subcyle and connected with the trees and tissue migration. The second performed between subcycles and responsible for the distribution of the perfusion process over nodes.

Fig. 4. The ﬁrst part of the algorithm connected with the trees and tissue migration

performed at the beginning of each subcycle while the second (see Fig. 5) does calculations between subcycles. Each node must have the vascular system: trees and MFUs, as recent as possible. Therefore, at the beginning of each subcycle (the ﬁrst part of the presented algorithm) the managing node sends the latest vascular trees and tissue represented by MFUs to the computational nodes. The vascular system can be large and complex, therefore its migration process between the processors within the framework of the message passing interface is composed of 3 steps: packing the nodes and MFUs into a ﬂat message, sending the message and unpacking corresponding nodes and MFUs. In order to minimize the message size we choose only the parameters of the nodes that cannot be reconstructed: position in space, possession of children, individual node number and MFU class. When the computational node receives the message with the vascular trees, the remaining characteristics is restored. Almost all indispensable information about MFUs is sent with trees. Additionally, as the blood ﬂow is unique in each MFU, we also have to transfer the value of the ﬂow. Moreover, a lot of parameters about structure are read from input ﬁles at each node, which enables us to send quite a small package in comparison to the real size of the vascular system. We also assigned individual numbers to tree nodes and MFUs, which facilitated the process of migration, rebuilding and permanent perfusion in the managing node. To sum up, each computational node possesses identical vascular trees and MFUs after the completion of the ﬁrst part of the algorithm.

Parallel Implementation of Vascular Network Modeling

685

Fig. 5. The second part of the algorithm responsible for MFUs perfusion

The second part of the algorithm is responsible for a calculation occurring between individual subcycles. First, the managing node creates the list of new MFUs which can be added to the vascular network. There is a possibility to model this process in a parallel environment but the time needed for that part of the algorithm can be neglected, as it is very short, in comparison to the perfusion time. Next, the managing node makes no attempt to ﬁnd vessels in order to add new MFUs. They are sent to the computational nodes. The MFU migration is more simple and less time consuming than the entire trees, but to minimize a message within the framework of the message passing interface we only choose the essential information, namely: position in space, blood ﬂow and MFU class. When the computational node receives the message, it tries to ﬁnd the closest vessels and the optimal point to connect the received MFU to the vascular network. It makes the simulation of the perfusion process. If the searching ends with a success, the computational node does not perfuse permanently a new MFU but sends the parameters of the bifurcation to the managing node. The message contains only the position of the bifurcation point in space and numbers of perfusing vessels. Next, when the managing node receives the parameters of the bifurcation, it checks if there have been some other changes in its trees since the last contact with the sender of a message. If there have been, it checks the changed vessels. If at least, one vessel that was changed, is on the list of vessels to perfuse the new element, the MFU is rejected. But in the other case, the managing node permanently joins the new MFU and broadcasts all new changes that occurred between previous and present contact with the sender of the message. The migration of changes is less time consuming than the entire trees. All vascular trees are only sent at the beginning of each subcycle. If there are more MFUs to perfuse, the managing node sends the next one. The whole vascular system has to be sent at the beginning of each subcycle as there are several other processes (e.g. degeneration and growth of organ shape) between cycles and subcycles, which has an inﬂuence on the entire trees. The algorithm of creating new MFUs ensures that the number of rejected MFUs is small enough. Moreover, the rejected MFUs leave empty space between vessels and other MFUs, which increases a probability that the vascular network

686

K. Jurczuk and M. Kr¸etowski

growth algorithm will choose more macroscopic functional units in the next subcycle.

4

Experimental Results

This section contains a preliminary, experimental veriﬁcation of the proposed algorithm in a parallel environment. The presented results were obtained from many experiments. We used the default settings for the sequential version (about 12000 MFUs). Moreover we checked the behavior of the proposed solution for large conﬁgurations with about 50000 MFUs and consequently about 300000 vessels (Fig. 6 shows a visualization of one of the obtained vascular systems).

a)

b)

Fig. 6. Visualization of the adult liver (about 50000 MFUs and 300000 vessels): a) portal veins, b) main hepatic arteries, portal veins and hepatic veins with liver shape

In the experiments a cluster of sixteen SMP servers running Linux 2.6 and connected by an Inﬁniband network was used. Each server was equipped with two 64-bit Xeon 3.2GHz CPUs with 2MB L2 cache, 2GB of RAM and an Inﬁniband 10Gb/s HCA connected to a PCI-Express port. We used the MVAPICH version 0.9.5 [9] as the MPI standard implementation [4]. Figure 7a presents the obtained mean speedup. It is far from linear but from practical point of view is very satisfactory. Usually the process to obtain an organ with about 50000 MFUs on the single processor machine (3.2GHz CPU and 2GB of RAM) takes about 24 hours. On the other hand, the parallel version can simulate it approximately 8 times faster (with 16 processors). Moreover, it is worth noting that the speedup, in spite of the increasing number of MFUs, does not decrease signiﬁcantly. For 16 nodes it still varies around 8. In order to in depth explain presented results a detailed time-sharing ﬁgure is necessary and is presented for the case of 50000 MFUs (see Fig. 7b). It is clearly visible that the most time consuming operation is the perfusion process. The degeneration phase takes only a small part of the time necessary to develop the adult organ. The time connected with the MPI operations (e.g. sending and receiving messages) is insigniﬁcant in comparison to the time of other operations. In our case it was always less than 1% of the whole simulation time. Moreover,

Parallel Implementation of Vascular Network Modeling 24

vascular system growth theoretical

16

degeneration time perfusion time tree uniformity time MPI time

22 Organ growth time (in hours)

14 Absolute speedup

687

12 10 8 6 4 2

20 18 16 14 12 10 8 6 4 2

0

0

0

2

4

6

8

10

12

Number of processors

a)

14

16

1

2 4 8 Number of processors

16

b)

Fig. 7. Eﬃciency of the parallel implementation: a) mean speedup for many conﬁgurations of MFUs, b) detailed time-sharing ﬁgure for the case of about 50000 MFUs

we can observe that the algorithm also spends a short period of time on the processes connected with the maintaining of uniformity of the vascular system (e.g. selection and gathering changes at the managing node and adding changes at the computational nodes). Furthermore, it should be mentioned that we changed the memory organization at the computational nodes. To optimize the time connected with tree rebuilding we introduced a continuous memory representation. It decreased the time needed to allocate and deallocate memory. Prior to this mechanism the mean speedup for 16 nodes was equal approximately 6.7.

5

Conclusion

In this paper, a parallel algorithm to model the vascular network growth is investigated. It was shown that the presented solution signiﬁcantly reduces the computation time, which increases the possibility to create more elaborate and precise virtual organs. This can be very useful when we use them in CT or MRI simulators. Moreover, this solution can be treated as the ﬁrst step to bring the presented model closer to reality, in which the analogous processes of the vascular network growth can occur in a parallel way. The presented algorithm is still under development. We see a lot of possible directions for future improvements and at least a few diﬀerent approaches. First, we would like to introduce more decentralized solution. We also plan to implement the process of perfusion in the framework of multi-platform shared-memory parallel programming (OpenMP), which makes it possible to reduce the time period connected with tree rebuilding, waiting for new MFUs and tree uniformity. Acknowledgments. This work was supported by the grant W/WI/5/08 from Bialystok Technical University.

688

K. Jurczuk and M. Kr¸etowski

References 1. B´ezy-Wendling, J., Bruno, A.: A 3-D dynamic model of vascular trees. Journal of Biological Systems 7(1), 11–31 (1999) 2. B´ezy-Wendling, J., Kr¸etowski, M., Mescam, M., Jurczuk, K., Eliat, P.-A.: Simulation of hepatocellular carcinoma in MRI by combined macrovascular and pharmacokinetic models. In: 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 1272–1275. IEEE Press, Washington (2007) 3. Demongeot, J., B´ezy-Wendling, J., Mattes, J., Haigron, P., Glade, N., Coatrieux, J.-L.: Multiscale modeling and imaging: the challenges of biocomplexity. Proceedings of the IEEE 91, 1723–1737 (2003) 4. Juhasz, Z., Kacsuk, P., Kranzlmuller, D.: Distributed and Parallel Systems: Cluster and Grid Computing. Springer, Heidelberg (2005) 5. Kamiya, A., Togawa, T.: Optimal branching structure of the vascular trees. Bulletin of Mathematical Biophysics 34, 431–438 (1972) 6. Karch, R., Neumann, F., Neumann, M., Schreiner, W.: Staged growth of optimized arterial model trees. Annals of Biomedical Engineering 28, 495–511 (2000) 7. Kr¸etowski, M., B´ezy-Wendling, J., Coupe, P.: Simulation of biphasic CT ﬁndings in hepatic cellular carcinoma by a two-level physiological model. IEEE Trans. on Biomedical Engineering 54(3), 538–542 (2007) 8. Kr¸etowski, M., Rolland, Y., B´ezy-Wendling, J., Coatrieux, J.-L.: Physiologically based modeling for medical image analysis: application to 3D vascular networks and CT scan angiography. IEEE Trans. on Medical Imaging 22(2), 248–257 (2003) 9. Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over InﬁniBand. Int. Journal of Parallel Programming 32(3), 167–198 (2004) 10. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical recipes in C. The art of scientiﬁc computing. Cambridge University Press, Cambridge (1992) 11. Schreiner, W., Buxbaum, P.F.: Computer-optimization of vascular trees. IEEE Trans. on Biomedical Engineering 40(5), 482–491 (1993) 12. Sherlock, S., Dooley, J.: Diseases of the liver and biliary system. Blackwell Science, Malden (2002) 13. Zamir, M.: Arterial branching within the conﬁnes of fractal L-system formalism. Journal of General Physiology 118, 267–275 (2001)

Some Remarks about Modelling of Annular Three-Layered Plate Structure Dorota Pawlus Faculty of Mechanical Engineering and Computer Science, University of Bielsko-Biała, Willowa 2, 43-309 Bielsko-Biała, Poland

Abstract. The evaluation of the influence of the examined mesh structure on the computational results is considered in this paper. The annular plates with three-layered, cross-section structure having the soft core in the range of critical behaviour after loss of static and dynamic stability have been analysed. Several meshes of plate models, which can be applied in numerical calculations have been presented. The results obtained in finite element method have been compared with the results of plates solved using the finite difference method. The analysis has been undertaken in the wide range of the examined problems taking into account not only the global forms of plate critical deformations but also the other local ones and analysing different plate buckling forms with several transverse waves in circumferential direction, too. In the discussion the rate of the sensitivity of the presented plate models depending on their problem application has been noticed. Keywords: mesh model, sandwich plate, static, dynamic stability, FEM, FDM.

1 Introduction The different critical and overcritical behaviours of the sandwich structure of plates under lateral loads require building of the proper computational model. The geometrical and material parameters of the component layers of the structure essentially determine its behaviour, especially, when there are significant differences among them. Widely examined structure of three-layered plate with soft, foam, thick core is exactly such object, which computational model shows the significant sensitivity for the accepted parameters describing it. The way of building of such structure of plate with annular shape, which enables the solution to the static and dynamic plate problem, with the indication on its computational sensitivity has been considered in this paper. In the range of the axisymmetric dynamic stability problems of sandwich plates ( among others ) recently appeared works [1], [2] could be mentioned.

2 Problem Formulation The problem undertaken in this paper consists in the evaluation of the influence of the model structure of the three-layered plate on the computational results. The annular M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 689–699, 2008. © Springer-Verlag Berlin Heidelberg 2008

690

D. Pawlus

plate with the soft core compressed in the facings plane by the loads acting on the inner or/and outer their perimeters is the subject of the analysis. The scheme of plate loading is presented in Fig. 1. Such loads make the loss of plate stability, characterized by the critical parameters, like: critical static or dynamic load, form of buckling and critical deflection. These quantities for the plate examples differing in the model built by means of the finite element method have been analysed in this work. The examined, exemplary plate has the slidably clamped edges; symmetrical crosssectional structure and the following material and geometrical parameters: - the inner radius ri=0.2 m; - the outer radius ro=0.5 m; - the facing thickness (equal for each facing ) h'=0.0005 m or h'=0.001 m; - the core thickness h2=0.005 m, 0.02 m, 0.06 m; - the steel facing material with Young's modulus E=2.1⋅105 MPa, Poisson's ratio ν=0.3 and mass density μ=7.85⋅103 kg/m3 ; - two polyurethane foam core material with the value of Kirchhoff's modulus equal to G2=5 MPa and mass density μ2=64 kg/m3 [3] or with G2=15.82 MPa and μ2=93.6 kg/m3 [4]; value of Poisson's ratio, equal to ν=0.3 and values of Young's modulus E2=13 MPa and E2=41.13 MPa calculated treating the foam material as an isotropic, respectively. a)

b)

Fig. 1. The scheme of plate loaded: a) on the inner perimeter, b) on the outer perimeter

The obtained results in finite element method have been compared with the computational results of plates solved using the finite difference method. Building the models of plate structure solved in both methods: finite element (FEM) and finite difference (FDM) the distribution of the basic stress on the normal and shearing carrying by the plate layers: facings and core has been used. Such loading distribution of the layers of plate with soft core is the assumption of the classical theory of sandwich plate. In work [5] the proposal of the application of the mixed shell and solid elements in mesh structure has been presented. It has been used in the modelling of the structure of the analysed plate. The application of the shell elements with the option COMPOSITE to specify a shell cross-section for the plates with soft core does not assure the proper results in plate stability problem. The calculations of the plate models built of shell elements for elastic core characteristics corresponding to the facings material parameters have been presented in work [6].

Some Remarks about Modelling of Annular Three-Layered Plate Structure

691

2.1 Plate Models Built in Finite Element Method The calculations were carried out in the ABAQUS system at the Academic Computer Center CYFRONET-CRACOW (KBN/SGL_ORIGIN_2000/PŁódzka/030/1999) [7]. The model in the form of the full annulus of plate is the basic model accepted in problem analysis. This model is composed of the 9-node 3D shell elements and 27-node 3D solid elements building the facings and core meshes, respectively. The mesh of the model is presented in Fig. 2a. a).

b). shell elements

solid elements shell elements

c).

Fig. 2. The forms of plate models: a) full annulus, b) annular sector with single or double core layer, c) built of axisymmetric elements with single, double and quaternary core layer

The examinations of the selected forms of plate critical deformations have been carried out for the models in the form of an annular sector being the 1/8 or 1/6 part of the full plate perimeter. The facings mesh is built of 9-node 3D shell elements and core mesh is built of 27-node 3D solid elements, too. The solid elements could be arranged in single or double layers in core mesh. The models are presented in Fig. 2b. The mesh of the model for some plate examples, which minimal value of critical load pcr ( important in stability problem ) corresponds to the form of regular, axiallysymmetrical buckling could be simplified to the form - where only the axisymmetric elements are used. The cross-section structure presented in Fig. 2c is composed of 3node shell and 8-node solid elements arranged in single, double and quaternary core mesh layer. The regular, axially-symmetrical form of plate buckling corresponds to the minimal value of the critical load for plates slidably clamped and compressed on inner perimeter [8]. Each of analyzed plate model uses the surface contact interaction to connect the elements of facings mesh with elements of core mesh. The option TIE has been applied. The proper, symmetry conditions on the partitioned edges have been formulated for the annular sector models. The boundary conditions with the limitation of radial relative displacements in the plate slidably clamped edges are imposed on the outer

692

D. Pawlus

and inner plate edges. The introduction of the additional condition for the plate layers, by their connection with the equal deflection, increases the numbers of examined plate models. 2.2 Solution to the Plate Stability Problem Using the Finite Difference Method The solution uses the classical theory of sandwich plates with the broken line hypothesis [8]. The equal deflections of plate layers have been assumed. The basic elements in the solution to the static stability problem are as follows: -

-

-

formulation of the equilibrium equations for each of plate layer, determination of the equations of radial and circumferential core deformation, formulation of the physical relations of the material of plate layers, on the strength of the equations of the sectional forces and moments and suitable equilibrium equations determination of the formulas of the resultant radial and circumferential forces and the resultant membrane radial, circumferential and shear forces determined by means of the introduced stress function, by the usage of the equilibrium equations of the projections in the 3-direction of forces loading the plate layers the formulation of the basic differential equation describing the deflections of the analyzed plate, determination of the additional equilibrium equations of projections in the radial and circumferential directions of forces loading the undeformed outer plate layers, determination of the boundary conditions and dimensionless quantities, assumption that the stress function is a solution to the disk state, application of the finite difference method for the approximation of the derivatives with respect to radius and the solution to the eigen-value problem with the calculation of the minimal value of p* as the critical static load pcr:

(

)

det (MAP + MAD ⋅ MATD + MAG ⋅ MATG ) − p*MAC = 0

(1)

p , E MAP, MAC, MAD, MAG, MATD, MATG – matrices of elements composed of geometric and material plate parameters and the quantity b of the length of the interval in the finite differences method and the number m of buckling waves. The detailed description of the problem solution has been presented in work [9]. Presented in this work, the results of plate dynamic stability calculations have been limited to the regular, axially-symmetrical form of plate critical deformation. This form corresponds to the minimal value of critical load for analyzed plates compressed on the inner facings parameters. Then, the solution requires the formulation:

where: p* =

-

the dynamic, equilibrium equations, description of the core deformation taking into account the plate imperfection, determination of the initial loading conditions, assumption of the form of plate predeflection, the formulation of the system of equations using the finite difference method.

Some Remarks about Modelling of Annular Three-Layered Plate Structure

693

The description of the solution is presented in work [10]. The numerical calculations in finite difference method require the proper choice of the number of discrete points to fulfil the results accuracy up to 5% of the technical error. The calculation were carried out for number 14 of discrete points.

3 Discussion of Computational Results The discussion of computational results of analyzed plates has been presented separating the static and dynamic stability plate problems. It was also taken notice on the examples of plates with the core thickness treated as the medium for h2 equal to h2=0.005 m, 0.02 m and the thick: h2=0.06 m.

3.1 Critical Static Loads The observed buckling forms of analyzed plates loaded on inner edge are regular, axially-symmetrical, but the plates compressed on the outer perimeter lose their stability for the different number m of the transverse waves in circumferential direction. The global, quasi-euler’s forms of critical plate buckling are essentially observed. For plate with thick core are expected the primary, local forms of the loss of plate stability, when the critical deformations of plate layers (the core, particularly) do not occur for their equal deflections.

3.1.1 Analysis of Plates with Medium Core The computational results of different models of plates loaded on inner edge are presented in Table 1. The critical form of deformation is the regular axially-symmetrical for all plates examples. For the each plate model it is presented in Fig. 3. Table 1. Values of the critical static stress of plates loaded on the inner edge of facings pcr [MPa] model built annular of sector axisymmetric (1/8 elements part) 1)

h’ [m]/h2 [m]/G2 [MPa]

0.0005/0.005/5.0 0.001/0.005/5.0 0.0005/0.02/5.0 0.001/0.02/5.0 0.0005/0.005/15.82 0.001/0.005/15.82 0.0005/0.02/15.82 0.001/0.02/15.82 1) 2)

full annulus plate model

model built of axisymmetric elements

57.84 64.08 170.50 143.77 137.93 120.30 449.24 326.41

57.48 64.00 168.32 143.20 136.49 119.92 434.56 324.01

57.52 64.00 172.27 144.16 136.63 119.94 457.95 328.30

57.73 62.94 168.04 143.22 137.41 119.21 435.62 323.93

1)

2)

annular sector (1/8 part) 57.82 63.15 173.94 144.17 137.66 119.43 462.66 330.41

annular sector (1/8 part) 57.78 63.14 169.43 143.17 137.46 119.38 438.89 325.67

FDM

64.12 75.61 165.51 150.29 149.91 149.34 437.54 338.94

Plate layers connected with the condition of the equal deflection. Facings connected with the condition of the equal deflection.

The consistence of results of all FEM plate models is observed. The good compatibility of values of critical loads of plates calculated in finite difference and finite element methods is particularly observed for plates with the core thickness, equal to h2=0.02 m. The values of critical loads of plate models with the condition of the equal

694

D. Pawlus a)

b)

c)

Fig. 3. Regular axially-symmetrical form of plate buckling for: a) full annulus plate model, b) annular plate sector, c) model built of axisymmetric elements

layers deflection are generally slightly higher than values obtained for plates without this condition. The increase in these values above the values calculated for FDM plate model appears for plates with thin facings h’=0.0005 m and thicker core (h2=0.02 m). The results of the plates compressed on outer perimeter are presented in Table 2. Table 2 contains the minimal values of critical loads pcr and number of buckling waves m. Some forms of plates buckling are presented in Fig. 4. Table 2. Values of the critical static stress of plates loaded on the outer edge of facings h’ [m]/h2 [m]/G2 [MPa] 0.0005/0.005/5.0 0.001/0.005/5.0 0.0005/0.02/5.0 0.001/0.02/5.0 0.0005/0.005/15.82 0.001/0.005/15.82 0.0005/0.02/15.82 0.001/0.02/15.82 1)

1)

full annulus plate model 19.16 m=7 16.48 m=5 66.56 m=10 43.71 m=7 52.03 m=9 35.04 m=6 193.15 m=12 115.10 m=9

pcr [MPa] full annulus plate model 19.18 m=7 16.49 m=5 67.39 m=9 43.97 m=7 52.10 m=8 35.06 m=6 198.65 m=12 116.31 m=8

FDM 22.37 m=8 20.52 m=5 69.49 m=12 46.95 m=7 61.51 m=9 46.53 m=6 200.62 m=18 125.11 m=9

Plate layers connected with the condition of the equal deflection.

m=5

m=9

Fig. 4. The forms of critical plate deformations

m=12

Some Remarks about Modelling of Annular Three-Layered Plate Structure

695

Results show the slightly increase in values of critical loads of plates with the condition of the equal layers deflection. Then, the change of plate buckling in the form of the decrease in the number m waves could occur. Additionally, in Table 3 the results of the select plate examples obtained for the annular sector of plate model are presented. These values of critical loads are suitably higher than results obtained for full annulus plate model. Table 3. Critical static loads for plate examples compressed on outer edges with the results of the model of plate annular sector h’ [m]/h2 [m]/G2 [MPa] 0.0005/0.005/15.82 0.001/0.005/15.82 0.001/0.02/15.82 1)

1)

full full annulus annulus plate model plate model 52.03 52.10 m=9 m=8 35.04 35.06 m=6 m=6 115.10 116.31 m=9 m=8

pcr [MPa] annular annular sector sector (1/6 part) (1/8 part) 52.75 52.63 m=9 m=8 36.74 m=6 123.23 124.43 m=9 m=8

1)

annular sector (1/8 part) 55.26 m=8

FDM 61.51 m=9 46.53 m=6 125.11 m=9

116.65 m=8

Plate layers connected with the condition of the equal deflection.

3.1.2 Analysis of Plates with Thick Core The results of various models of plates compressed on inner perimeter are presented in Table 4. The results marked by * concern the plate models, which critical deformation has not the regular, axially-symmetrical form. For all examined models the decrease in values of critical loads and the change in the form of critical deformation is observed for the plates with thin facings h’=0.0005 m and thick core h2=0.06 m and particularly for the core Kirchhoff’s moduls equal to G2=15.82 MPa. The examples forms of critical deformations are presented in Fig. 5. Table 4. Values of the critical loads of plates with thick core loaded on the inner edge pcr [MPa] model built of axisymmetric elements 440.20 317.34 1238.81 796.31 2) annular sector (1/8 part) 324.10 * 293.01 670.41 * 677.64 1)

h’ [m]/h2 [m]/G2 [MPa]

full annulus plate model

0.0005/0.06/5.0 0.001/0.06/5.0 0.0005/0.06/15.82 0.001/0.06/15.82

347.54 293.90 791.37 * 689.10 *

h’ [m]/h2 [m]/G2 [MPa]

annular sector (1/8 part)

0.0005/0.06/5.0 0.001/0.06/5.0 0.0005/0.06/15.82 0.001/0.06/15.82

329.58 291.53 718.51 * 676.11

1)

model built of axisymmetric elements 345.48 292.68 774.10 * 686.66 1) annular sector (1/8 part) 445.26 319.62 1252.27 804.84

3)

model built of axisymmetric elements 315.53 288.45 684.19 * 659.77 3) annular sector (1/8 part) 251.81 * 279.29 511.49 * 586.19*

Plate layers connected with the condition of the equal deflection. Facings connected with the condition of the equal deflection. 3) The core mesh built of two layers of solid elements. 4) The core mesh built of four layers of solid elements. 2)

4)

model built of axisymmetric elements 309.47 287.84 649.49 * 655.17 FDM 406.98 312.53 1191.70 749.53

696

D. Pawlus

The results obtained for plates with connected layers by the condition of the equal deflection are the obvious exception. Then, the values of loads and the global buckling forms correspond to the results obtained for plates models calculated in FDM. It could be suspected that these values are too high. The values of critical loads of plates models built of two or four layers of the core elements are lower than values obtained for the models with single core layer. Particularly, the essential decrease in values of critical loads is observed for the plate model in the form of the annular sector with the double core layer. a)

b)

pcr=791.37 MPa

c)

pcr=718.51 MPa

d)

pcr=774.10 MPa

pcr=684.19 MPa

Fig. 5. The forms of buckling of plate models: a) full annulus, b) annular sector, c) model built of axisymmetric elements, d) model built of two layers of core axisymmetric elements

The sensitivity of plate models with thin facings and thick core is observed for plate compressed on the outer perimeter, too. Example results are presented in Table 5. Table 5. Values of the critical loads of plates with thick core loaded on the outer edge h’ [m]/h2 [m]/G2 [MPa]

full annulus plate model

0.0005/0.06/5.0

149.37 *

0.001/0.06/5.0

102.86 m=9

0.0005/0.06/15.82

402.44 *

0.001/0.06/15.82

274.23 m=12

1)

pcr [MPa] 1) full annulus annular sector plate model (1/8 part) 189.45 187.15 m=12 m=20 111.60 124.92 m=8 m=8 571.69 519.03 m=14 m=24 318.34 343.26 m=10 m=12

1)

FDM 185.68 m=17 113.39 m=9 542.70 m=27 320.63 m=12

Plate layers connected with the condition of the equal deflection.

Also for plates loaded on the outer edge the values of critical loads of models with connected layers seem to be to high. Results obtained for the annular sector of plate with the connected layers are both in values of critical loads and the forms of buckling closer to the results obtained in finite difference method.

3.2 Critical Dynamic Loads The calculations of the critical dynamic loads have been carried out for the plates compressed on inner edges of the facings with the linear, rapidly increasing stress expressed by the formula:

Some Remarks about Modelling of Annular Three-Layered Plate Structure

p=st where: p - compressive stress, s – rate of plate loading growth, t – time.

697

(2)

The rate of plate loading growth s is equal for each of numerical analysed plate. The value of the rate s is the result of the following equation: s=K7⋅pcr . The value of parameter K7 is accepted as K7=20. Solving the eigenproblem the value of critical stress pcr is equal: pcr=217.3 MPa calculated for the plate with the facing thickness h'=0.001 m, core thickness h2=0.01 m and the value of core Kirchhoff's modulus equal: G2=15.82 MPa. The calculations are carried out for the regular, axially-symmetrical form of plate buckling. This form has the plate predeflection, too. As the criterion of loss of plate stability, the criterion presented in work [11] was adopted. According to this criterion the loss of plate stability occurs at the moment of time when the speed of the point of maximum deflection reaches the first maximum value. The results of some plate examples obtained using the finite element method in the form of time histories of plate maximum deflection and the velocity of deflection are a)

b)

Fig. 6. Time histories of deflection and velocity of deflection for plates with parameters: a) h’=0.001 m, h2=0.005 m, G2=5 MPa, b) h’=0.001 m, h2=0.06 m, G2=5 MPa Table 6. Values of the critical, dynamic loads pcrdyn and critical deflections wcr of plates loaded on the inner edge pcrdyn [MPa] wcr⋅10-3 [m] h’ [m]/h2 [m]/G2 [MPa]

0.001/0.005/5.0 0.001/0.02/5.0 0.001/0.06/5.0 0.001/0.005/15.82 0.001/0.02/15.82 3)

full annulus plate model 91.27 4.25 152.11 3.13 304.22 4.16 136.90 3.08 330.30 2.99

model built of annular sector axisymmetric (1/8 part) elements 86.93 86.93 3.28 4.38 147.78 147.78 2.94 3.56 304.22 299.87 4.35 4.80 139.10 134.74 3.67 4.31 326.0 326.0 3.08 3.88

The core mesh built of two layers of solid elements.

3)

annular sector (1/8 part) 86.92 4.38 147.76 3.57 291.18 4.42 134.73 4.31 325.95 3.9

FDM 98.88 3.92 159.30 3.77 321.65 5.21 166.69 3.87 346.64 4.25

698

D. Pawlus

presented in Fig. 6. Table 6 shows the values of critical, dynamic loads and critical deflections for the plate models built in finite element and finite difference methods. The results obtained in FEM indicate a mutually good consistency. For plate with thicker core h2=0.02 m and 0.06 m these results correspond to the results obtained in finite difference method, too. The major fluctuations are observed for the plate critical deflections. The calculations show that in dynamic problem in the range of the global buckling observations the influence of the plate structure built in finite element method on the final results is not so essential, like is static analysis.

4 Conclusions The results obtained for presented different models, which are possible to application in computational plate examinations, indicate on some sensitivity of their structures. The observed differences of values of critical loads, particularly of plates with thick core are in the range of the dozens MPa. Therefore, these differences are significant. Particularly, this problem concerns the static stability issue, when other than global critical forms could occur. The study of the mesh structure of these plates models seems to be especially important. The lowest values of critical loads are for plate models with the mesh core composed of the several layers of solid elements. The computational results of these models could be the essential complement to the plate examinations carried out for their basic model in the form of full annulus. Comparing the computational results of plates calculated in two methods: finite element and finite difference essentially, it can be determined the compatibility of the results of plate models with medium core. The consistency of results of plates with thick core is observed for these cases of plate models, which critical deformation has the global, quasi-euler’s form. Then, the values of critical, static loads could be really to high.

References 1. Wang, H.J., Chen, L.W.: Axisymmetric dynamic stability of sandwich circular plates. Composite Structures 59, 99–107 (2003) 2. Chen, Y.R., Chen, L.W., Wang, C.C.: Axisymmetric dynamic instability of rotating polar orthotropic sandwich annular plates with a constrained damping layer. Composite Structures 73(2), 290–302 (2006) 3. Majewski, S., Maćkowski, R.: Creep of Foamed Plastics Used as the Core of Sandwich Plate. Engineering and Building Industry (Inżynieria i Budownictwo) 3, 127–131 (1975) (in Polish) 4. Romanów, F.: Strength of Sandwich Constructions, WSI, Zielona Góra, Poland (1995) (in Polish) 5. Kluesener, M.F., Drake, M.L.: Mathematical Modelling. Damped Structure Design Using Finite Element Analysis, Shock and Vibration Bulletin 52, 1–12 (1982) 6. Pawlus, D.: Homogeneous and sandwich elastic and viscoelastic annular plates under lateral variable loads. In: Proceedings of the Third International Conference on Thin-Walled Structures, pp. 515–522. Elsevier Science, Amsterdam (2001)

Some Remarks about Modelling of Annular Three-Layered Plate Structure

699

7. Hibbitt, Karlsson and Sorensen, Inc.: ABAQUS/Standard. User’s Manual, version 6.1 (2000) 8. Volmir, C.: Stability of Deformed System. Science, Moskwa (1967) (in Russian) 9. Pawlus, D.: Solution to the Static Stability Problem of Three-Layered Annular Plates With a Soft Core. Journal of Theoretical and Applied Mechanics 44(2), 299–322 (2006) 10. Pawlus, D.: Dynamic Stability Problem of Three-Layered Annular Plate under Lateral Time-Dependent Load. Journal of Theoretical and Applied Mechanics 43(2), 385–403 (2005) 11. Volmir, C.: Nonlinear Dynamic of Plates and Shells. Science, Moskwa (1972) (in Russian)

Parallel Quantum Computer Simulation on the CUDA Architecture Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata Department of Computer Architecture, University of Malaga, 29071 Malaga, Spain {eladio,sromero,maria,ezapata}@ac.uma.es

Abstract. Due to their increasing computational power, modern graphics processing architectures are becoming more and more popular for general purpose applications with high performance demands. This is the case of quantum computer simulation, a problem with high computational requirements both in memory and processing power. When dealing with such simulations, multiprocessor architectures are an almost obliged tool. In this paper we explore the use of the new graphics processor architecture NVIDIA CUDA in the simulation of some basic quantum computing operations. This new architecture is oriented towards a more general exploitation of the graphics platform, allowing to use it as a parallel SIMD multiprocessor. In this direction, some implementation strategies are proposed, showing that the eﬀectiveness of the codes is subject to a right exploitation of the underlying memory hierarchy.

1

Introduction

Contrary to classical computers, quantum computers are devices that process information on the basis of the laws of quantum physics. Due to this fact, they could provide an eﬃcient implementation of diﬀerent algorithms as refers both to computing time and storage requirements. This way, they would be able to solve some problems of non-polynomial complexity in a much smaller time [10]. Although this approach is quite promising, at the present time it is still necessary to face certain limitations. On the one hand, existing technology allows only to construct quantum computers of very reduced dimensions [7], and on the other hand, only a small number of eﬀective algorithms [5,8,13] are known. Nevertheless, the analysis of this computational model constitutes a topic of great interest for physicists, computer scientists and engineers. Actually, a quantum computer can be considered as a hardware accelerator of the classical processor, from which it receives the orders for the resolution of a concrete problem [7], as shown in Fig. 1. As it is not possible to know the inner state of a quantum computer, according to the laws that govern it, outputs must be obtained by means of a measurement, issuing a result with a certain probability. The quantum parallelism is one of the sources of the power of quantum computers as it allows to perform simultaneous operations on an exponential set of M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 700–709, 2008. c Springer-Verlag Berlin Heidelberg 2008

Parallel Quantum Computer Simulation on the CUDA Architecture

701

superposed states. This causes quantum computer simulation to demand high computational power. This way, parallelism is a suitable tool for mitigating such requirements [7,11]. The simulation of quantum computers not only requires a high computational eﬀort but they present data access patterns with low locality characteristics. Several interests arise for this kind of simulation giving raise to the development of diﬀerent simulators, both in software [2,3,7,11] as in hardware [6,9,14]. In this paper we prove that modern architectures based on Graphic Processor Units (GPU) are suitable to accomplish an eﬃcient simulation of quantum computers. GPUs are devices specialized in graphics algorithms involving very intensive and highly parallel computations that due to their high computational power are nowadays being used also for general purpose applications. With this purpose, we have implemented several parallel approaches to the simulation of the basic operators of an ideal quantum computer, using the new compute uniﬁed device architecture (CUDA) [12], lately released by the GPU manufacturer NVIDIA. Diﬀerent strategies are explored, looking for the exploitation of the data reference locality in this sort of architecture.

2

Quantum Computing

The ideal quantum computer to be simulated follows the model presented in [4], consisting on the successive application of a network of quantum gates to a quantum register with a classical initial state. The quantum bit (qubit) can be imagined as the linear superposition of two homologous classical states we will note as |0, |1, in Dirac notation. The state of a qubit can be represented using a complex two-dimensional vector, where the basis for these two states are |0 and |1. Thus, the state of a qubit can be written as Ψ = α0 |0+α1 |1, where the coeﬃcients, or amplitudes, verify |α0 |2 +|α1 |2 = 1. |α0 |2 and |α1 |2 are interpreted α 1 as theprobability of measuring |0 or |1. In vector notation, we write Ψ = α0 , 1 0 |0 = 1 and |1 = 0 . A quantum register generalizes the qubit deﬁnition. The state of a n-qubit quantum register is determined by the linear superposition of the 2n possible classical states provided by n bits. After this, the state of a quantum reg2n −1 2n −1 2 ister can be written as Ψ = = 1, i=0 αi |i with αi ∈ C, i=0 |αi | 2 since |αi | is interpreted as the probability of obtaining |i when the register is measured. Thus, Ψ belongs to a 2n -dimensional complex vector space, where |i constitutes a basis, with 0 ≤ i ≤ 2n −1. For example, for n = 3, we T will write |6 = |110 = (0 0 0 0 0 0 1 0 ) . By applying the Kronecker’s tensor product, it is possible to represent the elements of the state space basis for the register as a function of the individual states of the qubits. For example |6 = |110 = |1 ⊗ |1 ⊗ |0. The state of a quantum register will evolve according to a transformation, which can be interpreted as an operator U applied to the register state. Quantum physics laws settle that operator U must be a linear and unitary one. It follows that for a n-qubit register, an order 2n ×2n matrix can be found verifying

E. Gutierrez et al. HW accelerator

CPU

Quantum Computer QC

classical initial state commands (quantum operators) results (meassurement)

CPU

2

q1

Quantum Computer Simulation

H 2

H

input state

U0

U1

q1

q2

q2

.... ....

.... ....

qi qn

input state

output state

(b)

.... ....

q0

q1

qn

H

q0

Fig. 2. A QFT implementation

q0

qi

2 U3

q1 q2

4

U2

q0

U

q1

8

q3

Graphics Processor GPU

q3 q2

4

q2

Fig. 1. A quantum computer model

(a)

output state

H

q0

input state

U

qi qn

output state

(c)

I I

input state

I U I

.... ....

Interface classical initial state commands (quantum operators) results (meassurement)

.... ....

Classical Computer

.... ....

702

output state

Fig. 3. Representation of quantum transformations as quantum gate networks

U U ∗ = I, where U ∗ is matrix U both conjugated and transposed, and I is the unitary matrix. Usually, this kind of transformations are represented in the manner of Fig. 3(a). As a particular instance, let us consider the application of a transformation over one particular bit, as shown in Fig. 3(b). In this case, the global transformation will be the tensor product of all the 1-qubit transformations simultaneously applied to each individual qubit. This means that the global resulting transformation Ug will be equivalent to the application of the identity transformation to the residuary bits. If the 1-qubit operator U is applied to the k−th qubit then: Ug = I ⊗

(n−k−1 times) ...

⊗I ⊗ U ⊗ I ⊗

(k times) ...

⊗I = I ⊗n−k−1 ⊗ U ⊗ I ⊗k (1)

The transformation applied to one single qubit can be interpreted like a unitary quantum gate of order 2 × 2. Table 1 presents several well-known transformations. As an example, Pauli transformation X = |01| + |10|, does project component |0 over the |1 one, and vice versa, following that its quantum application to a classic state 0 or 1 is equivalent to the logic operator NOT. The generalization to gates with more than one qubit is straightforward, resulting in an associated matrix of order 2n × 2n , for n qubits. A quantum computer can be thought to be a quantum device on which a sequence (or network) of transformations can be applied successively to the state of a quantum register [4]. Diﬀerent minimal universal sets of gates have been proposed, in such a way that any n-qubit transformation can be expressed as a network of these gates. It is proven that a complete set can be built from gates Φ(δ), Rz (α), Ry (θ) and CNOT (2 qubits), as stated in [1]. A remarkable transformation is the Quantum Fourier Transform (QFT). The QFT [10] is a key element of some quantum algorithms like the integer

Parallel Quantum Computer Simulation on the CUDA Architecture

703

Table 1. Inventory of quantum computer gates

Identity

1-qubit transformations I = |00| + |11|

I

Pauli X

X = |01| + |10|

X

Pauli Y

Y = j|10| − j|01|

Y

Pauli Z

Z = |00| − |11|

Z

H =

Hadamard y-axis rotation z-axis rotation

1 √ 2

(X + Z)

H

Ry (θ) = cos( θ2 )I + sin( θ2 )Y

R y(θ)

Rz (α) = ejα/2 |00| + e−jα/2 |11|

R z(α)

2-qubit transformation Controlled NOT

CNOT=|00|+|11|+|23|+|32|

Controlled Phase Shift CPh(K)=|00|+|11|+|22|+ej

2π K

|33| K

factorization proposed by Shor [13], of great interest in cryptography. An implementation of the QFT is depicted in Fig. 2 in terms of 1-qubit Hadamard gates and 2-qubit controlled-phase gates. QFT is deﬁned analogously to the classical transform but a normalization coeﬃcient √12n makes it unitary: 2 −1 2 −1 1 in 2πck = √ αk e 2n i |c 2n k=0 c=0 n

|Ψ

3

out

n

(2)

Elementary Quantum Gate Simulation

Simulation of a quantum computer will consist on determining the state of a n-qubit register, after the application of a unitary linear transformation. This n means that we have to compute the register’s state vector |Ψ out = 2i=0−1 αout i |i, n 2 −1 out from initial state |Ψ in = i=0 αin |i, that is, to determine coeﬃcients α for i i the ﬁnal state as a function of coeﬃcients αin for the initial state and the unitary i matrix U deﬁning the transformation. In general, the application of this unitary transformation will require computations with a complexity order O(22n ) as matrix U is of order 2n ×2n , since 2n is the dimension of the associated vector space. By decomposing this transformation into a set of successive transformations with a lower number of qubits (quantum gates or stages), the eﬀective complexity of the simulation can be reduced. As case of study, the simulation of three basic quatum computing operations is analyzed: a generic gate U over one qubit, the operator U ⊗n applied to an n-qubit register, and the QFT. Simulation of 1-qubit Gate U. The application of a 1-qubit quantum gate performs the operation |Ψ out = Ug |Ψ in , where Ug comes from expression (1) as a function of the 1-qubit transformation U . If we consider the initial state is a 2n −1 superposition of states Ψ in = i=0 αin i |i, the eﬀect of the transformation over the coeﬃcients αin can be determined. i

704

E. Gutierrez et al.

Consider that the 1-qubit transformation U = u00 |00|+u01|01|+u10 |10|+ u11 |11| is applied to the q-th qubit of a n-qubit register. If the initial state is a classical one Ψ in = |i = |bn−1 bn−2 ...b1 b0 , that is, an element in the basis of the state space, where bk represent the bits on the binary expression of natural i. Transformation U over the q-th bit bq will result in: u |b ...0...b1 b0 +u10 |bn−1 ...1...b0 if bq = 0 out Ψ =|bn−1 ⊗...U |bq ⊗...⊗|b0 = 00 n−1 u01 |bn−1 ...0...b1 b0 +u11 |bn−1 ...1...b0 if bq = 1 (3) By means of the bitwise exclusive or (⊕), this can be expressed as: in u00 αin bn−1 ...0...b1 b0 +u01 αbn−1 ...1...b1 b0 if bq = 0 out out αi = αbn−1 bn−2 ...bq ...b1 b0 = in u10 αbn−1 ...0...b1 b0 +u11 αin bn−1 ...1...b1 b0 if bq = 1 u αin +u01 αin i⊕2q if bq = 0 = 00 iin (4) u10 αi⊕2q +u11 αin if bq = 1 i This means we can compute the output coeﬃcients from the input ones. But this requires traversing each one of the coeﬃcients, with a O(2n ) complexity. Actually, coeﬃcients associated to both bq = 0 and bq = 1 can be computed simultaneously. This reduces the complexity of the simulation loop to O(2n−1 ), making in each iteration an eﬀort equivalent to the matrix-vector product of order 2 × 2 (computation in place). Simulation of a Factorizable r-qubit Gate. Let us consider the case of simulating a r-qubit gate to an n-qubit register, being this gate factorizable in terms of the Kronecker’s product of 1-qubit gates. Its simulation follows from the simulation of the single qubit gate. Basically it consists of applying the gate U to one qubit after another resulting in r consecutive steps. With the purpose of analyzing the eﬀect of the transformation over the coefﬁcient space, let us study the application of an 1-qubit transformation U over r consecutive qubits, from the q-th to the (q + r − 1)-th qubit. Following the argument of equations (3) and (4) we can infer that, for this case, the number of input coeﬃcients that contribute to the calculation of a given output coeﬃcient αout is 2r . Furthermore, this set of coeﬃcients is given by k Gq,r k

=

r 2 −1

{αk⊕2q ·m }

(5)

m=0

Observe that such groups, just deﬁned, act like closed groups because it is possible to determine, for all αin ∈ Gk , its corresponding output coeﬃcient αout withth out need of information outside of Gk . For example, transforming the 4 inwhen in in in and 5th qubits, we will have q = 4, r = 2 and G4,2 = α , α , α , α k k k⊕24 k⊕2·24 k⊕3·24 . out out out , α , α and α can be computed from Thus, the coeﬃcients αout k k⊕24 k⊕2·24 k⊕3·24 4,2 those only included in Gk . The number of disjoint groups existing in the coeﬃcient space is 2n−r . Therefore the computation complexity will be equivalent to apply 2n−r times the

Parallel Quantum Computer Simulation on the CUDA Architecture

705

1-qubit gate U to an r-qubit register. In order to simulate a given gate U ⊗n applied to an n-qubit register, computations can be organized by splitting the register into subregister of r qubits. This way we can work over closed groups of a desired size. Note that the partition of the coeﬃcient space into groups is diﬀerent for the diﬀerent subregisters. Simulation of the QFT. The n-qubit Quantum Fourier Transform [10,3] can be implemented as shown in Fig. 2. A straightforward simulation, gate by gate, steps which is very ineﬃcient. Nevertheless, a more eﬃcient will involve n(n−1) 2 implementation can be performed by grouping the gates into n stages, one for each k-th qubit, denoted as Uk in Fig. 2: QF T = Un−1 · · · U1 · U0 . It is remarkable that each one of these stages Uk operates only on one of the qubits through controlled phase transformations and Hadamard gates. Therefore, the computational workload of simulating each stage Uk is the same to the simulation a 1-qubit gate. This way, the resulting sequence of stages follows a similar scheme than the previously described simulation of U ⊗n .

4

Implementation

In this section we proceed to map our simulation schemes onto the CUDA architecture for the three cases under study: 1-qubit gate, U ⊗n transformation and QFT. As forementioned, our simulation model (Fig. 1) involves a classical computation (code running on the host), and a quantum computation which is just simulated on the GPU (kernel code running in parallel on the device). The main challenge when designing the kernel code is to handle the limitations coming from the CUDA features. The ﬁrst limitation arises from the device’s memory system organization. On the one hand, the vector of coeﬃcients describing the state of the quantum register uses to be very large, and only the device’s global memory is able to store the whole of it. Although shared memory is two orders of magnitude faster (as it acts like a parallel data cache for each multiprocessor), it has a reduced size and it will not be able to store but a reduced portion of the coeﬃcients. Subsets of coeﬃcients can be transferred from global to shared memory (copy in) when they are frequently reused and therefore a substantial increment of performance is granted. Notice that results will have to be transferred back to the global memory (copy out). On the other hand, an eﬃcient transference among global and shared memory is restricted to contiguous words (coalescing). Another important limitation comes from the synchronization mechanism inherent to CUDA. This is just a short range synchronism, as it only allows synchronizing threads belonging to the same block. Synchronization of threads pertaining to diﬀerent blocks must be solved at the host side. Simulation of 1-qubit Gate U. The simulation of a 1-qubit gate U is derived from the parallel SIMD execution of (4), where U is applied to the q-th qubit of a n-qubit register. According to this expression, the computation of a coeﬃcient

706

E. Gutierrez et al.

in in αout k requires accessing to the coeﬃcient αk itself and the coeﬃcient αk⊕2q . Even out more, once these two coeﬃcients are accessed, also αk⊕2q can be calculated. Consequently, the kernel code must determine in parallel every pair of the form {αk , αk⊕2q }. In total, as there exists 2n−1 pairs, the same number of threads will be required. Each thread is in charge of computing the transformation U for each pair. As a given coeﬃcient belongs to one and only one pair, it is necessary only one read and one write operations in global memory, assuming that each coeﬃcient αk is located in the k-th position of the state vector in global memory. Thus, when simulating a gate separately, the use of shared memory will not improve the performance because there is no reusing of data transferred from global to shared memory. Due to the disjointness of diﬀerent pairs, the coeﬃcients computed after a transformation can be directly overwritten (in-place computation). This way, a higher number of qubits can be simulated. Note that synchronization points become mandatory when consecutive 1-qubit gates are going to be simulated in order to guarantee the correctness of the computation.

Simulation of a Factorizable n-qubit Gate. Let us consider the simulation of a multiple-qubit gate factorizable in terms of the Kronecker’s product of 1qubit gates and/or controlled gates. Without loss of generality, the same 1-qubit gate U is applied to every qubit of the register, that is, the transformation to be analyzed is U ⊗n . A ﬁrst approach follows from the simulation of the single qubit gate. It consists of applying the gate U to one qubit after another, resulting in n consecutive stages. Due to the lack of inter-block synchronization in the GPU side, synchronization in the host side is necessary. This fact involves a diﬀerent kernel invocation for each qubit, i.e., for each stage. This solution entails two main disadvantages: the overhead in time due to host synchronization and the ineﬃciency of not being able to use the fast shared memory. In contrast to this solution, a more eﬃcient proposal is introduced hereafter. The key idea consists of copying-in a subset of coeﬃcients from global to shared memory, performing all possible computations and then copying-out the results from shared to global memory. These coeﬃcients allocated in shared memory can be reused several times, resulting in more accesses to fast shared memory and less accesses to global memory. Basically, we consider quantum subregisters of consecutive qubits in such a way that the corresponding groups Gk of expression (5) ﬁt the multiprocessor shared memory. That is, resulting groups must fullﬁl the condition Card(Gk ) = S, being S the number of coeﬃcients ﬁtting the shared memory. Thus, the number of qubits in the subregister to be transformed is log2 (S). This approach is expressed in Fig. 5 which corresponds with the scheme of Fig. 4(a) for q = 0, r = 4 and S = 2r , where the gates are applied to the log2 (S) less signiﬁcant qubits. Note that in this case, q = 0, coeﬃcients in a group Gk are allocated in consecutive memory positions. This fact involves that copy-in/out operations beneﬁt from the coalescing condition.

Parallel Quantum Computer Simulation on the CUDA Architecture

q3

U U

(a)

q0

q1

q1

q2 q4

q5

q5

q6

q6 output state

Synch. on host copy−out

q2

q3

q4

q7

copy−in

copy−out copy−in

U

input state

copy−out

q0

U

q7

(b)

input state

q3

SynchThreads

U

q4

U U

output state

q5

U

SynchThreads

U

q6 q7

(c)

SynchThreads

U

input state

MSB qubits

q2

copy−in

LSB qubits

q1

SynchThreads copy−out

U

MSB qubits

copy−in q0

707

U output state

Fig. 4. Using the shared memory allows to reduce the number of GPU/CPU synchronizations (a), but for the most signiﬁcant qubits (b) the lack of coalescing may degrade the performance. However contiguous memory locations can be used by deﬁning coalesced groups of coeﬃcients (c).

r −1 = 2m=0 {αk⊕m } Let G = Gr,q=0 k / Card(G) = 2r = S

Copy in(G) Apply U ⊗r to α ∈ G Copy out(G) Fig. 5. Kernel for LSB qubits of U ⊗n

r−m,q Let H = Gr−m,q ∪ Gr−m,q k k+1 ... ∪ Gk+M / m = log2 (M ), q > log2 (S) S ) = 2r−m = M , / Card(Gr−m,q k Card(H) = S

Copy in(H ) Apply U ⊗r−m ⊗ I ⊗m to α ∈ H Copy out(H) Fig. 6. Kernel for MSB qubits of U ⊗n

When proceeding with the remaining n − log2 (S) qubits (q > log2 (S)), a ﬁrst attempt may consist of building new groups Gq,r k with cardinality S, as described above. This situation is shown in Fig. 4(b). Observe that, according to equation (5), the coeﬃcients of these groups are located far apart in global memory when q > log2 (S). This means that global memory accesses are not coalesced, inﬂuencing adversely the performance. With the purpose of overcoming this lack of coalescing, a superset of coeﬃcient groups can be copied-in/out by selecting them appropriately. Being M the S ) qubits can be now transformed because number of these new groups, log2 ( M the maximum number of coeﬃcients in shared memory is S. These M groups are selected in such a way that the coeﬃcients altogether constitute S/M series of M consecutive coeﬃcients. This idea is illustrated in Fig. 6, where H is referred to the superset of groups. In contrast to the approach followed for the LSB qubits, where log2 (S) gates are simulated, this one reduces the number of qubits to be S ), but locality is severely improved. This results in more processed to log2 ( M kernel invocations as depicted in Fig. 4(c), where q = 4, S = 16 and M = 4. Nevertheless, extra copy-in/out operations, including host side synchronizations, are worthwhile because memory accesses are coalesced. Simulation of the QFT. As discussed in section 3 the QFT can be carried out in n stages, denoted Uk , one for each k-th qubit, as shown in Fig. 2. This way, the resulting chain of stages is analogous to the simulation of U ⊗n previously

708

E. Gutierrez et al.

described, but the transformation to be considered for each stage (qubit) depends on each qubit. We have considered two implementations. The ﬁrst one proceeds stage by stage, working on the state vector coeﬃcients stored in the device global memory. This requires n diﬀerent kernel code invocations, one for each stage Uk . A better performance can be achieved by using the shared memory to allocate a portion of the coeﬃcient space and transferring data between global and shared memory according to the coalescing criterion. But in this case the computational scheme becomes more complex for two reasons. On the one hand each transformation Uk implies controlled (conditional) updates of the state vector coeﬃcients. This makes each coeﬃcient to be transformed in a diﬀerent way, as contrasted with U ⊗n , where the same matrix is applied to all qubits. On the other hand, transformations Uk must be translated properly when working on the local space of coeﬃcients copied to the share memory, since coalesced consecutive coeﬃcients in the shared memory do not necessarily correspond with consecutive coeﬃcients in the global space.

5

Experimental Results

Simulations of a multiple-qubit gate U ⊗n and the QFT for n qubits have been implemented and tested. The target GPU platform has been an NVIDIA GeForce 8800GTX with the following features: 16 multiprocessors of 8 processors at 1.35GHz, a device global memory of 768MB with a latency up to 600 clock cycles, and 8KB parallel data cache (shared memory) per multiprocessor with a latency of 4 cycles. CUDA 1.1 SDK and toolkit was used. The goal was to evaluate the two main strategies previously introduced: gate-by-gate (stage-bystage for the QFT) and coalescing-improved implementations. In the case of U ⊗n , we select U = H (Hadamard gate) giving rise to the so-called Walsh gate [10]. Table 2 shows the execution time in milliseconds for the stage-by-stage and coalescing-improved strategies on the GPU platform for diﬀerent sizes of the quantum register. It is also shown, with the purpose of setting a time reference, a sequential simulation on a Intel Core2 6300 based platform at 1.86GHz with linux OS using libquantum [2], one of most popular quantum simulation software. The upper register size limit is 26 qubits, which is imposed by the memory size of the device. Notice that each state vector coeﬃcient is a complex number made of two simple-precision ﬂoating point numbers (32 bits). Other two signiﬁcant parameters, for the coalescing-improved version, are the number of coeﬃcients ﬁtting the shared memory (S) and the optimal number of consecutive coeﬃcients to be transferred (M ). These values are platform-dependent, being S = 1024 and M = 32 for our device. Also, the number of threads is set to a half of the number of coeﬃcients. Some facts can be highlighted concerning these results. Firstly, a good scalability with respect to the number of coeﬃcients can be observed for both parallel versions. Secondly, note that the coalescing-improved version exhibits a better performance for all the range. In other words, a high performance requires a good exploitation of the device memory hierarchy. On the one hand, data to be reused must be allocated in the shared memory, whenever possible. On the

Parallel Quantum Computer Simulation on the CUDA Architecture

709

Table 2. Execution time (msec) for the Walsh gate and the QFT simulations Number of qubits of the quantum register Implementation CPU sequential: libquantum H⊗n GPU stage-by-stage GPU coalescing-improved

15

16

17

18

19

20

-

-

-

31

78

156 328 688 1453 3031 6281 13062

-

-

-

1.8 3.2 6.1 12.0 24.3 49.9

-

-

-

1.1 2.0 4.1

30

80

CPU sequential: libquantum 10 QFT GPU stage-by-stage GPU coalescing-improved

21

22

23

24

25

26

102

212

439

8.8 18.1 37.1 76.2

158

342

150 350 730 1560 3260 6710 13910 28160 56890

2.38 4.58 9.09 18.6 39.5 85.2 185 402 874 1897 4100 8847 0.20 0.43 0.81 1.83 3.78 7.84 16.9 35.1 72.9

150

313

667

other hand, it is crucial to coalesce the accesses to the global memory. Finally, the coalescing-improved GPU version can reach a speedup up to 85 relative to the CPU implementation, for the fastest execution.

References 1. Barenco, A., Bennett, C.H., Cleve, R., DiVicenzo, D.P., Margolus, N., Shor, P., Sleator, T., Smolin, J.A., Weinfurter, H.: Elementary Gates for Quantum Computation. Phys. Rev. A 52, 3457–3467 (1995) 2. Butscher, B., Weimer, H.: The libquantum Library, http://www.enyo.de/libquantum/ 3. De Raedt, K., Michielsen, K., De Raedt, H., Trieu, B., Arnold, G., Richter, M., Lippert, T., Watanabe, H., Ito, N.: Massively Parallel Quantum Computer Simulator. Computer Physics Communications 176, 121–136 (2007) 4. Deutsch, D.: Quantum Computational Networks. Proceedings of Royal Society of London, Series A 425, 73–90 (1989) 5. Deutsch, D., Jozsa, R.: Rapid Solution of Problems by Quantum Computation. Proceedings of Royal Society of London, Series A 439, 553–558 (1992) 6. Fujishima, M.: FPGA-Based High-Speed Emulator of Quantum Computing. In: IEEE Int’l Conference on Computer Design (2004) ¨ 7. Glendinning, I., Omer, B.: Parallelization of the QC-Lib Quantum Computer Simulator Library. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 461–468. Springer, Heidelberg (2004) 8. Grover, L.K.: A Fast Quantum Mechanical Algorithm For Database Search. In: Annual ACM Symposium on the Theory of Computation, pp. 212–219 (1996) 9. Khalid, A.U., Zilic, Z., Radecka, K.: FPGA Emulation of Quantum Circuits. In: IEEE Int’l Conference on Field-Programming Technology (2003) 10. Nielsen, M., Chuang, I.: Quantum Computation and Quantum Information. Cambridge University Press, Cambridge (2004) 11. Niwa, J., Matsumoto, K., Imai, H.: General-Purpose Parallel Simulator for Quantum Computing. In: Calude, C.S., Dinneen, M.J., Peper, F. (eds.) UMC 2002. LNCS, vol. 2509, pp. 230–251. Springer, Heidelberg (2002) 12. NVIDIA CUDA Programming Guide, SDK and Toolkit, http://developer.nvidia.com/object/cuda.html 13. Shor, P.W.: Algorithms for Quantum Computation: Discrete Logarithm and Factoring. In: 35th Symposium on Foundations of Computer Science, pp. 124–134 (1995) 14. Udrescu, M., Prodan, L., Vladutiu, M.: Using HDLs for Describing Quantum Circuits: A Framework for Eﬃcient Quantum Algorithm Simulation. In: Computing Frontiers Conference (2004)

Comparison of Numerical Models of Impact Force for Simulation of Earthquake-Induced Structural Pounding Robert Jankowski Faculty of Civil and Environmental Engineering, Gdańsk University of Technology, ul. Narutowicza 11/12, 80-952 Gdańsk, Poland [email protected]

Abstract. Structural pounding during earthquakes is a complex phenomenon involving plastic deformations, local cracking, etc. The aim of the present paper is to check the accuracy of three pounding force numerical models, such as: the linear viscoelastic model, the non-linear elastic model following the Hertz law of contact and the non-linear viscoelastic model. In the analysis, the results of numerical simulations have been compared with the results of an impact experiment conducted by dropping balls of different building materials. The results of the study indicate that the non-linear viscoelastic model is the most precise one in simulating the pounding force time history during impact. Keywords: pounding, earthquakes, impact force, numerical simulation.

1 Introduction Earthquake-induced pounding between neighbouring, inadequately separated buildings or bridge segments can lead to considerable damage or even collapse of colliding structures (see, for example, [1,2]). Impact itself is a highly complex phenomenon involving plastic deformations at contact points, local cracking or crushing, friction, etc. what results in difficulty in its modelling. Structural pounding has been recently intensively studied applying different numerical models of impact force. The fundamental study on pounding between buildings in series using a linear viscoelastic model has been conducted by Anagnostopoulos [3]. Jankowski et al. [4] used the same model to study pounding of superstructure segments in bridges. In order to simulate the force-deformation relation more realistically, a non-linear elastic model following the Hertz law of contact has been adopted by a number of researchers (see, for example, [5,6]). For the purposes of a more precise simulation of a physical phenomenon, a non-linear viscoelastic model has also been considered [79]. In this model, a non-linear spring following the Hertz law of contact is applied together with an additional non-linear damper, which is activated during the approach period of collision in order to simulate the process of energy loss taking place mainly during that period. The aim of the present paper is to check the accuracy of these pounding force models for the purposes of simulation of different building materials impacts. In the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 710–717, 2008. © Springer-Verlag Berlin Heidelberg 2008

Comparison of Numerical Models of Impact Force

711

analysis, the results of numerical simulations have been compared with the results of an impact experiment conducted by dropping balls of different mass.

2 Pounding Force Numerical Models 2.1 Linear Viscoelastic Model The linear viscoelastic model is the most frequently used one for simulation of structural pounding under earthquake excitation (see, for example, [3,4]). The pounding force during impact, F (t ) , for this model is expressed as:

F (t ) = kδ (t ) + cδ& (t ) ,

(1)

where δ (t ) describes the deformation of colliding structural members, δ& (t ) denotes the relative velocity between them, k is the impact element’s stiffness simulating the local stiffness at the contact point and c is the impact element’s damping, which can be obtained from the formula [3]: c = 2ξ k

m1m2 , m1 + m2

(2)

where m1 , m2 are masses of structural members and ξ is the damping ratio related to a coefficient of restitution, e, which accounts for the energy dissipation during impact [10]. Value of e = 1 deals with the case of a fully elastic collision, value of e = 0 with a fully plastic one. The relation between ξ and e in the linear viscoelastic model is given by the formula [3]:

ξ=

− ln e

π + (ln e) 2 2

.

(3)

2.2 Non-linear Elastic Model

In order to model the pounding force-deformation relation more realistically, a nonlinear elastic model following the Hertz law of contact has been adopted by a number of researchers [5,6]. The pounding force, F (t ) , for this model is expressed by the formula: 3

F (t ) = βδ 2 (t ) ,

(4)

where β is the impact stiffness parameter, which depends on material properties and geometry of colliding bodies. The disadvantage of the Hertz contact law model is that it is fully elastic and does not account for the energy dissipation during contact due to plastic deformations, local crushing, etc.

712

R. Jankowski

2.3 Non-linear Viscoelastic Model

For the purposes of a more precise simulation of an impact phenomenon, a non-linear viscoelastic model has been proposed [7]. The pounding force, F (t ) , for this model is expressed by the formula: 3

for δ&(t ) > 0 (approach period),

3

for δ&( t ) ≤ 0 (restitution period),

F (t ) = βδ 2 (t ) + c (t )δ&(t ) F (t ) = βδ 2 (t )

(5)

where β is the impact stiffness parameter and c (t ) is the impact element’s damping, which at any instant of time can be obtained from the formula [7]: c (t ) = 2ξ β δ (t )

m1m2 , m1 + m2

(6)

where ξ denotes the damping ratio related to a coefficient of restitution, e. The approximate relation between ξ expressed by the formula [11]:

ξ =

and e in the non-linear viscoelastic model is

9 5 1 − e2 . 2 e ( e(9π − 16) + 16 )

(7)

3 Comparison of Pounding Force Numerical Models In order to verify the accuracy of pounding force models for the use of different building materials, the results of the numerical simulations have been compared with the results of an impact experiment conducted by dropping steel, concrete and timber balls of different mass on rigid surface. Balls have been dropped from various height levels in order to obtain different impact velocities. The properties of balls used in the experiment are specified in Table 1. The experimental setup is shown in Fig. 1. Table 1. Properties of balls used in the experiment Material Steel (type 18G2A) Concrete (grade C30/37) Timber (pinewood)

Ball diameter (mm) 21 50 83 103 114 128 55 71 118

Ball mass (kg) 0.053 – 0.054 0.538 – 0.541 2.013 1.329 – 1.350 1.763 – 1.835 2.531 – 2.636 0.065 – 0.066 0.109 – 0.112 0.493 – 0.497

Comparison of Numerical Models of Impact Force

713

Fig. 1. Setup of the experiment

The numerical analysis has been conducted using the following equation of motion: my&&(t ) + F (t ) = mg ,

(8)

where m is mass of a ball, && y (t ) its vertical acceleration and g stands for the acceleration of gravity. The pounding force, F (t ) , has been set to zero when y (t ) ≤ h (h is a drop height) and has been calculated according to Equation (1), (4) or (5) when y (t ) > h , whereas the deformation, δ (t ) , has been calculated as:

δ (t ) = y (t ) − h .

(9)

A time-stepping integration procedure with constant time step Δt = 1×10−6 s has been applied to solve the Equation (8) numerically. The values of the impact stiffness parameters: k , β and β , defining the models used in the numerical analysis, have been determined using the method of the least squares. The difference between the results of the experiment and the results of the numerical analysis has been assessed by calculating the normalised error: E=

F−F F

⋅100% .

(10)

where F is an impact time history vector obtained from the experiment, F is an impact time history vector obtained from the numerical analysis and F − F is an Euclidean norm of F − F .

714

R. Jankowski

The experimental study and the numerical analysis have been conducted for a large number of impact cases. In the following sections of this paper, the examples of the results are presented. 3.1 Steel-to-Steel Impact

In the first example, the results of the numerical analysis are compared with the results of the experiment conducted for a steel ball of mass 2.013 kg impacting the steel surface with the velocity of 0.92 m/s. In the numerical analysis, the following values of parameters defining the different pounding force models have been used: k = 4.82 × 108 N/m , ξ = 0.17 (e = 0.58) for the linear viscoelastic model,

β = 7.55 × 1010 N/m 3/2 for the non-linear elastic model and β = 6.60 × 1010 N/m 3/2 , ξ = 0.49 (e = 0.58) for the non-linear viscoelastic model. The pounding force time history measured during the experiment and the histories received from the numerical analysis for the considered example of steel-to-steel impact are presented in Fig. 2. Using Equation (10), the simulation errors for pounding force histories have been calculated as equal to: 15.9% for the linear viscoelastic model, 64.1% for the nonlinear elastic model and 15.3% for the non-linear viscoelastic model.

experiment linear viscoelastic model non-linear elastic model non-linear viscoelastic model

4 104

4

Pounding force (N)

3 10

4

2 10

4

1 10

0

0

0.05

0.1

0.15 Time (ms)

0.2

Fig. 2. Pounding force time histories for steel-to-steel impact

0.25

Comparison of Numerical Models of Impact Force

715

3.2 Concrete-to-Concrete Impact

The second example concerns the comparison between the results of the numerical simulations and the experiment conducted for a concrete ball of mass 1.763 kg impacting the concrete surface with the velocity of 0.13 m/s. In the numerical analysis, the following values of parameters defining the different pounding force models have been used: k = 4.91 × 107 N/m , ξ = 0.09 (e = 0.76) for the linear viscoelastic model, β = 1.04 × 1010 N/m3/2 for the non-linear elastic model and

β = 1.02 × 1010 N/m 3/2 , ξ = 0.22 (e = 0.76) for the non-linear viscoelastic model. The pounding force time history measured during the experiment and the histories received from the numerical analysis for the considered example of concrete-toconcrete impact are presented in Fig. 3. The simulation errors for pounding force histories from Fig. 3 have been calculated as equal to: 12.7% for the linear viscoelastic model, 33.5% for the non-linear elastic model and 11.6% for the nonlinear viscoelastic model.

experiment linear viscoelastic model non-linear elastic model non-linear viscoelastic model

Pounding force (N)

1500

1000

500

0

0

0.1

0.2

0.3 0.4 Time (ms)

0.5

0.6

0.7

Fig. 3. Pounding force time histories for concrete-to-concrete impact

3.3 Timber-to-Timber Impact

In the third example, the results of the numerical analysis are compared with the results of the experiment conducted for a timber ball of mass 0.109 kg impacting the

716

R. Jankowski

timber surface with the velocity of 0.39 m/s. In the numerical analysis, the following values of parameters defining the different pounding force models have been used: k = 2.28 × 106 N/m , ξ = 0.16 (e = 0.61) for the linear viscoelastic model,

β = 3.24 × 108 N/m3/2 for the non-linear elastic model and β = 2.52 × 108 N/m3/2 , ξ = 0.43 (e = 0.61) for the non-linear viscoelastic model. The pounding force time history measured during the experiment and the histories received from the numerical analysis for the considered example of timber-to-timber impact are presented in Fig. 4. Using Equation (10), the simulation errors for pounding force histories have been calculated as equal to: 20.9% for the linear viscoelastic model, 61.0% for the non-linear elastic model and 19.5% for the non-linear viscoelastic model. experiment linear viscoelastic model non-linear elastic model non-linear viscoelastic model

300

Pounding force (N)

250 200 150 100 50 0 -50 0

0.1

0.2

0.3

0.4 0.5 Time (ms)

0.6

0.7

0.8

Fig. 4. Pounding force time histories for timber-to-timber impact

4 Conclusions The results of the study indicate that the non-linear viscoelastic model is the most precise one in simulating the pounding force time histories during impact for three different building materials. The model allows us to simulate the relatively rapid increase in the pounding force during the approach period of collision and the decrease in the force with lower unloading rate during the restitution period. Because of the above the model can be successfully used for the numerical simulations of pounding-involved response of structures under earthquake excitation in order to

Comparison of Numerical Models of Impact Force

717

enhance the accuracy of the analysis. On the other hand, Figs. 2-4 show the drawbacks of the two other models considered. In the case of the linear viscoelastic model, the negative force can be observed just before separation, which does not have any physical explanation. In the case of the non-linear elastic model following the Hertz contact law, the pounding force history at approach and restitution periods is symmetric, due to elastic behaviour, and a maximum pounding force attains a higher value in comparison with the experimental results.

References 1. Rosenblueth, E., Meli, R.: The 1985 earthquake: causes and effects in Mexico City. Concrete international 8, 23–34 (1986) 2. Kasai, K., Maison, B.: Building pounding damage during the 1989 Loma Prieta earthquake. Engineering Structures 19, 195–207 (1997) 3. Anagnostopoulos, S.A.: Pounding of buildings in series during earthquakes. Earthquake Engineering and Structural Dynamics 16, 443–456 (1988) 4. Jankowski, R., Wilde, K., Fujino, Y.: Pounding of superstructure segments in isolated elevated bridge during earthquakes. Earthquake Engineering and Structural Dynamics 27, 487–502 (1998) 5. Jing, H.-S., Young, M.: Impact interactions between two vibration systems under random excitation. Earthquake Engineering and Structural Dynamics 20, 667–681 (1991) 6. Chau, K.T., Wei, X.X.: Pounding of structures modeled as non-linear impacts of two oscillators. Earthquake Engineering and Structural Dynamics 30, 633–651 (2001) 7. Jankowski, R.: Non-linear viscoelastic modelling of earthquake-induced structural pounding. Earthquake Engineering and Structural Dynamics 34, 595–611 (2005) 8. Jankowski, R.: Impact force spectrum for damage assessment of earthquake-induced structural pounding. Key Engineering Materials 293–294, 711–718 (2005) 9. Jankowski, R.: Pounding force response spectrum under earthquake excitation. Engineering Structures 28, 1149–1161 (2006) 10. Goldsmith, W.: Impact: The theory and physical behaviour of colliding solids. Edward Arnold Ltd., London (1960) 11. Jankowski, R.: Analytical expression between the impact damping ratio and the coefficient of restitution in the non-linear viscoelastic model of structural pounding. Earthquake Engineering and Structural Dynamics 35, 517–524 (2006)

Large-Scale Image Deblurring in Java Piotr Wendykier and James G. Nagy Dept. of Math and Computer Science, Emory University, Atlanta GA, USA [email protected], [email protected]

Abstract. This paper describes Parallel Spectral Deconvolution (PSD) Java software for image deblurring. A key component of the software, JTransforms, is the first, open source, multithreaded FFT library written in pure Java. Benchmarks show that JTransforms is competitive with current C implementations, including the well-known FFTW package. Image deblurring examples, including performance comparisons with existing software, are also given.

1 Motivation Instruments that record images are integral to advancing discoveries in science and medicine – from astronomical investigations, to diagnosing illness, to studying bacterial and viral diseases [1][2][3]. Computational science has an important role in improving image quality through the development of post-processing image reconstruction and enhancement algorithms and software. Probably the most commonly used post-processing technique is image deblurring, or deconvolution [4]. Mathematically this is the process of computing an approximation of a vector xtrue (which represents the true image scene) from the linear inverse problem b = Axtrue + η .

(1)

Here, A is a large, usually ill-conditioned matrix that models the blurring operation, η is a vector that models additive noise, and b is a vector representing the recorded image, which is degraded by blurring and noise. Generally, it is assumed that the blurring matrix A is known (at least implicitly), but the noise is unknown. Because A is usually severely ill-conditioned, some form of regularization needs to be incorporated [5][6]. Many regularization methods, including Tikhonov, truncated singular (or spectral) value decomposition (TSVD), and Wiener filter, compute solutions of the form xreg = A†r b, where A†r can be thought of as a regularized pseudo-inverse of A. The precise form of A†r depends on many things, including the regularization method, the data b, and the blurring matrix A [4]. The actual implementation of computing xreg can often be done very efficiently using fast Fourier transforms (FFT) and fast discrete cosine transforms (DCT). This paper describes our development of Parallel Spectral Deconvolution (PSD) [7] Java software for image deblurring, including a plugin for the open source image processing system, ImageJ [8]. A key component of our software is the first, open source, multithreaded FFT library written in pure Java, which we call JTransforms [7].

Research supported by the NSF under grant DMS-05-11454.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 721–730, 2008. c Springer-Verlag Berlin Heidelberg 2008

722

P. Wendykier and J.G. Nagy

This paper is organized as follows. In Section 2 we describe some basic image deblurring algorithms, and how fast transforms, such as FFTs and DCTs, can be used for efficient implementations. Section 3 describes the performance of our Java implementations, with a particular focus on JTransforms. Benchmarks show that our multithreaded Java approach is competitive with current C implementations, including the well-known FFTW package [9]. Image deblurring examples, including performance comparisons with existing software, are also given.

2 Deblurring Techniques The deblurring techniques considered in this paper are based on filtering out certain spectral coefficients of the computed solution. 2.1 Regularization by Filtering We begin by showing why regularization is needed, and how it can be done through spectral filtering. To simplify the discussion, we assume A is an n × n normal matrix [10], meaning that it has a spectral value decomposition (SVD)1 A = Q∗ ΛQ ,

(2)

where Λ is a diagonal matrix containing the eigenvalues of A, Q is a matrix whose columns, qi , are the corresponding eigenvectors, Q∗ is the complex conjugate transpose of Q, and Q∗ Q = I. We assume further that the eigenvalues are ordered so that |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn | ≥ 0. Using the spectral decomposition, the inverse solution of (1) can be written as xinv = A−1 b = A−1 (Axtrue + η) = xtrue + A−1 η = xtrue +

n ηi qi , λ i=1 i

(3)

= Q∗ η. That is, the inverse solution is comprised of two terms: the desired where η true solution and an error term caused by noise in the data. To understand why the error term usually dominates the inverse solution, it is necessary to know the following properties of image deblurring [4][5]: – Assuming the problem is scaled so that |λ1 | = 1, the eigenvalues, |λi |, decay to, and cluster at 0, without a significant gap to indicate numerical rank. – The eigenvectors qi corresponding to small |λi | tend to have more oscillations than the eigenvectors corresponding to large |λi |. These properties imply that the high frequency components in the error are highly magnified by division of small eigenvalues. The computed inverse solution is dominated by 1

We realize that “SVD” usually refers to “singular value decomposition”. We do not think there should be any confusion because our discussion of filtering can be done using the singular value decomposition in place of the spectral value decomposition.

Large-Scale Image Deblurring in Java

723

these high frequency components, and is in general a very poor approximation of the true solution, xtrue . In order to compute an accurate approximation of xtrue , or at least one that is not horribly corrupted by noise, the solution process must be modified. This process is usually referred to as regularization [5][6]. One class of regularization methods, called filtering, can be formulated as a modification of the inverse solution [5]. Specifically, a filtered solution is defined as (4) xreg = A†r b, φ1 φ2 φn where A†r = Q∗ diag , ,..., Q . The filter factors, φi , satisfy φi ≈ 1 for λ1 λ2 λn large |λi |, and φi ≈ 0 for small |λi |. That is, the large eigenvalue (low frequency) components of the solution are reconstructed, while the components corresponding to the small eigenvalues (high frequencies) are filtered out. Different choices of filter factors lead to different methods; popular choices are the truncated SVD (or pseudo-inverse), Tikhonov, and Wiener filters [5][6][11]. 2.2 Tikhonov Filtering To illustrate spectral filtering, consider the Tikhonov regularization filter factors φi =

|λi |2 , |λi |2 + α2

(5)

where the scalar α is called a regularization parameter, and usually satisfies |λn | ≤ α ≤ |λ1 |. Note that smaller α lead to more φi approximating 1. The regularization parameter is problem dependent, and in general it is nontrivial to choose an appropriate value. Various techniques can be used, such as the discrepancy principle, the L-curve, and generalized cross validation (GCV) [5][6]. There are advantages and disadvantages to each of these approaches [12], especially for large-scale problems. In this work we use GCV, which, using the SVD of A, requires finding α to minimize the function 2 n 2 n α2 α2 |bi | , (6) G(α) = n |λi |2 + α2 |λi |2 + α2 i=1 i=1 = Q∗ b. Standard optimization routines can be used to minimize G(α). where b Tikhonov filtering, and using GCV to choose regularization parameters, has proven to be effective for a wide class of inverse problems. Unfortunately for large scale problems such as image deblurring, it may not be computationally feasible to compute the SVD of A. One way to overcome this difficulty is to exploit structure in the problem. 2.3 Fast Transform Filters In image deblurring, A is a structured matrix that describes the blurring operation, and is given implicitly in terms of a point spread function (PSF). A PSF is an image of a point source object, and provides the essential information to construct A. The structure

724

P. Wendykier and J.G. Nagy

of A depends on the PSF and on the imposed boundary condition [4]. In this subsection we describe two structures that arise in many image deblurring problems. However, due to space limitations, we cannot provide complete details; the interested reader should see [4] for more information. If the blur is assumed to be spatially invariant then the PSF is the same regardless of the position of the point source in the image field of view. In this case, if we also enforce periodic boundary conditions, then A has a circulant matrix structure, and the spectral factorization (7) A = F∗ ΛF , where F is a discrete Fourier transform (DFT); a d-dimensional image implies F is a d-dimensional DFT matrix. In this case, the matrix F does not need to be constructed explicitly; a matrix vector multiplication Fb is equivalent to computing a DFT of b, and similarly F∗ b is equivalent to computing an inverse DFT. Efficient implementations of DFTs are usually referred to as fast Fourier transforms (FFT). The eigenvalues of A can be obtained by computing an FFT of the first column of A, and the first column of A can be obtained directly from the PSF. Thus, the computational efficiency of spectral filtering methods for image deblurring with a spatially invariant PSF and periodic boundary conditions requires efficient FFT routines. If the image has significant features near the boundary of the field of view, then periodic boundary conditions can cause ringing artifacts in the reconstructed image. In this case it may be better to use reflexive boundary conditions. But changing the boundary conditions changes the structure of A, and it no longer has the Fourier spectral decomposition given in (7). However, if the PSF is also symmetric about its center, then A is a mix of Toeplitz and Hankel structures [4], and has the spectral value decomposition A = CT ΛC ,

(8)

where C is the discrete cosine transform (DCT) matrix; a d-dimensional image implies C is a d-dimensional DCT matrix. As with FFTs, there are very efficient algorithms for evaluating DCTs. Furthermore, computations such as the matrix vector multiplication Cb and CT b are done by calling DCT and inverse DCT functions. The eigenvalues of A can be obtained by computing a DCT of the first column of A, and the first column of A can be obtained directly from the PSF. Note that in the case of the FFT, F has complex entries and thus computations necessarily require complex arithmetic. However, in the case of the DCT, C has real entries, and all computations can be done in real arithmetic. Efficient FFT and DCT routines are essential for spectral deblurring algorithms. The next section describes our contribution to the development of efficient parallel Java codes for these important problems.

3 Using Java for Image Deblurring Java is ideally suited to provide efficient, open source image deblurring software that can be used in inexpensive imaging devices for point of care medical applications. Java implementations are available for virtually all computing platforms, and since May

Large-Scale Image Deblurring in Java

725

2007 the source code of Java is distributed under the terms of the GNU General Public License. Moreover, Java has native support for multithreaded programming, which has become a mandatory paradigm in the era of multicore CPUs. Finally, sophisticated imaging functionality is built into Java, allowing for efficient visualization and animation of computational results. Significant improvements have been made to Java since the 1996 release of JDK 1.0, including Just-In Time compilation, memory allocation enhancements, and utilization of performance features in modern x86 and x64 CPUs [13]. It is no longer the case that Java is too slow for high-performance scientific computing applications; this point is illustrated below for spectral image deblurring. There are disadvantages to using Java in scientific computing, including no primitive type for complex numbers, an inability to do operator overloading, and no support for IEEE extended precision floats. In addition, Java arrays were not designed for high-performance computing; a multi-dimensional array is an array of one-dimensional arrays, making it difficult to fully utilize cache memory. Moreover, Java arrays are not resizable, and only 32-bit array indexing is possible. Fortunately open source numerical libraries, such as Colt [14], have been developed to overcome these disadvantages. For our work, we are implementing a fully multithreaded version of Colt, which we call Parallel Colt [7]. In the rest of this section we describe Java implementations of JTransforms, ImageJ and associated plugins for image deblurring. 3.1 JTransforms Fast Fourier Transform. An FFT algorithm is the most efficient method to compute a DFT, with a complexity of Θ(N log(N )) to compute a DFT of a d-dimensional array containing N components. An FFT algorithm was first proposed by Gauss in 1805 [15], but it was the 1965 work by Cooley and Tukey [16] that is generally credited for popularizing its use. The most common variant of the algorithm, called radix-2, uses a divide-and-conquer approach to recursively split the DFT of size N into two parts of size N/2. Other splittings can be used as well, including mixed-radix and split-radix algorithms [17]. Th split-radix algorithm has the lowest arithmetic operation count to compute a DFT when N is a power of 2 [18]. The algorithm was first described in 1968 by Yavne [19] and then reinvented in 1984 by Duhamel and Hollmann [20]. The idea here is to recursively divide a DFT of size N into one DFT of size N/2 and two DFTs of size N/4. Further details about split-radix algorithm can be found in [17]. Parallel Implementation in Java. JTransforms is the first, open source, multithreaded FFT library written in pure Java. The code was derived from the General Purpose FFT Package (OouraFFT) written by Ooura [21]. OouraFFT is a multithreaded implementation of the split-radix algorithm in C and Fortran. In order to provide more portability both Pthreads and Windows threads are used in the implementation. Moreover, the code is highly optimized and in some cases runs faster than FFTW. Even so, the package has several limitations arising from the split-radix algorithm. First of all, the length of the

726

P. Wendykier and J.G. Nagy

input data has to be a power of two. Second, the number of computational threads must also be a power of 2. Finally, one-dimensional transforms can only use two or four threads. JTransforms, with few exceptions, share all the features and limitations of Ooura’s C implementation. However, there are some important distinctions. First, JTransforms uses thread pools, while OouraFFT does not. Although thread pooling in Pthreads is possible, there is no code for this mechanism available in the standard library, and therefore many multithreaded applications written in C do not use thread pools. This has the added problem of causing overhead costs of creating and destroying threads every time they are used. Another difference between our JTransforms and the OouraFFT is the use of “automatic” multithreading. In JTransforms, threads are used automatically when computations are done on a machine with multiple CPUs. Conversely, both OouraFFT and FFTW require manually setting up the maximum number of computational threads. Lastly, JTransform’s API is much simpler than OouraFFT, or even FFTW, since it is only necessary to specify the size of the input data; work arrays are allocated automatically and there is no planning phase. The release of Java 5 in 2004 came with a number of significant new language features [22]. One feature that we have found to be very useful is the cached thread pool, which creates new threads as needed, and reuses previously constructed threads when they become available. This feature allows to improve the performance of programs that execute many short-lived asynchronous tasks. Benchmark. To show the performance of JTransforms we have benchmarked the code against the original OouraFFT and also against FFTW 3.1.2. The benchmark was run on the Sun Microsystems SunFire V40z server, with 4 Dual Core AMD Opteron Processors 875 (2.2GHz) and 32 GB of RAM memory. The machine had installed Red Hat Enterprise Linux version 5 (kernel 2.6.18-8.1.14.el5), gcc version 3.4.6 and Java version 1.6.0_03 (64-bit server VM). The following Java options were used: -d64 -server -Xms15g -Xmx15g. For the OouraFFT, we used -O2 flag for the C compiler (one can get slightly better timings with unsafe flags: -O6 - -fast-math). All libraries were set to use a maximum of eight threads and DFTs were computed in-place. The timings in Tables 1 and 2 are an average among 100 calls of each transform. This average execution time does not incorporate the “warm up” phase (the first two calls require more time) for JTransforms and OouraFFT. Similarly, for FFTW, the times do not incorporate the planning phase. Table 1 presents the benchmark results for computing two-dimensional complex forward DFTs. For 29 × 29 , 210 × 210 and 212 × 212 sizes, JTransforms outperforms all other tested libraries. Table 1. Average execution time (milliseconds) for 2-D, complex forward DFT Library \ Size 27 JTransforms 2.43 OouraFFT 0.74 FFTW_ESTIMATE 1.15 FFTW_MEASURE 0.83 FFTW_PATIENT 0.67

28 3.76 3.15 4.84 2.91 2.81

29 6.21 12.60 31.75 10.73 11.73

210 32.84 33.66 131.80 37.65 36.84

211 198.31 202.78 1149.87 182.77 179.55

212 529.81 789.25 2715.39 840.09 884.39

213 4028.17 4165.33 26889.97 6665.73 3761.50

214 15682.78 16738.65 49670.29 14735.13 56522.40

Large-Scale Image Deblurring in Java

727

Table 2. Average execution time (milliseconds) for 3-D, complex forward DFT Library \ Size 22 JTransforms 0.12 OouraFFT 0.001 FFTW_ESTIMATE 0.48 FFTW_MEASURE 0.48 FFTW_PATIENT 0.001

23 1.09 0.02 0.39 0.37 0.01

24 2.35 0.15 0.44 0.44 0.10

25 5.02 1.67 1.59 1.23 1.48

26 6.43 11.38 11.18 8.28 8.36

27 46.85 58.63 110.14 48.69 47.27

28 553.21 847.13 1471.14 601.88 573.77

29 7115.84 12448.24 34326.50 7432.08 8936.34

Table 2 shows benchmark results for three-dimensional, complex forward DFTs. Once again, our Java implementation is faster than OouraFFT for almost all sizes of input data. Moreover, starting from 26 × 26 × 26 , JTransforms is faster than FFTW. More benchmark results including discrete cosine and sine transforms, can be found at the JTransforms website [7]. 3.2 Deconvolution Plugins for ImageJ ImageJ [8] is an open source image processing program written in Java by Wayne Rasband, a researcher working at the U.S. National Institutes of Health (NIH). Besides having a large number of options for image editing applications, ImageJ is designed with pluggable architecture that allows developing custom plugins (over 300 user-written plugins are currently available). Due to this unique feature, ImageJ has become a very popular application among a large and knowledgeable worldwide user community. DeconvolutionJ [23] is an ImageJ plugin written by Nick Linnenbrügger that implements spectral deconvolution based on the Regularized Wiener Filter [11]. The plugin has a number of limitations. It can handle arbitrary-sized two- and three-dimensional images, although it requires the PSF image to be the same size as the blurred image, and it must be centered in the field of view. In addition, the regularization parameter of the Wiener filter must be specified manually and there is no update option to efficiently deblur the same image with different values of the regularization parameter. Last, but not least, DeconvolutionJ is a serial implementation, and therefore cannot take advantage of modern multicore processors. Our implementation of spectral deconvolution plugin, Parallel Spectral Deconvolution (PSD), does not suffer from any of these limitations. The current version (1.4) implements Tikhonov- and TSVD-based image deblurring [4]. Our multithreaded approach uses both JTransforms and Parallel Colt, so we were able to achieve a superior performance compared to DeconvolutionJ. PSD’s features include two choices of boundary conditions (reflexive and periodic), automatic choice of regularization parameter using GCV, threshold (the smallest nonnegative pixel value assigned to the restored image), single and double precision, a very fast parameter update option, and the possibility of defining the number of computational threads. By default, the plugin recognizes the number of available CPUs and uses that many threads. Nevertheless, current implementation of PSD has a couple of limitations. First, color images are not supported (DeconvolutionJ is also limited to grayscale images). The second limitation arises due to JTransforms, where the size of input data and the number of threads must

728

P. Wendykier and J.G. Nagy

be power of two numbers. To support images of arbitrary size, PSD uses padding. The number of threads, however, must be a power of two number. In order to test the performance of PSD, we also used the SunFire V40z with ImageJ version 1.39s. The following Java options were used: -d64 -server -Xms15g -Xmx15g -XX:+UseParallelGC. The test image (see Fig. 1) is a picture of Ed White performing the first U.S. spacewalk in 1965 [24]. The true image is of the size 4096 × 4096 pixels. The blurred image was generated by reflexive padding of the true data to size 6144 × 6144, convolving it with Gaussian blur PSF (standard deviation = 20), adding 1% white noise and then cropping the resulting image to the size of 4096 × 4096 pixels. Blurred image

Blurred image (crop)

Restored image (PSD)

Restored image (DeconvolutionJ)

Fig. 1. Astronaut image: blurred and restored data

Figure 1 shows the blurred data as well as the deblurred astronaut images using DeconvolutionJ and PSD. To better illustrate the quality of deblurring, we display a small region of the blurred and reconstructed images. In PSD, we used the Tikhonov method with reflexive boundary conditions and regularization parameter equal 0.004. Similarly, in DeconvolutionJ, we used no resizing (the image size was already a power of two), double precision for complex numbers and the same value for the regularization parameter. Table 3 presents average execution times among 10 calls of each method. All timings are given in seconds and the numbers in brackets include the computation of the regularization parameter. One should notice a significant speedup, especially from 1 to 2 threads. The last row in Table 3 shows the execution time for DeconvolutionJ, which is over 11 times greater than the worst case of PSD (Tikhonov, FFT, 1 thread) and almost 30 times greater than the best case of PSD (Tikhonov, DCT, 8 threads). For 3-D deblurring we used exactly the same hardware and software. This time the test image (see Fig. 2), is a T1 weighted MRI image of Jeff Orchard’s head [25]. The Table 3. Average execution times (in seconds) for 2-D deblurring (numbers in brackets include the computation of the regularization parameter) Method 1 thread 2 threads 4 threads 8 threads Tikhonov, FFT 16.3 (54.3) 12.1 (37.8) 10.9 (28.8) 10.6 (27.8) Tikhonov, DCT 14.8 (53.3) 9.1 (32.5) 6.7 (23.7) 6.1 (22.4) DeconvolutionJ 181.7 -

Large-Scale Image Deblurring in Java

729

Table 4. Average execution times (in seconds) for 3-D deblurring (numbers in brackets include the computation of the regularization parameter) Method 1 thread 2 threads 4 threads 8 threads Tikhonov, FFT 9.2 (27.8) 7.3 (18.7) 7.0 (15.6) 6.7 (14.4) Tikhonov, DCT 6.2 (25.6) 3.9 (14.9) 2.4 (10.3) 2.0 (9.7) DeconvolutionJ 31.6 -

true image is of the size 128 × 256 × 256 pixels. The blurred image was generated by zero padding of the true data to size 128×512×512, convolving it with a Gaussian blur PSF (standard deviation = 1), adding 1% white noise and then cropping the resulting image to the size of 128 × 256 × 256 pixels. Figure 2 shows the 63rd slice of the deblurred head images. In PSD, we used the Tikhonov method with reflexive boundary conditions and regularization parameter equal 0.02. In DeconvolutionJ, we used exactly the same parameters as for the 2-D astronaut image and 0.01 for the regularization parameter. In Table 4, we have collected all timings. Once again, the execution time for DeconvolutionJ is over 3 times greater than the worst case of PSD (Tikhonov, FFT, 1 thread) and almost 16 times greater than the best case of PSD (Tikhonov, DCT, 8 threads). Blurred image

Restored image (PSD)

Restored image (DeconvolutionJ)

Fig. 2. Head image (63rd slice): blurred and restored data

4 Conclusion In this paper we have described our research efforts to develop computationally efficient Java software for image deblurring. A key component of this software, JTransforms, is the first, open source, multithreaded FFT library written in pure Java. Due to usage of the cache thread pool we are able to achieve superior performance and speedup on symmetric multiprocessing machines. Numerical results illustrate that our Parallel Spectral Deconvolution package outperforms the ImageJ plugin, DeconvolutionJ, and that our Java FFT implementation, JTransforms, is highly competitive with optimized C implementations, such as FFTW.

730

P. Wendykier and J.G. Nagy

References 1. Sarder, P., Nehorai, A.: Deconvolution methods for 3D fluorescence microscopy images. IEEE Signal Proc. Mag., 32–45 (May 2006) 2. Roggemann, M.C., Welsh, B.: Imaging Through Turbulence. CRC Press, Boca Raton (1996) 3. Sechopoulos, I., Suryanarayanan, S., Vedantham, S., D’Orsi, C.J., Karellas, A.: Scatter radiation in digital tomosynthesis of the breast. Med. Phys. 34, 564–576 (2007) 4. Hansen, P.C., Nagy, J.G., O’Leary, D.P.: Deblurring Images: Matrices, Spectra and Filtering. SIAM (2006) 5. Hansen, P.C.: Rank-deficient and discrete ill-posed problems. SIAM (1997) 6. Vogel, C.R.: Computational Methods for Inverse Problems. SIAM (2002) 7. Wendykier, P.: JTransforms, Parallel Colt, Parallel Spectral Deconvolution (2008), http://piotr.wendykier.googlepages.com/ 8. Rasband, W.S.: ImageJ, U. S. National Institutes of Health, Bethesda, Maryland, USA (2008), http://rsb.info.nih.gov/ij/ 9. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE 93(2), 216–231 (2005) 10. Stewart, G.W.: Matrix Algorithms, Volume 1: Basic Decompositions. SIAM (1998) 11. Gonzalez, R.C., Wintz, P.: 5. Digital Image Processing. Addison-Wesley, Reading (1977) 12. Kilmer, M.E., O’Leary, D.P.: Choosing regularization parameters in iterative methods for ill-posed problems. SIAM J. Matrix Anal. Appl. 22, 1204–1221 (2001) 13. Doederlein, O.: Mustang’s HotSpot Client gets 58% faster! (2005), http://weblogs.java.net/blog/opinali/archive/2005/11/ mustangs_hotspo_1.html 14. Hoschek, W.: Colt Project (2004), http://dsd.lbl.gov/%7Ehoschek/colt/index.html 15. Heideman, M.T., Johnson, D.H., Burrus, C.S.: Gauss and the history of the fast Fourier transform. Archive for History of Exact Sciences 34, 265–277 (1985) 16. Cooley, J.W., Tukey, J.W.: An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation 19(90), 297–301 (1965) 17. Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM (1992) 18. Johnson, S.G., Frigo, M.: A modified split-radix FFT with fewer arithmetic operations. IEEE Trans. Signal Processing 55(1), 111–119 (2007) 19. Yavne, R.: An economical method for calculating the discrete Fourier transform. In: AFIPS Fall Joint Computer Conference, pp. 115–125 (1968) 20. Duhamel, P., Hollmann, H.: Split Radix FFT Algorithms. Electronic Letters 20, 14–16 (1984) 21. Ooura, T.: General Purpose FFT (Fast Fourier/Cosine/Sine Transform) Package (2006), http://www.kurims.kyoto-u.ac.jp/%7Eooura/fft.html 22. Sun Microsystems: New Features and Enhancements J2SE 5.0 (2004), http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html 23. Linnenbrügger, N.: FFTJ and DeconvolutionJ (2002), http://rsb.info.nih.gov/ij/plugins/fftj.html 24. NASA: Great Images in NASA. Ed White performs first U.S. spacewalk (1965), http://grin.hq.nasa.gov/ABSTRACTS/GPN-2006-000025.html 25. Orchard, J.: His Brain (2007), http://www.cs.uwaterloo.ca/%7Ejorchard/mri/

A New Signature-Based Indexing Scheme for Trajectories of Moving Objects on Spatial Networks∗ Jaewoo Chang, Jungho Um, and Youngjin Kim Dept. of Computer Eng., Chonbuk National Univ., Chonju, Chonbuk 561-756, Korea {jwchang,jhum,yzkim}@chonbuk.ac.kr

Abstract. Because moving objects usually move on spatial networks, their trajectories play an important role in indexing them for spatial network databases. In this paper, we propose a new signature-based indexing scheme for moving objects’ trajectories on spatial networks. For this, we design it so that we can efficiently deal with the trajectories of current moving objects as well as for maintaining those of past moving objects. In addition, we provide both an insertion algorithm to store the segment information of moving objects’ trajectories and a retrieval algorithm to find a set of moving objects whose trajectories match with a query trajectory. Finally, we show that our indexing scheme achieves much better performance on trajectory retrieval than the leading trajectory indexing schemes, such as TB-tree and FNR-tree. Keywords: signature-based index scheme, trajectory, spatial network.

1 Introduction Most of the existing work in spatial databases considers Euclidean spaces, where the distance between two objects is determined by the ideal shortest path connecting them [6]. However, in practice, objects can usually move on road networks, where the network distance is determined by the length of the real shortest path connecting two objects on the network. For example, a gas station nearest to a given point in Euclidean spaces may be mored distant in a road network than another gas station. Therefore, the network distance is an important measure in spatial network databases (SNDB). Recently, there have been some studies on SNDB for emerging applications such as location-based service (LBS) [1, 5, 7, 8]. First, Speicys et al. [8] dealt with a computational data model for spatial network. Secondly, Shahabi et al. [7] presented k-nearest neighbors (k-NN) query processing algorithms for SNDB. Finally, Papadias et al. [5] designed a novel index structure for supporting query processing algorithms for SNDB. Because moving objects usually moves on spatial networks, instead of on Eucli-dean spaces, their trajectories play an important role indexing them for spatial network ∗

This work is financially supported by the Ministry of Education and Human Resources Development (MOE), the Ministry of Commerce, Industry and Energy (MOCIE) and the Ministry of Labor (MOLAB) though the fostering project of the Lab of Excellency. This work is also supported by the second stage of Brain Korea 21 Project.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 731–740, 2008. © Springer-Verlag Berlin Heidelberg 2008

732

J. Chang, J. Um, and Y. Kim

databases. However, there has been little research on trajectory indexing schemes for spatial networks, even though efficient index structures are required to gain good retrieval performance on their trajectories. In this paper, we propose a new signaturebased indexing scheme for moving objects’ trajectories on spatial networks. For this, we design it so that we can efficiently deal with the trajectories of current moving objects as well as for maintaining those of past moving objects. In addition, we provide both an insertion algorithm to store the segment information of moving objects’ trajectories and a retrieval algorithm to find a set of moving objects whose trajectories match with a query trajectory. The rest of the paper is organized as follows. In Sect. 2, we introduce related work. In Sect. 3, we propose a signature-based indexing scheme for moving objects’ trajectories. In Sect. 4, we provide the performance analysis of our indexing scheme. Finally, we draw our conclusion in Sect. 5.

2 Related Work There has been a little research on trajectory indexing schemes for spatial networks. So we overview both a predominant trajectory index structure for Euclidean spaces and a leading trajectory index structure for spatial networks. First, Pfoser et al. [4] proposed a hybrid index structure which preserves trajectories as well as allows for Rtree typical range search in Euclidean spaces, called TB-tree (Trajectory-Bundle tree). The TB-tree has fast accesses to the trajectory information of moving objects, but it has a couple of problems in SNDB. First, because moving objects move on a predefined spatial network in SNDB, the paths of moving objects are overlapped due to frequently used segments, like downtown streets. This leads to a large volume of overlap among the MBRs of internal nodes. Secondly, because the TB-tree constructs a three-dimensional MBR including time, the dead space for the moving object trajectory is highly increased in case of a long time movement. This leads to a large volume of overlap with other objects’ trajectories. Meanwhile, Frentzos [2] proposed a new indexing technique, called FNR-tree (Fixed Network R-tree), for objects constrained to move on fixed networks in two-dimensional space. The general idea of the FNR-tree is to construct a forest of 1-dimensional (1D) R-trees on top of a 2-dimensional (2D) R-tree. The 2D R-tree is used to index the spatial data of the network, e.g. roads consisting of line segments, while the 1D R-trees are used to index the time interval of each object movement inside a given link of the network. The FNR-tree outperforms the R-tree in most cases, but it has a critical drawback that the FNR-tree has to maintain a tremendously large number of R-trees, thus leading to a great amount of storage overhead to maintain it. This is because the FNR-tree constructs as large number of R-trees as the total number of segments in the networks, being greater than 1 million in some cases.

3 Signature-Based Indexing Scheme 3.1 Trajectory Indexing Scheme for Current Moving Objects Because moving objects change their locations continuously on road networks, the amount of trajectory information for a moving object is generally very large. To solve

A New Signature-Based Indexing Scheme

733

the problems of TB-tree as mentioned in Sect. 2, we propose a new signature-based indexing scheme which can have fast accesses to moving object trajectories. Figure 1 shows the structure of our trajectory indexing scheme. The main idea of our trajectory indexing scheme is to create a signature of a moving object trajectory and maintain partitions which store the fixed number of moving object trajectories and their signatures together in the order of their start time. There are a couple of reasons for using partitions. First, because a partition is created and maintained depending on its start time, it is possible to efficiently retrieve the trajectories of moving objects on a given time. Next, because a partition can be accessed independently to answer a trajectory-based query, it is possible to achieve better retrieval performance by searching partitions in parallel. Partition Table 1 2

MO(timestamp) n

Memory 1 2 3

1 2 3

1 2 3

…

m

Signature Info

…

Partition i-1

. .

m

m

…

…

Partition i-1

Location Info

1 2

. .

…

Signature Info

Signature Info

Location Info

1 2

…

…

…

m

…

Partition i-1

Location Info

Disk 1 2

. .

m

m

Trajectory Info

Trajectory Info

Trajectory Info

Partition 1

Partition 2

Partition n

Fig. 1. Signature-based trajectory indexing scheme

Our trajectory indexing scheme consists of a partition table and a set of partitions. A partition can be divided into three areas; trajectory information, location information, and signature information. A partition table maintains a set of partitions which store trajectories for current moving objects. The partition table is resided in a main memory due to its small size. To answer a user query, we find partitions to be accessed by searching the partition table. The trajectory information area maintains moving object trajectories which consist of a set of segments (or edges). The location information area contains the location of an object trajectory stored in the trajectory information area. This allows for accessing the actual object trajectories corresponding to potential matches to satisfy a query trajectory in the signature information area. The location information area also allows for filtering out irrelevant object trajectories based on the time condition of a query trajectory because it includes the start time, the current time, and the end time for a set of object trajectories. To create a signature from a given object trajectory in an efficient manner, we make use of a superimposed coding because it is very suitable to SNDB applications where the number of segments for an object trajectory is variable [10]. To achieve good retrieval performance, we store both the signature and the location information in a main memory.

734

J. Chang, J. Um, and Y. Kim

3.2 Trajectory Indexing Scheme for Past Moving Objects To answer trajectory-based queries with a past time, it is necessary to efficiently search the trajectories of past moving objects which no longer move on road networks. The trajectories of moving objects can be divided into two groups; one being frequently used for answering queries based on current object trajectories (COTSS) and the other for answering queries based on past object trajectories (POTSS). Figure 2 shows an overall architecture of indexing schemes for moving object trajectories. When a current moving object trajectory in COTSS is no longer changed due to the completion of the object movement, the object trajectory should be moved from COTSS to POTSS. The signature and the location information areas of COTSS are resided in a main memory for fast retrieval, whereas all of three areas of POTSS are maintained in a secondary storage.

COTSS

POTSS

(Current Object Trajectory

(Past Object Trajectory

Storage Structure)

Storage Structure)

Fig. 2. Overall architecture of indexing schemes for moving objects’ trajectories

To move current object trajectories from COTSS to POTSS, we should consider three requirements; retrieval of past object trajectories in an efficient way, accesses of the small number of partitions to answer a trajectory-based query, and construction of an efficient time-based index structure. To satisfy the first requirement, we make use of a bit-sliced method [10] for constructing a signature-based indexing scheme in POTSS, instead of using a bit-string method in COTSS. In the bit-sliced method, we create a fixed-length signature slice for each bit position in the original signature string. When the number of segments in a query trajectory is m and the number of bits assigned to a segment is k, the number of page I/O accesses for answering the query in the bit-sliced method is less than k*m. Therefore, when the number of segments in a query trajectory is small, our indexing scheme requires the small number of page I/O accesses due to the small number of signature slices needed for the query. To satisfy the second requirement, we maintain all the partitions in POTSS so that they can hold the condition that if start_time(partition i)<start_time(partition i+1), end_time(partition i) ≤end_time(partition i+1). If this condition is not satisfied among partitions in POTSS, query processing may be inefficient depending on the time window distribution of partitions in POTSS, even for queries with the same time window. Actually, if all the trajectories of the partition i have completed their movements earlier than those of the partition i-1, the partition i should move from COTSS to POTSS earlier than the partition i-1, leading to the dissatisfaction of the above condition. To prevent it, we require a strategy to store partitions such that if all the trajectories of the partition i are no longer changed, but those of the partition i-1 are changed, we exchange trajectories being changed in the partition i-1 with those having the smallest end time in the partition I and then move the partition i-1 from COTSS to POTSS. To satisfy the final requirement, we construct a B+-tree by using

A New Signature-Based Indexing Scheme

735

the end time of a partition as a key so as to have fast accesses to partitions in POTSS. Figure 3 shows the time-based B+-tree structure. A record, Rec, of a leaf node in the time-based B-tree is where p_start_time and p_end_time mean the smallest start time and the largest end time of all the trajectories for a parti-tion in POTSS, respectively. Here, Pid and PLoc mean its partition ID and its location, respectively. When a query is issued to find object trajectories with a time window [t1, t2], we first get a starting leaf node by searching the time-based B+-tree using t1, and then obtain records to satisfy the condition, p_end_time ≥ t1 AND p_start_time ≤ t2. B+-tree with key = end time

root

sequence set leaf

Pa

Pb

nodes partition pointer

search space

Fig. 3. Time-based B+-tree structure for partitions in POTSS

3.3 Insertion Algorithms for Moving Object Trajectories The algorithms for inserting moving objects trajectories can be divided into an initial trajectory insertion algorithm and a segment insertion algorithm for its trajectory. For the initial trajectory insertion, we find the last partition in the partition table and obtain an available entry (NE) in the last partition. The initial trajectory insertion can be performed according to two cases; one with no expected future trajectories and the other with expected trajectories. The detailed algorithm will be omitted due to its space requirement. For the segment insertion of a moving object trajectory, we find a partition storing its trajectory from the partition table by using the start time (ST) of the moving object. In addition, we obtain the entry storing the trajectory information in the partition. Figure 4 shows the segment insertion algorithm (i.e., InsertSeg) for moving object trajectories. Here NE is the entry in the partition covering the object identified by MOid and Loc is the location of the NE entry in the trajectory information area. The segment insertion can be performed in two cases. First, for a segment insertion for trajectories with no expected future ones, we just store a new segment (TrajSeg) into the NE entry of the trajectory information area, being addressed by Loc. In addition, we generate a trajectory signature (SigTS) from the TrajSeg and store the SigTS into the NE entry of the signature information area. Then, we store <MOid, Loc, StartT, CurrentT, ExpectET> into the NE entry of the location information area. Secondly, for a segment insertion for trajectories with expected future ones, we can store a new segment according to three types of the discrepancy between a new segment and the expected segment of a trajectory. To check if a new segment ac-cords with an expected trajectory’s segment, we call a find-seg() function to find a segment coinciding with TrajSeg from the expected trajectory of the NE entry. First, in case of no segment coinciding with TrajSeg (seg_pos = 0), we perform the same procedure as the segment insertion algorithm with no expected future

736

J. Chang, J. Um, and Y. Kim

segments. In addition, we move the trajectory’s expected segments backward by one and store the TrajSeg into the (#_actual_seg)-th segment of the NE entry. Secondly, in case where the segment coinciding with TrajSeg is the first one (seg_pos = 1), we store only the TrajSeg into the (#_actual_seg)-th segment of the NE entry because the TrajSeg is the same as the first expected segment of the trajectory. Otherwise (seg_pos > 1), we delete the (seg_pos-1) number of segments from the expected segments of the NE entry, store the TrajSeg into the (#_actual_seg)-th segment, and move all the expected segments forward by seg_pos-2. If the ratio of mismatched segments (#_mismatch) over all the segments of the trajectory is less than a threshold (τ), we store the trajectory signature (SigTS) generated from the TrajSeg into the NE entry of the signature information area. Otherwise, we regenerate SigTS from the trajectory information by calling a signature regeneration function (regenerate_sig). Finally, we update the values of #_actual_seg, #_future_seg, and #_mismatch in the NE entry, and we update the CurrentT of the NE entry in the location information area and that of the partition P’s entry in the partition table. Algorithm InsertSeg(MOid, TrajSeg, ST) /* TraSeg contains a segment for the trajectory of a moving object Moid, to be stored with an object trajectory’s start time, ST*/ 1. Generate a signature SigTS from TrajSeg 2. Locate a partition P covering ST in partition table 3. Locate an entry E covering ST for the moving object with MOid and get its location, Loc, in the trajectory information area 4. Obtain #actual_seg, #future_seg, and #mismatch of the trajectory info entry E (i.e., TE) for the MOid in P 5. if(#future_seg = 0) { // no expected trajectory 6. Insert TrajSeg into(#actual_seg+1)th trajectory segment of TE 7. Store SigTS into the entry E of the signature info area in P} 8. else { // expected trajectory exists 9. seg_pos = find_seg(TrajSeg,Loc) 10. #actual_seg++, #future_seg = #future_seg – seg_pos 11. case(seg_pos = 0) { // find no segment 12. Insert TrajSeg into segment of TE and relocate the future traj segments backward 13. Store SigTS into entry E of signature info area in P } 14. case(seg_pos = 1) //find the first segment 15. Insert TrajSeg into (#actual_seg)-th trajectory segment of TE for exchanging the old segment 16. case(seg_pos > 1) {//find the (seg_pos)-th segment 17. #mismatch = #mismatch + seg_pos – 1 18. Insert TrajSeg into (#actual_seg)-th segment of TE and relocate the future traj segments forward 19. if(#mismatch/(#future_seg+#actual_seg) > ) regenerate_sig(Loc,SigTS,E,P)}// end of case 20. } // end of else 21. Update #actual_seg, #future_seg, and #mismatch of TE 22. CurrentT = te of TrajSeg 23. Store CurrentT into the current_time of the entry E and into the current_time of the partition P entry End InsertSeg

τ

Fig. 4. Segment insertion algorithm for moving object trajectories

A New Signature-Based Indexing Scheme

737

3.4 Retrieval Algorithm for Moving Object Trajectories The retrieval algorithm for moving object trajectories finds a set of objects whose trajectories match the segments of a query trajectory. Figure 5 shows the retrieval algorithm (i.e., Retrieve) for moving object trajectories. To find a set of partitions satisfying the time interval (TimeRange) represented by of a given query (Q), we call a find_partition function to generate a list of partitions (partList) by searching both the partition table of COTSS and the B+-Tree of POTSS. The search cases can be determined by comparing the TimeRange (T) with the p_end_time (PEtime) of the last partition in POTSS as well as with the p_start_time (CStime) of the first partition in COTSS as follows. 1. If T.lower > PEtime, both T.lower and T.upper are ranged in COTSS 2. If T.upper ≤ PEtime and T.upper < CStime, both T.lower and T.upper are ranged in POTSS 3. If T.upper≤PEtime and T.upper ≥ CStime, both T.lower and T.upper are ranged in POTSS and T.upper is at least within in COTSS simultaneously 4. If T.lower≤PEtime and T.upper>PEtime, T.lower is within POTSS while T.upper is in COTSS For the first case, we perform the sequential search of the partition table in COTSS and find a list of partitions (partList) to satisfy the condition that if end_time≠NULL, end_time ≥ T.lower AND start_time ≤ T.upper and otherwise, current_time ≥ T.lower AND start_time ≤ T.upper. Because the partition table of COTSS is resident in a main memory, the cost for searching partition table is low. For the second case, we get a starting leaf node by searching the B+-tree of POTSS with key = lower and obtain the partList to satisfy the above condition by searching the next leaf nodes from the starting leaf node in the sequence set. For the third case, we get two lists of partitions to satisfy the TimeRange in both COTSS and POTSS, respectively. We obtain the partList by merging the two lists of partitions acquired from both POTSS and COTSS. For the last case, we get a starting leaf node by searching the B+-tree of POTSS with key = lower and obtain a list of partitions to satisfy the TimeRange and obtain a list of partitions to satisfy a condition p_start_time ≤ T.upper by searching the partition table of COTSS. We obtain the partList by merging the partitions ac-quired from POTSS and those from COTSS. Next, we generate a query signature (QSig) from a query trajectory’s segments. For each partition of the partList, we search the signatures in the signature information area and acquire a list of candi-dates (CanList). For the entries corresponding to the candidates, we determine if their start_time, end_time, and current_time satisfy the condition. Finally, we determine if the query trajectory matches the object trajectories corresponding to the entries. If it matches object trajectories, we insert the object’s ID into a result list (MoidList).

738

J. Chang, J. Um, and Y. Kim

Algorithm Retrieve(QSegList, TimeRange, MOidList) /* MOidList is moving objects’ id list to satisfy QsegList for TimeRange */ 1. Qsig = 0, #qseg = 0, partList = Ø 2. t1 = TimeRange.lower, t2 = TimeRange.upper 3. for each segment QSj of QsegList { 4. Generate a signature QSSi from Qsj 5. QSig = QSig | QSSj, #qseg = #qseg + 1 } 6. find_partition(TimeRange, partList) 7. for each partition Pn of partList { 8. Obtain a set of candidate entries, CanList, examining the signatures of signature info area in Pn 9. for each candidate entry Ek of CanList { 10. Let s,e,c be start_time, end_time, current_time of the entry Ek of location information area 11. if((s ) AND (e t1 OR c t1)){ 12. #matches = 0 13. Obtain the first segment ESi of the entry Ek, TEk, and the first segment QSj of QsegList 14. while(ESi NULL and QSj NULL) { 15. if(match(Esi, QSj)=FALSE) Obtain the next segment ESi of TEk 16. else { #matches = #matches + 1 17. Obtain the first segment ESi of Tek } 18. if(#matches=#qseg)MOidList=MOidList ∪ {TEk’s MOid} 19. } } } //end of while //end of if //end of for- CanList 20. } // end of for - partList End Retrieve

≤ t2

≥

≠

≥

≠

Fig. 5. Retrieval algorithm for moving object trajectories

4 Performance Analysis We implement our trajectory indexing scheme under Pentium-IV 2.0GHz CPU with 1GB main memory, running Window 2003. For our experiment, we use a road net-work consisting of 170,000 nodes and 220,000 edges [9]. We also generate 50,000 moving objects randomly on the road network by using Brinkhoff’s algorithm [1]. For performance analysis, we compare our indexing scheme with the TB-tree and the FNR tree in terms of insertion time and retrieval time for moving object trajectories. First, Table 1 shows the insertion performance to store one moving object trajectory. It is shown from the result that our indexing scheme preserves nearly the same insertion performance as TB-tree, but the FNR tree provides about two orders of magnitude worse insertion performance than TB-tree. This is because the FNR-tree constructs a tremendously great number of R-trees, i.e., each per a segment in the road network. Table 1. Trajectory insertion performance

Trajectory insertion time(sec)

TB-tree

FNR-tree

Our indexing scheme

1.232

401

1.606

A New Signature-Based Indexing Scheme

739

We measure retrieval time for answering queries whose trajectory contains 2 to 20 segments. Figure 6 shows the trajectory retrieval performance. It is shown from the result that our indexing scheme requires about 20 ms while the FNR-tee and the TBtree needs 25ms and 93ms, respectively, when the number of segments in a query is 2. It is shown that our indexing scheme outperforms the existing schemes when the number of segments in a query trajectory is small. On the contrary, the TB-tree achieves bad retrieval performance due to a large extent of overlap in its internal nodes even when the number of segments in a query trajectory is small. As the number of segments in queries increase, the retrieval time is increased in both the FNR-tree and the TB-tree; however, our indexing scheme requires constant retrieval time. The reason is why our indexing scheme creates a query signature combining all the segments in a query and it searches for potentially relevant trajectories of moving objects once by using the query signature as a filter.

time(sec)

trajectory retrieval(WALL TIME) 1 0.8 0.6 0.4 0.2 0

TB-tree FNR-tree Our indexing s cheme

2

3

4

5

6

8

10 12 14 16 18 20

query segment

Fig. 6. Trajectory retrieval performance

When the number of segments in a query is 20, it is shown that our indexing scheme requires about 20 ms while the FNR-tree and the TB-tree needs 150ms and 850ms, respectively. Thus our indexing scheme achieves about one order of magnitude better retrieval performance than the existing schemes. This is because our indexing scheme constructs an efficient signature-based indexing structure by using a superimposed coding technique. On the contrary, the TB-tree builds a MBR for each segment in a query and performs a range search for each MBR. Because the number of range searches increases in proportion to the number of segments, the TB-tree dramatically degrades on trajectory retrieval performance when the number of segments is great. Similarly, the FNR-tree should search for an R-tree for each segment in a query. Because it gains accesses to as the large number of R-trees as the number of segments in the query, the FNR-tree degrades on trajectory retrieval performance as the number of segments is increased.

5 Conclusion Because moving objects usually move on spatial networks, instead of on Euclidean spaces, efficient index structures are needed to gain good retrieval performance on

740

J. Chang, J. Um, and Y. Kim

their trajectories. However, there has been little research on trajectory indexing schemes for spatial network databases. Therefore, we proposed a signature-based indexing scheme for moving objects’ trajectories on spatial networks so that we might efficiently deal with the trajectories of current moving objects as well as for maintaining those of past moving objects. In addition, we provided both a segment insertion algorithm and a retrieval algorithm. Finally, we showed that our indexing scheme could achieve, to a large extent, about one order of magnitude better retrieval performance than the existing schemes, such as the FNR-tree and TB-tree. As future work, it is required to extend our indexing scheme to a parallel environment so as to achieve better retrieval performance due to the characteristic of signature files [10].

References 1. Brinkhoff, T.: A Framework for Generating Network-Based Moving Objects. GeoInformatica 6(2), 153–180 (2002) 2. Frentzos, R.: Indexing Moving Objects on Fixed Networks. In: International Conference on Spatial and Temporal Databases, Santorini island, Greece, pp. 289–305 (2003) 3. Faloutsos, C., Christodoulakis, S.: Signature Files: An Access Method for Documents and Its Analytical performance Evaluation. ACM Transaction on Office Information Systems 2(4), 267–288 (1984) 4. Pfoser, D., Jensen, C.S., Theodoridis, Y.: Novel Approach to the Indexing of Moving Object Trajectories. In: 27th International Conference on VLDB, Egypt, pp. 395–406 (2000) 5. Papadias, S., Zhang, J., Mamoulis, N., Tao, Y.: Query Processing in Spatial Network Databases. In: 29th International Conference on VLDB, Germany, pp. 802–813 (2003) 6. Shekhar, S.: Spatial Databases - Accomplishments and Research Needs. IEEE Transaction on Knowledge and Data Engineering 11(1), 45–55 (1999) 7. Shahabi, C., Kolahdouzan, M.R., Sharifzadeh, M.: A Road Network Embedding Technique for K-Nearest Neighbor Search in Moving Object Databases. GeoInformatica 7(3), 255–273 (2003) 8. Speicys, L., Jensen, C.S., Kligys, A.: Computational Data Modeling for NetworkConstrained Moving Objects. In: 17th ACM International Symposium on Advances in Geographic Information Systems, New Orleans, Louisiana, USA, pp. 118–125 (2003) 9. Penn State University Libraries, http://www.maproom.psu.edu/dcw/ 10. Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted Files Versus Signature Files for Text Indexing. ACM Transanction on Database Systems 23(4), 453–490 (1998)

Eﬀective Emission Tomography Image Reconstruction Algorithms for SPECT Data J. Ram´ırez1, J.M. G´ orriz1 , M. G´ omez-R´ıo2, A. Romero1 , R. Chaves1, 1 2 A. Lassl , A. Rodr´ıguez , C.G. Puntonet4 , F. Theis5 , and E. Lang3 1

Dept. of Signal Theory, Networking and Communications, University of Granada, Spain [email protected] 2 Servicio de Medicina Nuclear, Hospital Universitario Virgen de las Nieves (HUVN), Granada, Spain 3 Institut f¨ ur Biophysik und physikalische Biochemie, University of Regensburg, Germany 4 Dept. of Architecture and Computer Technology, University of Granada, Spain 5 Max Planck Institute for Dynamics and Self-Organisation, Bernstein Center for Computational Neuroscience, G¨ ottingen, Germany

Abstract. Medical image reconstruction from projections is computationally intensive task that demands solutions for reducing the processing delay in clinical diagnosis applications. This paper analyzes reconstruction methods combined with pre- and post-ﬁltering for Single Photon Emission Computed Tomography (SPECT) in terms of convergence speed and image quality. The evaluation is performed by means of an image database taken from a concurrent study investigating the use of SPECT as a diagnostic tool for the early onset of Alzheimer-type dementia. Filtered backprojection (FBP) methods combined with frequency sampling 2D pre- and post-ﬁltering provides a good trade-oﬀ between image quality and delay. Maximum likelihood expectation maximization (ML-EM) improves the quality of the reconstructed image but with a considerable increase in processing delay. To overcome this problem the ordered subsets expectation maximization (OS-EM) method is found to be an eﬀective algorithm for reducing the computational cost with an image quality similar to ML-EM.

1

Introduction

Emission-computed tomography (ECT) has been widely employed in biomedical research and clinical medicine during the last three decades. ECT diﬀers fundamentally from many other medical imaging modalities in that it produces a mapping of physiological functions as opposed to imaging anatomical structure. Tomographic radiopharmaceutical imaging, or ECT, provides in vivo threedimensional maps of a pharmaceutical labeled with a gamma ray emitting radionuclide. The distribution of radionuclide concentrations are estimated from a set of projectional images acquired at many diﬀerent angles around the patient. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 741–748, 2008. c Springer-Verlag Berlin Heidelberg 2008

742

J. Ram´ırez et al.

Single Photon Emission Computed Tomography (SPECT) imaging techniques employ radioisotopes which decay emitting predominantly a single gamma photon. This represents the fundamental diﬀerence between PET (Positron Emission Tomography) and SPECT. PET systems employ isotopes in which a couple of photons are produced in each individual annihilation. There is a rich variety of isotopes that decay, emitting a single-photon and which consequently can be utilized in SPECT. When the nucleus of a radioisotope disintegrates, a gamma photon is emitted with a random direction which is uniformly distributed in the sphere surrounding the nucleus. If the photon is unimpeded by a collision with electrons or other particles within the body, its trajectory will be a straight line or “ray”. In order for a photon detector external to the patient to discriminate the direction that a ray is incident from, a physical collimation is required. Typically, lead collimator plates are placed prior to the the detector’s crystal in such a manner that the photons incident from all but a single direction are blocked by the plates. This guarantees that only photons incident from the desired direction will strike the photon detector. Brain SPECT has become an important diagnostic and research tool in nuclear medicine. The ultimate value of this procedure depends on good technique in acquisition setup and proper data reconstruction [1,2]. This paper analyzes reconstruction methods combined with pre- and post-ﬁltering for Single Photon Emission Computed Tomography (SPECT) in terms of convergence speed and image quality.

2

Filtered Backprojection Reconstruction

An image of the cross section of an object can be recovered or reconstructed from the projection data. In ideal conditions, projections are a set of measurements of the integrated values of some parameter of the object. If the object is represented by a two dimensional function f (x, y) and each line integral by the (θ, t) parameters, the line integral Pθ (t) is deﬁned as: Pθ (t) =

+∞

−∞

+∞

−∞

f (x, y)δ(x cos θ + y sin θ − t)dxdy

(1)

The function Pθ (t) is known as the Radon transform of the function f (x, y). A projection is formed by combining a set of line integrals. The simplest projection is a collection of parallel ray integrals and is given by Pθ (t) for a constant θ. Another type of projection is possible if a single source is placed in a ﬁxed position relative to a line of detectors. This projection is known as fan beam projection because the line integrals are measured along fans. The key to tomographic imaging is the Fourier Slice Theorem which relates the measured projection data to the two-dimensional Fourier transform of the object cross section. The Fourier Slice Theorem is stated as follows: “The Fourier

Eﬀective Emission Tomography Image Reconstruction Algorithms

743

transform Sθ (w) of a parallel projection Pθ (t) of an image f (x, y) taken at angle θ and deﬁned to be: +∞ Pθ (t) exp(−j2πwt)dt (2) Sθ (w) = −∞

gives a slice of the two-dimensional Fourier transform: +∞ +∞ F (u, v) = f (x, y) exp(−j2π(ux + vy))dxdy −∞

(3)

−∞

subtending an angle θ with the u-axis”, that is, Sθ (w) = F (u = w cos θ, v = w sin θ)

(4)

The above result is the essence of straight ray tomography and indicates that by having projections of an object function at angles θ1 , θ2 , ..., θk and taking the Fourier transform of them, the values of F (u, u) can be determined on radial lines. In practice only a ﬁnite number of projections of an object can be taken. In that case it is clear that the function F (u, v) is only known along a ﬁnite number of radial lines so that one must then interpolate from these radial points to the points on a square grid. The ﬁltered backprojection (FBP) algorithm can be easily derived from the Fourier Slice Theorem. An image of the cross section f (x, y) of an object can be recovered by: π

f (x, y) =

Qθ (x cos θ + y sin θ)dθ

(5)

Sθ (w)|w| exp(j2πwt)dw

(6)

0

where

Qθ (t) =

+∞ −∞

The FBP algorithm then consists of two steps: the ﬁltering part, which can be visualized as a simple weighting of each projection in the frequency domain, and the backprojection part.

3

Maximum Likelihood Expectation Maximization (ML-EM)

In emission tomography, a compound containing a radioactive isotope is introduced into the body and forms an unknown emitter density λ(x, y) under the body’s functional activity. Emissions then occur according to a Poisson process. The acquisition system usually consists of D detectors so that the measured data n∗ (1), ..., n∗ (D) represents the counts of photons emitted by the body and measured by each one of the detectors. The maximum likelihood expectation maximization algorithm (ML-EM) ˆ of λ which maximizes the probability [3,4,5,6] determine an estimate λ

744

J. Ram´ırez et al.

p(n∗ (1),...,n∗ (D)|λ) of observing the actual detector count data over all possible densities. Let n(b) represent the number of unobserved emissions in each of B boxes (pixels) partitioning an object containing an emitter and let p(b, d) be the probability of an emission in box b is detected in detector unit d. ML-EM is an iterative reconstruction algorithm which starts with an initial estimate λ0 ˆ from an old estimate λˆ : and gives the new estimate λ ˆ λ(b) = λˆ (b)

4

D

n∗ (d)p(b, d) B b =1 λ (b )p(b , d) d=1

(7)

Ordered Subset Expectation Maximization (OSEM)

The application of Expectation Maximization (EM) algorithms in emission tomography has led to the introduction of many related techniques. While quality of reconstruction is good, the application of EM is computer intensive and its convergence slow. Ordered subset expectation maximization (OS-EM) [7] algorithm for computed tomography groups projection data in ordered subsets. The standard EM algorithm (i.e., projection followed by backprojection) is then applied to each of the subsets in turn. The resulting reconstruction becomes the starting value for use with the next subset. An iteration of the OS-EM algorithm is deﬁned as a single pass through all the speciﬁed subsets. Further iterations may be performed by passing through the same ordered subsets, using as a starting point the reconstruction provided by the previous iteration. By selecting mutually exclusive subsets, each OS-EM iteration has a similar computation to a single EM iteration. In SPECT, the sequential processing of ordered subsets is very natural, as projection data is collected separately for each projection angle (as a camera rotates around the patient in SPECT); counts on single projections can form successive subsets.

5

Preﬁltering and Postﬁltering

A major drawback of FBP algorithms for tomographic image reconstruction is the undesired ampliﬁcation of the high frequency noise and its impact on image quality. These eﬀects are caused by the ﬁltering operation or multiplication of Sθ (w) by |w| in equation 6. In order to attenuate the high frequency noise ampliﬁed during FBP reconstruction, a number of window function has been proposed. In this way, the reconstruction method described by equations 5 and 6 is normally redeﬁned by applying a frequency window which returns to zero as the frequency tends to π. Among the most common window functions used for FBP reconstruction are: i) sinc (Shepp-Logan ﬁlter), ii) cosine, iii) Hamming and, iv ) Hanning window functions. However, even when the reconstruction noise is kept low using a noise controlled FBP approach, the noise captured by the acquisition system needs to be ﬁltered out to improve the quality of reconstructed images.

Eﬀective Emission Tomography Image Reconstruction Algorithms

(a)

745

(b)

(c)

Fig. 1. a) SPECT sinogram acquired by a three head gammacamera, b) Filtered SPECT data, c) Frequency response of noise ﬁlter

Moreover, the preprocessing stage of most automatic SPECT image processing systems often incorporates preﬁltering, reconstruction and postﬁltering to minimize the noise acquired by the gammacamera as well as the noise ampliﬁed during FBP reconstruction.

6

Image Quality and Performance Evaluation

Image ﬁles were taken from a concurrent study investigating the use of SPECT as a diagnostic tool for the early onset of Alzheimer-type dementia. SPECT data were acquired by a three head gammacamera Picker Prism 3000. The patient is injected with a 99m Tc-ECD radiopharmeceutical which emits gamma rays that are detected by the detector. A total of 180 projections were taken with a 2-degree angular resolution. Fig. 1 shows the eﬀects of noise and ﬁltering on projection data acquired by the gammacamera. The acquired sinogram shows a visible high frequency noise that is eﬀectively ﬁltered by a 2D frequency sampling FIR ﬁlter. Fig. 2 shows the eﬀect of noise on FBP image reconstruction and the eﬀectiveness of pre- and post-ﬁltering. Note that, when a simple ramp ﬁlter is used, the noise completely

746

J. Ram´ırez et al.

(a)

(b)

(c)

Fig. 2. FBP reconstruction of a SPECT image with a) RAM-LAK ﬁlter, b) with FIR preﬁltering, c) with pre- and post-ﬁltering

corrupts the reconstructed image. Pre- and post-ﬁltering improves the quality of FBP reconstruction by removing the huge high frequency noise present in SPECT data and the residual noise after reconstruction. Thus, FBP yields good image quality for analysis and display of SPECT data although residual noise is observed in the images as shown in Fig. 2a. ML-EM algorithm is an iterative algorithm for image reconstruction that better models the photons emitted by a radioactive source and yields better image quality when compared to FBP. Fig. 3 shows the slow convergence of ML-EM to a ﬁnal reconstructed image. It is shown that the iterative algorithm described by equation 7 requires from 15 to 20 iterations to converge and yields a ﬁnal image with improved quality when compared to FBP as shown in Fig. 4. OS-EM image reconstruction, that groups projection data in ordered subsets and performs projection followed by backprojection to each of the subsets in turn, yields an acceleration on the convergence of the image reconstruction process. Thus, with a number of subset equal to the number of iterations required by

Fig. 3. Convergence of ML-EM SPECT image reconstruction

Eﬀective Emission Tomography Image Reconstruction Algorithms

(a)

747

(b)

Fig. 4. Comparison of FBP and ML-EM methods for SPECT image reconstruction. a) FBP, b) ML-EM.

(a)

(b)

(c)

Fig. 5. Result after one OSEM iteration. Image reconstruction partitioning the set of detectors into a) 10, b) 15 and c) 20 subsets.

ML-EM to converge, OS-EM converges in a single iteration performed over all the subsets. Thus, for the example shown in Fig. 3, OS-EM should converge by partitioning the whole set of detector elements into about 15-20 subsets. Fig. 5 shows the results of a single iteration performed by the OS-EM algorithm with a diﬀerent number of subsets. It is clearly shown that with 15 and 20 subsets the image appear to be of similar quality to ML-EM with the advantage of a speedup of the order of the number of subsets.

7

Conclusions

Classical ﬁltered backprojection and statistical maximum likelihood expectation maximization image reconstruction algorithms were evaluated in terms of image quality and processing delay. Image ﬁles were taken from a concurrent study investigating the use of SPECT as a diagnostic tool for the early onset of Alzheimer-type dementia. FBP image reconstruction needs a careful control and the noise since it tends to amplify high frequency noise. Pre- and post-ﬁltering improves the quality of FBP reconstruction by removing the huge high frequency

748

J. Ram´ırez et al.

noise present in SPECT data and the residual noise after reconstruction. MLEM yields better image quality when compared to FBP since a precise statistical model of the emission is used. However, the processing delay is considerable due to its slow convergence. OS-EM was found to be a good trade-oﬀ between image quality and processing delay since it converges in a single iteration by partitioning the set oﬀ detection elements into about 15-20 subsets. Acknowledgments. This work has been funded by the PETRI project DENCLASES (PET2006-0253) of the Spanish MEC and the regional Excellence Project (TIC-02566) of the Consejer´ıa de Innovaci´on Ciencia y Empresa (Junta de Andaluc´ıa, Spain). We also acknowledge ﬁnancial support by the German Academic Exchange Service (DAAD).

References 1. Vandenberghea, S., D’Asselera, Y., de Wallea, R.V., Kauppinenb, T., Koolea, M., Bouwensa, L., Laerec, K.V., Lemahieua, I., Dierckx, R.: Iterative reconstruction algorithms in nuclear medicine. Computerized Medical Imaging and Graphics, 105– 111 (2001) 2. Bruyant, P.P.: Analytic and iterative reconstruction algorithms in spect. The Journal of Nuclear Medicine 43, 1343–1358 (2002) 3. Shepp, L.A., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE Transactions on Medical Imaging MI-1, 113–122 (1982) 4. Vardi, Y., Shepp, L.A., Kaufman, L.: A statistical model for positron emission tomography. Journal of the American Statistical Association 80, 8–20 (1985) 5. Lange, K., Carson, R.: Em reconstruction for emission and transmission tomography. Journal of Computer Assisted Tomography 8, 306–312 (1984) 6. Chornoboy, E.S., Chen, C.J., Miller, M.I., Miller, T.R., Snyder, D.L.: An evaluation of maximum likelihood reconstruction for spect. IEEE Transactions on Medical Imaging 9, 99–110 (1990) 7. Hudson, H.M., Larkin, R.S.: Accelerated image reconstruction using ordered subsets of projection data. IEEE Transactions on Medical Imaging 13, 601–609 (1994)

New Sky Pattern Recognition Algorithm Wojciech Makowiecki1 and Witold Alda2 1

Astronomical Observatory of the Jagiellonian University, Krak´ ow, Poland [email protected] http://www.wojciech.us 2 AGH University of Science and Technology, Krak´ ow, Poland [email protected]

Abstract. We present here a new algorithm which enables the identiﬁcation of stars in astronomical photographs using readily available star catalogs for this purpose. The algorithm was implemented in a standalone application called ‘Skyprint‘ capable of performing a matching process. The computational aspect of the problem can be designated to the wide class of image recognition methods and analysis of multidimensional data. The astronomical aspect concentrates on astrometry – the method of determining the coordinates of stars in the celestial sphere. The problem of identifying star patterns occurs most often in such areas as cosmic probe navigation, adjusting and merging numerous photographs of the sky together, or in recovering missing information in relation to a fragment of the sky represented in the photograph. Keywords: Pattern recognition, Astrometry, Geometric hashing.

1

Introduction

The concept of matching the stars in a photograph with stars in the catalog has been investigated for many years. The majority of algorithms which are able to successfully solve the problem are based on the idea of comparing polygonal shapes in the photograph with those in the catalog. The shapes in the photograph are created by connecting stars in the photograph together and likewise the shapes in the catalog. We can distinguish two strategies of approaching the problem of matching these two groups of shapes. The ﬁrst one, used for example by Valdes [1], Groth [2] and Murtagh [3], relies on heavy restrictions that are put on the fragment of the catalog to be compared with the photograph. We must know that the chosen fragment of the catalog contains most of the stars visible in the photograph. Also the size of the fragment can not be larger than a few times the size of the ﬁeld of view (FOV) shown on the photograph. The shapes (in this case triangles), made up of stars brighter than a certain limiting magnitude, are created in the picture and in the chosen part of the catalog. Then the triangles are saved in two separate lists so that they can be later investigated for similarities. Conﬁrmation that two found shapes are made up of the same objects is done in the so called ‘voting process‘. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 749–758, 2008. c Springer-Verlag Berlin Heidelberg 2008

750

W. Makowiecki and W. Alda

In this paper the authors have concentrated on proposing a new algorithm belonging to the second group, that has emerged recently e.g. Lang [4]. Algorithms belonging to this group are able to recognize an arbitrary fragment of the catalog shown on the photograph, without being limited by cases where the approximate location is not known. Such algorithms are not usually able to identify particular stars belonging to the shape, however, for this purpose, one of the algorithms from the ﬁrst group which has already been developed, can be chosen. The main idea behind the algorithm presented is to create and subsequently compare two sets of convex quads. The ﬁrst set is created by using the stars found in the photograph, and the second one by using the stars present in the catalog.

2

Algorithm

One of the most important issues in the algorithm is its method of characterizing the shapes. We need to do this in a way which is invariant with respect to translation, rotation and scaling. Each shape consists of N points. Each point has 2 coordinates, therefore we need 2N parameters to determine the positions of all N points in the 2-dimensional space. As translation, rotation and scaling in a 2D coordinate system require 4 degrees of freedom, we end up with the 2N − 4 parameters which determine the shape unambiguously. This gives us 2 parameters for each triangle and 4 parameters for each quad. In our approach we have chosen to use quads and to compare them by using only 3 internal angles of each quad. Although such a measure is not suﬃcient to determine the convex quad unambiguously, we sacriﬁce this for the advantage of having only three parameters for each shape. Our tests show that such an approach performs well and is suﬃcient in most cases. The algorithm consists of several steps: 1. Creating the ‘grid of seeds‘ By ‘seed‘ we mean a virtual point having 2 coordinates (Right Ascension and Declination). We deﬁne ‘grid of seeds‘ as points distributed in one of any number of regular ways. For example one can choose to satisfy the following condition: number of seeds per unit area ≈ const. (1) number of stars per unit area We believe this rule of placing seeds is one of the best. 2. Choosing only the convex shapes for later comparison Now we use the seeds to create shapes as shown in Fig. 1. We try to create a convex quad in the vicinity of each seed. A case with three collinear stars is treated as a correct convex quad. 3. Angle calculation, discretization and hashing In this stage we ﬁrst calculate angles by using one of three diﬀerent methods mentioned later. Subsequently we discretize angles every 2 degrees. For each shape we store 5 values: 3 discrete angles and 2 coordinates of the seed which

New Sky Pattern Recognition Algorithm

751

Fig. 1. Stages of creating a convex quad. Black dots denote seeds of the grid, blue circles denote stars. Sequence of stages are as follows: a) select one seed and choose 4 closest stars b) create quad only if these 4 stars can make a convex one c) calculate internal angles of the quad d) choose the largest angle and two more angles by moving in a clockwise direction, remember values and sequence of these 3 angles.

was used to create this particular quad. We use a standard hash function with primes as coeﬃcients in the form: (α ∗ 23827 + β ∗ 34693 + γ ∗ 46021) % 904997

(2)

which transforms 3 input angles (α, β and γ) to an array index. This method enables quick matching of quads. 4. Division of sky into fragments In this step, in order to use ‘voting‘, we divide the whole sky into fragments as shown in Fig. 2, 3 and 4. Fragments can have diﬀerent shapes and sizes so there are many diﬀerent ways of doing this. In our approach we use orthogonal division. We limit the minimum size of the fragments to the maximum FOV of the input photograph. In other words, the minimum size of the fragment must be larger than the photograph. However, the limiting size cannot be too large - because in such a case, identifying fragments would be imprecise or incorrect. Even when the photograph has angular dimensions which ﬁt in the single fragment of the catalog, it may be only partially contained. Figure 2 shows our solution to this problem. We enlarge the fragment

752

W. Makowiecki and W. Alda

Fig. 2. Enlarging fragments of the catalog by a factor of 2 in order to be sure that the photograph is always entirely contained by at least one fragment

by a factor of 2 in order to be sure that the photograph is always entirely contained by at least one fragment. 5. Counting matched quads and voting on catalog fragments We then match the convex quads from the photograph with those in the catalog. Once a match is found, one or more areas in the sky receives votes as shown in Fig. 3. At the end of the process, the region of the catalog which has received the highest number of votes is the one we were looking for (Fig. 4). We use seeds for two reasons. Firstly, to introduce a unique and easy method for choosing stars, which will be used later to create quads. Secondly, using seeds is necessary for reducing the number of convex quads created from the catalog. The problem shows certain asymmetry. For the photograph, we can create as many convex quads as possible without using seeds at all. The number of convex quads is the main reason for this. A photograph always has many fewer stars than the whole sky catalog. However, as it turned out in our tests, such a method is not optimal and introduces diﬃculties. The main problem concerns the scale of quads. If we create all possible convex quads from the stars in the photograph

New Sky Pattern Recognition Algorithm

753

Fig. 3. Example of quads matching process

we would create convex quads in many diﬀerent scales, whereas using a grid of seeds for the catalog would create only quads of similar sizes.

3

Software

The name ‘Skyprint‘ [10] was intended to be analogous with ﬁngerprinting because it tries to ﬁnd the same patterns of stars in a large database. Skyprint is a highly interactive application written in C++ with Qt interface. Sample screen of the program’s graphical interface is shown in Fig. 5. There are two algorithms implemented in Skyprint. First, the algorithm presented in the previous section and second - a version of the algorithm presented by Valdes [1]. The distortion eﬀects originating from projecting a celestial sphere onto a 2D plane can no longer be neglected for photographs with FOV larger than 30 arcmin. For this reason Skyprint gives the option of choosing between three diﬀerent types of projection. The ﬁrst one is a ‘no projection‘ method (the angles are calculated on the sphere with use of spherical trigonometry). The second method is to calculate the angles on the sphere by treating them as if they were in a two dimensional plane. The last one, the ‘gnomonic projection‘,

754

W. Makowiecki and W. Alda

Fig. 4. Sample results of voting procedure

Fig. 5. Skyprint - view of the main window

New Sky Pattern Recognition Algorithm

755

is the best option as it uses exactly the same distortion as the one that appears when a picture of the sky in taken. The typical use case of Skyprint is as follows: – – – – –

4

Open the photograph. Select or correct position of stars on the photograph. Choose sky region (possibly the whole sky). Select between algorithm I and II. Choose matching parameters for the selected algorithm.

Results

Although we made our tests on a number of images, here we would like to concentrate on the comparison of results obtained by matching a single photograph while changing many parameters of the algorithm. We have used the simplest orthogonal division as shown in Fig. 3. This allowed us to divide the catalog into 128 parts. We placed seeds much more densely, but in precisely the same manner. Seed density is a parameter of the software. Typically we use two densities: 600 x 300 and 1200 x 600. However, these seeds are not uniformly distributed on the sphere. This particular distribution causes the seeds with highest and lowest declinations to be close to other seeds. This is not desirable, because in this way there would be many more quads created for the regions close to the poles than for the parts of the sky which lie beside the equator. The same problem applies to choosing fragments that the catalog is divided into. One solution is to use number of votes in the region of the catalog with 2nd highest number of votes number of votes in the region of the catalog with highest number of votes total number of quads created from the photgraph

Number of quads (blue bars) Number of votes (red and black bars)

140

120

100

80

60

40

20

0 400

450

500

550

600

Maximal spacing between stars in the photograph [pixels]

Fig. 6. Results obtained without using any projection

650

756

W. Makowiecki and W. Alda

number of votes in the region of the catalog with 2nd highest number of votes number of votes in the region of the catalog with highest number of votes total number of quads created from the photgraph

Number of quads (blue bars) Number of votes (red and black bars)

140

120

100

80

60

40

20

0 300

350

400

450

500

550

600

Maximal spacing between stars in the photograph [pixels]

Fig. 7. Results obtained by calculating the angles on the sphere with use of spherical trigonometry number of votes in the region of the catalog with 2nd highest number of votes number of votes in the region of the catalog with highest number of votes total number of quads created from the photgraph

Number of quads (blue bars) Number of votes (red and black bars)

140

120

100

80

60

40

20

0 350

400

450

500

550

600

Maximal spacing between stars in the photograph [pixels]

Fig. 8. Results obtained by using the gnomonic projection

650

New Sky Pattern Recognition Algorithm

757

HEALPix [8] for determining the uniform distribution of seeds and regions of the catalog. The other is to utilize the constraint in equation 1. For the tests we used a M45 [7] photograph with a FOV of about 1 degree, which is very small in size, when compared with 22.5 degrees (the largest part into which we divide the catalog). The larger the ratio of FOV to the size of the part of the catalog, the harder it is for the algorithm to work properly. This was one of the reasons for choosing this photograph. The second reason was to show that even at such a small FOV (small in the context of amateur photography) we need to make sure we pay attention to distortions. In Fig. 6, 7 and 8 we show the number of created quads compared with the number of votes in the regions with two highest ranks, obtained for each method of angle calculation.

5

Conclusions

We have shown that the implemented algorithm works well with almost any class of photographs. We have made no assumptions about the photograph apart from limiting its size which makes the algorithm much more eﬃcient. In practice the maximum size of the photograph is usually well known and can be set as a parameter, thus representing a very soft constraint. We have also determined the most important factors in the identiﬁcation of the fragment of the sky. One of them is the number of stars selected in the photograph either by an automatic ﬁlter or the user. The more stars selected, properly and precisely, the higher the probability of successful matching. The projection and the method used to calculate angles of the shape is also very important, especially for larger FOVs. For matching real life photographs of the sky, the gnomonic projection is the best choice. Because there are countless methods of characterizing shapes we plan to explore some more possibilities further in our future research on this subject. The work of Arzoumanian et.al. [9] shows that diﬀerent applications of the star pattern matching algorithms may exist in a disparate ﬁelds of science and are still to be explored. AGH Grant no. 11.11.120.777 is acknowledged.

References 1. Valdes, F.G., Campusano, L.E., Velasquez, J.D., Stetson, P.B.: FOCAS Automatic Catalog Matching Algorithms. PASP 107, 1119 (1995) 2. Groth, E.J.: A pattern-matching algorithm for two-dimensional coordinate lists. The Astrophysical Journal 91, 1244–1248 (1986) 3. Murtagh, F.: A new approach to point-pattern matching. PASP 104, 301–307 (1992) 4. Lang, D., Hogg, D.W., Mierle, K., Blanton, M., Roweis, S.: Making the sky searchable (submitted, 2007), http://www.astrometry.net 5. Harvey, C.: New Algorithms for Automated Astrometry. M.Sc. Thesis, University of Toronto (2004)

758

W. Makowiecki and W. Alda

6. Roeser, S., Bastian, U.: PPM Star Catalog, vol. I & II. Astronomisches RechenInstitut, Heidelberg (1991) 7. De Martin, D.: M45 picture, http://www.skyfactory.org 8. G´ orski, K.M., Hivon, E., Banday, A.J., Wandelt, B.D., Hansen, F.K., Reinecke, M., Bartelmann, M.: HEALPix: A Framework for High-Resolution Discretization and Fast Analysis of Data Distributed on the Sphere. The Astrophysical Journal 622(2), 759–771 (2005) 9. Arzoumanian, Z., Holmberg, J., Norman, B.: An astronomical pattern-matching algorithm for computer-aided identiﬁcation of whale sharks Rhincodon typus. Journal of Applied Ecology - British Ecological Society 42 (6), 999–1011 (2005) 10. Makowiecki, W.: Skyprint software for interactive Star Pattern Matching, http://www.skyprint.info

A Generic Context Information System for Intelligent Vision Applications Luo Sun, Peng Dai, Linmi Tao, and Guangyou Xu Tsinghua National Laboratory for Information Science and Technology Tsinghua University, 100084 Beijing, China {sunluo00,daip02}@mails.tsinghua.edu.cn, {linmi,xgy-dcs}@tsinghua.edu.cn

Abstract. The future intelligent vision is expected to be highly context-aware such that it can perceive and be aware of user's situation and react accordingly. In this paper, we propose a context representation mechanism and build a highperformance, extensible, distributed context information system based on it, in order to facilitate context-awareness development and information sharing. It pays attention to representing and organizing contexual information in an effective way and does not force any certain type of context reasoning algorithm. It can provide information-related services for distributed intelligent vision applications, mainly including representation, storing and retrieval, forming a whole pipeline of real-time semantic metadata generating and management. Besides user context, which is used to support runtime context communication between application components, our system also contains contextual descriptions about running environment and system configuration, making applications based on it can move to another environment or configuration seamlessly. Moreover, context representation in our system has a well-designed plugin-based architecture, helping users add their own context types without any modification of the original system. We introduce a contextaware meeting application based on our system, which employs Dynamic Bayesian Network as context reasoning algorithm. Experiment results show our context information system has excellent configurability, extensibility and performance. Keywords: Context Information System, Context Representation, Context Storing and Context Retrieval.

1 Introduction The final goal of intelligent vision is to provide the right service to the right object in the right place on the right time, which is expected to be highly context-aware such that it can perceive and be aware of user’s situation and react accordingly [1]. By context, we mean a dynamic structure of information that is used to characterize the situation, viewed over a period of time, episode of use, social interaction, internal goals and local influence [2, 3]. In the real world, understanding of each person’s behavior is affected by his relations with others and the surrounding environment, M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 759–769, 2008. © Springer-Verlag Berlin Heidelberg 2008

760

L. Sun et al.

therefore contextual information has to be considered. Moreover, with the development of multimedia sensing technology, more and more data can be acquired by much less effort than even before. Facing large amount of available data, the use of context can help decrease the complexity significantly by processing only relevant scenarios and heuristics to choose only the objects that may involve in a particular activity. The very first step of context-awareness is to represent context in an effective computer-applicable format [1, 2]. Besides representing contextual information, information representation and sharing is playing an important role in the domain of computer vision. Over last 20 years, we have seen a remarkable amount of progress in the abilities and usability of computer vision. Although there is lots of collaboration between groups working in the same field, the majority of researchers have their own ways of working and representation format. This causes the issue that we still only see very individual and proprietary use of this work, both in research and in industry [4]. In this regard, a generic context information system to facilitate context-awareness development and information sharing, which does not force any certain type of context reasoning algorithm, is the urgent need in the development of intelligent computer vision applications. There have been several attempts for this purpose recently. Most of them represent information in XML because of its open, human-readable format and natural hierarchical structure. VEML (Video Event Markup Language) [5] has been implemented for representing and recognizing events in videos, in order to facilitate the development of applications such as video surveillance, video browsing and content-based video indexing. CVML (Computer Vision Markup Language) [4] summarized lots of common visual processing requirements and proposed a framework dedicated to describe computer vision information. MPEG-7, formally named “Multimedia Content Description Interface”, is a wide-accepted standard for describing multimedia content data, which aims to resolve the problems of management and retrieval. It has rich functionality, which also makes itself too complex to employ. Moreover, MPEG-7 is not suitable for dynamic scenes. However VEML and CVML are not as wide-accepted as expected and these three informationdescription languages do not address the concept of context. Context Toolkit [6, 7] aims to be a reusable solution to handle context in a distributed infrastructure. However, it doesn’t provide adequate support for organizing context in a formal structured format, therefore cannot represent the dynamic nature of context [3]. It also adopts hard coded context ontology and cannot be easily extended and interoperate with others. Facing the issue that information systems don’t consider context-awareness and context-aware systems don’t take information sharing into account, we propose a context representation mechanism for intelligent computer vision and build a highperformance, extensible, distributed context information system based on it, in order to facilitate context-awareness and information sharing. It’s used to support information-related services for distributed intelligent vision applications, including representation, storing and retrieval, forming a whole pipeline of real-time semantic metadata generating and management. We have already implemented a flexible multiserver platform for distributed visual information processing in our previous work [8]. Combining it with our context information system, a complete infrastructure of

A Generic Context Information System for Intelligent Vision Applications

761

intelligent computer vision has emerged. The rest of this paper is organized as follows: section 2 describes the system, including our basic idea, several designing aspects of the architecture and some key technologies. Section 3 shows a contextaware meeting application based on our information system, as well as some performance results. In the end, there will be some conclusions.

2 Context Information System 2.1 Distributed Architecture As a complete information system, the following functions should be provided at least. They are minimal requirements to form a whole pipeline of metadata generating and management. 1. 2. 3.

Information representation, the mechanism to formalize information into metadata, representing information in an effective computer-applicable way. Information storing, the ability to store formalized information persistently, usually to hard disks. Information is usually indexed for faster retrieval. Information retrieval, the ability to retrieve stored information later with specified restrictions.

Among them, information representation is usually considered as the most important one since it determines the representing scope and the ability of interoperation with others. However, as more and more sensors are employed into intelligent vision system, it’s not possible to finish all procedures on just one computer. In this regard, a distributed architecture should be taken into account.

Fig. 1. System Architecture. The whole system is organized in a distributed manner, including three separate services, as shown in dash boxes.

762

L. Sun et al.

In our previous work, a flexible platform for distributed visual processing has been finished, which acts as a container for information processing and analysis algorithms to plug in. The platform is composed by a set of servers which collaborate with each other to accomplish the tasks, such as video capture, transmission, buffering and synchronization. With the help of this convenient platform, we design a distributed architecture for our information system, as shown in Fig. 1. There are three kinds of services within the system, namely local storing service, local retrieval service and global archive service. Each service is a separate process, using local or remote socket to communicate with other services or visual processing components. Local storing service, local query service would run on every computer in the vision system, providing information-related services to all visual processing components working on the same machine. Global archive service is unique within the whole system, usually running on a dedicated server for information archiving. We employ XML as basic storing format in our system. Therefore we can use Berkeley DB XML [9] to store metadata, which is a high performance, open source, embeddable XML database with XQuery support. It store XML in an optional indexed way, providing extremely fast access. Our information system architecture features fast storing and retrieval access. This is important if want to deploy it to some existed intelligent vision applications, since they usually do not consider unified information format carefully at the very beginning and slow access may cause significant decreasing of their performances. Fast storing benefits from the local information database which is small and acts as local caching. New information will be stored into the local database directly by synchronized storing function calls, while there is a background thread dedicated to synchronize the local database with the global one. This design can effectively avoid the data loss in case of application crashes, while keeping storing procedure fast enough without annoying applications based on it too much. On the other hand, retrieval access cannot be handled in the same method. It has to be synchronized since applications need results immediately. We optimize it by the means of increasing the capacity of the internal cache of Berkeley DB XML and using UDP protocol with confirming replay, instead of TCP. The experiment later would show our context information system’s performance. 2.2 Context Representation and Plugin-Based Implementation The very first step of context-awareness is to represent it in a computer-applicable format. Traditional context models deal with individuals or their relationships with environmental objects, which can be represented as single level events such as changes of location, time, temperature etc. This method suffers especially under dynamic interaction scenes since it ignores the dynamic nature of context. Greenberg refined context as a dynamic construct, which is viewed over a period of time, episode of use, social interaction, internal goals, and local influence [3]. Following his definition, we propose a computer-applicable version of context representation in the domain of computer vision. Context is defined as multi-level hierarchy, which represents situations at different abstract levels. Moreover, a complete context representation mechanism should not only include runtime dynamic information representing human physical and mental

A Generic Context Information System for Intelligent Vision Applications

763

states, but also need to contain static environment and configuration settings. There are three kinds of descriptions in our information system. 1.

2.

3.

Environment configuration. The external description of a system, used to describe the environment where a system would run, including PC configuration, room configuration, physical objects and their properties, etc. System configuration. The static internal description of a system, used to describe how a system would run, including the connection relationships between visual processing components, what kind of services the system should provide to users, etc. Run-time information communication. The dynamic internal description of a system, used to regulate information format and share it through the whole system, helping all components understand and communicate with each other.

Respectively, context representation defined in our information system can be divided into three parts, environment context, system context and user context. User context is only valid during runtime and contains processing targets and their properties, from low-level features to high-level semantic descriptions. Environment context and system context are rather static, usually remaining unchanged during systems’ running. User context highly depends on the interaction status, which changes frequently. It’s organized in a hierarchical structure at different time scales and group sizes. Fig. 2 shows a user context example under meeting scenario.

Fig. 2. A User Context Example under Meeting Scenario. There are four people within a meeting. At one moment, A and B is talking to each other, while C is staring at A and D is staring at B. Some key nodes of the current user context are showed as a tree on the right side. At the longest time scale, it’s a discussion during a group meeting. There are three short-term interactions currently, namely A talking to B, D staring at B and C staring at A.

Context representation in our system is implemented in a plugin architecture. Plugin-based systems are widely used recently, e.g. Firefox and Eclipse. It has lots of advantages, such as rich extensibility and easy maintenance. It also helps reduce

764

L. Sun et al.

Fig. 3. Modules in Our Plugin System. ID context plugin, which has implemented interface IContextInterface, is a DLL(Dynamic Link Library) file under Windows or a SO(Shared Object) file under Linux and can produce ID context representation, which has implemented interface IContextRepresentation. Location Context component is the same. All plugins are controlled by a module named Context Controller, which provides several useful functions for plugin management, such as loading, refreshing and searching for a specified plugin.

Fig. 4. Some of Implemented Context Representation in our Information System. Blue rectangles are categories and blue arrows between categories represent their relationships and how they work together. Each category contains several context types, which can be used to describe objects or theirs properties in the real world. Expect category “System” and “Environment”, all categories belongs to user context.

future development labor to some extent. In our design, each context type belongs to a

A Generic Context Information System for Intelligent Vision Applications

765

certain category, which is used to organize context types based on what kind of information they designate. Context category is actually a directory on the disk with a configuration file to describe its relationship with other categories. Codes of every context type are stored as a Dynamic Link Library file (DLL) under Windows, and a Shard Object file (SO) under Linux, in the corresponding category directory. When the information system starts, it will scan all possible directories and load context representations dynamically. Therefore, users can add their own context representations, just following some predefined interfaces, without compiling the whole information system. The implementation of our plugin system is based on Qt’s plugin system [10] and the relationships among modules are show in Fig. 30. Every context plugin needs to implement a common interface IContextInterface and every context representation needs to implement a common interface IContextRepresentation. Some of implemented context representations in our system are shown in Fig. 4. It contains many useful context types for intelligent vision. Although they may not cover every requirement, developer can add their own context types very easily with the help of our plugin architecture.

3 Experiment Systems and Results In this section, we would like to show a context-aware meeting application supported by our context information system, as well as some results of performance tests. The major task of this indoor meeting analysis and archiving system is to generate and archive a hierarchical context representation of multimodal meeting data, which is can be further employed for retrieval and more sophisticated analysis [11]. Our

Fig. 5. Sensor Settings in Meeting Room. Three fixed cameras are set to monitor the meeting room from three distinct perspectives. A PTZ camera is placed on the table to focus on any

766

L. Sun et al.

specific target. Three linear microphone arrays are assembled on the table to get audio information from various participants.

information system plays an important role in environment configuration, system configuration and user context representation. A flexible multi-server platform has been developed to support distributed multimedia data capturing and transferring. Multiple audio and visual sensors are installed in the meeting room so as to extract multimodal information in real-time, as illustrated in Fig. 5. Since context cannot be determined by simple collection of multimodal sensor data, this system employs Dynamic Bayesian Network for context reasoning, which can generate analysis results at run-time. The structure of user context is shown in Fig. 6, where different layers have very strong semantic meanings. Fig. 7 shows a description example. At one moment, person A is staring at B (not shown in the picture). Some body parts of A, including head position and orientation, hands positions and so on, are tracked as low-level features for later analysis.

Fig. 6. Hierarchical User Context Structure under Meeting Scene. The top layer is the meeting scenario itself. The second layer is used to describe what current situation is from a high semantic perspective. The third layer represents how people interact with each other individually.

Fig. 7. A Description Example under Meeting Scene. The picture on the left side is from the original video overlapped with some color boxes designating different parts of a person. The

A Generic Context Information System for Intelligent Vision Applications

767

XML segments in the center and on the right are the corresponding low-level and high-level semantic descriptions, respectively. Orange arrows help show the relationships among them.

Storing and retrieval performance have been tested on data extracted from several meetings and another context-aware outdoor surveillance application. Fig. 8 shows some results about storing performance. From the figure, we can find that local storing speed is fairly fast, less than 2ms even when the local database is 20MB. However the storing speed is affected by the size of local database and the concurrency. Synchronization time increases with the size of global database and local database, nearly linearly.

Fig. 8. Results of Storing Performance Experiment. The left plot shows some results of storing a specified event into the local database and different square colors represent different numbers of concurrent storing. The right one shows the background synchronization time and different square colors represent different sizes of global database.

Fig. 9. Results of Retrieval Performance Experiment. These figures show how query time changes with respect to database size and concurrency. The left plot is searching for a specified type in the surveillance application, while the right one for a specified event in the meeting application. Different square colors represent different numbers of concurrent queries.

768

L. Sun et al.

Fig. 9 shows some experiment results about retrieval performance. Event query is a little bit more complex than type query; therefore it costs a bit more time. Query time increases with the size of database and the number of concurrent query processes sublinearly. When database is small, query is fast since cache could be very hot, playing an important role in retrieval. With the increasing of database size, disk I/O costs more and more time.

4 Conclusion Our work in this paper focused on the reusability of context by the means of splitting it into representation and reasoning. Reasoning differs significantly through different context-aware systems while representation remains similar. As a proof of concept, we presented a context representation mechanism and build a high-performance, extensible, distributed context information system based on it to facilitate contextawareness and information sharing. It can support information-related services for distributed intelligent vision applications, mainly including representation, storing and retrieval, forming a whole pipeline of real-time semantic metadata generating and management. Our proposed contextual information includes not only static environment and system settings but also dynamic information representing human physical and mental states. The former is used to describe where and how the application would run, while the latter is used for runtime intercommunication between application components. Contextual information has been implemented in a plugin manner, which means developers can add their own context types without any modification of the original system, just following our interface definition. As a result of our basic idea, this context information system pays attention to representing and organizing contexual information in an effective way and does not force any certain type of context reasoning algorithm; therefore developers can employ any algorithm they prefer, rule-based or probability-based. We also introduce a context-aware meeting analysis and archiving application based on our system, which employs Dynamic Bayesian Network as context reasoning algorithm. Our information system plays an important role in it, showing its configurability and extensibility. Performance of storing and retrieval has been tested on lots of real data and results show our system has excellent performance. In our previous work, we have already finished a flexible platform for distributed visual processing. Our information system actually can be considered as an important complement to it. Combining them together, a complete infrastructure of intelligent vision has emerged. Acknowledgments. This research is supported by NSFC project 60433030 and 60673189 of China. We are about to be thankful for the thoughtful comments and suggestions of our reviewers.

References 1. Goh, E., Chieng, D., Mustapha, A., Ngeow, Y.-C., Low, H.-K.: A Context-Aware Architecture for Smart Space Environment. In: Proceedings of International Conference on Multimedia and Ubiquitous Engineering, Korea, pp. 908–913 (2007)

A Generic Context Information System for Intelligent Vision Applications

769

2. Dey, A.K., Abowd, G.D., Salber, D.: A Conceptual Framework and a Toolkit for Supporting the Rapid Prototyping of Context-Aware Applications. Human-Computer Interaction 16, 97–166 (2001) 3. Greenberg, S.: Context as a Dynamic Construct. Human-Computer Interaction 16, 257– 268 (2001) 4. List, T., Fisher, R.B.: CVML – An XML-based Computer Vision Markup Language. In: Proceedings of International Conference on Pattern Recognition, Cambridge, vol. 1, pp. 789–792 (2004) 5. Nevatia, R., Hobbs, J., Bolles, B.: An Ontology for Video Event Representation. In: Proceedings of International Conference on Computer Vision and Pattern Recognition Workshop (2004) 6. Context Toolkit, http://www.cs.cmu.edu/~anind/context.html 7. Edwards, K., Bellotti, V., Dey, A.K., Newman, M.: Stuck in the Middle: The Challenges of User-Centered Design and Evaluation for Middleware. In: Proceedings of The International Conference on Human Factors in Computing Systems (2003) 8. Wang, Y., Tao, L., Liu, Q., Zhao, Y., Xu, G.: A Flexible Multi-server Platform for Distributed Video Information Processing. In: Proceedings of 5th International Conference on Computer Vision Systems (2007) 9. Oracle Berkeley DB XML, http://www.oracle.com/database/berkeley-db/xml/index.html 10. Qt product page, http://www.trolltech.com/products/qt 11. Dai, P., Di, H., Dong, L., Tao, L., Xu, G.: Group Interaction Analysis in Dynamic Context. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics (to appear, 2007)

Automated Positioning of Overlapping Eye Fundus Images Povilas Treigys1, Gintautas Dzemyda1, and Valerijus Barzdziukas2 1

Institute of Mathematics and Informatics, Akademijos str. 4, LT-08663 Vilnius, Lithuania {treigys,dzemyda}@ktl.mii.lt 2 Kaunas University of Medicine, Eiveniu str. 4, LT-3007 Kaunas, Lithuania [email protected]

Abstract. Changes in eye fundus images can be associated with numerous vision threatening diseases such as glaucoma, optic neuropathy, swelling of the optic nerve head, or related to some systemic disease. Tracking the progress of a possible disease of the patient becomes very difficult from separated retinal images. In this article we present a method which registers two retinal images so that the fundus images overlaps each other in the best way. As a separate case, this article shows that in order to solve the optic nerve disc registration problem a linear transformation of retinal image is sufficient. A human identification possibility via retinal image registration will be disclosed as well. Keywords: automated eye fundus registration, vasculature structure extraction, automated shifting, optic nerve registration, identification, retinal image transformation.

1 Introduction At present ophthalmologists can collect and analyze the eye fundus from digital images. Whenever the image of the eye fundus becomes digital, the means of automatic image processing comes in play. A high quality colour photograph of the eye fundus is helpful in the accommodation and follow-up of the development of the eye disease. Evaluation of the eye fundus images is complicated because of the variety of anatomical structure and possible fundus changes in eye diseases. The optic nerve disc (OD) appears in the normal eye fundus image as a yellowish disc with whitish central cupping (excavation) through which the central retinal artery and vein pass (Fig. 1 left and centre images). Changes of the optic nerve disc can be associated with numerous vision threatening diseases such as glaucoma, optic neuropathy, swelling of the optic nerve head, or related to some systemic disease. Thus, one of the basic tasks in ophthalmology is to analyze the optic nerve disc. Also, the analysis of vasculature can be helpful to indicate pathologic changes associated with diseases such as: a hypertension, diabetes, or atherosclerosis [7]. Vasculature extraction methods from retinal images can be classified into one of the groups: kernel, classifier and tracing-based [19]. In the kernel-based methods, an image is convolved with predefined kernel in most cases. Further, the Gaussian filter M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 770–779, 2008. © Springer-Verlag Berlin Heidelberg 2008

Automated Positioning of Overlapping Eye Fundus Images

771

is introduced in order to model cross-section of the vessels. Afterwards the vessel identification filters [6] are applied. Such a class of vasculature structure extraction algorithms is commonly modelled together with neural networks [8] and is a very time-consuming task. Classification-based methods are composed of two steps. During the first step, segmentation of an image is performed. Segmentation [16, 14] is basically accomplished by the kernel-based methods. In the second step, a set of features has to be provided for the algorithm. Such a set describes the vessels visible in the image. These methods that belong to this class allow processing of the objects with complex structures [1]. This enables algorithms to perform faster, however, these algorithms cannot be automatic in most cases. In the tracing-based class of algorithms [2], the algorithm traces the structure of a vessel between predefined points. Basically tracing ends at the provided reference points. It is common that these reference points are provided interactively by the human. We present here a method for automatic retinal image registration. Image registration is the process of transforming the different sets of data into one coordinate system. In this particular situation, the registration should be performed so that visible structures in two images overlap each other in the resulting image (Fig. 1 right side image). The resulting image comes with a quality measurement parameter that later can be introduced into a decision support system for ophthalmologists. Besides, this article shows that in order to solve the optic nerve disc registration problem, a linear transformation is sufficient. It should be noted that the structure of vasculature is commonly used for human identification purposes. Also, automated human identification within the patient’s database problem is presented also.

Fig. 1. Base retinal image is shown (the left side); committed for registration retinal image is shown (in the centre); superimposed retinal images with structures have yet not been registered is shown (the right side)

2 Image Pre-processing and Scaling The eye fundus images were collected in the Department of Ophthalmology of the Institute for Biomedical Research of Kaunas University of Medicine, using the fundus camera Canon CF-60UVi, at a 60° angle. 6,3 Mpixel images (image size 3072x2048 pixels) were taken. The magnification quotient was 0,0065248 mm/pixels and the common magnification quotient for the system eye-fundus camera was 0,556782±0,000827 (mean±SD). The scale (mm/pixels) for the fundus camera was 0,01171875 mm/pixels.

772

P. Treigys, G. Dzemyda, and V. Barzdziukas

In order to register the position of OD of two retinal images, first of all we have to pre-process images. The first step of image pre-processing is accomplished by scaling down the retinal image to the size of 768x512 pixels. Scaling is performed in order to decrease the computation time. Basically the morphological operations are the most time consuming procedures, because each pixel in a spatial domain is probed with some structuring element also known as a convolution kernel. This leads to a substantial acceleration of vessel structure extraction, which is very important at this stage. It is well known that every pixel in a colour image can be described by three components, namely: red (R) channel, green (G) channel, and blue (B) channel intensity values. Then, every image that consists of NxM pixels can be described by three separate matrices: {R(x,y);G(x,y);B(x,y)}, where x = 1,…,N; y = 1,…,M. Here each function returns the specific intensity value of the channel at the position of (x,y). As usual, in order to calculate the monochrome luminance of colour image, we need to apply coefficients related with the eye's sensitivity to each of the RGB channel. This is done according to the NTSC standard and can be expressed by: I(x,y)=0.2989*R(x,y)+0.587*G(x,y)+0.1140*B(x,y) Here I is the intensity image with integer values ranging from a minimum of zero, to a maximum of 255.

3 Mathematical Morphology and Point-Wise Operations Typically the morphologic operation was developed to deal with binary images but it can be easily applied to intensity images. In the case of intensity images, erosion and dilation is understood as a nonlinear search for minimum or maximum by introducing some filters as well as opening and closing is a combination of erosion and dilation. However, the fundamental concepts of grey-level morphology operations cannot be directly applied to colour images [5, 13]. Thus, we need to convert colour images to intensity ones as described in the pre-processing section. Morphological operations typically probe an image with a small shape or template known as a structuring element. The four basic morphological operations are erosion, dilation, opening, and closing [15]. The grey-scale erosion can be described as the calculation of the minimum pixel value within the structuring element centred on the current pixel Ai,j. Denoting an image by I and a structuring element by Z, the erosion operation IΘZ at a particular pixel (x,y) is defined as: IΘZ = min ( Ax+i , y + j ). ( i , j )∈Z

(1)

Here i and j are indices of the pixels of Z. The grey-scale dilation is considered in a dual manner and thus can be written as: I ⊕ Z = max ( Ax+i , y + j ). ( i , j )∈Z

(2)

The opening of an image is defined as erosion followed by dilation, while the image closing includes dilation followed by erosion. Thus, the morphological operation as closing can be defined as follows: I • Z = ( I ⊕ Z )ΘZ = min ( max ( Ax+i , y + j )). ( i , j )∈Z ( i , j )∈Z

(3)

Automated Positioning of Overlapping Eye Fundus Images

773

The closing operator usually smoothes away the small-scale dark structures from colour retinal images. As closing only eliminates the image details smaller than the structuring element used, it is convenient to set the structuring element big enough to cover all possible vascular structures. Mendels et al. [9] applied the closing grey-level morphology operation to smooth the vascular structures from the retinal images. Let us assume that the above presented scheme is applied to two pre-processed images. Also, let us say that one image (P1) is the base image and the other image (P2) is the image that has to be registered on the base image. In order to see the differences of two spatial images, the technique of intensity values subtraction is frequently used. This operation can be defined as follows: Q(x,y)=|P1(x,y)-P2(x,y)|. After this operation (for each x = 1,…,N; y = 1,…,M ) if at a particular time there are no changes in the spatial domain, the subtracted intensity values acquire the value of 0, otherwise, if there are some differences the intensity value does not become 0. In order to visualize the subtracted image, we have to apply a intensity adjustment procedure. This is needed because the vessels intensity values in colour images are very low according to the surrounding background of the retinal image. Basically the intensity adjustment procedure can be described in this way. Let us assume that the distribution of intensity values of subtracted image Q(x,y) and the transformation function f are continuous in the interval [0, 1] [4]. Moreover, assume that the transfer function is single-valued and monotonically increasing. Then actual intensity levels in shown interval will be recalculated using the function f to the desired intensity levels in a desired interval. In our investigation we used the desired interval with the minimum value of 0 and the maximum value of 255. Also, the Gamma correction factor was set to 1 (this transfer function is nearly linear). By thresholding the intensity adjusted image, next, we will be able to apply skeletonization operation in order to achieve the reference vasculature structures in both images. Thus, for automated threshold level calculation we use Otsu’s method based on the weighted histogram calculation [12].Otsu’s method maximizes the a posteriori between-class variance σ B2 (t ) given by: ⎛ μ T (τ 1 ) − μ1 (τ 1 ) μ1 (τ 1 ) ⎞ ⎟. − w0 (τ 1 ) ⎟⎠ ⎝ 1 − w0 (τ 1 )

σ B2 (t ) = w0 (τ 1 )[1 − w0 (τ 1 )]⎜⎜ τ1

Here

w0 (τ 1 ) = ∑ i =0

(4)

τ1 L −1 ni n n ; w1 (τ 1 ) = 1 − w0 (τ 1 ); μ1 (τ 1 ) = ∑ i i ;μT (τ 1 ) = ∑ i i . N i =0 N i =0 N

The optimal threshold τ1 is found by Otsu’s method through a sequential search for the maximum of max σ B2 (τ 1 ) of τ 1 , where ni represents the number of pixels in the 0≤τ1 < L

grey-level i, L is the number of grey-levels, and N is the total number of pixels in the image [18]. The Otsu thresholding method was applied because in the next morphological operation we have unambiguously distinguished what is foreground and what is background in the retinal image. A foreground is assumed to be the vasculature of the retinal image and the background – remaining part of the retinal image. The next step is to extract the structure of vasculature from both images: from the base image and the image to be registered. For this end we have used the medial axis

774

P. Treigys, G. Dzemyda, and V. Barzdziukas

transform (skeletonization) [10]. Basically, the skeletonization operation is calculated by shifting the origin of the structuring element (Fig. 2) to each possible pixel position in the image. Then, at each position it is compared with the underlying image pixels. If the foreground and background pixels in the structuring element match exactly the foreground and background pixels in the image, then the image pixel situated under the origin of the structuring element is set to the background, otherwise, it is left unchanged. Here we denote that a foreground pixel is assumed to be 1 and a background pixel is 0. An empty cell means that a particular pixel is of no interest, and it is not taken into account for evaluation.

Fig. 2. Structuring elements used for skeletonization

In Fig. 2, both images are first skeletonized by the left-hand structuring element, and afterwards by the right-hand one. Then the above presented process is performed with the remaining six 90° rotations of those two elements during the same iteration. The iteration process is stopped when there are no changes in the images for the last two iterations.

4 Transformation to the Frequency Domain In order to register two vasculature trees achieved by scheme proposed we have to incorporate some cross-correlation method. It is well known that for big images the convolution methods designed for cross-correlation runs very slowly. This problem can be solved by introducing a discrete Fourier transform (DFT) [3]. Usually DFT is defined for the discrete function f(x,y) that is non-zero over the finite region 0 ≤ m ≤ M − 1 and 0 ≤ n ≤ N − 1 . In our case, this function represents a retinal image in the spatial domain. Then, the two-dimensional discrete Fourier transformation of the matrix M by N can be calculated as follows:

F ( p, q ) =

M −1 N −1

∑∑

f (m, n)e

⎛ 2π ⎞ ⎛ 2π ⎞ −i ⎜ ⎟ pm − i ⎜ ⎟ qn ⎝M ⎠ ⎝ N ⎠

e

(5)

.

m =0 n = 0

Where p=0,…,M-1 and q=0,…,N-1. The inverse DFT can be achieved by applying: f (m, n) =

1 MN

M −1 N −1

⎛ 2π ⎞ ⎛ 2π ⎞ i⎜ ⎟ pm i ⎜ ⎟ qn ⎠ ⎝ N ⎠

∑ ∑ F ( p , q )e ⎝ M p =0 q = 0

Here m=0,…,M-1 and n=0,…,N-1.

e

.

(6)

Automated Positioning of Overlapping Eye Fundus Images

775

The Fourier transform produces a complex number valued output image. This image can be displayed with two images, either with the real and imaginary part or with magnitude and phase. In our investigation, we apply Eq. 5 to the base retinal image. The retinal image committed for the registration process is rotated by 1800, since the convolution operation itself reverses the provided pattern [17]. Then, the Eq. 5 is applied to the rotated pattern as well. This results in four arrays, the real and imaginary parts of the two images being convolved. Multiplying the real and imaginary parts of the base image by the real and imaginary parts of the image committed for registration generates a new frequency image with the real and imaginary parts. Taking an inverse DFT of the newly created frequency image, described by Eq. 6 completes the algorithm by producing the final convolved image. The value of each pixel in a convolved correlation image is a measure of how well the target image matches the searched image at a particular point. The new correlation image calculated is composed of noise plus a single high peak, indicating the best match of vasculature of the image to be registered in the base retinal image vasculature. Simply by locating the highest peak in this image, it would specify the detected coordinates of the best match. The frequency transformation procedure described above is applied by taking the structure of vasculature of the image which has to be registered on itself. This is done because another coordinates are necessary that show where the best match of image on itself is (Fig. 3).

Fig. 3. Peaks indicate a shift along the x axis (the left side), and peaks indicate shift along y axis (the right side)

In Fig. 3 on both sides a smaller peak corresponds to the two different images convolved together. The biggest peak corresponds to the image convolved by itself. Then, by introducing a simple linear transform to the retinal image, committed for registration, we shift pixels by the calculated distance along the x and y axes. The result of closing, subtraction, histogram equalization, thresholding, skeletonization and shift calculation is shown in Fig. 4. Fig. 4 shows two structures of vasculature extracted by the method proposed above. The stronger structure belongs to the base retinal image on which s a retinal image intended for registration has to be put. The weaker structure of vasculature belongs to the retinal image intended for registration.

776

P. Treigys, G. Dzemyda, and V. Barzdziukas

Fig. 4. Shows a superimposed structure of the extracted vasculature of two retinal images (top figure), (the bottom figure) shows the registered vasculature structure of two retinal images

5 Results Eye fundus images were provided by the Department of Ophthalmology of the Institute for Biomedical Research of Kaunas University of Medicine (BRKU). The testing

Automated Positioning of Overlapping Eye Fundus Images

777

set consisted of 19 patients’ retinal images of both eyes. It should be noted that registration of images is possible only if those images are of the same patient and of the same eye. This comes from the fact that the structure of eyes vasculature of each human is unique. In order to verify this fact and to obtain the factor of registration error, the algorithm proposed was applied to the retinal images taken from different patients of the same eye (Fig. 5).

Fig. 5. Shows two correlated images. (On the left side) the correlation between the base retinal image and that committed for registration is shown, (on the right side), self-correlation of retinal images is shown.

Fig. 5 shows the magnitudes of convolved images along the x axis. In this particular case, where it is not the same person is taken for investigation, note, that magnitudes on the left image are dramatically lower than that in the image on the right. Thus, to evaluate the quality of the registration, we computed the ratio of the peaksignal to noise (PSNR). 119 possible pairs of eye fundus images have been investigated. The conditions for those images to be of the same person and also of the same eye have been satisfied. The results achieved are shown in Fig. 6.

Fig. 6. Histogram of the ratio between the peak-signal to noise

In Fig. 6, a histogram of the ratio peak-signal to noise is presented. According to [11], acceptable PSNR values are between 20Db and 40Db. The higher the value of decibels, the better registration is performed (Fig. 7). Here we can draw the conclusion on automated human identification from retinal images. If PSNR dramatically lower than 20Db one can made the decision that it is not the same person. In case shown in Fig. 5 and 3 calculated PSNR value was 4.3Db.

778

P. Treigys, G. Dzemyda, and V. Barzdziukas

Fig. 7. (On the left), two overlapping images are illustrated, (on the right), the registered image of the quality of 51Db

The comparative PSNR analysis can be made over the patients’ database in order to automatically identify the person. This can be used for solving the problem of patient’s data protection because medic will be working only with the data about the state of the patient without knowing who really the patient is.

6 Conclusions In this article the authors presented an automated technique for retinal image registration where two images overlap each other in the best way. The task was accomplished by introducing the intensity level morphology operations for vessel extraction. Then the intensity adjustment procedure was performed to enhance the resulting image after subtraction. This operation was followed by the image binarization, where the skeletonization operation was introduced. In the next step the spatial domain of the extracted vasculature structure was converted into the frequency domain, which resulted in a fast convolution of two images. This fact enables us to calculate the image shift. The analysis of provided retinal images showed that the registration quality parameter basically occurs within the bounds of decibels accepted in literature. Also, we have shoved that for fundus image registration problem a linear transformation is enough to obtain satisfactory results. Disclosed problem on human identification revealed that proposed algorithm is also suitable to solve the identification-related class problems. However, more careful analysis in order to evaluate the identification results should be made. Acknowledgements. The research is partially supported by the Lithuanian State Science and Studies Foundation project “Information technology tools of clinical decision support and citizens wellness for e.Health system” No. B-07019.

References 1. Chanwimaluang, T., Guoliang, F., Fransen, S.R.: Hybrid retinal image registration. IEEE Transactions on Technology in Biomedicine 10(1), 129–142 (2006) 2. Dongxiang, X., Jenq-Neng, H., Chun, Y.: Atherosclerotic blood vessel tracking and lumen segmentation intopology changes situations of MR image sequences. In: Proceedings. International Conference on Image Processing, vol. 1, pp. 637–640 (2000)

Automated Positioning of Overlapping Eye Fundus Images

779

3. Edward, W.,, Kamen, B., Heck, S.: Fundamentals of Signals and Systems Using the Web and Matlab (2000) 4. Gonzalez, R., Woods, R.: Digital Image Processing. Addison-Wesley Publishing Company, Reading (1992) 5. Goutsias, J., Heijmans, H., Sivakumar, K.: Morphological operators for image sequences. Computer Vision and Image Understanding (62), 326–346 (1995) 6. Hoover, A., Goldbaum, M.: Locating the optic nerve in a retinal image using the fuzzy convergence of the blood vessels. IEEE Transactions on Medical Imaging 22(8), 951–958 (2003) 7. Lowell, J., Hunter, A., Steel, D., Basu, A., Ryder, R., Kennedy, R.L.: Measurement of Retinal Vessel Widths from Fundus Images Based on 2-D Modeling. MedImg 23(10), 1196–1204 (2004) 8. Matsopoulos, G.K., Asvestas, P.A., Mouravliansky, N.A., Delibasis, K.K.: Multimodal registration of retinal images using self organizing maps. MedImg 23(12), 1557–1563 (2004) 9. Mendels, F., Heneghan, C., Thiran, J.: Identification of the optic disc boundary in retinal images using active contours. In: The Proceedings of the Irish Machine Vision and Image Processing Conference, pp. 103–115 (1999) 10. Mukherjee, J., Kumar, A.M., Das, P.P., Chatterji, B.N.: Use of medial axis transforms for computing normals at boundary points. Pattern Recognition Letters 23(14), 1649–1656 (2002) 11. Netravali, A.N., Haskell, B.G.: Digital Pictures Representation, Compression, and Standards, 2nd edn. Plenum Press, New York (1995) 12. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybernet. SMC91(1), 62–66 (1979) 13. Peters, R.: Mathematical morphology for angle-valued images. Non-linear Image Processing. In: International Conference on Electronic Imaging, Society of Photo-optical Instrumentation Engineers, pp. 1–11 (1997) 14. Soares, J.V.B., Leandro, J.J.G., Cesar Jr., R.M., Jelinek, H.F., Cree, M.J.: Retinal vessel segmentation using the 2-D Gabor wavelet and supervised classification. MedImg 25(9), 1214–1222 (2006) 15. Soille, P.: Morphological Image Analysis. Springer, Berlin (1999) 16. Staal, J., Abramoff, M.D., Niemeijer, M., Viergever, M.A., van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. MedImg 23(4), 501–509 (2004) 17. Steven, W.S.: The Scientist & Engineer’s Guide to Digital Signal Processing. California Technical Pub. (1997) 18. Tian, H., Lam, S.K., Srikanthan, T.: Implementing Otsu’s Thresholding Process Using Area-Time Efficient Logarithmic Approximation Unit. In: IEEE International Symposium on Circuits and Systems (ISCAS), vol. 4, pp. 21–24 (2003) 19. Vermer, K.A., Vos, F.M., Lemij, H.G., Vossepoel, A.M.: A model based method for retinal blood vessel detection. Computers in Biology and Medicine 34, 209–2019 (2004)

Acceleration of High Dynamic Range Imaging Pipeline Based on Multi-threading and SIMD Technologies Radoslaw Mantiuk and Dawid Paj¸ak Szczecin University of Technology Zolnierska 49, 71-210 Szczecin, Poland {rmantiuk,dpajak}@wi.ps.pl http://zgk.wi.ps.pl

Abstract. In this paper we present a holistic approach to CPU based acceleration of the high dynamic range imaging (HDRI) pipeline. The high dynamic range representation can encode images regardless of the technology used to create and display them, with the accuracy that is only constrained by the limitations of the human eye and not a particular output medium. Unfortunately, the increase in accuracy causes signiﬁcant computational overhead and eﬀective hardware acceleration is needed to ensure a utility value of HDRI applications. In this work we propose a novel architecture of the HDRI pipeline based on CPU SIMD and multi-threading technologies. We discuss the impact on processing speed caused by vectorization and parallelization of individual image processing operations. A commercial application of the new HDRI pipeline is described together with evaluation of achieved image processing speed-up. Keywords: high dynamic range imaging, SIMD architecture, SSE, multi-threading architecture, image processing, computer visualization.

1

Introduction

The advances in high dynamic range imaging (HDRI), especially in display and camera technology, have a signiﬁcant impact on existing imaging systems. The assumptions of traditional low-dynamic range imaging, designed for paper print as a major output medium, are ill suited for the range of visual material that is shown on modern displays. The high dynamic range representation can encode images regardless of the technology used to create and display them, with the accuracy that is only constrained by the limitations of the human eye and not a particular output medium. The disadvantage of HDRI technology is computational complexity. A single pixel in an high dynamic range image consumes 4 times more memory storage in comparison to a low dynamic range image (3 ﬂoating point numbers (4-bytes long each) against 3-bytes in the low dynamic range representation). This complexity means that HDRI pipeline is not used in many applications which would M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 780–789, 2008. c Springer-Verlag Berlin Heidelberg 2008

Acceleration of HDRI Pipeline

781

signiﬁcantly beneﬁt from HDR accuracy. For example, processing of huge datasets from medical computer tomography or RAW photographs is problematic for typical personal computers or laptops. Moreover, the constant increase of image resolution and complexity of image processing algorithms will make this problem even worse in the future. In this paper we argue that by usage of modern CPU technologies, it is possible to accelerate the image processing of the HDRI pipeline signiﬁcantly. The acceleration can be achieved based on existing CPU capabilities: the SIMD instruction set, multi-processor and multi-core architectures. SIMD (Single Instruction Multiple Data) instructions allow to speed-up processing of ﬂoating point vector data, which can represent HDR pixels. Because of the independent transformation of individual pixels, most HDRI algorithms are well suited for parallel processing. Multi-threading architecture accelerates such processing almost by a factor of the available threads. A goal of eﬃcient HDRI computing is to accelerate the whole HDRI pipeline rather than to speed-up individual operations. In this paper we propose the architecture of the accelerated pipeline which beneﬁts from careful optimization of HDRI algorithms and from eﬀective RAM memory management. We also introduce queueing techniques that enable grouping many simple operations into one complex command path. This way automatic optimization of CPU hardware usage is implemented and eﬀective acceleration of complex algorithms is possible. We review existing SIMD and multi-threading based technologies in section 2. The concept of high dynamic range imaging pipeline and its possible acceleration techniques are discussed in Section 3. In section 4 we present an architecture of our novel HDRI pipeline which uses SIMD and multi-threading technologies to speed-up data processing. We then describe a software package that operates on the new HDRI pipeline (section 5) and discuss achieved results.

2

Previous Work

A general approach to image processing using SIMD and parallel hardware can be found in a few software packages. The VIPS (VASARI Image Processing System) library [7] seems to be the most known LGPL system for processing huge images. It divides images into small arrays and uses multi-threading to eﬀectively process them on SMP (Symmetric Multiprocessor) computers. Intel Integrated Performance Primitives (Intel IPP) [8] is a library of multi-core-ready, optimized software functions for multimedia data processing. Careful programming in plain C/C++ code and compilation based on IPP compilers can speed-up applications. The Image Processing Toolbox (IPT) [10] provides a set of functions for image manipulation and analysis. The IPT capabilities include SIMD and multi-threading optimized color space transformations, linear ﬁltering, mathematical morphology, geometric transformations, image ﬁltering and analysis. Acceleration is achieved based on the O-Matrix engine [10] that supports fast matrix processing. Multithreading and SIMD architecture is also exploited by the GENIAL (GENeric Image Array Library) [9] library to speed-up computation of signal processing

782

R. Mantiuk and D. Paj¸ak

algorithms. The architecture of GENIAL is based on the same conventions as the Standard Template Library (STL), consisting of containers, iterators, adaptors, function objects and algorithms. The intensive use of templates makes it possible for the library to automatically adapt calculations on containers to the speciﬁed problem in order to achieve faster execution. There are a few acceleration toolkits in the medical imaging community. ITK (Insight Segmentation and Registration ToolKit) [11] is a library for image segmentation and registration. Another available choice, MITK (Medical Imaging ToolKit) [12] uses CPU SIMD instructions to accelerate matrix and vector computations, and linear and tri-linear interpolation computations. Both toolkits provide a general framework for medical imaging rather than a set of highly optimized image processing functions. All these approaches seem to be general and not optimized for HDRI processing. In particular they are not intended for throughout rendering of the HDRI pipeline. In this paper we propose a more eﬃcient solution at the cost of rejecting generality of computations. We present a holistic approach to CPU based acceleration of the HDRI pipeline. We don’t deal with GPU (Graphics Processor Image) acceleration techniques [1]. The usage of GPU seems to be promising for fast HDRI processing but this technology is not common on many platforms (e.g. mobile phones or PDA devices). Moreover, advanced GPU capabilities (e.g. shader support) are not well standardized yet so we leave the GPU acceleration as future work.

3

Acceleration of the HDRI Pipeline

High dynamic range imaging [2] is a new paradigm that involves a highly accurate representation of images. As it originates from light simulation (computer graphics rendering) and measurements, the pixels of HDR images are assigned a physical meaning. This highly accurate representation of images gives an unique opportunity to create a common imaging framework, that could meet the requirements of diﬀerent imaging disciplines. Figure 1 illustrates an example of the HDRI pipeline [3] that starts with acquisition of a real world scene or rendering of an abstract model using computer graphics techniques and ends at a display. This pipeline overcomes shortcomings of a typical graphics pipeline that doesn’t support devices of a higher dynamic range or a wider color gamut [4]. The drawback of the HDRI pipeline is that much more data has to be processed in comparison to a typical 8-bit pipeline. Pixels of an HDR image are represented by a vector of ﬂoating point values: one 4-byte ﬂoating point number for each of 3 or more color channels (e.g. in the case of multi-spectral imaging more than 10 channels should be supported). Number of bbp (bits per pixel) is then four or more times higher than for 8-bit images. Additionally, it should be considered that most HDRI devices generate huge sets of data because of growing sampling resolutions of input and output devices. Like in typical images, dimensionality of HDRI images is important and change of context in both horizontal and vertical directions cannot be neglected. Summing up, to achieve

Acceleration of HDRI Pipeline

783

SIMD Acceleration Thread 1

Real scene

HDR camera

Image chunk 1

Image processing

Tone mapping

Image chunk 2

Image processing

Tone mapping

Image chunk 3

Image processing

Tone mapping

Image chunk 4

Image processing

Tone mapping

Image ch. n-1

Image processing

Tone mapping

Image chunk n

Image processing

Tone mapping

Thread 2 Image splitting

HDR Storage

Image merging

... Thread N

Abstract 3D model

Rendering engine (floating point processing)

LDR display

HDR display

Fig. 1. High dynamic range imaging pipeline accelerated by SIMD and multi-threading operations

the same performance as for the low dynamic range pipeline, the data in the HDRI pipeline need to be processed much faster. A recent development of CPU hardware allows for the speed-up HDRI processing signiﬁcantly. In particular the SIMD instruction can be exploited eﬀectively by processing many data in one CPU cycle. HDRI processing is susceptible to parallelization so usage of multi-threading accelerates computations. To exploit SIMD and multi-threading eﬃciently, we selected a set of algorithm crucial to HDRI processing. Then, we propose a new architecture of the HDRI pipeline. A set of selected algorithms is depicted in Figure 2. From the acceleration viewpoint, the most important group of operations is matrix arithmetic. This group covered both matrix-by-matrix operations (e.g. matrix multiplication) and scalar-by-matrix operation (e.g. multiplication of all matrix elements by a scalar value). Also, matrix manipulation algorithms, like transposition or vertical and horizontal shift, should be considered. Channel masking and pixel masking can eliminate selected color channels or pixels from pipeline processing based on conditional expressions. The accumulation algorithms, such as computation of a sum of pixel values in an image area, is time consuming and should be accelerated. In many cases HDR images must be transformed by a non-linear functions and Look-Up-Table (LUT) is the fastest and simplest way to proceed. Finally, a selected group of advanced image processing algorithms are accelerated. This group includes image scaling, color space conversions and color proﬁle conversions.

784

R. Mantiuk and D. Paj¸ak

Matrix operations scalar-by-matrix - multiplication/division - addition/substraction - exponential operations - logarithmic operations matrix-by-matrix - multiplication - matrix reciprocal - transposition

Accumulation operations - sum of pixel values - computation of maximum/ minimum value - arithmetic average - geometric average

Conditional expressions - channel masking - pixel masking - logic operations

Look-Up Table operations Advanced operations - image scaling - image rotation - color transformations - color profile conversion - histogram computation - Gaussian pyramid - convolution - statistic coefficients

Fig. 2. HDRI operations crucial for fast image processing

4

Using SIMD Operations and Multi-threading in HDRI Processing

In Figure 1 a novel HDRI pipeline accelerated by CPU SIMD operations and multi-threading is presented. The main goal of the pipeline processing is to exploit the SIMD and multi-threading architectures eﬃciently. Moreover, data interchange between CPU and computer RAM memory should be limited as much as possible. All intermediate and temporary results should be stored in CPU registers or CPU cache memory. If a larger temporary storage is required (e.g. in local tone mapping operators [2]), algorithm could exploit L1 cache, as it oﬀers lower access latency and greater performance than system memory. However in this case, an HDR image must be divided into suitable chunks due to limited cache size. Parallel processing is exploited in a new pipeline. We divide an HDR image into arrays/chunks and the chunks are processed independently. The size of an array is limited by the size of CPU L1 cache rather than the number of available threads or processors. We noticed that even in a one-thread system, it is faster to process small chunks of data rather than the whole image at once. In the pipeline implementation a ﬁxed number of threads (equal to the number of execution units reported by an operating system) is created at run-time. Threads become active only when new tasks are assigned to them and go to sleep right after they ﬁnish processing scheduled tasks. Thread management is handled by operating system. It is the operating system’s responsibility to identify Hyper-threading, multi-threading or multi-processor hardware and manage threads eﬃciently. The goal of SIMD computation is to perform a single operation on many data elements. Modern CPU hardware is equipped with a set of SIMD arithmetic, logical, comparison and conversion instructions. They process 128-bit words in one CPU cycle so a 4-times speed-up of basic operations is potentially possible. Almost all current CPUs oﬀer the SIMD instruction set. Examples are SSE [5] in Intel and AMD processors or AltiVec in IBM’s Power PC.

Acceleration of HDRI Pipeline

785

HDRI processing is especially suited to acceleration based on the SIMD architecture. A common 4-channel RGBA representation (red, green, blue, and alpha channels) of an HDR pixel can be considered as one 128-bit word and all channels can be processed simultaneously. The CPU SIMD instruction set delivers most of operations required in HDRI computing [6]. In the case of advanced operations (like logarithm computation or exponentiation), which are not available, we use existing instructions to approximate results. For example to compute log2 (x), we use speciﬁc features of a single precision ﬂoating-point number representation. As deﬁned in the IEEE 754-1985 speciﬁcation, a single precision number is described by the equation: s ∗ 2e ∗ m, where s is a sign bit, e is an 8-bit exponent and m is a 24-bit normalized mantissa. We calculate the log2 (x) value by extracting the exponent from the number representation and adding it to the approximation of log2 (m) of extracted mantissa: log2 (x) = log2 (2e ∗ m) = e + log2 (m), where x > 0. In our implementation we use the Chebyshev mini-max ﬁfth degree polynomial to approximate the function log2 (m). This technique results in a small relative error (10−6 ) and can be implemented very eﬃciently on SIMD architecture. Most of the operations performed inside the HDRI pipeline are executed on luminance (1 channel) value of HDR pixels. This makes the calculation even more eﬃcient as we process 4 pixels in each instruction. An example of this set of operations is a Look-up table transformation which lets the user apply a custom non-linear transformation on input luminance values (e.g. custom global TMO curve). Given a set of non-overlapping input ranges ([a0 , b0 ]..[an , bn ]) and their mapped counterpart values ([c0 , d0 ]..[cn , dn ]) we transform luminance value l using the formula: le = cx + ((l − ax )/(bx − ax )) ∗ (dx − cx ), where ax ≤ l < bx . The most time consuming task is ﬁnding the right input range for every pixel. By using SIMD selective write operations and bit masking we speed-up the procedure by ﬁnding the ranges for 4 pixels at the same time. Because of the characteristics of image data (neighbourhood pixels have relatively small diﬀerences in luminance values) the performance drawback coming from redundant loop passes (some pixels in the vector might have their range locked but we continue the search until ranges for all 4 pixels are found) is, in most cases, negligible.

5

Example Application: The HDRI Library

We implemented the HDRI pipeline as a Windows dynamic library. All public methods in the library are accessed through a simple C API similar to OpenGL with internal state machine on the library side. The library uses multi-threading and vector processing capabilities. The functions are manually optimized with intensive usage of MMX/SSE/SSE2 intrinsic code. The initialization code is able to detect the features of the CPU (by invoking the cpuid instruction) and choose the best possible code path for the library methods. Also a special 64-bit version is available for Win64 systems (additional performance gain comes from the increased number of CPU SIMD registers). The library itself was implemented in C++.

786

R. Mantiuk and D. Paj¸ak

The activity diagram in Figure 3 shows simpliﬁed processing stages of HDRI pipeline. Actual pipeline stages implementation is much more complicated and includes additional features/modules. We skip them because they do not inﬂuence the design of the pipeline acceleration methods. A synchronization of all working threads is required to gather the results, calculate required image speciﬁc factors and then move to the next stage. The synchronization of threads is implemented via the Win32 event system which is known for low latency and quick kernel dispatching time. Even if we use the most eﬃcient synchronization method available, this task tends to be the slowest element of the pipeline processing (the delays are not even correlated with image data size). By carefully designing the pipeline stages and operation types at each stage we minimized the synchronization count to the absolute minimum. Besides the ﬁxed functionality pipeline we have implemented a mechanism which lets the programmer combine many simple SIMD accelerated functions (addition, multiplication, division, etc.) with their corresponding arguments into one complex command, which is then executed by multi-threading, chunk processing engine of the pipeline. This queueing method oﬀers additional ﬂexibility to computational abilities of the library, however it comes with a price of degraded locality of calculations and increased memory bandwidth usage. Thus it is not able to compete in terms of performance with the ﬁxed functionality pipeline. A set of 5 HDR images (6 M-pixels each) was used to conduct performance tests of the library. The tests covers both low level (basic data processing operations) and advanced HDRI processing operations. The timings from the upper part of table 1 are arithmetic means of results we got for individual test images. The lower part of the table covers tests for execution of the whole HDRI pipeline starting from HDR image blending (for 5 input images) and ending at tone mapping (TMO) and LDR image generation. The test platform was a Windows XP based PC equipped with Athlon X2 3800+ dual core CPU and 2GB DDR RAM. Noticeable speed-ups are visible in both internal functions as well as in overall pipeline performance. The computational power of SIMD architecture is especially exposed in calculation of log-average (arithmetic mean of logarithms) or sRGB gamma correction where approximated functions are used instead of their built-in counterparts. Decreased acceleration ratio of functions with conditional execution (compare timings for log-average and conditional log-average) is caused by the streaming nature of SIMD computation: we calculate values of all elements in the vector and write the result depending on the condition bit mask. In the case of SISD implementation we only calculate the results for elements which pass the condition statement. Operations performing well on a super-scalar FPU (e.g. image blending) or resistant to vectorization due to algorithm nature (tone mapping) still take advantage of parallel processing and speed-up close to the theoretical limit of 2 (number of cores in our test system). False color image generation1 performance 1

A process of encoding the HDR image luminance into an LDR image where certain luminance values are mapped to a predeﬁned RGB triplets from LUT.

Acceleration of HDRI Pipeline

HDR image 1

...

HDR image 2

787

HDR image N

Blending and geometry transformation (e.g. cropping) Luminance channel extraction and clipping [Do we need luminance statistic?]

Compute image luminance statistic (e.g. min, max values)

Yes

Yes

No

No

Luminance clipping

[Do we need to pre-process luminance?]

[Do we need to clip luminance values?]

Yes No

[Do we need false color image?] Yes False colour image

generation No

Tone mapping (TMO)

RGB retrieval

sRGB gamma correction

Yes

[Do we need gamma correction?] No

Output LDR image

Fig. 3. Selected HDRI library pipeline stages (simpliﬁed). The default conﬁguration of the HDRI pipeline combines a speciﬁed number of the input HDR images into one intermediate image, which is then processed and written as an output LDR image. Most of the functions work on the luminance channel which is extracted in the preprocessing stage. At the end of pipeline the results of luminance processing are applied on all channels of input HDR image.

does not scale linearly with the number of available execution units and seem to be bound by memory bandwidth. We have compared our implementation with the VIPS library in low level function performance. All tested functions of HDRI library performed better than

788

R. Mantiuk and D. Paj¸ak

Table 1. Computation time of internal pipeline functions. For SIMD implementation both single-threaded (st) and multi-threaded (mt) tests were performed. The speed-up is a ratio between measured time of a single-threaded FPU code and SIMD multithreaded implementation. Operation

Time [ms] Speed-up FPU(st) SIMD(st) SIMD(mt) HDR image blending 257 238 139 1.85 Log-average 395 33 19 20.79 Conditional log-average 238 43 25 9.52 XYZ to RGB conversion 64 38 20 3.2 Conditional XYZ to RGB conversion 90 49 22 4.09 sRGB gamma correction 1971 302 157 12.55 Global photographic TMO [13] 3146 590 360 8.74 Gamma TMO 3356 621 375 8.95 False color image generation (LUT example) 690 321 220 3.14

their VIPS equivalents. For example, the computation speed of log-average is about 11 times faster in our solution (both test programs utilize multi-threading architecture). This was expected as VIPS library is architecture independent and does not use SIMD instruction set or math functions approximations.

6

Conclusions and Future Work

The limitations of the existing low-dynamic range imaging technology can be addressed and eliminated in the HDR imaging pipeline that oﬀers higher precision of visual data. In this work we outline acceleration techniques for processing HDR images. We use the SIMD and multi-threading technologies available in present CPU hardware to overcome computation bottlenecks. Those technologies, together with the proposed novel data processing architecture, make the HDRI pipeline as fast as the traditional pipeline. In the ﬁnal section of the paper we describe the software library implemented based on the proposed design. The utility values of this library was proved in commercial applications (they will be available on the market in the near future). Another increase of computation speed of the HDRI library could be achieved using specialized instructions from the SSE3/SSE4 extensions (e.g. usage of hardware dot product and horizontal data processing instructions could speed up accumulation operations by a factor of 4). At the time of writing of the implementation the CPUs with SSE3/SSE4 support were not propagated enough in the market to count this code path in the library design but we plan to use those operations in the future. We have also been extending functionality of the library to accelerate more complex image processing operations (e.g. histogram computation or Gaussian pyramid usage). In the future we plan to port the library to GPU hardware.

Acceleration of HDRI Pipeline

789

Acknowledgments. The research work, which results are presented in this paper, was sponsored by Polish Ministry of Science and Higher Education (years 2006-2008).

References 1. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A.E., Purcell, T.: A survey of general-purpose computation on graphics hardware. In: Proc. of Eurographics 2005, State of the Art Reports, September 2005, pp. 21–51 (2005) 2. Reinhard, E., Ward, G., Pattanaik, S., Debevec, P.: High Dynamic Range Imaging. In: Data Acquisition, Manipulation, and Display. Morgan Kaufmann, San Francisco (2005) 3. Mantiuk, R., Krawczyk, G., Mantiuk, R.: High Dynamic Range Imaging Pipeline: Merging Computer Graphics, Physics, Photography and Visual Perception. In: Spring Conference on Computer Graphics 2006 Posters and Conference Materials, April 20-22, pp. 37–40 (2006) 4. Mantiuk, R., Krawczyk, G., Mantiuk, R., Seidel, H.P.: High Dynamic Range Imaging Pipeline: Perception-motivated Representation of Visual Content. In: Proc. of SPIE. Human Vision and Electronic Imaging XII. 649212, vol. 6492 (2007) 5. Intel64 and IA-32 Architectures Optimization Reference Manual (May 2007) 6. Intel Architecture Software Developer Manual, Instruction Set Reference, vol.2 (1999) 7. Martinez, K., Cupitt, J.: VIPS - a highly tuned image processing software architecture. In: Proceedings of IEEE International Conference on Image Processing 2, Genova, vol. 2, pp. 574–577 (2005) 8. Taylor, S.: Intel Integrated Performance Primitives Book. ISBN 0971786135, ISBN13 9780971786134 (2004) 9. Laurent, P.: GENIAL - GENeric Image Array Library, http://www.ient.rwth-aachen.de/team/laurent/genial/genial.html 10. Harmonic Software Inc.: IPT - The Image Processing Toolbox for O-Matrix, http://www.omatrix.com/ipt.html 11. Ibanez, L., Schroeder, W., Ng, L., Cates, J.: The ITK Software Guide, 2nd edn., Kitware, Inc. Publisher (November 2005) 12. Zhao, M., Tian, J., Zhu, X., Xue, J., Cheng, Z., Zhao, H.: The Design and Implementation of a C++ Toolkit for Integrated Medical Image Processing and Analysis. In: Proc. of SPIE Conference, vol. 6, pp. 5367–5374 (2004) 13. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic Tone Reproduction for Digital Images. ACM Trans. on Graphics 21(3), 267–276 (2002)

Monte Carlo Based Algorithm for Fast Preliminary Video Analysis Krzysztof Okarma and Piotr Lech Szczecin University of Technology, Faculty of Electrical Engineering, Chair of Signal Processing and Multimedia Engineering, 26. Kwietnia 10, 71-126 Szczecin, Poland {krzysztof.okarma,piotr.lech}@ps.pl

Abstract. In the paper a fast statistical image processing algorithm for video analysis is presented. Our method can be used on colour as well as grayscale or even binary images. The main component of the proposed approach is based on statistical analysis using the Monte Carlo method. A video’s statistical information is acquired by specifying a logical condition for the Monte Carlo technique. The results of the algorithm depend on the correct choice of threshold values; thus the application area is limited by the adaptability of the thresholds to videos with large heterogeneity: e.g. videos with objects moving into and out of the scene, rapidly varying illumination, etc. Keywords: statistical image analysis, Monte Carlo method.

1

Description of the Method

For a static image of analysed scene a constant value can be deﬁned related to the number of pixels fulﬁlling speciﬁed logical condition. Such condition can be deﬁned e.g. as the belonging of the image sample to the speciﬁed luminance range. In such case the algorithm works as the area estimator for the objects fulﬁlling the speciﬁed luminance criterion. In the eﬀect of analysis of the whole image the corresponding binary image is created which stores the values equal to 1 for the samples which fulﬁl the condition and 0 for the others. It gives the quantitative information related to the object’s features described by the logical condition which can be obtained by the summation of all ”ones”. ˆ given as: The estimator L ˆ=m, L (1) where m stands for the total number of ”ones” in the binary image, can be related to the area of the single object located in the empty scene, assuming the object’s pixels fulﬁl the speciﬁed logical condition. After some additional morphological operations it is possible to easily estimate some other parameters such as object’s perimeter, diameter, moments etc. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 790–799, 2008. c Springer-Verlag Berlin Heidelberg 2008

Monte Carlo Based Algorithm for Fast Preliminary Video Analysis

location 2

location 3

location 4

location 9

location 1

location 5

location 8

location 7

location 6

791

Fig. 1. The example of the distortions caused by camera lens

Counting of all ”ones” for high resolution images may be time consuming because analysis of all image samples is required. In order to increase the speed of the algorithm the reduction of the number of analysed samples is suﬃcient. In that case application of a statistical experiment using the Monte Carlo method is useful. The number of analysed points is equal to the number of draws using pseudo-random generator with uniform distribution. Binary image samples can be stored in one-dimensional vector, numbered from 1 to N , where N is the total number of samples in the scene. Then n independent draws (with returns) are performed from the vector and the number of ”ones” drawn (k) is stored. Estimated number of ”ones” is equal to ˆ MMC = k · N , L n

(2)

where: k - the number of ”ones” drawn, n - the number of draws, N - the total number of samples. The estimation error can be expressed as: K uα K εα = √ · · 1− , (3) n N N where: uα - the value denoting two-sided critical range, K - total number of ”ones” in the entire image. The considerations presented above are correct for the generator with uniform distribution. The prevention of the error’s increase requires good statistical properties of the generator.

2

Inﬂuence of Geometrical Distortions

In order to simplify the considerations it is assumed that the goal of the algorithm is the estimation of the object’s area. The sources of the geometrical distortions

792

K. Okarma and P. Lech

Monte Carlo method

„Full-analysis” 2

2 1

3

1 4

1

5

8

0,8

3

9 4

9

1

5

8

0,8

7

7

0,6

0,6 6

0,4

6

0,4

0,2

0,2

0

0

lens 1 2 9 1

8

3

2

1

4

3 1,2

5

7

0,8

lens 1

6

0,4

0,2

0,2 0

0

lens 2

lens 2 2

2

3

3 9

4

1,2

8

5 7

0,8

9

4

1 1

4

7

0,6

0,4

5

8

0,8

6

0,6

1

9

1

1 8

1 0,8

6 0,6

5 7 6

0,6 0,4

0,4

0,2

0,2

0

0

lens 3

lens 3

Fig. 2. The comparison of the normalised results for various locations and camera lens

are mainly related to the camera optics and may be observed mostly near the image corners as shown in Fig. 1. The experiment has been performed using the digital camera with adjustable optical parameters (various lens), the ﬁrst test with the most visible distortions and the third one with almost invisible ones. The camera has been installed directly over the scene containing single object of known dimensions. In this experiment the scene has been constantly lightened by uniformly distributed light in order to eliminate the inﬂuence of its changes. The location of the object has been changed (see Fig. 1). The comparison of the results obtained using three diﬀerent lens for the classical ”full-analysis” procedure (counting of all the pixels belonging to the object) and the Monte Carlo method is illustrated by Fig. 2. All presented results are normalised assuming the exact object’s area is equal to 1. They show the relevance of the inﬂuence of lens quality for both methods and using the Monte Carlo method with good quality lens does not introduce signiﬁcant errors. In systems with poor quality cameras some additional digital image correction algorithms may be needed to ensure high accuracy.

Monte Carlo Based Algorithm for Fast Preliminary Video Analysis

793

Distortions level below 10% area's relative error

0,06 0,05 0,04 0,03 0,02 0,01 0 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 measurement number Monte Carlo method full analysis Monte Carlo average full analysis average

Distortions level over 10% area's relative error

0,6 0,5 0,4 0,3 0,2 0,1 0 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 measurement number Monte Carlo method full analysis Monte Carlo average full analysis average

Fig. 3. Relative errors of area estimation for distortions level below 10% and over 10%

3

Inﬂuence of Lighting Conditions

Regardless of the static errors dependent on the lens quality, in most practical applications some dynamic ones, caused by changes of local or global lighting conditions, may also appear. In order to analyse the inﬂuence of lighting it is assumed that the object is located in the centre of the scene and the best lens is used. The experiment has been performed for a small amount of distortions (below 10% of distorted pixels from the whole scene comparing to the uniformly lightened scene) and the signiﬁcantly higher number of distorted pixels. All the distortions have been caused by the light changes with the direct inﬂuence on the number of pixels fulﬁlling the same logical condition in each case. The threshold value of 10% distorted pixels has been chosen experimentally for speciﬁed lens and the analysed static scene for the better illustration of observed eﬀects. During the experiment 16 measurements have been performed in various lighting conditions (changes of the number of light sources, their locations and parameters). The results obtained for two cases described above are presented in Fig. 3. Comparing the relative errors obtained for two cases analysed in the paper it is worth noticing that for higher amount of distortions the Monte Carlo approach leads to similar results comparing to ”full analysis”. However, for lower distortion level the advantage of the Monte Carlo method is much better visible.

794

K. Okarma and P. Lech Edges detected

Object

Elements with area equal to elementary block size (not used in further analysis)

Fig. 4. Idea of the edge detection using the Monte Carlo approach

4

Applications of the Method

Universality of the Monte Carlo method makes possible to use it in many areas of digital image and video analysis applications. There are many articles and books related to more or less advanced statistical techniques applied for video analysis published in the recent years. A popular approach seems to be the usage of Sequential Monte Carlo [5] or Markov Chain Monte Carlo methods [9] e.g. for video text segmentation [1] as well as some tracking purposes [6]. Nevertheless, such algorithms usually require relatively high computational power so their applications in some real-time systems is limited. Apart from that, a good example of reasonable requirements (Pentium PC) can be the real-time low bitrate video segmentation approach presented in the paper [4] also designed for speciﬁc applications. The most crucial features of proposed approach are its easy implementation and low computational complexity. All the applications where analysis of the whole image is not necessary and the high accuracy of results is not required are the potential ﬁeld of its usage. 4.1

Estimation of Geometrical Features

A typical application of the Monte Carlo method is the area estimation. However, presented idea can be applied also for the estimation of some other geometrical features. The simplest one is the perimeter estimation by edge detection. Analysed binary image should be divided into T x S elements of r x r pixels each using the square grid with the usage of the Monte Carlo method for the area estimation. All the blocks with the area equal to zero or equal to the size of elementary block are not used in further analysis because they do not correspond to the ﬁgure’s edge as shown in Fig. 4. In the next step the area of each object’s fragment in elementary square elements is calculated. The estimated values are stored in the array of T x S elements. On the base of the binary image the array K of the same size is created. The elements of that array have the following values: zero if the corresponding element’s value in the binary image is equal to the size of elementary block

Monte Carlo Based Algorithm for Fast Preliminary Video Analysis

795

(all pixels belong to the object) and none of its neighbouring blocks (using 8directional neighbourhood) has zero value, zero if the corresponding element’s value in the binary image is equal to zero (background), one for the others (representing the edge). The array K shows the projection of the edges detected from the source image. Counted number of non-zero elements of array K represents the estimated value of object’s perimeter expressed in the squares of r x r pixels. In order to obtain a better estimation in the ﬁnal step the number of square elements should be increased (smaller values of parameter r) and then the ﬁrst steps are repeated. Using additionally the array K achieved in the third step, the analysis of blocks in the binary image with zero values is not needed, so further analysis is performed only for strongly limited number of indexed elements corresponding to the edge obtained in the third step. The limit accuracy of the algorithm is determined by the size of the elementary block equal to 1 pixel what is equivalent to using convolution edge detection ﬁlters. Dividing the scene into smaller squares it is possible to easily estimate some motion parameters such as direction and velocity applying the Monte Carlo procedure for each block. If the whole binary image is divided into T x M square blocks containing r x r pixels each, there is also a possibility to estimate some additional geometrical parameters which may be treated as local (e.g. mean diameter or average area) or global ones (e.g. number of objects inside the given area). For a single object on the image plane the most interesting parameters are those ones which are insensitive to image deformations introduced during acquisition and typical geometrical transformations such as scaling, translation and rotation. In such sense the usefulness of the simplest parameters, such as area or perimeter, is strongly limited but many other factors, such as moments, can be often determined on the basis of the simplest ones. Some typical geometrical parameters used in the image analysis are horizontal and vertical projection’s lengths (easily extended by the analysis of the presence of concavities in the object’s shape) and Feret’s diameters as the measure of horizontal and vertical object’s maximum size [3]. An interesting group of parameters are the shape coeﬃcients because of the possibility of fast estimation and wide opportunities of their usage for the classiﬁcation and recognition purposes. Most of them can be easily calculated on the base of area, perimeter or Feret’s diameters. Another group of parameters used during the binary image analysis is represented by the linear moments e.g. the ﬁrst order ones related to the object’s centre of gravity and the second order moments used as the object’s inertia measures. The binary array used in the Monte Carlo approach can be treated as the equivalent of reduced resolution one so geometrical parameters can be expressed in blocks of r x r pixels instead of pixels. The processing time is then signiﬁcantly shortened due to the reduction of the number of analysed pixels and performed

796

K. Okarma and P. Lech

Table 1. Area and perimeter values (in pixels) obtained in the experiments for various numbers of analysed points in the image and in the block respectively Method Points (N) Area Perimeter1 Perimeter2

Full image Monte Carlo 100 200 500 1000 5000 31518 26127 29030 28698 29528 − − − − − − − − − −

Monte Carlo 32x32 pixels 10 20 50 100 500 32957 31938 30945 30530 30276 640 672 736 768 768 437 417 440 436 434

experiments showed a good accuracy of their computation (suﬃcient for some typical image recognition applications). Performed tests have shown that proposed approach can be valuable for the estimation of some shape coeﬃcients which are dependent on the area and the perimeter. In that case the estimation errors depend on the accuracy of the area and perimeter estimation. It is worth noticing that the approximation of moments and geometrical parameters based on the perimeter requires the usage of the Monte Carlo method with division of the image into blocks while for the geometrical parameters based only on the area or projection lengths the standard Monte Carlo method is suﬃcient. In order to determine the projection lengths the additional storing of the minimum and maximum coordinates of the analysed pixels representing the object is necessary. The estimation of the moments require slightly more sophisticated algorithm because the accuracy limited to the size of the block (r x r pixels) is not suﬃcient for most applications. The simplest solution is similar to the technique used for the perimeter estimation. Blocks corresponding to the object’s contour can be divided into smaller elements and then the adjustment vectors for the moments should be calculated. Table 1 illustrates the results of the statistical experiment performed for the estimation of the area and the perimeter of a single object located in the scene. The perimeter has been estimated using two approaches: counting of all 32x32 pixels blocks representing the contour (Perimeter1) and using the square root of the estimated area for each block (Perimeter2). The ﬁrst method leads to the overestimation and the second one produces too small values. Correct values of the area and the perimeter obtained by the analysis of all pixels are 30286 and 552 pixels respectively. 4.2

Fast Image Quality Estimation

Conventional objective full-reference image quality metrics, mainly based on the Mean Square Error [2], are poorly correlated with the Human Visual System so some other proposals have been presented in recent years. One of the most popular seems to be the Universal Image Quality Index proposed by Wang and Bovik [7]. Such measure models image distortions as the combination of three elements: loss of correlation, luminance distortion and loss of contrast. Assuming xi,j and yi,j are the values of the luminance for the pixel (i, j) of the original

Monte Carlo Based Algorithm for Fast Preliminary Video Analysis

797

and distorted image respectively, it is deﬁned as the local index for the single image block (NxN pixels - usually N=8) as Q=

σxy 2¯ xy¯ 2σx σy 4σxy x¯y¯ , · · = 2 σx σy (¯ x)2 + (¯ y )2 σx2 + σy2 (σx + σy2 ) · [(¯ x)2 + (¯ y )2 ]

(4)

where x¯ and y¯ are the mean values and σ stands for the standard deviation in the original and distorted image blocks respectively [7]. The overall quality index is deﬁned as the mean value of metrics (4) obtained for all blocks using sliding window approach. In the paper [8] the deﬁnition (4) has been extended into the Structural Similarity (SSIM) introducing the possibility of choosing the importance exponent for each of three factors in Eq. 4 with additional stability enhancement for the regions where x ¯ or σx2 are close to zero. The modiﬁed expression based on the usage of two coeﬃcients preventing such instability can be expressed as SSIM =

(σx2

(2¯ xy¯ + C1 ) · (2σxy + C2 ) , + σy2 + C1 ) · [(¯ x)2 + (¯ y )2 + C2 ]

(5)

where C1 and C2 are the small values chosen experimentally as suggested by authors of the paper [8]. However, analysis of all the image pixels is time consuming and in many applications the exact image quality assessment is not the most crucial element, because the image quality estimation should be performed fast and not necessarily very accurately. Besides all the image quality metrics should be actually treated as estimators, because there is no ideal objective image quality measure. For the applications where the image quality estimation should be fast enough to avoid the introduction of additional delays the calculation of the SSIM index using the Monte Carlo approach is proposed. Estimation of the local SSIM index should be performed only for some randomly chosen pixels inside the current sliding window. Assuming a good quality of the pseudo-random generator the expected number of the drawn pixels inside the window in each position should be almost the same. The advantage of that approach is an equal chance to analyse each pixel of the image so there is no need to use any sophisticated method for the decrease of the resolution in order to preserve some patterns. The results of the image quality estimation using proposed approach for the test images after low-pass and median ﬁltration, JPEG compression and contamination by an achromatic impulse noise are shown in the Tables 2 - 4. The usage of the limited number of randomly distributed pixels during the calculations of the Structural Similarity index may lead to a good quality estimation, assuming the pseudo-random generator with the uniform distribution, even for a low number of samples for some typical distortions. The results obtained for the image ’Baboon’ (Table 3) diﬀer from the other images because of the speciﬁc character of the image with many details. It leads to the lower quality index for the lossy JPEG compression (many pixels diﬀer from their originals) and a better one for the images contaminated by an impulse (’salt and pepper’) noise.

798

K. Okarma and P. Lech

Table 2. The SSIM index obtained for various distortions using the Monte Carlo approach with various number of points - the image ’Kodim’ Number low-pass low-pass median median 5% 20% JPEG 3x3 5x5 3x3 5x5 noise noise 60% of points 50 0.9280 0.8581 0.9418 0.8607 0.0880 0.0613 0.9448 0.9355 0.8780 0.9471 0.8997 0.1144 0.0799 0.9472 100 0.9366 0.8811 0.9437 0.8982 0.1094 0.0682 0.9428 200 0.9405 0.8803 0.9508 0.9013 0.1183 0.0723 0.9450 500 0.9384 0.8718 0.9489 0.8945 0.1217 0.0768 0.9472 1000 0.9387 0.8737 0.9498 0.8956 0.1220 0.0759 0.9454 5000 0.9392 0.8762 0.9492 0.8961 0.1847 0.0746 0.9457 10000 0.9393 0.8768 0.9493 0.8974 0.2888 0.0741 0.9455 all

JPEG 10% 0.8495 0.8448 0.8448 0.8475 0.8502 0.8447 0.8485 0.8468

Table 3. The SSIM index obtained for various distortions using the Monte Carlo approach with various number of points - the image ’Baboon’ Number low-pass low-pass median median 5% 20% JPEG 3x3 5x5 3x3 5x5 noise noise 60% of points 50 0.6919 0.4456 0.7333 0.4613 0.5661 0.2323 0.8987 0.6917 0.4790 0.7123 0.5160 0.5043 0.1826 0.9003 100 0.6794 0.4540 0.7039 0.4633 0.5281 0.2342 0.8986 200 0.6906 0.4593 0.7178 0.4846 0.5674 0.2363 0.8992 500 0.6877 0.4629 0.7168 0.4841 0.5409 0.2254 0.8954 1000 0.6860 0.4591 0.7162 0.4796 0.5548 0.2270 0.8992 5000 0.6846 0.4591 0.7125 0.4798 0.5532 0.2318 0.8977 10000 0.6861 0.4608 0.7147 0.4815 0.5518 0.2295 0.8986 all

JPEG 10% 0.7059 0.6790 0.6838 0.6880 0.6802 0.6790 0.6778 0.6800

Table 4. The SSIM index obtained for various distortions using the Monte Carlo approach with various number of points - the image ’Lena’ Number low-pass low-pass median median 5% 20% JPEG 3x3 5x5 3x3 5x5 noise noise 60% of points 50 0.9212 0.8533 0.9324 0.8807 0.2892 0.1098 0.9441 0.8967 0.8123 0.9072 0.8398 0.3052 0.0948 0.9175 100 0.9043 0.8130 0.9175 0.8456 0.3393 0.1115 0.9258 200 0.9087 0.8305 0.9202 0.8565 0.3411 0.0920 0.9234 500 0.9091 0.8309 0.9210 0.8585 0.3469 0.0953 0.9250 1000 0.9105 0.8340 0.9215 0.8612 0.3432 0.0935 0.9256 5000 0.9097 0.8355 0.9203 0.8613 0.3391 0.0930 0.9243 10000 0.9103 0.8347 0.9212 0.8612 0.3397 0.0945 0.9251 all

JPEG 10% 0.8438 0.7926 0.8121 0.8136 0.8173 0.8159 0.8161 0.8164

Analysing presented results for many typical distortions, obtained relative errors of the SSIM values estimated using strongly limited number of pixels are about 1-2%, so proposed fast Monte Carlo SSIM estimation can be treated as an interesting alternative for the applications in lower performance systems with a smaller amount of memory.

Monte Carlo Based Algorithm for Fast Preliminary Video Analysis

5

799

Conclusions

Considering the fact that the Monte Carlo method is much faster than full image analysis it seems to be a good alternative for the classical methods in a wide area of applications. Presented examples are not comprehensive but it is worth noticing that presented approach can be very useful especially in embedded systems with a low computational power and a limited amount of memory.

References 1. Chen, D., Odobez, J.-M.: Sequential Monte Carlo Video Text Segmentation. In: International Conference on Image Processing ICIP 2003, vol. 3, pp. 21–24. IEEE Press, New York (2003) 2. Eskicioglu, A., Fisher, P., Chen, S.: Image Quality Measures and Their Performance. IEEE Trans. Comm. 43(12), 2959–2965 (1995) 3. Kindratenko, V.: Development and Application of Image Analysis Techniques for Identiﬁcation and Classiﬁcation of Microscopic Particles. PhD thesis, Antwerp University (1997) 4. Luo, H., Eleftheriadis, A., Kouloheris, J.: Statistical Model-Based Video Segmentation and its Application to Very Low Bit-Rate Video Coding. Signal Processing: Image Communication 16(3), 333–352 (2000) 5. Quan, G., Chelappa, R.: Structure from Motion Using Sequential Monte Carlo Methods. Int. Journal of Computer Vision 59(1), 5–31 (2004) 6. Vermaak, J., Ikoma, N., Godsill, S.J.: Sequential Monte Carlo Framework for Extended Object Tracking. IEE Proc. Radar Sonar Navig. 152(5), 353–363 (2005) 7. Wang, Z., Bovik, A.: A Universal Image Quality Index. IEEE Signal Process. Letters 9(3), 81–84 (2002) 8. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image Quality Assessment: From Error Measurement to Structural Similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 9. Zhai, Y., Shah, M.: Video Scene Segmentation Using Markov Chain Monte Carlo. IEEE Trans. on Multimedia 8(4), 686–697 (2006)

Interactive Learning of Data Structures and Algorithmic Schemes Clara Segura, Isabel Pita, Rafael del Vado V´ırseda, Ana Isabel Saiz, and Pablo Soler Departamento de Sistemas Inform´ aticos y Computaci´ on Universidad Complutense de Madrid, Spain {csegura,ipandreu,rdelvado}@sip.ucm.es, {anussita,ileras}@gmail.com

Abstract. We present an interactive environment called Vedya for the visualization of data structures and algorithmic schemes which can be used as a very useful educational tool in Computer Science. The integration of Vedya and the Virtual Campus of the Complutense University of Madrid has allowed us to manage the whole administration of the individual students’ homework, including generating exercises, tests, grading delivered homework, and storing the achieved results. The part of the system concerning data structures has been evaluated during the last academic course 2006/07. By means of the Vedya tool, the students beneﬁted from complementary and interactive material, facilitating the intuitive comprehension of most typical operations of classical data structures without any restriction of time or material.

1

Introduction

We present an interactive environment tool called Vedya for the visualization of data structures and algorithmic schemes. The pedagogical aim of Vedya is to facilitate the student’s grasp of the target procedures of education in Computer Science by means of interactive learning, in order to facilitate teamwork and communication between teachers and students. For this purpose, we have integrated Vedya in a motivating environment such as the Virtual Campus of the Complutense University of Madrid (https://www. ucm.es/info/uatd/CVUCM/index.php) facilitating the accessibility, understanding and visualization of the main data structures and algorithmic schemes. The combination of the Vedya tool and the Virtual Campus has allowed us to control the whole administration of the individual students’ homework including generating exercises, tests, grading delivered homework, and storing the achieved results. By means of the Vedya tool, the students have beneﬁted from complementary and interactive material, facilitating the intuitive comprehension of most typical operations of classical data structures without restrictions of time or material.

The authors have been partially supported by the Spanish National Projects MERITFORMS (TIN2005-09027-C03-03) and PROMESAS-CAM (S-0505/TIC/0407).

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 800–809, 2008. c Springer-Verlag Berlin Heidelberg 2008

Interactive Learning of Data Structures and Algorithmic Schemes

801

Moreover, the continuous utilization of the tool during the theoretical classes of the second four-month period has allowed us to reach one of the most useful educational aims in Computer Science in order to settle some of the academic deﬁciencies of this kind of subjects: to support the continuous, personal and interactive work across a virtual classroom. During the academic course 2006/07, Vedya has been freely accessible across the Virtual Campus. In this time, and within the framework of an Educational Innovation Project in Computer Science, we have evaluated the part of the tool dedicated to the study of the main data structures that provides interactive learning support to guide our students in their comprehension of a modern (imperative or declarative) programming language. At this moment, the tool has been widely used to illustrate in a graphical, visual and intuitive way the following well-known data structures [1]: linear data structures (stacks and queues), tree-like data structures (binary search trees, AVL trees, and heaps) and functional data structures (ordered and hash tables). Additionally, the ﬂash animations incorporated in the tool have been used to illustrate other data structures like red-black trees, 2-3-4 trees and graphs; and to show how data structures are used to solve problems. Thanks to this eﬀort, our students have assimilated, inside a motivating framework, one of the fundamental concepts of the subject, the diﬀerence between the formal description of the behavior of the data structures provided by the algebraic speciﬁcation [2,3], and their implementation using a concrete programming language. The outline of the paper is the following. In Section 2 we describe Vedya from the point of view of the tool’s user. In Section 3 the implementation is explained. Section 4 provides the results obtained from the application of the tool in the last academic course. Finally, in Section 5 we conclude and outline future work.

2

The Vedya Tool: Description

Vedya is an interactive environment for learning data structures and algorithms. It covers the most common data structures: stacks, queues, binary search trees, AVL trees, priority queues, and sorted and hash tables. Moreover, it also provides other diﬀerent types of data structures, like one for an implementation of a doctor’s oﬃce. Concerning the algorithmic schemes, it covers the most common resolution methods [4]: greedy, divide and conquer, dynamic programming, backtracking, and branch and bound. Lots of work has already been done on data structures and algorithm visualization. However, usually, tools are not complete, have a lack of common Graphical User Interface or can only be executed over some operating systems. In [5] Chen and Sobh present a tool for data structure visualization and userdeﬁned algorithm animation. The data structures available are arrays, stacks, queues, binary search trees, heaps and graphs. The most relevant improvement of this tool is the possibility of executing user-deﬁned algorithms and visualize the state of the data structures used by the algorithms.

802

C. Segura et al.

Fig. 1. Main window for the data structures part of the Vedya tool

Nevertheless, Vedya is something more than a data structure execution tool. The main features that diﬀerentiate it from other interactive tools are: (1) First of all, Vedya is created to supply the students with a tool that facilitates the study of both data structures and algorithms. Therefore, all the data structures and algorithmic methods taught in our courses are integrated in the same environment. The environment provides a lot of facilities to make executions more user-friendly: Vedya can be executed using Sun’s j2re 1.4.2.xx. It allows the execution of several data structures/algorithms and several sequences of operations on the same structure at the same time making use of a multi-windows and multi-frame system. It also allows saving sequences of operations in order to continue the execution later on. Operations in a sequence can be deleted and added. (2) Vedya oﬀers several learning possibilities. The main one is the interactive execution of data structures and algorithms, but it is also possible to create simulations that execute automatically, visualize tutorials, and solve tests within the same environment. It also integrates a set of animations that show how data structures are used to solve some problems. (3) Concerning the data structures part, Vedya oﬀers diﬀerent views for the data structure behavior and the data structure implementations. Moreover, in most of the data structures both a static and a dynamic implementation are shown. Operations may be executed in any view, and the user can move from one view to another to see the changes in any moment. (4) The following algorithms are implemented: • divide and conquer: binary search and quicksort, • dynamic programming: knapsack problem,

Interactive Learning of Data Structures and Algorithmic Schemes

803

• greedy method: non-fractional knapsack problem, Dijkstra’s algorithm, • branch and bound: knapsack problem. The students can visualize how diﬀerent algorithmic schemes are applied to solve the same problem and under which conditions they can be used. (5) The environment integrates documentation related to the algebraic speciﬁcation, the implementation code and the operation cost of each data structure/algorithm. Currently there exist two versions of the tool. The old version contains all the data structures and algorithmic schemes mentioned above while the new one oﬀers a subset of them in a more attractive visual environment. Figures in this paper correspond to the new version, which can be found at http://www.fdi.ucm. es/profesor/csegura/. 2.1

Tool Usage

When the application is started the user selects one area: data structures or algorithmic schemes, and chooses a particular one. Then, the main window for the selected data structure or algorithmic scheme is opened. For all cases this window looks similar. The central panel is used to represent the data structure, on the left there is a list of the actions that can be executed. The allowed actions are highlighted while the non–allowed ones are disabled. The panel on the right shows the actions that have already been executed on the data structure. On Fig. 1, we show an example of the main window for binary search trees. The selected key type is string, but the system also allows char and string types. On this tree, the user may continue executing actions using the left hand operations panel, or he can use the simulation facilities of the menu to go up on the sequence of actions to see previous states or restart the sequence from the beginning. The central panel oﬀers some tree drawing facilities that allows the user to expand and contract the tree, as well as to move over the screen to see the hidden parts. Notice also that just above the central panel the result of the last action is shown. On the top of the screen there is a menu that facilitates managing the system. We can create a new data structure; open an existing one; or save the state of the editing one. We can change the view, from behavior, that is the one by default, to implementation, either static or dynamic. On Fig. 2 we can see ﬁrst the behavior view of a queue. When we insert an item, a truck throws it on the top of the maze. Then, the item falls down until it ﬁnds the end of the maze or a previous one. When we extract an item, the end of the maze opens and the ﬁrst item falls down. The use of the maze illustrates that items cannot jump over the previous ones and the fact that in a queue items are extracted in the order that they come in. Below the behavior view we have the representation of the static implementation based on a circular array and a dynamic implementation based on pointers. On the static implementation, the array is represented by a circle where items are inserted and extracted counterclockwise. This representation stresses the fact that there is not a ﬁnal element of the array and also that we

804

C. Segura et al.

Fig. 2. Behavior and implementations views of queues

Interactive Learning of Data Structures and Algorithmic Schemes

805

can always insert an item unless the array is completely full. On the dynamic implementation we use short arrows to represent the pointers between the items. Two long arrows point to the ﬁrst item (written as primero in the ﬁgure) and to the last one (written as ultimo). The sequence of modiﬁcations are shown step by step so that students notice its relevance in order not to lose any pointer. Using the menu we can also execute the operations on the data structure, use the simulation facilities and change the execution speed. In the documentation part the user can consult some documentation about the data structure, like the algebraic speciﬁcation, the implementation code and the cost of each implementation, and ﬁnally it can change to the associate Vedya-test tool, where the user can do some proposed test about the selected data structure. The main window for the execution of algorithmic schemes looks similar. There are panels for drawing the execution of the algorithm, for introducing the input data and for showing the actions being executed. The simulation facilities are also available. 2.2

Data Types Animations

Vedya is complemented with a set of tutorials on data types and a set of algorithm animations showing the usage of a particular data type to solve a given problem. This set of tutorials and algorithm animations are developed in Flash, and can be accessed independently from Vedya’s initial menu. We have tutorials for stacks, queues, binary search trees, red-black trees and priority queues. We have a tutorial about the heap-sort algorithm, an animation of the insertion in a 2-3-4 tree, and examples of: – The use of stacks to evaluate an expression in postﬁx form, or to transform an inﬁx expression to a postﬁx one. – The use of queues to obtain the breath-ﬁrst tree traversal. – The use of stacks, queues and double queues to check palindromes. Finally, we have some animations on graphs: to obtain the minimum spanning tree using the Prim and Kruskal algorithms and to compute minimum paths using the Dijsktra algorithm [6]. 2.3

The Vedya-test Tool

As we have said Vedya also oﬀers facilities to solve tests about the data structures and algorithms that are being studied. The Vedya-test tool can be invoked from the Vedya tool, or it can be executed independently. The tool oﬀers facilities to teachers that allow them to create/modify/delete questions in a database, and to create tests from the database of questions. The student visualizes the tests, solves them and obtains the solutions. Questions are grouped by subject on the database, but it is possible to mix questions about diﬀerent data structures in the same test.

806

C. Segura et al.

Fig. 3. Tool design

3

The Vedya Tool: Implementation

One of the main objectives when Vedya was implemented was that its extension with new data structures or algorithmic schemes should be as easy as possible. The main feature of the implementation is that both the data structures and the algorithms being represented are actually implemented in the tool separately from their graphical representations. The design, shown in Fig. 3, is modularly divided into four blocks: interface, graphics, threads and implementations of the data structures and algorithms. This means that whenever we desire to change a graphical representation we do not need to change the data structure or the algorithm itself. The windows and frame act as communication channel between implementation and the graphics. The threads only communicate with the graphics. Inheritance is exploited in order to reuse as much code as possible. The diﬀerent blocks are communicated by means of “actions”: • When the user executes an operation in the interface over a data structure or executes an algorithm, the frame sends it to the implementation of the data structure/algorithm to be executed. • As a result of such execution, a vector of atomic actions to be graphically represented in an animated way is returned to the frame. For example, when inserting an element in a binary search tree, information about the path followed by the element being inserted is returned. • The frame sends the vector of actions to the graphical module, which paints the necessary elements and sends to the threads block each action to be animated, depending on the kind of selected visualization (user, static implementation or dynamic implementation).

Interactive Learning of Data Structures and Algorithmic Schemes

807

Fig. 4. Stack tracks design

Java threads are used in order to animate operations on data structures. Each operation divides into actions and the animation sequence of each action is divided into tracks where diﬀerent kinds of movements are applied: circular, radial, horizontal, vertical or point to point. In Fig. 4 we show the circular tracks for the stack user representation, which consists of a dead-end pipe.

4

Interactive Learning

In order to obtain a detailed evaluation of the usage of Vedya, we have proposed several tests related to the behavior, implementation and application of the main data structures oﬀered by the tool. We also collect students’ opinion using Vedya in the Data Structures course at the second year, and in the Programming Methodology and Technology course at the third year, respectively. The vast majority of our engineering and computer science students have taken an introductory programming course in the ﬁrst academic year, typically in Pascal. Although the learning of the main algorithmic schemes and programming techniques is not a prerequisite to the subject of Data Structures, many students elect to take it either prior to, or concurrent with, Programming Methodology and Technology. As a result, although a pseudocode programming language for Data Structures is the assumed language for Data Structures, many students have enough knowledge about C++ or Java programming languages through the integrated programming laboratories of parallel courses and subjects. Taking into account this proﬁle, skills and background of our students, we have proposed 8 tests in the Virtual Campus of the Complutense University of Madrid. The number of engineering students registered in the Virtual Campus was just over 320 distributed in three groups (130 in group A, 59 in group B, and 131 in group C). Table 1 shows the number of the students who answered each of the tests in the corresponding group.

808

C. Segura et al. Table 1. Students answering the tests

Group A (130) Group B (59) Group C (131) Total

Stacks 1 Stacks 2 Queues Sequences BST AVL RB Heaps 61 50 45 32 37 34 41 38 26 23 23 19 18 17 17 18 59 44 37 24 36 45 32 28 147 118 105 75 91 96 90 76

We observe that, from the second test on, the number of students becomes stable in a number lightly low to the number of students who access regularly to the Virtual Campus. These numbers, though seemingly high, are only between 23 % (75 students of 320) and 37 % (118 of 320) of registered students, which shows the high rate of students giving up in this topic from the beginning. Table 2. Percentage of correct answers

Group A Group B Group C

Stacks 1 Stacks 2 76.4% 82.5% 78.9% 83.6% 76.2% 79.8%

Queues Sequences BST AVL RB 77.8% 65.6% 82.2% 84.9% – 85.0% 63.6% 86.2% 87.7% 90.9% 73.5% 69.0% 83.5% – 68.9%

Heaps 86.3% 90.2% 86.8%

Table 2 shows the percentage of correct answers in the three groups: In general, it is high, which demonstrates the interest of the students who have taken part. In group B, the percentage is slightly higher than groups A and C; since 85 % of the students who have decided to complete the tests across the Virtual Campus of group B are not “new” students of this subject. Table 3. Comparison of academic results with previous courses

Not attended Passed Failed

2002/03 2003/04 2004/05 2005/06 2006/07 57.6% 45.3% 42.3% 64.7% 50.8% 15.3% 22.2% 20.2% 18.2% 30.1% 27.1% 32.5% 37.5% 17.1% 18.9%

Table 3 shows the percentage of students that did not attend the ﬁnal exam, those who passed, and those who failed during the last ﬁve years. We observe that in the last course, in which we have applied the Vedya tool, we have reduced by 14% the percentage of students giving up the course with respect to the previous course, and at the same time, we have increased by 12% the percentage of students that passed the exam. The percentage of students that failed the exam increased by 2% due to the rise of students attending the exam. Comparing with previous courses (2003 to 2004) the percentage of students that passed has increased between 8% (with respect to the course 2003/04) and 15% (with respect to the course 2002/03).

Interactive Learning of Data Structures and Algorithmic Schemes

5

809

Conclusions and Future Work

In this paper, we have presented a novel interactive tool called Vedya for the visualization of data structures and algorithmic schemes which can be used as a very useful educational tool to aid ﬁrst-year engineering and computer science students learn Data Structures and Algorithms. The main beneﬁt of this kind of software is to facilitate the student’s grasp of the target procedures of education, to facilitate teamwork and communication between teachers and students. In this sense, the novel integration of the Vedya tool in the virtual classroom has allowed us to motivate the participation of the students, and to obtain one of the most powerful goals from the educational viewpoint. Furthermore, the personalization and automatization of the learning process has avoided the lack of motivation for the abstract subjects in Computer Science because students ﬁnd them ‘useful’. The fact that interface language is Spanish is for our students one additional reason to ﬁnd the environment more friendly. Finally, it is frequent that a tool is useful not only to the original purpose it was created for, but also for other subsidiary (though not less important) uses. In this sense, we believe that the development of this tool can help educators to use the Vedya tool directly or to create other similar tools. As future work, we plan to develop alternative ways to integrate the Vedya tool in a Virtual Campus based on WebCT. We are interested in the application of the tool and the interactive learning methodology presented in this paper inside of diﬀerent models of virtualization and e-learning in Computational Science: the development of digital repositories of Learning Objects about data structures and algorithmic schemes in the Vedya tool using IMS DRI and Moodle, the integration of the Vedya tool in an Intelligent Tutorial System, or the application of typical tools of the Web 2.0 philosophy.

References 1. Weiss, M.: Data Structures and Problem Solving Using Java. Addison-Wesley, Reading (1998) 2. Mart´ı, N., Ortega, Y., Verdejo, A.: Estructuras de datos y m´etodos algor´ıtmicos: ejercicios resueltos. Prentice-Hall, Englewood Cliﬀs (2003) 3. Pe˜ na Mar´ı, R.: Dise˜ no de programas. Formalismo y abstracci´ on, 3rd edn. PrenticeHall, Englewood Cliﬀs (2005) 4. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. The MIT Press/McGraw-Hill (2001) 5. Chen, T., Sobh, T.: A tool for data structure visualization and user-deﬁned algorithm animation. In: Frontiers in Education Conference (2001) 6. Pita, I., Segura, C.: A tool for interactive learning of data structures and algorithms. In: 8th International Symposium on Computers in Education, SIIE 2006, vol. 1, pp. 141–148 (2006)

Prediction and Analysis of Weaning Results of Ventilator-Dependent Patients with an Artificial Neuromolecular System Jong-Chen Chen1, Shou-Wei Chien1, and Jinchyr Hsu2 1

Department of Information Management, National Yunlin University of Science and Technology, Touliu, Taiwan 2 Department of Internal Medicine, TaiChung Hospital, TaiChung, Taiwan [email protected], {jans.alex,jinchyr.hsu}@msa.hinet.net

Abstract. We have developed a vertical information processing model, motivated from physiological evidence, which integrates intra- and interneuronal information processing. Information processing at the intraneuronal levels is to create a repertoire of pattern processing neurons. Information processing at the interneuronal levels is to group appropriate pattern processing neurons to constitute an effective pattern processing system. The system was applied to a database of the weaning results of ventilator-dependent patients. Ventilator has been used to support the breathing need of patients, and weaning is the gradual process of removing it from ventilator-dependent patients. Experiments with the model show that the integrated system is able to learn to differentiate data in an autonomous manner, separating those patients who have successful weaning results from those who do not. Our parameter analysis shows that most of the parameters identified as significant by the system are the same as those by physicians, but some are not. Keywords: Weaning, Evolutionary Learning, Artificial Neural Networks.

1 Introduction People are in great danger when they have difficulty in breathing, especially if they have chronic pulmonary disease or neuromuscular problems. Ventilator has been commonly used to support their breathing needs. Statistics show that most people in intensive care unit (ICU) are in critical conditions and around 40% of them need ventilator to support their breathing [1], [18]. When patients are no longer in critical conditions, they will be transferred to respiratory care center (RCC) for further care. Removing ventilator from patients too early could put them in dangerous situation. However, for those patients who can breathe naturally, placing ventilator on them may be of little help. When to remove a ventilator from a patient is sometimes a tough question. Statistics show that the successful weaning rate is between 35% and 60% [16], [17]. The decision is usually based on the subjective assessments of physicians. The parameters that the physician takes in accounts are physiological factors, including maximal inspiratory pressure (Pimax)[10], [20], vital capacity M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 813–822, 2008. © Springer-Verlag Berlin Heidelberg 2008

814

J.-C. Chen, S.-W. Chien, and J. Hsu

(VC)[20], rapid shallow breathing index (RSBI)[23], minute ventilation (VE)[10], [21], pH and PCO[2], [16], APACHE II score, and blood urine nitrogen (BUN). The above studies are related to what parameters to consider when the physician tries to wean ventilator off patients in ICU. However, there are limited studies regarding weaning in RCC, where one of the major tasks is to help patients to progress from mechanical ventilation to spontaneous breathing as early as possible. At present, there are 27 parameters used to determine whether weaning is successful or not for ventilator-dependent patients in RCC. These parameters are divided into four categories. The first category includes gender and age. The second category comprises physiological parameters. These are APACHE II score and coma scale at admission time, blood urine nitrogen (BUN), creatinin (Cr), albumin (Alb), hemoglobine (Hb), the ratio of respiratory frequency to tidal volume (RSBI) and coma scale before removing ventilator. The third category of parameters is related to the diseases carried by patients. These include chronic obstructive pulmonary disease, cardiovascular disease, cerebral vascular disease, other internal factors, septic syndrome with multiple organ failure, respiratory tract disease, trauma, acute respiratory distress syndrome, brain surgery, and other surgeries. The fourth category of parameters is regarding the treatment and complication occurring during the patient’s admission time. These include the length of time staying in ICU, the length of time using ventilator, and tracheastomy, respiratory tract infection, blood stream infection, urinary tract infection, and other infection. Successfully weaning patients off ventilator requires careful assessments. It may be helpful if we develop an intelligent system to assist physicians in making such a decision. In this study, we apply an artificial neuromolecular system (ANM), a biologically motivated model that integrates intra- and inter-neuronal information processing, to differentiate a database of ventilator-dependent patients. The system has two hypotheses. The first is that some neurons in the brain have significant intraneuronal information processing, which might directly or indirectly relate to their firing behavior [14], [13], [15]. Neurons of this type will be called cytoskeletal neurons or enzymatic neurons[4], [6]. They combine, or integrate, input signals in space and time to yield temporally patterned output signals to control other neurons. The second hypothesis is that some neurons in the brain serve as pointers to other neurons in a way that allows for memory manipulation. Neuron of this type is called reference neurons [6-7]. Reference neurons are used to select appropriate subsets of cytoskeletal neurons, which then control the manner in which input patterns are transduced to output patterns.

2 The Architecture The ANM system as currently implemented comprises eight competing subnets, each consisting of 30 cytoskeletal neurons. Cytoskeletal neurons are manipulated by two levels of reference neurons. Low-level reference neurons select comparable cytoskeletal neurons in each subnet (i.e., neurons that have similar cytoskeletal structures). High-level reference neurons select different combinations of the low-level reference neurons. Fig. 1 provides a simplified picture (only two of the competing subnets were shown, each consisting of only four cytoskeletal neurons). In Fig. 1, the intraneuronal

Prediction and Analysis of Weaning Results

Ra

high-level reference neurons low-level reference neurons cytoskeletal neurons

E1

r1

815

Rb

r2

r3

E2 E3 E4 subnet 1

E1

r4

E2 E3 E4 subnet 2

Fig. 1. Connections between reference and cytoskeletal neuron layers. When Ra fires, it will fire r1 and r4. Similarly, the firing of Rb will cause r3 and r4 to fire, which in turn fires E3 and E4 in each subnet. (Ei stands for cytoskeletal neuron i.).

structures of E1, E2, E3, and E4 in subnet 1 are similar to those of E1, E2, E3, and E4 in subnet 2, respectively.

3 Pattern Processing Neurons Cytoskeletal neurons are the major pattern processing neurons in the ANM system. Information processing sketched in cytoskeletal neurons is motivated from some physiological evidence that the internal dynamics of a neuron control its firing behaviors [13], [14], [15]. Our hypothesis is that the cytoskeleton plays the role of signal integration. That is, it is capable of integrating signals in space and time to yield spatiotemporal output signals. The dynamics of cytoskeletal neurons are simulated with 2-D cellular automata [22]. Our implementation (Fig. 2) in the cytoskeletal neuron is to capture the signal integration capability (to be described below). When an external signal impinges on the membrane of a cytoskeletal neuron, a readin enzyme sharing at the same site is activated. The activated readin enzyme will then activate the cytoskeletal component sharing at the same site, which in turn activates its neighboring component of the same type, and so on. Thus, the activation of a readin enzyme will in turn activate a chain of neighboring components of the same type (i.e., initiate a unidirectional signal flow). As to a neighboring component of different type, an activated component will affect its state when there is an MAP (microtubule associated protein) linking them together. The interactions between two different types of neighboring components are assumed to be asymmetric. For example, in Fig. 3a, the activation of the readin enzyme at location (2,2) will trigger a cytoskeletal signal flow along the C2 components of the second column, starting from location (2,2) and running to location (8,2). When the signal arrives at location (8,2), the C2 component sharing at the same site will be activated, which in turn stimulates its neighboring C1 component at location (8,1) to a more exciting state (but is not sufficient to activate it). A counter example (Fig. 3b) is that the activation of the readin enzyme at location (4,1) will trigger a cytoskeletal signal flow along the C1 components of the first column, starting from location (4,1) and running to

816

J.-C. Chen, S.-W. Chien, and J. Hsu

location (8,1). The activation of the C1 component at location (8,1) will activate its neighboring C2 component at location (8,2). The activation of the latter will in turn activate the C2 component at location (7,2), its next neighboring component at location (6,2), and so on. Thus, it will trigger a signal flow on the second column, starting from location (8,2) and running to location (2,2). j location (i,j)

5 6 7 8 c1 1 i 2 c2 c3 c1 3 c2 c3 c1 c1 4 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c3 7 c2 c1 8 c1 c2 c1 c3 1 2

' ' MAP

3 4

' ' readout enzyme

' ' readin enzyme

Fig. 2. Cytoskeletal neurons. Each grid location has at most one of three types of components: C1, C2, or C3. Some sites may not have any component at all. Readin enzymes could reside at the same site as any one of the above components. Readout enzymes are only allowed to reside at the site of a C1 component. Each site has eight neighboring sites. The neighbors of an edge site are determined in a wrap-around fashion. Two neighboring components of different types may be linked by an MAP (microtubule associated protein).

We have described the feature that different types of components interact with each other in an asymmetric manner. The other feature is that different types of components transmit signals at different speeds. The summary of the above two features is that C1 components transmit signals at the slowest speed, but with the highest activating value, and that C3 components transmit signals at the fastest speed, but with the lowest activating value. The activation value of C2 components and their transmitting speed are intermediate between those of C1 and C3 components. When the spatiotemporal combination of cytoskeletal signals arriving at a site of readout enzyme is suitable, the readout will be activated and then the neuron will fire. Fig. 4 shows that there are three possible cytoskeletal signal flows, initiated by external signals, to activate the readout enzyme at location (8,3). One is a signal flow on the second column, the other on the third column, and another on the fourth column. Any two of the above three signal flows might activate the readout enzyme at location (8,3), which in turn will cause the neuron to fire. Nevertheless, the neuron might fire at different times in response to different combinations of signal flows along these fibers. One is that different types of components transmit signals at

Prediction and Analysis of Weaning Results

817

different speeds. The other is that signals may be initiated by different readin enzymes. For example, the signal flow on the second column may be initiated either by the readin enzyme at location (2,2) or by the readin enzyme at location (3,2). Similarly, the signal flow on the fourth column may be initiated either by the readin enzyme at location (1,4) or by the enzyme at location (2,4). A signal initiated by different enzymes will integrate with another signal at different times. All of these will affect the temporal firing behaviors of a neuron. (a)

external signal 1 2 3 4 5

(b)

6

7

1 c3 2 c2 c3 c1 3 c2 c3 c1 c1 4 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c1 c3 7 c1 c2 c1 c3 c2 8 c1 c2 c1 c3

external signal

8

1 2 1 2

3

4

5

6

7

8

c3

c2 c3 c1 3 c2 c3 c1 c1 4 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c1 c3 7 c1 c2 c1 c3 c2 8 c1 c2 c1 c3

Fig. 3. Interaction between different types of component via an MAP. (a) An external signal will trigger a signal flow on the second column, starting from location (2,2) and running to location (2,8). When this signal arrives at location (8,2), it will affect its neighboring C1 component at location (8,1) to a more exciting (i.e., a state that is much easier to be activated later). (b) An external signal will trigger a signal flow on the first column, starting from location (4,1) to running to location (8,1). The activation of the C1 component at location (8,1) will in turn activate the C2 component at location (8,2) via the MAP, which in turn will trigger a signal flow on the second column, starting from location (8,2) and running to location (2,2).

4 Evolutionary Learning Six levels of evolutionary variation are possible in the system: at the level of readin enzymes (initiating signal flows), at the level of readout enzymes (responding signal flows), at the level of MAPs (modulating signal flows), at the level of cytoskeletal components (transmitting signal flows), at the level of connections among receptor neurons and cytoskeletal neurons, and at the level of reference neurons. In the current implementation we allow variation-selection operators to act on only one level at a time. One level, or aspect, is open to evolution for sixteen cycles. During this time all the other levels are held constant. The levels of evolutionary changes are turned on in the sequence of reference neuron, readin enzyme, reference neuron, connection among receptor neurons and cytoskeletal neurons, reference neuron, cytoskeletal component, reference neuron, readout enzyme, reference neuron, and MAP.

818

J.-C. Chen, S.-W. Chien, and J. Hsu

external signals 1

2

3

4

5

6

7

8

1

c3 c2 c3 c1 c2 c3 c1 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c1 c3 7 c1 c2 c1 c3 c2 8 c1 c2 c1 c3 2 3 4 c1

Fig. 4. Different combinations of cytoskeletal signals to fire the neuron. The figure demonstrates that the readout enzyme at location (8,3) might be activated by any two of signal flows along C1, C2, and C3 components.

Evolutionary learning at cytoskeletal neuron level has three steps: 1. Each subnet is activated in turn for evaluating its performance. 2. The pattern of readout enzymes, readin enzymes, MAPs, connectivities, and other components of best-performing subnets is copied to lesser-performing subnets, depending on which level of evolution is operative. 3. The pattern of readout enzymes, readin enzymes, MAPs, connectivities, and other components of lesser-performing subnets is slightly varied. Evolutionary learning at reference neuron level has three steps: 1. Cytoskeletal neurons controlled by each reference neuron are activated for evaluating their performance. 2. The pattern of neural activities controlled by best-performing reference neurons is copied to lesser-performing reference neurons. 3. Lesser-performing reference neurons control a slight variation of the neural grouping controlled by best-performing reference neurons.

5 Application Domains In this study, the ANM system was employed to differentiate a clinical database of ventilator-dependent patients. In total, there were 189 records, of which 84 patients had successfully weaned from ventilators while the remaining 105 failed. The following procedure described how to link the ANM system with the database. Note that in this model the cytoskeletal neurons served as the major components responsible for information processing. We first explained how to set up the connections between these 27 parameters and cytoskeletal neurons. Then we described how to map each parameter value to an external signal for cytoskeletal neurons. At last, we explained how to evaluate the system’s performances.

Prediction and Analysis of Weaning Results

819

We note that the connections between the parameter layer and cytokskeletal neuron layer were only partial. That is, each of these neurons was responsible for processing only a small subset of stimuli generated from these 27 parameters (through evolutionary learning). However, it should be noted that all cytoskeletal neurons that had connections with a parameter would receive the same pattern of stimuli. The initial connections between these 27 parameters and cytoskeletal neurons were randomly decided, but subject to change as learning continued. Through evolutionary learning, each cytoskeletal neuron would be trained to be a specific input-output pattern transducer. Each parameter was encoded with a 5-bit pattern. In total, there were 135 bits required to encode all of these 27 parameters. For each parameter, the minimal and maximal values of these 189 records were determined (to be denoted by MIN and MAX, respectively). The difference between these two values was equally divided into 5 increments (denoted by INCRs). The transformation of each actual parameter value (to be denoted by ACTUAL) into the corresponding 5-bit pattern was shown to be

=

⎧ 00001, ⎪ 00010, ⎪ ⎨ 00100, ⎪ 01000, ⎪⎩ 10000,

if if if if if

(MIN (MIN (MIN (MIN

MIN + + + +

≤ INCR) ≤ INCR × 2) ≤ INCR × 3) ≤ INCR × 4) ≤

ACTUAL ACTUAL ACTUAL ACTUAL ACTUAL

< < < < ≤

(MIN + (MIN + (MIN + (MIN + MAX

INCR) INCR × 2) INCR × 3) INCR × 4)

Each bit with value ‘1’ represented a specific set of stimuli to a cytoskeletal neuron. When a readin enzyme received an external stimulus, a cytoskeletal signal was initiated. As to which readin enzymes of a neuron would receive the stimuli from a parameter, it was randomly decided in the beginning and subject to change during the course of learning. For each record, all stimuli were sent to cytoskeletal neurons simultaneously. This means that all cytoskeletal signals were initiated at the same time. The cytoskeleton integrated these signals in space and time. For each input record, the class of the first firing cytoskeletal neuron was assigned as its output (to be described below). Cytoskeletal neurons were equally divided into two classes, corresponding to two different groups of records. One class of neurons represented the group of patients who have successful weaning while the other unsuccessful weaning. For each record (input pattern), we defined that the system made a correct response when the class of the first firing neuron was in accordance with the group shown in the database. The ANM system was tested with each of these 189 records in sequence. The greater the number of correct responses made by the system, the higher its fitness was.

6 Experimental Results Two experiments were performed. The first was to apply the system to the RCC database. For comparison, we also tested the database with two other artificial neural networks (the back-propagation neural network and support vector machine) and one tree-based decision algorithm (Waikato Environment for Knowledge Analysis). In the second experiment, we examined the effectiveness of each parameter in determining weaning results.

820

J.-C. Chen, S.-W. Chien, and J. Hsu

6.1 Data Differentiation These 189 records were divided into two sets: training and testing. The training set consisted of 120 records, about two thirds of the database. The testing set consisted of 69 records, about one third of the database. Ten runs were performed. For each run, 120 out of these 189 records were selected at random as the training set whereas the remaining 69 records were grouped as the test set. The experimental result showed that the number of records recognized by the ANM system increased continuously during the course of learning. The average differentiation rate was 91%. Then, the system after substantial learning was tested. The average differentiation rate on these ten test sets was 71.0%, implying that it possessed a certain degree of differentiation capability. For comparison, we first tested the database with the Weka tool, a collection of data mining algorithms that employed tree-based classifications. The best result that we obtained with this tool was collected. The average differentiation rate of 10 runs on the testing set was 63.2%. We then tested it with the support vector machine. The average differentiation rate on the testing set was 69.9%. Lastly, we employed the back-propagation neural network to the database. Several combinations of the numbers of hidden layers (including nodes) and transferring functions were tested. The ten best results were collected. The average differentiation rate on the testing set was 70.9%. 6.2 Parameter Analysis In this experiment, we investigated the effectiveness of each parameter in determining the weaning results of ventilator-dependent patients. We noted that the system learned to treat a particular parameter as rather significant when it could always be used to give a correct response to all training records. In other words, altering the values of this parameter might lead to a completely different result. By contrast, the system tended to ignore those parameters with insignificant values. In other words, for an insignificant parameter, any alteration of its values had no effect on the response. To generate a test set, we copied the training set and varied a specific parameter of each of the training patterns. The variation was made by setting a parameter at a specific value, starting from 0.1 to 1 (with an increment of 0.1). Thus, there were eleven testing sets generated for each of these 27 parameters. In total, there were 297 testing sets. For each testing set, we set a parameter at a specific value but kept the values of other 26 parameters unchanged. For example, the first testing set was identical to the training set except the value of the first parameter of each record was set to be 0.1. The second testing set was generated in a similar manner, but the value of the first parameter was set to be 0.2. The system after substantial learning was employed. The experimental results showed that there were almost no changes on the system’s outputs when the values of these 7 parameters (Cr, respiratory tract disease, cardiovascular disease, and chronic obstructive pulmonary disease, septic syndrome with multiple organ failure, other surgeries, and the length of time staying in ICU) were altered. That is, altering the values of these parameters had no significant effect on the system’s outputs. This implied that there was essentially no direct relationship between these parameters and

Prediction and Analysis of Weaning Results

821

patients’ weaning results. By contrast, the outputs were quite different when the following 10 parameters (age, Alb, BUN, Hb, cerebral vascular disease, coma scale before removing ventilator, RSBI, respiratory tract infection, urinary tract infection, other infection) were set at different values, illustrating that they played a vital role in affecting patients’ weaning results. Among these 10 parameters, age and coma scale before removing ventilator are two most important parameters. For the remaining 10 parameters, the result showed that the changes of the system’s outputs fell within a certain range, suggesting that to some extent they were important parameters. We were particular interested in those parameters identified as very important by one side but as insignificant by other side. Our experimental result showed that there were four parameters identified by the physician as very significant but by the ANM system as almost insignificant. These included respiratory tract disease, cardiovascular disease, chronic obstructive pulmonary disease, and septic syndrome with multiple organ failure. All of these parameters were related to the disease carried by the patients. Even though the result was not quite the same as that of the physicians; however, in some senses it should provide another dimension of information to physicians. It implied that some important parameters that the physician had in mind might not be that critical.

7 Conclusions The experimental result with the RCC database is consistent with our previous result that the system demonstrates a high differentiation capability. Finally, we note that a parameter is significant if it can be used to effectively differentiate data and to make it redundant if it cannot. The significance of each parameter is dependent upon the structure of a data set. Different sets possess different significant parameters. The parameter analysis result has some implications for clinical studies. It provides physicians with information about the effectiveness of each parameter in determining weaning for patients. Our experimental result shows that some of these significant parameters are the same as those recognized by physicians whereas some are not. The finding of those significant parameters not previously recognized by physicians may provide them another dimension of information, which in turn may open up a possibility of exploring some unknown phenomena.

References 1. Adams, A.B., et al.: Survey of Long-Term Ventilators Support in Minnesota: From 1986 to 1992. Chest 103, 463–469 (1993) 2. Bouachour, G., et al.: Gastric Intramural Ph: An Indicator of Weaning from Mechanical Ventilation in Patients. Eur. Respir. J. 9, 1868–1873 (1996) 3. Bremermann, H.J.: Optimization through Evolution and Recombination. In: Yovits,, Jacobi,, Goldstein (eds.) Self-Organizing Systems, pp. 93–106. Spartan Books, Washington, D.C (1962) 4. Conrad, M.: Molecular Information Processing in the Central Nervous System, Parts I and II. In: Conrad, M., Güttinger, W., Dal Cin, M. (eds.) Physics and Mathematics of the Nervous System, pp. 82–127. Springer, Heidelberg (1974a)

822

J.-C. Chen, S.-W. Chien, and J. Hsu

5. Conrad, M.: Evolutionary Learning Circuits. J. Theor., Biol. 46, 167–188 (1974b) 6. Conrad, M.: Molecular Information Structures in the Brain. J. Neurosci. Res. 2, 233–254 (1976a) 7. Conrad, M.: Complementary Molecular Models of Learning and Memory. BioSystems 8, 119–138 (1976b) 8. Conrad, M.: Principle of Superposition-Free Memory. J. Theor. Biol. 67, 213–219 (1977) 9. Conrad, M., Kampfner, R.R., Kirby, K.G., Rizki, E.N., Schleis, G., Smalz, R., Trenary, R.: Towards an Artificial Brain. BioSystems 23, 175–218 (1989) 10. Feeley, T., Hedley-Whyte, J.: Weaning from Controlled Ventilation and Supplemental Oxygen. N. Engl. J. Med. 292, 303–306 (1975) 11. Fogel, D.: Evolutionary Computation: Towards a New Philosophy of Machine Intelligence. IEEE Press, Piscatawy (1995) 12. Fogel, L., Owens, A., Walsh, M.: Artificial Intelligence through Simulated Evolution. Wiley, New York (1966) 13. Hameroff, S.R.: Ultimate Computing. North-Holland, Amsterdam (1987) 14. Liberman, E.A., Minina, S.V., Shklovsky-Kordy, N.E., Conrad, M.: Microinjection of Cyclic Nucleotides Provides Evidence for a Diffusional Mechanism of Intraneuronal Control. BioSystems 15, 127–132 (1982) 15. Matsumoto, G., Tsukita, S., Arai, T.: Organization of the Axonal Cytoskeleton: Differentiation of the Microtubule and Actin Filament Arrays. In: Warner, F.D., McIntosh, J.R. (eds.) Cell Movement: Kinesin, Dynein, and Microtubule Dynamics, pp. 335–356. Alan R. Liss, New York (1989) 16. Modawal, A., et al.: Weaning Success among Ventilator-Dependent Patients in a Rehabilitation Facility. Arch. Phys. Med. Rehabil. 83, 154–157 (2002) 17. Nava, S., et al.: Survival and Prediction of Successful Ventilator Weaning in Copd Patients Requiring Mechanical Ventilation for more than 21 Days. Eur Respir. J. 7, 1645–1652 (1994) 18. Robinson, R.: Ventilotor Dependency in the United Kingdom. Archives in Disease of Child 65, 1235–1236 (1990) 19. Rosen, R.: Dynamical System Theory in Biology. Wiley, New York (1970) 20. Sahn, S.A., Lakshminarayan, S.: Besides Criteria for Discontinuation of Mechanical Ventilation. Chest 63, 1002–1005 (1973) 21. Stetson, J.B.: Introductory Essay. Int Anethesiology Clinics 8(4), 767–779 (1970) 22. Wolfram, S.: Cellular Automata as Models of Complexity. Nature 311, 419–424 (1984) 23. Yang, K.L., Tobin, M.J.: A Prospective Study of Indexes Predicting the Outcome of Trials of Weaning from Mechanical Ventilation. N. Engl. J. Med. 324(21), 1445–1450 (1991)

Licence Plate Character Recognition Using Artificial Immune Technique Rentian Huang, Hissam Tawfik, and Atulya Nagar Intelligent and Distributed Systems Lab, Deanery of Business and Computer Sciences, Liverpool Hope University, Liverpool, United Kingdom L16 9JD {10076507,TAWFIKH,NAGARA}@Hope.ac.uk

Abstract. This paper proposes the application of Artificial Immune Technique in Licence Plate Character Recognition (LPCR). The use of Clonal Selection Algorithm (CSA) is composed of two main stages: (1) dynamic training samples; and (2) a choice of the best antibodies based on the three main clonal operations of cloning, clonal mutation and clonal selection. Once memory cells are established it will output the classification results using Fuzzy K-Nearest Neighbor (KNN) approach. The performance of CSA is compared to the Back Propagation Neural Networks (BPNN) in solving a LPCR problem. The experimental results show that the Artificial Immune Technique has a favorable performance in terms of being more accurate and robust. Keywords: Artificial Immune System (AIS), Clonal Selection Algorithm (CSA), Licence Plate Recognition (LPR).

1 Introduction Licence Plate Recognition (LPR) System combines image processing and character recognition technology to identify vehicles by way of automatically reading their number plates. A typical LPR process consists of three stages: 1) licence plates location, 2) character segmentation and 3) character recognition. LPR demonstrates particularly useful and practical vehicle identification technology as it assumes no additional means of vehicle identity apart from the existing and legally required number plate. Furthermore, when data gathered by an LPR system is stored and organized within a database, more complex information-driven tasks may potentially be performed such as, vehicle travel time calculations as well as border controls. However, in practice it is a very difficult task due to the variety of environmental conditions. LPR system is usually conducted under certain restrictive conditions such as, indoor scenes, stationary backgrounds, fixed illumination, prescribed driveways, limited vehicle speed, and at a designated range of distance between camera and vehicle [1]. Despite these current limitations, LPR finds applications in private parking management, traffic monitoring, automatic toll payment, surveillance and security enforcement [2]. Numerous algorithms have previously been exploited such as, Hidden Markov Models (HMM) [3], Artificial Neural Networks (ANN) [4], Hausdorff Distance [5], M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 823–832, 2008. © Springer-Verlag Berlin Heidelberg 2008

824

R. Huang, H. Tawfik, and A. Nagar

Support Vector Machine [6] (SVM)-based character recognizer and template matching that leave a lot of room for improvements [7]. The focus of this paper is to investigate a character recognition technique using the Artificial Immune System (AIS) based CSA. A number of adjustments are made to the basic implementation of CSA in order to improve the performance, especially in using a new dynamic training to establish the immune memory (collection of antibodies) for classification. Additionally, Neural-Network results are presented to compare the performance. The experimental results show that CSA for character recognition has a better performance in terms of successful classification of the characters of licence plates. CSA proved more accurate and robust compared to Neural Networks. The rest of this paper is presented as follows: The LPR architecture in section 2 which includes a review of relevant techniques used for tackling character recognition of LPR. Section 3 introduces our CSA and the features added to it for character recognition. Section 4 provides the experimental details and highlights the compared performance of the CSA in character recognition. Section 5 gives the conclusion and proposes the future work.

2 Car Plate Recognition Car plate recognition algorithms reported in the research are generally composed of three main steps, 1) locating licence plates, 2) segmenting licence numbers and 3) identifying the characters. Fig. 1 illustrates our proposed LPR. In locating the licence plate, a colour edge detector is developed to detect the type of edges contained within the licence plate. While, multiple licence plate candidates are normally detected, size and shape filtering is used to remove objects that do not satisfy some specific conditions. The target will select regions that serve as possible licence plate boundaries. In order to achieve possible licence plate boundaries the area-to-perimeter ratio of the candidate area is compared with the actual standard ratio of a number plate. Once a licence plate candidate has been extracted from the image, the licence number segmentation preprocessing component will continue to perform three tasks, Grey-level transform, Median filtering and Binarisation. A vertical projection is performed to segment the characters with each character image normalized to a size of 16x16 after segmentation. Following character segmentation from the plate region a method needs to be selected for character recognition which is the main subject of this work.

Fig. 1. Diagram of our LPR process

Licence Plate Character Recognition Using Artificial Immune Technique

825

There has been a large number of character recognition techniques reported. HMM for recognition begins with the pre-processing and parameterization from the region of interests detected in the previous phase. Researchers report that the width of the plate in the image after rescaling lies between 25% and 75% for an image width arranging from 200 and 600 pixels. This reveals the necessity for good character analysis when implementing HMM, which places a restriction on the effective distance of the recognition system [3]. Various types of ANN had been used for licence plate character identification such as the work done by Broumandnia and Fathy [8]. A self-organized Neural Network based on Kohonen’s Self-Organized Feature Maps (SOFMs) was implemented to tolerate noisy, deformed, broken, or incomplete characters acquired from licence plates, which were bent or tilted with respect to the camera [9]. Probabilistic Neural Networks (PNNs) for LPR were also introduced by Anagnostopoulos et al. [10]. Hausdorff distance is a method for LPCR that compares two binary images. Its main problem is the computational burden. Its recognition rate is very similar to that obtained with Neural-Network classifiers, but it is slower [5]. Kim et al. designed a system implementing four SVMs and report an impressive average character recognition rate. The architecture, however, is strictly designed for Korean plates [6]. A suitable technique for the recognition of single font and fixed size characters is the pattern matching technique. The recognition process was based on the computation of the normalized cross correlation values for all the shifts of each character template over the sub-image containing the licence plate [7]. LPR problem continues to be a challenge for artificial intelligence solutions and novel approaches are therefore needed to improve the performance and efficiency of LPCR algorithm. In this work, an AIS character recognition technique based Clonal Selection Algorithm is presented for solving LPR problem.

3 Immune Techniques for Character Recognition Artificial Immune System is a rapidly emerging technique for developing mechanisms for learning, prediction, memory and adaptation. AIS mimics the biological immune systems as these offer powerful and robust information processing capabilities for solving complex problems [11]. The immune system is a biological pattern recognition and classification system which learns to distinguish the self from nonself. The immune systems behaviour is an emergent property of the entire population of diverse agents and improves performance by weeding out the weakest players, replacing them with agents as different as possible. The immune system is computationally one of the least understood biological paradigms but has drawn significant attention. AIS started to be used in many application domains including computer security, optimization, robotics, data mining, fault detection, anomaly detection, and pattern recognition [12]. 3.1 Clonal Selection Clonal Selection Theory, the famous theory in immunology, was put forward by Burnet in 1978 [13]. Its main ideas lie in the fact that the antigen can selectively react

826

R. Huang, H. Tawfik, and A. Nagar

to the antibodies, which are natively produced and spread on the cell surface in the form of peptides. When cells are exposed to an antigen, the antigen stimulates an immune cell with appropriate receptors to proliferate (divide) and mature into terminal plasma cells. The process of cell division generates a clone, i.e., a set of cells that are the progenies of the single cell. In addition to proliferating and maturing into plasma cells, the immune cells can differentiate into long-lived memory cells. Memory cells circulate through the blood, lymph and tissues, and when exposed to a second antigenic stimulus they commence to differentiate into large immune cells (lymphocyte) capable of producing high affinity antibody preselected for the specific antigen that once stimulated the primary response. Fig. 2 depicts the clonal selection principle.

Fig. 2. Clonal Selection Principle

Clonal selection is a dynamic process of the immune system stimulated by the selfadapting antigen. Some biologic features such as learning, memory and antibody diversity can be used in artificial immune systems to solve complex problems. De Castro and Von Zuben proposed the first Clonal Selection Algorithm, called CLONALG and suggested that it could be used for pattern recognition. They generated random antibodies to be used as the target patterns in CSA. They used a set of 12 x 10 binary images as the target patterns [14]. Utpal Garain et al. proposed a CSA used for a 2-class problem to classify pairs of similar character patterns and claimed promising results. Setting aside the classification power, data reduction had been another capability of CSA [15]. Utpal Garain et al. further explored the potential of CSA in pattern recognition by applying the CSA for a 10-class classification problem. Empirical study with two datasets shows that the CSA has very good generalization ability with experimental results reporting the average recognition accuracy of about 96% [16]. 3.2 Clonal Selection for LPCR The proposed clonal selection algorithm for character recognition is composed of two main processes: firstly, selecting samples and training these samples using CSA; secondly, using fuzzy (KNN) approach to output the classification results for LPR. The essentials of clonal selection are established as follows. Antigens are images

Licence Plate Character Recognition Using Artificial Immune Technique

827

stored in a matrix and represent the licence plate character of the system. Antibodies are the candidates that go through clonal process and try to catch and represent the common features of antigens. The affinity between antibody and antigen is the reflection of the total combination power located between them. For classification problem, hamming distance (HD) and similarity functions are used to measure affinity between antigen and antibody. The Hamming distance rule is presented below: n

len

difference = ∑∑ posi = Abi ⊕ Ag ij

Affinity = − difference

(1)

i =1 i =1

Where Abi is the ith bit in the antibody Ab , Ag ij is the ith bit in the antigen pattern examples Ag j and n is the number of examples for a particular pattern class. Len is the total length of an antibody and ⊕ represents the exclusive XOR operator. Another formula used to measure similarity (affinity) of antigen to antibody interaction is presented as shown in Eqn (2) below:

S ( Ag 1 , Ag 2 ) =

1 − 2 2

S10 S 01 − S 00 S11

(2)

(S11 + S10 )(S 01 + S 00 )(S11 + S 01 )(S10 + S 00 )

where ( Ag1 , Ag 2 ) are the two matrices to be compared, S10 , S 01, S 00, S11 , which denote the number of zero matches, one matches, zero mismatches, and one mismatches. The value of S is in the range [0, 1], where 1 indicates the highest and 0 indicates lowest similarity between the samples. In immunology, cloning selects a number of antibodies with the highest affinity and cloning them based on their antigenic affinities. The higher the antigenic affinity, the higher the number of clones will be generated. The total number of clones generated N c is defined in Equation 3 as follow: n ⎛β ⋅N ⎞ N c = ∑ round ⎜ ⎟ ⎝ i ⎠ i =1

(3)

where β is a multiplying factor, N is the total number of antibodies, round (.) is the operator that rounds its argument toward the closet integer. i

i*

In clonal mutation, the clone set C is used to produce mutated offspring C . The higher the affinity an antigen has, the smaller the mutation rate. The algorithm for mutation is described in Fig. 3. The Equation is defined as follows: ⎛ t ⎞ ⎛ ⎜ 1− ⎟ ⎜ Δ(t , y ) = y 1 − r ⎝ T ⎠ ⎜ ⎝

λ

⎞ ⎟ ⎟ ⎠

(4)

where t is the iteration number, T is the maximum iteration number, r is a random value in the range [0, 1], and λ is used to decide the nonconforming degree.

828

R. Huang, H. Tawfik, and A. Nagar

i

Find the maximum and minimum in population C . For each Ab , do Generate a random value in the range [0, 1], named mr Generate another random value in the range [0, 1], named t0 If mr < mutation_rate If t0 >=0 Ab = Ab + Δ(t , max − Ab ) else Ab = Ab - Δ(t , Ab − min ) return Ab __________________________________________________________________ Fig. 3. Mutation rate control algorithm

The final operation is clonal selection which includes hamming distance and similarity threshold selections. Hamming distance threshold defines the antibodies allowed to stay for further memory cell selection. The similarity threshold defines the antibodies allowed to be added into memory cells to become detectors. The value of the Hamming distance threshold should be adjusted empirically such that the antibodies have the ability to detect new cases correctly. 3.3 Dynamic Training Algorithm One antigen (UK mandatory typeface) from each class and antibodies generated by basic clonal selection has been chosen to initialize the immune memory. After initialization, real characters were passed to a dynamic training algorithm, and the immune memory cells training and testing go hand in hand to obtain a better memory cell for classification. The clonal dynamic training algorithm is shown as Fig. 4. While No. <= size of antigens Selected an antigen Ag and start to train Classify the antigen using the current updated memory cells If Classification strategy recognized antigen, start with another antigen Otherwise generate antibodies Ab s randomly and calculate the affinity Select n Ab s having the highest affinity and clone them Apply hypermutation to the clone set C i to produce mutated offspring C i* Re-calculate the HD between Ag and C i* . Select Ab s for next step Calculate similarity between Ab and Ag Select matured Ab for memory cells Stop training if the required number of matured antibodies is generated End when all antigens been trained ____________________________ _____________________________________ Fig. 4. Dynamic Training algorithm

Licence Plate Character Recognition Using Artificial Immune Technique

829

Classification is implemented by a fuzzy (KNN) approach, proposed by Keller in [17], which provides an improvement on existing classification techniques. For the testing, pattern was passed through the memory cells as the fuzzy KNN selected k closest memory cells from the immune memory. The selected memory cells were then grouped according to their class labels with the class of the largest sized group identifying the testing pattern.

4 Experiments and Results Three different dataset resources ‘LPR0’, ‘LPR1’ and ‘LPR2’ have been collected from a car park, road, street and petrol stations within the UK. LPR0 consisted of 950 samples of licence plates that will be used for training. LPR1 and LPR2 are two datasets containing 400 samples to test the performance of the systems. Characters extracted from LPR0 data were grouped into two parts: digits and letters. The digits have 10 classes (0 to 9) and the letters have 23 classes (A to Z without Q, O, I). Experiments were carried out through two different training methods. (1) Single pass training, where each antigen produces the same number of antibodies, (2) Dynamic training, described in section 3.3. In both training, all the antibodies were first generated based on the mandatory typeface. Each antigen produced 30 antibodies before generating them from LPR0. These antigens only generated 10 antibodies each. The HD Threshold was 25 for digits and 50 for characters; the Similarity Threshold was 0.93 for digits and 0.87 for letters. The initial population for antibodies was 30 and hypermutation probability was 0.05. All parameters were determined by experimentation. Classification results are also depended on the classification strategy. The effect of k in fuzzy KNN classification is examined and shows that k=5 for the digits and k=7 for letters gave the best performance with K=7 giving a better combined performance. Improvement can be further achieved by dividing the letters into G1 and G2. Table 1 presents the results for both training. The results proved that dynamic training reduced the difficulties for Single pass such as, large numbers of immune memory cells, low recognition accuracy, time wasted in training and recognizing. G1:

G2: Table 1. Two different training results for CSA

Parameter K K=5 K=7 K=9

Digits Accuracy % Single Dynamic 96.5 98.5 95 96.5 93 95

Letters Accuracy % Single Dynamic 86 88 89 92.4 85 89

Table 2 presents the results for best training and testing (C=correct, I=Incorrect) for our Licence Plate Character Recognition.

830

R. Huang, H. Tawfik, and A. Nagar Table 2. Training & testing results for CSA

Data set Accuracy%

Training LPR0 Digits Letters 96.5 92.4

Testing LPR1 C 1I >2I 89.5 3.0 7.5

Testing LPR2 C 1I >2I 83.5 7.5 9.0

The performance of our CS based approach has also been compared to Back Propagation Neural Networks [18]; a feed-forward neural network consisting of three layers has been employed. In this case, the multilayer perceptron (MLP) model had 256 nodes in the input layer and 20-50 in hidden layers which were determined empirically. Letters are divided into N1, N2 so the confusion of similar characters could be corrected by Neural-Networks. The initial results using ANN for the performance of the digit network were sufficiently successful. The results are shown in table 3. N1:

N2: Table 3. Training & testing results for ANN

Data set Accuracy%

Training LPR0 Digits Letters 94.3 90.5

C 84.5

Testing LPR1 1W >2W 6.0 9.5

C 81.0

Testing LPR2 1W >2W 9.0 10.0

From the experiments, the clonal selection algorithm has shown that it can provide sufficient data to make generalizations from examples and can also successfully classify previously unseen examples of its training classes. Adjustments to the basic algorithm improved performance and illustrated how clonal selection could be used in LPCR. Compared to our neural network approach the proposed AIS based method was found to be better than ANN by more than 2 percents in training and more than 3 percent in testing. The algorithm has been made to increase good candidate memory cells and reduce the training time. But the main weakness lies in the efficiency of the algorithm as the time taken to generate the memory cells could be seen to make it unattractive for time-dependent applications such as most real-world problems.

5 Concluding Remarks The paper reports on the use of clonal selection algorithm for its application for Licence Plate Character Recognition. The clonal selection algorithm can be characterized as a good alternative and a more competitive approach whereby individual antibodies are competing with the whole population cooperating as an ensemble of individuals to present the final solution. The experimental results consistently show that the proposed algorithm has high classification precision.

Licence Plate Character Recognition Using Artificial Immune Technique

831

Moreover, when compared with BPNN the average performance of CSA is more accurate and robust. Specifically, the proposed algorithmic structure was not evaluated on a fixed location; the image is acquired manually in various views and illumination conditions in order to closely resemble real world situations. Licence Plate Recognition is always an important research topic of artificial intelligent systems. Future work include increasing the size of the test set, improvement of the detection accuracy and classification speed, and the hybridization of CSA with Higher Order Neural Networks (HONNs) can be carried out later on.

References 1. Chang, S.L., Chen, L.S., Chung, Y.C., Chen, S.W.: Automatic licence plate recognition. IEEE Trans. Intell. Transp. Syst. 5(1), 42–53 (2004) 2. Wu, C., On, L.C., Weng, C.H., Kuan, T.S., Kengchung, N.G.: A Macao Licence Plate Recognition System. In: Proceedings of Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, vol. 7, pp. 4506–4510 (2005) 3. Duan, T.D., Hong Du, T.L., Phuoc, T.V., Hoang, N.: Building an automatic vehicle licence plate recognition system. In: Proc. Int. Conf. Comput. Sci. RIVF, pp. 59–63 (2005) 4. Hu, Y., Zhu, F., Zhang, X.: A Novel Approach for License Plate Recognition Using Subspace Projection and Probabilistic Neural Network. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3497, pp. 216–221. Springer, Heidelberg (2005) 5. Martín, F., García, M., Alba, L.: New methods for automatic reading of VLP’s (Vehicle Licence Plates). In: Proc. IASTED Int. Conf. SPPRA (2002) 6. Kim, K.I., Jung, K., Kim, J.: Color Texture-Based Object Detection: An Application to License Plate Localization. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 293–309. Springer, Heidelberg (2002) 7. Anagnostopoulos, C., Anagnostopoulos, I., Loumos, V., Kayafas, E.: A Licence PlateRecognition Algorithm for Intelligent Transportation System Applications. IEEE Transaction on Intelligent Transportation Systems 7(3) (September 2006) 8. Broumandnia, A., Fathy, M.: Application of pattern recognition for Farsi licence plate recognition. In: The ICGST Int. Conf. Graphics, Vision and Image Processing (GVIP), vol. 2, pp. 25–31 (2005) 9. Chang, S.-L., Chen, L.-S., Chung, Y.-C., Chen, S.-W.: Automatic Licence Plate Recognition. IEEE Transactions on Intelligent Transportation Systems 5(1) (March 2004) 10. Anagnostopoulos, C., Alexandropoulos, T., Boutas, S., Loumos, V., Kayafas, E.: A template-guided approach to vehicle surveillance and access control. In: Proc. IEEE Conf. Advanced Video and Signal Based Surveillance, pp. 534–539 (2005) 11. de Castro, L.N., Timmis, J.: Artificial Immune systems: A Novel Paradigm to Pattern Recognition. In: Corchado, A.J., Fyfe, C. (eds.) Artificial Neural Networks in Pattern Recognition, pp. 67–84. university of Paisley (2003) 12. Timmis, J., Knight, T., De Castro, L.N., Hart, E.: An Overview of Artificial Immune Systems. Natural Computation Series, 51–86 (2004) 13. Burnet, F.M.: Clonal Selection and After. Theoretical Immunology. In: Bell, G.I., Perelson, A.S., Pimbley Jr, G.H. (eds.), pp. 63–85. Marcel Dekker Inc., New York (1978) 14. de Castro, L.N., Von Zuben, F.J.: aiNet: An Artificial Immune Network for Data Analysis. In: Sacker, R.A., Newton, C.S. (eds.) Data Mining: A Heuristic Approach. Idea Publishing Group, Hershey (2001)

832

R. Huang, H. Tawfik, and A. Nagar

15. Garain, U., Chakraborty, M., Majumder, D.: Improvement of OCR Accuracy by Similar Character Pair Discrimination: an Approach based on Artificial Immune System. In: The 18th Int. Conf. on Pattern Recognition (ICPR), vol. 2, pp. 1046–1049 (2006) 16. Garain, U., Chakraborty, M.P., Dasgupta, D.: Recognition of Handwritten Indic Script Using Clonal Selection Algorithm. In: Bersini, H., Carneiro, J. (eds.) ICARIS 2006. LNCS, vol. 4163, pp. 256–266. Springer, Heidelberg (2006) 17. Keller, J.M., Gray, M.R., Givens Jr., J.: A Fuzzy K-Nearest Neighbor Algorithm. IEEE Transactions on Systems, Man, and Cybernetics 15(4), 580–585 (1985) 18. Nijhuis, J.A.G., ter Brugge, M.H., Helmholt, K.A., Pluim, J.P.W., Spaanenburg, L., Venema, R.S., Westenberg, M.: Car licence plate recognition with neural networks and fuzzy logic. In: Proc. IEEE Int. Conf. Neural Netw., vol. 5, pp. 2232–2236 (1995)

Integration of Ab Initio Nuclear Physics Calculations with Optimization Techniques Masha Sosonkina1 , Anurag Sharda1 , Alina Negoita2 , and James P. Vary2 1

Ames Laboratory/DOE, Iowa State University, Ames, IA 50011, USA {masha,anurag}@scl.ameslab.gov 2 Physics Department, Iowa State University, Ames, IA 50011, USA {alina,jvary}@iastate.edu

Abstract. Optimization techniques are ﬁnding their inroads into the ﬁeld of nuclear physics calculations where the objective functions are very complex and computationally intensive. A vast space of parameters needs searching to obtain a good match between theoretical (computed) and experimental observables, such as energy levels and spectra. In this paper, we propose a design integrating the ab initio nuclear physics code MFDn and the VTDIRECT95 code for derivative-free optimization. We experiment with the initial implementation of the design showing good matches for several single-nucleus cases. For the parallel MFDn code, we determine appropriate processor numbers to execute eﬃciently a multiple-nuclei parameter search. Keywords: No Core Shell Model, MFDn, Derivative-free Optimization, VTDIRECT95.

1

Motivation

Unlike electrons in the atom the interaction between nucleons is not known precisely and is complicated. The shell model is the fundamental tool to study the structure of nuclei. The basic idea is that the nucleons move in an average potential generated by the mutual interactions of the nucleons. The strong Nucleon-Nucleon (NN) interaction as well as 3-nucleon (NNN) interactions generate the potential that describes the nucleon energy levels in the nucleus. In particular, NN and NNN interactions tuned to ﬁt light nuclei are used in nuclear astrophysics for solar models, supernova modeling, and Big Bang nucleosynthesis. The techniques for solving these problems also ﬁnd applications in the ﬁeld of quantum chemistry, condensed matter physics, atomic, nuclear, and particle physics. Until recently, the No Core Shell Model [1,2] has been limited to nuclei up to atomic mass A of 16. Work is underway to extend this method to heavier nuclei [3]. The eﬀective Hamiltonian operator derived from CD-Bonn interaction [4] gives a poor description of nuclei with atomic mass near 48. Figure 1 describes the matches between the theoretical and experimentally obtained energy levels for 49 Sc, where the initial version of the theory is given in the rightmost column. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 833–842, 2008. c Springer-Verlag Berlin Heidelberg 2008

834

M. Sosonkina et al.

A problem with the existing Hamiltonian is that the computed spectra is too compressed compared with the experimental spectra. The addition of the three terms — isospin-dependent V0 , central V1 , and tensor-interaction Vtens — results in a reasonable low lying spectra for the nuclei involved in the double-beta decay of 48 Ca. One of the physics goals is to test whether the same modiﬁed Hamiltonian used for the isotopes of the nuclei with atomic mass of 48 is able to describe other heavy nuclei. These three terms and possibly other (up to 20) parameters need to be searched to obtain their best match to the experimental values (see Fig. 1, for example). To ﬁnd this match according to some criteria, it is required to evaluate energies at many points in a parameter search space. In particular, a criterion, called χ2 , may be calculated that quantiﬁes the match using weights (see Sect. 3.2). This process may be automated by taking advantage of optimization techniques which will generate the points at which χ2 may be evaluated. Note that, since derivatives do not come into the picture for complex nuclear physics calculations on which χ2 depends, the derivative-free optimization is considered. As a trade-oﬀ, it is typically required a large number of function evaluations even to ﬁnd a local minimum. The time taken by such an optimization algorithm is directly proportional to the cost of the objective function evaluation. Thus, parallel implementations of both the function evaluation and optimization algorithm may be beneﬁcial. The remainder of the paper is organized as follows. In Sect. 2, we give an overview of two parallel software packages under consideration: Many Fermion Dynamics nuclear (MFDn) and the optimization algorithm VTDIRECT95. In 5.6

+

3/27/29/2 11/2 5/2

-

1/2

)

(5/2

4.8

-

(15/2 ) -

(11/2 )

Excitation Energy (MeV)

49

Sc -

7/2

-

7/2 11/2 13/2

-

4

box column

-

-

5/2

5/2 (9/2 ) + 1/2

-

5/2 1/2 3/2

-

-

3.2

9/2

(9/2 ) 7/2 + + 5/2 3/2

-

5/2

-

-

11/2

7/2

-

9/2

-

2.4

3/2 + + (5/2 3/2 )

F

-

7/2 -

5/2

-

9/2

+

1.6

3/2

-

3/2

-

+

15/2

1/2

-

11/2

-

1/2

0.8

-

7/2

convex hull

-

5/2

0

-

-

3/2

7/2

Exp

CD-Bonn + 3 terms

CD-Bonn

Fig. 1. Matching of experimental (Exp) and theoretical energy levels of 49 Sc using CD-Bonn potential in its initial version (CD-Bonn) and new with searched terms (CD-Bonn+3terms). Each energy level is annotated with its spin value.

D

Fig. 2. Example of the scatter plot of boxes. The F -axis is function values, and the D-axis is box diameters.

Integration of Ab Initio Nuclear Physics Calculations

835

Sect. 3, we ﬁrst present a design to make the MFDn code and optimization algorithm work in concert. Then, we describe our initial implementation. Sect. 4 presents the computation experiments and analyzes them with respect to nuclear physics objectives. Sect. 5 concludes.

2

Overview of Nuclear Physics and Optimization Packages

Many Fermion Dynamics nuclear (MFDn) [5] parallel code is used for large-scale nuclear structure calculations in the No Core Shell Model (NCSM) formalism [1,2], which has been shown to be successful for up to 16-nucleon problems on present day computational resources. MFDn code is charged to compute a few lowest (≈15) converged solutions, called wave functions, to the many-nucleon Schr¨ odinger equation: H |φ = E |φ . (1) Then other properties, called observables, are formed from the calculated wave functions. The matrix H in (1) is the Hamiltonian operator, which is typically solved using Lanczos diagonalization since H is symmetric and sparse. However, the Lanczos iterative process may be very expensive due to huge dimensionality of H with many oﬀ-diagonal elements. The number of Lanczos iterations also increases signiﬁcantly for the energy levels beyond the ground state. For example, for the 16 O nucleus in the 6hω basis space, the ground-state energy level requires only 35 Lanczos iterations, while 15 excited states need at least 200 Lanczos iterations to converge. Note that, in this case, the constructed Hamiltonian H has the dimension of 26,483,625. MFDn constructs the m-scheme basis space, evaluates the Hamiltonian matrix elements in this basis using eﬃcient algorithms, diagonalizes the Hamiltonian to obtain the lowest eigenvectors and eigenvalues, then post-processes the wave functions to obtain a suite of observables and to compare them with experimental values. VTDIRECT95 [6] is a Fortran95 suite of parallel codes implementing the derivative-free optimization algorithm DIRECT [7] that takes a set of problemand algorithm-dependent input parameters and ﬁnds the global minimum of an objective function f inside the feasible set D. Each iteration of DIRECT consists of the following steps. 1. INITIALIZATION. Normalize the feasible set D to be the unit hypercube. Sample the center point ci of this hypercube and evaluate f (ci ). Initialize fmin = f (ci ), evaluation counter m = 1, and iteration counter t = 0. 2. SELECTION. Identify the set S of “potentially optimal” boxes that are subregions of D. A box is potentially optimal if, for some Lipschitz constant, the function value within the box is potentially smaller than that in any other box (a formal deﬁnition with parameter is given by [8]). 3. SAMPLING. For any box j ∈ S, identify the set I of dimensions with the maximum side length. Let δ equal one-third of this maximum side length. Sample the function at the points c ± δei for all i ∈ I, where c is the center of the box and ei is the ith unit vector.

836

M. Sosonkina et al.

4. DIVISION. Divide the box j containing c into thirds along the dimensions in I, starting with the dimension with the lowest value of wi = min{f (c + δei ), f (c−δei )}, and continuing to the dimension with the highest wi . Update fmin and m. 5. ITERATION. Set S = S − {j}. If S = ∅ go to 3. 6. TERMINATION. Set t = t + 1. If iteration limit or evaluation limit has been reached, stop. Otherwise, go to 2. Initially, only one box exists in the system. As the search progresses, more boxes are generated, illustrated by the scatter plot shown in Fig. 2, where each circle represents a box. The sizes of boxes increase along the D-axis (diameter) and the function values at box centers increase along the F -axis (function). All the boxes with the same diameter belong to a “box column”. Reference [8] proves that all potentially optimal boxes in S are on the lower right convex hull of the scatter plot in Fig. 2. To produce more tasks in parallel, new points are sampled around all boxes in S along their longest dimensions during SAMPLING. This modiﬁcation also removes the step ITERATION, thus simplifying the loop. In the DIVISION step, multiple new boxes are generated for each potentially optimal box. The multiple function evaluation tasks at each iteration give rise to a natural functional parallelism, which is especially beneﬁcial for expensive objective functions. The parallel implementation distributes the work to multiple masters in the SELECTION phase. The functions are then evaluated by the pool of workers to accomplish SAMPLING. VTDIRECT95 also supports user-level checkpointing method to restart function evaluations through log ﬁles. Several other options are provided by the optimization algorithm to improve the performance on large-scale parallel systems.

3

Design of Integrated System

In nuclear structure calculations, the computed and experimental results are matched for single as well as multiple nuclei using the χ2 function. Thus, a general case of multiple nuclei is considered in the design. At the initialization, the search algorithm divides the search domain into multiple sub-domains and assigns each sub-domain a set of masters (SM). Each sub-domain master is responsible for generation of evaluation points within its own sub-domain. Figure 3 and N3, with one MFDn shows a diagram for the case of multiple nuclei, N1, N2, execution per nuclei. The overall hierarchy consists of three tiers. The ﬁrst two tiers correspond to the set of processors used by VTDIRECT95 while the last tier corresponds to the set of processors used by MFDn. It is denoted as MFDn pool and included into a dash-lined oval. The evaluation points generated by SM are then handed to the workers W assigned to VTDIRECT95 (represented as shaded squares in Fig. 3). These workers act as connectors between the VTDIRECT95 and the MFDn. Each worker W pre-processes the evaluation point and then delivers the pre-processed data to MFDn pool. The number of processing elements (PE) required by MFDn typically depends on the size of the Hamiltonian matrix and the hardware characteristics. So another functionality of a W

Integration of Ab Initio Nuclear Physics Calculations

SM1

SM2

W1

SMm

Wn

W2

N1

837

N3

N1

N3

N2

N2

MFDn pool

Fig. 3. Design layout for the integration system

worker is to assemble from MFDn pool the needed number of PEs for an MFDn run. Once an instance of MFDn completes, its output is communicated back to the associated W worker, which is responsible for computing the χ2 function aggregating the MFDn outputs for all the nuclei and for delivering it to the proper SM. 3.1

Enabling Seamless Integration

The goal of this design is to treat both the VTDIRECT95 and the MFDn codes as “black-box” components and provide an infrastructure to integrate them. Here, we detail the implementation of the design. The MFDn code requires an input ﬁle which contains the set of parameters for every MFDn run. The output of the MFDn code is a text ﬁle including theoretical observables, such as excitation energies for a given nucleus for the desired number of states. The optimization algorithm produces points in a search space in each iteration and then accepts the function evaluation at those points. Thus, both MFDn and VTDIRECT95 have well-deﬁned input and output interfaces. Consider the following external additions necessary to interface MFDn and VTDIRECT95: 1. Input Modiﬁer (IM) inserts the points generated by VTDIRECT95 into an input ﬁle for MFDn. 2. Wait (W) waits for a completion signal from the MFDn code. The Wait stub may also gather the runtime performance information from MFDn. 3. Output Modiﬁer (OM) post-processes the MFDn output ﬁle and creates an input for the χ2 function. For a single iteration of VTDIRECT95, Fig. 4 shows a workﬂow between the VTDIRECT95 and MFDn codes. The main goal of the described additions is to provide interfaces and placeholders for pre- and post-processing. Thus, we denote them as stubs in Fig. 4. VTDIRECT95 generates an evaluation point which is inserted into MFDn input ﬁle through the IM stub. MFDn produces

838

M. Sosonkina et al.

then a set of values to be captured by the OM stub and provided to the χ2 evaluator, which in turn returns the objective function value to VTDIRECT95. All the stubs related to the MFDn execution are grouped in a single box together with MFDn in Fig. 4. The χ2 evaluator is a generic module the implementation of which may change depending on the application scientist’s goals. Similarly, VTDIRECT95 is depicted as separate box since it is a general-purpose optimization code. The integrated workﬂow, due to its ﬂexibility, allows to change the objective function construction and the search algorithm without aﬀecting the overall organization.

VTDIRECT95 IM Stub

MFD

W Stub OM Stub

Χ2 Evaluator

Fig. 4. Workﬂow diagram for the MFDn and VTDIRECT95

3.2

Construction of the χ2 Function

The χ2 is constructed using a theory ﬁle, an experimental ﬁle, and the base energy value of the given nucleus. The theory ﬁle is an output from the MFDn code which contains calculated observables. The experimental ﬁle has the energy levels as found experimentally by diﬀerent national scientiﬁc organizations [9,10]. In addition to the energy values, each energy level is associated with the spin j of the protons/neutrons. There are many options in construction of χ2 . A particular choice depends on such parameters as the quality of the experimental data and the questions nuclear physicists want to answer comparing the theoretical and experimental energy levels. As an example consider the following χ2 deﬁnition: ˜it (v) 2 × σ 2 , Eie (v) − E (2) χ2 (v) = ie 1≤it ≤15 1≤ie ≤k

˜ are the absolute experimental and theoretical where v = (V0 , V1 , Vtens ) and E energies, respectively; ie and it are the indices of the corresponding matched energy levels (one ie paired with one it ), i.e., those levels that have the same spin j; and k is the maximum desired number of matches. Each experimental energy level le is assigned the weight σle . This weight is inversely proportional to the distance of that energy level from the ground energy level. In (2), diﬀerent weight schemes were considered, such as, for every le , σle = 1/le or σle = 1/2le .

Integration of Ab Initio Nuclear Physics Calculations

4

839

Experiments

Computing platforms at the National Energy Research Scientiﬁc Computing Center (NERSC) at the Lawrence Berkeley National Laboratory served as testbed for the development and testing. The results presented here were obtained on the NERSC IBM p575 POWER 5 system, named Bassi. It is a distributed memory computer with 888 processors (comprising 111 nodes and sharing 32 GBytes of memory) used for scientiﬁc applications. Each Bassi processor has a theoretical peak performance of 7.6 GFlops, and the nodes are interconnected by the IBM “Federation” HPS switch. An implementation of the proposed design for single and multiple nuclei has been developed and tested. It integrates the serial VTDIRECT95 and parallel MFDn codes, while the usage of parallel VTDIRECT95 has been left as a future work. The following three nuclei have been considered: 47 K, 48 Ca, and 49 Ca. The Hamiltonian matrices are sparse and their sizes are 136231, 12000, and 15666, respectively, in the lowest available model space. The MFDn execution time depends heavily on the Hamiltonian size. Also, the complexity (“shape”) of the objective function drives the time required by the optimization algorithm to ﬁnd the minimum. Figure 5(a) shows the number of evaluations required per VTDIRECT95 iteration as the number of iterations grows for a sample nucleus. For the multiple nuclei ﬁt, the runtime is guided by the runtime of the heaviest nucleus since the evaluations for all the nuclei are needed to construct the χ2 in this case. Therefore, it is desirable to adjust the number of processors allocated to a particular nucleus based on its computational cost relative to the other nuclei in the set. In particular, for our example, 47 K has the largest Hamiltonian matrix, so executing it on the largest subset of processors makes sense. Figure 5(b) depicts the case when all the three nuclei are evaluated once with three diﬀerent sets of processor numbers (shown in the x-axis). The parallel time to compute all three nuclei indeed decreases as the number of processors is increased compared with the base case of using small equal number of processor for each nucleus. When about twice as many (15) processors are assigned to the calculation of each nucleus, the timings for the smaller Hamiltonians decrease by about half. The runtime for 47 K, however, decreases only slightly. By augmenting the number of processors to 45 for 47 K, the execution time is decreased dramatically, while for the smallest Hamiltonian of 48 Ca, the time has actually increased with the increase in the number of processors to 21. The latter indicates that the parallel overhead starts to dominate the overall execution time. In general, besides the Hamiltonian matrix size, other factors such as communication overhead and hardware characteristics, may aﬀect the number of processors used to calculate eﬃciently a particular nucleus. The optimization of the parameters for 48 Ca produced the results (Fig. 6(a)) matching theoretical observables correctly to their counterparts in the experimental data ﬁle. First six states have been matched correctly with their spins and the diﬀerence in energy levels from theoretical observables and experimental data is less than 0.001 MeV. The second nucleus considered is 49 Ca. The groundstate level in the theoretical observable was matched within the diﬀerence of

840

M. Sosonkina et al.

(a) Evaluations per iteration for

49

(b) Execution time of 1 evaluation of VTDIRECT95 for 47 K, 48 Ca, and 49 Ca on diﬀerent numbers of processors on Bassi.

Ca.

Fig. 5. Execution of MFDn for a single- and multiple-nuclei cases

0.02 MeV of the experimental ground state energy value (Fig. 6(b)). Higher energy levels in the theoretical observables remained unmatched to counterparts in the experimental data. A likely reason is that some of the energy levels in the experimental data have uncertain spin levels. The ﬁndings for 49 Ca are important from the physics point of view since they give new directions for ﬁtting the nuclei with a similar mass starting with obtained set of parameters. Similar arguments may be used to explain the unmatched energy levels in 47 K. Checkpointing is a valuable feature of VTDIRECT95. Supercomputers with batch scheduling typically have an upper bound on the time any job is allowed to execute. For example, the maximum time permitted on NERSC supercomputers

1 ,2,3,4 3(1)

Excitation Energy (MeV)

(-)

5

4

4 3,4,5 1 (+) 3 3

5

2+ 4+ 0+ 2+ 2+ 5

-

+

3 + 4 + 0 2

+

3

48

2

Ca

1

0

0

6

+

+

+

0

CD-Bonn + 3 terms

Experiment

(a)

48

Ca.

+

5

Excitation Energy (MeV)

-

6

4

-

-

1/2 + 9/2 1/2 + 9/2 + (5/2+ ) 5/21/2 3/2+ 7/2 5/2-

9/2 11/2 7/2 1/2 3/2 7/211/2 9/2 3/2 5/2 7/2 5/2

-

(1/2 ,3/2 ) -

3

5/2 + (9/2)

2

1/2

3/2

49

1

0

3/2

-

-

Ca

-

CD-Bonn + 3 terms

Experiment

(b)

49

Ca.

Fig. 6. Matching of experimental and theoretical energy levels

7/2

-

3/2

-

Integration of Ab Initio Nuclear Physics Calculations

841

is only 48 hours, which is surely not enough to ﬁnd the global or even local minimum for such an expensive function evaluation as described in this paper. Hence, we have utilized the checkpointing feature as a routine procedure to restart the integrated code for the next maximum time allowed by the queuing system.

5

Summary and Future Work

We have proposed a design for the integration of the MFDn and VTDIRECT95 parallel codes. The integration uses the master-worker paradigm of the VTDIRECT95 code and produces a three-tier scheme. Our contribution is to show how an expensive multiprocessor function evaluation may ﬁt into this scheme. In the paper, we have shown an implementation of the proposed design for the case of sequential VTDIRECT95, which produces one point at a time. In this situation, we have studied various deﬁnitions of the objective function (χ2 ) and obtained good matches between the theoretical and experimental energy levels for 48 Ca and the ground-state energy level for 49 Ca. We have also investigated the eﬃcient executions of MFDn when the results from multiple nuclei are to be considered simultaneously by VTDIRECT95. We have found that assigning diﬀerent numbers of processors to diﬀerent MFDn executions, typically in accordance with the Hamiltonian matrix size, reduces the overall time for a function evaluation needed by VTDIRECT95. In the future, we plan to provide an implementation taking advantage of parallelism in both VTDIRECT95 and MFDn and also to consider a variety of nuclei in the multiple-nuclei case. We will also compare the results produced by VTDIRECT95 with other derivative-free optimization techniques, such as those from the Toolkit for Advanced Optimization (TAO) [11]. The integrated software will be a useful tool for a wide range of ab initio nuclear physics calculations. Acknowledgments. The work was supported in part by Iowa State University under the contract DE-AC02-07CH11358 with the U.S. Department of Energy, by the U.S. Department of Energy under the grants DE-FC02-07ER41457 (UNEDF SciDAC-2) and DE-FG02-87ER40371 (Division of Nuclear Physics), and by NERSC.

References 1. Navratil, P., Vary, J., Barrett, B.: Properties of 12 C in the ab-initio Nuclear Shell Model. Phys. Rev. Lett. 84, 5728 (2000) 2. Navratil, P., Vary, J., Barrett, B.: Large-basis ab-initio No-core Shell Model and its application to 12 C. Physical Review C62, 54311 (2000) 3. Vary, J., Popescu, S., Stocia, S., Navratil, P.: No Core Shell Model A=47 and A=49. nucl-th/0607041/ 4. Machleidt, R., Sammarruca, F., Song, Y.: Nonlocal nature of the nuclear force and its impact on nuclear structure. Phys. Rev. C53, C63, 024001, Ref 9 (1996, 2001)

842

M. Sosonkina et al.

5. Vary, J.: The Many-Fermion Dynamics Shell-Model Code. Iowa State University (1992) 6. He, J., Watson, L., Sosonkina, M.: Algorithm XXX: VTDIRECT95: Serial and parallel codes for the global optimization algorithm DIRECT. ACM Transactions on Math. Soft. (submitted, 2007) 7. Jones, D.: The DIRECT global optimization algorithm. Encyclopedia of Optimization 1, 431–440 (2001) 8. Jones, D., Pertunen, C., Stuckman, B.: Lipschtzian optimization without the Lipscitz constant. J. Optimization Theory and Applications 79, 157–181 (1993) 9. Electronic version of nuclear data sheets. telnet://bnlnd2.dne.bnl.gov 10. Burrows, T.: Nuclear data sheets, 74, 1; Nuclear data sheets 76, 191 (1995) 11. Benson, S., McInnes, L.C., Mor´e, J., Munson, T., Sarich, J.: TAO user manual (revision 1.9). Technical Report ANL/MCS-TM-242, Mathematics and Computer Science Division, Argonne National Laboratory (2007), http://www.mcs.anl.gov/tao

Non-uniform Distributions of Quantum Particles in Multi-swarm Optimization for Dynamic Tasks Krzysztof Trojanowski Institute of Computer Science, Polish Academy of Sciences Ordona 21, 01-237 Warsaw, Poland [email protected]

Abstract. This paper presents research considering mixed multi-swarm optimization approach applied to dynamic environments. One of the versions of this approach, called mQSO is a subject of our special interest. The mQSO algorithm works with a set of particles divided into subswarms where every sub-swarm consists of two types of particles: classic and quantum ones. The research is focused on studying properties of the latter type. Two new distributions of new locations for the quantum particles are proposed: static and adaptive one. Both of them are based on an α-stable symmetric distribution. In opposite to already published methods of distribution of new locations the proposed methods allow the locations to be distributed over the entire search space. Obtained results show high eﬃciency of the mQSO approach equipped with the proposed two new methods.

1

Introduction

In the presented research a mixed multi-swarm optimization in dynamic environments is studied. Application of mixed multi-swarm approach to dynamic optimization has already been tested and proved its eﬃciency. Approaches with static and varying number of subs-warms [1], [2], [3] as well as approaches with adaptive number of species in the swarm [4], [5], [6] have been researched. A version with static number of subs-warms called mQSO, where two types of particles: quantum particles and classic ones are in use and especially their rules of movement became a subject of our interest. Classic particles use velocity vectors to evaluate their new positions. The rules for quantum particles are based on the random distribution of possible new locations of a particle around its current location similarly to the distribution of the locations in the quantum cloud of the atom. The rules for quantum particles proposed in [2], [3] are based on the idea of uniform distribution over the space of a quantum cloud, which is a hyper-sphere of a constant radius with the current location in the middle. In this paper we examine strategies of the quantum particle movement being alternative to the strategy mentioned above. Two methods of evaluation of new locations of the quantum particle are proposed and the eﬃciency of the mQSO equipped with these methods is experimentally veriﬁed. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 843–852, 2008. c Springer-Verlag Berlin Heidelberg 2008

844

K. Trojanowski

As a dynamic test-bed a MPB [7] generator was selected. In MPB we optimize in a real-valued 5-dimensional search space and the ﬁtness landscape is built of a set of unimodal functions individually controlled by the parameters allowing to create diﬀerent types of changes. The paper is organized as follows. In Sec. 2 there is a brief presentation of the optimization algorithm. Two new methods of generation of the quantum particle’s new locations are described in Sec. 3. Section 5 shows a measure used for evaluation of the results of experiments and the selected testing environment while Sec. 6 – the results of experiments performed with the environment. Section 7 concludes the presented research.

2

Quantum Multi-swarm

A simple scheme of the particle swarm optimization algorithm is given in Fig. 1. A PSO optimizer is equipped with a set of particles xi where i ∈ [1, . . . N ]. Each of the particles represents a solution in an n-dimensional real valued search space. For the search space a ﬁtness function f (·) is deﬁned which is used to evaluate the quality of the solutions. A particle yi represents the best solution found by the i-th particle (called particle attractor), and a particle y∗ – the best solution found by the swarm (called swarm attractor). The scheme is made for maximization problem. Algorithm 1. The particle swarm optimization 1: Create and initialize the swarm 2: repeat 3: for i = 1 to N do 4: if f (xi ) > f (yi ) then 5: yi = xi 6: end if 7: if f (yi ) > f (y∗ ) then 8: y∗ = yi 9: end if 10: end for 11: update location and velocity of all the particles 12: until stop condition is satisﬁed

Search properties of the PSO scheme in Algorithm 1 are represented by the step ”update location and velocity”. In this step there are two main actions performed: ﬁrst the velocity of each of the particles is updated and then all the particles change their location in the search space according to the new values of velocity vectors and the kinematic laws. Formally for every iteration t of the search process every j-th coordinate of the vector of velocity v as well as the coordinate of the location x undergo the following transformation [8]: vjt+1 = χ(vjt + c1 r1t (yjt − xtj ) + c2 r2t (yj∗t − xtj )) , = xtj + vjt+1 , xt+1 j

(1)

Non-uniform Distributions of Quantum Particles

845

where r1t and r2t are random values uniformly generated in the range [0, 1], χ is a constriction factor and χ < 1 and c1 and c2 control the attraction to the best found personal and global solutions, respectively. The basic idea presented in Algorithm 1 has been developed for non-stationary optimization applications. One of the ﬁrst signiﬁcant changes in this scheme is an introduction of multi-swarm. In the presented approach the number of subswarms is constant during the process of searching. Each of them is treated as an independent self-governing population which is not inﬂuenced by any of the neighbors. However, there are mechanisms which periodically perform some actions based on the information about the state of search of the entire swarm [3]. To guarantee the appropriate distribution of the sub-swarms over the entire search space the exclusion mechanism eliminates sub-swarms, which are located too close to each other. When the sub-swarms are too close to each other, the occupation of the same optimum is most likely to occur. In this case one of them is selected to be eliminated and a new one is generated from scratch. Any two subswarms are considered as located too close to each other if for the best solutions from the compared two sub-swarms the euclidean distance is closer than the deﬁned threshold ρ. In [3] yet another mechanism of sub-swarms’ management was proposed called anti-convergence, which protects against convergence of subswarms. However it was oﬀ in the presented experiments. The last of the sub-swarms’ management mechanisms described in [3] is based on mixing of types of particles in sub-swarms. In the presented research the mixed sub-swarms consist of two types of particles governed by two diﬀerent rules of movement. While the location of the particles of the ﬁrst type is evaluated according to classic formulas as discussed above, the remaining ones are treated as quantum particles and change their location according to the analogy with quantum dynamics of the particles. All the particles in such a mixed sub-swarm share the information about the current best position and the best position ever found by the sub-swarm. Idea of quantum particle proposed by Blackwell and Branke in [2] originates from the quantum model of atom where the trajectories of electrons are described as quantum clouds. Adaptation of this idea to the model of movement of the particles rejects kinematics laws used in classic PSO for evaluation of a distance traveled by the particle with a constant velocity in a period of time. Instead of this a new position of the quantum particle is randomly generated inside a cloud of the given range rcloud surrounding y∗ i.e. the current sub-swarm attractor. In quantum model the particle’s speed becomes irrelevant, because every location inside the cloud can be chosen as a new location with a non-zero probability. The model of quantum particles has been extended in this paper. Since the model proposed in [2] assumes the uniform distribution of the set of possible new locations of the particle over the cloud’s space, it was interesting to test and verify another types of distributions.

846

3

K. Trojanowski

Movement of Quantum Particles

Two new types of distribution of new locations are considered in this paper. The ﬁrst one is deﬁned with static rules while the second one – adaptive rules. Both of them are based on two-phase mechanism. In the ﬁrst phase a direction θ is selected. In the second phase a distance d from the original is calculated. The direction θ can be obtained with use of a random variable from the angularly uniform distribution on the surface of a hyper-sphere [9]. The distance d is an α-stable random variate and is computed as follows: d = SαS(0, σ), and

(2)

σ = rSαS · (Dw /2),

(3)

where SαS(·, ·) represents α-stable symmetric distribution variate and Dw is a width of the feasible part of the domain, i.e. a distance between a lower and an upper boundary of the search space. The new location is based on the found direction θ and a distance d from the original. This is an isotropic distribution. The α-stable distribution is controlled by four parameters: stability index α (α ∈ 0 < α ≤ 2), skewness parameter β, scale parameter σ and location parameter μ. The Chambers-Mallows-Stuck method of generation of the α-stable symmetric random variables [10] can be used. The method for σ = 1 and μ = 0 with a correction for the case where α = 1 given by Weron in 1996 [11] is presented in (4). To calculate the α-stable distributed random variate X two another independent random variates are needed: a random variate U , which is uniformly distributed on [−π/2, π/2] and an exponential random variate W obtained with rate parameter λ = 1: ⎧ (1−α)/α α(U+Bα,β ) cos(U−α(U+Bα,β )) ⎪ ⎪ Sα,β · sin(cos · , 1 ⎪ U) /α W ⎪ ⎪ ⎪ ⎨ iﬀ α = 1, (4) X= ⎪ ⎪ 2 π W cos U ⎪ ⎪ ( + βU ) tan U − β ln π +βU , ⎪ ⎪ 2 ⎩π 2 iﬀ α = 1. 2 πα 1/(2α) 2 . where Bα,β = α−1 arctan β tan πα 2 , and Sα,β = 1 + β tan 2 In the symmetric version of this distribution (called SαS, i.e. symmetric αstable distribution) β is set to 0. For α = 2 the SαS(μ, σ) distribution reduces to the Gaussian N (μ, σ) and in the case of α = 1 the Cauchy C(μ, σ) is obtained. In (3) rSαS is a scale parameter. The diﬀerence between the static and the adaptive version of the distribution is in the way of calculation of the distance d. In the static version the distance depends on merely the α-stable generator. In the adaptive version it depends on the value returned by the generator and multiplied by normalized ﬁtness of the particle. The latter version was inspired by a mutation operator introduced in [12] where it was a component of the immune optimization algorithm called opt-aiNet and designed for multimodal function optimization. The

Non-uniform Distributions of Quantum Particles

847

mutation operator uses independent random variates for modiﬁcation of each of the coordinates. This approach gives isotropic distribution for Gaussian random variables but unfortunately turns into non-isotropic for any other α-stable distribution in multidimensional search space. We wanted to keep to isotropic distributions, therefore the operator was not migrated as-is. In our adaptive approach the direction is evaluated in the same way like in the static one but the distance is calculated respectively to the current ﬁtness values of the remaining antibodies in P : (5) d = SαS(0, σ) · exp(−f (xi )), where σ is calculated as in (3) and f (xi ) is the ﬁtness of the i-th solution xi normalized in [0, 1] respectively to the ﬁtness values of all the solutions in P : f (xi ) − fmin , (fmax − fmin ) = max f (xj ) and fmin =

f (xi ) = fmax

j=1,...,N

(6) min f (xj ).

j=1,...,N

Fig. 1. Distribution of the new points in 2-dimensional search space for α: 2, 1, 0.5, and 0.1

Fig. 1 presents sample distributions of a set of points in the 2-dimensional search space generated for the same original with the static method of generation. There are four distributions for four diﬀerent values of α: 2, 1, 0.5, and 0.1. It often happens that the domain of possible solutions is limited by a set of some box constraints and only the solutions ﬁtted in the constraints are classiﬁed as feasible. Both types of distribution presented above allow to generate new locations over the entire domain, so it is possible to generate feasible and unfeasible locations as well. From the theoretical point of view we can easily cope with unfeasible locations simply by allowing them just to stay where they are because the evaluation function formula is usually deﬁned for all points in Rn . However, from the engineering point of view we cannot accept such a free treatment since in the real world the constraints are based on the knowledge of the modeled phenomenon and represent its features like e.g. temperature (which cannot be less than -273 C or higher than some reasonable limit: +100 C for a water or a smoke point for an oil). Therefore in the presented research it is assumed that the domain of possible solutions is limited by a set of some box constraints and only the solutions ﬁtted in the constraints are classiﬁed as feasible.

848

K. Trojanowski

Since the main focus in this paper is not about the constrained optimization, we selected a very simple procedure of immediate repairing unfeasible particles. Clearly the j-th coordinate of the solution x breaking its box constraints is trimmed to the exceeded limit, i.e.: if xj < loj then xj = loj , if xj > hij then xj = hij . The procedure is applied in the same way to both types of the particles, the classic and the quantum ones. In case of classic particles the velocity vector v of the repaired particle stays unchanged even if it still leads the particle outside the acceptable search space.

4

Settings of the Algorithm’s Parameters

The algorithm parameters’ settings applied to the experiments presented below originate from [3]. In the cited publication authors present results of experiments obtained for diﬀerent conﬁgurations of swarms tested with the MPB benchmark where there are 10 moving peaks. Among many tested conﬁgurations the best results for the optimization problem with 10 moving peaks are obtained where there are 10 sub-swarms and each of them consists of ﬁve classic particles and ﬁve quantum ones (see Table III in [3]). The total population of particles consists of 100 solutions divided equally into 10 sub-swarms. The values of pure PSO parameters are: c1,2 = 2.05 and χ = 0.7298. For QSO the range of exclusion is set to 31.5 (for the best performance the value of ρ should be set close to 30. However, the precision of this parameter’s setting is not crucial. In [3] the authors claim that the algorithm is not very sensitive to small changes of ρ). In the presented algorithm there is no strategy of detecting the appearance of change in the ﬁtness landscape. Since our main goal was studying the properties of the diﬀerent distributions of the quantum particles, we assumed that it would just introduce yet another unnecessary bias into the obtained values of oﬄine error and make their analysis even more diﬃcult. Therefore a change is known to the system instantly as it appears and there is not any additional computational eﬀort for its detection. When the change appears, all the solutions stored in both classic and quantum particles are reevaluated and the swarm memory is forgotten. Classic particle’s attractors are overwritten by current solutions represented by these particles and sub-swarms’ attractors are overwritten by the current best solutions in the sub-swarms.

5

Applied Measure and the Benchmark

In the performed experiments the oﬄine error (brieﬂy oe) measure [7,13] of obtained results was used. The oﬄine error represents the average deviation from the optimum of the ﬁtness of the best individual evaluated since the last change of the ﬁtness landscape. Every time the solution’s ﬁtness is evaluated,

Non-uniform Distributions of Quantum Particles

849

an auxiliary variable is increased by the value which is the deviation of the best solution evaluated since the last change including the one just evaluated as well. When the experiment is ﬁnished the sum in the variable is divided by the total number of evaluations and returned as the oﬄine error. Formally: Ne (j) Nc

1

1 ∗ oe = ( (f ∗ − fji )) , Nc j=1 Ne (j) i=1 j

(7)

where Nc is the total number of changes of the ﬁtness landscape in the experiment, Ne (j) is the number of evaluations of the solutions performed for the j-th state of the landscape, fj∗ is the value of optimal solution for the j-th landscape ∗ (i.e. between the j-th and (j + 1)-th change in the landscape) and fji is the current best found ﬁtness value for the j-th landscape i.e. the best value found among the ones belonging to the set from fj1 till fji where fji is the value of the ﬁtness function returned for its i-th call performed for the j-th landscape. During the process of search the oﬄine error can be calculated in two ways: in one of them the error is evaluated from the beginning of the experiment while in another one – the value of oﬄine error starts to be evaluated only after some number of changes in the ﬁtness landscape. The latter way is advised as saddled with the less measurement error caused by the initial phase of the search process (please see e.g. [14] for a discussion about the possible inﬂuence of the initial phase on the quality of the results obtained for MPB). Therefore, just this way was applied in our tests. For compatibility with experiments published by others the number of evaluations between subsequent changes of the ﬁtness landscape equals 5000. During a single experiment the ﬁtness landscape changes 110 times (however for the ﬁrst 10 changes the error is not evaluated). Every experiment was repeated 50 times and the means are presented. The parameters of MPB were set exactly the same as speciﬁed in [3] in scenario 2. The ﬁtness landscape was deﬁned for the 5-dimensional search space with boundaries for each of dimensions set to 0; 100. For the search space there exist a set of 10 moving peaks which vary their height randomly within the interval 30; 70, width within 1; 12 and position by a distance of 1.

6

Results of Experiments

We started our research from repeating some of the experiments presented in [3]. It was necessary to make the earlier results comparable to the current ones since in our case a period of the ﬁrst 10 changes in the environment is excluded from calculating the oﬄine error which makes the values of oe signiﬁcantly smaller than those in [3]. We repeated experiments with uniform distribution of new locations inside a quantum cloud (called further Mcloud ) for a series of values of rcloud : from 0.05 to 4.5 with step 0.05. The three best values of oe were: 1.6264 (std.dev.: 0.4104), 1.6297 (std.dev.: 0.4062), 1.6298 (std.dev.: 0.5227) and they were obtained for rcloud : 0.30, 0.35 and 0.25 respectively.

850

K. Trojanowski 1.65 1.6 1.55 1.5

1.65 1.6 1.55 1.5

4.5 4 3.5 3 2.5 2 1.5

4.5 4 3.5 3 2.5 2 1.5 0.5 1 1.5

0.1 rS α S

0.01

α

0.5 1 1.5

0.1

0.0012

rS α S

0.01

α

0.0012

Fig. 2. Oﬄine error for Mαstatic i.e. a static version (on the left) and for Mαadapt i.e. an adaptive version (on the right): rSαS vs. α Table 1. The three best values of oﬄine error obtained for the two tested methods of the new location generation for the case with 10 moving peaks method

oﬄine error

std. dev.

σ

α

Mαstatic [1] Mαstatic [2] Mαstatic [3]

1.4603 1.4665 1.5023

0.3066 0.4188 0.3518

0.25 0.25 0.35

1.35 1.80 1.00

Mαadapt [1] Mαadapt [2] Mαadapt [3]

1.4614 1.4722 1.5008

0.3255 0.3041 0.3496

0.60 0.85 0.60

1.70 1.70 1.75

In our research we wanted to get performance characteristics of the tested distributions ﬁrst. This way we are able to compare not only the best results possible to obtain for a given test-case but also an information about the robustness of the searching engines and its sensitivity to changes in their parameters settings. Thus, two large groups of experiments were performed. We tested the static version of α-stable symmetric distribution — Mαstatic , and adaptive version of α-stable symmetric distribution — Mαadapt . The sets of tests with the two methods were based on variation of values of two method’s parameters: α and rSαS . The former parameter varied from 0.5 to 2 with step 0.05 while the latter – from 0.001 to 0.1 with step 0.001. It gave 3100 conﬁgurations for each of the approaches. They were tested on the same class of dynamic environments build by MPB benchmark with 10 moving peaks and with its parameters set to values as deﬁned above. Obtained values of oﬄine error for the two groups of experiments are presented in Fig. 2. In Table 1 for each of the distributions the best three conﬁgurations and the values of oﬄine error for each of the three are presented. The oﬄine error for Mcloud is higher than for the two remaining methods. The signiﬁcance levels obtained with Student’s t tests indicate diﬀerence between Mcloud and distributions in unlimited area i.e. for both tests between Mcloud and the two methods the

Non-uniform Distributions of Quantum Particles

851

1.6

1.9

1.58 1.8

1.56 1.54

1.7 1.52 1.5

1.6

1.48 1.5

Mα static Mα adapt 50

100

150

200

250

300

350

Mα static Mα adapt

1.46 400

450

2

4

6

8

10

12

14

16

18

20

Fig. 3. The ﬁrst 10% of values of oﬄine error sorted ascending for the two tested distributions – left side graph, and a zoom to the best ﬁrst 20 values – right side graph

signiﬁcance level is lower than the commonly accepted level of 0.05 (for Mαstatic – p = 0.025437, for Mαadapt – p = 0.029881). Apart from Fig. 2 and Table 1 yet another quantitative comparison was performed based on the same set of results. The analysis is presented in Fig. 3. To generate Fig. 3 for each of the methods, the values of oﬄine error obtained for tested conﬁgurations of parameters were sorted ascending. This way we can observe sensitivity of the methods for changes in the parameters values as well as the small disturbances in the optimized ﬁtness landscape. We can compare not only the best results but also the number of other conﬁgurations giving satisfying results i.e. the size of area of useful conﬁgurations of the distributions. A graph with the ﬁrst 10% of sorted values of oﬄine error is in Fig. 3. Both curves in Fig. 3 start from the level of oﬄine error which is presented in Table 1 and which is almost the same for each of them. However they quickly branch. It is clearly visible that the adaptive method outperforms the static method. The advantage is that the adaptive method is much more tolerant for lack of ﬁtting the methods parameters to the properties of the ﬁtness landscape. In other words an appropriately tuned algorithm is doing very well for each of the methods however if the algorithm is not perfectly tuned to the problem, the loss of performance is much less for adaptive methods of distribution.

7

Conclusions

In this paper two methods of evaluation of the quantum particle’s new position are experimentally veriﬁed: the static and the adaptive one. Both methods employ the α-stable symmetrically distributed random variable. The distribution is controlled by the parameter α, which is responsible for the density of distribution of new locations around the quantum particle. Obtained results are satisfactory: they are better than those for uniform distribution of new locations

852

K. Trojanowski

in the limited area around the quantum particle. Besides the results of series of experiments visualized in Fig. 3 showed that the adaptive method of distribution is less sensitive to small changes in its parameters than the static one.

References 1. Blackwell, T.: Particle Swarm Optimization in Dynamic Environments. In: Evolutionary Computation in Dynamic and Uncertain Environments. Studies in Computational Intelligence, vol. 51, pp. 29–49. Springer, Heidelberg (2007) 2. Blackwell, T., Branke, J.: Multi-swarm optimization in dynamic environments. In: Raidl, G.R., Cagnoni, S., Branke, J., Corne, D.W., Drechsler, R., Jin, Y., Johnson, C.G., Machado, P., Marchiori, E., Rothlauf, F., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2004. LNCS, vol. 3005, pp. 489–500. Springer, Heidelberg (2004) 3. Blackwell, T., Branke, J.: Multiswarms, exclusion, and anti-convergence in dynamic environments. IEEE Trans. Evol. Comput. 10(4), 459–472 (2006) 4. Li, X.: Adaptively choosing neighborhood bests in a particle swarm optimizer for multimodal function optimization. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 105–116. Springer, Heidelberg (2004) 5. Li, X., Branke, J., Blackwell, T.: Particle swarm with speciation and adaptation in a dynamic environment. In: GECCO 2006: Proc. Conf. on Genetic and Evolutionary Computation, pp. 51–58. ACM Press, New York (2006) 6. Parrot, D., Li, X.: Locating and tracking multiple dynamic optima by a particle swarm model using speciation. IEEE Trans. Evol. Comput. 10(4), 440–458 (2006) 7. Branke, J.: Memory enhanced evolutionary algorithm for changing optimization problems. In: Proc. of the Congress on Evolutionary Computation, pp. 1875–1882. IEEE Press, Piscataway (1999) 8. Clerc, M., Kennedy, J.: The particle swarm-explosion, stability, and convergence in a multi-dimensional complex space. IEEE Trans. Evol. Comput. 6(1), 58–73 (2002) 9. Marsaglia, G.: Choosing a point from the surface of a sphere. Ann. Math. Statist. 43(2), 645–646 (1972) 10. Chambers, J.M., Mallows, C.L., Stuck, B.W.: A method for simulating stable random variables. J. Amer. Statist. Assoc. 71(354), 340–344 (1976) 11. Weron, R.: On the Chambers-Mallows-Stuck method for simulating skewed stable random variables. Statist. Probab. Lett. 28, 165–171 (1996) 12. de Castro, L.N., Timmis, J.: An artiﬁcial immune network for multimodal function optimization. In: Proc. of the IEEE Congress on Evolutionary Computation, vol. 1, pp. 674–699. IEEE Press, Piscataway (2002) 13. Branke, J.: Evolutionary Optimization in Dynamic Environments. Kluwer Academic Publishers, Dordrecht (2002) 14. Trojanowski, K.: B-cell algorithm as a parallel approach to optimization of moving peaks benchmark tasks. In: Sixth International Conf. on Computer Information Systems and Industrial Management Applications (CISIM 2007), IEEE Computer Society Conf. Publishing Services, pp. 143–148 (2007)

An Integer Linear Programming for Container Stowage Problem Feng Li, Chunhua Tian, Rongzeng Cao, and Wei Ding IBM China Research Laboratory, Beijing 100094, P.R. China {lfeng, chtian, caorongz, dingw}@cn.ibm.com

Abstract. So far most stowage planning only consider the stowage of single type containers or do not consider the stability of the containership. Here a 0-1 linear programming model for this problem is proposed. The objective is to maximize space utilization while minimize the operation cost from upload and download of diﬀerent types containers at each port of a multi-port journey, with vessel stability, industry regulations and customized rules as the constraints. Based on experiments by adopting branch & cut algorithm from COIN-OR(Common Optimization Interface for Operations Research) to such model. Successively, a simulation system is developed to verify the feasibility and practicability of the model. By using the branch and cut algorithm of COIN-OR, the simulation system has shown that our model is eﬃcient and robust. Keywords: Container Stowage Planning, Integer Programming.

1

Introduction

Since the 1970s, containerization has increasingly facilitated the transportation of cargos. Nowdays over 60 percent deep-sea general cargo is transported in Containers, whereas some routes, especially between economically strong and stable countries, are containerized up to 100 percent [1]. There are lots of shipping companies competing around the world to provide proﬁtable container transportation services. In order to increase the beneﬁt of economy of scale, the size of containerships has enhanced. The increase in capacity has been typically from relatively small 350 Twenty Foot Equivalent Units (TEUs) to containerships with capacities of more than 8000 TEUs [2]. The increase of containership size contributes to incremental proﬁts of shipping companies, it will cause a minus factor, enlarginging complexity and diﬃculty of the arrangement of containers. The arrangement of contaioners for a containership is usually called Container Stowage Problem (CSP). CSP is complicated because of its combinatorial nature. In recent operations research and management science literatures, the mothods develped to sovle it can be grouped into ﬁve main classes: mathematical modeling, simulation based upon probability, decision support systems and heuristics. Among them, there are several objectives for the problem: optimizing available space and prevent M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 853–862, 2008. c Springer-Verlag Berlin Heidelberg 2008

854

F. Li et al.

damage, minimizing berthing time [3], minimizing total number of shifting of containers [4], maximizing containership’s ultilization, and so on. Unfortunately those approaches have some commonalities with the problem and mainly devoted to the loading problem. The well-known mathematical model for CSP is integer linear programming [5]. Although those models can provide optimal solutions for CSP, they have incorporated too many simpliﬁcation hypotheses, which have made them unsuitable for practical applications. The ﬁrst simulation approach is completed by Shields [6]. In his work, a small number of stowage plans are randomly created to be evaluated and compared by simulation of the voyage and the best is selected. It is eﬃcient in practice, it does not guarantee the optimality of the solutions, and it also takes a long computation time to ﬁnd a reasonably good solution. Later, Saginaw and Perakis [7] and Shin and Nam [8] develop decision support systems that are based on the knowledge of a manager or an operator in charge of loading and unloading operations, Wilson et al. presented a realistic model, taking into account all technical restrictions in order to implement a commercial usable decision support system. They decomposed the CSP into two phases: strategic process and tactical process [9]. However, those decision support systems only show a solution of small sample problem and their eﬃciency from the practical viewpoint are not shown. The ﬁrst heuristic for CSP is proposed by Martin et al. [10]. They addressed CSP for the transtainer system, and developed a heuristic algorithm to solve it. Since then Todd and Sen implemented a genetic algorithm procedure with multiple criteria such as proximity in terms of container location on board and minimization of unloading-related re-handle, transverse moment and vertical moment [11], Haghani and Kaisar developed a heuristic algorithm for CSP with minimizing container handling cost, while keeping the containership’s GM acceptable [12]. Dubrovsky et al. implemented a genetic algorithm-based heuristic for CSP [13]. These heuristics can produce complete and practical but rarely near-optimum solutions to the container stowage problem. Considering those problems which are mentiioned above, an extend model for CSP will be proposed in this paper, which is used to ﬁnd an optimal plan for stowing containers of diﬀerent size into a containership on a multi-port journey, with a set of structural and operational restrictions. The objective is to maximize the containership utilization while minimize the operation cost from the container re-allocation. Such constraints and objective are ﬁrstly described in detail, then a basic 0-1 Linear Programming model is proposed. Successively, a simulation system was proposed to illustrate the eﬃciency of the model and compare it with random stowage strategy.

2

Container Stowage Problem

When solving CSP, of particular interest are the constraints related to the structure of the containership and the size of the hold and upper deck. We consider here two types of containerships, namely Ro-Ro (Roll on-Roll oﬀ), which load/unload containers through the ramps located either at the bow or the stern

An Integer Linear Programming for Container Stowage Problem

16 21

19

17

12 15

13

08 11

04 07

09

05

855

40'Bays 03

01

20'Bays Tiers 90 88 86 84 82 08 06 04 02

B

A Rows 10 08 06 04 02 01 03 05 07 09

Fig. 1. Containership Structure

of the ship, and Lo-Lo (Lift on-Lift oﬀ), which load/unload containers from the top(by using cranes). The basic structure of a containership and its cross section are shown in Figure 1, which are used to provide an idea of how container stowage take places. There are a given number of locations for placing containers, that can vary in size depending on the containership. The most common location is 8 feet in height, 8 feet in width and 20 feet in length. Each location is identiﬁed by three indies, each one consisting of two numbers given its position with respect to the three dimensions. In partice, each location is addressed by the following identiﬁers: (1) bay, that gives its position relative to the cross section of the containership (counted from bow to stern); (2) row, that gives its position relative to the vertical section of the corresponding bay (counted from the center to outside); (3) tier, that gives its position related to the horizontal section of the corresponding bay (counted from the bottom to the top of the containership). Thus a container will be located in a given bay, a given row and a given tier. In oder to describe CSP in detail, we propose the following notations. Bay Index Sets: Let I denote index set of bay set, A and B denote index sets of anterior part bay set and back part bay set, respectively, E and O denote index sets of even bay set and odd bay set, respectively, where I=A∪B, I=E ∪O, A ∩ B=∅ and E ∩ O=∅. Row Index Sets: Let Ri denote index set of row set for the ith bay, LRi and RRi denote index sets of left row set and right row set of the ith bay, respectively. For example, the row set of the second bay for the containership shown in Figure 1 is {09, 07, 05, 03, 01, 02, 04, 06, 08, 10}, then R2 = {1, 2, ···, 10}, LR2 = {1, 2, · · ·, 5} and RR2 = {6, 7, · · ·, 10}. Tier Index Sets: Let Tij denote index set of tier set for the ith bay and the jth row of Ri , let U Tij denote index set of upper tier set for the ith bay and the

856

F. Li et al.

jth row of Ri and let BTij denote index set of below deck tier set of the ith bay and the jth row of Ri . Obviously, U Tij ∪ BTij = Tij and U Tij ∩ BTij = ∅ Port Set: Let P denote the port set which are on the journey of the containership. Suppose P = {1, 2, · · ·, |P |}. Container Sets: Let Cp (p = 1, 2, · · ·, |P |) denote the container set, which will |P | be loaded on the containership at port p, and let C = ∪p=1 Cp . Let wc and dc denote the weight and destination port of container c, ∀c ∈ C, repectively. In this paper only two types of containers are considered, and let F denote set of 40-feet container, W denote set of 20-feet container. Let Cˆp (p = 1, 2, · · ·, |P |) denote the set of containers which are loaded on port i, (i = 1, 2, · · · , p − 1) and the order of those containers desitination port are greater than p. Obviously, when p = 1, Cˆp = ∅. 2.1

Constraints

Using these notations, some structural and operational restrictions can be described as follows. Because the number of the allocations provided by containership is ﬁnite, the maximum number of the containers which can be stowed on the containership is limitied. Thus at current port p, xijkc · sc ≤ mt , (1) i∈I j∈Ri k∈Tij c∈Cp ∪C ˆp

xijkc ≤ mf .

(2)

i∈E j∈Ri k∈Tij c∈Cp ∪C ˆp

where, xijkc denote the variables of the optimizaiton, with the following speciﬁcation: 1, If the container c is stowed in location {ijk}, (3) xijkc = 0, otherwise. and location {ijk} means the ith bay, the jth row of Ri and the kth tier of Tij . sc denote the size of the container c ∈ C, and 1, If the size of container c is 20 foot, sc = (4) 2, If the size of container c is 40 foot. mf denote the maximum number of 40-feet container which can be loaded on the ship. mt denote the maximum number of 20-feet container which can be stowed on the ship. At the same time, each location of the containership can only have most one container and each container occupies one and only one location of the containership, therefore, xijkc ≤ 1, ∀c ∈ Cp ∪ Cˆp , (5) i∈I j∈Ri k∈Tij

ˆp c∈Cp ∪C

xijkc ≤ 1, ∀i ∈ I, ∀j ∈ Ri , ∀k ∈ Tij .

(6)

An Integer Linear Programming for Container Stowage Problem

857

The total weight of all containers which can be stowed on the containership can not exceed the maximum weight capacity of the containership. xijkc · wc ≤ Q, (7) i∈I j∈Ri k∈Tij c∈Cp ∪C ˆp

where Q denote the maximum weight capacity of the containership. In practice, any 40-feet container can not be stowed in an odd bay and any 20feet container can not be stowed in an even bay; the stowage of 20-feet containers in those odd bays that are contiguous to even bays’s location already chosen for stowing 40-feet container is unfeasible, and inversely; any 20-feet container can not be stowed on any 40-feet container; and any container can not be hangingly stowed. Then, we can get xijkc = 0, ∀i ∈ O, j ∈ Ri , k ∈ Tij , (8) ˆp ) c∈F ∩(Cp ∪C

xijkc = 0,

∀i ∈ E, j ∈ Ri , k ∈ Tij ,

ˆp ) c∈W ∩(Cp ∪C

ˆp ) c∈W ∩(Cp ∪C

∀i ∈ E, j ∈ Ri , k ∈ U Tij , (k + 1) ∈ U T(i+1)j , (12) x(i−1)j(k+1)c + xijkc ≤ 1, ˆp ) c∈F ∩(Cp ∪C

∀i ∈ E, j ∈ Ri , k ∈ U Tij , (k + 1) ∈ U T(i−1)j , (13) x(i+1)j(k+1)c + xijkc ≤ 1, ˆp ) c∈F ∩(Cp ∪C

∀i ∈ E, j ∈ Ri , k ∈ BTij , (k + 1) ∈ BT(i+1)j , (14) x(i−1)j(k+1)c + xijkc ≤ 1, ˆp ) c∈F ∩(Cp ∪C

xij(k+1)c −

∀i ∈ E, j ∈ Ri , k ∈ BTij , (k + 1) ∈ BT(i−1)j , (15) xijkc ≤ 0,

ˆp ) c∈W ∩(Cp ∪C

ˆp ) c∈W ∩(Cp ∪C

ˆp ) c∈W ∩(Cp ∪C

xijkc ≤ 1,

ˆp ) c∈F ∩(Cp ∪C

ˆp ) c∈W ∩(Cp ∪C

x(i+1)j(k+1)c +

ˆp ) c∈W ∩(Cp ∪C

xijkc ≤ 1, ∀i ∈ E, j ∈ Ri , k ∈ Tij , (11)

ˆp ) c∈F ∩(Cp ∪C

ˆp ) c∈W ∩(Cp ∪C

x(i−1)jkc +

ˆp ) c∈W ∩(Cp ∪C

xijkc ≤ 1, ∀i ∈ E, j ∈ Ri , k ∈ Tij , (10)

ˆp ) c∈F ∩(Cp ∪C

ˆp ) c∈W ∩(Cp ∪C

x(i+1)jkc +

(9)

xij(k+1)c −

∀i ∈ O, j ∈ Ri , k ∈ U Tij , k + 1 ∈ U Tij , (16) xijkc ≤ 0,

ˆp ) c∈W ∩(Cp ∪C

∀i ∈ O, j ∈ Ri , k ∈ BTij , k + 1 ∈ BTij , (17)

858

F. Li et al.

2xij(k+1)c −

ˆp ) c∈F ∩(Cp ∪C

ˆp ) c∈F ∩(Cp ∪C

x(i+1)jkc ≤ 0,

ˆp ) c∈W ∩(Cp ∪C

ˆp ) c∈F ∩(Cp ∪C

x(i−1)jkc −

ˆp ) c∈W ∩(Cp ∪C

∀i ∈ E, j ∈ Ri , k ∈ U Tij , k + 1 ∈ U Tij , (18)

2xij(k+1)c −

2xijkc −

ˆp ) c∈F ∩(Cp ∪C

x(i+1)jkc ≤ 0,

2xijkc −

x(i−1)jkc −

ˆp ) c∈W ∩(Cp ∪C

∀i ∈ E, j ∈ Ri , k ∈ BTij , k + 1 ∈ BTij . (19)

ˆp ) c∈W ∩(Cp ∪C

The stability of the ship is very important for the deep-sea container transprotation. Since the vertical, transverse and longitudinal distribution of a ship’s weight, which is the most inﬂuencing factor for the ship’s stability, is excessively unbalanced will lead it unstable, a bad stowage plan may result in the instability of the ship. To assure the stability of a containership, a stowage plan should satisfy several operational constraints. In this study the following three factors: metacentric height (GM), heel and trim are introduced to describe the vertical, transverse and longitudinal distribution of a ship’s weight. For a ship to be stable, GM must be greater than the minimum allowable metacentric height, the heel should be narrow or at least to be smaller than a given small number, the trim is also close to zero or at least within certain prespeciﬁed limits for good performance of the ship. Hence a good stowage planning should make GM greater, heel and trim smaller to the extrem. Thus, the following constraints should be satisﬁed, ˆp , ∀i ∈ I, j ∈ Ri , k ∈ U Tij , wc · xijkc − we · xijk+1e ≥ 0, ∀c, e ∈ Cp ∪ C (20) ˆp , ∀i ∈ I, j ∈ Ri , k ∈ BTij , (21) wc · xijkc − we · xijk+1e ≥ 0, ∀c, e ∈ Cp ∪ C −Q1 ≤ wc · xijkc − wc · xijkc ≤ Q2, ˆp i∈A j∈Ri k∈Tij c∈Cp ∪C

−Q3 ≤

ˆp i∈B j∈Ri k∈Tij c∈Cp ∪C

wc · xijkc −

ˆp i∈I j∈LRi k∈Tij c∈Cp ∪C

(22) wc · xijkc ≤ Q3,

ˆp i∈I j∈RRi k∈Tij c∈Cp ∪C

(23)

where Q1, Q2 and Q3 are given tolerance for stability of the ship. In order to guarantee the containers in Cˆp still on the containership, when the containership begins to vessel to the other destination, the following equations should be satisﬁed, xijkc = 1, ∀c ∈ Cˆp . (24) i∈I j∈Ri k∈Tij

2.2

Objective Function

In the container transportation industry, containerships make repeated tours of a sequences of ports according to their planned routes. At each port on a tour of

An Integer Linear Programming for Container Stowage Problem

859

a containership, containers are unloaded and additional containers destined for subsequent ports are loaded. Time duration required for loading and unloading depends on the arrangement of the cargo on board the ship, ie the stowage planning, which speciﬁes where each container is loaded on the ship. Stowage plans, if not prepared well enough, may cause unnecessary handling time, time required for temporary unloading and re-loading of containers of gantry cranes at the ports. Consequently, port eﬃciency and ship utilization are largely aﬀected by stowage plans. In order to minimize the total rehandling of the containers on the ship. We propose the following objective functions, |P |−1

min

p=2 i∈I j∈Ri k∈Tij

⎡ ⎣

m∈Tij ;m>k

⎛

⎝

dc · xijmc −

ˆp c∈Cp ∪C

⎞⎤

dc · xijkc ⎠⎦ .

ˆp c∈Cp ∪C

(25)

While considering the shifting number of the containers, we still want to improve the utilization of the containership. Thus we need another objective function, that is max sc · xijkc . (26) i∈I j∈Ri k∈Tij c∈Cp ∪C ˆp

Deﬁnition 1. A vertical unit of the container means the container set with the same row and the absolute diﬀerence of their bay number is smaller than 1. Obviously the function value of (25) is not the real shifting number of the containers. In fact the shifting number of the containers can be calculate as follows, |P |−1 Vijmkp , (27) p=2 i∈I j∈Ri k∈Tij m∈Tij ;m>k

where, Vijmkp =

1, if ˆp dc · xijmc > ˆp dc · xijkc , c∈Cp ∪C c∈Cp ∪C 0, otherwise.

Theorem 1. If the container stowage problem is built up by objective function with (25,26) and constraints (1-24), it can minimize the container shifting number. Proof. Obviously, we only need to prove that when (25) obtain its minimum, (27) can also get its minimum. Since only the containers in a vertical unit will result in the container shifting. Thus we only need to prove that for some vertical unit, the theorem is correct. Without lose general, we suppose the Bay no of the vertical unit is i , Row no is j . When (25) gets its minimum we can get that, dc · xi j mc ≤ dc · xi j kc , if m > k, m, k ∈ Ti j . ˆp c∈Cp ∪C

ˆp c∈Cp ∪C

860

F. Li et al.

Therefore, when (25) get its minimum, the shifting nubmer will be zero. Zero is the minimum of (27).

3

Numerical Examples

According to this integer programming container stowage planning model, we can generate an optimal container stowage strategy for the ocean shipping liners. Below a simulation system was proposed to illustrate it. In order to verify the eﬃcency of the integer model for the container stowage problem, we develop a simulation system. In this system, there are two stowage engine, the ﬁrst one is our model, the second one is ”random”. Here ”random” means we stowage the containers onto the containership randomly, it is said that for any container, we randomly generate a position with bay row and tie, if the position is available, then we stowage that container on it, otherwise we generate another position until the position is available. In order to display the result of those two stowage engine, a 3d view of the containership is developed. The 3d view of the containership is built up on a containership generator. By using this generator, any type of containership can be built. Besides stowage engine, 3d view, and containership generator module of the system, there are data input, container generator, vessel operation calculation and report module in this simulation system. The function of data input is to input the data of routes(a sequences of ports) of the contianership. The container generator is to produce containers with diﬀerent types, sizes and destination. The vessel operation calculation is to caluate the shifting number of the container at each port, the utilization of the vessel, and the weight balance of the veseel including the absolute weight diﬀerence between the front and rear, the absolute weight between the left part and the right part. The report module is show all of the calculation result and the statistics of the container on the vessel, such as how many container with the same destination. In order to verify the eﬃciency of the model, a containership with 20 bays 10 rows and 10 ties is built up. We suppose that the maximum capacity of the containership is 800TEU and its maximum load capacity is 18,000TON. For this vessel we build up a routes with 10 ports. At each port we randomly generate diﬀerent types, diﬀerent weight and diﬀerent destination containers. Each time we generate all containers, then we vessel the containership, we load all of the containers from the ﬁrst ports of the route, then unload the container for the second port and load the other containers for the coming ports, and so on. We have done the stowage process about 1000 times, and the stowage results are shown in Table 1 and Figures 2–5. Table 1 shows that, the shifting number, ultilization of the containership, and the weight balance diﬀerence of the vessel for the optimal stowage strategy are better than that of the random strategy. By using our model, the shifting number of the container handling has been cutten down to Zero. The irrationality of the ramdom stowage strategy will result in some containers can not be loaded on the vessel. Since the random stowage strategy don’t consider the weigth balance

An Integer Linear Programming for Container Stowage Problem

861

Table 1. Compare Result for Optimal Random Optimal Mean Variation Mean Variation Shifting Number 1800 200 0 0 Utilization 89% 0.10 95% 0.04 Weight Balance 50 10 8 2 Diﬀerence (TON)

Fig. 2. 3d-View for Random Stowage Strategy(Up Deck)

Fig. 4. 3d-View for Optimal Stowage Strategy(Up Deck)

Fig. 3. 3d-View for Random Stowage Strategy(Down Deck)

Fig. 5. 3d-View for Optimal Stowage Strategy(Down Deck)

of the vessel, it will lead the absolute weigth diﬀerence between the front and back of the vessel and the left and right part of the vessel bigger than that of the optimal stowage strategy. Thus the stability of the vessel has been improved by the optimal stowage strategy.

4

Conclusion

In this paper, an integer programming for the container stowage problem is proposed. It is suitable for multiple port and multimodel container transportation services. In this model, the container loading process and re-handling process are considered to minimize the container shifting number and maximize the utilization of the containership. At the same time several real rules of the ocean shipping industry and vessel stability are been concerned as constrains of the model. A simulation system is developed to verify the feasibility and practicability of the model. By using the branch and cut algorithm of COIN-OR, the simulation system has shown that our model is eﬃcient and robust.

862

F. Li et al.

References 1. Steenken, D., VoB, S., Stahlbock, R.: Container terminal operation and operations research: a classiﬁcation and literature review. OR Spectrum 26, 3–49 (2004) 2. Methodologies for reducing truck turn time at marine container terminals. Research Report SWUT/05/167830-1 3. Atkins, W.H.: Modern marine terminal operations and management. Boyle, Oakland (1991) 4. Kang, J.G., Kim, Y.D.: Stowage planning in maritime container transportation. Journal of the Operational Research Society 53, 415–426 (2002) 5. Ambrosino, D., Sciomachen, A., Tanfani, E.: Stowing a containership: the master bay plan problem. Transportation Research Part A 38, 81–99 (2004) 6. Shields, J.J.: Container-ship stowage: a computer-aided preplanning system. Marine Technology 21, 370–383 (1984) 7. Saginaw, D.J., Perakis, A.N.: A decision support system for containership stowage planning. Marine Technology 26, 47–61 (1989) 8. Shin, J.Y., Nam, K.C.: Intelligent decision support system for containership autostowage planning. Journal of Korean Institue Port Research 9, 19–32 (1995) 9. Wilson, I.D., Roach, P.A., Ware, J.A.: Container stowage pre-planning: using search to genetrate solutions, a case study. Knowledge-Based Systems 14, 137– 145 (2001) 10. Martin, J., Randhawa, S.U., McDowell, E.D.: Computerized container-ship load planning: A methodology and evaluation. Computers & Industrial Engineering 9, 357–369 (1988) 11. Todd, D.S., Sen, P.: A multiple criteria genetic algorithm for containership loading. In: Proceedings of the Seventh International Conference on Genetic Algorithms, pp. 674–681 (1997) 12. Haghani, A., Kaisar, E.I.: A model for designing container loading plans for containerships. In: Annual Conference for Transportation Research Board (2001) 13. Dubrovsky, O., Levitin, G., Penn, M.: A genetic algorithm with a compact solution encoding for the container ship stowage problem. Journal of Heuristics 8, 585–599 (2002)

Using Padding to Optimize Locality in Scientiﬁc Applications E. Herruzo1 , O. Plata2 , and E.L. Zapata2 1

2

Dept. Electronics, University of C´ ordoba, Spain [email protected] Dept. Computer Architecture, University of M´ alaga, Spain [email protected], [email protected]

Abstract. Program locality exploitation is a key issue to reduce the execution time of scientiﬁc applications, so as many techniques have been designed for locality optimization. This paper presents new compiler algorithms based on array padding that optimize program locality either locally (at loop level) or globally (the whole program). We ﬁrst introduce a formal cache model that is used to analyze how all cache levels are ﬁlled up when arrays inside nested loops are referenced. We further study the relation between the model parameters and the data memory layout of the arrays, and deﬁne how to pad those arrays in order to optimize cache occupation at all levels. Experimental evaluation on some numerical benchmarks shows the beneﬁts of our approach.

1

Introduction

For the last decades locality exploitation has been one of the main goals for improving the performance of scientiﬁc applications, giving rise to a wide range of software optimizations. Nowadays two locality-related trends can be observed: applications process larger and larger data sets, and the processor-memory gap problem is more and more signiﬁcant. The memory latency problem has been attacked from two diﬀerent fronts. On the one hand, by means of hardware solutions, like lockup-free caches, multithreading, prefetching, out-of-order execution, data and instruction speculation and so on. On the other hand, by means of compiler techniques for code and/or data transformations [1]. Array padding is a well-known data layout optimization technique that optimizes locality by reducing conﬂict misses. Although a global technique (aﬀects the whole program), its use can be localized to the nested loop (or few loops) where most of the execution time is spent (a frequent case in scientiﬁc applications). This paper presents a simple model of the cache that captures essential information of its behaviour during the execution of a loop nest. This model is used as a framework to deﬁne how to pad the arrays in the loop in order to optimize cache occupation. Our method establishes a relationship among a small set of cache parameters, how the array elements are referenced and how they are stored in memory in order to obtain the optimal padding that optimizes cache occupation. The proposed method is subsequently extended to optimize cache M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 863–872, 2008. c Springer-Verlag Berlin Heidelberg 2008

864

E. Herruzo, O. Plata, and E.L. Zapata

locality for the whole program (global code optimization) and as well as for the complete cache hierarchy. The rest of the paper is organized as follows. First, the cache model developed to design our padding methods is introduced. In the next section, our intra-array padding approach is developed for three scenarios: a single loop nest optimization, all loops in the program (global optimization) and all levels in the cache hierarchy. After that, the proposed techniques are experimentally evaluated for a wide set of numerical bechmarks. Finally, some related work is discussed.

2

Modelling the Cache Behaviour

A minimal set of characteristic parameters will be deﬁned with the aim of optimizing cache occupation by using array padding. We consider a L−way setassociative cache, of size C × L × W , where C is the number of cache sets, L is the number of blocks per set, and W is the block size in words. We also consider that the array access patterns come from referencing a M −dimensional array, X, within a N −depth nested loop. Expressions in the array dimensions are of the form fk ∗ Ik , k = 1, . . . , M , where I = (I1 , I2 , . . . , IN ) is the iteration vector of the loop, but any general aﬃne expression is perfectly valid. Without loss of generality, we assume that N = M . To simplify the explanation, we restrict the cache model to a single array X inside a perfectly nested loop. However this model can be extended to several arrays appearing in the same loop and to not-perfectly nested loops. When the multidimensional array X is allocated in memory, it is linearized and laid out in some order. So, considering for instance a column-major order, the oﬀset (in words) from the beginning of the array of the element of X referenced in the iteration I = (I1 , I2 , . . . , IM ) of the nested loop is given by ArrOf f (X, I ) = k−1 M−1 f1 ∗ I1 + . . . + fk ∗ Ik ∗ i=1 Di + . . . + fM ∗ IM ∗ i=1 Di , where Di is the size of the i−dimension of X. The stride of array X on loop index Ik is deﬁned as the distance in memory (in words) of array entries referenced by consecutive l iterations of the loop k, that is, Stride(X, Ik ) = fk ∗ Ikl+1 ∗ k−1 i=1 Di − fk ∗ Ik ∗ k−1 l i=1 Di , where Ik represents the l−th iteration of the loop k. Let us consider now the cache. The execution of consecutive iterations of the loop k generates array references separated in memory a word distance equals to Stride(X, Ik ). We will deﬁne now two new cache-related strides. The above array references are contained in cache blocks (of size W words each). The distance (in blocks) between these cache blocks is deﬁned as the cache block stride, that is, BlockStride(X, Ik ) = block(X, Ikl+1 ) − block(X, Ikl ) = M emAddr(X, Ikl+1 )/W −M emAddr(X, Ikl )/W . Note that although Stride() is constant BlockStride() may not be constant, depending on the relative oﬀset of the memory references in their corresponding blocks. On the other hand, the distance (in cache sets) of the above blocks once they are placed into cache is deﬁned as the cache set stride, that is, SetStride(X, Ik ) = (block(X, Ikl+1 ) − block(X, Ikl )) mod C, or SetStride(X, Ik ) = BlockStride(X, Ik ) mod C. We are assuming BlockStride is a positive integer, in order to simplify the expressions

Using Padding to Optimize Locality in Scientiﬁc Applications

865

(otherwise, we have to take absolute values and consider negative loop steps). With this deﬁnition, if we take as a reference the ﬁrst iteration of the loop k (iteration Lw), we can use the following expression to calculate the set where is located the block referenced in the l−th iteration of the loop, Set(X, Ikl ) = (Set(X, IkLw ) + l ∗ SetStride(X, Ik )) mod C.

(1)

This expression assumes that SetStride() is constant for all iterations of the loop k, which is true if BlockStride() is also constant for that loop. This condition is fulﬁlled if all array references have the same block oﬀset, that is, if Stride(X, Ik ) is a multiple of W . Our method to optimize cache occupation considers that this is the case. Otherwise, we can always reach this condition by incrementing the k dimension of X by the amount Stride(X, Ik ) mod W .

3

Optimizing Cache Occupation by Array Padding

We propose a new approach to determine how to pad arrays in order to optimize cache occupation, and consequently reduce miss rates. Padding techniques modify the portion of memory reserved for storing array data by including empty memory zones [1]. We focus our analysis on intra-array padding, where the empty memory zones are included among the array dimensions that is, arrays are redimensioned by increasing the size in some of its dimensions. 3.1

Single Loop Nest

Consider ﬁrst the case of a single loop nest and a single-level cache hierarchy. In our analysis we are specially interested in those arrays with a stride greater than the cache block size (W ) in the innermost loop of the nest. The cache occupation for these arrays may suﬀer from block replacements by self-interference, and some cache blocks may remain unoccupied. This happens when SetStride(X, Ik ) > 1. This situation is very common in scientiﬁc applications, but the interesting problem occurs when the number of sets involved in the replacements is small. These cases shows an ineﬃcient use of the cache space. Our goal is to change, in these cases, the array memory layout through padding to maximize the cache occupation, and consequently reduce the miss rate (due to self-replacements). Next lemma explains how to determine the dimension increments (intra-array padding) needed to maximize cache occupation. Lemma. Given a loop k in the nest and the array X, the occupation of the cache is maximized if SetStride(X, Ik ) and C are mutually prime, that is, GCD(SetStride(X, Ik ), C) = 1 (GCD condition). Proof. We will prove that if SetStride(X, Ik ) is relatively prime to C then the ﬁrst C iterations of the loop k touch diﬀerent cache sets. In that case, if the loop k is longer than C then all cache sets are occupied by references to array X in that loop. If it is shorter, the number of touched cache sets is maximum

866

E. Herruzo, O. Plata, and E.L. Zapata

Select all Xi arrays with stride greater than W on the innermost loop (IM ) for (each Xi array) do if (SetStride(Xi , IM ) is even) then N ewStride(Xi , IM ) = Stride(Xi , IM ) + W endif endfor Fig. 1. Intra-array padding algorithm for a single loop nest and a single cache

(as many as the length of the loop). So, let us consider two of the ﬁrst C iterations of the loop k, say p and q. Note that 0 ≤ p, q < C. Let us assume that references to X in those two iterations are located to the same cache set, that is, Set(X, Ikp ) = Set(X, Ikq ). In that case, from Eq. (1), we have, (Set(X, IkLw ) + p ∗ SetStride(X, Ik )) mod C = (Set(X, IkLw ) + q ∗ SetStride(X, Ik )) mod C, or, eliminating the module operation, Set(X, IkLw ) + p ∗ SetStride(X, Ik ) = Set(X, IkLw ) + q ∗ SetStride(X, Ik ) + C ∗ r, where r is an integer number. A bit of simpliﬁcation gives, (p − q) ∗ SetStride(X, Ik ) = C ∗ r. Given that SetStride(X, Ik ) and C are mutually prime, p − q must be divisible by C, as r is an integer number. But | p − q | is lower than C, so we came to a contradiction. Thus, references to X in both iterations must touch diﬀerent cache sets. And this must be true for any pair of such iterations. Q.E.D. In the case that SetStride(X, Ik ) and C are not mutually prime then the lemma does not hold. However, if β is the greatest common factor between both values, then SetStride (X, Ik ) = SetStride(X, Ik )/β and C = C/β are mutually prime. So, the lemma holds for these new reduced values. That means that the ﬁrst C iterations of loop k touch diﬀerent cache sets. The rest of them may cause set replacements. In addition, as SetStride(X, Ik ) is assumed constant, the cache sets touched by these ﬁrst iterations are β sets apart. So, only the fraction C/β cache sets are assured to be occupied. The GCD condition can be used to deﬁne a simple algorithm (see Fig. 1) to compute the needed padding of an array in a nested loop for achieving maximum cache occupation. The key idea is touching the maximum number of cache sets when executing the iterations in the innermost loop. That permits to increase the cache set reuse by iterations of the next outer loop. In order to touch all the cache sets, we need to modify SetStride() to make it relatively prime to C. As C is a power of two, an increment of SetStride() by one is enough. This is obtained by increasing BlockStride() by also one, or by increasing Stride() of the array by W (remember that BlockStride() is assumed constant). 3.2

Multiple Loop Nests (Global Code Optimization)

In scientiﬁc codes it is very frequent that the same few arrays, usually referenced in the body of loop nests, are re-used during the whole program execution. To maximize the cache occupation for these array we may pad them satisfying the

Using Padding to Optimize Locality in Scientiﬁc Applications

867

for (each X array in the program) do i = 1, P (0) = 0 for (each loop nest in the program) do if (SetStride(X, IM ) is odd) then P (0) = 1 else P (i) = SetStride(X, IM )/2, i = i + 1 endfor n = i, even = 0, odd = 0 if (P (0) == 0) then ΔSetStride = 1 else for (i = 1 to n) do if (P (i) is even) then even = even + 1 else odd = odd + 1 endfor if (odd > even) then ΔSetStride = 4 else ΔSetStride = 2 N ewStride(Xi , IM ) = Stride(Xi , IM ) + W ∗ ΔSetStride endfor Fig. 2. Intra-array padding procedure for the whole program

GCD condition for all loop nests in the program that include such arrays. We have to bear in mind that we consider the arrays which are referenced over a non-contiguous dimension in the innermost loop of a nest. Then, for a speciﬁc array under consideration, we take all SetStride()’s corresponding to all nested loops, and calculate the stride increment (from now on, ΔStride) needed to satisfy the GCD condition for every SetStride(). A new problem arises when for some loop nests SetStride() is even and for other loop nests SetStride() is odd. As ΔStride must be the same for all loop nests, we need to calculate it so that it satisﬁes the GCD condition for all of them or, at least, it minimizes GCD(SetStride(), C), in order to maximize the number of the used cache sets. In what follows we assume that the number of iterations of the diﬀerent nested loops are similar. We also assume that the number of the odd and even values of SetStride()’s are similar. This is the worst case because we need to obtain a solution for the even values of SetStride()’s. In our approach, with a mixture of odd and even values for SetStride()’s, an even ΔStride will be calculated, in order to not turning odd values into even ones. To minimize the number of the non touched cache sets, this even increment should minimize GCD(SetStride(), C). In addition, it is convenient that the increment being the smallest possible, in order to minimize the number of (empty) pad locations. Based on this idea, we have deﬁned the algorithm shown in Fig. 2, that determines the array stride increment needed to maximize the occupation of the cache for the whole program. In the procedure a vector P is computed containing the values SetStride()/2 for all even values of SetStride()’s. The ﬁrst value of P shows if there are odd values of SetStride()’s or not. With the even values, the minimum even increment in stride to minimize GCD(SetStride(), C) is computed for the whole program.

868

E. Herruzo, O. Plata, and E.L. Zapata Select a X array inside a nested loop Sort cache levels in decreasing order of their block size for (each cache level take Wi in the sorted order) do Apply single-level procedure(Stride(X, IM ), Ci , Wi ) Stride(X, IM ) = N ewStride(X, IM ) endfor Fig. 3. Intra-array padding procedure for the complete cache hierarchy

3.3

Complete Cache Hierarchy

In the cache hierarchy the blocks may be the same size for all levels. In this case, the stride increment is the same for all cache levels, so we can apply the same algorithms described above. Otherwise, if block sizes are diﬀerent over the cache hierarchy, we need to obtain a stride increment appropriate for each cache level. The algorithm in Fig. 1 calculates the stride increment for a speciﬁc cache level. To pad the array for other cache level, we need to carry out the same algorithm changing SetStride() and the cache block size (W ). The method to accomplish this process starts with the cache level with the largest block size, continuing with the rest of levels in decreasing order of block size. As an example, consider a two-level cache hierarchy with diﬀerent block sizes, W 1 and W 2, where W 1 < W 2 (both are power of two). For this system, if Stride()/W 2 is integer, then all the quotients, Stride()/W 1, (Stride() + W 2)/W 2), (Stride() + W 1)/W 1) and (Stride() + W 2 + W 1)/W 1) are all also integer. However, the quotient (Stride() + W 2 + W 1)/W 2) is non-integer. Despite this, its integer part satisﬁes the GCD condition. Based on these results, a simple algorithm has been developed to extend our padding approach to a cache hierarchy (see Fig. 3).

4

Experimental Evaluation

We ﬁrst tested the basic padding method (single loop nest and one level cache) for a synthetic single loop nest example on a real machine, and for a small set of kernel benchmarks on a cache simulator. Second, we tested the full padding method (whole program optimization on a complete cache hierarchy) for a selection of benchmarks on a real machine. When using a real machine, diﬀerent optimization levels where tested, obtaining similar results. In this paper we show those results for two optimization levels on two diﬀerent processors. 4.1

Basic Padding Method

In this section we present the evaluation of the basic padding method for a simple test code, a double nested loop of 1000 × 1000 iterations (i, j), where the loop body is X(i, j) = 3, being X a 1600 × 1600 single-precision ﬂoating-point array (4-byte words). The experiments were conducted in a MIPS R10K-processor system running in exclusive mode. This system has a 2-way set associative 32KByte L1 data cache with a 32-Byte cache block (C = 512, L = 2, W = 8),

Using Padding to Optimize Locality in Scientiﬁc Applications

869

Table 1. L1 (left) and L2 (right) data cache misses for diﬀerent array paddings Arr. Dim. SetStride GCD Exec.Time L1 Misses Arr. Dim. SetStride GCD Exec.Time L2 Misses 1600 200 8 0.212 1,001,000 1600 50 2 0.212 23,100 1608 201 1 0.171 312,000 1632 51 1 0.214 19,500 1616 202 2 0.214 1,001,000 1664 52 4 0.219 27,640 1624 203 1 0.169 310,000 1696 53 1 0.217 20,400 1632 204 4 0.214 1,001,000 1728 54 2 0.214 24,700 1640 205 1 0.168 312,000 1760 55 1 0.212 18,600 1648 206 2 0.216 1,001,000 2048 64 64 0.548 698,500 1664 208 16 0.217 1,001,000 2176 68 4 0.217 22,200

Table 2. Cache miss ratio for a 16 KB cache using a simulator C=1024 Benchmark Array Dim. GCD L=1 test code 400 4 1 test code 404 1 1.196 test code 408 2 0.282 test code 412 1 9.310 liver 240 4 1 liver 244 1 3.874 liver 248 2 0.869 liver 252 1 1.705 mxm 200 2 1 mxm 204 1 0.858 mxm 208 4 2.502 mxm 220 1 7.826

Cache Miss Ratio C=512 C=256 C=128 C=64 L=2 L=4 L=8 L=16 1 1 1 1 1.442 2.732 13.333 13.333 0.105 0.873 4.230 4.230 3.307 3.142 3.142 12.571 1 1 1 1 3.874 1.193 13.741 14.200 0.869 3.354 4.531 4.531 1.705 13.741 13.741 14.200 1 1 1 1 0.256 0.131 0.226 0.645 1.031 0.432 0.140 0.145 1.351 0.658 0.138 1.001

C=32 L=32 1 61.710 4.645 12.342 1 14.200 4.531 14.200 1 1.072 0.093 1.208

and a 2-way set associative 4-MByte L2 data cache with a 128-Byte cache block (C = 16384, L = 2, W = 32). The results correspond to codes compiled with the MIPSpro Fortran 90 (v. 7.30) compiler using the ”O0” optimization option. Considering the test code, we have Stride(X, i) = 1 and Stride(X, j) = 1600. For the L1 cache level we have, BlockStride(X, j) = 1600/8 = 200 (constant), SetStride(X, j) = 200 mod 512 = 200, and GCD(200, 512) = 8. This result means that between two touched cache sets there are 7 other sets which are never used. With an increment by one of SetStride(X, j), we have GCD(201, 512) = 1 and N ewStride(X, j) = 1600 + 8 = 1608. Table 1 (left) shows the experimental results (execution time and L1 data cache misses) for diﬀerent values of the second dimension of the array X. Besides, the table shows the SetStride(X, j) for each case and the values of GCD(SetStride(X, j), C). Note that the best results are obtained when the GCD condition holds, as expected. For the L2 cache level, we have now BlockStride(X, j) = 1600/32 = 50 (constant), SetStride(X, j) = 50 mod 16384 = 50, and GCD(50, 16384) = 2. So, only half of the L2 cache is used. Table 1 (right) shows the experimental results for the L2 cache and diﬀerent values for the array second dimension. The table shows three cases fulﬁlling the GCD condition. For these cases the minimum miss rate is obtained, as expected. A diﬀerent scenario corresponds to the evaluation of our basic padding method using a cache simulator, in order to test easily diﬀerent cache conﬁgurations. The obtained results are shown in table 2, for three diﬀerent benchmarks: test

870

E. Herruzo, O. Plata, and E.L. Zapata

code (test code described above), liver (livermore kernel [3]), and mxm (matrix multiplication of square matrices). The ﬁrst row in table 2 for each benchmark corresponds to the original dimension size of the array. The rest of rows include diﬀerent padded dimension sizes. The smallest one (that is, the second row) corresponds to the increment given by our padding technique. For each cache conﬁguration we measured the cache miss rate using our simulator. The results are given in the table as Cache Miss Ratio, that is obtained by dividing the cache miss rate for the original dimension size by the cache miss rate for the padded dimension size. This way, a value larger than 1 means that the miss rate was decreased after applying padding. In case of mxm, padding is not eﬀective due to the small size of the data structures. 4.2

Full Padding Method

A diﬀerent experimental setup was implemented testing a number of bechmarks from diﬀerent suites (SPEC95 & 2000, perfectB and NAS) on a Intel Pentium-4 platform. The computer system has a 8-way set associative 16-KByte L1 data cache with a 64-Byte block size, and a 8-way set associative 1-MByte L2 data cache with a 64-Byte block size. The results shown here correspond to codes compiled with the Intel Fortran Compiler v. 9.0, using in this case the ”O3” optimization level and ”-align none” option (in order to not interfere with padding). In addition, the code was executed in exclusive mode (hard real-time mode in the linux scheduler). Table 3 (top) shows the performance results applying our padding method for the complete cache hierarchy. Note that there are a signiﬁcant performance improvement by padding the arrays. In order to optimize the whole code, we ﬁrst have to ﬁnd all the nested loops where array references in a non contiguous dimension exist in the innermost loop. Next, we proceed by selecting one of these arrays. From this point on, SetStride()’s are calculated for every level of the cache hierarchy. For each one, the corresponding values of GCD(SetStride(), C) are also calculated, and from then, the stride increments. If the stride values are the same for all the loops, we only have to carry out the sum of the stride increments obtained for each level of the cache memory. Otherwise, we have to apply the corresponding padding algorithm. Table 3 (bottom) show the improvement obtained by applying our padding method to several benchmarks on the Pentium-4 machine.

5

Related Work

There is a great amount of work in the literature related to the design of compiler techniques for program locality exploitation. In [6] authors present an iterative method that uses ILP (Integer Linear Programming) to compute optimal solutions of memory layout transformations. Other works [7,8] also develop heuristic solutions to data layout and loop transformations, based on data reuse vectors. Works based on Cache Miss Equations (CMEs) [5] are similar to our own proposal in some aspects but with diﬀerent results. They analyze the cache

Using Padding to Optimize Locality in Scientiﬁc Applications

871

Table 3. Improvement for a single loop nest (top) and the whole code (bottom) for the complete cache hierarchy in Pentium 4 Subroutine

Original array dim.

Padded % Exec. time % L1 num. misses % L2 num. misses array dim. improvement improvement improvement

PerfectB Bench. adm hyd 50,50,50 56,52,50 arc2d scaldt 500,500,4 544,500,4 dyfesm mnlbyx 500,500,3 544,500,3 ﬂo52 collc 200,200,200 200,205,200 mg3d march 100,100 100,112 spec77 horiz2 500,500 544,500 trfd trfa 500,500 544,500 NAS Bench. appsp spentax3 660,33,33 660,35,34 appbt l2norm 50,50,50,50 56,52,56,50 ﬀtpde transx 1024,1024 1064,1024 SPEC CFP Bench. tomcatv cal.res. 513,513 540,520 applu rhs 20,50,50,100 20,50,70,100 swim calc3z 1335,1335 1352,1336 hydro2d v.th.p 1200,800 1224,800 mgrid norm2u3 320,200,200 344,201,200 turb3d wcal 33,64,64 34,68,64 wave RADFG 100,100,100 116,140,100 tﬀt2 transc 500,30,500 500,30,501

swim mgrid appsp ﬀtpde

16.2 11.3 7.1 5.4 3.2 16.8 18.3

15.1 15.7 1.2 5.5 0.7 15.5 3.8

0 13.6 44.4 5.7 8.6 34.2 0

0 0 120.0

6.3 0.2 -1.7

2.0 0 90.0

588.2 5.7 6.8 433.2 4.2 8.3 5.4 6.5

211.5 4.7 1.6 907.4 19.5 -0.1 5.5 11.7

103.8 2.9 3.6 173.3 18.4 11.7 5.7 0

% Exec. time % L1 num. misses % L2 num. misses improvement improvement improvement 1.1 3.2 -0.2 3.4 2.5 1.3 4.8 2.8 3.7 1.2 3.9 10.4

occupation through references to arrays in loops and deﬁne the algorithms to generate the CMEs and some optimizations, as padding. In [11] authors also use CMEs to propose a cost model that combined with a genetic algorithm carries out padding (and tiling) for a multi-level cache memory system. The work in [2] is mainly focused on spatial locality optimization. The method provides a parameterized cost function based on polytopes and Ehrhart polynomials from the iteration space of a loop nest. We also use the polyhedron deﬁned by the iteration space of a loop nest, but with a diﬀerent objective. Our goal is to determine the cache occupation generated by this polyhedron. In [9] authors present a similar approach to ours for array padding. Their technique is iterative until no self-interference is caused. Our method, however, is a direct calculation, more accurate and with lower algorithmic complexity. Other work [10] computes the conﬂict distance of array references by directly linearizing the uniformly-generated references. In [12] authors analyze iterative stencil loops and use array padding to remove conﬂict misses after tiling the loops. Finally, the work in [4] shares with our approach a similar treatment of the cache and the program code. However, we also take into account how array

872

E. Herruzo, O. Plata, and E.L. Zapata

data is stored in memory, and we introduce the new notions of cache block and set strides and work with them to develop the padding algorithms. These facts, to some extent, complete their work, justifying some of the results they obtain.

6

Conclusions

We presented a parameter-based cache model that is used as a framework to determine how to pad arrays in order to optimize program locality. In fact, the model permits to maximize cache occupation when arrays are reference within loop nests. This model is used to develop array padding algorithms in diﬀerent scenarios: single loop nest and multiple loops, single and multiple cache levels. We showed that simple padding techniques are very useful to obtain respectable performance improvements for a variety of scientiﬁc codes.

References 1. Bacon, D.F., Graham, S.L., Sharp, O.J.: Compiler Transformations for HighPerformance Computing. ACM Computing Surveys 26(4), 345–420 (1994) 2. Clauss, P., Meister, B.: Automatic Memory Layout Transformation to Optimize Spatial Locality in Parameterized Loop Nests. ACM Computer Architecture News 28(1), 11–19 (2000) 3. Coleman, S., Mckinley, K.S.: Tile Size Selection Using Cache Organization and Data Layout. In: ACM Conf. on PLDI. La Jolla (CA), pp. 279–290 (1995) 4. Ferrante, J., Sarkar, V., Thrash, W.: On Estimating and Enhancing Cache Eﬀectivenes. Work. on Languages and Compilers for Parallel Computers (1991) 5. Ghosh, S., Martonosi, M., Malik, S.: Cache Miss Equations: A Compiler Framework for Analyzing and Tunning Memory Behaviour. ACM TOPLAS 21(4), 703–746 (1999) 6. Kandemir, M., Banerjee, P., Choudhary, A., Ramanujam, J., Ayguade, E.: An Integer Linear Programming Approach for Optimizing Cache Locality. In: ACM Int’l. Conf. on Supercomputing Rhodes, pp. 500–509 (1999) 7. Kandemir, M., Choudhary, A., Ramanujam, J., Banerjee, P.: Improving Locality Using Loop and Data Transformations in an Integrated Framework. In: ACM/IEEE Int’l. Symp. on Microarchitecture. Dallas (TX), pp. 285–297 (1998) 8. O’Boyle, M., Knijnenburg, P.: Integrating Loop and Data Transformations for Global Optimizations. In: IEEE Int’l. Conf. on Parallel Architectures and Compilation Techniques., Paris, pp. 12–19 (1998) 9. Panda, P., Nakamura, H., Dutt, N., Nicolau, A.: A Data Alignment Technique for Improving Cache Performance. In: Int’l. Conf. on Computer Design: VLSI in Computers and Processors., Austin (TX), pp. 587–592 (1997) 10. Rivera, G., Tseng, C.W.: Data Transformations for Eliminating Conﬂict Misses. In: ACM Conf. on PLDI, Montreal, pp. 38–49 (1998) 11. Vera, X., Abella, J., Llosa, J., Gonz´ alez, A.: An Accurate Cost Model for Guiding Data Locality Transformations. ACM TOPLAS 27(5), 946–987 (2005) 12. Li, Z., Song, Y.: Automatic Tiling of Iterative Stencil Loops. ACM TOPLAS 26(6), 975–1028 (2004)

Improving the Performance of Graph Coloring Algorithms through Backtracking Sanjukta Bhowmick1 and Paul D. Hovland2 1

Department of Computer Science and Engineering,The Pennsylvania State University, University Park, PA 16802 [email protected] 2 Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439-4844 [email protected]

Abstract. Graph coloring is used to identify independent objects in a set and has applications in a wide variety of scientific and engineering problems. Optimal coloring of graphs is an NP-complete problem. Therefore there exist many heuristics that attempt to obtain a near-optimal number of colors. In this paper we introduce a backtracking correction algorithm which dynamically rearranges the colors assigned by a top level heuristic to a more favorable permutation thereby improving the performance of the coloring algorithm. Our results obtained by applying the backtracking heuristic on graphs from molecular dynamics and DNA-electrophoresis show that the backtracking algorithm succeeds in lowering the number of colors by as much as 23%. Variations of backtracking algorithm can be as much as 66% faster than standard correction algorithms, like Culberson’s Iterated Greedy method, while producing a comparable number of colors. Keywords: Graph Coloring, Backtracking.

1 Introduction Graph coloring is used for partitioning a collection of objects into “independent” sets. Objects belonging to the same set are identified by having the same color. Objects with the same color are non-conflicting, that is, certain operations can be performed simultaneously on them. Coloring is used in many computational and engineering applications that require identification of concurrent tasks. Some examples include register scheduling, frequency assignments for mobile networking, the evaluation of sparse Jacobian matrices, etc. Optimal coloring strategies improve parallelism. The fewer colors required to classify the objects, the more the inherent parallelism of the problem can be exploited. The holy grail of graph coloring is achieving the chromatic number, the smallest number of colors required to color the graph so that no two adjacent vertices have the same color. Algorithms for determining the chromatic number are NP-complete [1] and designing polynomial time heuristics to obtain quasi-optimal solutions is an active area of research. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 873–882, 2008. c Springer-Verlag Berlin Heidelberg 2008

874

S. Bhowmick and P.D. Hovland

It has been observed that for many heuristics the order in which vertices are colored significantly affects the number of colors obtained [2]. Based on this observation, we present a backtracking correction heuristic that mitigates the effects of a “bad” vertex ordering. The backtracking algorithm dynamically rearranges the colors assigned by the top level coloring algorithm thereby, changing the vertex ordering to a more favorable permutation. We study the performance of backtracking algorithm in conjunction with several popular coloring heuristics and compare it with another correction technique, Culberson’s iterated greedy scheme [3]. Results from our experiments on graphs obtained from molecular dynamics [4] and DNA electrophoresis [5] show that backtracking improves upon the performance of the top level heuristic as well as the iterated greedy approach. The average reduction of colors is as much as 23% (16%), compared to the original (iterated greedy) method. Most correction algorithms necessarily take more time to determine the near-optimal number of colors. We have designed a variation of backtracking that is as much as 66% faster than the iterated greedy method, while giving the number of colors within 1% of that obtained by the correction method. The rest of the paper is arranged as follows. In Section 2 we present the mathematical description of relevant terms from graph theory and define graph coloring. We provide a brief review of some of the standard coloring heuristics in Section 3. In Section 4 we describe the backtracking algorithm. We discuss experimental results in Section 5 and present improved variations to the heuristic such as Multilevel and Reverse backtracking. Section 6 contains conclusions and discussion of our future research plans.

2 Mathematical Definitions In this section we define some terms used in graph theory. Unless mentioned otherwise, the terms used here are as they are defined in [6]. A graph G = (V, E) is defined as a set of vertices V and a set of edges E. An edge e ∈ E is associated with two vertices u, v which are called its endpoints. If a vertex v is an endpoint of an edge e, then e is incident on v. A vertex u is a neighbor of v if they are joined by an edge. The degree of a vertex u is the number of its neighbors. A walk, of length l, in a graph G is an alternating sequence of v0 , e1 , v1 , e2 , . . . , el , vl vertices and edges, such that for j = 1, . . . , l; vj−1 and vj are the endpoints of edge ej . An internal vertex is a vertex that is neither the initial nor final vertex in this sequence. A path is a walk with no edges or internal vertices repeated. Two vertices are said to be distance-k neighbors if the shortest path connecting them has length at most k [7]. A vertex-coloringof a graph G = (V, E) is a function φ : V → C from the set of vertices to a set C = {1, 2, . . . , n} of “colors”. A distance-k coloring of a graph G = (V, E) is a mapping φ : V → {1, 2, . . . , n} such that φ(u) = φ(v), whenever u and v are distance-k neighbors. The least possible number of colors required for a distance-k coloring of a graph G is called its k-chromatic number [7].

3 Review of Some Coloring Algorithms In this section we provide an overview of some standard coloring algorithms. It has been observed that the order in which vertices are colored is an important parameter in

Improving the Performance of Graph Coloring Algorithms through Backtracking

875

lowering the number of colors. Consequently, many coloring heuristics focus on finding an efficient vertex ordering. Some apply well known graph traversal methods like the depth-first search [6] while others focus on orderings based on the degree of the vertices or the number of colored neighbors. Some examples of the later category include the largest first [8] ordering where the vertices are arranged in non-increasing order of their degrees and the smallest last [9] ordering which dynamically orders the vertices such that the last vertex in the sequence is one with the minimum degree in the subgraph induced by the yet uncolored vertices. In the incidence degree [10] ordering, the vertex with the maximum number of colored neighbors is the next one to be colored. The effectiveness of these heuristics depend on the underlying graph structure. The results can be improved by using correction algorithms such as Culberson’s Iterated Greedy method [3]. In this approach, once the initial algorithm has been applied, the iterated greedy method rearranges the vertices in decreasing order of color, and re-colors them. Culberson’s method guarantees that the reordering does not increase the number of colors.

4 The Backtracking Correction Heuristic The backtracking correction heuristic is based on dynamically reassigning colors amongst already colored vertices in order to restrict the number of colors within an user specified minimum. The heuristic is implemented as follows; the user specifies a coloring threshold set to a lower bound on the chromatic number. The backtracking heuristic is invoked whenever this threshold is exceeded. It is easy to see that when backtracking is called there is only one vertex, designated as the last-vertex, that is assigned a color higher than the threshold. Evidently, the rest of the colors up to the threshold would have been used to color the neighbors of the last vertex. These colors form the acceptable set of colors. The last-vertex is temporarily assigned a pseudo-color from the acceptable color set. The backtracking algorithm tries to determine whether there is an alternate assignment of colors from the acceptable set to the neighboring vertices that would allow the last-vertex to retain the pseudo-color and prevent conflicts. If such an assignment is found, then we have a coloring within the limits of the threshold. If no such arrangement can be obtained for any color from the acceptable set, the last vertex is assigned its original color and the threshold is increased by one. Pseudocode for Backtracking Heuristic Set threshold to T For all vertices v Color vertex v with initial coloring algorithm pseudocolor[v]=color[v] If color[v]>T For all colors c; 1 ≤ c ≤ T Set pseudocolor[v] to c Set fail to FALSE For all neighbors n of v, If color[n]=c

876

S. Bhowmick and P.D. Hovland

Reassign pseudocolor[n] to avoid conflicts; If pseudocolor[n]>T; Set fail to TRUE; Break; If fail is FALSE Re-coloring is successful Break Else continue for next color If fail is FALSE (Alternative coloring assignment found) For all vertices v; set color[v]=pseudocolor[v] Else For all vertices v; set pseudocolor[v]=color[v] Increase T by 1

5 Performance of Backtracking Heuristic We report on the performance of the backtracking algorithm on two test suites each containing six matrices. We applied the coloring algorithms discussed in Section 3 to the adjacency graphs corresponding to these matrices. The first set obtained from molecular dynamics [4], consists of a group of graphs with fixed vertices (11414) and gradually increasing number of edges. The second set obtained from the Florida Sparse Matrix Collection [5], representing DNA electrophoresis, consists of graphs whose size increases with both vertices and edges. We used the following ordering heuristics; Natural Ordering (N), Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). For each heuristic we conducted three sets of experiments using: i) only the heuristic, ii) Culberson’s Iterated Greedy method [3], with the current heuristic in the first iteration and iii) the heuristic with the Backtracking algorithm. Our experiments include results for both distance-1 and distance-2 coloring objectives. The threshold for distance-1 coloring was set to 3 and the threshold for distance-2 coloring was set to the minimum degree of the graph. 5.1 Reduction of Colors The results summarized in Tables 1 and 2 demonstrate that backtracking can significantly reduce the number of colors. The number of colors obtained is lower than that given by the iterated greedy algorithm. The reduction is higher for distance-1 coloring (maximum reduction of 23%) than for distance-2 coloring (maximum reduction of 18%). This is to be expected, since coloring vertices based on distance-2 neighbors requires fulfilling more constraints, thus reducing the possibility of color reassignments. 5.2 Running Time The time taken to color a graph is proportional to its size. Figure 1 plots the time taken to distance-1 color the two sets of graphs. The results show that though the execution time of backtracking is competitive with the iterated greedy heuristic for smaller

Improving the Performance of Graph Coloring Algorithms through Backtracking

877

Fig. 1. Comparison of the time taken between Depth First Search (DF), DF using Culberson’s Iterated Greedy Algorithm and DF using Backtracking for distance-1 coloring. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNA-electrophoresis. Time is given in seconds.

graphs, it gets much larger as the size of the graph increases. The running time is worse for distance-2 coloring and with particular bad combinations of vertex ordering and threshold for dense matrices the execution time can go up to as much as 77 times the time taken by the top-level heuristic. The time required to backtrack can be reduced by increasing the level of the threshold. Since backtracking does not start until the threshold is reached, a judicious selection can significantly decrease the time, as shown in Figure 2, without compromising in the number of colors. 2.5 235 236

Time for Coloring Graphs

2

1.5 236 1

239

0.5 246 287 0

N

NG

NB(3)

NB(143) NB(215) NB(225)

Fig. 2. Comparison of time taken to color a molecular dynamics graph with Natural (N) ordering, Iterated Greedy Method (G) and Backtracking (B) with different thresholds, given in parenthesis. The values on top of the bars give the number of colors obtained. Time is given in seconds.

5.3 Improving Backtracking Improvements to backtracking can include i) reduction of colors and ii) reduction of execution time. We will explore the first option with respect to distance-1 coloring and the second option with respect to distance-2 coloring.

878

S. Bhowmick and P.D. Hovland

Multilevel Backtracking: Multilevel backtracking reduces the number of colors by recursively invoking the correction heuristic. For example, if in course of backtracking a neighbor of the last-vertex is colored higher than the threshold, we use multilevel backtracking to explore further reassignments to the vertices adjacent to the neighbor in order to obtain limit the colors within the acceptable set. Figure 3 and Table 1 summarizes the number of colors for distance-1 coloring of the representative graphs. The results show that 2-level backtracking (recursion used once) can reduce the number of colors by 10% compared to non-recursive backtracking.

600

400

300

200

100

0

V=37:E=196 V=93:E=692 V=1K:E=9K V=3K:E=38K V=39K:E=520K V=130K:E=1902K

60

Total Number of Colors

500

Total Number of Colors

70

Edge=15K Edge=64K Edge=130K Edge=412K Edge=1655K Edge=2683K

50

40

30

20

10

N NGNBN2

S SGSB S2

L LG LB L2

I IG IB I2

D DGDBD2

0

N NGNBN2

S SGSB S2

L LG LB L2

I IG IB I2

D DGDBD2

Fig. 3. Number of colors required by the graphs for distance-1 coloring. The groups from left to right represent different top level heuristics, Natural Ordering (N), Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the coloring obtained by the iterated greedy algorithm, suffix (B) represents the coloring obtained by backtracking and suffix (2) represents 2-level backtracking. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNAelectrophoresis.

Reverse Backtracking: Reverse backtracking is used to reduce the execution time of the backtracking algorithm. In this heuristic the vertices are first colored using a top level algorithm and then backtracking is applied to see if it is possible to re-color all the vertices of the the highest color to a lower color. The process is continued until a lower color cannot be assigned. This algorithm is similar to the iterated greedy algorithm in that the vertices are grouped according to color after the initial coloring is complete. However instead of re-coloring all the vertices again as in the iterated greedy algorithm, reverse-backtracking re-colors only the vertices with the highest colors (and their neighbors as required). Consequently reverse backtracking has a lower running time than the iterated greedy heuristic. Figure 4 compares the execution time for distance-2 coloring with respect the most expensive algorithm (depth-first search). The results show that reverse backtracking is faster by as much as 66% compared to the iterative greedy algorithm. The performance of these algorithms with respect to the number of colors is summarized is Figure 5 and Table 2. For most algorithms, the number of colors obtained by reverse backtracking is competitive to that obtained by the iterated greedy method.

Improving the Performance of Graph Coloring Algorithms through Backtracking

879

Table 1. Number of colors required by the graphs for distance-1 coloring. The groups from left to right represent different top level heuristics, Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the coloring obtained by the iterated greedy algorithm, suffix (B) represents the coloring obtained by backtracking and suffix (2) represents 2-level backtracking. Graphs V=11K E=15K V=11K E=64K V=11K E=130K V=11K E=412K V=11K E=1.6M V=11K E=2.6M V=37 E=196 V=93 E=692 V=1K E=9K V=3K E=38K V=39K E=520K V=130K E=1.9M

S 5 8 14 34 115 183 4 6 8 9 10 12

S-G 5 8 14 34 115 183 5 5 7 8 10 11

S-B 5 8 14 32 110 176 4 5 7 8 10 10

S-2 5 8 13 32 109 171 4 5 7 8 9 9

L 5 9 16 42 137 218 5 6 8 9 13 13

L-G 5 9 16 40 135 216 5 7 8 9 11 12

L-B 5 9 15 37 124 187 4 5 7 8 10 10

L-2 5 8 14 35 117 181 4 5 6 8 9 10

Edge=15K

3000

I 5 9 15 34 117 183 5 7 8 10 12 13

I-G 5 9 14 34 117 182 5 6 7 9 11 11

350

2500

300

Edge=412K Edge=2683K

250

Total Number of Colors

Total Number of Colors

Edge=1655K 2000

1500

1000

I-2 5 8 13 32 109 173 4 5 7 8 10 10

D 5 11 18 45 164 266 5 6 8 11 14 14

D-G 5 9 16 42 156 241 5 6 7 9 12 12

D-B D-2 5 5 9 8 15 14 39 37 142 130 225 201 5 4 5 5 8 7 9 8 11 10 12 11

V=37:E=196 V=93:E=692 V=1K:E=9K V=3K:E=38K V=39K:E=520K V=130K:E=1902K

Edge=64K Edge=130K

I-B 5 9 14 33 112 177 4 5 7 9 10 10

200

150

100 500

0

50

N NGNBNR

S SGSBSR

L LG LB LR

I IG IB IR

D DGDBDR

0

N NGNBNR

S SGSBSR

L LG LB LR

I IG IB IR

D DGDBDR

Fig. 4. Number of colors required by the graphs distance-2 coloring. The groups from left to right represent different top level heuristics,Natural Ordering (N), Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the iterated greedy algorithm, suffix (B) represents backtracking and suffix (R) represents reverse backtracking. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNA-electrophoresis.

Reverse Backtracking Heuristic Color all vertices Set C to maximum number of colors While TRUE For all vertices colored with C; Apply backtracking with threshold C-1 If all vertices can be recolored; set C to C-1 Else break

880

S. Bhowmick and P.D. Hovland

Table 2. Number of colors required by the 12 graphs for distance-2 coloring. The groups from left to right represent different top level heuristics, Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the iterated greedy algorithm, suffix (B) represents backtracking and suffix (R) represents reverse backtracking. Due to space constraints the colors for the graph with vertex=11k and edges=2.6M are given in the order of 103 . Graphs (V:E) 11K:15K 11K :64K 11K:130K 11K:412K 11K:1.6M 11K:2.6M 37:196 93:692 1K:9K 3K:38K 39K:520K 130K :1.9M

S S-G S-B S-R L L-G L-B L-R 9 22 47 152 617 1.02 11 19 30 39 62 68

9 22 47 150 609 1.01 11 18 28 38 62 66

9 22 45 144 585 .97 10 17 27 36 56 60

9 22 47 150 608 1.01 11 18 29 39 59 64

9 26 57 183 654 1.07 10 17 30 42 67 73

9 24 56 178 651 1.07 10 17 29 41 67 65

9 24 52 157 617 1.01 10 17 27 38 61 72

9 24 55 170 645 1.05 10 17 28 38 64 68

I 9 24 50 152 615 1.01 11 17 31 44 66 72

I-G I-B I-R 9 23 49 152 615 1.00 11 17 30 42 63 71

9 22 45 148 596 .98 11 17 28 37 58 64

9 23 48 151 610 1.00 11 16 29 41 62 68

D D-G D-B D-R 9 29 65 217 862 1.39 12 19 31 48 81 87

9 26 59 198 771 1.22 12 18 31 43 82 77

9 24 55 184 710 1.14 11 17 28 40 68 73

9 26 58 200 789 1.24 11 18 29 43 70 78

Fig. 5. Comparison of the time taken between Depth First Search (DF), DF using Culberson’s Iterated Greedy Algorithm and DF using Reverse Backtracking for distance-2 coloring. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNA-electrophoresis. Time is given in seconds.

5.4 Comparison with Results from Integer Programming We can measure the effectiveness of backtracking by comparing the results with known optimal colorings. We used integer programming to find the optimal number of colors for some graphs of small size. The results provided in Table 5.4 demonstrate that we can also obtain optimal coloring within a few levels of backtracking. The threshold was set to the minimum nonzeros per column of the corresponding column intersection matrix.

Improving the Performance of Graph Coloring Algorithms through Backtracking

881

Table 3. Number of colors for distance-1 coloring. The columns from left to right represent coloring using Natural ordering, Integer Programming, and Natural ordering with Backtracking. The number of levels of backtracking is given in parenthesis. The IP execution for graphs with E=1555 and E=3925 could not be completed within the time bounds. Graph Natural IP BT (Levels) V=12 E=59 11 11 11 (1) 8 8 8 (1) V=12 E=32 50 between 50 and 45 50 (1) V=60 E=1555 10 10 10 (1) V=20 E=115 10 10 10 (1) V=10 E=45 8 7 7 (2) V=14 E=77 8 6 6 (2) V=72 E=472 51 51 51 (1) V=68 E=2074 unknown 40 (1) V=100 E=3925 49

6 Discussion and Future Work We have described a backtracking algorithm and shown that it is indeed successful in reducing the number of colors. The price for achieving a smaller number of colors is an increase in the time required to compute the coloring. The execution time is generally higher for dense matrices or coloring problems with more constraints, such as distance-2 coloring. This is to be expected, since as the number of neighbors increase backtracking has more vertices to search. Backtracking has several provisions for lowering the execution time based on user specific requirements, such as by varying the threshold, or invoking backtracking only at specific intervals. Reverse backtracking is also effective in reducing the computing time and the number of colors given by this method is close to the original backtracking for distance-2 coloring. The trade-off between the reduction of colors and the execution time of the algorithm depends on the purpose of the coloring. If the underlying application uses the same partition multiple times then an upfront large cost to obtain a near-optimal partition is justified. We have observed that the performance of backtracking is largely dependent on the top-level heuristic. One of our research goals, therefore, is to design more efficient variations of backtracking to match the top-level coloring techniques. Our other future plans include application of backtracking in parallel coloring algorithms. Acknowledgments. This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Computational Technology Research, U.S. Department of Energy under Contract W-31-109-Eng-38.The idea for the backtracking algorithm was inspired by reading the excellent review on graph coloring algorithms by Assefaw Gebremedhin, Fredrik Manne and Alex Pothen [7]. We are also grateful to Assefaw Gebremedhin and Rahmi Aksu for letting us use their graph coloring software. Our implementation of the backtracking algorithm was built on top of this software. We would also like to thank Sven Leyffer for his assistance in using integer programming to find optimal colorings for several small matrices. We thank Gail Pieper for proofreading a draft of this paper.

882

S. Bhowmick and P.D. Hovland

References 1. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NPCompleteness. W.H. Freeman and Company, New York (1979) 2. Siek, J.G., Lee, L., Lumsdaine, A.: Boost Graph Library, The: User Guide and Reference Manual. Addison Wesley Professional, Reading (2001) 3. Culberson, J.C.: Iterated Greedy Graph Coloring and the Difficulty Landscape. Technical Report (1992) 4. Carloni, P.: PDB Coordinates for HIV-1 Nef binding to Thioesterase ii, http://www.sissa.it/sbp/bc/publications/publications.html 5. Davis, T.: University of Florida Sparse Matrix Collection (1997), http://www.cise.ufl.edu/research/sparse/matrices 6. Gross, J.L., Yellen, J.: Handbook of Graph Theory and Applications. CRC Press, Boca Raton (2004) 7. Gebremdhin, A., Manne, F., Pothen, A.: What color is your Jacobian? Graph coloring for Computing Derivatives. SIAM Review 47, 629–705 (2005) 8. Welsh, D.J.A., Powell, M.B.: An upper bound for the chromatic number of a graph and its application to timetabling problems. Computer J., 85–86 (1967) 9. Matula, D.W.: A min-max theorem for graphs with application to graph coloring. SIAM Review, 481–482 (1968) 10. Coleman, T., Mor´e, J.J.: Estimation of Sparse Jacobian Matrices and Graph Coloring Problems. SIAM Journal of Numerical Analysis 20(1), 187–209 (1983)

Automatic Identification of Fuzzy Models with Modified Gustafson-Kessel Clustering and Least Squares Optimization Methods Grzegorz Glowaty AGH University of Science and Technology, Department of Computer Science, Al. Mickiewicza 30, 30-059 Krakow, Poland [email protected]

Abstract. An automated method to generate fuzzy rules and membership functions from a set of sample data is presented. Our method is based on clustering and uses a modified version of Gustafson-Kessel algorithm. The aim is to divide a product space into set of clusters for which the systems exhibits behavior close to linear. For each of the clusters we produce a fuzzy rule and generate a set of membership functions for the rule antecedent with use of an approach based on curve fitting. Weighted linear least-squares regression is used to obtain consequent functions for TSK-models. Key words: fuzzy modeling, fuzzy clustering, Gustafson-Kessel algorithm.

1 Introduction Fuzzy models have proven to be effective function approximators. They are also easy interpretable because they are composed of human readable rules. Those rules can be used to understand a nature of a modeled system. This huge advantage of fuzzy modeling above many other modeling techniques motivates researchers to work on automatic methods of fuzzy modeling as they eventually allow for easy generation of human readable interpretation of the system. In this paper we focus on a fuzzy model generation with use of fuzzy data clustering. First, we provide a general idea of application of clustering in fuzzy model identification. We propose modifications to Gustafson-Kessel fuzzy clustering algorithm with a purpose of producing clusters more suitable for usage in fuzzy model. Then we show how to convert those clusters to TSK fuzzy models. At the end of this work models produced with the described method are compared with models produced by other classical fuzzy modeling approaches.

2 Clustering in Fuzzy Model Identification Fuzzy rules introduce a natural partition of the system space. Antecedents of the rules introduce a partition of the input space. This partition defines a set of regions in which particular rules apply. General idea behind the use of clustering techniques in the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 883–892, 2008. © Springer-Verlag Berlin Heidelberg 2008

884

G. Glowaty

fuzzy model identification [6, 8, 10] is that if we are able to find groups of sample data that exhibit similar behavior in a given area of a system space then we should be able to divide the problem of modeling into several smaller subspaces. In each of these subspaces we create a fuzzy rule that mimics approximated system’s behavior in this area. Fuzzy clustering methods not only find cluster centers, but also assign membership degree of each of the samples to each of the clusters. We use this information in generation of fuzzy rules. We modify Gustafson-Kessel fuzzy clustering algorithm [9] and use it as the basis for our approach.

3 Finding Clusters 3.1 Desired Cluster Properties The objective of our method is to create fuzzy rules for which antecedents are decomposed into set of predicates for each of the variables of the input domain. This kind of model provides the best interpretability of produced rules. There are approaches [6] that use n-1 dimensional fuzzy sets as membership functions in rule antecedents (where n is the number of dimensions of the product space). However, those models are harder to interpret. In the best performing of the methods presented in [8], Gath and Geva clustering algorithm is used and a transformation of input variables is applied. The goal of the transformation is to leverage clusters as if they were parallel to the axes of the space. That also reduces readability of the rules. In order to derive fuzzy model from a set of fuzzy clusters in the product (inputoutput) space X 1 × ... × X n −1 × X n (where X n is the output domain) a projection of each of the clusters onto each of the input space axes is obtained. Fuzzy clusters being results of the most of the fuzzy clustering algorithms are of the shape of sphere or hyper-ellipsoid. In case of spheres it is easy to obtain a “projection” of a cluster on an axis without a loose of information, however in case of hyper-ellipsoids the more axes of the ellipsoid are parallel to the axes of the space, the more information is preserved. Some of the approaches are based on this observation and look for the clusters that have all of their axes parallel to the axes of the space [10]. For the TSK fuzzy models the consequent of the fuzzy rule may be a linear function of input variables. In this case there is no need of projecting the cluster onto the output axis. With this in mind we propose a modified version of Gustafson-Kessel algorithm that finds clusters that are easily projected onto the input space, and not necessarily parallel to the output axis. 3.2 Gustafson-Kessel Algorithm Let us assume a set of N samples in the n dimensional space. The target is to find K fuzzy clusters, such that: K

∀i ∈{1,..., N }∑ μ k ,i = 1 ,

(1)

k =1

where

μk ,i is a membership degree of sample i to cluster k. Gustafson-Kessel algo-

rithm finds clusters by minimizing the following function:

Automatic Identification of Fuzzy Models N

885

K

J X , m (U , V ) = ∑∑ μ km,i DA2k ( xi − vk ) ,

(2)

i =1 k =1

where U is a set of membership degrees μ , V is a set of cluster centers v, m is fuzzi2

ness factor (usually a value close to 2), X is a set of N samples x, and DAk is a norm induced by matrix Ak . Every cluster has its own norm inducing matrix 1

Ak = [σ k det( Fk )] n −1 Fk−1 ,

(3)

where F is a fuzzy covariance matrix defined as follows: N

Fk =

∑μ i =1

m k ,i

( xi − vk )( xi − vk )T

.

N

∑μ i =1

(4)

m k ,i

Parameter σ k in (3) was introduced as a cluster capacity so the objective function minimization is not trivial process of minimizing all values of matrix A. Usually for Gustafson-Kessel algorithm destination capacity of 1 for each of the clusters is as2

sumed. Norm DAk induced by matrix Ak is calculated in the following way:

DA2k (x) = (v k − x)T Ak (vk − x) .

(5)

Given the membership degrees centers of the clusters are calculated as the weighted mean value of all membership degrees: N

vk =

∑μ i =1 N

m k ,i

∑μ i =1

xi

(6)

. m k ,i

On the other hand, given the cluster center and the norm inducing matrix it is possible to induce desired membership degrees of the samples in the following way: μ k ,i =

1 D A2k ( xi ) K

∑D j =1

2 Aj

.

(7)

( xi )

Gustafson-Kessel algorithm minimizes function given by (2) by iterative execution of the following steps: 1. 2. 3. 4.

Initialize U with random membership degrees Calculate centers of clusters with (6) Calculate new membership degrees with (7) Calculate fuzzy covariance matrices using (4)

886

G. Glowaty

5. Calculate norms induced by those matrices using (3) and (5). 6. If membership degrees have changed more in this iteration than assumed termination value proceed to step 2. In [10] a modification of this algorithm was proposed to restrict it to find clusters that are parallel to all the axes of the input-output space. In this method, we propose modification that results in finding clusters parallel to input space axes, and not necessarily parallel to the output axis. Gustafson-Kessel algorithm needs the number of clusters as the input parameter. We identify several models with different numbers of clusters and chose the best one according to the testing set error. 3.3 Modification of Gustafson-Kessel Algorithm to Obtain Desired Clusters

Clusters that are parallel to one of the axes tend to have significant non-zero variance along this axis and values of all covariances of this axis variable close to zero. As it was noticed in [10] a desired covariance matrix for clusters parallel to the axes is a diagonal matrix. In this work, however, we are looking for a wider class of clusters, namely clusters that are parallel only to the input-space axes. To achieve this, we lessen a restriction on the covariance of the output variable, but still do not want to introduce any covariance between input variables. This leads to clusters induced by covariance matrix of a form (8)

⎡c1 0 ⎢0 c 2 ⎢ F0 = ⎢... ... ⎢ ⎢0 0 ⎢⎣ 0 0

... 0 ... 0 ... 0 0 cn−1 yi

0

0⎤ 0 ⎥⎥ yi ⎥ . ⎥ 0⎥ cn ⎥⎦

(8)

If F is a fuzzy covariance matrix, F0 is a matrix that was created from F by putting 0 everywhere except for diagonal, and a single place not on a diagonal in last row and last column. Question to be answered is whether such a matrix is a valid covariance matrix. This is important because the covariance matrix needs to be positive semidefinite so it has positive determinant and the norm inducing matrix A obtained by (3) exists in ℜ . It is easy to show that in general (if more original elements were preserved) such matrix may not be a covariance matrix. However, restricting values in a way shown in (8) leads to covariance matrix in all cases. n× n

Theorem 1. Let F be a covariance matrix. Matrix F0 as in (8) created from F has

properties of a covariance matrix. Proof Let us consider only two variables: i-th and n-th and their covariance matrix

⎡c Fin = ⎢ i ⎣ yi

yi ⎤ . c n ⎥⎦

(9)

Automatic Identification of Fuzzy Models

887

From properties of covariance matrices we have:

det( Fin ) ≥ 0 ⇒ ci c n − y i2 ≥ 0 .

(10)

For F0 to be a covariance matrix it is sufficient to be a symmetric positive semidefinite. A sufficient condition for a matrix to be positive semi-definite is that all determinants of the leading minors of the matrix are non-negative. Let F0 j be a j-th leading minor of F0 . From definition (8) and form a fact that diagonal contains only non-negative numbers (variances) for all n-1 leading minors:

∀j = 1...n − 1 : det( F0 j ) = c1 ⋅ ... ⋅ c n−1 ≥ 0 .

(11)

The value of last minor’s determinant (of whole matrix):

det( F0 ) = c1 ⋅ ... ⋅ c n − c1 ⋅ ... ⋅ ci −1 y i ci +1 ⋅ ... ⋅ c n −1 y ,

(12)

det( F0 ) = c1 ⋅ ... ⋅ ci −1ci +1 ⋅ ... ⋅ c n −1 (ci c n − y ) . 2 i

From (10) and (12) we conclude det( F0 ) ≥ 0 , so the matrix (8) is positive semidefinite symmetric matrix. It is worth noting that condition stated by the above theorem does not hold in a general case when we leave more than one non-diagonal non-zero value in the last row and column. We modify the Gustafson-Kessel algorithm so it finds clusters that have covariance matrices of a form (8) meaning that their axis are not necessarily parallel to the output axis. We do this by introduction of a step 4a to the algorithm: 4a. Convert covariance matrix to a form (8) by preserving only the largest covariance value in the last row/column. The intuition for this approach is that we would like to preserve the most significant relation in the shape of the obtained cluster. Because of the conclusions of Theorem 1 all calculations performed in next steps of the algorithm may succeed.

4 Converting Clusters to Fuzzy Rules Having obtained a set of fuzzy clusters canters and a set of norms induced for those clusters the task is to create membership functions of rules’ antecedents. In this example we use asymmetric Gaussian type of membership functions but it must be noted that any classical type of membership functions would fit our method. The membership function is based on 4 parameters determining peak point and shape of left and right sides of the curve ⎧ − ( x −c21 ) ⎪ e 2σ1 , x < c1 ⎪ fσ1 ,c1 ,σ 2 ,c2 ( x ) = ⎨ −( x −c22 )2 . ⎪ e 2σ 2 , x > c2 ⎪1, x ≥ c ^ x ≤ c 1 2 ⎩ 2

(13)

888

G. Glowaty

Some authors suggest projecting clusters onto each of the axes using fuzzy projection techniques [1, 10]. Curve fitting technique is applied to adjust membership function parameters so the membership degree of fulfillment of premise of the rule corresponding to a given cluster reflects the membership degree of the measured samples to that cluster. In TSK model we assume prod-type AND operator for rule premise. It should be noted that our technique applies also to different types of operators (e.g. min). The degree of fulfillment of a rule j is calculated as follows: n −1

d j ( x) = ∏ f ( i ) ( x (i ) ) ,

(14)

i =1

(i )

(i )

where f ( x ) is a value of i-th function in a form of (13) for i-th coordinate of vector x. We employ non-linear least squares optimization to obtain parameters of f ( i ) . Objective function under minimization for rule i is given by (15): N

e(Σ ( i ) , C (i ) ) = ∑ ( μ i , j − d i ( x)) 2 ,

(15)

j =1

where Σ ( i ) , C ( i ) are sets of parameters of membership functions used in rule i and

μi, j is a membership degree of sample j to cluster i. If Jacobian of the destination function is analytically available we may use it in the calculations (this applies for standard Gaussian membership functions). In all other cases we may calculate Jacobian approximation using finite differences. In this work we used a subspace trust region approach [3] available in Matlab Optimization Toolbox, but also other least squares curve fitting methods could be applied. Numerical gradient based methods have a risk of converging to local minimum. In this case, however, we have a good starting point for the minimization. We can use cluster centers coordinates as initial values for the C parameters. Having cluster center defined as a very close to optimal we have less chance of converging to some local minima. Also we can calculate good initial guess value for Σ parameters, such that neighboring membership functions overlap. However, experiments have shown that this calculation is not necessary, as function converges without it.

5 Determination of Rule Consequents Clusters detect grouping of sample data, which due to the construction of the covariance matrix may be approximated with a linear function. We use weighted least squares linear regression to identify parameters of the output function for a rule. Philosophy for use of weighted method is that samples with little membership degrees in a given rule are likely be evaluated by other rules so their output should not influence the output function to a big extent. And contrary, samples with high membership

Automatic Identification of Fuzzy Models

889

degrees to a rule is evaluated primarily by this rule so they should have a big impact on the output function. Given the output function for a rule i we can formulate an error function for linear regression as shown below: n −1

g i ( x (1) ,..., x ( n −1) ) = a 0 + ∑ a j x ( j ) , j =1

(16)

N

ei ( A) = ∑ μi , j [ x j =1

( n) j

− gi ( x ,...,x (1) j

( n −1) j

2

)] .

6 Experimental Results 6.1 Box-Jenkins Gas Furnace

The input data [11] is a series of pairs where u(t) is a rate of flow of gas into furnace, and y(t) is a CO2 concentration at the time t. With use of the method described in [1] we conclude that the output y(t) can be predicted with use of 3 variables: y(t-1), u(t-4) and u(t-3). Variable u(t-3) does not significantly improve the performance of the model, while adding computation complexity. As in [6] we use only y(t-1), u(t-4) variables. As the learning data set we chose the first half of the samples, second half is used to calculate an approximation error.

mf

mf

2,1

mf

1,1

1,2

1

0.8

0.8

Degree of membership

Degree of membership

mf

2,2

1

0.6

0.4

0.2

0.6

0.4

0.2

0

0

-3

-2

-1

0 u(t-4)

1

2

3

45

50

55

60

y(t-1)

Fig. 1. Membership functions of input variables obtained for gas furnace problem

Membership functions that were obtained with our approach are depicted on Fig. 1. Resulting TSK rule base: IF u(t - 4) IS mf 1,1 AND y(t - 1) IS mf 1,2 THEN y = -1.38u(t - 4) + 0.51y(t - 1) + 25.78 IF u(t - 4) IS mf 2,1 AND y(t - 1) IS mf 2,2 THEN y = -1.39u(t - 4) + 0.54y(t - 1) + 23.86 .

(17)

Root mean square error (RMSE) for testing set approximation with rule base (17) is 0.391. Table 1 compares our result with results obtained with different methods summarized in [6].

890

G. Glowaty Table 1. Comparison of RMSE for gas furnace problem

Method Pedrycz (84) Xu (87) Sugeno (91) Sugeno (93) Wang (96) Delgado(99) Rantala(02) This method

Num. of inputs 2 2 6 3 2 2 4 2

Num. of rules 81 25 2 6 5 2 5 2

RMSE 0.565 0.572 0.261 0.435 0.397 0.396 0.358 0.391

As it can be seen our method provides good approximation accuracy with simple model. Delgado [6] model provides similar accuracy, but they use input membership functions in product space, hence not achieving that interpretability as our model. Wang [12] provides also similar accuracy model with significantly bigger number of rules. Sugeno [13] model providing the best accuracy uses significantly bigger input information so these two methods can not be directly compared with this example. 6.2 Non-linear Function Identification

As another benchmark we use a non-linear function with two input variables:

z = (1 + x −2 + y −1.5 ) 2 ,1 ≤ x, y ≤ 5 .

(18)

We use 50 random samples for learning and 200 other random samples for the system performance evaluation once it is learned. We determined a number of clusters to be 4. Figure 2 presents actual function vs. our modeling result. As we can see, our approximation does not perform well on the boundaries of the system space. This is due to the fact that very little random samples for learning were selected on the boundary. Model output

50 5

45

4.5

40 4

35 z

z

3.5

30

3 2.5

25

2

1

20 2 15 1

1 1.5

2

3 1.5

2

2.5

3 1

4 3

3.5

4

4.5

5

5

1.5

2

2.5

3

4 3.5

4

x

y

y

Fig. 2. Original function and its fuzzy model

4.5

5

5 x

Automatic Identification of Fuzzy Models

891

Table 2. Comparison of RMSE for non linear function (18)

Method Wang (96) Delgado(99) This method

Num. of rules 6 2 2

RMSE 0.281 0.266 0.233

Table 2 presents RMSE of our approach compared with other results from the literature. 6.3 Miles Per Gallon (MPG) Prediction

We run the test against standard miles per gallon prediction data set [14]. We divided the data set into two equal subsets and performed learning on one of them and measured the RMSE on the other half of the data. We selected 5 inputs for our model (displacement, horsepower, weight, acceleration and year) and 4 rules. Table below shows comparison of our result with other approaches found in the literature. It must be noted, that difference in MPG prediction are so small that can be due to the selection of random learning and testing sets. Our model is more complicated than Babuska [8] model but provides more interpretability as the other method uses transformation of input variables for the rules. Optimized ANFIS provides similar results to our method but with more complex underlying model. Table 3. Comparison of RMSE for MPG approximation

Method Jang (96) (linear reg.) Babuska (02) ANFIS This method

Inputs 6 5 5 5

Rules 2 6 4

Training RMSE 3.45 2.72 2.48 2.76

Testing RMSE 3.44 2.85 2.85 2.84

7 Conclusions We have shown that existing clustering based approaches to fuzzy modeling may still be improved. By modification of clustering algorithm in use we are able to obtain accurate fuzzy models and still preserve the interpretability. Additionally, it has been proven that curve fitting techniques combined with linear regression methods are a valid approach to convert clusters into fuzzy rules. As numerical results show, our method provides satisfactory results very often delivering a simpler model than other approaches. Moreover, the method is extensible enough and can be easily adopted to find membership functions of other types than Gaussian. It also can be subject to later optimization, giving a very good basis for optimization starting point. Optimization of the model obtained with this method is in scope for our future work in this area.

892

G. Glowaty

References 1. Sugeno, M., Yasukawa, T.: A Fuzzy-Logic-Based Approach to Qualitative Modeling. IEEE Trans. on Fuzzy Systems 1(1), 7–31 (1993) 2. Jang, J.S.R.: ANFIS: Adaptive-network-based fuzzy inference systems. IEEE Trans. on System, Man and Cybernetics 23(3), 665–685 (1993) 3. Coleman, T.F., Li, Y.: An Interior, Trust Region Approach for Nonlinear Minimization Subject to Bounds. SIAM Journal on Optimization 6, 418–445 (1996) 4. Rantala, J., Koivisto, H.: Optimized Subtractive Clustering for Neuro-Fuzzy Models. In: 3rd WSEAS International Conference on Fuzzy Sets and Fuzzy Systems (2002) 5. Wang, W., Zhang, Y.: On fuzzy cluster validity indices. Fuzzy Sets and Systems 158, 2095–2117 (2007) 6. Gomez-Skarmeta, A.F., Delgado, M., Vila, M.A.: About the use of fuzzy clustering techniques for fuzzy model identification. Fuzzy Sets and Systems 106, 179–188 (1999) 7. Parekh, G., Keller, J.M.: Learning the Fuzzy Connectives of a Multilayer Network Using Particle Swarm Optimization. In: IEEE Symposium on Foundations of Computational Intelligence, pp. 591–596 (2007) 8. Abonyi, J., Babuska, R., Szeifert, F.: Modified Gath-Geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models. IEEE Trans. on Systems, Man and Cybernetics 32(5), 612–621 (2002) 9. Gustafson, E.E., Kessel, W.C.: Fuzzy clustering with a fuzzy covariance matrix. In: Proc. of the IEEE Conference on Decision and Control, pp. 761–766 (1979) 10. Klawonn, F., Kruse, R.: Constructing a fuzzy controller from data. Fuzzy Sets and Systems 85, 177–193 (1997) 11. Box, G.E.P., Jenkins, G.M.: Time Series Analysis, Forecasting and Control. Holden Day, San Francisco (1970) 12. Langari, R., Wang, L.: Complex systems modeling via fuzzy logic. IEEE Trans. on Systems, Man, and Cybernetics 26(1), 100–106 (1996) 13. Sugeno, M., Tanaka, K.: Successive identification of a fuzzy model and its applications to prediction of a complex system. Fuzzy Sets and Systems 42(3), 315–334 (1991) 14. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2007)

Extending the Four Russian Algorithm to Compute the Edit Script in Linear Space Vamsi Kundeti and Sanguthevar Rajasekaran Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269, USA {vamsik,rajasek}@engr.uconn.edu

Abstract. Computing the edit distance between two strings is one of the most fundamental problems in computer science. The standard dynamic programming based algorithm computes the edit distance and edit script in O(n2 ) time and space. Often the edit script is of more importance than the value of the edit distance. The Four Russian Algorithm [1] computes the edit distance in O(n2 / log n) time but does not address how to compute edit script within that runtime. Hirschberg [2] gave an algorithm to compute edit script in linear space but the runtime remained O(n2 ). In this paper we present algorithms that compute both the edit script and n2 ) time using O(n) space. edit distance in O( log n Keywords: edit distance, edit script, linear space, four russian algorithm, hirschberg’s algorithm.

1

Introduction

The edit distance between strings S1 = [a1 , a2 , a3 . . . an ] and S2 = [b1 , b2 , b3 . . . bn ] is deﬁned as the minimal cost of transforming S1 into S2 using the three operations Insert, Delete, and Change(C) (see e.g., [3]). The ﬁrst application(global alignment) of the edit distance algorithm for protein sequences was studied by Needleman [4]. Later algorithms for several variations (such as local alignment, aﬃne gap costs, etc.) of the problem were developed (for example) in [5], [6], and [7]. The ﬁrst major improvement in the asymptotic runtime for computing the value of the edit distance was achieved in [1]. This algorithm is widely known as the Four Russian Algorithm and it improves the running time by a factor of O(log n) (with a run time of O(n2 / log n)) to compute just the value of the edit distance. It does not address the problem of computing the actual edit script, which is of wider interest rather than just the value. Hirschberg [2] has given an algorithm that computes the actual script in O(n2 ) time and O(n) space. The space saving idea from [2] was applied to biological problems in [8] and [9]. However the asymptotic complexity of the core algorithm in each of these remained O(n2 ). Also, parallel algorithms for the edit distance problem and its application to sequence alignment of biological sequences were studied M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 893–902, 2008. c Springer-Verlag Berlin Heidelberg 2008

894

V. Kundeti and S. Rajasekaran

extensively (for example) in [10] and [11]. In paper [12] linear space parallel algorithms for the sequence alignment problem were given, however they assume that O(n2 ) is the optimal asymptotic complexity of the sequential algorithm. Please refer to [13] for an excellent survey on all these algorithms. A special case is one where each of these operations is of unit cost. Edit Script is the actual sequence of operations that converts S1 into S2 . In particular, the edit script is a sequence Escript = {X1 , X2 , X3 . . . Xn }, Xi ∈ I, D, C. Standard dynamic programming based algorithms solve both the distance version and the script version in O(n2 ) time and O(n2 ) space. The main result of this is an al paper gorithm for computing the edit distance and edit script in O

n2 log n

time and

O(n) space. The rest of the paper is organized as follows. In Sec. 2 we provide a summary of the four Russian algorithm [1]. In Sec. 3 we discuss the O(n2 ) time algorithm that consumes O(n) space and ﬁnally in Sec. 4 we show how to compute the edit n2 distance and script using O( log n ) time and O(n) space.

2

Four Russian Algorithm

In this section we summarize the Four Russian Algorithm. Let D be the dynamic programming table that is ﬁlled during the edit distance algorithm. The standard edit distance algorithm ﬁlls this table D row by row after initialization of the ﬁrst row and the ﬁrst column. Without loss of generality, throughout this paper we assume that all the edit operations cost unit time each. The basic idea behind the Four Russian Algorithm is to partition the dynamic programming table D into small blocks each of width and height equal to t where t is a parameter to be ﬁxed in the analysis. Each such block is called a t-block. The dynamic programming table is divided into t-blocks such that any two adjacent t-blocks overlap by either a row or column of width (or height) equal to t. See Fig. 1 for more details on how the dynamic programming table D is partitioned. After this partitioning is done The Four Russian algorithm ﬁlls up the table D block by block. Algorithm 1 has more details. A quick qualitative analysis of the algorithm is as follows. After the partition2 ing of the dynamic programming table D into t-blocks we have nt2 blocks and if 2 processing of each of the block takes O(t) time then the running time is O( nt ). In the case of standard dynamic programming, entries are ﬁlled one at a time (rather than one block at a time). Each entry can be ﬁlled in O(1) time and 2 hence the total run time is O(n2 ). In the Four Russian algorithm, there are nt2 blocks. In order to be able to ﬁll each block in O(t) time, some preprocessing is done. Theorem 1 is the basis of the preprocessing. Theorem 1. If D is the edit distance table then |D[i, j] − D[i + 1, j]| ≤ 1, and |D[i, j] − D[i, j + 1]| ≤ 1∀(0 ≤ i, j ≤ n). Proof. Note that D[i, j] is deﬁned as the minimum cost of converting S1 [1 : i] into S2 [1 : j]. Every element of the table D[i, j] is ﬁlled based on the values from

Computing edit script in O(n2 / log n) Time and O(n) Space

895

D[i − 1, j − 1],D[i − 1, j] or D[i, j − 1]. D[i, j] ≥ D[i − 1, j − 1](characters at S1 [i] and S2 [j] may be same or diﬀerent), D[i, j] ≤ D[i, j − 1] + 1 (cost of insert is unity),D[i, j − 1] ≤ D[i − 1, j − 1] + 1(same inequality as the previous one rewritten for element D[i, j − 1]). The following inequalities can be derived from the previous inequalities. −D[i, j] ≤ −D[i − 1, j − 1] D[i, j − 1] ≤ D[i − 1, j − 1] + 1 −D[i, j] + D[i, j − 1] ≤ 1 D[i, j − 1] − D[i, j] ≤ 1 D[i, j] ≤ D[i, j − 1] + 1 {Started with this} −1 ≥ D[i, j − 1] − D[i, j] |D[i, j − 1] − D[i, j]| ≤ 1 Along the same lines we can also prove that |D[i − 1, j] − D[i, j]| ≤ 1 and D[i − 1, j − 1] ≤ D[i, j]. Theorem 1 essentially states that the value of the edit distance in the dynamic programming table D will either increase by 1 or decrease by 1 or remain the same compared to the previous element in any row or a column of D. Theorem 1 helps us in encoding any row or column of D with a vector of 0, 1, −. For example a row in the edit distance table D[i, ∗] = [k, k + 1, k, k, k − 1, k − 2, k − 1] can be encoded with a vector vi = [0, 1, −1, 0, −1, −1, 1]. To characterize any row or column we just need the vector vi and k corresponding to that particular row or column. For example, if D[i, ∗] = [1, 2, 3, 4, . . . , n], then k = 1 for this row and vi = [0, 1, 1, 1, 1, 1, 1, . . . , 1]. For the computation of the edit distance table D the leftmost column and the topmost row must be ﬁlled (or initialized) before the start of the algorithm. Similarly in this algorithm we need the topmost row (A) and leftmost column (B) to compute the edit distance within the t-block see Fig. 1. Also see Algorithm 2. It is essential that we compute the edit distance within any t-block in constant time. In the Four Russian algorithm the computation of each t-block depends on the variables A, B, K, C, E (see Fig. 1). The variable A represents the top row of the t-block and B represents the the left column of the t-block. C and E represent the corresponding substrings in the strings S1 and S2 . K is the intersection of A and B. If the value of the variable K is k then from Theorem 1 we can represent A and B as vectors of {0,1,-1} rather than with exact values along the row and column. As an example, consider the ﬁrst t-block which is the intersection of the ﬁrst t rows and the ﬁrst t columns of D. For this t-block the variables {A, B, K, C, E} have the following values: K = D[0, 0], A = D[0, ∗] = [0, 1, 1, 1, . . . , 1], B = D[∗, 0] = [0, 1, 1, 1, . . . , 1], C = S2 [0, 1, . . . , t], and E = S1 [0, 1, . . . , t]. For any t-block we have to compute {A , B , K } as a function of {A, K, B, C, E} in O(1) time. In this example plugging in {A, B, K, C, E} for the ﬁrst t-block gives K = D[t, t], A = [D[0, t], . . . , D[t, t]],B = [D[t, 0], . . . , D[t, t]]. To accomplish the task of computing the edit distance in a t-block in O(1) time, we precompute

896

V. Kundeti and S. Rajasekaran

S1

E

S2 {A’,B’,K’} = F(A,B,C,K,E)

K B A

C

t−block

A’ overlapping row

B’ overlapping column

K’

filled pattern indicates initialized values of the dynamic programming table

Fig. 1. Using preprocessed lookup table {A , B , K } = F (A, B, C, K, E)

all the possible inputs in terms of variables {A, B, 0, C, E}. We don’t have to consider all possible values of K since if K1 is the value of K we get with input variables {A, B, 0, C, E} then the value of K for inputs {A, B, K, C, E} would be K1 + K. Thus this encoding(and some preprocessing) helps us in the computation of the edit distance of the t-block in O(1) time. The algorithm is divided into two parts pre-processing step and actual computation.

Algorithm 1. Four Russian Algorithm, t is a parameter to be ﬁxed. INPUT : Strings S1 and S2 , Σ, t OUTPUT: Optimal Edit distance /*Pre-processing step*/ F = PreProcess(Σ, t) ; for i = 0;i < n;i+ = t do for j = 0;j < n;j+ = t do {A , B , D } = LookU pF (i, j, t) ; [D[i + t, j] . . . D[i + t, j + t] = A ; [D[i, j + t] . . . D[i + t, j + t] = B ; end end

2.1

Pre Processing Step

As we can see from the previous description, at any stage of the Algorithm 1 we need to do a lookup for the edit distance of any t-block and as a result get the row and column for the adjacent t-blocks. From Theorem 1 its evident

Computing edit script in O(n2 / log n) Time and O(n) Space

897

Algorithm 2. LookUp routine used by Algorithm 1. INPUT : i,j,t OUTPUT: A , B , D A = [D[i, j] . . . D[i, j + t]]; B = [D[i, j] . . . D[i + t, j]]; C = [S2 [j] . . . S2 [j + t]]; E = [S1 [j] . . . S1 [j + t]]; K = D[i, j]; /*Encode A,B*/ for k = 1;k < t;k + + do A[k] = A[k] − A[k − 1]; B[k] = B[k] − B[k − 1]; end /*Although K is not used in building lookup table F we maintain the consistency with Fig. 1 */ return {A , B , D } = F (A, B, C, K, E) ;

that any input {A, B, K, C, E} (see Fig. 1) to the t-block can be transformed into vectors of {−1, 0, 1}. In the preprocessing stage we try out all possible inputs to the t-block and compute the corresponding output row and column ({A , B , K } (see Fig. 1). More formally, the row (A ) and column(B ) that need to be for any t-block can be repesented as a function F (lookup table) with inputs {A, B, K, C, E}, such that {A , B , K } = F (A, B, K, C, E). This function can be precomputed since we have only limited possibilities. For any given t, we can have 3t vectors corresponding to A and B. For a given alphabet of size Σ we have Σ t possible inputs corresponding to C and E. K will not have any eﬀect since we just have to add K to A [t] or B [t] at the end to compute K . The time to preprocess is thus O((3Σ)2t t2 ) and the space for the lookup table F would be O((3Σ)2t t). Since t2 ≤ (3Σ)t , if we pick log n t = 3 log(3Σ) , the preprocessing time as well as the space for the lookup table will be O(n). Here we make use of the fact that the word length of the computer is Θ(log n). This in particular means that a vector of length t can be thought of as one word. 2.2

Computation Step

Once the preprocessing is completed in O(n) time, the main computation step proceedes scanning the t-blocks row by row and ﬁlling up the dynamic programming table(D). Algorithm 1 calls Algorithm 2 in the inner most for loop. Algorithm 2 takes O(t) time to endcode the actual values in D and calls the function F which takes O(1) time and returns the row (A ) and column (B ) which are used as input for other t-blocks. The runtime of the entire algorithm is 2 O( nt nt t) = O( nt ). Since t = Θ(log n) the run time of the Four Russian Algorithm 2 n is O( log n ).

898

3

V. Kundeti and S. Rajasekaran

Hirschberg’s Algorithm to Compute the Edit Script

In this section we brieﬂy describe Hirschberg’s [2] algorithm that computes the edit script in O(n2 ) time using O(n) space. The key idea behind this algorithm is an appropriate formulation of the dynamic programming paradigm. We make some deﬁnitions before giving details on the algorithm. – Let S1 and S2 be strings with |S1 | = m and |S2 | = n. A substring from index i to j in a string S is denoted as S[i . . . j]. – If S is a string then S r denotes the reverse of the string. – Let D(i, j) stand for the optimal edit distance between S1 [1 . . . i] and S2 [1 . . . j]. – Let Dr (i, j) be the optimal edit distance between S1r [1 . . . i] and S2r [1 . . . j]. Lemma 1. D(m, n) = min0≤k≤m {D[ n2 , k] + Dr [ n2 , m − k]}. The Lemma 1 essentially says that ﬁnding the optimal value of the edit distance between strings S1 and S2 can be done as follows: Split S1 into two parts (p11 and p12 ) and S2 into two parts (p21 and p22 ); Find the edit distance (e1 ) between p11 and p21 ; Find the edit distance (e2 ) between p12 and p22 ; Finally add both the distances to get the ﬁnal edit distance (e1 + e2 ); Since we are looking for the minimum edit distance we have to ﬁnd a breaking point (k) that minimizes the value of (e1 + e2 ). We would not miss this minimum even if we break one of the strings deterministically and ﬁnd the corresponding breaking point in the other string. As a result of this we keep the place where we break in one of the strings ﬁxed. (Say we always break one of the strings in the middle). Then we ﬁnd a breaking point in the other string that will give us minimum value of (e1 + e2 ). The k in Lemma 1 can be found in O(mn) time and O(m) space for the following reasons. To ﬁnd the k at any stage we need two rows(D[ n2 , ∗] and Dr [ n2 , ∗]) from forward and reverse dynamic programming tables. Since the values in any row of the dynamic programming table just depend on the previous row, we just have to keep track of the previous row while computing the table D and Dr . Once we ﬁnd k we can also determine the path from the previous row ( n2 − 1) to row ( n2 ) in both the dynamic programming tables D and Dr (see Fig. 2). Once we ﬁnd these subpaths we can continue to do the same for the two subproblems (see Fig. 2) and continue recursively. The run time of the algorithm can be computed by the following reccurence relation. T (n, m) = T ( n2 , k) + T ( n2 , m − k) + mn mn T ( n2 , k) + T ( n2 , m − k) = mn 2 + 4 + . . . = O(mn) In each stage we use only O(m) space and hence the space complexity is linear.

Computing edit script in O(n2 / log n) Time and O(n) Space

899

S1 S2

m−k D sub−problem n/2 −1

k

(k at which D[n/2,k]+ Dr [n/2,m−k] is min)

n/2 n/2 −1

subpaths

Dr sub−problem

S r2 S r1 Fig. 2. Illustration of Hirschberg’s recursive algorithm

4

Our Algorithm

Our algorithm combines the frameworks of the Four Russian algorithm and n2 that of Hirschberg’s Algorithm. Our algorithms ﬁnds the edit script in O log n time using linear space. We extend the Four Russian algorithm to accommodate Lemma 1 and to compute the edit script in O(n) space. At the top-level of our algorithm we use a dynamic programming formulation similar to that of Hirschberg. Our algorithm is recursive and in each stage of the algorithm we compute k and also ﬁnd the sub-path as follows. n n D(m, n) = min0≤k≤m {D( , k) + Dr ( , m − k)} 2 2 The key question here is how to use the Four Russian framework in the computation of D( n2 , k) and Dr ( n2 , m − k) for any k in time better than O(n2 )? . Hirschberg’s algorithm needs the rows D( n2 , ∗) and Dr ( n2 , ∗) at any stage of the recursion. In Hirschberg’s algorithm at recursive stage (R(m, n)), D( n2 , k) and Dr ( n2 , m − k) are computed in O(mn) time. We cannot use the same approach since the run time will be Ω(n2 ). We have to ﬁnd a way to compute the rows n2 D( nn , ∗) and Dr ( n2 , ∗) with a run time of O( log n ). The top-level outline of our algorithm is illustrated by the pseudo-code in TopLevel (see Algorithm 3). The algorithm starts with input strings S1 and S2 of length m and n, respectively. At this level the algorithm applies Lemma 1 and ﬁnds k. Since the algorithm requires D( n2 , ∗) and Dr ( n2 , ∗) at this level it calls the algorithm FourCompute to compute the rows D( n2 , ∗), D( n2 − 1, ∗), Dr ( n2 , ∗) and

900

V. Kundeti and S. Rajasekaran

Dr ( n2 − 1, ∗). Note the fact that although for ﬁnding k we require rows D( n2 , ∗) and Dr ( n2 , ∗), to compute the actual edit script we require rows D( n2 − 1, ∗) and Dr ( n2 − 1, ∗). Also note that these are passed to algorithm FindEditScript to report the edit script around index k. Once the algorithm ﬁnds the appropriate k for which the edit distance would be minimum at this stage, it divides the problem into two sub problems (see Fig. 2) (S1 [1 . . . k1 − 1], S2 [1 . . . n2 − 1]) and (S1 [m − k2 + 1 . . . m], S2 [ n2 + 1 . . . n]. Observe that k1 and k2 are returned by FindEditScript. FindEditScript is trying to ﬁnd if the sub-path passes through the row n2 (at the corresponding level of recursion) and updates k so that we can create sub-problems (please see arcs (sub-paths) in Fig. 2). Once the sub-problems are properly updated the algorithm solves each of these problems recursively. We now describe algorithm FourCompute which ﬁnds the rows D( n2 , ∗) and r n D ( 2 , ∗) (that are required at each recursive stage of TopLevel (Algorithm 3)) in time O( nm t ) where t is the size of blocks used in the Four Russian Algorithm. We do exactly the same pre-processing done by the Four Russian Algorithm and create the lookup table F . FourCompute is called for both forward (S1 ,S2 ) and reverse strings (S1r ,S2r ). The lookup table F (A, B, K, C, E) has been created for all the strings from Σ of length t. We can use the same lookup table F for all the calls to FourCompute. A very important fact to remember is that in the Four Russian algorithm whenever a lookup call is made to F the outputs {A , B } are always aligned at the rows which are multiples of t, i.e., at any stage of the Four Russian algorithm we only require the values of the rows D(i, ∗) such that i mod t = 0. In our case we cannot directly use the Four Russian Algorithm in algorithm FourCompute because the lengths of the strings which are passed to FourCompute from each recursive level of TopLevel is not necessarily a multiple of t. Suppose that in some stage of the FourCompute algorithm a row i is not a multiple of t. We apply the Four Russian Algorithm and compute till row D( ti , ∗), ﬁnd the values in the row D( ti − t, ∗) and apply lookups for rows ti − t, ti − t + 1, . . ., and ti − t + i mod t. Basically we need to slide the t-block from the row ti − t to ti − t + i mod t. Thus we can compute any row that is not a multiple of t in an extra i mod t∗ m t time (where m is the length of the string represented across the columns). We can also use the standard edit distance computation in rows ti , ti + 1, . . . ti + i mod t which also takes the same amount of extra time. Also consider the space used while we compute the required rows in the FourCompute algorithm. We used only O(m + n) space to store arrays D [0, ∗] and D [∗, 0] and reused them. So the space complexity of algorithm FourCompute is linear. The run time is O(( nt )( mt )(t)) to compute a row D(n, ∗) or Dr (n, ∗). We arrive at the following Lemma. Lemma 2. Algorithm FourCompute Computes rows Dr ( n2 , ∗), D( n2 , ∗) required by Algorithm TopLevel at any stage in O( mn t ) time and O(m + n) space.

Computing edit script in O(n2 / log n) Time and O(n) Space

901

The run time of the complete algorithm is as follows. Here c is a constant. T (n, m) = T ( n2 , k) + T ( n2 , m − k) + c mn 2t . mn mn + + · · · ) = O( T (n, m) = c( mn 2t 4t t ). Since t = Θ(log n) the run time is O(n2 / log n). Algorithm 3. TopLevel which calls FourCompute at each recursive level. Input: Strings S1 ,S2 ,|S1 | = m,|S2 | = n Output: Edit Distance and Edit Script D( n2 , ∗) = FourCompute( n2 , m, S1 , S2 , D(∗, 0), D(0, ∗)); Dr ( n2 , ∗) = FourCompute( n2 , m, S1r , S2r , Dr (∗, 0), Dr (0, ∗)); /*Find the k which gives min Edit Distance at this level*/ M inimum = (m + n) ; for i = 0 to n do if (D( n2 , i) + Dr ( n2 , m − i)) < M inimum then k=i; M inimum = D( n2 , i) + Dr ( n2 , m − i) ; end end /*Compute The EditScripts at this level */ k1 = FindEditScript(D( n2 , ∗), D( n2 − 1, ∗), k, F orward) ; k2 = FindEditScript(Dr ( n2 , ∗), Dr ( n2 − 1, ∗), k, Backward) ; /*Make a recursive call If necessary*/ ; TopLevel(S1 [1 . . . k1 − 1],S2 [1 . . . n2 − 1]) ; TopLevel(S1 [m − k2 + 1 . . . m],S2 [ n2 + 1 . . . n]) ; 4.1

Space Complexity

The space complexity is the maximum space required at any stage of the algorithm. We have two major stages where we need to analyze the space complexity as follows. The ﬁrst during the execution of the entire algorithm and the second during preprocessing and storing the lookup table. 4.2

Space during the Execution

The space for algorithm TopLevel is clearly linear since we need to store just 4 rows at any stage: Rows D( n2 , ∗), D( n2 − 1, ∗), Dr ( n2 , ∗) and Dr ( n2 − 1, ∗). From Lemma 2 the space required for FourCompute is also linear. So the space complexity of the algorithm during execution is linear. 4.3

Space for Storing Lookup Table F

We also need to consider the space for storing the lookup table F . The space required to store the lookup table F is also linear for an appropriate of t value n2 (as has been shown in Sec. 2.1). The runtime of the algorithm is O log n .

902

5

V. Kundeti and S. Rajasekaran

Conclusion

In this paper we have shown that we can compute both the edit distance and n2 edit script in time O( log n ) using O(n) space. Acknowledgments. This research has been supported in part by the NSF Grant ITR-0326155 and a UTC endowment.

References 1. Arlazarov, V.L., Dinic, E.A., Kronrod, M.A., Faradzev, I.A.: On economic construction of the transitive closure of a directed graph. Dokl. Akad. Nauk SSSR 194, 487–488 (1970) 2. Hirschberg, D.S.: Linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 341–343 (1975) 3. Horowitz, E., Sahni, S., Rajasekaran, S.: Computer Algorithms. Silicon Press (2008) 4. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970) 5. Smith, T.F., Waterman, M.S.: Identiﬁcation of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981) 6. Gotoh, O.: Alignment of three biological sequences with an eﬃcient traceback procedure. Journal of Theoretical Biology 121(3), 327–337 (1986) 7. Huang, X., Hardison, R.C., Miller, W.: A space-eﬃcient algorithm for local similarities. Computer Applications in the Biosciences 6(4), 373–381 (1990) 8. Gotoh, O.: Pattern matching of biological sequences with limited storage. Computer Applications in the Biosciences 3(1), 17–20 (1987) 9. Myers, E.W., Miller, W.: Optimal alignments in linear space. Computer Applications in the Biosciences 4(1), 11–17 (1988) 10. Edmiston, E., Wagner, R.A.: Parallelization of the dynamic programming algorithm for comparison of sequences, pp. 78–80 (1987) 11. Ranka, S., Sahni, S.: String editing on an simd hypercube multicomputer. Journal of Parallel and Distributed Computing 9(4), 411–418 (1990) 12. Rajko, S., Aluru, S.: Space and time optimal parallel sequence alignments. IEEE Transactions on Parallel and Distributed Systems 15(12), 1070–1081 (2004) 13. Gusﬁeld, D.: Algorithms of Strings Trees and Sequences. Cambridge (1997)

Accuracy of Baseline and Complex Methods Applied to Morphosyntactic Tagging of Polish Marcin Kuta, Michal Wrzeszcz, Pawel Chrzaszcz, and Jacek Kitowski Institute of Computer Science, AGH-UST, al. Mickiewicza 30, Krak´ ow, Poland {mkuta,kito}@agh.edu.pl

Abstract. The paper presents baseline and complex part-of-speech taggers applied to the modiﬁed corpus of Frequency Dictionary of Contemporary Polish. Accuracy of 5 baseline part-of-speech taggers is reported. On the base of these results complex methods are worked out. Thematic split and attribute split methods are proposed and evaluated. Tagging accuracy of voting methods is evaluated ﬁnally. The most accurate baseline taggers are SVMTool (for a simple tagset) and fnTBL (for a complex tagset). Voting method called Total Precision achieves the top accuracy among all looked over methods. Keywords: part-of-speech tagging, natural language processing.

1

Introduction

Part-of-speech (POS) tagging algorithms are intensively exploited in a wide range of applications including syntactic and semantic parsing, speech recognition and generation, ontology construction, machine translation, text understanding, information retrieval and many others [1]. Unfortunately POS tagging of highly inﬂecting languages like Polish is much more challenging than application to analytic languages (e.g. English or French), as the former are annotated with large, complex tagsets, describing many morphological categories. POS tagging algorithms are computationally time demanding, especially when employed to inﬂecting languages. In the domain training time exceeding 24 h is nothing extraordinary. Time requirements of the complex algorithms like split models or voting methods are moreover one order of magnitude higher. The paper examines accuracy of selected baseline algorithms applied to morphosyntactic tagging of Polish. Next more complex methods are investigated, both originally proposed (split models) as well as already known but not evaluated on Polish (voting methods). We present also results for the simple tagset (only the ﬁrst attribute of each tag considered) to provide point of reference to English and other languages described by small tagsets. The taggers are evaluated on the modiﬁed corpus of Frequency Dictionary of Contemporary Polish (m-FDCP), authors’ improved version [2] of the standard FDCP corpus available at [3] site. The m-FDCP corpus is annotated with the complex tagset, containing over 1200 tags. Tags consist of a set of attributes, M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 903–912, 2008. c Springer-Verlag Berlin Heidelberg 2008

904

M. Kuta et al.

each attribute describing selected morphological category. A token is the entity being subject of tagging. A word segment is a token containing at least one letter or digit. By raw text we mean a sequence of tokens without their tags.

2

Baseline Tagging Algorithms

POS tagging algorithms are roughly divided into statistical and rule based methods. For a given token w all methods examine its context, being a window of N tokens and tags centred on the token w. The rule based methods take into account a wide context, which is desirable in the case of languages containing long-distance syntax dependencies. Statistical algorithms map a sequence of tokens into a sequence of tags with a probability model, which describes occurrence of the most probable sequence of tags for a given sequence of tokens. 2.1

Evaluated Algorithms

In the paper we ﬁrst investigate ﬁve baseline algorithms, providing new results, compared to [2] [4]. The methods serve furthermore as components for construction of more sophisticated taggers. Hidden Markov Model (HMM). The algorithm belongs to the statistical methods. Given a sequence of tokens, w1 , . . . , wn , the HMM tagger assigns a sequence of tags, T = (t1 , . . . , tn ), according to a formula Tˆ = arg max T

n

p(wi |ti ) · p(ti |ti−1 , . . . , ti−N ) ,

(1)

i

where p(wi |ti ) is the conditional probability of occurrence of word wi given that tag ti occurred and p(ti |ti−1 , . . . , ti−N ) is the conditional probability of occurrence of tag ti given that tag sequence ti−1 , . . . , ti−N previously occurred. Maximum entropy. This statistical method [5] aims to maximize the entropy function by selection of binary features, reﬂecting dependence in a training corpus. The model assumes a set of binary features, fj , is deﬁned on the combination of a tag ti and its context c. The probabilistic model is built from family of models f (t ,c) p(ti , c) = πμ αj j i , (2) j

where p(ti , c) stands for joint distribution of tags and contexts and π, μ are normalisation factors in order that p(·, ·) forms the probability function. Memory-based learning. The algorithm exploits rule based approach. Tagger [6] acquires examples from training corpora, which are later used in the tagging process. During the learning process memory-based taggers store in memory a set of examples (ti , ci ), where ti denotes the tag and ci its context. Given a token

Accuracy of Baseline and Complex Methods

905

w in context c, the memory-based tagger assigns it a tag tk , such that distance between c and ck is minimal. Transformation-based error-driven learning. This rule based method [7] starts with assigning a trivial sequence of tags to a given tokenised text. The target sequence of tags is determined by applying series of transformations. Each transformation, F , is a rule in the form: ”Replace value of tag t with value y if current context c fulﬁls condition φ.” The core of learning process is the algorithm for ﬁnding suitable transformations. Support vector machines. The SVM is a statistical algorithm, which maps input data to a higher dimensional space and next constructs a linear separating hyperplane with the maximal margin. The mapping is done with kernel functions, of which linear, polynomial, radial basis (RBFs) and sigmoid functions are the most used. The SVM approach achieves 97.16% accuracy for English [8]. As an implementation of the above algorithms the following taggers have been chosen for evaluation: Table 1. Taggers used in experiments Algorithm

Tagger name

HMM Maximum entropy Transformation based Memory based SVM

TnT [9] MXPost 1 [5] fnTBL [10] MBT [6] SVMTool 2 [8]

Next the complex methods have been worked out: the split models have been elaborated and the collective methods evaluated.

3

Split Models

3.1

Thematic Split

To make beneﬁt of the thematic split approach the corpus structure should be nonuniform, i.e., it should consist of segments diversiﬁed in language style. This is the case of the m-FDCP corpus, containing 5 segments (thematic parts) diﬀering in vocabulary, style, etc., e.g. average sentence length varies from 10.42 tokens/sentence (artistic drama) to 23.27 tokens/sentence (popular science). These diﬀerences cause that tagging rules utile in one segment may become ineﬃcient for another one. Thus, instead of providing one overall language model, created from the whole corpus, it is worth considering building a number of separate models, each acquired with a baseline tagger from a diﬀerent thematic part. 1 2

Referred further as MXP. Referred further as SVM.

906

M. Kuta et al.

3.2

Attribute Split

Attribute split method is applicable only to corpora annotated with complex tagsets. Assuming a tagset provides for presence of K morphological categories, the entire corpus is replicated K times, each copy corresponding to one morphological category. The i-th copy (1 ≤ i ≤ K) contains the whole text (all tokens), annotated with a small tagset Ti . The tagset Ti is obtained from the complex tagset by removing from each tag all attributes except the attribute describing i-th morphological category. If i-th morphological category is not applicable to a current token, the token is annotated with a special tag none in a relevant copy. Next, the training procedure and tagging of the raw test set is performed separately on each copy with a given baseline algorithm. Partition to training and test sets remains preserved as in the original corpus annotated with the complex tagset. Finally, K output ﬁles, generated within tagging of the raw test set, are merged into one ﬁle, where each token is back annotated with all the relevant morphological categories. The merged ﬁle is evaluated against the test set of the corpus annotated with the complex tagset.

4

Collective Methods

Collective methods and their performance on English and Dutch are shown in the fundamental work [11]. The idea of collective methods is based on assumption, that diﬀerent baseline methods make errors in diﬀerent places, i.e., baseline methods are to some extent complementary. The higher the complementarity of taggers, the bigger chance the combined system compensates errors of its constituents and performs better than the components alone. A few independent baseline taggers (components) propose a tag simultaneously [11] [12] [13] according to their algorithms described in Sect. 2.1. Proposed tags are compared and the best tag is selected according to an arbitration mechanism. As an arbitration several voting strategies are available. Assuming the reference test corpus contains n tokens w1 , . . . , wn , a token wi is annotated in the corpus with a tag ti (1 ≤ i ≤ n), a tagger A guesses for a token wi a tag tA i , the accuracy of the tagger A is deﬁned as follows: n δ(tA , ti ) df #correctly tagged tokens accuracy = = i=1 i , (3) #all tokens n where δ is the Kronecker delta function. The sentence accuracy is a ratio of entirely correctly tagged sentences to the total number of sentences. If a tagger B guesses a tag tB i respectively, the complementarity of the tagger B to the tagger A, comp(B|A), is determined as follows [14]: df

comp(B|A) = 1 −

#common errors of taggers A and B . #errors of tagger A

(4)

Given a tag X, the precision and recall of the tagger A on the tag X are given as: df #tokens tagged and annotated with X precX = , (5) #tokens tagged with X

Accuracy of Baseline and Complex Methods

df

recallX =

#tokens tagged and annotated with X . #tokens annotated with X

907

(6)

Majority is a simple voting method, with exactly one vote assigned to each tagger. With weighted voting methods each tagger votes for its accuracy (Total Precision method) or precX (Tag Precision or Precision-Recall method). Additionally, the tagger may be obliged to support tags other than suggested by itself (Precision-Recall method, with weight 1 − recallY ). Ties are resolved by a random selection amongst winning tags. Vote strength of particular methods is summarised in Table 2. Table 2. Vote strength of a component tagger in various voting methods, X stands for a tag proposed by tagger, Y (Y = X) represents each tag proposed by opposition Voting method Tag X Tag Y Majority 1 0 Total Precision accuracy 0 Tag Precision precX 0 Precision-Recall precX 1 − recallY

For each tagger parameters accuracy, precX and recallX (for each tag X) are to be determined with help of a tuning set, disjoint from a test set, to avoid artiﬁcial boosting of the above methods.

5

Evaluated Data and Experiments Setup

We used the modiﬁed corpus of Frequency Dictionary of Contemporary Polish [15], annotated with the slightly abridged version of the IPI PAN tagset [16]. The m-FDCP corpus is partially corrected and disambiguated version of the FDCP corpus [3], both with manual checking and automatic procedures. The whole process of corpus improving has been described in details in [2]. The corpus is balanced between ﬁve thematic parts: (A) popular science, (B) news dispatches, (C) editorials and longer articles, (D) artistic prose and (E) artistic drama, each standing for approximately 20% of the corpus and representing diﬀerent style of the language. The used tagset provides for 9 morphological categories: grammatical class (part of speech or POS), number, case, gender, person, degree, aspect, negation, vocalicity. The main parameters of the corpus are gathered in Table 3 (4th column). The baseline algorithms have been evaluated with the split of the m-FDCP corpus to a training and test set, standing for 90% and 10% of the corpus, respectively. The balanced character of the training and test set was carefully preserved within the split. Their characteristics are gathered in Table 3. The setup for the thematic split approach was the following: for each of 5 thematic parts of the m-FDCP corpus the separate experiment with the baseline

908

M. Kuta et al. Table 3. Main parameters of the m-FDCP corpus [2]

tokens word segments sentences diﬀerent tokens

Training Test Full 90% 10% 100% 592729 65927 658656 496907 55139 552046 36601 4211 40812 87097 19557 92872

tagset size ambiguous tokens, % mean token ambiguity

Simple tagset 30 30 30 26.15 26.19 26.16 1.44 1.43 1.44

tagset size ambiguous tokens, % mean token ambiguity

Complex tagset 1191 724 1243 47.76 47.65 47.74 3.12 3.12 3.12

taggers was performed. Each part was split to a training and test set in 90%/10% ratio (18% and 2% of the whole corpus respectively). In the attribute split approach we used the entire m-FDCP corpus, divided to training and test sets as for baseline tagging. This corpus was replicated 9 times, each with an appropriate tagset, as there are 9 morphological categories. The setup for voting methods was slightly more complex. We avoided partition of the corpus to training, tuning and test sets in 80%/10%/10% ratio [12] but instead the tuning set was created on the base of the 90% training set, used already within baseline tagging experiments. The 90% training set was divided into 9 equal parts among which 8 parts served for training and one part for testing. The procedure was repeated 9 times, with a diﬀerent part serving for testing each time. The 9 output ﬁles were merged into one ﬁle, standing for the tuning set; cf. [11]. In this way two important aims have been achieved. We got bigger training set (90% instead of 80% of the corpus) for training the baseline taggers and at the same time 9 times bigger tuning set (90% instead of 10% of the corpus). As baseline components we used the taggers from Table 1, prepared already for the baseline experiments. But when applied to the complex tagset, voting methods omitted the SVM tagger, as achieving much lower accuracy than the rest of components. All experiments were performed at the ACC Cyfronet AGH-UST site on the SMP supercomputer, SGI Altix 3700, equipped with 128 1.5 GHz Intel Itanium 2 processors, 256 GB RAM and 4.75 TB disk storage. The taggers have been compiled with standard optimization level. Depending on baseline tagger chosen training time varies considerably from 3 to 980 · 102 seconds and tagging speed from 5 to 220 · 102 tokens/second [4].

Accuracy of Baseline and Complex Methods

6 6.1

909

Results Results for Baseline Methods

The accuracy of 5 baseline tagging algorithms is presented in Table 4. Table 4. Accuracy of baseline taggers trained on 90% of the m-FDCP corpus, [%] TnT MXP fnTBL MBT SVM

6.2

All tokens Known tokens Unknown tokens Ambiguous tokens Word segments Word segments with known tags Word segments with unknown tags Unknown word segments Sentences

96.20 96.98 88.65 89.50 95.46 96.94 0.00 88.65 61.48

Simple tagset 96.30 96.51 95.74 97.01 97.51 97.10 89.43 86.89 82.60 91.09 91.36 89.94 95.57 95.83 94.91 96.79 97.55 97.08 28.21 3.85 0.00 89.43 86.90 82.60 62.15 63.71 58.54

96.74 97.52 89.14 91.36 96.10 97.60 0.00 89.15 65.16

All tokens Known tokens Unknown tokens Ambiguous tokens Word segments Word segments with known tags Word segments with unknown tags Unknown word segments Sentences

86.33 88.97 60.86 78.66 83.66 90.73 0.00 60.86 28.95

Complex tagset 85.00 86.79 82.31 87.53 89.76 85.75 60.55 58.09 49.06 78.71 80.34 72.48 82.07 84.21 78.85 87.44 90.27 86.58 29.84 30.50 0.66 60.55 58.09 49.05 26.88 29.87 22.51

73.54 81.12 0.23 63.44 68.36 80.69 0.00 0.23 12.04

Results for Split Models

The observed decline of accuracy with the thematic split model (Table 5, columns 2-7) is due to considerable reduction of training sets sizes, comparing to setup for the baseline algorithms, what might mask the proﬁts of thematic split. To prevent the masking eﬀect we made another experiment. We examined the baseline algorithms, with uniform (balanced) training and test set of identical sizes as for the thematic split, i.e., standing for 18% and 2% of the entire corpus (Table 5, column 8). The higher accuracy of the thematic split model over this additional experiment is apparent (column 8 vs. column 7 of Table 5). We remark also accuracy degradation of the attribute split model (Table 6) comparing to the baseline taggers. Only the attribute split model for SVM experiences improvement, in comparison to the baseline tagger. This can be explained by considerably low accuracy of the SVM tagger.

910

M. Kuta et al.

Table 5. Accuracy of taggers trained on thematic parts A–E, average accuracy (arithmetic mean) of all thematic parts, accuracy of taggers trained on the uniform set, [%] Thematic parts B C D

A

Simple 95.07 94.83 94.58 94.00 95.39

E

18%/2% average uniform

tagset 95.24 94.85 95.40 94.69 95.83

TnT MXP fnTBL MBT SVM

95.24 94.63 94.98 94.14 95.44

96.49 95.98 95.63 95.61 96.62

95.39 94.66 95.20 94.40 95.49

TnT MXP fnTBL MBT SVM

82.61 79.05 80.74 77.91 66.69

83.35 80.34 80.80 77.80 66.72

Complex tagset 82.24 82.39 85.79 79.47 79.83 84.02 80.43 80.99 84.00 77.13 78.42 82.92 65.42 67.94 74.53

95.49 94.99 95.16 94.57 95.75

94.82 94.46 94.64 93.80 94.93

83.28 80.54 81.39 78.84 68.26

83.26 79.22 80.92 78.03 67.69

Table 6. Tagging accuracy of the attribute split model applied to the complex tagset, in next rows accuracy of its individual components given, [%]

attribute split POS number case gender person degree aspect negation vocalicity

6.3

TnT

MXP fnTBL MBT

SVM

81.73 96.20 97.08 91.00 92.46 99.31 98.13 97.73 98.82 100.00

82.55 81.73 81.00 96.30 96.44 95.74 97.10 96.74 96.77 92.81 92.08 91.03 91.95 91.13 91.86 99.45 99.54 99.38 98.44 98.59 98.11 97.91 98.03 97.54 98.91 98.91 98.70 100.00 100.00 100.00

84.06 96.74 97.42 92.81 93.01 99.53 98.58 98.19 98.92 100.00

Results for Voting Methods

The possibility of improving the baseline algorithms by voting methods is expressed by complementarity (Table 7). Complementarity value near 0% indicates that a pair of taggers gives similar eﬀects and no accuracy increase with voting methods. The closer the complementarity values to 100%, the higher the margin for accuracy increase of voting methods. The accuracy of the voting methods themselves is given in Table 8. The methods are ordered according to their growing complexity. The eﬀort connected with gathering weights required by voting methods and creation of the tuning set paid oﬀ. Voting methods achieve the highest accuracy among all presented algorithms. The results of Total Precision methods are especially encouraging.

Accuracy of Baseline and Complex Methods

911

Table 7. Complementarity, comp(B|A), of baseline taggers trained on 90% of the m-FDCP corpus, [%] (a) simple tagset

@A B@ @

TnT MXP fnTBL MBT SVM

(b) complex tagset

TnT MXP fnTBL MBT SVM – 37.32 36.84 31.80 36.16

35.72 – 40.35 38.55 38.55

31.29 36.72 – 28.03 30.47

39.21 46.58 41.03 – 41.60

25.74 30.30 25.65 23.79 –

@A B@ @

TnT MXP fnTBL MBT SVM

TnT MXP fnTBL MBT SVM – 33.20 34.56 21.39 13.62

39.13 32.28 39.25 55.39 – 33.25 43.19 56.42 41.22 – 41.07 54.05 33.01 21.09 – 39.96 23.12 7.94 10.17 –

Table 8. Accuracy of voting methods, [%] Voting method Simple tagset Complex tagset Majority 96.93 87.14 Total Precision 96.95 88.03 Tag Precision 96.93 87.41 Precision Recall 96.93 87.21

7

Conclusions

According to our experiments, the following conclusions can be drawn. The SVM tagger achieves the highest, state-of-the-art accuracy among the baseline methods for the simple tagset, although is useless when applied to the complex tagset. For the complex tagset fnTBL yields both the highest overall accuracy and sentence accuracy among the baseline methods. If the amount of unknown tokens is prevailing, the MXPost should be considered. Attribute split models give lower results than the baseline taggers except the SVM case, whose baseline accuracy is however signiﬁcantly low. The thematic split model has not outperformed any baseline tagger. The additional experiment with uniform 20% set of the m-FDCP corpus proved however potential usefulness of the model. The condition of its applicability is that a corpus consists of several thematic segments. The segments should be large enough, so that the beneﬁt of separate thematic models dominates accuracy degradation tied to reduction of a training set size. Voting methods are the most complex methods, requiring preparation of several baseline taggers. Elaboration of a reliable tuning set is another computationally demanding issue. Each of these methods outperforms the most accurate baseline method. The Total Precision method achieves the highest accuracy and at the same time is simpler than the Tag Precision and Precision Recall methods yielding the simplicity only to the Majority method. Acknowledgments. This research is partially supported by the Polish Ministry of Science and Higher Education, grant no. 11.11.120.777. ACC CYFRONET AGH is acknowledged for the computing time.

912

M. Kuta et al.

References 1. Mauco, M., Leonardi, M.: A derivation strategy for formal speciﬁcations from natural language requirements models. Computing and Informatics 26(4), 421–445 (2007) P., Kitowski, J.: Increasing quality of the Corpus of Frequency 2. Kuta, M., Chrzaszcz, Dictionary of Contemporary Polish for morphosyntactic tagging of the Polish language. Computing and Informatics (to appear) 3. Corpus of Frequency Dictionary of Contemporary Polish, http://www.mimuw.edu.pl/polszczyzna P., Kitowski, J.: A case study of algorithms for morphosyntac4. Kuta, M., Chrzaszcz, tic tagging of Polish language. Computing and Informatics 26(6), 627–647 (2007) 5. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proc. of the 1st Conf. on Empirical Methods in Natural Language Processing, Univ. of Pennsylvania, USA, pp. 133–142 (1996) 6. Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A memory-based part of speech tagger-generator. In: Proc. of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 14–27 (1996) 7. Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21(4), 543–565 (1995) 8. Gim´enez, J., M` arquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proc. of the 4th Int. Conf. on Language Resources and Evaluation, Lisbon, Portugal, pp. 43–46 (2004) 9. Brants, T.: TnT - a statistical part-of-speech tagger. In: Proc. of the 6th Applied Natural Language Processing Conf., Seattle, USA, pp. 224–231 (2000) 10. Florian, R., Ngai, G.: Fast Transformation-Based Learning Toolkit manual. John Hopkins Univ., USA (2001), http://nlp.cs.jhu.edu/~rflorian/fntbl 11. van Halteren, H., Zavrel, J., Daelemans, W.: Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics 27(2), 199–229 (2001) 12. van Halteren, H., Zavrel, J., Daelemans, W.: Improving data driven wordclass tagging by system combination. In: Proc. of the 36th Annual Meeting on Association for Computational Linguistics, Montr´eal, Canada, vol. 1, pp. 491–497 (1998) 13. Schr¨ oder, I.: A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkit, Technical report FBI-HH-M-314/02, Univ. of Hamburg, Germany (2002) 14. Brill, E., Wu, J.: Classiﬁer combination for improved lexical disambiguation. In: Proc. of the 7th Int. Conf. on Computational Linguistics, San Francisco, USA, pp. 191–195 (1998) 15. Modiﬁed corpus of Frequency Dictionary of Contemporary Polish, http://nlp.icsr.agh.edu.pl 16. IPI PAN Corpus resources, http://korpus.pl

Synonymous Chinese Transliterations Retrieval from World Wide Web by Using Association Words Chung-Chian Hsu and Chien-Hsing Chen National Yunlin University of Science and Technology, Taiwan {hsucc,g9423809}@yuntech.edu.tw

Abstract. We present a framework for mining synonymous transliterations from a set of Web pages collected via a search engine. An integrated statistical measure is proposed to form search keywords for a search engine in order to retrieve relevant Web snippets. We employ a scheme of comparing the similarity between two transliterations to aid in identifying synonymous transliterations. Experimental results show that the average number of harvesting synonymous transliterations is about 5.04 for an input transliteration. The retrieval results could be beneficial for constructing ontology, especially, in the domain of foreign person names. Keywords: synonymous transliteration, cross lingual information retrieval, Chinese transliteration, person names, ontology.

1 Introduction A transliteration is a local representation of a foreign word by rendering the pronunciation in the alphabet to the local language. With many different translators working without a common standard, there may be many different transliterations for the same proper noun. For example, the inconsistent Chinese transliterations

本拉登

本拉丹

賓拉登

(bin la deng), (ben la deng) and (ben la dan) are all translated from a foreign name “Bin Laden”. Unfortunately, a person may know only one of those transliterations. As a result, the synonymous transliterations problem may engender comprehensive obstacle while one is reading. More importantly, it also results in incomplete search results when a user inputs only one of the transliterations to a

賓拉登 (bin la deng) as a search keyword cannot retrieve the Web pages which use 本拉登 (ben la deng) as the transliteration for Bin

search engine. For instance, using

Laden. In this paper, we attempt to propose a framework for automatically extracting as many synonymous transliterations as possible from the Web with respect to a given input transliteration as a first step to the problem. The research result is beneficial to constructing ontology, especially, in the domain of famous person names. Some major tasks in natural language processing such as machine translation, named entity recognition, automatic summarization, information extraction and crosslanguage information retrieval (CLIR) have treated Web corpora as a good knowledge M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 913–922, 2008. © Springer-Verlag Berlin Heidelberg 2008

914

C.-C. Hsu and C.-H. Chen

source for extracting useful information. Search engines have been considered an important tool to retrieve relevant documents. However, a simple, short query usually fail in returning only highly relevant documents and instead a huge amount of Web pages in diversified topics are usually returned. A short query expanded by additional relevant search keywords could help to limit the retrieved pages to what the user is intended. Work in the literature such as query extension [1] proposed some techniques for identifying proper keywords for extension. We follow this idea for collecting high quality candidate snippets which might contain synonymous transliterations. The traditional approaches in CLIR usually require a parallel corpus which suffers from bias and time-consuming due to manually collecting. Instead, we propose an effective framework to mining synonymous transliterations from Web snippets returned by a search engine. A critical step is to use proper keywords for collecting a limited amount of snippets which could include as many synonymous transliterations as possible. To achieve this goal, we use a measure which integrates several statistic approaches of keyword determination so as to raise the keyword quality. After retrieving relevant documents via a search engine, we apply a comparison scheme to determine whether an unknown word segmented from the retrieved snippets is indeed a synonymous transliteration. Our scheme is based on comparing digitalized physical sounds of Chinese characters. The traditional approaches in CLIR are usually grapheme-based or phonetic-based. Compared to those approaches, our approach possesses more powerful discrimination capability.

2 Candidate Snippets Collection We propose a procedure as presented in Fig. 1 for collecting candidate Web snippets in which synonymous transliterations may appear. First, the transliteration (TL) is inputted for collecting a set of n snippets, called core snippets. After text preprocess, a set of m keywords, called association words, which are highly associated with the TL are extracted from the core snippets. The associated words are to form searchkeywords to retrieve a set of k snippets from the Web, called candidate snippets, which are considered likely containing synonymous transliterations.

Core Snippets

TL Downloading

Association Words

Feature Selection

Candidate Snippets

Search Keywords

Keywords Formation

Retrieving

Fig. 1. A procedure of collecting candidate Web snippets

2.1 Association Words Selection Several statistical methods [2] can be used to select feature terms with respect to a document category by measuring association strength between a term and the category, including Information gain (IG), Mutual information (MI), Chi-square (CHI), Correlation coefficient (CC), Relevance score (RS), Odds ratio (OR) and GSS Coefficient (GSS). A fusion approach which integrates features selected by different methods may improve the quality of features, reduce noisy, and avoid overfitting [2].

Synonymous Chinese Transliterations Retrieval from World Wide Web

915

Therefore, to estimate the strength of association between a term tk and an input transliterations ci, we employ a fusion model integrating six popular feature selection functions. To calculate the strength, we need to compute various joint and conditional probabilities. Recently, several researchers proposed to use the returned count of a query to a search engine for estimating term relationship. Cheng et. al. [3] used the returned page counts from the search engine to estimate association strength between two terms. Cilibrasi and Vitanyi [4] used the returned page counts to measure the information distance so as to estimate the similarity among the names of objects. We follow their idea for our needs. To take GSS as an example, GSS(tk,ci) = p(tk,ci)p( , ) − p( tk, )p( ,ci ) where p(tk,ci) represents the probability of cooccurrence of tk and ci which can be estimated via the returned page counts of a query “tk” + “ci” to a search engine, in which ci is a transliteration and tk is a term. and represent the positive existence of in a Web page containing while and indicate the opposite, non-existence. In practice, we first download a fixed number of Web snippets D for a transliteration ci via a search engine. Denote T = {t1, …, tk, …, tK} be a set of terms in D are extracted and the scores on obtained from the core snippets, all terms the six functions for association strength between tk and ci are measured. Six ranking for each tk with respect to the six functions are obtained, where values represents the rank of tk under the mth evaluation function. The average rank fk is ∑ /M. A lower rank represents more important of the term. defined as , 2.2 Search-Keywords Formation Based on the ranked association words selected in the previous step, there are several alternatives to form a query for further collecting candidate snippets which may contain synonymous transliterations. We consider several strategies and empirically compare their performance. Three entities are used to form different strategies for synonymous transliterations (ST), which are the transliteration TL, the association term (AS), and TL’s original word (ORI). Three strategies were made as follows. Strategy 1 (Direct strategy). An ST may appear in the same snippet with a TL or ORI. Therefore, the TL or ORI can be used as the query term. Given a transliteration, its foreign origin can be determined automatically by several techniques found in CLIR [5-12]. Strategy 2 (Indirect strategy). Association words highly related to the TL are possible to retrieve snippets containing an ST. Therefore, in the indirect strategy, we make a query Q out of association words; specifically, a query Qm-As is an m-term query which is formed by m association words. We select significant association words and use a combination to generate a set of queries Q. Then, each of the queries is used to collect several hundred snippets which collectively form the set of candidate Web

賓拉登 (Bin Laden), say { 恐佈分子 (terrorist), 阿富汗 (Afghanistan), 攻擊 (attack), 恐怖主義 (terrorism) }, and m = 2, a query Q is a 2-term query such as (恐怖份子,阿富汗). snippets. For instance, given the top four association words of

2-As

916

C.-C. Hsu and C.-H. Chen

The query set Q consists of all two-term combinations of the four ASs. The size of Q 汗) is C(4,2) = 6. The set of search-keywords in query Q2-As is {q1 ( , q2

= 恐怖份子阿富 ;

=(阿富汗攻擊); q =(恐怖主義,攻擊); …; q }. 3

6

Strategy 3 (Integrated strategy). A combination of the direct and the indirect strategy may improve retrieval effectiveness. Therefore, an integrated strategy containing the Qm-As and the QORI or QTL is considered. Empirically, the integration with ORI is much better than TL. Thus, we integrate association words with the ORI to produce a query Qm-AsOri. For example, Q1-AsOri =(

、

恐怖份子, Bin Laden) or Q

2-AsOri

=(

恐怖主義、阿富

汗 Bin Laden).

3 Synonymous Transliterations Extraction from Candidate Snippets After collecting candidate snippets from the Web, we apply several processes to extract synonymous transliterations. Transliterations are unknown to an ordinary dictionary, so we first discard known words in the snippets with the help of a dictionary and then extract n-gram terms from the remaining text. The length parameter n is set to the range from |TL| - 1 to |TL| + 1 since the length of an ST with respect to an input TL is most likely in that range. A process of dynamic alignment is employed to select candidate synonymous transliterations (CSTs) from the n-gram terms. Then, we compare the similarity between CSTs and the TL. A highly similar CST to the TL is considered a synonymous transliteration. The extraction procedure is presented in Fig. 2. Candidate Snippets

Term Segmenting

n-gram terms

CSTs

Dynamic Aligning

STs

Comparing

Fig. 2. The procedure of extracting synonymous transliterations

3.1 N-Gram Terms and Candidate Synonymous Transliterations Generations The size of n-gram terms segmented from the remaining text after discarding known words is still huge. Most of them are obviously not an ST. We apply a heuristic to discard those n-gram terms and the remaining terms are referred to as candidate synonymous transliterations (CSTs). In particular, we observed that most of synonymous transliterations are highly matching in the first and the last Character, for instance,

戈巴契夫 (ge ba qi fu), 戈爾

巴喬夫 (ge er ba qiao fu) and 戈巴卓夫 (ge ba zhuo fu). That is to say, two terms which does not match well in the first and the last character are very likely not synonymous, for instance, 本拉丹 (ben la dan) and 拉丹襲 (la dan xi) in which the

Synonymous Chinese Transliterations Retrieval from World Wide Web

917

first character of the latter matches with the second character of the former while the

拉丹襲 which has the first two characters from a true synonymous transliteration 本拉丹 is last character of the former matches with the second of the latter. In fact,

generated due to the use of 3-gram segmentation. An extra-last-character exception has to be taken care. Several final foreign phonemes might be ignored in the transliterations by some translators but not be ignored by some other translators. Those phonemes include “m”, “er”, “d” and “s”

姆 (mu), 兒 (er), 爾 (er), 德 (de), and 斯 (si) when they are not ignored. For example, 貝克漢 (bei ke han) and 貝克漢姆 (bei ke han mu).

and usually be transliterated as

Therefore, when a mismatched last character pair is attributed to this exception, we need to further explore the matching between the last second character of the longer word and the last character of the shorter word. According to the above observations, we resort to a dynamic programming technique [12, 13] to determine the optimal alignment between an n-gram term and the TL so as to eliminate the n-gram terms which do not match well with the TL in the first and the last character and neither fall in the extra-last-character exception. Those n-gram terms which match well with the TL in the first and the last character or are well handled by the extra-last-character exception are considered CSTs. Note that the alignment is based on pronunciation similarity of Chinese characters. 3.2 Candidate Synonymous Transliterations Comparison

A transliteration usually has pronunciation close to their original foreign words. Therefore, synonymous transliterations usually have similar pronunciations. We use the Chinese Sound Comparison method (CSC) [12] to compare the pronunciation of two Chinese words, which has advantages over grapheme-based and conventional phoneme-based approaches. Grapheme-based approaches are mainly based on the number of identical alphabets in the two words. Phoneme-based approaches are mainly based on the pronunciation similarity between phones. In the conventional phoneme-based approaches [14, 15], the similarity scores between phones are assigned by some predefined rules which take articulatory features of phones into consideration. Instead, CSC compares two words by their digitalized physical sounds, which raise the effectiveness by embedding more discriminative information in the digitalized sound signals Given two Chinese words A ={a1a2…aN} and B ={b1b2…bM} where an is the nth character in Chinese word A and bm is the mth character in Chinese word B. N is not necessarily equal to M. A dynamic programming based approach to comparing the similarity of smallest distortion for A and B by adjusting the warp on the time axis is employed. The recurrence formula is defined as follows in which T(N, M) is the similarity of {a1a2…aN} and {b1b2…bM}, and sim(an, bm) is the similarity for two Chinese characters. ,

max

N 1, M N 1, M N, M 1

1

N, M

918

C.-C. Hsu and C.-H. Chen

We constructed two similarity matrices for comparing the similarity between Chinese characters [12]. One is for the 37 phonetic symbols which are used to make of the pronunciation of a Chinese character. The other is for the 412 basic sounds which include all pronunciations of Chinese characters without considering tones. The similarity between two Chinese characters is measured by where an.IC . , . 1 , , and bm.IC represent their initial consonant (IC). According to our experience, final sound heavily influences speech sound comparison. Therefore, we adopt an initialweighted comparison approach, which involved a balancing adjustment: weighting the initial consonants of the characters to balance the bias caused by the final sounds. The 37 phonetic symbol similarity matrix is used to provide the similarity data between the initials of the characters. sim(an, bm) is the weighted similarity between character an and bm obtained from the similarity matrices of the 37 phonetic symbols and the 412 character pronunciations. w represents a trade-off between weighting the initial consonant and the whole character and is set to 0.4 empirically. For example,

森

生

the similarity between two Chinese characters (sen) and (sheng) is measured by first converting them to the representation of their corresponding phonetic symbols,

ㄙㄣ(sen) and ㄕㄥ(sheng), respectively. They have initial consonants ㄙ(si) and ㄕ(shi), respectively. Then, the score is calculated by the formula, sim(森,生) = sim( ㄙㄣ,ㄕㄥ) = 0.4 sim (ㄙ, ㄕ) + 0.6 sim (ㄙㄣ, ㄕㄥ). According the two similarity matrices s37 and s412, sim (ㄙ,ㄕ) =0.66 and sim (ㄙㄣ,ㄕㄥ) = 0.69. The result is 0.68, the measured similarity between two Chinese characters 森 and 生.

namely,

s37

s412

s37

s412

The normalized similarity between two words A and B which takes into account the length of the words is defined as scoreCSC(A,B) = T(N,M)/(0.5(N+M)) where N and M are the lengths of A and B, respectively. The choice of normalization operation significantly influences the similarity comparison. We set it to the average length of N and M according empirical results indicated in [12]. A high score between an CST and the TL implies the CST is very likely an ST of the TL.

4 Experiments We collected a total of 50 Chinese transliterations (TLs) from the Web. The data were drawn from two major types of proper nouns, i.e., locations and personal names. Their length is 2, 3 or 4, which are most commonly seen in Chinese transliterations. The number of transliterations in each group is 10, 30 and 10, respectively. 4.1 Quality of Query Strategies Each of the 50 TLs was submitted to Google search engine and the first 20 snippets were collected as the core snippets of the TL. For each TL, the top five association words were used to collect various sets of the candidate snippets according to different strategies mentioned in section 2.2. Google also suggests synonymous

Synonymous Chinese Transliterations Retrieval from World Wide Web

919

transliterations with respect to some user queries. We therefore consider their recommendation as well in the experiment. QTL: collecting snippets by using the TL; QOri: collecting snippets by using the original foreign word; Qm-As: collecting snippets by using the query consisting of m associated words; Qm-AsOri: collecting snippets by using a query consisting of m associated word plus the foreign word; QGR: Google recommendation. The following discusses how effective each strategy is able to collect a better set of candidate snippets, which shall contain as many synonymous transliterations as possible. The second row in Table 1 shows the ratio of TL having at least one synonym in the collected snippets under a certain query strategy. Under the strategy QOri, the ratio is 74%; in other words, 37 out of 50 TLs have at least one synonym in the retrieved snippets. Q2-AsOri brings the best performance, which is 92%. The result also shows that only 4% has recommendation from the search engine. Among the inputs, three of the 50 TLs do not has any synonym in the collected snippets, including

托拉斯

赫爾利

雅典娜

(Trust) and (Hurley). (Athena), Surprisingly, the combination of the original word along with association words performs better than using the original word alone. For instance, the transliterations

馬斯哈托夫 (Maskhadov), 巴薩拉 (Basra), 賽普拉斯 (Cypress), 費雪 (Fisher), 蓋亞 (Gaea), and 鮑爾 (Powell) have no STs in the collected snippets by Q , but they do Ori

have by Q2-AsOri or Q1-AsOri. The reason is that these transliterations are more popular than their synonymous transliterations. As a result, all the returned snippets, of which the number is limited to about 1000 by the search engine, by QOri contain only the most commonly seen transliterations, no other synonymous transliterations. A stricter query strategy which additionally include association words along with the original foreign word help to bring the Web pages containing synonymous transliterations to the set of the returned first 1000 pages. Second, we test how many synonymous transliterations could be retrieved in average under different methods with respect to a given TL. Experimental results in the third row of Table 1 show that including Ori along with their association words in the query outperforms their counterpart, which does not include Ori. Furthermore, the parameter m (the number of association words in a query) is better not to be greater than 3. Requesting too many association words in a snippet would limit the number of snippets that we can retrieve. Given the 50 TLs, we retrieved in total 366 STs, of which 252(69%), 246(67%), 136(37%), 145(40%), 86(23%), 54(15%), 40(11%), 22(6%), 2(0.5%) by 2-AsOri, 1AsOri, 3-AsOri, Ori, TL, 2-As, 3-As, 1-As, and GR, respectively. 322 (88%) out of 366 can be retrieved by 2AsOri and 1AsOri together. Finally, we inspect how uniqueness the retrieved result of a method is, i.e., how many words which are retrieved uniquely by the method but not by the other methods.

920

C.-C. Hsu and C.-H. Chen

Table 1. Probability of a TL having at least one synonym in the collected snippets and the average number of retrieved synonymous transliterations Method 2-AsOri 1-AsOri 3-AsOri Ori TL 2-As 3-As 1-As GR ST. Occurrence Probability 0.92 0.9 0.82 0.74 0.50 0.50 0.38 0.36 0.04 Average number of STs 5.04 4.92 2.72 2.90 1.72 1.08 0.80 0.44 0.10 Uniqueness 0.318 0.419 0.062 0.093 0.023 0.054 0.016 0.016 0.000

Table 1 shows among those words retrieved by only one method, about 40% and 30% are by Q1-AsOri and Q2-AsOri, respectively. Except for GR, other methods can retrieve more or less some unique STs. 4.2 Performance of Synonymous Transliterations Extraction This section presents how well the confirmation model can recognize those identified candidate synonymous transliterations (CSTs) as true synonymous transliterations (STs). The evaluation measures include precision, the average number of retrieved STs and the inclusion rate. Because Q2-AsOri was more effective in retrieving candidate snippets in which STs appear, we use the set of CSTs extracted from the candidate snippets by Q2-AsOri via the dynamic alignment process. The dynamic alignment approach reduced the size of the n-gram terms 355,943 to the size of CST terms 56,408. We further utilize the CSC approach [12] to measure the similarity between a CST term and the TL. The initial consonant weight is set to w = 0.4 which is suggested in [12]. A high score indicates high pronunciation similarity between the CST and the TL and implies that they are likely synonymous. Fig. 3 shows retrieval precision and the average number of retrieved STs with respect to various similarity thresholds by the CSC. The result shows that all extracted STs acquire at least a 0.5 CSC similarity score. It also shows that the precision is high (over 0.89) when the score is greater than 0.9.

Fig. 3. Precision and average number of collected synonyms under various similarity scores by CSC

AR (average ranking), ARR (average reciprocal rank) and the inclusion rate, which are commonly used for the evaluation in information retrieval, are calculated for the data set according to the rank of the similarity score of a true ST to the TL. AR and ARR are 7.22 and 0.74, respectively. For the inclusion rate, 67% of STs are included

Synonymous Chinese Transliterations Retrieval from World Wide Web

921

in top-1, 81% are included in top-5, 88% are included in top-10 and 99% are included in top-100. The lowest rank of a true ST is 324.

5 Conclusions and Future Directions In this paper we present a framework for collecting synonymous transliterations from the Web with respect to a given input transliteration. The research result can be applied to construct ontology of famous person names. Our method uses the online retrieved Web pages collection as the corpus. Unlike the conventional approaches in information retrieval, a manually pre-collected set of documents is used as the corpus which may engender bias. Moreover, to extract synonymous transliterations from the retrieved Web snippets, we compare the similarity between unknown words and the input transliteration by an approach based on comparing digitalized physical sounds. We will continue to improve the precision of identified synonymous transliterations in our future work. Acknowledgement. This work is supported by National Science Council, Taiwan under grant NSC 96-2416-H-224-004-MY2.

Reference 1. Carpineto, C., Bordoni, F.U., Mori, R.D., Avignon, U.O., Romano, G., Bordoni, F.U., Bigi, B.: An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems 19(1), 1–27 (2001) 2. Huang, S., Chen, Z., Yu, Y., Ma, W.-Y.: Multitype features coselection for Web document clustering. IEEE Transactions on Knowledge and Data Engineering 18(4), 448–459 (2006) 3. Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., Chien, L.-F.: Translating unknown queries with Web corpora for cross-language information retrieval. In: Proceedings of ACM SIGIR, Sheffield, South Yorkshire, UK (2004) 4. Cilibrasi, R.L., Vitanyi, P.M.B.: The Google similarity distance. IEEE Transactions on Knowledge and Data Enginerring 19(3), 370–383 (2007) 5. Tsuji, K.: Automatic extraction of translational Japanese-Katakana and English word pairs from bilingual corpora. International Journal of Computer Processing of Oriental Language, 261–280 (2002) 6. Stalls, B.G., Kevin, K.: Translating names and technical terms in arabic text. In: Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages (1998) 7. Somers, H.L.: Similarity metrics for aligning children’ s articulation data. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pp. 1227–1231 (1998) 8. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. In: IEEE Trans. Acoustics, Speech, and Signal Proc. ASSP, pp. 43–49 (1978) 9. Lin, W.H., Chen, H.H.: Backward machine transliteration by learning phonetic similarity. In: Proceedings of the Sixth Conference on Natural Language Learning, Taipei, Taiwan, pp. 139–145 (2002)

922

C.-C. Hsu and C.-H. Chen

10. Lin, W.H., Chen, H.H.: Similarity measure in backward transliteration between different character sets and its applications to CLIR. In: Proceedings of Research on Computational Linguistics Conference XIII, Taipei, Taiwan, pp. 97–113 (2000) 11. Lee, C.J., Chang, J.S., Jang, J.-S.R.: Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources. ACM Transactions on Asian Language Information Processing 5(2), 121–145 (2006) 12. Hsu, C.-C., Chen, C.-H., Shih, T.-T., Chen, C.-K.: Measuring similarity between transliterations against noise data. ACM Transactions on Asian Language Information Processing (2007) 13. Kuo, J.-S., Li, H., Yang, Y.-K.: A phonetic similarity model for automatic extraction of transliteration pairs. ACM Trans. Asian Language Information Processing (2007) 14. Kondrak, G.: Phonetic alignment and similarity. Computers and the Humanities 37(3), 273–291 (2003) 15. Chen, H.H., Lin, W., Yang, C.C., Lin, W.H.: Translating/transliterating named entities for multilingual information access. Journal of the American Society for Information Science and Technology, 645–659 (2006)

Parallel Approximate Finite Element Inverses on Symmetric Multiprocessor Systems Konstantinos M. Giannoutakis and George A. Gravvanis Department of Electrical and Computer Engineering, School of Engineering, Democritus University of Thrace, 12, Vas. Soﬁas street, GR 671 00 Xanthi, Greece {kgiannou, ggravvan}@ee.duth.gr

Abstract. A new parallel normalized optimized approximate inverse algorithm for computing explicitly approximate inverses, is introduced for symmetric multiprocessor (SMP) systems. The parallelization of the approximate inverse has been implemented by an antidiagonal motion, in order to overcome the data dependencies. The parallel normalized explicit approximate inverses are used in conjuction with parallel normalized explicit preconditioned conjugate gradient schemes, for the eﬃcient solution of ﬁnite element sparse linear systems. The parallel design and implementation issues of the new algorithms are discussed and the parallel performance is presented, using OpenMP. The speedups tend to the upper theoretical bounds for all cases making approximate inverse preconditioning suitable for SMP systems.

1

Introduction

Sparse matrix computations, which have inherent parallelism, are of central importance in computational science and engineering computations and are the most time-consuming part. Hence research eﬀorts were focused on the production of eﬃcient parallel computational methods and related software suitable for multiprocessor systems, [1,2,3,10]. Until recently, direct methods have been eﬀectively used, but the increase of size, even with the use of modern computer systems, has become a barrier to such methods, [2,3,10]. Additionally, the solution of sparse linear systems, because of its applicability to real-life problems, is obtained by iterative methods, which are in competitive demand after the emergence of Krylov subspace methods, [4,8]. An important achievement over the last decades is the appearance and use of Explicit Preconditioned Methods, [4], for solving sparse linear systems, and the preconditioned form of a linear system Au = s is M Au = M s, where M is preconditioner, [2,4,8,10]. The preconditioner M has therefore to satisfy the following conditions: (i) M A should have a “clustered”spectrum, (ii) M can be eﬃciently computed in parallel and (iii) ﬁnally “M × vector”should be fast to compute in parallel, [2,4,8,9,10]. Hence, the derivation of parallel methods was the main objective for which several families of parallel inverses are proposed. The main motive for the derivation of the parallel explicit approximate inverse matrix algorithms is that they M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 925–934, 2008. c Springer-Verlag Berlin Heidelberg 2008

926

K.M. Giannoutakis and G.A. Gravvanis

result in parallel iterative methods in conjunction with parallel preconditioned conjugate gradient - type schemes respectively, for solving ﬁnite element linear systems on SMP systems. The important feature of the proposed parallel approximate inverse preconditioning is that the approximate inverse is computed explicitly and in parallel, eliminating the forward-backward substitution, which does not parallelize easily, [10]. For the implementation of the parallel programs, the OpenMP application programming interface has been used. OpenMP has emerged as a shared-memory programming standard and it consists of compiler directives and functions for supporting both data and functional parallelism. The parallel for pragma with static scheduling has been used for the parallelization of loops on both the construction of the approximate inverse and the preconditioned conjugate gradient scheme.

2

Parallel Explicit Approximate Inverses

Let us consider the arrow - type linear system, i.e., Au = s, where A is a sparse arrow - type (n × n) matrix of the following form: - m ⎡ ⎤ a1 b 1 b b b ⎢ a2 b 2 ⎥ b b ⎢ ⎥ b0 ⎥ b ⎢ b b b ⎢ b b b⎥ b ⎢ ⎥ b b b ⎢ ⎥ b b b ⎢ ⎥ b b 0 b ⎢ ⎥) bb b b V ≡ (v b k,η ⎢ ⎥ am−1 bm−1 b ⎢ ⎥ b ⎢ ⎥ am b m A≡⎢ b ⎥ b⎥ ⎢ Z Z ⎢ ⎥ ⎢ ⎥ Z Z symmetric ⎢ ⎥ Z Z ⎢ ⎥ Z Z ⎢ ⎥ Z Z ⎢ ⎥ ⎢ ⎥ Z Z ⎣ Z bn−1 ⎦ Z an

(1)

(2)

Let us assume the normalized ﬁnite element approximate factorization of the coeﬃcient matrix A, such that: A ≈ Dr Trt Tr Dr ,

r ∈ [1, . . . , m − 1),

(3)

where r is the “ﬁll-in” parameter, i.e. the number of outermost oﬀ-diagonal entries retained in semi-bandwidth m, Dr is a diagonal matrix and Tr is a sparse upper (with unit diagonal elements) triangular matrix of the same proﬁle as the coeﬃcient matrix A, [7].

Parallel Approximate Finite Element Inverses on SMP Systems

927

The elements of the Dr , Tr decomposition factors were computed by the FEANOF algorithm, cf. [12]. The memory requirements of the FEANOF algorithm are ≈ (r + 2 + 2)n words, while the computational work for m << n is ≈ 1/2(r + )(r + + 3)n mults + n square roots, cf. [7]. Let Mrδl = (μi,j ), i ∈ [1, n] j ∈ [max(1, i − δl + 1), min(n, i + δl − 1)], be the normalized approximate inverse of the coeﬃcient matrix A, i.e. −1 −1

rδl Dr−1 , with M

rδl = Trt Tr −1 . Mrδl = Dr−1 Trt Tr Dr = Dr−1 M

(4)

The elements of the approximate inverse were determined by retaining a certain number of elements of the inverse, i.e. only δl elements in the lower part and δl − 1 elements in the upper part of the inverse (by applying the so-called “position-principle”), next to the main diagonal, by the Normalized Optimized Banded Approximate Inverse Finite Element Matrix -2D algorithmic procedure (henceforth called the NOROBAIFEM-2D algorithm, without inverting the decomposition factor Tr , [7]. The challenge of computing parallel approximate inverses is to overcome its data dependencies, which create a critical path and an order of computations, hence any parallel approximate inverse matrix algorithm must abide by those dependencies in order to avoid any data loss. ⎤ ⎡ (15) δl (14) " (13) b " " μ 1,1 μ μ 1,3 " " b " 1,2 " " ⎥ ⎢"" " "(12) "" (14) ""(13) (11)b ⎥ ⎢ μ μ 2,2 " μ 2,3" μ 2,4 " b ⎥ ⎢ 2,1 " " " ⎥ ⎢ " " " " ⎥ ⎢" (13) "" (12) "(11) (10) "" (9) b ⎥ ⎢ μ 3,1 " μ 3,2 " μ 3,3 "μ 3,4" μ 3,5 " "b " " ⎥ ⎢b " " " b " ⎥ ⎢ b " " "(11) " (10) " (9) (8) (7) " ⎥ ⎢ " " b μ μ μ μ μ " " " 4,2 4,3 4,4 4,5 4,6 " " ⎥ ⎢ b "

δl = ⎢ " "" " b M " " ⎥ . (5) " r b ""(9) "" (8) " (7) (6) " (5) " ⎢ "b ⎥ μ 5,3 μ 5,4" μ 5,5 " μ 5,6" μ 5,7 " ⎢ b " b⎥ " " " ⎥ ⎢ " b" "" "(6) "" " ⎥ ⎢ (7) (5) (4) (3) " " " ⎢ μ μ μ μ μ b" 6,4 6,5" 6,6" 6,7 " 6,8 ⎥ " ⎥ ⎢ " " " " b" ⎢ "(5) "" " (4) (3) (2) ⎥ " ⎥ ⎢ " " b μ μ μ μ 7,5 7,6 " 7,7" 7,8 " ⎦ ⎣ " b "" " " " (3) (2) (1) " " b"μ μ μ b 8,6" 8,7 " 8,8 For the parallelization of the NOROBAIFEM-2D algorithm, an antidiagonal motion (wave-like pattern), starting from the element μ 8,8 down to μ 1,1 , has been used, because of the dependency of the elements of the inverse during its construction. More speciﬁcally, any element within the banded approximate inverse requires its corresponding right or lower element to be computed ﬁrst. This sequence of computations, without any loss of generality and for simplicity reasons, is shown for the normalized banded approximate inverse in equation (5) (with n = 8 and δl = 3). The values of the parentheses at the superscript of each (k) i,j was computed at the (k)-th element (e.g. μ i,j ), indicate that the element μ sequential step of the algorithm (k-th antidiagonal), while the elements with the same superscript (i.e. (k)) were computed concurrently. It should be noted that

928

K.M. Giannoutakis and G.A. Gravvanis

due to the data dependencies, for δl = 1, 2 the parallel algorithm will execute sequentially. Let us consider that the command forall denotes the parallel for instruction (forks/joins threads), for executing parallel loops. Then, the algorithm for the implementation of the Parallel ANti Diagonal NOROBAIFEM-2D algorithm (henceforth called the PAND-NOROBAIFEM-2D algorithm), on symmetric multiprocessor systems, can be described as follows: for k = 1 to δl forall l = 1 to k call inverse(n − l + 1,n − k + l) m=2 for k = (δl + 1) to n forall l = m to (k − m + 1) call inverse(n − l + 1,n − k + l) if (k − δl) mod 2 = 0 then m=m+1 m=m−1 for k = (n − 1) downto (δl + 1) forall l = m to k − m + 1 call inverse(l,k − l + 1) if (k − δl) mod 2 = 1 then m=m−1 for k = δl downto 1 forall l = 1 to k call inverse(l,k − l + 1) where the function inverse(i,j) computes the element μi,j of the normalized optimized approximate inverse, and can be described as follows, [7]: function(inverse) Let r = r + , m = m + , mr = m − r, nmr = n − mr, nm = n − m. if i >= j then if j > nmr then if i = j then if i = n then (6) μ 1,1 = 1 else n−j,δl+1 (7) μ n−i+1,1 = 1 − gj · μ else n−i+1,i−j (8) μ n−i+1,i−j+1 = −gj · μ else if j ≥ r and j ≤ nmr then if i = j then nmr−j μ n−i+1,1 = 1 − gj · μ n−j,δl+1 − hr−1−k,j+1−r+k · μ x,y (9) k=0

call mw(n, δl, i, mr + j + k, x, y)

Parallel Approximate Finite Element Inverses on SMP Systems

929

else μ n−i+1,i−j+1 = −gj · μ n−i+1,i−j −

nmr−j

hr−1−k,j+1−r+k · μ x,y

(10)

k=0

call mw(n, δl, i, mr + j + k, x, y) else if j > nm + 1 and j ≤ r − 1 then if i = j then n−j,δl+1 − hj,k · μ x1 ,y1 μ n−i+1,1 = 1 − gj · μ − call mw(n, δl, i, m + k − 1, x1 , y1 ) else

k=j+1−r k>0

nm

hj−1−λ,+1+λ · μ x2 ,y2

call mw(n, δl, i, m + λ, x2 , y2 )

μ n−i+1,i−j+1 = −gj · μ n−i+1,i−j − − call mw(n, δl, i, m + k − 1, x1 , y1 ) else if j ≤ nm + 1 then if i = j then if i = 1 then

(11)

λ=0

nm

hj,k · μ x1 ,y1

k=j+1−r k>0

hj−1−λ,+1+λ · μ x2 ,y2

(12)

λ=0

call mw(n, δl, i, m + λ, x2 , y2 )

n−1,δl+1 − μ n,1 = 1 − g1 · μ

h1,k · μ x,y

(13)

k=1

call mw(n, δl, 1, m + k − 1, x, y) else μ n−i+1,1 = 1 − gj · μ n−j,δl+1 −

k=j+1−r k>0

hj,k · μ x1 ,y1 −

j−1

hj−λ,+λ · μ x2 ,y2

(14)

λ=1

call mw(n, δl, i, m + k − , x1 , y1 ) call mw(n, δl, i, m + λ − 1, x2 , y2 ) else j−1 n−j,δl+1 − hj,k · μ x1 ,y1 − hj−λ,+λ · μ x2 ,y2 (15) μ n−i+1,1 = −gj · μ k=j+1−r k>0

call mw(n, δi, i, m + k − , x1 , y1 ) if i <> j then n−i+1,i−j+1 μ n−i+1,δl+i−j = μ

λ=1

call mw(n, δl, i, m + λ − 1, x2 , y2 ) (16)

The procedure mw(n, δl, s, q, x, y), [5], reduces the memory requirements of the approximate inverse to only n × (2δl − 1)-vector spaces. The computational process is logically divided into 2n − 1 sequential steps representing the 2n − 1 antidiagonals, while synchronization between processes is needed after the

930

K.M. Giannoutakis and G.A. Gravvanis

computation of each antidiagonal, to ensure that the elements of the matrix are correctly computed.

3

Parallel Normalized Preconditioned Conjugate Gradient method

In this section we present a class of parallel Normalized Explicit Preconditioned Conjugate Gradient (NEPCG) method, based on the derived parallel optimized approximate inverse, designed for symmetric multiprocessor systems.The NEPCG method for solving linear systems has been presented in [7]. The computational complexity of the NEPCG method is O [(2δl + 2 + 11) nmults + 3n adds] ν operations,where ν is the number of iterations required for the convergence to a certain level of accuracy, [7]. The Parallel Normalized Explicit Preconditioned Conjugate Gradient (PN EPCG) algorithm for solving linear systems can then be described as follows: forall j = 1 to n (r0 )j = sj − A (u0 )j if δl = 1 then forall j = 1 to n (r0∗ )j = (r0 )j / d2 j else forall j = 1 to n j (r0∗ )j =

(17)

(18)

μ n+1−i,i+1−k k=max(1,j−δl+1) min(n,j+δl−1) +

k=j+1

(r0 )k /dk

μ n+1−k,δl+k−j (r0 )k /dk

/ (d)j

(19)

forall j = 1 to n (σ0 )j = (r0∗ )j (20) forall j = 1 to n (reduction+p0 ) p0 = (r0 )j ∗ (r0∗ )j (21) Then, for i = 0, 1, . . ., (until convergence) compute in parallel the vectors ui+1 , ri+1 , σi+1 and the scalar quantities αi , βi+1 as follows: forall j = 1 to n (qi )j = A (σi )j (22) forall j = 1 to n (reduction +ti ) (23) ti = (σi )j ∗ (qi )j (24) αi = pi /ti forall j = 1 to n (ui+1 )j = (ui )j + αi (σi )j (25) (ri+1 )j = (ri )j − αi (qi )j (26) if δl = 1 then forall 1 to n ∗j = (27) ri+1 j = (ri+1 )j / d2 j

Parallel Approximate Finite Element Inverses on SMP Systems

else forall j = 1 to

n ∗ ri+1 j =

j

μ n+1−i,i+1−k k=max(1,j−δl+1) min(n,j+δl−1) +

k=j+1

(ri+1 )k /dk

μ n+1−k,δl+k−j (ri+1 )k /dk

931

/ (d)j (28)

forall j = 1 to n (reduction+p i+1 ) ∗ pi+1 = (ri+1 )j ∗ ri+1 j βi+1 = pi+1 /pi forall j = 1 to n ∗ + βi+1 (σi )j (σi+1 )j = ri+1 j

(29) (30) (31)

It should be noted that the parallelization of the coeﬃcient matrix A×vector operation has been implemented by taking advantage of the sparsity of the coeﬃcient matrix A.

4

Numerical Results

In this section we examine the applicability and eﬀectiveness of the proposed parallel schemes for solving sparse ﬁnite element linear systems. Let us now consider a 2D-boundary value problem: uxx + uyy + u = F,

(x, y) ∈ R,

with

u (x, y) = 0,

(x, y) ∈ ∂R,

(32)

where R is the unit square and ∂R denotes the boundary of R. The domain is covered by a non-overlapping triangular network resulting in a hexagonal mesh. The right hand side vector of the system (1) was computed as the product of the matrix A by the solution vector, with its components equal to unity. The “ﬁll-in”parameter was set to r = 2 and the width parameter was set to = 3. The iterative process was terminated when ri ∞ < 10−5 . It should be noted that further details about the convergence behavior and the impact of the “retention”parameter on the solution can be found in [6]. The numerical results presented in this section were obtained on an SMP machine consisting of 16 2.2 GHz Dual Core AMD Opteron processors, with 32 GB RAM running Debian GNU/Linux (National University Ireland Galway). For the parallel implementation of the algorithms presented, the Intel C Compiler v9.0 with OpenMP directives has been utilized with no optimization enabled at the compilation level. It should be noted that due to administrative policies, we were not able to explore the full processor resources (i.e. more than 8 threads). In our implementation, the parallel for pragma has been used in order to generate code that forks/joins threads, in all cases. Additionally, static scheduling has been used (schedule(static)), whereas the use of dynamic scheduling has not produced improved results. The speedups and eﬃciencies of the PAND-NOROBAIFEM-2D algorithm for several values of the “retention”parameter δl with n = 10000 and m = 101,

932

K.M. Giannoutakis and G.A. Gravvanis

Table 1. Speedups for the PAND-NOROBAIFEM-2D algorithm for several values of δl Speedups for the PAND-NOROBAIFEM-2D algorithm Retention parameter 2 processors 4 processors 8 processors δl = m 1.8966 3.8458 6.8653 δl = 2m 1.9600 3.8505 7.4011 δl = 4m 1.9741 3.9260 7.5768 δl = 6m 1.9986 3.9501 7.8033

Table 2. Eﬃciencies for the PAND-NOROBAIFEM-2D algorithm for several values of δl Eﬃciencies for the PAND-NOROBAIFEM-2D algorithm Retention parameter 2 processors 4 processors 8 processors δl = m 0.9483 0.9615 0.8582 δl = 2m 0.9800 0.9626 0.9251 δl = 4m 0.9870 0.9815 0.9471 δl = 6m 0.9993 0.9875 0.9754

are given in Table 1 and 2. In Fig. 1 the parallel speedups for several values of the “retention”parameter δl are presented for the PAND-NOROBAIFEM2D method, for n = 10000 and m = 101. The speedups and eﬃciencies of the PNEPCG algorithm for several values of the “retention”parameter δl with n = 10000 and m = 101, are given in Table 3 and 4. In Fig. 2 the parallel speedups for several values of the “retention”parameter δl are presented for the PNEPCG method, for n = 10000 and m = 101. Table 3. Speedups for the PNEPCG algorithm for several values of δl Speedups for the PNEPCG method Retention parameter 2 processors 4 processors 8 processors δl = 1 1.1909 1.5365 1.6097 δl = 2 1.5261 2.2497 2.7299 δl = m 1.8070 3.4351 6.3522 δl = 2m 1.8576 3.4824 6.3636 δl = 4m 1.9103 3.5453 6.4043 δl = 6m 1.9735 3.5951 6.6106

It should be mentioned that for large values of the “retention”parameter, i.e. multiples of the semi-bandwidth m, the speedups and the eﬃciency tend to the upper theoretical bound, for both the parallel construction of the approximate inverse and the parallel normalized preconditioned conjugate gradient method, since the coarse granularity amortizes the parallelization overheads. For small

Parallel Approximate Finite Element Inverses on SMP Systems

Fig. 1. Speedups versus the NOROBAIFEM-2D algorithm

“retention”parameter

δl

for

the

933

PAND-

Table 4. Eﬃciencies for the PNEPCG algorithm for several values of δl Eﬃciencies for the PNEPCG algorithm Retention parameter 2 processors 4 processors 8 processors δl = 1 0.5955 0.3841 0.2012 δl = 2 0.7631 0.5624 0.3412 δl = m 0.9035 0.8588 0.7940 δl = 2m 0.9288 0.8706 0.7954 δl = 4m 0.9551 0.8863 0.8005 δl = 6m 0.9867 0.8988 0.8263

Fig. 2. Speedups versus the “retention”parameter δl for the PNEPCG algorithm

values of the “retention”parameter, i.e. δl = 1, 2, the ﬁne granularity is responsible for the low parallel performance, since the parallel operations reduces to simple ones, like inner products, and the utilization of the hardware platform is decreasing.

934

5

K.M. Giannoutakis and G.A. Gravvanis

Conclusion

The design of parallel explicit approximate inverses results in eﬃcient parallel preconditioned conjugate gradient method for solving ﬁnite element linear systems on multiprocessor systems. Finally, further parallel algorithmic techniques will be investigated in order to improve the parallel performance of the normalized explicit approximate inverse preconditioning on symmetric multiprocessor systems, particularly by increasing the computational work output per processor and eliminating process synchronization and any associated latencies. Acknowledgments. The author would like to thank indeed Dr. John Morrison of the Department of Computer Science, University College of Cork for the provision of computational facilities and support through the WebCom-G project funded by Science Foundation Ireland.

References 1. Akl, S.G.: Parallel Computation: Models and Methods. Prentice-Hall, Englewood Cliﬀs (1997) 2. Dongarra, J.J., Duﬀ, I., Sorensen, D., van der Vorst, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998) 3. Duﬀ, I.: The impact of high performance computing in the solution of linear systems: trends and problems. J. Comp. Applied Math. 123, 515–530 (2000) 4. Gravvanis, G.A.: Explicit Approximate Inverse Preconditioning Techniques. Archives of Computational Methods in Engineering 9(4), 371–402 (2002) 5. Gravvanis, G.A.: Parallel matrix techniques. In: Papailiou, K., Tsahalis, D., Periaux, J., Hirsch, C., Pandolﬁ, M. (eds.) Computational Fluid Dynamics I, pp. 472–477. Wiley, Chichester (1998) 6. Gravvanis, G.A., Giannoutakis, K.M.: Normalized Explicit Finite Element Approximate Inverses. I. J. Diﬀerential Equations and Applications 6(3), 253–267 (2003) 7. Gravvanis, G.A., Giannoutakis, K.M.: Normalized ﬁnite element approximate inverse preconditioning for solving non-linear boundary value problems. In: Bathe, K.J. (ed.) Computational Fluid and Solid Mechanics 2003. Proceedings of the Second MIT Conference on Computational Fluid and Solid Mechanics, vol. 2, pp. 1958–1962. Elsevier, Amsterdam (2003) 8. Grote, M.J., Huckle, T.: Parallel preconditioning with sparse approximate inverses. SIAM J. Sci. Comput. 18, 838–853 (1977) 9. Huckle, T.: Approximate sparsity patterns for the inverse of a matrix and preconditioning. Applied Numerical Mathematics 30, 291–303 (1999) 10. Saad, Y., van der Vorst, H.A.: Iterative solution of linear systems in the 20th century. J. Comp. Applied Math. 123, 1–33 (2000)

Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor Wesley Alvaro1 , Jakub Kurzak1 , and Jack Dongarra1,2,3 2

1 University of Tennessee, Knoxville TN 37996, USA Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA 3 University of Manchester, Manchester, M13 9PL, UK {alvaro, kurzak, dongarra}@eecs.utk.edu

Abstract. Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, ﬂoating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crutial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C = C − A × B T operation and the C = C −A×B operation for matrices of size 64×64 elements. For the latter case, the performance of 25.55 Gﬂop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.

1

Introduction

The CELL Broadband Engine (CBE) processor has been developed jointly by the alliance of Sony, Toshiba and IBM (STI). The CELL processor is an innovative multi-core architecture consisting of a standard processor, the Power Processing Element (PPE), and eight short-vector Single Instruction Multiple Data (SIMD) processors, referred to as the Synergistic Processing Elements (SPEs). The SPEs are equipped with scratchpad memory referred to as the Local Store (LS) and a Memory Flow Controller (MFC) to perform Direct Memory Access (DMA) transfers of code and data between the system memory and the Local Store. All components are interconnected with the Element Interconnection Bus (EIB). This paper is only concerned with the design of computational micro-kernels for the SPE in order to fully exploit Instruction Level Parallelism (ILP) provided by its SIMD architecture. Issues related to parallelization of code for execution on multiple SPEs, including intra-chip communication and synchronization, are M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 935–944, 2008. c Springer-Verlag Berlin Heidelberg 2008

936

W. Alvaro, J. Kurzak, and J. Dongarra

not discussed here. SPE architercural details important to the discussion are presented in Sect. 4.1 and also throughout the text, as needed. Plentiful information about the design of the CELL processor and CELL programming techniques is in public the domain [1].

2

Motivation

The current trend in processor design is towards chips with multiple processing units, commonly referred to as multi-core processors [2]. It has been postulated that building blocks of future architectures are likely to be simple processing elements with shallow pipelines, in-order execution, and SIMD capabilities [3]. It can be observed that the Synergistic Processing Element of the CELL processor closely matches this description. By the same token, investigation into microkernel development for the SPE may have a broader impact by providing an important insight into programming future multi-core architectures. 2.1

Performance Considerations

State of the art numerical linear algebra software utilizes block algorithms in order to exploit the memory hierarchy of traditional cache-based systems [4,5]. Public domain libraries such as LAPACK and ScaLAPACK are good examples. These implementations work on square or rectangular submatrices in their inner loops, where operations are encapsulated in calls to Basic Linear Algebra Subroutines (BLAS), with emphasis on expressing the computation as Level 3 BLAS, matrix-matrix type, operations. Frequently, the call is made directly to the matrix multiplication routine GEMM. At the same time, all the other Level 3 BLAS can be deﬁned in terms of GEMM and a small amount of Level 1 and Level 2 BLAS [6]. 2.2

Code Size Considerations

In the current implementation of the CELL BE architecture, the SPEs are equipped with a Local Store of 256 KB. It is a common practice to use tiles of 64 × 64 elements for dense matrix operations in single precision, which occupy 16 KB buﬀers in the Local Store. Between six and eight such buﬀers are necessary to eﬃciently implement common matrix operations. In general, it is reasonable to assume that half of the Local Store is devoted to application data buﬀers leaving only 128 KB for the application code, necessary libraries and the stack. Owing to that, the Local Store is a scarse resource and any real-world application is facing the problem of ﬁtting tightly coupled components together in the limited space.

3

Related Work

Implementation of matrix multiplication C = C + A × B T using Intel Streaming SIMD Extensions (SSE) was reported by Aberdeen and Baxter [7]. Analysis

Fast and Small Short Vector SIMD Matrix Multiplication Kernels

937

of performance considerations of various computational kernels for the CELL processor, including the GEMM kernel, was presented by Williams et al. [8]. The ﬁrst implementation of the matrix multiplication kernel C = A × B for the CELL processor was reported by Chen et al. [9]. Performance of 25.01 Gﬂop/s was reported on a single SPE, with code size of roughly 32 KB. More recently assembly language implementation of the matrix multiplication C = C − A × B was reported by Hackenberg[10,11]. Performance of 25.40 Gﬂop/s was reported with code size close to 26 KB.

4 4.1

Implementation SPU Architecture Overview

The core of the SPE is the Synergistic Processing Unit (SPU). The SPU is a RISC-style SIMD processor feturing 128 general purpose registers and 32bit ﬁxed length instruction encoding. SPU includes instructions that perform single precision ﬂoating point, integer arithmetic, logicals, loads, stores, compares and branches. SPU includes nine execution units organized into two pipelines, referred to as the odd and even pipeline. Instructions are issued in-order and two independent instructions can be issued simultaneously if they belong to diﬀerent pipelines. SPU executes code form the Local Store and operates on data residing in the Local Store, which is a fully pipelined, single-ported, 256 KB of Static Random Access Memory (SRAM). Load and store instructions are performed within local address space, which is untranslated, unguarded and noncoherent with respect to the system address space. Loads and stores transfer 16 bytes of data between the register ﬁle and the Local Store, and complete with ﬁxed six-cycle delay and without exception. SPU does not perform hardware branch prediction and omits branch history tables. Instead, the SPU includes a Software Managed Branch Target Buﬀer (SMBTB), which holds a single branch target and is loaded by software. A mispredicted branch ﬂushes the pipelines and costs 18 cycles. A correctly hinted branch can execute in one cycle. Since both branch hint and branch instructions belong to the odd pipeline, proper use of SMBTB can result in zero overhead from branching for a compute-intensive loop dominated by even pipeline instructions. 4.2

Loop Construction

The main tool in loop construction is the technique of loop unrolling. In general, the purpose of loop unrolling is to avoid pipeline stalls by separating dependent instructions by a distance in clock cycles equal to the corresponding pipeline latencies. It also decreases the overhead associated with advancing the loop index and branching. On the SPE it serves the additional purpose of balancing the ratio of instructions in the odd and even pipeline, owing to register reuse between interations.

938

W. Alvaro, J. Kurzak, and J. Dongarra

In the canonical form, matrix multiplication Cm×n = Am×k × Bk×n coinsists of three nested loops iterating over the three dimensions m, n and k. Loop tiling is applied to improve the locality of reference and to take advantage of the O(n3 )/O(n2 ) ratio of arithmetic operations to memory accesses. This way register reuse is maximized and the number of loads and stores is minimized. Conceptually, tiling of the three loops creates three more inner loops, which calculate a product of a submatrix of A and a submatrix of B and updates a submatrix of C with the partial result. Practically, the body of these three inner loops is subject to complete unrolling to a single block of a straight-line code. The tile size is picked such that the cross-over point between arithmetic and memory operations is reached, which means that there is more FMA or FNMS operations to ﬁll the even pipeline than there is load, store and shuﬄe or splat operations to ﬁll the odd pipeline. The resulting structure consists of three outer loops iterating over tiles of A, B and C. Inevitably, nested loops induce mispredicted branches, which can be alleviated by further unrolling. Aggressive unrolling, however, leads quickly to undesired code bloat. Instead, the three-dimensional problem can be linearized by replacing the loops with a single loop performing the same traversal of the iteration space. This is accomplished by traversing tiles of A, B and C in a predeﬁned order derived as a function of the loop index. A straightforward row/column ordering can be used and tile pointers for each iteration can be constructed by simple transformations of the bits of the loop index. At this point, the loop body still contains auxiliary operations that cannot be overlapped with arithmetic operations. These include initial loads, stores of ﬁnal results, necessary data rearrangement with splats and shuﬄes, and pointer advancing operations. This problem is addressed by double-buﬀering, on the register level, between two loop iterations. The existing loop body is duplicated and two separate blocks take care of the even and odd iteration, respectively. Auxiliary operations of the even iteration are hidden behind arithmetic instructions of the odd iteration and vice versa, and disjoint sets of registers are used where necessary. The resulting loop is preceeded by a small body of prologue code loading data for the ﬁrst iteration, and then followed by a small body of epilogue code, which stores results of the last iteration. 4.3

C=C-A×B

T

Before going into details, it should be noted, that matrix storage follows C-style row-major format. It is not as much a carefull design decision, as compliance with the common practice on the CELL processor. It can be attributed to C compilers being the only ones allowing to exploit short-vector capabilities of the SPEs through C language SIMD extensions. An easy way to picture the C = C − A × B T operation is to represent it as the standard matrix vector product C = C − A × B, where A is stored using row-major order and B is stored using column-major order. It can be observed that in this case a row of A can readily be multiplied with a column of B to yield a vector containing four partial results, which need to be summed up to

Fast and Small Short Vector SIMD Matrix Multiplication Kernels

939

produce one element of C. The vector reduction step introduces superﬂuous multiply-add operations. In order to minimize their number, four row-column products are computed, resulting in four vectors, which need to be internally reduced. The reduction is performed by ﬁrst transposing the 4 × 4 element matrix represented by the four vectors and then applying four vector multiply-add operations to produce a result vector containing four elements of C. The basic scheme is depicted in Fig. 1 (left).

Fig. 1. Basic operation of the C = C − A × B T micro-kernel (left). Basic operation of the C = C − A × B micro-kernel (right).

The crucial design choice to be made is the right amount of unrolling, which is equivalent to deciding the right tile size in terms of the triplet {m, n, k} (Here sizes express numbers of individual ﬂoating-point values, not vectors). Unrolling is mainly used to minimize the overhead of jumping and advancing the index variable and associated pointer arithmetic. It has been pointed out in Sect. 4.1 that both jump and jump hint instructions belong to the odd pipeline and, for compute intensive loops, can be completely hidden behind even pipeline instructions and thus introduce no overhead. In terms of the overhead of advancing the index variable and related pointer arithmetic, it will be shown in Sect. 4.5 that all of these operations can be placed in the odd pipeline as well. In this situation, the only concern is balancing even pipeline, arithmetic instructions with odd pipeline, data manipulation instructions. Simple analysis can be done by looking at the number of ﬂoating-point operations versus the number of loads, stores and shuﬄes, under the assumption that the size of the register ﬁle is not a constraint. The search space for the {m, n, k} triplet is further truncated by the following criteria: only powers of two are considered in order to simplify the loop construction; the maximum possible number of 64 is chosen for k in order to minimize the number of extraneous ﬂoating-point instructions performing the reduction of partial results; only multiplies of four are selected for n to allow for eﬃcient reduction of partial results with eight shuﬄes per one output vector of C. Under these constraints, the entire search space can be easily analyzed.

940

W. Alvaro, J. Kurzak, and J. Dongarra

Table 1 (left) shows how the number of each type of operation is calculated. Table 2 (left) shows the number of even pipeline, ﬂoating-point instructions including the reductions of partial results. Table 2 (center) shows the number of even pipeline instructions minus the number of odd pipeline instructions including loads, stores and shuﬄes (not including jumps and pointer arithmetic). In other words, Table 2 (center) shows the number of spare odd pipeline slots before jumps and pointer arithmetic are implemented. Finally, Table 2 (right) shows the size of code involved in calculations for a single tile. It is important to note here that the double-buﬀered loop is twice the size. Table 1. Numbers of diﬀerent types of operations in the computation of one tile of the C = C − A × B T micro-kernel (left) and the C = C − A × B micro-kernel (right) as a function of tile size ({m, n, 64} triplet) Type of Operation Floating point Load A Load B Load C Store C Shuffle

Pipeline

Number of Operations

Type of Operation

(m × n × 64) ⁄ 4 + m × n

Floating point

Even Odd

m × 64 ⁄ 4

Load A

64 × n ⁄ 4

Load B

m×n ⁄4

Load C

m×n ⁄4

Store C

m×n ⁄4×8

Pipeline

Number of Operations

Even Odd

Splat

(m × n × k) ⁄ 4

m×k ⁄4

k×n ⁄4 m×n ⁄4 m×n ⁄4 m×k

Table 2. Unrolling analysis for the C = C − A × B T micro-kernel: left - number of even pipeline, ﬂoating-point operations, center - number of spare odd pipeline slots, right - size of code for the computation of one tile M/N 1 2 4 8 16 32 64

4 68 136 272 544 1088 2176 4352

8 16 32 64 136 272 544 1088 272 544 1088 2176 544 1088 2176 4352 1088 2176 4352 8704 2176 4352 8704 17408 4352 8704 17408 34816 8704 17408 34816 69632

M/N 1 2 4 8 16 32 64

4 -22 20 104 272 608 1280 2624

8 16 32 64 -28 -40 -64 -112 72 176 384 800 272 608 1280 2624 672 1472 3072 6272 1472 3200 6656 13568 3072 6656 13824 28160 6272 13568 28160 57344

M/N 1 2 4 8 16 32 64

4 1.2 1.0 1.7 3.2 6.1 12.0 23.8

8 1.2 1.8 3.2 5.9 11.3 22.0 43.5

16 2.3 3.6 6.1 11.3 21.5 42.0 83.0

32 4.5 7.0 12.0 22.0 42.0 82.0 162.0

64 8.9 13.9 23.8 43.5 83.0 162.0 320.0

It can be seen that the smallest unrolling with a positive number of spare odd pipeline slots is represented by the triplet {2, 4, 64} and produces a loop with 136 ﬂoating-point operations. However, this unrolling results in only 20 spare slots, which would barely ﬁt pointer arithmetic and jump operations. Another aspect is that the odd pipeline is also used for instruction fetch and near complete ﬁlling of the odd pipeline may cause instruction depletion, which in rare situations can even result in an indeﬁnite stall. The next larger candidates are triplets {4, 4, 64} and {2, 8, 64}, which produce loops with 272 ﬂoating-point operations, and 104 or 72 spare odd pipeline slots, respectively. The ﬁrst one is an obvious choice, giving more room in the odd pipeline and smaller code.

Fast and Small Short Vector SIMD Matrix Multiplication Kernels

941

C=C-A×B

4.4

Here, same as before, row major storage is assumed. The key observation is that multiplication of one element of A with one row of B contributes to one row of C. Owing to that, the elementary operation splats an element of A over a vector, multiplies this vector with a vector of B and accumulates the result in a vector of C (Fig. 1). Unlike for the other kernel, in this case no extra ﬂoating-point operations are involved. Same as before, the size of unrolling has to be decided in terms of the triplet {m, n, k}. This time, however, there is no reason to ﬁx any dimension. Nevertheless, similar constraints to the search space apply: all dimensions have to be powers of two, and additionally only multiplies of four are allowed for n and k to facilitate eﬃcient vectorization and simple loop construction. Table 1 (right) shows how the number of each type of operation is calculated. Table 3 (left) shows the number of even pipeline, ﬂoating-point instructions. Table 3 (center) shows the number of even pipeline instructions minus the number of odd pipeline instructions including loads, stores and splats (not including jumps and pointer arithmetic). In other words, Table 3 (center) shows the number of spare odd pipeline slots before jumps and pointer arithmetic are implemented. Finally, Table 3 (right) shows the size of code involved in calculations for a single tile. It is should be noted again that the double-buﬀered loop is twice the size. It can be seen that the smallest unrolling with a positive number of spare odd pipeline slots produces a loop with 128 ﬂoating-point operations. Five possibilities exist, with the triplet {4, 16, 8} providing the highest number of 24 spare odd pipeline slots. Again, such unrolling would both barely ﬁt pointer arithmetic and jump operations and be a likely cause of instruction depletion. The next larger candidates are unrollings producing loops with 256 ﬂoatingpoint operations. There are 10 such cases, with the triplet {4, 32, 8} being the obvious choice for the highest number of 88 spare odd pipeline slots and the smallest code size. Table 3. Unrolling analysis for the C = C − A × B micro-kernel: left - number of even pipeline, ﬂoating-point operations, center - number of spare odd pipeline slots, right size of code for the computation of one tile K 4 4 4 4 4 4 4 8 8 8 8 8 8 8 16 16 16 16 16 16 16

M/N 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64

4 4 8 16 32 64 128 256 8 16 32 64 128 256 512 16 32 64 128 256 512 1024

8 8 16 32 64 128 256 512 16 32 64 128 256 512 1024 32 64 128 256 512 1024 2048

16 16 32 64 128 256 512 1024 32 64 128 256 512 1024 2048 64 128 256 512 1024 2048 4096

32 64 32 64 64 128 128 256 256 512 512 1024 1024 2048 2048 4096 64 128 128 256 256 512 512 1024 1024 2048 2048 4096 4096 8192 128 256 256 512 512 1024 1024 2048 2048 4096 4096 8192 8192 16384

K 4 4 4 4 4 4 4 8 8 8 8 8 4 4 16 16 16 16 16 16 16

M/N 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64

4 -7 -10 -16 -28 -52 -100 -196 -12 -16 -24 -40 -72 -136 -264 -22 -28 -40 -64 -112 -208 -400

8 -9 -10 -12 -16 -24 -40 -72 -14 -12 -8 0 16 48 112 -24 -16 0 32 96 224 480

16 -13 -10 -4 8 32 80 176 -18 -4 24 80 192 416 864 -28 8 80 224 512 1088 2240

32 64 -21 -37 -10 -10 12 44 56 152 144 368 320 800 672 1664 -26 -42 12 44 88 216 240 560 544 1248 1152 2624 2368 5376 -36 -52 56 152 240 560 608 1376 1344 3008 2816 6272 5760 12800

K 4 4 4 4 4 4 4 8 8 8 8 8 4 4 16 16 16 16 16 16 16

M/N 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64

4 0.1 0.1 0.2 0.4 0.7 1.4 2.8 0.1 0.2 0.3 0.7 1.3 2.5 5.0 0.2 0.4 0.7 1.3 2.4 4.8 9.6

8 0.1 0.2 0.3 0.6 1.1 2.2 4.3 0.2 0.3 0.5 1.0 1.9 3.8 7.6 0.3 0.6 1.0 1.9 3.6 7.1 14.1

16 0.2 0.3 0.5 1.0 1.9 3.7 7.3 0.3 0.5 0.9 1.7 3.3 6.4 12.6 0.6 1.0 1.7 3.1 6.0 11.8 23.3

32 0.3 0.5 1.0 1.8 3.4 6.8 13.4 0.6 1.0 1.7 3.1 5.9 11.5 22.8 1.1 1.8 3.1 5.6 10.8 21.0 41.5

64 0.6 1.0 1.8 3.4 6.6 12.9 25.5 1.2 1.8 3.2 5.8 11.1 21.8 43.0 2.2 3.4 5.8 10.6 20.3 39.5 78.0

942

4.5

W. Alvaro, J. Kurzak, and J. Dongarra

Advancing Tile Pointers

The remaining issue is the one of implementing the arithmetic calculating the tile pointers for each loop iteration. Due to the size of the input matrices and the tile sizes being powers of two, this is a straightforward task. The tile oﬀsets can be calculated from the tile index and the base addresses of the input matrices using integer arithmetic and bit manipulation instructions (bitwise logical instructions and shifts). Although a few variations are possible, the resulting assembly code will always involve a similar combined number of integer and bit manipulation operations. Unfortunately, all these instructions belong to the even pipeline and will introduce an overhead, which cannot be hidden behind ﬂoating point operations, like it is done with loads, stores, splats and shuﬄes. One way of minimizing this overhead is extensive unrolling, which creates a loop big enough to make the pointer arithmetic negligible. An alternative is to eliminate the pointer arithmetic operations from the even pipeline and replace them with odd pipeline operations. With the unrolling chosen in Sect. 4.3 and Sect. 4.4, the odd pipeline oﬀers empty slots in abundance. It can be observed that, since the loop boundaries are ﬁxed, all tile oﬀsets can be calculated in advance. At the same time, the operations available in the odd pipeline include loads, which makes it a logical solution to precalculate and tabulate tile oﬀsets for all iterations. It still remains necessary to combine the oﬀsets with the base addresses, which are not known beforehand. However, under additional alignment constraints, oﬀsets can be combined with bases using shuﬄe instructions, which are also available in the odd pipeline. The precalculated oﬀsets have to be compactly packed in order to preserve space consumed by the lookup table. Since tiles are 16 KB in size, oﬀsets consume 14 bits and can be stored in a 16-bit halfword. Three oﬀsets are required for each loop iteration. With eight halfwords in a quadword, each quadword can store oﬀsets for two loop iterations or a single interation of the pipelined, double-buﬀered loop. The size of the lookup table constructed in this manner equals N 3 /(m × n × k) × 8 bytes. The last arithmetic operation remaining is the advancement of the itaration variable. It is typical to decrement the iteration variable instead of incrementing it, and branch on non-zero, in order to eliminate the comparison operation, which is also the case here. This still leaves the decrement operation, which would have to occupy the even pipeline. In order to annihilate the decrement, each quadword containing six oﬀsets for one itaration of the double-buﬀered loop also contains a seventh entry, which stores the index of the quadword to be processed next (preceeding in memory). In other words, the iteration variable, which also serves as the index to the lookup table, is tabulated along with the oﬀsets and loaded instead of being decremented. At the same time, both the branch instruction and the branch hint belong to the odd pipeline. Also, a correctly hinted branch does not cause any stall. As a result, such an implementation produces a continuous stream of ﬂoating-point operations in the even pipeline, without a single cycle devoted to any other activity.

Fast and Small Short Vector SIMD Matrix Multiplication Kernels

5

943

Results

Both presented SGEMM kernel implementations produce a continuous stream of ﬂoating-point instructions for the duration of the pipelined loop. In both cases, the loop iterates 128 times, processing two tiles in each iteration. The C = C − A × B T kernel contains 544 ﬂoating-point operations in the loop body and, on a 3.2 GHz processor, delivers 25.54 Gﬂop/s (99.77 % of peak) if actual operations are counted, and 24.04 Gﬂop/s (93.90 % of peak) if the standard formula, 2N 3 , is used for operation count. The C = C −A×B kernel contains 512 ﬂoating-point operations in the loop body and delivers 25.55 Gﬂop/s (99.80 % of peak). Here, the actual operation count equals 2N 3 . If used on the whole CELL processor with 8 SPEs, performance in excess of 200 Gﬂop/s should be expected. Table 4 shows the summary of the kernels’ properties. Table 4. Summary of the properties of the SPE SIMD SGEMM mikro-kernels CharacteristicT Performance

C=C-A×BT C=C-A×BT 24.04

25.55

Gflop/s

Gflop/s

Execution time

21.80 s

20.52 s

Fraction of peak

93.90 %

99.80 %

99.77 %

99.80%

68.75 %

82.81 %

69

69

Code segment size

4008

3992

Data segment size

2192

2048

Total memory footprint

6200

6040

USING THE 2× M× N× K FORMULA

Fraction of peak USING ACTUAL NUMBER OF FLOATING–POINT INSTRUCTIONS

Dual issue rate ODD PIPELINE WORKLOAD

Register usage

The code is freely available, under the BSD license and can be downloaded from the author’s web site http://icl.cs.utk.edu/∼alvaro/.

6

Conclusions

Computational micro-kernels are architecture speciﬁc codes, where no portability is sought. It has been shown that systematic analysis of the problem combined with exploitation of low-level features of the Synergistic Processing Unit of the CELL processor leads to dense matrix multiplication kernels achieving peak performance without code bloat.

944

W. Alvaro, J. Kurzak, and J. Dongarra

References 1. IBM Corporation: Cell Broadband Engine Programming Handbook, Version 1.1 (April 2007) 2. Geer, D.: Industry Trends: Chip Makers Turn to Multicore Processors. Computer 38(5), 11–13 (2005) 3. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, Electrical Engineering and Computer Sciences Department, University of California, Berkeley (2006) 4. Dongarra, J.J., Duﬀ, I.S., Sorensen, D.C., van der Vorst, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998) 5. Demmel, J.W.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997) 6. K˚ agstr¨ om, B., Ling, P., van Loan, C.: GEMM-Based Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark. ACM Trans. Math. Soft. 24(3), 268–302 (1998) 7. Aberdeen, D., Baxter, J.: Emmerald: A Fast Matrix-Matrix Multiply Using Intel’s SSE Instructions. Concurrency Computat.: Pract. Exper. 13(2), 103–119 (2001) 8. Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The Potential of the Cell Processor for Scientiﬁc Computing. In: ACM International Conference on Computing Frontiers (2006) 9. Chen, T., Raghavan, R., Dale, J., Iwata, E.: Cell Broadband Engine architecture and its ﬁrst implementation, A performance view (November 2005), http://www-128.ibm.com/developerworks/power/library/pa-cellperf/ 10. Hackenberg, D.: Einsatz und Leistungsanalyse der Cell Broadband Engine. Institut f¨ ur Technische Informatik, Fakult¨ at Informatik, Technische Universit¨ at Dresden, Großer Beleg (February 2007) 11. Hackenberg, D.: Fast matrix multiplication on CELL systems (July 2007), http://tu-dresden.de/die tu dresden/zentrale einrichtungen/zih/forschun/ architektur und leistungsanalyse von hochleistungsrechnern/cell/

Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations W.N. Gansterer1 , H. Schabauer1 , C. Pacher2, and N. Finger2 1

University of Vienna, Research Lab Computational Technologies and Applications {wilfried.gansterer,hannes.schabauer}@univie.ac.at 2 Austrian Research Centers GmbH - ARC, Smart Systems Division {christoph.pacher,norman.finger}@arcs.ac.at

Abstract. We discuss a method for solving complex symmetric (nonHermitian) eigenproblems Ax = λBx arising in an application from optoelectronics, where reduced accuracy requirements provide an opportunity for trading accuracy for performance. In this case, the objective is to exploit the structural symmetry. Consequently, our focus is on a nonHermitian tridiagonalization process. For solving the resulting complex symmetric tridiagonal problem, a variant of the Lanczos algorithm is used. Based on Fortran implementations of these algorithms, we provide extensive experimental evaluations. Runtimes and numerical accuracy are compared to the standard routine for non-Hermitian eigenproblems, LAPACK/zgeev. Although the performance results reveal that more work is needed in terms of increasing the fraction of Level 3 Blas in our tridiagonalization routine, the numerical accuracy achieved with the nonHermitian tridiagonalization process is very encouraging and indicates important research directions for this class of eigenproblems. Keywords: Tridiagonalization, complex symmetric eigenvalue problems, waveguide simulation, optoelectronic.

1

Introduction

We discuss methods for eﬃciently tridiagonalizing a complex symmetric (nonHermitian) matrix. The term complex matrix is used to denote a matrix which has at least one element with a nonzero imaginary part. Tridiagonalization is an important preprocessing step in reduction-based (tridiagonalization-based) methods for computing eigenvalues and eigenvectors of dense real symmetric or complex Hermitian matrices. In the context considered here, the underlying complex symmetric eigenvalue problem (EVP) has similar ˆ B ˆ ∈ Cn×n structural, but diﬀerent mathematical properties. Given matrices A, H H ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ with A = A (but A = A ) and B = B (but B = B ), the objective is to eﬃciently compute eigenvalues λ and eigenvectors y of the generalized EVP ˆ = λBy ˆ . Ay

(1)

The main challenge is to ﬁnd ways for utilizing the structural symmetry in the absence of the mathematical properties of Hermitian matrices. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 945–954, 2008. c Springer-Verlag Berlin Heidelberg 2008

946

W.N. Gansterer et al.

Although problems of the type (1) do not occur as frequently in practice as real symmetric or complex Hermitian problems, there are some important applications where they arise (see, for example, [1,2,3]). The eﬀorts summarized in this paper are motivated by the simulation of guided-wave multi-section devices in optoelectronics. As described in Section 2, techniques for numerically solving Maxwell’s equations in this context lead to dense EVPs of the type (1). Analogously to Hermitian problems, one possible approach for solving probˆ = In . Complex symlem (1) starts with reducing it to standard form where B metry allows for special techniques in this reduction step which are not discussed here. After that, a tridiagonalization process is performed on the standard EVP which results in a similar complex symmetric tridiagonal matrix T . After this tridiagonalization step, eigenvalues and eigenvectors of T are computed and the eigenvectors are backtransformed. In the following, we focus on symmetry-preserving approaches for eﬃciently tridiagonalizing a complex symmetric matrix. This functionality constitutes a central step in the solution process outlined above and is one way of exploiting the available structure. A more detailed discussion of the other steps involved in solving (1) will be provided in a forthcoming paper. Mathematically speaking, structural symmetry is not a very distinctive feature of complex matrices, since every matrix A ∈ Cn×n is similar to a complex symmetric matrix [1]. In contrast to a real symmetric matrix a complex symmetric matrix A is not necessarily diagonalizable. Nevertheless, structural symmetry is of great interest for the development of space- and time-eﬃcient algorithms. Obviously, half of the information in a complex symmetric matrix is redundant, and eﬃcient algorithms should be able to take advantage of this fact in terms of memory requirements as well as in terms of computational eﬀort. The utilization of this purely structural property in the absence of important mathematical properties of Hermitian matrices requires a trade-oﬀ in numerical stability. In order to perform a symmetry preserving similarity transformation, the transformation matrix Q ∈ Cn×n needs to be complex orthogonal (but not unitary), that is, it has to satisfy Q Q = In . Related Work. Various non-reduction based methods for solving complex symmetric EVPs have been proposed, for example, based on the Jacobi method [4], on the Lanczos method [5], or on variants of the Jacobi-Davidson method [6]. For dense matrices, reduction-based methods tend to be more eﬃcient. A modiﬁed conventional Householder-based reduction method has been described in [2]. The tridiagonalization of a dense complex symmetric matrix has also been investigated in [3]. In [2], the resulting tridiagonal complex symmetric problem is solved using a modiﬁed QR algorithm. Other related approaches for computing eigenvalues of a complex symmetric tridiagonal matrix were discussed in [7,8]. Synopsis. In Section 2 of this paper, the motivating application from optoelectronics is summarized. In Section 3, the tridiagonalization method investigated in this paper is discussed in detail, and Section 4 contains experimental results. Conclusions and future work are summarized in Section 5.

Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations

2

947

Guided-Wave Multisection Devices

The use of high-index contrast waveguides (WGs) in novel guided-wave devices for telecom- and sensing applications allows a very versatile tailoring of the ﬂow of light. However, an eﬃcient design requires the direct numerical solution of Maxwell’s equations in inhomogeneous media. In many important cases such devices can be successfully modeled as follows: (i) in the x-direction (direction of propagation) the material parameters are piecewise constant, (ii) the material parameters and the optical ﬁelds do not depend on the y-coordinate, and (iii) in the z-direction the material parameters are allowed to vary arbitrarily. Usually, the z-dimension is of the order of up to several tens of wavelengths whereas the device extension into the x-direction is several hundreds of wavelengths. A powerful numerical method for the solution of Maxwell’s equations in such WG-based devices is the eigenmode expansion technique (which is often referred to as mode-matching (MM ) technique) [9,10,11], where the electromagnetic ﬁeld components in each subsection being homogeneous in the x-direction are represented in terms of a set of local eigenmodes. MM requires a small computational eﬀort compared to other numerical techniques like two-dimensional ﬁnite-elements or FDTD which can be regarded as “brute-force” methods from the viewpoint of device physics. However, MM can only be as stable and eﬃcient as the algorithms used to determine the required set of local WG modes. Due to the open boundary conditions (see Section 2.1) and materials with complex dielectric permittivities these local eigenmodes have typically complex eigenvalues which makes their correct classiﬁcation very diﬃcult: numerical instabilities can arise from an improper truncation of the mode spectrum. In a recently developed variant of the MM technique — the so-called variational mode-matching (VMM ) [12] — this stability problem is avoided by applying a Galerkin approach to the local wave equations and taking into account the whole spectrum of the discretized operators. 2.1

The VMM-Approach

Within the 2D-assumption ∂y (·) = 0, Maxwell’s equations for dielectric materials characterized by the dielectric permittivity ε(x, z) take the form ∂x a∂x φ + ∂z a∂z φ + k02 bφ = 0 ,

(2)

where φ = Ey , a = 1, b = ε for TE- and φ = Hy , a = 1ε , b = 1 for TMpolarization, respectively; k0 = 2π λ0 (vacuum wavelength λ0 ). In the z-direction, the simulation domain is 0 ≤ z ≤ L. To permit an accurate description of radiation ﬁelds, an artiﬁcial absorber (that mimics an open domain) has to be installed near z = 0 and z = L. For this purpose so-called perfectly-matched layers (PMLs) are used by employing the complex variable stretching approach [13], i. e., in the vicinity of the domain boundaries the coorz dinate z is extended into the complex plane: z → z˜ = z + ı 0 dτ σ(τ ), where σ is the PML parameter determining the absorption strength. At z = 0 and z = L

948

W.N. Gansterer et al.

Dirichlet- or Neumann boundary conditions (BCs) are set. However, they should not have a signiﬁcant inﬂuence on the overall optical ﬁeld since the physical BCs must be given by the PMLs. In the x-direction, the structure is divided into nl local WGs, which expand over xl−1 ≤ x ≤ xl = xl−1 + dl with 1 ≤ l ≤ nl . Under the necessary condition that ε does not depend on x Eq. (2) can be solved inside each local WG l with the separation ansatz φ(l) (x, z˜) =

nϕ j=1

ϕj (˜ z)

nϕ

(l) (l) (l) (l) (l) cjρ αρ,+ eık0 νρ (x−xl−1 ) + αρ,− e−ık0 νρ (x−xl ) ,

(3)

ρ=1

where ρ labels the local waveguide modes. The transverse shape functions ϕj (˜ z) (the same set is used for all local WGs) must satisfy the outer boundary conditions. Apart from this constraint, the ϕj ’s may be chosen rather freely allowing for adaptive reﬁnement in z-regions where rapid variations of the ﬁeld are expected. This ansatz reduces the 2D problem to a set of nl 1D problems. After inserting Eq. (3) into Eq. (2), Galerkin’s method is applied to obtain a (l) discretized version of Eq. (2) for each local WG l. Finally, the coeﬃcients αρ,± are “mode-matched” by imposing the physical boundary conditions at all the xl -interfaces [12]. 2.2

The Complex Symmetric Eigenvalue Problem

For each local WG, the discretized version of Eq. (2) is a generalized EVP of the form Acρ = (νρ )2 Bcρ , (4) where we have suppressed the index l for simplicity. Here, the νρ ’s are the modal refractive indices and the cjρ ’s are the corresponding modal expansion coeﬃcients is a sum of a mass- and a stiﬀness-matrix, occurring in Eq. (3).1 A z (∂z˜ϕm (˜ z ϕm (˜ z )b(˜ z )ϕj (˜ z ) − k2 d˜ z ))a(˜ z )(∂z˜ϕj (˜ z )), whereas B is a Amj = d˜ 0 z ϕm (˜ z )a(˜ z )ϕj (˜ z ). pure mass-matrix: Bmj = d˜ The generalized EVP (4) has the following properties: (i) A and B are complex symmetric: the complex coordinate z˜ originating from the PMLs (and the possibly complex material constants a and b) are responsible for the complexvaluedness; (ii) B is indeﬁnite (due to the open boundary conditions represented by the PMLs and a possibly negative material constant a); (iii) the typical order of the matrices for 2D problems is 100–1000 (depending on the geometry and the required truncation order of the modal expansion—in 3D models the order can be much higher); (iv ) the full spectrum of eigenpairs is required; (v ) the required accuracy is of the order 10−8 for the eigenpairs corresponding to the lowest order WG modes (approx. 10% of the mode spectrum); a somewhat lower accuracy (approx. 10−6 ) is acceptable for the remainder of the spectrum; (vi) depending on the WG geometry, some of the eigenvalues (especially those corresponding to the lowest order WG modes) may almost degenerate. It is evident that an eﬃcient eigenvalue solver which utilizes the symmetry of the EVP (4) as well as its special properties is a very important building block for eﬃcient 2D and 3D optical mode solvers.

Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations

3

949

Methodology

The standard approach to solving a dense complex symmetric EVP (as available, for example, in Lapack [14]) is to treat it as a nonsymmetric EVP: the complex symmetric matrix is reduced to upper Hessenberg form, from which eigenvalues and eigenvectors are computed using a QR iteration. The main motivation behind investigating tridiagonalization-based approaches as an alternative is the obvious potential for reducing storage and runtime requirements. In order to preserve symmetry, complex orthogonal similarity transformations (COTs) Q are needed which satisfy Q Q = In . In general, Q2 ≥ 1 and thus the application of complex orthogonal matrices can increase numerical errors. 3.1

Splitting Methods

The real part R and the imaginary part S of a complex symmetric matrix A = R + iS are real symmetric matrices. One basic idea, which has been introduced earlier [3], is to separate the tridiagonalization of R from the tridiagonalization of S as much as possible. More speciﬁcally, part of a column of R can be eliminated using a (real) orthogonal Householder transformation QR . After that, a (smaller) part of the corresponding column of S can be eliminated without causing any ﬁll-in in R using another (real) orthogonal Householder transformation QI . Both of these operations are performed in real arithmetic, and both transformation matrices have norm one. Eventually, a single nonzero element below the subdiagonal in S remains to be eliminated. This operation has to be performed in complex arithmetic, using a 2 × 2 COT, whose norm cannot be bounded a priori. When the column elimination is ﬁrst performed in R and then in S, we call the procedure RI variant. Analogously, it is possible to eliminate ﬁrst in S and then in R. We call this procedure IR variant . The advantages of splitting methods seem obvious: Most of the computation can be done in real arithmetic, only one third of the transformations are potentially unstable, and this danger can easily be monitored because of the low dimensions of the COTs (only order two). Complex Orthogonal Transformations. The transformation matrix 1 zs √ , z = z1 + iz2 ∈ C, z1 , z2 , s ∈ R , G := z 2 + s2 −s z

(5)

deﬁnes a COT since G G = I2 . Consequently, GAG is a similarity transformation of A. In the RI variant, a COT GRI has to be determined such that a + ib d + ie GRI = , ic 0 where a, b, c, d, e ∈ R and c = 0. Choosing the parameters z = s bc − i ac , s = 0 arbitrary, the COT is given as 1 b − ia c GRI =

. (6) b − ia b2 − a2 + c2 − i(2ab) −c

950

W.N. Gansterer et al.

In the IR variant, a COT GIR has to be determined such that a + ib d + ie = . GIR c 0 With z = s

3.2

a c

+ i cb , s = 0 arbitrary, the COT is given as 1 a + ib c GIR =

. a + ib a2 − b2 + c2 + i(2ab) −c

(7)

Numerical Aspects

In a splitting method, the complex orthogonal transformations (5) are the only non-unitary transformations, all other transformations used have unit norm. If G2 1 the accuracy of the tridiagonalization process could be inﬂuenced negatively. G is a normal matrix, and thus its spectral norm is given by its largest eigenvalue in modulus: G2 =

1+γ 1−γ

1/4 with

γ=

z12

2|z2 s| . + z22 + s2

(8)

If γ approaches one, the accuracy of the tridiagonalization process may deteriorate. For GRI and GIR , respectively, γ in (8) becomes γRI =

2|ac| , a 2 + b 2 + c2

γIR =

2|bc| . a 2 + b 2 + c2

We observe that the freedom in choosing the parameter s does not help in controlling the norm of the COT, since γRI and γIR are independent of s. During the tridiagonalization process, monitoring the norms of the COTs makes it possible to detect potentially large errors. Various strategies have been suggested to avoid large norms, such as the recovery transformations proposed in [3]. Adaptive Elimination Order. The order of processing R and S can be determined independently in each iteration of the tridiagonalization process. For both variants, the norm of each COT can be precomputed with only marginal overhead. Based on this information, the COT with the smaller norm can be selected and the corresponding variant carried out. Obviously, this heuristic choice is only a local minimization and there is no guarantee that it minimizes the accumulated norm of all COTs in the tridiagonalization process. Comparison to and combination with recovery transformations are topics of ongoing work.

4

Experimental Evaluation

In our experiments, we used the following routines: zsysta reduces a generalized ˆ B) ˆ to a standard EVP (A), and zsyevv solves the standard complex EVP (A,

Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations

951

symmetric EVP. The latter consists of a newly implemented RI tridiagonalization (zsytridi), compev [15] for computing eigenvalues and inverm [15] for computing corresponding eigenvectors of the complex symmetric tridiagonal. zsyevg tests the accuracy of the tridiagonalization process by ﬁrst calling zsytridi, followed by a call of LAPACK/zgeev on the resulting tridiagonal matrix. The codes were run on a Sun Fire v40z with 4 dual-core Opteron 875 CPUs (2.2 GHz) and 24 GB main memory. Suse Linux Enterprise Server 10, the GNU Fortran 95 compiler, Lapack version 3.1.1, Goto Blas 1.20, and the AMD Core Math Library (Acml 4.0.1) were used. We experimented with random test matrices with elements in [0, 2] as well as with a real application case. 4.1

Numerical Accuracy

Denoting with (λi , xi ) the eigenpairs computed by LAPACK/zgeev, and with ˜i, x ˜i ) the eigenpairs computed by zsyevv, an eigenvalue error E and a residual (λ error R have been computed according to E := max i

˜ i − λi | |λ , |λi |

R = max i

˜ i In )˜ ||(A − λ xi ||2 , ||A||2

i ∈ {1, . . . , n} .

Fig. 1 illustrates that the loss of accuracy in the tridiagonalization process itself is surprisingly low ! Although the total values of E and R of zsyevv increase up to 10−6 , most of this error is due to the Lanczos variant used for solving the tridiagonal problem. The error introduced by the RI tridiagonalization is only about two orders of magnitudes higher than the one of LAPACK/zgeev. 1D Waveguide Problem. The waveguide structure is a Si/SiOx twin waveguide operated in TM-polarization at a wavelength λ0 = 1.55 μm. The dielectric constants are εSi = 12.96 and εSiOx = 2.25. The core thickness and -separation are 0.5 μm and 0.25 μm, respectively. The z-extension of the model domain, terminated by electrically perfectly conducting walls, is 10 μm. The PML-layer thickness is 1 μm with the PML-parameter σ = 1. As shape functions, localized linear hat functions and polynomial bubble functions with a degree up to 24 were used. For reducing the generalized problem (4) to standard form, we computed a ˆ With ||B ˆ − F F T ||2 = generalized (complex) symmetric Cholesky factor F of B. −16 1.8 · 10 , the accuracy of this factorization is satisfactory for our test case. The eigenpairs (λi , xi ) of the resulting standard problem computed using Gnu ˜i, x Octave were compared with the eigenpairs (λ ˜i ) computed by our routine zsyevv. Backtransformation of the eigenvectors leads to a weighted residual error ˜i B)˜ ˆ yi ||2 ||(Aˆ − λ max = 3.8 · 10−14 , ˆ ˆ 2 i=1,...,n ||A||2 ||B|| ˆ 2 = 928, ||B|| ˆ 2 = 2). which is a very satisfactory accuray (for this test case, ||A||

952

W.N. Gansterer et al.

1e-05 R (zsyevv) E R (zsyevg) R (LAPACK/zgeev)

1e-06 1e-07 1e-08 1e-09 1e-10 1e-11 1e-12 1e-13 1e-14 100

500

1000

2000

3000

4000

Order n of eigenproblem Ax = λx Fig. 1. Accuracy of zsyevv, LAPACK/zgeev, and zsyevg operating on random matrices

4.2

Runtime Performance

We compared our routine zsyevv to LAPACK/zgeev using two diﬀerent implementations of the Blas. Fig. 2 shows that the current version of zsyevv is faster than zgeev only if the Acml Blas is used. With the overall faster Goto Blas, zgeev outperforms our implementation. At ﬁrst sight, this result is disappointing. Despite the exploitation of the structure, the new routine is slower than the more general routine for nonsymmetric problems for the best Blas available. A more detailed analysis helps to pinpoint the reason. Table 1 shows the percentages of the total runtimes which each of the two routines spent in their diﬀerent parts for the two diﬀerent Blas versions. For our routine zsyevv, the tridiagonalization part zsytridi clearly dominates the computation time for all problem sizes and for both Blas versions. This shows that our current code zsytridi is unable to take advantage of the faster Goto Blas. Three diﬀerent parts of LAPACK/zgeev have been timed separately: zgehrd reduces the complex matrix A to upper Hessenberg form, zhseqr computes the eigenvalues of the Hessenberg matrix, and ztrevc computes corresponding eigenvectors. The runtime for all other code parts of LAPACK/zgeev is summed under “rest”. The picture here is quite diﬀerent. For the Acml Blas, the operations on the Hessenberg matrix clearly dominate for all problem sizes, whereas for the faster Goto Blas, the percentages for the three dominating parts become very similar for large problem sizes. Summarizing, we observe that our current code cannot utilize a faster Blas. This is not surprising, since so far it is dominated by Level 2 Blas operations and more eﬀort is needed to increase the fraction of Level 3 Blas operations.

Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations

953

10000

Runtime [s]

1000 100 10 zgeev/ACML zsyevv/ACML zsyevv/Goto zgeev/Goto

1 0.1 0.01 100

500

1000

2000

3000

4000

Order n of eigenproblem Ax = λx Fig. 2. Runtimes of zsyevv and LAPACK/zgeev operating on random matrices Table 1. Percentages of runtimes spent in parts of zsyevv and LAPACK/zgeev zsyevv LAPACK/zgeev n zsytridi compev inverm zgehrd zhseqr ztrevc 500 87.2 6.5 6.3 8.1 82.6 6.2 Acml 2000 93.9 1.5 4.6 7.2 84.2 6.2 4000 94.5 0.8 4.7 4.4 90.3 3.8

rest 3.1 2.4 1.5

500 Goto 2000 4000

5.9 7.8 9.7

BLAS

5

87.3 92.7 93.7

6.5 1.9 1.0

6.2 5.4 5.3

15.3 22.7 28.6

66.9 50.6 37.8

12.0 18.9 23.9

Conclusions and Future Work

Motivated by application problems arising in optoelectronics, a tridiagonalization process for complex symmetric matrices based on complex orthogonal transformations has been investigated. Compared to the standard Lapack routine for nonsymmetric eigenproblems, the loss of numerical accuracy caused by the potentially instable tridiagonalization process is surprisingly low in practice. However, partly in contrast to results published earlier [16], the performance beneﬁts achieved are not yet satisfactory, especially for highly optimized Blas. The eﬀort summarized here motivates various further research activities. Methodologically, the performance results indicate the need for blocked approaches. This suggests that non-splitting methods, where A is not split into real and imaginary part, can be an attractive alternative. For the optoelectronics problem, the matrices in (4) can be made banded in some situations by choosing appropriate shape functions. This motivates the investigation of eﬃcient algorithms for generalized banded complex symmetric eigenvalue problems.

954

W.N. Gansterer et al.

References 1. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985) 2. Ohnami, K., Mikami, Y.: Resonance scattering in a two-dimensional non-integrable system. J. Phys. A 25, 4903–4912 (1992) 3. Bar-On, I., Ryaboy, V.: Fast diagonalization of large and dense complex symmetric matrices, with applications to quantum reaction dynamics. SIAM J. Sci. Comput. 18, 1412–1435 (1997) 4. Leung, A.Y.T., Liu, Y.F.: A generalized complex symmetric eigensolver. Comput. and Structures 43, 1183–1186 (1992) 5. Cullum, J.K., Willoughby, R.A.: A practical procedure for computing eigenvalues of large sparse nonsymmetric matrices. In: Cullum, J.K., Willoughby, R.A. (eds.) Proceedings of the IBM Europe Institute Workshop on Large Scale Eigenvalue Problems, pp. 193–223. North-Holland, Amsterdam (1986) 6. Arbenz, P., Hochstenbach, M.E.: A Jacobi–Davidson method for solving complex symmetric eigenvalue problems. SIAM J. Sci. Comput. 25(5), 1655–1673 (2004) 7. Luk, F., Qiao, S.: Using complex-orthogonal transformations to diagonalize a complex symmetric matrix. In: Luk, F.T. (ed.) Advanced Signal Processing: Algorithms, Architectures, and Implementations VII, Proc. SPIE., vol. 162, pp. 418–425 (1997) 8. Cullum, J.K., Willoughby, R.A.: A QL procedure for computing the eigenvalues of complex symmetric tridiagonal matrices. SIAM J. Matrix Anal. Appl. 17, 83–109 (1996) 9. Sudbo, A.S.: Film mode matching: A versatile numerical method for vector mode ﬁeld calculations in dielectric waveguides. Pure and Appl. Optics 2, 211–233 (1993) 10. Franza, O.P., Chew, W.C.: Recursive mode matching method for multiple waveguide junction modeling. IEEE Trans. Microwave Theory Tech. 44, 87–92 (1996) 11. Bienstman, P., Baets, R.: Optical modelling of photonic crystals and VCSELs using eigenmode expansion and perfectly matched layers. Optical and Quantum Electronics 33, 327–341 (2001) 12. Finger, N., Pacher, C., Boxleitner, W.: Simulation of Guided-Wave Photonic Devices with Variational Mode-Matching, April 2007. American Institute of Physics Conference Series, vol. 893, pp. 1493–1494 (2007) 13. Teixeira, F.L., Chew, W.C.: General closed-form PML constitutive tensors to match arbitrary bianisotropic and dispersive linear media. IEEE Microwave Guided Wave Lett. 8, 223–225 (1998) 14. Anderson, E., Bai, Z., Bischof, C.H., Blackford, S., Demmel, J.W., Dongarra, J.J., Du, C.J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.C.: Lapack Users’ Guide, 3rd edn. SIAM Press, Philadelphia (1999) 15. Cullum, J.K., Willoughby, R.A.: Lanczos Algorithms for Large Symmetric Eigenvalue Computations, vol. 1. Theory, vol. 2. Programs, Birkh¨ auser, Boston, MA (1985) 16. Bar-On, I., Paprzycki, M.: High performance solution of the complex symmetric eigenproblem. Numerical Algorithms 18, 195–208 (1998)

On Using Reinforcement Learning to Solve Sparse Linear Systems Erik Kueﬂer and Tzu-Yi Chen Computer Science Department, Pomona College, Claremont CA 91711, USA {kuefler,tzuyi}@cs.pomona.edu

Abstract. This paper describes how reinforcement learning can be used to select from a wide variety of preconditioned solvers for sparse linear systems. This approach provides a simple way to consider complex metrics of goodness, and makes it easy to evaluate a wide range of preconditioned solvers. A basic implementation recommends solvers that, when they converge, generally do so with no more than a 17% overhead in time over the best solver possible within the test framework. Potential reﬁnements of, and extensions to, the system are discussed. Keywords: iterative methods, preconditioners, reinforcement learning.

1

Introduction

When using an iterative method to solve a large, sparse, linear system Ax = b, applying the right preconditioner can mean the diﬀerence between computing x accurately in a reasonable amount of time, and never ﬁnding x at all. Unfortunately choosing a preconditioner that improves the speed and accuracy of the subsequently applied iterative method is rarely simple. Not only is the behavior of many preconditioners not well understood, but there are a wide variety to choose from (see, for example, the surveys in [1,2]). In addition, many preconditioners allow the user to set the values of one or more parameters, and certain combinations of preconditioners can be applied in concert. Finally, there are relatively few studies comparing diﬀerent preconditioners, and the guidelines that are provided tend to be general rules-of-thumb. To provide more useful problem-speciﬁc guidelines, recent work explores the use of machine learning techniques such as decision trees [3], neural networks [4], and support vector machines [5,6] for recommending preconditioned solvers. This line of research attempts to create a classiﬁer that uses assorted structural and numerical features of a matrix in order to recommend a good preconditioned solver (with parameter settings when appropriate). At a minimum, these techniques recommend a solver that should be likely to converge to the solution vector. However, each paper also describes assorted extensions: [3] attempts to recommend a preconditioned solver that converges within some user-deﬁned parameter of optimal, [5] attempts to give insight into why certain solvers fail,

Corresponding author.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 955–964, 2008. c Springer-Verlag Berlin Heidelberg 2008

956

E. Kueﬂer and T.-Y. Chen

and [4] considers diﬀerent use scenarios. In addition, [7] tries to predict the efﬁciency of a solver in terms of its time and memory usage, and [3] describes a general framework within which many machine learning approaches could be used. Other work explores statistics-based data mining techniques [8]. A drawback of the existing work is its dependence on supervised learning techniques. In other words, to train the classiﬁer they need access to a large body of data consisting not only of matrix features, but also information on how diﬀerent preconditioned solvers perform on each matrix. If the goal is predicting convergence, the database needs to keep track of whether a particular preconditioned solver with particular parameter settings converges for each matrix. However, if time to convergence is also of interest, the database must have consistent timing information. Furthermore, there must be an adequate number of test cases to allow for accurate training. These requirements may become problematic if such techniques are to be the basis of long term solutions. An appealing alternative is reinforcement learning, which diﬀers from previously applied machine learning techniques in several critical ways. First, it is unsupervised which means the training phase attempts to learn the best answers without being told what they are. This makes it easier to consider a large variety of preconditioned solvers since no large collection of data gathered by running examples is necessary for training the system. Second, it allows the user to deﬁne a continuous reward function which it then tries to maximize. This provides a natural way to introduce metrics of goodness that might, for example, depend on running time rather than just trying to predict convergence. Third, reinforcement learning can be used to actually solve linear systems rather than just recommending a solver. After describing how reinforcement learning can be applied to the problem of choosing between preconditioned solvers, results of experiments using a basic implementation are discussed. Extensions and reﬁnements which may improve the accuracy and utility of the implementation are also presented.

2

Using Reinforcement Learning

Reinforcement learning is a machine learning technique that tries to gather knowledge through undirected experimentation, rather than being trained on a specially-crafted body of existing knowledge [9]. This section describes how it can be applied to the problem of selecting a preconditioned iterative solver. Applying reinforcement learning to a problem requires specifying a set of allowable actions, a reward (or cost) associated with each action, and a state representation. An agent then interacts with the environment by selecting an option from the allowable actions, and keeps track of the environment by maintaining an internal state. In response to the actions taken, the environment gives a numerical reward to the agent and may change in a way that the agent can observe by updating its state. As the agent moves within the environment, the agent attempts to assign a value to actions taken while in each state. This value is what the agent ultimately wishes to maximize, so computing an accurate

On Using Reinforcement Learning to Solve Sparse Linear Systems

957

action-value function is the agent’s most important goal. Note that the reward from taking an action in a state is diﬀerent from its value: the former reﬂects the immediate beneﬁt of taking that single action whereas the latter is a long-term estimate of the total rewards the agent will receive in the future as a result of taking that action. The agent learns the action-value function through a training process consisting of some number of episodes. In each episode, the agent begins at some possible starting point. Without any prior experiences to guide it, the agent proceeds by performing random actions and observing the reward it receives after taking such actions. After performing many actions over several episodes, the agent eventually associates a value with every pair of states and actions. As training continues, these values are reﬁned as the agent chooses actions unlike those it has taken previously. Eventually the agent will be able to predict the value of taking each action in any given state. At the end of the training the agent has learned a function that gives the best action to take in any given state. When the trained system is given a matrix to solve, it selects actions according to this function until it reaches a solution. 2.1

Application to Solving Sparse Linear Systems

Reinforcement learning can be applied to the problem of solving sparse linear systems by breaking down the solve process into a series of actions, specifying the options within each action, and deﬁning the allowable transitions between actions. Fig. 1 shows an example which emphasizes the ﬂexibility of the framework. For example, the two actions labelled “scale” and “reorder,” with transitions allowed in either direction between them, can capture the following (not unusual) sequence of actions: equilibrate the matrix, permute large entries to the diagonal, scale the matrix to give diagonal entries magnitude 1, apply a ﬁll-reducing order. The implementation simply needs to allow all those matrix manipulations as options within the “scale” and “reorder” actions. Similarly, the single “apply iterative solver” step could include all the diﬀerent iterative methods described in [10] as options. And every action can be made optional by including the possibility of doing nothing. Of course, increasing the ﬂexibility in the initial speciﬁcation is likely to increase the cost of training the system. The state can be captured as a combination of where the agent is in the ﬂowchart and assorted matrix features. These features should be cheap to compute and complete enough to represent the evolution of the matrix as it undergoes assorted actions. For example, features might include the matrix bandwidth or a matrix norm: the former is likely to change after reordering and the latter after scaling. While the framework in Fig. 1 does allow for unnecessary redundant actions such as computing and applying the same ﬁll-reducing heuristic twice, a wellchosen reward function will bias the system against such repetition. For example, a natural way to deﬁne the reward function is to use the time elapsed in computing each step. This not only allows the algorithm to see the immediate, short-term eﬀects of the actions it plans to take, but also allows it to estimate

958

E. Kueﬂer and T.-Y. Chen

Fig. 1. One set of actions that could be used to describe a wide variety of solvers for sparse linear systems

the remaining time that will be required once that action is completed. In other words, the algorithm should be able to learn that taking a time-consuming action (e.g., computing a very accurate preconditioner) could be a good idea if it puts the matrix into a state that it knows to be very easy to solve. Notice that this means the framework gracefully allows for a direct solver (essentially a very accurate, but expensive to compute, preconditioner). In addition, if there are actions that result in failures from which there is no natural way to recover, those could be considered to result in essentially an inﬁnite amount of time elapsing. If later a technique for recovery is developed, it can be incorporated ino the framework by adding to the ﬂowchart. Training the system consists of giving it a set of matrices to solve. Since the system must explore the space of possibilities and uses some randomness to do so, it should attempt to solve each matrix in the training set several times. 2.2

Implementation Details

The general framework for applying reinforcement learning to this problem is described above; important details that are speciﬁc to the implementation discussed in this paper are presented here. First, the set of steps and allowable actions are restricted to those shown in Fig. 2. There are fewer actions than in Fig. 1, and the options within each action are restricted to the following: – Equilibrate: The matrix can be initially equilibrated, or left alone. – Reorder: The rows and columns of the matrix can be left unpermuted (natural), or one or the other could be reordered using a permutation computed using: MC64 [11,12], Reverse Cuthill-McKee [13], or COLAMD [14,15]. – Precondition: The preconditioner is restricted to the ILUTP Mem [16] variant of incomplete LU, with one of 72 combinations of parameter settings: lfil between 0 and 5 inclusive, a droptol of 0, .001, .01, or .1, and a pivtol of 0, .1, or 1. – Solve: The iterative solver is restricted to GMRES(50) [17] with a maximum of 500 iterations and a relative residual of 1e − 8. The reinforcement learning framework allows for many more combinations of preconditioners than earlier studies which also restrict the solver to restarted

On Using Reinforcement Learning to Solve Sparse Linear Systems

959

Fig. 2. Possible transitions between steps and their associated actions

GMRES and/or the preconditioner to a variant of ILU [4,5,6,7,18]. Observe, for example, that equilibration is now optional. Hence a total of 576 preconditioned solvers are described by the above framework; this is notably more than used to evaluate systems based on other machine learning techniques [3,4,5]. A system for automatically selecting from amongst so many options is particularly valuable given previous work that shows the diﬃculty of presenting information accurately comparing diﬀerent preconditioned solvers across a range of metrics [19]. Note that because the state keeps track of where the program is in the ﬂowchart, the system can restart the entire preconditioned solve if and only if the incomplete factorization breaks down or if GMRES fails to converge. As a result, the ﬁnal system will be more robust since it can try diﬀerent approaches if the ﬁrst fails. While such step-based restrictions are not strictly necessary, incorporating domain knowledge by requiring the agent to perform computations in a logical order should reduce the training time and improve the accuracy of the trained system. The state also keeps track of 32 structural and numerical features derived from the matrix itself. These are the same features as those used in [4], which are a subset of those used in [3,18]. Since each action changed the values of some of the features, this allowed the agent to observe the changes it made to the matrix during the computation and to react to those changes. Finally, since the overall goal is minimizing the total time required to solve the matrices in the training set, the reward function used is the negative of the time required to complete that step. To bias the system against actions which are very fast but do not lead to a successful solve, the agent receives an additional reward (penalty) if GMRES fails to converge or if the ILU preconditioner cannot be computed. Without this safeguard, the agent might repeatedly take an action that cannot succeed and thus make no progress in learning the action-value function. The action-value function is initialized to 0, even though all true action values are negative. This is the “optimistic initial values” heuristic described in [9] that has the beneﬁcial eﬀect of encouraging exploration during early iterations of the

960

E. Kueﬂer and T.-Y. Chen

algorithm. Since the agent is eﬀectively expecting a reward of 0 for each action, it will be continually “disappointed” with each action it takes after receiving a negative reward, and will thus be encouraged to experiment with a wide range of actions before eventually learning that they will all give negative rewards. The high-level reinforcement learning algorithm was implemented in C++, with C and Fortran 77 used for the matrix operations. The code was compiled using g++, gcc, and g77 using the -O2 and -pthread compiler ﬂags. The testing, training, and exhaustive solves were run on a pair of Mac Pro computers each running Ubuntu with 2 GB of RAM and four 2.66 GHz processors.

3

Experimental Results

The system described above was tested on a pool of 664 matrices selected from the University of Florida sparse matrix collection [20]. So that the results could be compared against the best results possible, all 576 preconditioned solvers allowed for by Fig. 2 were run on each matrix. However, due to time constraints, only 608 of the 664 matrices completed all 576 runs. Fig. 3 plots the number of matrices (out of 608) that converged for a given number of runs; note that each bar represents a decile. Every matrix converged for at least one setting, and 7 converged for all settings. Overall, 42% of the tested preconditioned solvers converged. For each matrix the fastest time taken to solve it was also saved so that the results using the trained system could be compared to it.

Fig. 3. Convergence results from testing all 576 possible preconditioned solvers on 608 of the matrices in the test suite. The y-axis gives the number of matrices which converged for some number of the solvers, the x-axis partitions 576 into deciles.

3.1

Methodology

The following protocol for training and testing was repeated 10 times. The system was trained on 10% of the matrices, chosen at random, by solving each of those matrices 40 times. Since the framework restarts if the ILU factorization fails or GMRES does not converge, potentially many more than 40

On Using Reinforcement Learning to Solve Sparse Linear Systems

961

attempts were made. As demonstrated in Fig. 3, every matrix can be solved by at least one solver so eventually repeated restarts should result in ﬁnding the solution. After the training phase, the algorithm was tested on two sets of matrices. The ﬁrst was equivalent to the training set; the second contained all 664 matrices. From each testing set the number of matrices successfully solved on the algorithm’s ﬁrst attempt (without a restart on failure) was calculated. Next, the ratio of the time taken to solve the matrix was divided by the fastest time possible within the framework described in Section 2.2. If the reinforcement learning algorithm did a good job of learning, this ratio should be close to 1. If it equals 1 then the algorithm learned to solve every matrix in the best possible way. 3.2

Results

Table 1 gives both the percentages of matrices that the system successfully solves on its ﬁrst try and the time it took to solve them. These numbers are given both when the algorithm is tested on matrices in its training set and when it is tested on a more diverse set of matrices. Table 1. Percent of systems successfully solved, and the median ratio of the time taken to solve those systems vs. the fastest solver possible, both when the testing and training sets are equivalent and when the testing set is larger and more diverse testing = training testing = all matrices percent solved 81.8% 56.4% ratio of time 1.14 1.16

As expected, convergence results are best when the training and testing set are identical, with a success rate of 81.8%. When tested on the entire set of matrices, 56.4% of matrices were successfully solved (note that both of these percentages should go up if restarts are allowed). As was done for Fig. 3, Fig. 4 plots the number of matrices that were successfully solved in a given number of trials. Note that there were 10 trials overall and that, on average, a matrix should only be in the training set once. Comparing Fig. 4 to Fig. 3, observe that matrices were more likely to be solved in a greater percentage of cases, and that a larger number of cases converged overall (56% vs 42%). This indicates that the system has learned an action-value function that appropriately penalizes preconditioned solvers which cannot solve a system. Since the time taken to solve each matrix must be compared to the optimal time (as computed through exhaustive search), the second row in Table 1 takes the ratio of solved time to best possible time and gives the median of those ratios. Note that this ratio could only be computed for the 608 matrices on which the full set of exhaustive runs was completed. While the results were slightly better when the training and testing sets were equivalent, overall half the matrices that were solved were done so with no more than 16% overhead over the fastest

962

E. Kueﬂer and T.-Y. Chen

Fig. 4. The number of matrices which were correctly solved on the ﬁrst try for a given number of trials (out of 10)

solution possible regardless of whether the matrix was in the testing set as well as the training set.

4

Discussion

This paper describes a framework for using reinforcement learning to solve sparse linear systems. This framework diﬀers from that of previous sytems based on other machine learning techniques because it can easily factor running time into the recommendation, it makes it practical to consider a far larger number of potential preconditioned solvers, and it actually solves the system. In addition, the framework is extensible in the sense that it is simple to add new operations such as a novel iterative solver or a new choice of preconditioner. An initial implementation that focussed on solving systems using ILU preconditioned GMRES is described. And while the convergence results presented in Section 3 are not as good as those in papers such as [4], the problem being solved here is more complex: rather than predicting if any of a set of preconstructed solvers would be likely to solve a particular matrix, this architecture creates its own solver as an arbitrary combination of lower level operations. Furthermore, the results are based on the system’s ﬁrst attempt at solving a problem — there was no possibility of a restart on failure since, without learning (which injects some randomness) in the ﬁnal trained system, a restart without some matrix modiﬁcation would result in the same failure. Note that either incorporating randomness (say by enabling learning) and allowing a restart after any kind of failure, or trying something more complex such as adding αI to A [21] upon a failure to compute the ILU preconditioner, should improve the convergence results. Of course, restarts would take time, so the ratio of time solved to best possible time would increase. The fact that the code had trouble solving general-case matrices when the testing set is much more diverse than the training set suggests that the algorithm may not be generalizing suﬃciently. This is a known issue in reinforcement learning (and all other machine learning techniques), and there are standard ways to attempt to improve this. Possibilities include a more sophisticated state

On Using Reinforcement Learning to Solve Sparse Linear Systems

963

encoding (e.g., Kanerva Coding [22]), or reducing the set of matrix features used to deﬁne the state to those that are particularly meaningful (work on determining these features is currently underway). As with other machine learning techniques, there are also many opportunities to ﬁnd better constants in the implementation. For the tested implementation values for parameters such as the number of training episodes, the learning rate, the eligibility trace decay, the size of tiles, and the number of tilings were chosen based on general principles and were experimented with only slightly. An intruiging direction for future work is exploring alternative reward functions. Even within the current implementation a modiﬁed reward function that, say, punished failure more might improve the behavior of the trained system. But, in addition, the reward function could be modiﬁed to use any metric of goodness. For example, a function that depended on a combination of space and time usage could be used to build a recommendation system that would take into account both. And, in fact, one could imagine a personalized system for solving sparse linear systems that allows users to deﬁne a reward function which depends on the relative utilities they assign to a wide variety of resources. Acknowledgements. The authors would like to thank Tom Dietterich for helpful discussions. This work was funded in part by the National Science Foundation under grant #CCF-0446604. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation.

References 1. Benzi, M.: Preconditioning techniques for large linear systems: A survey. J. of Comp. Physics 182(2), 418–477 (2002) 2. Saad, Y., van der Vorst, H.A.: Iterative solution of linear systems in the 20th century. J. Comput. Appl. Math. 123(1-2), 1–33 (2000) 3. Bhowmick, S., Eijkhout, V., Freund, Y., Fuentes, E., Keyes, D.: Application of machine learning to the selection of sparse linear solvers. International Journal of High Performance Computing Applications (submitted, 2006) 4. Holloway, A.L., Chen, T.-Y.: Neural networks for predicting the behavior of preconditioned iterative solvers. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 302–309. Springer, Heidelberg (2007) 5. Xu, S., Zhang, J.: Solvability prediction of sparse matrices with matrix structurebased preconditioners. In: Proc. Preconditioning 2005, Atlanta, Georgia (2005) 6. Xu, S., Zhang, J.: SVM classiﬁcation for predicting sparse matrix solvability with parameterized matrix preconditioners. Technical Report 450-06, University of Kentucky (2006) 7. George, T., Sarin, V.: An approach recommender for preconditioned iterative solvers. In: Proc. Preconditioning 2007, Toulouse, France (2007) 8. Ramakrishnan, N., Ribbens, C.J.: Mining and visualizing recommendation spaces for elliptic PDEs with continuous attributes. ACM Trans. on Math. Softw. 26(2), 254–273 (2000) 9. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

964

E. Kueﬂer and T.-Y. Chen

10. Barrett, R., Berry, M., Chan, T.F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., van der Vorst, H.: Templates for the solution of linear systems: Building blocks for iterative methods. SIAM, Philadelphia (1994) 11. Duﬀ, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4), 889–901 (1999) 12. Duﬀ, I.S., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Anal. Appl. 22(4), 973–996 (2001) 13. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: Proc. of the 24th Natl. Conf. of the ACM, pp. 157–172 (1969) 14. Davis, T., Gilbert, J., Larimore, S., Ng, E.: Algorithm 836: COLAMD, a column approximate minimum degree ordering algorithm. ACM Trans. on Math. Softw. 30(3), 377–380 (2004) 15. Davis, T., Gilbert, J., Larimore, S., Ng, E.: A column approximate minimum degree ordering algorithm. ACM Trans. on Math. Softw. 30(3), 353–376 (2004) 16. Chen, T.-Y.: ILUTP Mem: A space-eﬃcient incomplete LU preconditioner. In: Lagan´ a, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3046, pp. 31–39. Springer, Heidelberg (2004) 17. Saad, Y., Schultz, M.H.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856–869 (1986) 18. Xu, S., Zhang, J.: A data mining approach to matrix preconditioning problem. Technical Report 433-05, University of Kentucky (2005) 19. Lazzareschi, M., Chen, T.-Y.: Using performance proﬁles to evaluate preconditioners for iterative methods. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Lagan´ a, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3982, pp. 1081–1089. Springer, Heidelberg (2006) 20. Davis, T.: University of Florida sparse matrix collection. NA Digest 92(42), October 16, 1994 and NA Digest 96(28) July 23, 1996, and NA Digest 97(23) June 7 (1997) http://www.cise.ufl.edu/research/sparse/matrices/ 21. Manteuﬀel, T.A.: An incomplete factorization technique for positive deﬁnite linear systems. Mathematics of Computation 34, 473–497 (1980) 22. Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988)

Reutilization of Partial LU Factorizations for Self-adaptive hp Finite Element Method Solver Maciej Paszynski and Robert Schaefer Department of Computer Science AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krak´ ow, Poland paszynsk,[email protected] http://home.agh.edu.pl/~paszynsk

Abstract. The paper presents theoretical analysis of the extension of the new direct solver dedicated for the fully automatic hp adaptive Finite Element Method. The self-adaptive hp-FEM generates in a fully automatic mode (without any user interaction) a sequence of meshes delivering exponential convergence of the numerical error with respect to the mesh size. The consecutive meshes are obtained by performing h, p or hp reﬁnements. The proposed solver constructs an initial elimination tree based on the nested dissection algorithm executed over the initial mesh. The constructed elimination tree is updated each time the mesh is reﬁned, by adding the elimination sub-tree related to the executed reﬁnement. We propose a new strategy for reutilization of partial LU factorizations computed by the direct solver on the previous mesh, when solving a consecutive mesh from the sequence. We show that the number of LU factorizations that must be recomputed is linearly proportional to the number of singularities in the problem.

1

Motivation and the Basic Idea of Solution

The paper presents theoretical analysis of the extension of the sequential and parallel solvers [1], [2] dedicated for the self-adaptive hp Finite Element Method [3], [4], [5]. The self-adaptive hp-FEM generates a sequence of approximation spaces delivering exponential convergence of the numerical error of the resulting approximation of the variational problem under consideration. The exponential convergence of the error is obtained with respect to the dimension of the approximation space. The self-adaptive hp-FEM starts from an initial approximation space, constructed by utilizing a given uniform initial ﬁnite element mesh. The ﬁrst order polynomial basis function (”pyramids”) are related to vertices of the mesh, and the higher order polynomial basis functions are related to ﬁnite element edges and interiors [3]. The consecutive spaces from the produced sequence are obtained by performing so-called h or p reﬁnements. The h reﬁnement consists in breaking selected ﬁnite element into new son-elements, and adding new basis functions related to just created elements. The p reﬁnement consists in adding higher order basis function associated with selected element edges or interiors. The reﬁnements performed to improve the quality of the approximation M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 965–974, 2008. c Springer-Verlag Berlin Heidelberg 2008

966

M. Paszynski and R. Schaefer

Fig. 1. Updating of the elimination tree when the mesh is h reﬁned

space are selected by utilizing knowledge driven algorithm [6] based on the graph grammar formalism. An eﬃcient solver must be utilized to compute coeﬃcients of the projection of the considered weak (variational) problem solution onto the current approximation space. The coeﬃcients are called degrees of freedom (d.o.f.). These coefﬁcients, denoted by uihp , are computed by solving the system of equations dim

uihp b (ei , ej ) = l (ej )

∀j = 1, ..., dim ,

(1)

i=1

where dim denotes the dimension of the approximation space (number of the basis functions), {ek }dim k=1 denote the basis functions and b (ei , ej ) and l (ej ) are matrix and right-hand-side vector entries obtained by computing some integrals resulting from the considered problem. Here we present a short description of direct solvers utilized by FEM. The frontal solver browses ﬁnite elements in the order prescribed by the user, aggregates d.o.f. to the so-called frontal matrix. Based on the elements connectivity information it recognizes fully assembled degrees of freedom and eliminates them from the frontal matrix [7]. This is done to keep the size of the frontal matrix as small as possible. The key for eﬃcient work of the frontal solver is the optimal ordering of ﬁnite elements. The multi-frontal solver constructs the d.o.f. connectivity tree based on analysis of the geometry of computational domain [7]. The frontal elimination pattern is utilized on every tree branch. Finite elements are joined into pairs and d.o.f. are assembled into frontal matrix associated with the branch. The process is repeated until the root of the assembly tree is reached. Finally, the common dense problem is solved and partial backward substitutions are recursively executed on the assembly tree. The sub-structuring method solver is a parallel solver working over a computational domain partitioned into multiple sub-domains [8]. First, the sub-domains internal d.o.f. are eliminated with respect to the interface d.o.f. Second, the interface problem is solved. Finally, the internal problems are solved by executing backward substitution on each sub-domain. This can be done by performing frontal decomposition on each sub-domain, and then solving the interface problem by a sequential frontal solver (this method is called the multiple fronts solver [9]). The better method is to

Reutilization of Partial LU Factorizations

967

Fig. 2. Elimination tree for simple two ﬁnite elements mesh. Fully aggregated degrees of freedom for element interiors are eliminated in parallel, the resulting Schur complement contributions are added, and common interface problem is ﬁnally solved. The process is followed by performing recursive backward substitutions (not presented in the picture).

solve the interface problem also by a parallel solver (this is called the direct sub-structuring method solver ). The parallel implementation of the multi-frontal solver is called the sparse direct method solver. The MUlti frontal Massively Parallel Solver (MUMPS) [10] is an example of such a solver. A new eﬃcient sequential and parallel solver for self-adaptive hp-FEM has been designed [1], [2], utilizing elimination tree constructed base on the history of mesh reﬁnements. The elimination tree for the initial mesh is created by utilizing nested dissection algorithm. The exemplary two ﬁnite elements mesh with its elimination tree is presented on the ﬁrst panel in Fig. 1. Each time decision about mesh reﬁnement is made, the elimination tree is dynamically expanded by adding sub-tree related to the performed reﬁnements. The example of two h reﬁnements performed on the initial mesh with resulting expanding of the elimination tree is presented in Fig. 1. Thus, we can distinguish two levels on the elimination tree. The ﬁrst level is related to the initial mesh elements, and the second level is related to reﬁnements performed over the initial mesh. The following observation is the key idea of the designed solver [1], [6]. The integral b (ei , ej ) is non-zero only if intersection of supports of ei and ej is not empty. The support of a vertex basis function spreads over ﬁnite elements having the vertex, the support of an element edge basis function spreads over two ﬁnite elements adjacent to the edge, and ﬁnally the support of an element interior basis function spreads only over the element. Thus, the integral b (ei , ej ) is zero if basis functions are related to distanced elements. The solver constructs ﬁrst partially aggregated sub-matrices related to single ﬁnite elements, then it eliminates these entries that have already been fully assembled, and then it recursively merges resulting sub-matrices and eliminates fully assembled entries until it reaches the top of the elimination tree. Finally, it executes recursive backward substitutions, from the root of the tree down to the leaves. The exemplary execution of the solver on the two elements initial mesh from Fig. 1 is presented in Fig. 2. The resulting LU factorizations computed at every node of the elimination tree can be stored at tree nodes for further reutilization. Each time the mesh

968

M. Paszynski and R. Schaefer

Fig. 3. The problem is solved over the ﬁrst mesh. All LU factorizations (black and grey) are computed. Then, the mesh is reﬁned, and the problem is solved again. Grey LU factorizations are reutilized from the previous mesh, but all brown LU factorizations must be recomputed. Black LU factorizations from previous mesh are deleted.

is reﬁned, the LU factorizations from the unreﬁned parts of the mesh can be reutilized. There is a need to recompute LU factorization over the reﬁned elements, as well as on the whole path from any reﬁned leaf up to the root of the elimination tree. The example of the reutilization of partial LU factorizations after performing two local reﬁnements is presented in Fig. 3.

2

Theoretical Analysis of the Solver Eﬃciency

We start this section with the sketch of the recursive solver algorithm, with reutilizations of LU factorizations. matrix function recursive solver(tree node) if (tree node has no son nodes) then eliminate leaf element stiffness matrix internal nodes; store Schur complement sub-matrix at tree node; return (Schur complement sub-matrix); else if (tree node has son nodes) then do (for each tree node son) if (sub-tree has been refined) then son matrix = recursive solver(tree node son); else get the Schur complement sub-matrix from tree node son; endif merge son matrix into new matrix; enddo decide which unknowns of new matrix can be eliminated; perform partial forward elimination on new matrix; store Schur complement sub-matrix at tree node; return (Schur complement sub-matrix); endif

Reutilization of Partial LU Factorizations

969

Computational Complexity of the Sequential, Recursive Solver Without Reutilization of LU Factorizations. Let us estimate ﬁrst the number of operations performed by a sequential recursive solver during forward elimination over a square shape 2D ﬁnite element mesh with N = 2n × 2n ﬁnite elements. The order of approximation in the interior of the element is assumed to be equal to (p1 , p2 ). The orders of approximation on element edges are assumed to be equal to the corresponding orders in the interior. From this assumption it follows that there are 2 faces with orders p1 and 2 faces with orders p2 . The total number of d.o.f. in such an element is nrdof = (p1 + 1) (p2 + 1) = O (p1 p2 ). To estimate the eﬃciency of the sequential solver, we assume that p1 = p2 = p, e.g. by taking p = max{p1 , p2 }. Thus, the total number of d.o.f. satisﬁes 2 nrdof = (p + 1) = O(p2 ), while the number of interior d.o.f. can be evaluated as interior nrdof = (p − 1)2 = O(p2 ), and the number of interface d.o.f. satisﬁes interf ace nrdof = 4p2 = O(p2 ). The recursive solver eliminates d.o.f. related to elements interiors. The computational complexity of this step is 22n × O(p6 ) since there are 22n such ﬁnite elements and the internal d.o.f. elimination cost is O(p6 ) on every element. Then, the solver joints elements into pairs, and eliminates d.o.f. related to common edges. The computational complexity of this operation is 22n−1 × ((2 + 4 + 1) × p)2 × (2 + 4) × p since there are 22n−1 such pairs of elements, and there are 7 total edges within a pair, and only one edge is eliminated. In the next step elements are joint into sets of four, and d.o.f. related to two common edges are eliminated. The computational complexity of this step is 22n−2 × ((4 × 2 + 2) × p)2 × (4 × 2) × p since there are 22n−2 such sets of elements, and there are 10 edges in every set, and only 2 edges are eliminated.

Fig. 4. Two tested meshes with uniform p = 4 and p = 5

The process is repeated until we reach the root of the elimination tree. The total computational complexity of this process is k=1,...,n

22n × p6 + 22n−1 × (2 + 4 + 1)2 p2 × (2 + 4) × p + 2 22n−2k−1 2 × 2k+1 + 2 × 2k + 2k p2 2 × 2k+1 + 2 × 2k p+ 2 22n−2k 2 × 2k + 2 × 2k + 2k p2 2 × 2k + 2 × 2k p .

970

M. Paszynski and R. Schaefer

Fig. 5. The execution time of the parallel solver over the second tested mesh

This can be estimated by utilizing the sum of the geometrical series as ⎞ ⎛ 2n−1 3 2n 6 T1 = O 2 p + O 2 p +O⎝ 22n+k+5 p3 ⎠ k=1,...,n

= O 22n p6 + 22n−1 + 23n+6 − 22n+4 p3 = O(22n p6 + 23n p3 + 22n p3 ) . (2) Computational Complexity of the Sequential Solver With Reutilization of LU Factorizations. In this section we perform the same analysis of the computational complexity like in the previous section, but this time we assume that the problem over the computational mesh has been already solved, and only one element has been h reﬁned in the direction of a mesh corner singularity. In this case, there is a need to compute all LU factorizations related to the elimination sub-tree associated with broken corner element. It is also necessary to recompute all LU factorizations on the single path from the reﬁned element (represented by a leaf in the original elimination tree) up to the root of the tree. The computational complexity over the broken element is 4 × p6 + 2 × (2 + 4 + 1)2 p2 × (2 + 4) × p + (4 ∗ 2 + 2)2 p2 × (4 ∗ 2) × p , (3) since there are 4 element interiors, two single common edges and 1 twofold edge. The computational complexity of the recomputation of the whole path from the reﬁned leaf up to the elimination tree root can be estimated by utilizing equation (2) with the correction that there is only one set of elements on every level of the tree, and without the leaf element computations, already estimated in (3). k=1,...,n

(2 + 4 + 1)2 p2 × (2 + 4) × p + 2 2 × 2k+1 + 2 × 2k + 2k p2 2 × 2k+1 + 2 × 2k p+ 2 . 2 × 2 k + 2 × 2 k + 2 k p2 2 × 2 k + 2 × 2 k p

(4)

Reutilization of Partial LU Factorizations

971

Table 1. Execution time at diﬀerent elimination tree nodes on two tested meshes First mesh Tree level Nodes number min time [s] max time [s] 1 1 0.115 2 2 0.854 0.883 3 4 0.864 2.406 4 8 0.828 2.542 5 16 0.904 2.750 6 32 0.049 0.230 < 10−2 7 64 < 10−2 < 10−3 8-14 128-9216 < 10−3

Second mesh min time [s] max time [s] 0.212 1.631 1.674 1.617 4.625 1.675 4.535 1.621 4.686 1.606 4.763 < 10−2 0.110 < 10−3 < 10−3

The total computational complexity of the solver reutilizing LU factorization is equal to the sum of (3) and (4), that is ⎛ ⎞ 23k+6 p3 ⎠ T11 = O p6 + O p3 + O ⎝ k=1,...,n

= O p6 + 1 + 23n+6 − 26 p3 = O(p6 + 23n p3 ) .

(5)

In the case of multiple reﬁned leaves, the pessimistic estimation is that each leaf will generate a separate path to be totally recomputed. Thus, the total computational complexity with r reﬁned leafs (resulting from r4 singularities) is T1r = O rp6 + r + r23n+6 − r26 p3 = O(rp6 + r23n p3 ) .

(6)

We conclude this section with the comparison of the execution times of the sequential solver with reutilization of LU factorization with respect to the sequential solver without the reutilization

2n

2 N T1 =O =O . (7) r T1 r r The solver with reutilization of partial LU factorizations is O Nr times faster. Complexity of the Parallel Solver Without Reutilization of LU Factorizations. The parallel version of the solver exchanges the partially aggregated matrices between the same level nodes [1]. Leaves of the elimination tree are assigned to diﬀerent processors. When traveling up the elimination tree, the local Schur complements are sent from the second children node to the ﬁrst one (to the ﬁrst processor in every set). To estimate the computational complexity of the parallel recursive solve, we assume that the number of processors is P = 22m . Each processor is responsible for its part of the mesh, with 22n−2m ﬁnite elements. Thus, each processor performs O(22(n−m) p6 + 23(n−m) p3 )

(8)

972

M. Paszynski and R. Schaefer

operations on its part of the mesh. After this step, all computations over the elimination tree are performed fully in parallel: 2 2 × 2k+1 + 2 × 2k + 2k p2 2 × 2k+1 + 2 × 2k p+ k=m+1,...,n

2 2 × 2 k + 2 × 2 k + 2 k p2 2 × 2 k + 2 × 2 k p 22k ) = O(p3 22(m+k) ) = O(22(n−m) p3 ) .

= O(p3

k=m+1,n

(9)

k=1,n−m

The communication complexity involves 2(n − m + 1) parallel point to point communications where sub-matrices related to local Schur complements are exchanged between pairs of tree nodes. The communication complexity is then 2 × (2k × p)2 = O(p2 2(m+k) ) = O(22(n−m) p2 ) (10) k=m+1,n

k=1,n−m

since the size of every sub-matrix is 2 × p. The total complexity of the parallel solver without reutilization of the LU factorizations is then k

TP = (22(n−m) p6 + 23(n−m) p3 + 22(n−m) p3 ) × tcomp + 22(n−m) p2 × tcomm (11) with P = 22m the number of processor, and p the order of approximation. Complexity of the Parallel Solver With Reutilization of LU Factorizations. In the case of the parallelization of the reutilization, the maximum number of processors that can be utilized is equal to r, the number of elements reﬁned within the actual mesh. Each reﬁnement requires the recomputation of the whole path from the reﬁned leaf up to the tree root, which is a purely sequential. If the number of processors P = 22m is larger or equal to the number of executed reﬁnements 22m ≥ r, then the total computational complexity can be roughly estimated as parallel execution of r paths from a leaf to the root of the tree, which is equal to (5). The communication complexity remains unchanged, since there is still a need to exchange the LU factorization, even if they are taken from local tree nodes. Thus the communication complexity is equal to (10). The total complexity of the parallel solver with reutilization of LU factorizations is TPr = (p6 + 23n p3 ) × tcomp + 22(n−m) p2 × tcomm .

(12)

This is the “best parallel time” that can be obtain by the parallel solver with reutilization of partial LU factorizations, under the assumption that we have enough available processors (P = 22m ≥ r). In other words, it is not possible to utilize more processors then number of reﬁned elements r. We can compare the execution time of the parallel solver with reutilization to the parallel solver without the reutilization (as usually under the assumption that we have enough processors P = 22m ≥ r).

N N TP N 2(n−m) =O =O 2 =O ≤O . (13) r 2m TP 2 P r The parallel solver with reutilization is O Nr times faster than the parallel solver without the reutilization.

Reutilization of Partial LU Factorizations

3

973

Test Results

We conclude the presentation with two numerical experiments, presented in Fig. 4. The goal of these experiments is to illustrate the limitation of the scalability of the solver by the sequential part of the algorithm - the longest path from the root of the elimination tree down to the deepest leaf. For more numerical experiments executed for much larger problems, with more detailed discussion on the performance of the solver, as well as for the detailed comparison with the MUMPS solver, we refer to [1]. Both numerical experiments have been performed for the 3D Direct Current (DC) borehole resistivity measurement simulations [11]. The 3D problem has been reduced to 2D by utilizing the Fourier series expansions in the non-orthogonal system of coordinates. We refer to [11] for the detailed problem formulation. The ﬁrst mesh contains 9216 ﬁnite elements with polynomial order of approximation p = 4, and 148, 257 d.o.f. The second mesh contains 9216 ﬁnite elements with polynomial order of approximation p = 5, and 231, 401 d.o.f. Both meshes have been obtained by performing two global hp reﬁnements from the initial mesh with 32 × 18 = 576 ﬁnite elements with polynomial order of approximation p = 2 or p = 3, respectively. There are necessary 10 nested dissection cross-sections of the initial mesh, since 32 × 18 ≤ 25 × 25 . Thus, the depth of the initial elimination tree is 10. Each global hp reﬁnement consists in breaking each ﬁnite element into 4 son elements and increasing polynomial order of approximation by 1. Thus, each global hp reﬁnement adds 2 levels to the elimination tree, so the total number of levels in the elimination tree is 14. Table 1 contains the total number of nodes at given elimination tree level, as well as the minimum and maximum Schur complement computation times for nodes located at given level of the elimination tree. The time of computing the entire path of partial LU factorization from a tree leaf up to the elimination tree root varies from 4 sec. to 9 sec. on the ﬁrst mesh and from about 10 sec. up to 17 sec. on the second mesh. The execution time of the sequential solver with reutilization of LU factorizations over r times reﬁned mesh will be within (4 × r, 9 × r) sec. over the ﬁrst and (10 × r, 17 × r) sec. over the second mesh. The execution time of the parallel solver with reutilization of LU factorizations over r times reﬁned mesh will be within (4, 9) sec. over the ﬁrst and (10, 17) sec. over the second mesh, if there are more processors than reﬁned elements. We present also in Fig. 5 the execution time of the parallel solver over the ﬁrst mesh with N = 231, 401 unknowns, for increasing number of processors. We observe that the parallel solver execution time is limited by the maximum time required to solve the entire path, which is about 9 second in this case.

4

Conclusions

We proposed a new algorithm for the sequential and parallel solver, that allows for signiﬁcant reduction of the solver execution time over a sequence of meshes generated by the self-adaptive hp-FEM. The solver reutilized partial LU factorizations computed in previous iterations over unreﬁned parts of the mesh.

974

M. Paszynski and R. Schaefer

Every local h reﬁnements requires a sequential recomputation of all LU factorization on a path from the reﬁned leaf up to the root of the elimination tree. The maximum number of processors that can be utilized by the parallel solver with reutilization is equal to the number of reﬁned elements. Both, the sequential and parallel solver with reutilization is O Nr faster than the solver without the reutilization, where N is number of elements and r is number of reﬁnements. Acknowledgments. We acknowledge the support of Polish MNiSW grant no. 3TO8B05529 and Foundation for Polish Science under Homming Programme.

References 1. Paszy´ nski, M., Pardo, D., Torres-Verdin, C., Demkowicz, L., Calo, V.: Multi-Level Direct Sub-structuring Multi-frontal Parallel Direct Solver for hp Finite Element Method. ICES Report 07-33 (2007) 2. Paszy´ nski, M., Pardo, D., Torres-Verdin, C., Matuszyk, P.: Eﬃcient Sequential and Parallel Solvers for hp FEM. In: APCOM-EPMSC 2007, Kioto, Japan (2007) 3. Demkowicz, L.: Computing with hp-Adaptive Finite Elements, vol. I. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science, New York (2006) 4. Demkowicz, L., Pardo, D., Paszy´ nski, M., Rachowicz, W., Zduneka, A.: Computing with hp-Adaptive Finite Elements, vol. II. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science, New York (2007) 5. Paszy´ nski, M., Kurtz, J., Demkowicz, L.: Parallel Fully Automatic hp-Adaptive 2D Finite Element Package. Computer Methods in Applied Mechanics and Engineering 195(7-8), 711–741 (2007) 6. Paszy´ nski, M.: Parallelization Strategy for Self-Adaptive PDE Solvers. Fundamenta Informaticae (submitted, 2007) 7. Duﬀ, I.S., Reid, J.K.: The Multifrontal Solution of Indeﬁnite Sparse Symmetric Linear Systems. ACM Trans. on Math. Soft. 9, 302–325 (1983) 8. Giraud, L., Marocco, A., Rioual, J.-C.: Iterative Versus Direct Parallel Substructuring Methods in Semiconductor Device Modelling. Numerical Linear Algebra with Applications 12(1), 33–55 (2005) 9. Scott, J.A.: Parallel Frontal Solvers for Large Sparse Linear Systems. ACM Trans. on Math. Soft. 29(4), 395–417 (2003) 10. Milti-frontal Massively Parallel Sparse Direct Solver (MUMPS), http://graal.ens-lyon.fr/MUMPS/ 11. Pardo, D., Calo, V.M., Torres-Verdin, C., Nam, M.J.: Fourier Series Expansion in a Non-Orthogonal System of Coordinates for Simulation of 3D Borehole Resistivity Measurements; Part I: DC. ICES Report 07-20 (2007)

Linearized Initialization of the Newton Krylov Algorithm for Nonlinear Elliptic Problems Sanjay Kumar Khattri Stord/Haugesund University College, Bjørnsonsgt. 45 Haugesund 5528, Norway [email protected]

Abstract. It is known that the Newton Krylov algorithm may not always converge if the initial assumption or initialization is far from the exact solution. We present a technique for initializing Newton Krylov solver for nonlinear elliptic problems. In this technique, initial guess is generated by solving linearised equation corresponding to the nonlinear equation. Here, nonlinear part is replaced by the equivalent linear part. Eﬀectiveness of the technique is presented through numerical examples.

1

Introduction

The past ﬁfty to sixty years have seen generous improvement in solving linear systems. Krylov subspace methods are the result of the tremendous eﬀort by the researchers during the last century. It is one among the ten best algorithms of the 20th century. There exists optimal linear solvers [16]. But, still there is no optimal nonlinear solver, or the one that we know of. Our research is in the ﬁeld of optimal solution of nonlinear equations generated by the discretization of the nonlinear elliptic equations [15], [14], [13], [12]. Let us consider the following nonlinear elliptic partial diﬀerential equation [15] div(−K grad p) + f (p) = s(x, y) D

p(x, y) = p ˆ g(x, y) = (−K ∇p) · n

in Ω

(1)

on ∂ΩD on ∂ΩN

(2) (3)

Here, Ω is a polyhedral domain in Rd , the source function s(x, y) is assumed to be in L2 (Ω), and the medium property K is uniformly positive. In the equations (2) and (3), ∂ΩD and ∂ΩN represent Dirichlet and Neumann part of the boundary, respectively. f (p) represents nonlinear part of the equation. p is the unknown function. The equations (1), (2) and (3) models a wide variety of processes with practical applications. For example, pattern formation in biology, viscous ﬂuid ﬂow phenomena, chemical reactions, biomolecule electrostatics and crystal growth [9], [5], [6], [7], [8], [10]. There are various methods for discretizing the equations (1), (2) and (3). To mention a few: Finite Volume, Finite Element and Finite Diﬀerence methods [12]. These methods convert nonlinear partial diﬀerential equations into a system M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 975–982, 2008. c Springer-Verlag Berlin Heidelberg 2008

976

S.K. Khattri

of algebraic equations. We are using the Newton Krylov algorithm for solving the discrete nonlinear system of equations formed by the Finite Volume method [15]. Since, initial guess or initialization is very important for the convergence of the Newton’s algorithm. Thus, for starting the Newton Krylov algorithm, we are solving the corresponding linearised equation, and use this solution as the initial guess for the Newton Krylov algorithm. The corresponding linearized equations to the nonlinear equaion (1) is div(−K grad p)+ f (p) = s. Here, f (p) is the linear representation of the nonlinear part f (p).

2

Newton Krylov Algorithm

For formulating Newton algorithm, equation (1) is discretized in the residual form [15] div(−K grad p) + f (p) − s = 0. Let the discretization of the nonlinear partial diﬀerential equations result in a system of nonlinear algebraic equations A(p) = 0. Each cell in the mesh produces a nonlinear algebraic equation [15], [12]. Thus, discretization of the equations (1), (2) and (3) on a mesh with n cells result in n nonlinear equations, and let these equations are given as ⎛ ⎞ A1 (p) ⎜ A2 (p) ⎟ ⎜ ⎟ (4) A(p) = ⎜ . ⎟ . ⎝ .. ⎠ An (p) We are interested in ﬁnding the vector p which makes the operator A vanish. The Taylors expansion of nonlinear operator A(p) around some initial guess p0 is A(p) = A(p0 ) + J(p0 ) Δp + hot, (5) where hot stands for higher order terms. That is, terms involving higher than the ﬁrst power of Δp. Here, diﬀerence vector Δp = p − p0 . The Jacobian J is a n × n linear system evaluated at the p0 . The Jacobian J in the equation (5) is given as follows ⎞ ⎛ ∂A1 ∂A1 ∂A1 ··· ⎜ ∂p1 ∂p2 ∂pn ⎟ ⎟ ⎜ ⎜ ∂A2 ∂A2 · · · ∂A2 ⎟ ⎟ ⎜ ∂Ai ∂p1 ∂p2 ∂pn ⎟ =⎜ J= ⎟ ⎜ . . . ∂pj ⎜ .. .. . . . .. ⎟ ⎟ ⎜ ⎝ ∂An ∂An ∂An ⎠ ··· ∂p1 ∂p2 ∂pn Since, we are interested in the zeroth of the non-linear vector function A(p). Thus, setting the equation (5) equals to zero and neglecting higher order terms will result in the following well known Newton Iteration Method

Linearized Initialization of the Newton Krylov Algorithm

J(pk ) Δpk = −A(pk ), pk+1 = pk + Δpk+1 ,

k = 0, . . . , n.

977

(6)

The linear system (6) is solved by the Conjugate Gradient algorithm [16]. The pseudo code is presented in the Algorithm 1. The presented algorithm have been implemented in the C++ language. Three stopping criteria are used in the Algorithm 1. The ﬁrst criterion is the number of iterations. Second and third criteria are based on the residual vector, A(p) and diﬀerence vector Δpk . If the method is convergent, L2 norm of the diﬀerence vector, Δp, and the residual vector, A(p), converge to zero [see 11]. We are reporting convergence of both of these vectors. For better understanding the error reducing property of the method, we report variation of A(pk )L2 /A(p0 )L2 and Δ(pk )L2 /Δ(p0 )L2 with iterations (k). Algorithm 1. Newton Krylov algorithm. 1 2 3 4 5 6 7 8 9

Mesh the domain; Form the non-linear system, A(p); Find initial guess p0 ; Set the counter k = 0 ; while k ≤ maxiter or Δpk L2 ≤ tol or A(pk )L2 ≤ tol do Solve the discrete system J(pk )Δpk = −A(pk ); pk+1 = pk + Δpk ; k ++ ; end

Our research work is focus on the initialization step of the above algorithm. Initialization (step three of the Algorithm 1) is a very important part of the Newton Krylov algorithm.

3 3.1

Numerical Work Example 1

Without loss of generality let us assume that K is unity, and the boundary is of Dirichlet type. Let f (p) be γ exp(p). Thus, the equations (1), (2) and (3) are written as −∇2 p + γ exp(p) = f p(x, y) = p

D

in Ω,

(7)

on ∂ΩD .

(8)

Here, γ is a scalar. Let γ be 100. For computing the true error and convergence behavior of the methods, let us further assume that the exact solution of the equations (7) and (8) is the following bubble function

978

S.K. Khattri

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 1 0.8 0.6 0.4 0.2 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 1. Surface plot of the exact solution of example 3.1

p = x (x − 1) y (y − 1). Let our domain be a unit square. Thus, Ω = [0, 1] × [0, 1]. Figure 1 displays the surface plot of the exact solution. We are discretizing equations (7) and (8) on a 40 × 40 mesh by the method of Finite Volumes [11], [12], [13], [15]. Discretization results in a nonlinear algebraic vector (4) with 1600 nonlinear equations. For making initial guess, we are using two approaches. In the ﬁrst tradtional approach, we make a random initialization. The second approach is based on the linearization of the nonlinear part. Let us now form a linear approximation to the nonlinear part through Taylor series expansion. The Taylor series expansion of the nonlinear part (exponential funciton) is given as ep =

∞

pi i=0

i

,

=1+p+

p3 p2 + + ···. 2 3

From the above expansion, the linear approximation of ep is (1 + p). For forming a corresponding linearized equation to the nonlinear equation (7), we replace, ep by (1 + p). Thus, for ﬁnding an initial guess for the Newton algorithm, we are solving the following corresponding linearised equation −∇2 p + γ (1 + p) = f. The Newton iteration for both of these initial guesses are reported in the Fig. 2(a). Figure 2(a) presents the convergence of the residual vector, while Fig. 2(b) presents the convergence of the diﬀerence vector for ﬁrst eight

Linearized Initialization of the Newton Krylov Algorithm

979

0

10

Random Initialization Linearized Initialization −2

10

−4

||A(pk)||L2/||A(p0)||L2

10

−6

10

−8

10

−10

10

−12

10

0

1

2

3

4 Iterations [ k ]

5

6

7

8

(a) Newton iteration vs A(pk )L2 for two diﬀerent initialization. 0

10

Random Initialization Linearized Initialization −2

10

−4

||Δpk||L2/||Δp0||L2

10

−6

10

−8

10

−10

10

−12

10

0

1

2

3

4 Iterations [ k ]

5

6

7

8

(b) Newton Iteration vs Δ(pk )L2 for diﬀerent initialization. Fig. 2. Example 3.1

iterations. We are solving the Jacobian system by the ILU preconditioned Conjugate Gradient with a tolerance of 1 × 10−10 . It is clear from the Figs. 2(a) and 2(b) that solving the corresponding linearized equation for the initial guess can make a big diﬀerence. With random initialization, the residual after ﬁve iterations is about 1/100 of the initial residual. While with linearized initialization, the residual after ﬁve iteration is about 1/1012 of the initial residual. It is interesting to note in the Fig. 2(b), with random initialization the Newton Krylov algortithm is not converging in the L2 norm of the diﬀerence vector. On the other hand, with a linearized initialization the Newton Krylov algorithm is still reducing the error in diﬀerence vector by 1/1012 of the initial error.

980

S.K. Khattri

3.2

Example 2

Let us solve the following equations −∇2 p + ξ sinh(exp(p)) = f p(x, y) = p

in Ω, D

(9)

on ∂ΩD .

(10)

Here, ξ is a scalar. We choose ξ to be 10. Let the exact solution be given as p = cosx + y cos3 x − y + cosx − y sinhx + 3 y + 5 e−(x

2

+y 2 )/8

Let our domain be a unit square. Thus, Ω = [0, 1] × [0, 1]. Figure 3 portrays the surface plot of the exact solution. For forming a corresponding linearized equation. The Taylor series expansion of sinh(exp(p)) around p = 0 is given as 1 1 1 1 + e+ sinhep = e − p 2 2e 2 2e 5e 1 1 + + e p2 + − p3 + . . . . 2 12e 12 The above series expansion is found through the Maple by using the command “taylor(sinh(exp(p)), p = 0, 5)”. From the above expansion, the linear approximation of sinhep is 1 1 1 1 e− e+ + p. 2 2e 2 2e

7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 1 0.8 0.6 0.4 0.2 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Fig. 3. Surface plot of the exact solution of example 3.2

0.8

0.9

1

Linearized Initialization of the Newton Krylov Algorithm

981

For forming a corresponding linearized equation to the nonlinear equation (9), we replace, sinhep by (1/2 e − 1/2 e) + (1/2 e + 1/2 e) p. Thus, for ﬁnding an initial guess for the Newton algorithm, we are solving the following linearised equation 1 1 1 1 e− + e+ −∇2 p + ξ p = f. 2 2e 2 2e

4

Conclusions

Robust initialization of the Newton Krylov algorithm is very crucial for the convergence. Initialization plays very important role in the convergence of the Newton Krylov algorithm. We presented a technique for forming the initial guess. Numerical work shows that initializing the Newton Krylov algorithm through the solution of the corresponding linearized equation is computationally eﬃcient.

Bibliography [1] Khattri, S.K.: Newton-Krylov Algorithm with Adaptive Error Correction For the Poisson-Boltzmann Equation. MATCH Commun. Math. Comput. Chem. 1, 197– 208 (2006) [2] Khattri, S.K., Hellevang, H., Fladmark, G.E., Kvamme, B.: Simulation of longterm fate of CO2 in the sand of Utsira. Journal of Porous Media (to be published) [3] Khattri, S.K.: Grid generation and adaptation by functionals. Computational and Applied Mathematics 26, 1–15 (2007) [4] Khattri, S.K.: Numerical Tools for Multicomponent, Multiphase, Reactive Processes: Flow of CO2 in Porous Media. PhD Thesis, The University of Bergen (2006) [5] Host, M., Kozack, R.E., Saied, F., Subramaniam, S.: Treatment of Electrostatic Eﬀects in Proteins: Multigrid-based Newton Iterative Method for Solution of the Full Nonlinear Poisson-Boltzmann Equation. Proteins: Structure, Function, and Genetics 18, 231–245 (1994) [6] Holst, M., Kozack, R., Saied, F., Subramaniam, S.: Protein electrostatics: Rapid multigrid-based Newton algorithm for solution of the full nonlinear PoissonBoltzmann equation. J. of Bio. Struct. & Dyn. 11, 1437–1445 (1994) [7] Holst, M., Kozack, R., Saied, F., Subramaniam, S.: Multigrid-based Newton iterative method for solving the full Nonlinear Poisson-Boltzmann equation. Biophys. J 66, A130–A130 (1994) [8] Holst, M.: A robust and eﬃcient numerical method for nonlinear protein modeling equations. Technical Report CRPC-94-9, Applied Mathematics and CRPC, California Institute of Technology (1994) [9] Holst, M., Saied, F.: Multigrid solution of the Poisson-Boltzmann equation. J. Comput. Chem. 14, 105–113 (1993) [10] M. Holst: MCLite: An Adaptive Multilevel Finite Element MATLAB Package for Scalar Nonlinear Elliptic Equations in the Plane. UCSD Technical report and guide to the MCLite software package. Available on line at, http://scicomp.ucsd.edu/∼ mholst/pubs/publications.html [11] Khattri, S.: Convergence of an Adaptive Newton Algorithm. Int. Journal of Math. Analysis 1, 279–284 (2007)

982

S.K. Khattri

[12] Khattri, S., Aavatsmark, I.: Numerical convergence on adaptive grids for control volume methods. The Journal of Numerical Methods for Partial Diﬀerential Equations 9999 (2007) [13] Khattri, S.: Analyzing Finite Volume for Single Phase Flow in Porous Media. Journal of Porous Media 10, 109–123 (2007) [14] Khattri, S., Fladmark, G.: Which Meshes Are Better Conditioned: Adaptive, Uniform, Locally Reﬁned or Locally Adjusted? In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 102– 105. Springer, Heidelberg (2006) [15] S. Khattri, Nonlinear elliptic problems with the method of ﬁnite volumes. Diﬀerential Equations and Nonlinear Mechanics. Article ID 31797 (2006) [16] van der Vorst, H.A.: Iterative Krylov Methods for Large Linear Systems. Cambridge monographs on applied and computational mathematics. Cambridge University Press, New York (2003)

Analysis and Comparison of Reordering for Two Factorization Methods (LU and WZ) for Sparse Matrices Beata Bylina and Jaroslaw Bylina Department of Computer Science Institute of Mathematics Marie Curie-Sklodowska University Pl. M. Curie-Sklodowskiej 1, 20-031 Lublin, Poland [email protected], [email protected]

Abstract. The authors of the article make analysis and comparison of reordering for two factorizations of the sparse matrices – the traditional factorization into the matrices L and U as well as the factorization into matrices W and Z. The article compares these two factorizations regarding: the produced quantity of non-zero elements alias their susceptibility to a ﬁll-in; the algorithms reorganizing matrix (for LU it will be the algorithm AMD but for WZ it will be a modiﬁcation of the Markowitz algorithm); as well as the time of the algorithms. The paper also describes the results of a numerical experiment carried for diﬀerent sparse matrices from Davis Collection.

1

Introduction

It is a very important issue for the numerical linear algebra to solve diﬀerent linear systems of equations both when the matrix of coeﬃcients is a dense one (that is including few non-zero elements) or when the matrix is sparse. In this paper we deal with a question of solving linear systems with a sparse matrix of coeﬃcients by a factorization of the matrix. Solving sparse systems demands applying direct or iterative methods. Both kinds of methods have their own merits and ﬂaws. However, in this paper we only handle the direct methods based on Gaussian elimination. As far as the direct methods are concerned, they demand applying the coeﬃcient matrix factorization into factors of two matrices, e.g. into LU, WZ or QR as well as into three factors, e.g. into LDLT . We will assume that A is a square (n×n), nonsingular and sparse matrix of not any particular structure. Usually during a factorization of a sparse matrix, matrices which come into existence have far more non-zero elements comparing to the primary matrix. During the matrix A factorization into a product, one has to do with this ﬁll-in problem – consisting in generating additional non-zero elements (except the ones which were non-zero in the matrix A). The ﬁll-in causes a substantial increase in memory requirements and (what comes with that) a worsening M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 983–992, 2008. c Springer-Verlag Berlin Heidelberg 2008

984

B. Bylina and J. Bylina

of a solver performance. Some problems connected to the ﬁll-in are: a reduction of the very ﬁll-in (by some reordering or approximation) and forecasts of positions of non-zeros (for more eﬃcient storing of the matrices’ elements). The ﬁll-in is the reason for applying algorithms and data structures to reduce it and act in due time. A sparse factorization usually consists of two parts. The ﬁrst part is a reorganization of the matrix and its analysis where a symbolic factorization is done, pointing in anticipation of the places where non-zero elements appear. The second part is a usual numerical sparse matrix factorization into factors. We can ﬁnd some examples of this approach – as MUMPS [2] and SuperLU [13]. In [3] we can ﬁnd analysis and comparison of two solvers mentioned above. In this paper we focus on the ﬁrst part of the algorithm – that is the reordering. Reducing of non-zero elements quantity demands applying diﬀerent permutations of rows and columns (known as reordering). The number of all possible permutations is n! (for an n × n matrix) and ﬁnding, which of them is the best one, belongs to the class of NP-complete problems. For structured matrices (like symmetric ones) we can use the Minimum Degree Algorithm [16] or the Nested Dissection [15]. Of course, we do not always know the structure of the matrix so there are heuristic algorithms which reorganize the matrix. Some of them include the Markowitz scheme [14] and the Markowitz scheme with threshold pivoting (for stability) [10]. In papers [4] and [5] some other modiﬁcations of the Markowitz scheme are considered. The article considers reordering for the LU and WZ factorizations [9,10,16,18] for a sparse square (n × n) matrix of not any particular structure. The article describes and examines a matrix transformation leading to a reduction of non-zero elements in the output matrices L, U by applying the AMD (Approximate Minimum Degree) algorithm [1] as well as in the output matrices W, Z by applying the modiﬁed Markowitz algorithm (for the WZ factorization) given by the authors. The aim of the paper is to compare the algorithms in their eﬀectiveness of the ﬁll-in reduction. The performance time of the modiﬁed Markowitz algorithm is also considered. The reasons for choosing AMD is its popularity, accessibility and wide application. The rest of the paper is organized as follows. Section 2 presents the WZ factorization. Section 3 presents the modiﬁcations of the Markowitz scheme for the WZ factorization which ensures the growth of the matrices W and Z sparsity and also a factorization stability. Section 4 describes an environment used to numerical experiments conducted for plenty of matrices from Davis Collection and we also present the results of the examination. We will make an analysis, how many non-zero elements we will ﬁnd in the matrices L + U and W + Z, and also how the AMD algorithm and the modiﬁed Markowitz algorithm inﬂuence the number of non-zero elements as well as the time of algorithms performance. In this article we mark the well-known numerical algorithm of the LU factorization simply by LU. The numerical algorithm LU with reordering [1] we mark by AMD – in the same way as it is marked in the literature.

Analysis and Comparison of Reordering for Two Factorization Methods

2

985

WZ Factorization

The WZ factorization was proposed by Evans and Hatzopoulos [12] as the factorization compatible to SIMD computers. SIMD according to Flynn classiﬁcation means a Single Instruction stream and a Multiple Data stream, so the SIMD architecture is characterized by multiplexing of processing units. The papers [6,7,11,17] develop and examine the modiﬁcations of the WZ factorization method and consider its parallel implementations. Let A be a nonsingular matrix. The WZ factorization causes a division of the matrix A into W and Z factors (so that A = WZ) assuming forms which can be described like follows: (for an even n): ⎤ ⎡ 1 0 ⎢ w21 1 0 w2n ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· 1 0 ··· ··· ··· ··· ⎥ ⎥ (1) W=⎢ ⎢ ··· ··· ··· ··· 0 1 ··· ··· ··· ··· ⎥, ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎣ wn−1,1 0 1 wn−1,n ⎦ 0 1 ⎡ ⎤ z11 · · · · · · · · · · · · · · · · · · · · · · · · z1,n ⎢ ⎥ z22 · · · · · · · · · · · · · · · · · · z2,n ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ··· ··· ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ⎢ ⎥ ⎢ ⎥ zpp zpq ⎥, Z=⎢ (2) ⎢ ⎥ zqp zqq ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ··· ··· ⎢ ⎥ ⎣ ⎦ zn−1,2 · · · · · · · · · · · · · · · · · · zn−1,n zn1 · · · · · · · · · · · · · · · · · · · · · · · · zn,n where m = (n − 1)/2,

p = (n + 1)/2,

An example for an odd n (n = 5): ⎡ ⎤ 1 0 0 0 0 ⎢ w21 1 0 0 w25 ⎥ ⎢ ⎥ ⎥ W=⎢ ⎢ w31 w32 1 w34 w35 ⎥ , ⎣ w41 0 0 1 w45 ⎦ 0 0 0 0 1

⎡

z11 ⎢ 0 ⎢ Z=⎢ ⎢ 0 ⎣ 0 z51

q = (n + 1)/2.

(3)

⎤ z15 0 ⎥ ⎥ 0 ⎥ ⎥. 0 ⎦ z55

(4)

z12 z22 0 z42 z52

z13 z23 z33 z43 z53

z14 z24 0 z44 z54

See also Fig. 1 and Fig. 2. The numerical algorithm of the WZ factorization in this article is marked simply by WZ.

986

B. Bylina and J. Bylina

Fig. 1. The form of the output matrices in the WZ factorization (left: W; right: Z)

Fig. 2. The kth step of the WZ factorization (actually, of the transformation of the matrix A into Z); here k2 = n − k + 1

3

Modiﬁcation of Markowitz Scheme for WZ Factorization

The original Markowitz scheme was ﬁrst presented in [14]. It consists in a special way of the pivoting – not regarding to the value of pivot element but to the quantity of non-zero elements in rows and columns left to process. The row having the fewest non-zeros is chosen to be swapped with the current row and similarly columns are swapped. Thus, the number of newly generated non-zeros

Analysis and Comparison of Reordering for Two Factorization Methods

987

(that is the amount of the ﬁll-in) can be reduced signiﬁcantly. Unfortunately, such an algorithm can lead to a zero pivot and hence make the factorization fail. There are modiﬁcations of the Markowitz scheme which ensure success of the factorization (as in [4,5,10]). Here we show a modiﬁed Markowitz scheme version for the WZ factorization. Let A(k) be the matrix obtained from the kth step of the WZ factorization with (k) the size (n − 2k + 2) × (n − 2k + 2) (as in Fig. 2), let ri be the number of (k) non-zero values in the ith row of the matrix A . We choose i1 = arg

min

(k)

i∈{k,...,k2 }

ri

(5)

and i2 = arg

(k)

min

i∈{k,...,k2 }\{i1 }

ri .

(6)

Then we swap the kth row with the i1 st row and the k2 nd row with the i2 nd row. (We consider only rows, because in the WZ factorization there would be much more comparisons if we considered also columns because of two pivot rows [instead of only one in LU] and two pivot columns [instead of only one in LU]). Of course, such swapping can lead to the situation where the determinant (k) (k)

(k)

(k)

d = akk ak2 k2 − ak2 k akk2

(7)

(which is the pivot by which we divide in the WZ factorization) will be zero – then the continuation of the factorization will not be possible. That is why we must additionally choose i1 and i2 in the way the determinant d will not equal zero (what is not shown in the above paragraph). It means that in the modiﬁed Markowitz scheme (as in the original one) during each turn of completely external loop there is a need to make many comparisons to choose two rows including the smallest number of non-zero elements. The algorithm, which consists of the WZ factorization with our modiﬁcation of the Markowitz algorithm, we mark as MWZ.

4

Numerical Experiment

Here we try to compare the performance of some algorithms and study the reordering inﬂuence on the number of non-zero elements. The algorithms’ implementation was done using C language. Data structures to store the matrices A, W, Z, L, U were two-dimensional arrays located in RAM. The numerical experiment was done using a Pentium IV 2.80 GHz computer with 1 GB RAM. The algorithms were tested in a GNU/Linux environment and the compilation was done using the compiler gcc with an optimization option -O3. Tests were done for matrices from Davis Collection [8]. The tests were done for a set of 40 sparse matrices from diﬀerent applications. We have not managed to do the WZ factorization for 11 matrices – they were singular. For 14 matrices we needed the WZ factorization with the modiﬁed

988

B. Bylina and J. Bylina Table 1. Test matrices chosen from Davis Collection # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

matrix name lfat5 e bcsstko1 nos4 olm100 rdb2001 orsirr 1 comsol rdb2048 ex29 rdb3200 rdb5000 uym5940 raefsky5 fp pd

matrix size 14 48 100 100 2001 1030 1500 2048 2870 3200 5000 5940 6316 7548 8081

number of non-zeros 30 224 347 396 1120 6858 97645 12032 23754 18880 2960 85842 167178 884222 13036

is it symmetric? no yes yes no no no no no no no no no no no no

Table 2. The comparison of non-zero elements quantity for the algorithms LU, WZ, MWZ, AMD # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LU 44 272 447 639 7674 145528 1176657 258298 217840 505914 990394 2991163 212829 39967153 23526

WZ 44 272 447 639 7368 207125 1101656 254516 131951 274908 980888 2673569 226487 53861147 23818

MWZ 44 272 447 545 4730 86392 934350 114862 120198 216135 409015 1045803 227613 20092097 23088

AMD 52 930 1164 494 3730 50374 213582 82234 127970 150256 82234 656730 226632 2875348 20599

Markowitz scheme (as a kind of pivoting) what enabled the numerical WZ factorization (with no pivoting such factorizations were impossible). Table 1 includes the set of the matrices where the WZ and MWZ algorithms were successfully applied. Table 2 includes information how many non-zero elements (nz) were created while doing the algorithms WZ, LU, AMD and MWZ. By using data from Davis Collection [8] we placed the number of elements for the matrices created by the algorithm AMD; the results for LU, WZ and MWZ are from the authors’ tests.

Analysis and Comparison of Reordering for Two Factorization Methods

989

Table 3. The comparison of the performance times for the algorithms LU, WZ, MWZ (times given in seconds) # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LU 0.01 0.01 0.01 0.01 0.01 2.58 7.73 19.14 51.96 71.19 246.53 411.44 506.38 901.20 1135.50

WZ 0.02 0.03 0.05 0.06 0.10 1.44 4.30 10.41 28.95 37.53 143.66 237.18 286.07 503.96 591.44

MWZ 0.04 0.07 0.09 0.13 0.20 1.63 7.07 10.59 29.67 38.19 146.96 248.03 280.85 854.40 599.64

Table 3 presents time during which the algorithms WZ, LU and MWZ were being done. The quantities of non-zero elements and the performance times for chosen four matrices are also presented in Fig. 3 and Fig. 4. (They are scaled for every matrix to show the relative changes of the number of non-zeros and the performance time.) By comparing the algorithms LU and WZ we can notice that the number of non-zero elements generated by these two factorizations is approximately similar. It is possible to ﬁnd matrices for which the WZ factorization generates fewer nonzero elements than the LU factorization, for example the matrix ex29. But we can ﬁnd the matrices for which the LU factorization generates fewer non-zero elements, e.g. the matrix fp. For the tested matrices the algorithm WZ generates on the average 2% fewer non-zero elements than the algorithm LU. Applying the Markowitz scheme before the further WZ factorization caused a considerable decline of created non-zero elements number. Applying the Markowitz algorithm for the WZ factorization causes an increase of non-zero elements number for the only one matrix among all the tested matrices. For the rest ones, MWZ causes a decrease of non-zero elements number of average 25% comparing to the WZ algorithm. Applying the AMD algorithm for the tested matrices considerably reduced the quantity of non-zero elements of average 36%. We managed to ﬁnd such matrices for which the WZ factorization as well as the MWZ factorization produce fewer non-zero elements than the AMD algorithm, e.g. the matrix ex29.

990

B. Bylina and J. Bylina

Fig. 3. Relative numbers of non-zeros in the four algorithms for four sample matrices

Fig. 4. Relative performance times of the three algorithms for four sample matrices

In the Markowitz scheme comparing to the algorithm which does not use any permutation, time for the tested matrices grows 17% on the average.. It is worth noticing that the time for LU is 50% longer than for WZ.

5

Conclusions

In this paper we have presented a detailed analysis and comparison of two reordering schemes. The ﬁrst, called AMD, is used for the LU factorization; the second – MWZ – proposed by the authors, is used for the WZ factorization. The algorithms’ functioning was presented with some sparse matrices taken from concrete engineering applications. Our analysis is based on experiments with the use of a usual PC. The analysis addresses two aspects of the eﬃciency of the factorization: the role of the reordering step and the time needed for the factorization. We can summarize

Analysis and Comparison of Reordering for Two Factorization Methods

991

our observations as follows: there exist matrices for which MWZ (proposed by the authors) is worth using instead of AMD. Moreover, it appeared that the time of the WZ algorithm was on the average 50% shorter comparing to the LU algorithm. It results from the fact that loops in the WZ factorization are two times shorter what enables better use of modern processors architecture: threading (possibility to use parallel calculations) and the organization of the processor access to the memory (particularly an optimal use of the multilevel cache memory). Our future works would research problems of the inﬂuence of reordering on the results’ numerical accuracy. The other future issue is to name properties of the matrices for which using MWZ is better then using AMD. Acknowledgments. This work was partially supported within the project Metody i modele dla kontroli zatloczenia i oceny efektywno´sci mechanizm´ ow jako´sci uslug w Internecie nastepnej generacji (N517 025 31/2997). This work was also partially supported by Marie Curie-Sklodowska University in Lublin within the project R´ ownolegle algorytmy generacji i rozwiazywania mechanizm´ ow kontroli przecia˙zenia w protokole TCP modelowanych przy u˙zyciu la´ ncuch´ ow Markowa.

References 1. Amestoy, P., Davis, T.A., Duﬀ, I.S.: Algorithm 837: AMD, An approximate minimum degree ordering algorithm. ACM Trans. Math. Soft. 23, 1129–1139 (1997) 2. Amestoy, P.R., Duﬀ, I.S., L’Excellent, J.-I., Koster, J.: A full asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J. Matr. Anal. Apl. 23(1), 15–41 (2001) 3. Amestoy, P.R., Duﬀ, I.S., L’Excellent, J.-I., Li, X.S.: Analysis and Comparison of Two General Sparse Solvers for Distributed Memory Computers. ACM Trans. Math. Soft. 27(4), 388–421 (2001) 4. Amestoy, P., Li, X.S., Ng, E.G.: Diagonal Markowitz Scheme with Local Symmetrization. Report LBNL-53854 (2003); SIAM. J. Matr. Anal. Appl. 29, 228 (2007) 5. Amestoy, P., Pralet, S.: Unsymmetric Ordering Using a Constrained Markowitz Scheme. SIAM J. Matr. Anal. Appl.; Report LBNL-56861 (submitted, 2005) 6. Bylina, B., Bylina, J.: The Vectorized and Parallelized Solving of Markovian Models for Optical Networks. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3037, pp. 578–581. Springer, Heidelberg (2004) 7. Chandra Sekhara Rao, S.: Existence and uniqueness of WZ factorization. Parall. Comp. 23, 1129–1139 (1997) 8. Davis, T.: University of Florida Sparse Matrix Collection. NA Digest 92(42) (1994), NA Digest 96(28) (1996), and NA Digest 97(23) (1997), http://www.cise.ufl.edu/research/sparse/matrices 9. Duﬀ, I.S.: Combining direct and iterative methods for the solution of large systems in diﬀerent application areas. Technical Report RAL-TR-2004-033 (2004) 10. Duﬀ, I.S., Erisman, A.M., Reid, J.: Direct Methods for Sparse Matrices. Oxford University Press, New York (1986)

992

B. Bylina and J. Bylina

11. Evans, D.J., Barulli, M.: BSP linear solver for dense matrices. Parall. Comp. 24, 777–795 (1998) 12. Evans, D.J., Hatzopoulos, M.: The parallel solution of linear system. Int. J. Comp. Math. 7, 227–238 (1979) 13. Li, X.S., Demmel, J.W.: A scalable sparse direct solver using static pivoting. In: Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientiﬁc Computing (1999) 14. Markowitz, H.M.: The elimination form of the inverse and its application to linear programming. Management Science 3, 255–269 (1957) 15. Reid, J., Duﬀ, I.S., Erisman, A.M.: On George’s nested dissection method. SIAM J. Numer. Anal. 13, 686 (1976) 16. Tinney, W.F., Walker, J.W.: Direct solution of sparse network equations by optimally ordered triangular factorization. Proc. IEEE 55, 1801–1809 (1967) 17. Yalamov, P., Evans, D.J.: The WZ matrix factorization method. Parall. Comp. 21, 1111–1120 (1995) 18. Zlatev, Z.: On some pivotal strategies in Gaussian elimination by sparse technique. SIAM J. Numer. Anal. 17, 18–30 (1980)

KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis Chuan-Liang Chen1, Yun-Chao Gong2, and Ying-Jie Tian3,∗ 1

Department of Computer Science, Beijing Normal University, Beijing 100875, China 2 Software Institute, Nanjing University, Nanjing, China 3 Research Centre on Fictitious Economy & Data Science, Chinese Academy of Sciences, 100080, Beijing, China [email protected], [email protected], [email protected]

Abstract. Kernel Canonical Correlation Analysis (KCCA) is a technique that can extract common features from a pair of multivariate data, which may assist in mining the ground truth hidden in the data. In this paper, a novel partitioning clustering method called KCK-means is proposed based on KCCA. We also show that KCK-means can not only be run on two-view data sets, but also it performs excellently on single-view data sets. KCK-means can deal with both binary-class and multi-class clustering tasks very well. Experiments with three evaluation metrics are also presented, the results of which reflect the promising performance of KCK-means. Keywords: Kernel Canonical Correlation Analysis, K-means clustering, Similarity Measure, Clustering Algorithm.

1 Introduction Clustering is one of the most commonly techniques which is widely applied to extract knowledge, especially when lacking any a priori information (e.g., statistical models) about the data. Generally, the problem of clustering deals with partitioning a data set consisting of n points embedded in m-dimensional space into k distinct set of clusters, such that the data points within the same cluster are more similar to each other than to data points in other clusters [3]. There are two main approaches of clustering algorithms, hierarchical (e.g., agglomerative methods) and partitional approaches (e.g., k-means, k-medoids, and EM). Most of these clustering algorithms are based on elementary distance properties of the instance space [4]. In some interesting application domains, instances are represented by attributes that can naturally be split into two subsets, either of which suffices for learning [5], such as web pages which can be classified based on their content as well as based on the anchor texts of inbound hyperlinks. Intuitively, there may be some projections in these two views which should have strong correlation with the ground truth. Kernel ∗

Corresponding author.

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 995–1004, 2008. © Springer-Verlag Berlin Heidelberg 2008

996

C.-L. Chen, Y.-C. Gong, and Y.-J. Tian

Canonical Correlation Analysis (KCCA) is such a technique that can extract common features from a pair of multivariate data, which can be used as a statistical tool to identify the correlated projections between two views. Therefore, KCCA is expected to be used to measure the similarity between data points excellently. In this paper, we propose two algorithms based on KCCA which can improve the performances of traditional clustering algorithms—K-means, namely KCK-means for two-view data sets and single-view data sets that could not be split naturally. The results of experiments show that their performances are much better than those of the original algorithms. Our empirical study shows that these two algorithms can not only perform excellently on both two-view and single-view data, but also be able to extract better quality clusters than traditional algorithms. The remainder of this paper is organized as follows. We demonstrate KCCA and propose the algorithms in Sect. 2. Performance measures, experiment results and their analysis are presented in Sect. 3. Finally, Sect. 4 presents the main conclusions.

2 KCK-Means Method 2.1 Canonical Correlation Analysis Firstly, we briefly review Canonical Correlation Analysis (CCA), then its kernel extension—Kernel Canonical Correlation Analysis (KCCA). CCA is computationally an eigenvector problem. It attempts to find two sets of basis vectors, one for each view, such that the correlation between the projections of these two views into the basis vectors are maximized. Let X = {x1, x2, … , xl} and Y = {y1, y2, … , yl} denote two views, i.e. two attribute sets describing the data. CCA finds projection vectors wx and wy such that the correlation coefficient between wTx X and

wTy Y is maximized. That is [12], ⎛

⎞ ⎟ , ⎜ wT C w wT C w ⎟ x xx x y yy y ⎠ ⎝ wTx Cxy wy

ρ = arg max ⎜ wx , wy

(1)

⎧⎪ w Cxx wx = 1 , w.r.t ⎨ ⎪⎩ w C yy wy = 1 where Cxy is the between-sets covariance matrix of X and Y, Cxx and Cyy are respectively the within-sets covariance matrices of X and Y. The maximum canonical correlation is the maximum of ρ with respect to wx and wy. Assume that C yy is invertible, then T x T y

wy =

1

λ

C yy−1C yx wx ,

(2)

and C xy C yy−1C yx wx = λ 2 Cxx wx .

(3)

KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis

997

By first solving for the generalized eigenvectors of Eq. 3, we can therefore obtain the sequence of wx ’s and then find the corresponding wy ’s using Eq. 2. However, in complex situations, CCA may not extract useful descriptors of the data because of its linearity. In order to identify nonlinearly correlated projections between the two views, kernel extensions of CCA (KCCA) can be used [12]. Kernel CCA offers an alternative solution by first projecting the data into a higher dimensional feature space, i.e. mapping xi and yi to φ ( xi ) and φ ( yi ) respectively (i = 1, 2, … , l). And then

φ ( xi ) and φ ( yi ) are treated as instances to run CCA routine. Let Sx = { (φ ( x1 ), φ ( x2 ),..., φ ( xl )) }and Sy = { (φ ( y1 ), φ ( y2 ),..., φ ( yl )) }. Then the directions wx and wy can be rewritten as the projection of the data onto the direction α and β ( α , β ∈ ℜl ): wx = S xα and wy = S y β . Let Kx = S xT S x and Ky= S Ty S y be the kernel matrices corresponding to the two views. Substituting into Eq. 1 we can obtain the new objective function

ρ = max α ,β

α T Kx K y β α T K x2α ⋅ β T K y2 β

.

(4)

α can be solved from ( K x + κ I ) −1 K y ( K y + κ I )−1 K xα = λ 2α ,

(5)

where κ is used for regularization. Then β can be obtained from

β=

1 ( K y + κ I ) −1 K xα . λ

(6)

Let Κx(xi, xj) = φx ( xi )φxT ( x j ) and Κy(yi, yj) = φ y ( yi )φ yT ( y j ) are the kernel functions of the two views. Then for any for any x* and y*, their projections can be obtained from P(x*)= Κx(xi, X) α and P(y*)= Κy(yi, Y) β respectively. A number of α and β (and corresponding λ) can be solved from Eq. 5 and Eq. 6. If the two views are conditionally independent given the class label, the most strongly correlated pair of projections should be in accordance with the ground-truth [9]. However, in real-world applications the conditional independence rarely holds, and therefore, information conveyed by the other pairs of correlated projections should not be omitted [9]. So far we have considered the kernel matrices as invertible, although in practice this may not be the case [20]. We use Partial Gram-Schmidt Orthogonolisation (PGSO) to approximate the kernel matrices such that we are able to re-represent the correlation with reduced dimensionality [12]. In PGSO algorithm, there is a precision parameter—η, which is used as a stopping criterion. For low-rank approximations, we need keep eigenvalues greater than η and the number of eigenvalues we need to consider is bounded by a constant that depends solely on the input distribution [20]. Since the dimensions of the projections rely on the N×M lower triangular matrix

998

C.-L. Chen, Y.-C. Gong, and Y.-J. Tian

output by PGSO which relies on this stopping criterion, we discuss the influence of η to our algorithm in Sect. 3. More detail about PGSO is described in [20]. 2.2 Two KCK-Means Algorithms

In our method, the similarity between data points is measured partly by the projections obtained by KCCA and extends the K-means algorithm. In [7], Balcan et al. showed that given appropriately strong PAC-learners on each view, an assumption of expansion on the underlying data distribution is sufficient for co-training to succeed, which implies that the stronger assumption of independence between the two views is not necessary, and the existence of sufficient views is sufficient. Similarly, the distance function fsim described below is also calculated based on the assumption that X and Y are sufficient to describe the data respectively, which is the same as the assumption of expansion about the co-training method. Actually, our method is intuitively derived from co-training [10]. Since the two views are sufficient to describe the data, both of them may be consist of some projections correlate with the ground truth. So we intend to measure the similarity between instances using information from two views of data. KCCA is an excellent tool that can carry out this task. Therefore, measuring by the use of KCCA may be a promising way of solving the problem of traditional distance measures. Let m denote the number of pairs of correlated projections that have been identified, then x* and y* can be projected into Pj(x*) and Pj(y*) (j = 1, 2, … ,m). Let fsim denote distance functions, which is L2-norm • in this paper. Of course, other similarity distance functions also could be. Based on the projections obtained by KCCA, a new similarity measure can be defined as follows, 2

f sim ( xi , x j ) = μ xi − x j

2

m

+ ∑ Pk ( xi ) − Pk ( x j )

2

,

(7)

k =1

where μ is a parameter which regulates the proportion of the distance between the original instances and the distance of their projections. Based on this similarity measure, we propose the first algorithm as follows. Input: Output:

X and Y, two views of a data set with n instances k, the number of clusters desired C1 and C2, two vectors containing the cluster indices of each point of X and Y.

Process: 1. Identify all pairs of correlated projections, obtaining α i , β i by solving Eqs. 5 and 6 on X and Y. 2. for i = 1, 2, …, l do Project xi and yi into m pairs projections and obtain P(xi) and P(yi). 3. Get the new data sets by unite X and P(X), Y and P(Y), i.e. Mx = X P(X), My = Y P(Y). Fig. 1. KCK-means Algorithm for two-view data sets

KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis

999

Cluster Mx and My respectively as follows: 4. Randomly assign each instance of Mx (My) to one cluster of the k clusters. 5. Calculate the cluster means, i.e., calculate the mean value (both the original value and the projections’ value) of the instance of each cluster. 6. repeat 7. (re)assign each instances to the cluster to which the instance is the most similar by calculating Eq. 7. update the cluster means. 8. 9. until no change. Fig. 1. (continued)

However, two-view data sets are rare in real world, which is the cause that though co-training is a powerful paradigm, it is not widely applicable. In [6], it points out that if there is sufficient redundancy among the features, we are able to identify a fairly reasonable division of them, and then co-training algorithms may show similar advantages to those when they perform on the two-view data sets. Similarly, in this paper, we try to randomly split the single-view data set into two parts and treat them as the two views of the original data set to perform KCCA and then KCK-means. Input:

X , a single-view data set with n instances k, the number of clusters desired C, a vector containing the cluster indices of each point of X.

Output: Process: 1. Randomly spilt X into two views with the same attributes, X1 and X2. 2. Identify all pairs of correlated projections, obtaining D i , E i by solving Eqs. 5 and 6 on X1 and X2. 3. for i = 1, 2, …, l do Project x1, i and x2, i into m pairs projections and obtain P(x1, i) and P(x2, i). 4. Unite P(X1) and P(X2) into P(X), i.e. P(X) = P(X1)ĤP(X2). 5. Get the new data sets by unite X and P(X), i.e. Mx = XĤP(X). Cluster Mx: 6. Randomly assign each instance of Mx to one cluster of the k clusters. 7. Calculate the cluster means, i.e., calculate the mean value (both the original value and the projections’ value) of the instance of each cluster. 8. repeat 9. (re)assign each instances to the cluster to which the instance is the most similar by calculating Eq. 7. 10. update the cluster means. 11. until no change. Fig. 2. The KCK-means Algorithm for single-view data sets

3 Experiments and Analysis Two standard multi-view data sets are applied to evaluate the effectiveness of the first version of KCK-means. They are

1000

C.-L. Chen, Y.-C. Gong, and Y.-J. Tian

Course: The course data set has two views and contains 1,051 examples, each corresponding to a web page, which is described in [10]. 200 examples are used in this paper and there are 44 positive examples. Ads: The url and origurl data sets are derived from the ads data set which is described in [16] and has two categories. 300 examples are used in this paper, among which 42 examples are positive. In this paper, we construct a two-view dataset by using the url view and origurl view. In order to find out how well the second version of KCK-means performs on single-view data sets, we use three single-view data sets . F1

A3a: The a3a is a single-view data set derived from Adult Data Set of UCI, which is described in [11]. It has two categories and 122 features. 3,185 examples are used and there are 773 positive examples. W1a: The w1a is a single-view data set derived from web page dataset which is described in [9]. It has two categories and 300 sparse binary keyword attributes. 2,477 examples are used, among which 72 examples are positive. DNA: The DNA is a single-view data set which is described in [8]. It has three categories and 180 attributes. 2,000 examples are used, among which 464 examples are 1st class, 485 examples are 2nd class, and 1,051 examples are 3rd class. We use three performance measures, Pair-Precision, Intuitive-Precision and Mutual Information, to measure the quality of the clusters obtained by the KCK-means. Pair-Precision: The evaluation metric in [2] is used in our experiments. We evaluate a partition i.e. the correct partition using accuracy =

num(correct decisions ) . n(n − 1) / 2

Mutual Information: Though entropy and purity are suitable for measuring a single cluster’s quality, they are both biased to favor smaller clusters. Instead, we use a symmetric measure called Mutual Information to evaluate the overall performance. The Mutual Information is a measure of the additional information known about one when given another [1], that is MI ( A, B) = H ( A) + H ( B ) − H ( A, B) , where H(A) is the entropy of A and can be calculated by using n

H ( A) = −∑ p( xi ) log 2 ( p( xi )) . i =1

Intuitive-Precision: We choose the class label that share with most samples in a cluster as the class label. Then, the precision for each cluster A is defined as: P( A) =

1

1 max( {xi | label ( xi ) = C j } ) . A

On http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets, all these single-view data sets can be downloaded. H

H

KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis

1001

In order to avoid the possible bias from small clusters which have very high precision, the final precision is defined by the weighted sum of the precision for all clusters, as shown in the following equation G

Ak

k =1

N

P=∑

P ( Ak ) ,

where G is the number of categories (classes) and N is the total number of instances.

Fig. 3. Clustering results on two two-view data sets (course and ads, on the left column) and three single-view data sets (a3a, w1a and DNA, on the right column) using KCK-means comparing with two traditional clustering algorithms, K-means and Agglom (agglomerative hierarchical clustering) with three performance measures, P-Precision (Pair-Precision), IPrecision (Intuitive-Precision), and MI (Mutual Information)

The comparison among between KCK-means and K-means, agglomerative hierarchical clustering, are performed. In order to better reflect the performance of the three algorithms, for all experiments demonstrated below with the two partitioning algorithms, K-means and KCK-means, the diagrams are based on averaging over ten

1002

C.-L. Chen, Y.-C. Gong, and Y.-J. Tian

100%

100% 95%

Kmeans(View1) Kmeans(View2) Agglom(View1) Agglom(View2) KCK-means(View1) KCK-means(View2)

P-Precision

80%

90% 85%

P-Precision

90%

70%

Kmeans Agglom KCK-means

80% 75%

60%

70%

50%

65%

40%

100%

100%

95%

95%

90%

90%

85%

85%

1

9

8

0.

7

5 0.

0.

4 0.

6

3 0.

0.

2 0.

η

0.

1 0.

1

0. 9

0. 8

0. 7

0. 5

0. 6

0. 4

0. 3

0. 2

0. 1

60%

η

80% Kmeans(View1) Kmeans(View2) Agglom(View1) Agglom(View2) KCK-means(View1) KCK-means(View2)

75% 70% 65% 60%

I-Precision

I-Precision

Kmeans Agglom KCK-means

80% 75% 70% 65%

1

0. 9

0. 8

9 0.

0. 6

8 0.

η

0. 7

7 0.

0.9

0. 5

6 0.

0.5

0. 4

5 0.

1.0

0. 2

4 0.

0.6

0. 3

3 0.

0. 1

2 0.

1

1 0.

60%

η

0.8 Kmeans

Mutual Information

Mutual Information

0.4

0.7

Kmeans(View1) Kmeans(View2) Agglom(View1) Agglom(View2) KCK-means(View1) KCK-means(View2)

0.3 0.2 0.1

Agglom KCK-means

0.6 0.5 0.4

7

8

9

0.

0.

η

1

6

0.

5

1

0.

9 0.

0.

8 0.

3

7 0.

4

.6 η0

0.

5 0.

2

4 0.

0.

3 0.

1

2 0.

0.

0.3 1 0.

0.

0

Fig. 4. The influence of η on the performance of KCK-means on the two-view data set course and the single-view data set DNA, where η changes from 0.1 to 1.0, all of the three evaluation metrics, Pair-Precision, Intutitive-Precision and Mutual Information, are used

clustering runs to compensate for their randomized initialization. And that is also beneficial for measuring the performance of the second version of KCK-means on the single-view data sets for its randomly splitting these data sets. The performances of the three algorithms are showed in Fig. 3. In Fig. 3, the performances of KCK-means are much better than those of other two traditional clustering algorithms. On some data sets such as a3a, the Pair-Precision and Intuitive-Precision of the results of KCK-means are both almost 100%, but PairPrecision and Intuitive-Precision of the results of K-means and agglomerative hierarchical clustering are 59.74%, 75.73% and 58.87%, 75.73% respectively. KCKmeans also performs excellently on the multi-class data set—DNA and gets 85.03% Pair-Precision, for K-means and agglomerative hierarchical clustering 72.39% and 67.13% respectively. For other two evaluation metrics, KCK-means is also much better than those of the others’.

KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis

1003

In our experiments, we also note that when the proportion parameter μ is set to be very small or even zero, the performance of KCK-means is the best, which means using the projections obtained from KCCA the similarity between instances already can be measured good enough. The μ in the experiments described in this paper is all set to be 10-6. In Sect. 2.1 we have stated that there is a precision parameter (or stopping criterion)—η in the PGSO algorithm, on which the dimensions of the projections rely. Now we demonstrate its influence on the performance of KCK-means. In order to better measure such influence, we use two data sets, course and DNA, in the experiments described below. Because course is a two-view data set with two classes, and DNA is a single-view data set with three classes, then we can combine the measure of the KCK-means on two-view data set and single-view data set simultaneously. The results are derived on more than ten clustering showed in Fig. 4. In Fig. 4 we can find that follow the change of η, the performance of KCK-means changes a little. Furthermore, even considering the influence, the performances of KCK-means on both data sets are also much better than the other two clustering algorithms. However, in the experiments we find when η is larger than some threshold which depends on given data set the performance of KCK-means descends very much even worse than those of K-means and agglomerative hierarchical clustering. After carefully observation, we find in such situations the number of dimensions of projections is always very small, sometimes even only one dimension. Just as what we have described in Sect. 2.1, in real-world applications the conditional independence rarely holds, and therefore, information conveyed by the other pairs of correlated projections should not be omitted [9]. Therefore, this performance descending may be caused by lacking information conveyed by the other projections.

4 Conclusion In this paper, we propose a novel partitioning method, i.e. KCK-means, based on KCCA and inspiration from co-training. By using KCCA which mines the ground truth hidden in the data, KCK-means measures the similarity between instances. On two two-view data sets, course and ads, and three single-view data sets, a3a, w1a and DNA, the experiments are performed using three performance measures, PairPrecision, Intuitive-Precision and Mutual Information. The results reflect that by using KCK-means, much better quality of clusters could be obtained than those obtained from K-means and agglomerative hierarchical clustering algorithms. However, we also observe that when the number of dimensions of the projections obtained from KCCA is very small, the performance of KCK-means descends very much even worse than those of the two traditional clustering algorithms. This reflects that in real-world applications, we need to consider the information conveyed by the other pairs of correlated projections obtained from KCCA, instead of only considering the strongest projection or very few stronger projections. That is, the number of dimensions of projections obtained from KCCA and then used in KCK-means must be enough.

1004

C.-L. Chen, Y.-C. Gong, and Y.-J. Tian

Acknowledgments. The research work described in this paper was supported by grants from the National Natural Science Foundation of China (Project No. 10601064, 70531040, 70621001).

References 1. Butte, A.J., Kohane, I.S.: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In: Pacific Symposium on Biocomputing, Hawaii, pp. 415–426 (2000) 2. Wagstaff, K., Claire, C.: Clustering with Instance-level Constraints. In: the 17th International Conference on Machine Learning, pp. 1103–1110. Morgan Kaufmann press, Stanford (2000) 3. Khan, S.S., Ahmadb, A.: Cluster center initialization algorithm for K-means clustering. Pattern Recognition Letters 25, 129–1302 (2004) 4. Kirsten, M., Wrobel, S.: Relational distance-based clustering. In: Page, D.L. (ed.) ILP 1998. LNCS, vol. 1446, pp. 261–270. Springer, Heidelberg (1998) 5. Bickel, S., Scheffer, T.: Multi-View Clustering. In: The 4th IEEE International Conference on Data Mining, pp. 19–26. IEEE press, Brighton (2004) 6. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: the 9th international conference on Information and knowledge management, pp. 86–93. ACM press, McLean (2000) 7. Balcan, M.F., Blum, A., Yang, K.: Co-training and expansion: Towards bridging theory and practice. In: The 18th Annual Conference on Neural Information Processing Systems, pp. 89–96. MIT press, Vancouver (2005) 8. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415–425 (2002) 9. Zhou, Z.H., Zhan, D.C., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: The 22nd AAAI Conference on Artificial Intelligence, pp. 675–680. AAAI press, Vancouver (2007) 10. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The Conference on Computational Learning Theory, pp. 92–100. Morgan Kaufmann press, Madison (1998) 11. Kohavi, R.: Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. In: The Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207. AAAI press, Oregon (1996) 12. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis; An overview with application to learning methods. Technical report, Department of Computer Science Royal Holloway, University of London (2003)

Application of the Variational Iteration Method for Inverse Stefan Problem with Neumann’s Boundary Condition Damian Slota Institute of Mathematics Silesian University of Technology Kaszubska 23, 44-100 Gliwice, Poland [email protected]

Abstract. In this paper, the possibility of application of the variational iteration method for solving the inverse Stefan problem with a Neumann boundary condition is presented. This problem consists in a calculation of temperature distribution as well as in the reconstruction of the function which describes the heat ﬂux on the boundary, when the position of the moving interface is known. The validity of the approach is veriﬁed by comparing the results obtained with the analytical solution. Keywords: Inverse Stefan problem, Variational iteration method, Heat equation, Solidiﬁcation.

1

Introduction

In this paper, the author is trying to solve the one-phase inverse design Stefan problem with a Neumann boundary condition. This problem consists in a calculation of temperature distribution as well as in the reconstruction of the function which describes the heat ﬂux on the boundary, when the position of the moving interface is known. This paper applies the variational iteration method to the discussed problems. The variational iteration method was developed by Ji-Huan He [1, 2, 3, 4, 5] and is useful for solving a wide range of problems [1, 2, 3, 7, 5, 8, 9, 4, 6, 10, 11]. The application of the variational iteration method for direct and inverse Stefan problems with a Dirichlet boundary condition is considered in paper [12]. It is possible to ﬁnd an exact analytical solution of the inverse Stefan problem only in few simple cases. In other cases we are left with approximate solutions only [15, 17, 18, 16, 14, 13]. For example in papers [14, 13], authors used the Adomian decomposition method combined with optimization for an approximate solution of a one-phase inverse Stefan problem. However, in paper [17], the authors compare selected numerical methods to solve a one-dimensional, one-phase inverse Stefan problem. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1005–1012, 2008. c Springer-Verlag Berlin Heidelberg 2008

1006

D. Slota

Fig. 1. Domain of the problem

2

Problem Formulation

Let D = {(x, t); t ∈ [0, t∗ ), x ∈ [0, ξ(t)]} be a domain in R2 (Figure 1). On the boundary of this domain, three components are distributed: Γ0 = {(x, 0); x ∈ [0, v = ξ(0)]} , Γ1 = {(0, t); t ∈ [0, t∗ )} ,

(2.1) (2.2)

Γg = {(x, t); t ∈ [0, t∗ ), x = ξ(t)} ,

(2.3)

where the initial and boundary conditions are given. In domain D, we consider the heat conduction equations: α

∂ 2 u(x, t) ∂u (x, t), = ∂x2 ∂t

(2.4)

with the initial condition on boundary Γ0 : u(x, 0) = ϕ(x),

(2.5)

the Neumann condition on boundary Γ1 : −k

∂u(0, t) = q(t), ∂x

(2.6)

the condition of temperature continuity and the Stefan condition on the moving interface Γg : u(ξ(t), t) = u∗ , ∂u(x, t) dξ(t) , −k =κ ∂x dt x=ξ(t)

(2.7) (2.8)

where α is the thermal diﬀusivity, k is the thermal conductivity, κ is the latent heat of fusion per unit volume, u∗ is the phase change temperature, x = ξ(t) is

Application of the Variational Iteration Method

1007

the function describing the position of the moving interface Γg , and u, t and x refer to temperature, time and spatial location, respectively. The discussed inverse Stefan problem consists in ﬁnding a function to describe the temperature distribution u(x, t) in domain D, and function q(t) describing the heat ﬂux on the boundary Γ1 , which will satisfy equations (2.4)–(2.8). All other functions (ϕ(x), ξ(t)) and parameters (α, k, κ, u∗ ), are known.

3

Solution of the Problem

Using the variational iteration method we are able to solve the nonlinear equation: L(u(z)) + N (u(z)) = f (z), (3.1) where L is the linear operator, N is the nonlinear operator, f is a known function and u is a sought function. At ﬁrst, we construct a correction functional: z un (z) = un−1 (z) +

λ L(un−1 (s)) + N (˜ un−1 (s)) − f (s) ds,

(3.2)

0

where u ˜n−1 is a restricted variation [1, 2, 3, 4], λ is a general Lagrange multiplier [19, 1, 2], which can be identiﬁed optimally by the variational theory [20, 1, 2, 3], and u0 (z) is an initial approximation. Next, we determine the general Lagrange multiplier and identify it as a function of λ = λ(s). Finally, we obtain the iteration formula: z un (z) = un−1 (z) +

λ(s) L(un−1 (s)) + N (un−1 (s)) − f (s) ds.

(3.3)

0

The correction functional for equation (2.4) can be expressed as follows: x 2 ˜n−1 (s, t) ∂ un−1 (s, t) 1 ∂u un (x, t) = un−1 (x, t) + ds. (3.4) λ − ∂s2 α ∂t 0 From equation (3.4), the general Lagrange multiplier can be identiﬁed as follows: λ(s) = s − x. Hence, we obtain the following iteration formula: x ∂2u 1 ∂un−1 (s, t) n−1 (s, t) ds. (s − x) − un (x, t) = un−1 (x, t) + 2 ∂s α ∂t 0

(3.5)

(3.6)

Next, we select an initial approximation in the form: u0 (x, t) = A + B x,

(3.7)

where A and B are parameters. For the determination of parameters A and B, we will use the Neumann boundary condition (2.6) and the condition of temperature

1008

D. Slota

continuity (2.7). To this end, we require that the initial approximation u0 (x, t) fulﬁls the above conditions. The boundary condition (2.6) requires: 1 B = − q(t), k

(3.8)

whilst the condition (2.7) leads to the result: 1 ξ(t) q(t). k

A = u∗ +

(3.9)

Hence, the initial approximation has the form: 1 q(t) ξ(t) − x . k

u0 (x, t) = u∗ +

(3.10)

Finally, we obtain the following iteration formula: 1 q(t) ξ(t) − x , k un (x, t) = un−1 (x, t) + x ∂2u 1 ∂un−1 (s, t) n−1 (s, t) ds, + (s − x) − ∂s2 α ∂t 0 u0 (x, t) = u∗ +

(3.11)

n ≥ 1.

(3.12)

Because function un (3.6) depends on an unknown function q(t), we have derived this function in the form of a linear combination: q(t) =

m

pi ψi (t),

(3.13)

i=1

where pi ∈ R and the basis functions ψi (t) are a linear independence. The coeﬃcients pi are selected to show a minimal deviation of function un (3.6) from the initial condition (2.5) and the Stefan condition (2.8). Thus, we are looking for the minimum of the following functional: v 2 un (x, 0) − ϕ(x) dx + J(p1 , . . . , pm ) = 0

t∗

k

+ 0

dξ(t) 2 ∂un (ξ(t), t) +κ dt. ∂x dt

(3.14)

After substituting equations (3.12) and (3.13) to functional J, diﬀerentiating it with respect to the coeﬃcients pi (i = 1, . . . , m) and equating the obtained derivatives to zero: ∂J p1 , . . . , pm = 0, i = 1, . . . , m, (3.15) ∂pi a system of linear algebraic equations is obtained. In the course of solving this system, coeﬃcients pi are determined, and thereby, the approximated distributions of the heat ﬂux q(t) on boundary Γ1 and temperature un (x, t) in domain D are obtained.

Application of the Variational Iteration Method

4

1009

Example

The theoretical considerations introduced in the previous sections will be illustrated with an example, where the approximate solution will be compared with an exact solution. We consider an example of the inverse Stefan problem, in which: α = 0.1, k = 1, κ = 10, u∗ = 1, t∗ = 1/2 and ϕ(x) = e−x ,

ξ(t) =

1 t. 10

(4.1)

Next, an exact solution of the inverse Stefan problem will be found by means of the following functions: u(x, t) = et/10−x , q(t) = e

t/10

,

(x, t) ∈ D, ∗

t ∈ [0, t ].

(4.2) (4.3)

As basis functions we take: ψi (t) = ti−1 ,

i = 1, . . . , m.

(4.4)

In Figures 2 and 3, we present an exact and reconstructed distribution of the heat ﬂux on the boundary Γ1 for n = 1, m = 5 and for n = 2, m = 2. The left ﬁgure presents the exact (solid line) and the determined approximate position (dash line), whereas the right ﬁgure shows diagrams of the distribution of errors which occur when reconstructing the heat ﬂux.

Fig. 2. Heat ﬂux on boundary Γ1 (a) and error distribution in the reconstruction of this heat ﬂux (b) for n = 1 and m = 5 (solid line – exact value qe , dash line – reconstructed value qr )

Figure 4 presents error distributions in the reconstruction of the phase change temperature (left ﬁgure) and error distributions in the reconstruction of the Stefan condition along the moving interface (right ﬁgure) for n = 1 and m = 5. The calculations were made for an accurate moving interface position and for a position disturbed with a pseudorandom error with a size of 1%, 2% and 5%. Table 1 presents values of the absolute error (δf ) and a percentage relative error

1010

D. Slota

Fig. 3. Heat ﬂux on boundary Γ1 (a) and error distribution in the reconstruction of this heat ﬂux (b) for n = 2 and m = 2 (solid line – exact value qe , dash line – reconstructed value qr )

Fig. 4. Error distribution in the reconstruction of phase change temperature (a) and in the reconstruction of the Stefan condition (b)

(Δf ) with which the heat ﬂux on the boundary Γ1 (f = q) and distribution of the temperature in domain D (f = u) were reconstructed for diﬀerent perturbations. The values of absolute errors are calculated from formulas: t∗

1/2 2 1 qe (t) − qr (t) dt , (4.5) δq = ∗ t 0

1/2 2 1 δu = , (4.6) ue (x, t) − ur (x, t) dx dt |D| D where qe (t) is an exact value of function q(t), qr (t) is a reconstructed value of function q(t), ue (x, t) is an exact distribution of temperature in domain D and ur (x, t) is a reconstructed distribution of temperature in this domain, and: |D| = 1 dx dt. (4.7) D

However, percentage relative errors are calculated from formulas: t∗

−1/2 2 1 Δq = δq · ∗ · 100%, qe (t) dt t 0

(4.8)

Application of the Variational Iteration Method

1011

Fig. 5. Error distribution in the reconstruction of heat ﬂux for perturbation equal to 2% (a) and 5% (b) (qe – exact value, qr – reconstructed value)

Δu = δu ·

1 |D|

2 ue (x, t) dx dt

−1/2

· 100%.

(4.9)

D

As shown in the results, the presented algorithm is stable in terms of the input data errors. Each time when the input data were burdened with errors, the error of the heat ﬂux reconstruction did not exceed the initial error. Table 1. Values of errors in the reconstruction of heat ﬂux and distribution of temperature (n = 2, m = 2, δ – absolute error, Δ – percentage relative error)

5

Per.

δq

Δq

δu

Δu

0% 1% 2% 5%

0.001225 0.002957 0.008244 0.016487

0.11944% 0.28830% 0.80389% 1.60768%

0.000785 0.000843 0.001065 0.001385

0.07721% 0.08292% 0.10473% 0.13620%

Conclusion

In this paper, solution of one-phase inverse Stefan problems is presented. The problem consists in a calculation of temperature distribution and of a function which describes the heat ﬂux on the boundary, when the position of the moving interface is known. The proposed solution is based on the variational iteration method. The calculations show that this method is eﬀective for solving the problems under consideration. The advantage of the proposed method comparing it with classical methods consists in obtaining the heat ﬂux and temperature distribution in the form of continuous functions, instead of a discreet form. The method applied does not require discretization of the region, like in the case of classical methods based on the ﬁnite-diﬀerence method or the ﬁnite-element method. The proposed method produces a wholly satisfactory result already in a small number of iterations,

1012

D. Slota

whereas the classical methods require a suitably dense lattice in order to achieve similar accuracy, which considerably extends the time of calculations.

References 1. He, J.-H.: Approximate analytical solution for seepage ﬂow with fractional derivatives in porous media. Comput. Methods Appl. Mech. Engrg. 167, 57–68 (1998) 2. He, J.-H.: Approximate solution of nonlinear diﬀerential equations with convolution product nonlinearities. Comput. Methods Appl. Mech. Engrg. 167, 69–73 (1998) 3. He, J.-H.: Variational iteration method – a kind of non-linear analytical technique: some examples. Int. J. Non-Linear Mech. 34, 699–708 (1999) 4. He, J.-H.: Non-Perturbative Methods for Strongly Nonlinear Problems. Dissertation.de-Verlag im Internet GmbH, Berlin (2006) 5. He, J.-H.: Variational iteration method – Some recent results and new interpretations. J. Comput. Appl. Math. 207, 3–17 (2007) 6. Abdou, M.A., Soliman, A.A.: New applications of variational iteration method. Physica D 211, 1–8 (2005) 7. He, J.-H.: Variational iteration method for autonomous ordinary diﬀerential systems. Appl. Math. Comput. 114, 115–123 (2000) 8. He, J.-H., Liu, H.-M.: Variational approach to diﬀusion reaction in spherical porous catalyst. Chem. Eng. Technol. 27, 376–377 (2004) 9. He, J.-H., Wu, X.-H.: Construction of solitary solution and compacton-like solution by variational iteration method. Chaos, Solitions and Fractals 29, 108–113 (2006) 10. Momani, S., Abuasad, S.: Application of He’s variational iteration method to Helmholtz equation. Chaos, Solitions and Fractals 27, 1119–1123 (2006) 11. Momani, S., Abuasad, S., Odibat, Z.: Variational iteration method for solving nonlinear boundary value problems. Appl. Math. Comput. 183, 1351–1358 (2006) 12. Slota, D.: Direct and Inverse One-Phase Stefan Problem Solved by Variational Iteration Method. Comput. Math. Appl. 54, 1139–1146 (2007) 13. Grzymkowski, R., Slota, D.: One-phase inverse Stefan problems solved by Adomian decomposition method. Comput. Math. Appl. 51, 33–40 (2006) 14. Grzymkowski, R., Slota, D.: An application of the Adomian decomposition method for inverse Stefan problem with Neumann’s boundary condition. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 895–898. Springer, Heidelberg (2005) 15. Zabaras, N., Yuan, K.: Dynamic programming approach to the inverse Stefan design problem. Numer. Heat Transf. B 26, 97–104 (1994) 16. Grzymkowski, R., Slota, D.: Numerical method for multi-phase inverse Stefan design problems. Arch. Metall. Mater. 51, 161–172 (2006) 17. Liu, J., Guerrier, B.: A comparative study of domain embedding methods for regularized solutions of inverse Stefan problems. Int. J. Numer. Methods Engrg. 40, 3579–3600 (1997) 18. Slodiˇcka, M., De Schepper, H.: Determination of the heat-transfer coeﬃcient during soldiﬁcation of alloys. Comput. Methods Appl. Mech. Engrg. 194, 491–498 (2005) 19. Inokuti, M., Sekine, H., Mura, T.: General use Lagrange multiplier in non-linear mathematical physics. In: Nemat-Nasser, S. (ed.) Variational Method in the Mechanics of Solids, pp. 156–162. Pergamon Press, Oxford (1978) 20. Finlayson, B.A.: The Method of Weighted Residuals and Variational Principles. Academic Press, New York (1972)

Generalized Laplacian as Focus Measure Muhammad Riaz1, Seungjin Park2, Muhammad Bilal Ahmad1, Waqas Rasheed1, and Jongan Park1 1

School of Information & Communications Engineering, Chosun University, 501-759 South Korea 2 Dept of Biomedical Engineering, Chonnam National University Hospital, Kwangju, South Korea [email protected]

Abstract. Shape from focus (SFF) uses focus measure operator for depth measurement from a sequence of images. From the analysis of defocused image, it is observed that the focus measure operator should respond to high frequency variations of image intensity and produce maximum values when the image is perfectly focused. Therefore, an effective focus measure operator must be a high-pass filter. Laplacian is mostly used as focus measure operator in the previous SFF methods. In this paper, generalized Laplacian is used as focus measure operator for better 3D shape recovery of objects. Keywords: Shape from focus, SFF, Laplace filter, 3D shape recovery.

1 Introduction The well-known examples of passive techniques for 3D shape recovery from images include shape from focus (SFF). Shape From Focus (SFF) [1], [2] for 3D shape recovery is a search method which searches the camera parameters (lens position and/or focal length) that correspond to focusing the object. The basic idea of image focus is that the objects at different distances from a lens are focused at different distances. Fig. 1 shows the basic image formation geometry. In SFF, the cam-era parameter setting, where the blur circle radius R is zero is used to determine the distance of the object. In Fig. 1, if the image detector (ID) is placed exactly at a distance v, sharp image P’ of the point P is formed. Then the relationship between the object distance u, focal distance of the lens f, and the image distance v is given by the Gaussian lens law:

1 1 1 = + f u v

(1)

Once the best-focused camera parameter settings over every image point are determined, the 3D shape of the object can be easily computed. Note that a sensed image is in general quite different from the focused image of an object. The sensors M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1013–1021, 2008. © Springer-Verlag Berlin Heidelberg 2008

1014

M. Riaz et al.

Fig. 1. Image formation of a 3D object

are usually planar image detectors such as CCD arrays; therefore, for curved objects only some parts of the image will be focused whereas other parts will be blurred. In SFF, an unknown object is moved with respect to the imaging sys-tem and a sequence of images that correspond to different levels of object focus is obtained. The basic idea of image focus is that the objects at different distances from a lens are focused at different distances. The change in the level of focus is obtained by changing either the lens position or the focal length of the lens in the camera. A focus measure is computed in the small image regions of each of the image frame in the image sequence. The value of the focus measure increases as the image sharpness or contrast increases and it attains the maximum for the sharpest focused image. Thus the sharpest focused image regions can be detected and extracted. This facilitates auto-focusing of small image regions by adjusting the camera parameters (lens position and/or focal length) so that the focus measure attains its maximum value for that image region. Also, such focused image regions can be synthesized to obtain a large image where all image regions are in focus. Further, the distance or depth of object surface patches that correspond to the small image regions can be obtained from the knowledge of the lens position and the focal length that result in the sharpest focused images of the surface patches. A lot of research has been done on the image focus analysis to automatically focus the imaging system [6], [7] or to obtain the sparse depth information from the observed scene [2], [3], [4], [8], [9]. Most previous research on Shape From Focus (SFF) concentrated on the developments and evaluations of different focus measures [1], [9]. From the analysis of defocused image [1], it is shown that the defocusing is a LFP, and hence, focus measure should respond to high frequency variations of image intensity and produce maximum values when the image is perfectly focused. Therefore, most of the focus measure in the literature [1], [9] somehow maximizes the high frequency variations in the images. The common focus measure in the literature

Generalized Laplacian as Focus Measure

1015

are; maximize high frequency energy in the power spectrum using FFT, variance of image gray levels, L1-norm of image gradient, L2-norm of image gradient, L1-norm of second derivatives of image, energy of Laplacian, Modified Laplacian [2], histogram entropy of the image, histogram of local variance, Sum-ModulusDifference, etc. There are other focus measures based on moments, wavelet, DCT and median filters. The traditional SFF (SFFTR) [2] uses modified Laplacian as focus measure operator. There are spikes in the 3D shape recovery using modified Laplacian. Laplacian and modified Laplacian operators are fixed and are not suitable in every situation [5]. In this paper, we have used generalized Laplacian as focus measure operator which can be tuned for the best 3D shape results. This paper is organized as follows. Section 2 describes the image focus and defocus analysis and the traditional SFF method. Section 3 de-scribes the generalized Laplacian and simulation results are shown in section 5.

2 Image Focus and Defocus Analysis If the image detector (CCD array) coincides with the image plane (see Fig. 1) a clear or focused image f(x,y) is sensed by the image detector. Note that a sensed image is in general quite different from the focused image of an object. The sensors are usually planar image detectors such as CCD arrays; therefore, for curved objects only some parts of the image will be focused whereas other parts will be blurred. The blurred image h(x,y) usually modeled by the PSF of the camera system. In a small image region if the imaged object surface is approximately a plane normal to the optics axis, then the PSF is the same for all points on the plane. The defocused image g(x,y) in the small image region on the image detector is given by the convolution of the focused image with the PSF of the camera system, as:

g ( x, y ) = h ( x, y ) ⊗ f ( x, y )

(2)

where the symbol denotes convolution. Now we consider the defocusing process in the frequency domain ( ). Let , and be the Fourier Trans-forms of the functions , and respectively. Then, we can express equ. (2) in the frequency domain by knowing the fact that the convolution in the spatial domain is the multiplication in the fre-quency domain, as:

G ( w1 , w2 ) = H ( w1 , w2 ).F ( w1 , w2 )

(3)

The Gaussian PSF model is a very good model of the blur circle. So the PSF of the camera system can be given as:

h ( x, y ) =

⎛ x2 + y2 ⎜− exp ⎜ 2πσ 2 2σ 2 ⎝ 1

⎞ ⎟ ⎟ ⎠

(4)

The spread parameter σ is proportional to the blur radius R in Fig. 1. The Fourier Transform of PSF is OTF of the camera system and is given as: ⎛ w 2 + w2 2 2 ⎞ σ ⎟ H ( w1 , w2 ) = exp⎜ − 1 ⎜ ⎟ 2 ⎝ ⎠

(5)

1016

M. Riaz et al.

We note that low frequencies are passed un-attenuated, while higher frequencies are reduced in amplitude, significantly so for frequencies above about 1/σ. Now σ is a measure of the size of the original PSF; therefore, the larger the blur, the lower the frequencies that are attenuated. This is an example of the inverse relationship between scale changes in the spatial domain and corresponding scale changes in the frequency domain. In fact the product R”ρ is constant, where R” is the blur radius in the spatial domain, and ρ is the radius in its transform. Hence, defocusing is a low-pass filtering process where the bandwidth decreases with increase in defocusing. A defocused image of an object can be obtained in three ways: by displacing the sensor with respect to the image plane, by moving the lens, or by moving the object with respect to the object plane. Moving the lens or sensor with respect to one another causes the following problems: (a) The magnification of the system varies, causing the image coordinates of focused points on the object to change. (b) The area on the sensor over which light energy is distributed varies, causing a variation in image brightness. However, object movement is easily realized in industrial and medical applications. This approach ensures that the points of the object are focused perfectly focused onto the image plane with the same magnification. In other words, as the object moves, the magnification of imaging system can be assumed to be constant for image areas that are perfectly focused. To automatically measure the sharpness of focus in an image, we must formulate a metric or criterion of “sharpness”. The essential idea underlying practical measures of focus quality is to respond high-frequency content in the image, and ideally, should produce maximum response when the image area is perfectly focused. From the analysis of defocused image, it is shown that the defocusing is a low-pass filtering, and hence, focus measure should respond to high frequency variations of image intensity and produce maximum values when the image is perfectly focused. Therefore, most of the focus measure in the literature somehow maximizes the high frequency variations in the images. Generally, the objective has been to find an operator that behaves in a stable and robust manner over a variety of images, including those of in-door and outdoor scenes. Such an approach is essential while developing automatically focusing systems that have to deal with general scenes. An interesting observation can be made regarding the application of focus measure operators. Equation (2) relates a defocused image using the blurring function. Assume that a focus measure operator is applied by convolution to the defocused image . The result is a new image expressed as:

r ( x, y ) = o( x, y ) ⊗ g ( x, y ) = o( x, y ) ⊗ (h( x, y ) ⊗ f ( x, y ))

(6)

Since convolution is linear and shift-invariant, we can rewrite the above expression as: r ( x, y ) = h( x, y ) ⊗ (o( x, y ) ⊗ f ( x, y ))

(7)

Therefore, applying a focus measure operator to a defocused image is equivalent to defocusing a new image obtained by convolving the focused image with the operator. The operator only selects the frequencies (high frequencies) in the focused image that will be attenuated due to defocusing. Since, defocusing is a low-pass filtering process, its effects on the image are more pronounced and detectable if the image has strong

Generalized Laplacian as Focus Measure

1017

high-frequency content. An effective focus measure operator, therefore, must highpass filter the image. One technique for passing the high spatial frequencies is to deter-mine its second derivative, such as Laplacian, given as: ∇2 I =

∂2I ∂x 2

+

∂2I

(8)

∂y 2

The Laplacian masks of 4-neigbourhoods and 8- neighborhoods are given in Fig. 2.

0 -1 0

-1 4 -1

-1 -1 -1

0 -1 0

4-neigbourhoods

-1 8 -1

-1 -1 -1

8-neigbourhoods

Fig. 2. Laplacian masks

Laplacian is computed for each pixel of the given image window and the criterion function can be stated as:

∑∑ ∇ 2 I ( x, y) x

for ∇ 2 I ( x, y ) ≥ T

y

(9)

Nayar noted that in the case of the Laplacian the second derivatives in the x and y directions can have opposite signs and tend to cancel each other. He, therefore, proposed the modified Laplacian (ML) as: ∇ 2M I =

∂2I ∂x 2

+

∂2I ∂y 2

(10)

The discrete approximation to the Laplacian is usually a 3 x 3 operator. In order to accommodate for possible variations in the size of texture elements, Nayar computed the partial derivatives by using a variable spacing (step) between the pixels used to compute the derivatives. He proposed the discrete approximation of the ML as: ∇ 2ML I ( x, y ) = 2 I ( x, y ) − I ( x − step, y ) − I ( x + step, y ) + 2 I ( x, y ) − I ( x, y − step) − I ( x, y + step )

(11)

Finally, the depth map or the focus measure at a point (x,y) was computed as the sum of ML values, in a small window around (x,y), that are greater than a threshold value t: F ( x, y ) =

i= x+ N j = y + N

∑ ∑

i=x−N j = y− N

∇ 2ML I (i, j )

for ∇ 2ML I (i, j ) ≥ T1

(12)

1018

M. Riaz et al.

The parameter N determines the window size used to compute the focus measure. Nayar referred the above focus measure as the sum-modified-Laplacian (SML) or traditional SFF (SFFTR).

3 Generalized Laplacian as Focus Measure For a given camera, the optimally accurate focus measure may change from one object to the other depending on their focused images. Therefore, selecting the optimal focus measure from a given set involves computing all focus measures in the set. In applications where computation needs to be minimized by computing only one focus measure, it is recommended to use simple and accurate focus measure filter for all conditions [5]. Laplacian has some desirable properties such as simplicity, rotational symmetry, elimination of unnecessary in-formation, and retaining of necessary information. Modified Laplacian [2] takes the absolute values of the second derivatives in the Laplacian in order to avoid the cancellation of second derivatives in the horizontal and vertical directions that have opposite signs. In this paper, we tried to use tuned Laplacian [5] as focus measure operator. A 3x3 Laplacian (a) should be rotationally symmetric, and (b) should not respond to any DC component in image brightness. The structure of the Laplacian by considering the above conditions is shown in Fig. 3. The last condition is satisfied if the sum of all elements of the operator equals zero: a + 4b + 4c = 0

(13)

c b C b a B c b C

c -1 c -1 4(1-c) -1 c -1 c

(a)

(b)

(c)

(d)

Fig. 3. (a) The 3x3 Laplacian kernal (b) Tuned Laplacian kernal with c = 0.4, b = -1 (c) The Fourier Transform of (b) when c = 0 and (d) when c = 0.4

Generalized Laplacian as Focus Measure

1019

If b = -1, then a = 4(1-c). Now we have only one variable c. The problem is now to find c such that the operator’s response should have sharp peaks. The frequency response of Laplacian for c = 0 and for c = 0.4 are shown in Fig. 3 (c) and (d). From Fig 3 (d), we see that the response of the tuned focus measure operator (c = 0.4) has much sharper peaks than the Laplacian (c = 0). The 4-neighbouhood kernel in Fig. 2 is obtained by c = 0, b = -1, and 8neigbourhood kernel in Fig. 2 is obtained by c = -1, b = -1.

4 Simulation Results We analyze and compare the results of 3D shape recovery from image sequences using the SFFTR with modified Laplacian and generalized Laplacian. Experiments were conducted on three different types of objects to show the performance of the new operator. The first object is a simulated cone whose images were generated using camera simulation software. A sequence of 97 images of the simulated cone was generated corresponding to 97 lens positions. The size of each image was 360 x 360. The second object is a real cone whose images were taken using a CCD camera system. The real cone object was made of hard-board with black and white stripes drawn on the surface so that a dense texture of ring patterns is viewed in images. All image frames in the image sequences taken for experiments have 256 gray levels.

(a) At lens step 15

(b) At lens step 40

(c) At lens step 70

Fig. 4. Images of simulated cone at different lens steps

(a) At lens step 20

(b) At lens step 40 Fig. 5. Images of real cone at different lens steps

(c) At lens step 90

1020

M. Riaz et al.

Figs. 4 and 5 show the image frames recorded at different lens position controlled by the motor. In each of these frames, only one part of the image is focused, whereas the other parts are blurred to varying degrees. We apply Modified Laplacian and the Generalized Laplacian as fo-cus measure operator using SFFTR method on the simulated and real cone images. The improvements in the results (Fig. 6) on simulated cone are not very prominent except a slight sharpness in the peak. However, on real cone, we see in Fig. 7 (a) that there are some erroneous peaks using Modified Laplacian which are removed as shown in Fig. 7 (b) using generalized Laplacian.

(a)

(b)

Fig. 6. (a) 3D shape recovery of the Simulated cone using SFFTR with Modified Laplacian as Focus Measure Operator (b) with Tuned Laplacian as Focus Measure operator with b= -0.8, c = 0.45

(a)

(b)

Fig. 7. (a) 3D shape recovery of the Real cone using SFFTR with Modified Laplacian as Focus Measure Operator (b) with Tuned Laplacian as Focus Measure operator with b= -1, c = 0.4

5 Conclusions In this paper, we have proposed a generalized Laplacian method as focus measure operator for shape from focus. Some improvements in the 3D shape recovery results are obtained. It is also noticed through simulation that erroneous peaks can be reduced

Generalized Laplacian as Focus Measure

1021

by using modified Laplacian, as discussed in the previous section. Further investigation is in process for generalized focus measure operator in-stead of fixed operators.

Acknowledgement This research was supported by the second BK 21 program of the Korean Government.

References 1. Krotkov, E.: Focusing. International Journal of Computer Vision 1, 223–237 (1987) 2. Nayar, S.K., Nakagawa, Y.: Shape from focus. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(8) (August 1994) 3. Subbarao, M., Choi, T.-S.: Accurate recovery of three dimensional shape from im-age focus. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(3) (March 1995) 4. Nayar, S.K., Watanabe, M., Noguchi, M.: Real-time focus range sensor. In: Proc. of Intl. Conf. on Computer Vision, pp. 995–1001 (June 1995) 5. Subbarao, M., Tyan, J.K.: Selecting the Optimal Focus Measure for Autofocusing and Depth-from-Focus. IEEE Trans. Pattern Analysis and Machine Intelligence 20(8), 864–870 (1998) 6. Schlag, J.F., Sanderson, A.C., Neumann, C.P., Wimberly, F.C.: Implementation of Automatic Focusing Algorithms for a Computer Vision System with Camera Control. Carnegie Mel-lon University, CMU-RI-TR-83-14 (August 1983) 7. Tenenbaum, J.M.: Accommodation in Computer Vision. Ph.D. dissertation, Standford University (1970) 8. Hiura, S., Matsuyama, T.: Depth Measurement by the Multi-Focus Camera. In: Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, June 1998, pp. 953–959 (1998) 9. Jarvis, R.A.: A Perspective on Range Finding Techniques for Computer Vision. IEEE Trans. Pattern Analysis and Machine Intelligence 5(2) (March 1983)

Application of R-Functions Method and Parallel Computations to the Solution of 2D Elliptic Boundary Value Problems Marcin Detka and Czesław Cichoń Chair of Applied Computer Science, Kielce University of Technology, Al. Tysiąclecia Państwa Polskiego 7, 25-314 Kielce, Poland {Marcin.Detka,Czeslaw.Cichon}@tu.kielce.pl

Abstract. In the paper, the R-function theory developed by Rvachew is applied to solve 2D elliptic boundary value problems. Unlike the well-established FEM or BEM method, this method requires dividing the solution into two parts. In the first part, the boundary conditions are satisfied exactly and in the second part, the differential equation is satisfied in an approximate way. In such a way, it is possible to formulate in one algorithm the so-called general structural solution of a boundary-value problem and use it for an arbitrary domain and arbitrary boundary conditions. The usefulness of the proposed computational method is verified using the example of the solution of the Laplace equation with mixed boundary conditions. Keywords: structural solution, R-functions, parallel computations.

1 Introduction Mathematical models of engineering problems are often defined as boundary-value problems involving partial-differential equations. For the description of such problems it is required to have analytical information connected with the equation itself (or a set of equations) and geometrical information necessary to define boundary conditions. This information concerns the solution domain, shapes of particular parts of the boundary, distribution and forms of the imposed constraints and the like. It is accounted for in a different way in various solution methods. In the paper, such problems are solved in a uniform way using the R-function theory, developed by Rvachew et al. [3]. In this theory, the so-called structural solutions are constructed with the use of elaborated tools of the analytical geometry. As a result, the structural solution exactly satisfying the boundary conditions contains some unknown parameters that have to be computed. The paper is limited to elliptic problems in two dimensions. Such problems are still dealt with because of their well-known relation to many physical models. Furthermore, theoretical and numerical results obtained in this area are very useful in practice. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1022–1031, 2008. © Springer-Verlag Berlin Heidelberg 2008

Application of R-Functions Method and Parallel Computations

1023

The discrete solution is determined using orthogonal, structured grid nodes over the Cartesian space C which contains the solution domain Ω . The unknown function of the problem is approximated by means of assuming a set of simple spline functions of the first order. The property of the support locality and density of these functions make it possible to compute, in an effective way, parameters of the structural solution, by redistributing the solution procedure into processors. In the algorithm of the parallel solution, the meshless method, proposed by Yagawa et al. [6], is applied. In this method, the resulting system of linear equations is constructed in a “row-by-row” fashion. The usefulness of the proposed method of computations is verified with the example of the solution of the Laplace equation with mixed boundary conditions.

2 Problem Statement and the Method of Solution Consider the linear operator equation of the form:

Au = f

in Ω ⊂ ℜ 2 ,

(1)

where f ∈ L2 (Ω) . It is well known, that when A is a linear positive-defined operator on a linear set DA in a separable Hilbert space H, and f ∈ H , the generalized solution u of Eq. (1) is an element of the so-called energy space HA that minimizes the functional [2]:

J (u ) =

1 B(u , u ) − ( f , u ) H , 2

(2)

where B(u , u ) = ( Au, u ) H and ( f , u ) H are bilinear and linear functionals, respectively. Because of the equivalence of Eqs. (1) and (2), in numerical computations it is preferred to solve Eq. (2) using the Ritz method. It is assumed that for the most general boundary conditions the solution can by represented in the structural form:

u = φ0 + ωϕ (φ1 ) ,

(3)

where ω is a known function that takes on zero values on the boundary ∂Ω and is positive in the interior of Ω . Functions φ0 and φ1 are chosen in such a way so as to satisfy all boundary conditions. The specification of the function

ϕ

depends on the

problem under consideration (see Section 4). It should be noted that functions φ0 and

φ1 can by specified in a piece-wise fashion with different values prescribed to them at each part of the boundary ∂Ω . The advantage of the solutions in the form of Eq. (3)

1024

M. Detka and C. Cichoń

is that the function ω describes completely all the geometrical information of a particular boundary value problem. The equation ω = 0 defines the geometry of a domain implicitly. The functions ω are constructed using the theory of R-functions developed by Rvachev [3]. Finally, functions φ0 and φ1 can by expressed by only one function φ [5] that in the Ritz approximation is sought in the form: n

φ N = ∑ c jψ j ,

(4)

j =1

where N is a positive integer, c j are unknown parameters and {ψ j } are some basis functions. The sole purpose of the function φ N is to satisfy the analytical constraints of the boundary value problem. It means that the structure of Eq. (3) does not place any constraints on the choice of the functions ψ j [4] After integrating over the domain Ω , the functional J (u ) becomes an ordinary function of the parameters

c1 , c2 ,K, c N . Therefore, the condition δJ = 0 is

equivalent to the solution of the linear algebraic equation, characterized by the matrix equation:

Kc = F .

(5)

3 Parallel Procedure of Computations Basis functions {ψ j } can be defined globally in the domain Ω , or locally with the dense local supports. As regards the parallel solution of the problem, the local approach is preferable, therefore it is chosen in the paper. Let us define the Cartesian space C ⊂ ℜ 2 and assume that the solution domain is subspace Ω ⊂ C , Fig. 1. Then, the space C is discretized using the regular mesh points of the structured grid. It is necessary to choose integers n and m, and define step size h and k by h = ( xb − xa ) n and k = ( yb − y a ) m , where points a( xa , ya ) and b( xb , yb ) are given a priori. For each point of the grid, the simple spline of the first order based on the sixth triangles, containing the grid vertex j is defined, Fig. 2. The basis function ψ j is composed of the six linear functions:

ψ j = {ψ 1j ,ψ 2j ,...,ψ 6j } ,

(6)

where functions ψ kj , k = 1,2,...,6 , have the following form in the local coordinate

(

)

system s1 = ( x − x j ) / h, s 2 = ( y − y j ) / k :

Application of R-Functions Method and Parallel Computations

⎧ψ 1j = 1 − s1 − s2 ⎪ 2 ⎪ψ j = 1 − s2 ⎪ψ 3j = 1 + s1 ⎪ ψ j = ⎨ψ 4j = 1 + s1 + s2 ⎪ψ 5 = 1 + s 2 ⎪ j ⎪ψ 6j = 1 − s1 ⎪ ⎩0

1025

if ( s1 , s2 ) ∈ T1 if ( s1 , s2 ) ∈ T2 if ( s1 , s2 ) ∈ T3 if ( s1 , s2 ) ∈ T4 if ( s1 , s2 ) ∈ T5 if ( s1 , s2 ) ∈ T6 otherwise .

Fig. 1. Cartesian space C and the solution domain

Fig. 2. Linear dashed basis function ψ

j

(7)

Ω

1026

M. Detka and C. Cichoń

The algorithm of parallel computations is shown in Fig. 3. The main steps of the computations are as follows: 1. Decomposition of the space C into Cp subdomains, p=1,2,…,P, where P is the number of processors. 2. Parallel identification of nodes in each subdomain Cp according to the rule shown in. Fig. 4. 3. Parallel modification of subdomain Cp in order to balance the number of nodes in each processor. 4. Parallel supplement of the nodes set in the domain Cp with neighbouring nodes which are active in the solution of the problem. 5. For each node j, parallel computations of the elements Kjk and Fj, k=1,2,…,7 (max) of the matrix equation (5). 6. Parallel solution of the matrix equation (5) by the conjugate gradient method using the Portable Extensible Toolkit for Scientific Computation (PETSc) library.

Fig. 3. The algorithm of parallel computations

Numerical integration is needed to calculate the matrix K and the vector F . Integration over triangles is performed with the use of 4-point Gaussian quadrature. For the case when the boundary ∂Ω crosses the triangle an additional procedure has been applied in order to divide the integration region into subregions. The rule states

Application of R-Functions Method and Parallel Computations

1027

Fig. 4. Decomposition of the solution domain (P=3), identification of nodes

that “the active subregion” is such a part of the triangle that belongs to the Ω domain and contains any of integration points. Next, integrals over the new triangle or quadrilateral subregions are also computed numerically. The above rule has also been applied to the identification of nodes in the subdomains.

4 Example The proposed solution method has been verified with a simple example taken from [1]. Consider the Laplace equation on the domain Ω , shown in Fig. 5

− ∇ 2u ( x, y ) = 0 in Ω ,

(8)

with the boundary conditions on ∂Ω

∂u ∂y

=0 ∂Ω1

∂u = −2 u ( x, y ) ∂Ω = 80 4 ∂x ∂Ω3 .

(9)

u ( x, y ) ∂Ω = −4 x 4 + 33x 2 − 2 x + 17. 2

The exact solution is equal to:

u ( x, y ) = 81 − 2 x + x 2 − y 2 . The geometric domain primitives

(10)

Ω can be defined as a Boolean set combination of four

Ω = Ω1 ∩ Ω 2 ∩ Ω 3 ∩ Ω 4 ,

(11)

1028

M. Detka and C. Cichoń

Fig. 5. Solution domain Ω of the Laplace equation (8)

defined as

Ω i = {( x, y ) ∈ ℜ 2 : ωi ( x, y ) ≥ 0}.

(12)

Functions ωi , normalized to the first order, have the form:

ω1 = y , ω 2 =

8 − 2x2 − y 16 x 2 + 1

, (13)

2 ω3 = x , ω 4 = ( y − 1 + x), 2 and the equation of the solution domain

Ω can be expressed in the following way:

ω = ω1 ∧ 0 ω 2 ∧ 0 ω3 ∧ 0 ω 4 , where

(14)

∧ 0 is the R0 – conjunction.

After some manipulations, the structural form of the solution (3) takes the final form

u = g 01 − ωD1 ( g 01 ) + ωg11 − ωD1 (φg 02 ) + φg 02 , where D1 (•) = ∂ω ∂ (•) + ∂ω ∂ (•) and

∂x ∂x

∂y ∂y

(15)

Application of R-Functions Method and Parallel Computations

g 01 =

(−4 x 4 + 33 x 2 − 2 x + 17)ω134 + 80ω123 , ω 234 + ω134 + ω124 + ω123

g11 =

− 2ω1 , ω1 + ω3

ω234 + ω124 , g 02 = ω 234 + ω134 + ω124 + ω123

1029

(16)

where ωijk = ωi ω j ω k .The functional (2) takes the form

⎡⎛ ∂u ⎞ 2 ⎛ ∂u ⎞ 2 ⎤ J (u ) = ∫ ⎢⎜ ⎟ + ⎜⎜ ⎟⎟ ⎥dΩ + 4 ∫ u d∂Ω 3 . ∂x ∂y Ω⎢ ∂Ω 3 ⎣⎝ ⎠ ⎝ ⎠ ⎥⎦

(17)

The formulae for the calculation of the matrix K coefficients and the column vector F , Eq. (5), are given explicitly in [1]. Computations have been made for h=k=0.5, 0.2 and 0.1, which has led to the sets of the basis functions {ψ j }, j = 1,2,3,..., N , where N= 63, 307 and 1124. The quality of the solution has been verified calculating the absolute, relative and least square errors:

ε 1 = max | u exac − u approx | ,

(18)

u exac − u approx |, u exac

(19)

∑ (u exac − u approx ) 2 .

(20)

i

ε 2 = max | i

ε3 =

1 N

i

The results of computations are given in Table 1. It should be noted that the improvement in the calculation accuracy at higher mesh density is smaller than expected. Probably, the reason is that the basic functions ψ j are too simple. The last column in Table 1 presents the data given in [1], where the global approximations are assumed in the form of the third degree complete polynomial. It should be stressed that although the final solution is worse, it is obtained with notably less numerical effort. Table 1. Approximation errors

ε1 ε2 ε3

h=k=0.5 6.00

h=k=0.2 2.34

h=k=0.1 1.98

[1] 10.32

0.15

0.05

0,04

0.41

2.62

0.93

0.85

19.18

1030

M. Detka and C. Cichoń

The graphs of the u function for different vertical and horizontal cross-sections of the Ω domain are shown in Fig. 6.

Fig. 6. Graphs of the u function for the different cross-sections, +++ discretization k=h=0.5, △△△ discretization k=h=0.1,--- polynomial N=3 [1], exact solution

◇◇◇ discretization k=h=0.2,

Fig. 7. Speedup and parallel efficiency in the function of the number of processors (time of the parallel solving of the linear equations set has been omitted), theoretical ideal speedup, +++ discretization k=h=0.5, ◇◇◇ discretization k=h=0.2, △△△ discretization k=h=0.1

Application of R-Functions Method and Parallel Computations

1031

As expected, the assumption of simple linear basis functions yields quite satisfactory computational results for suitably dense mesh nodes. Some inaccuracies that occur in the interior of the domain solution probably result from approximate calculations of the function derivatives, which appear in the formulae. In the program, these derivatives are calculated using of the GNU Scientific Library (GSL). Fig. 7 shows how the speedup and parallel efficiency varies with the numbers of processors for various problem sizes. The presented algorithm has been parallelized using Message Passing Interface (MPI MPICH ver. 1.2.7p1) library function and GNU C Compiler (ver.3.2). It has been tested with 9 nodes cluster with 2 Intel Xenon 2.4 Mhz 1GB of RAM. The nodes have been connected by a Gigaethernet.

5 Conclusions In the paper, the so-called structural solution has been applied to the solution of the elliptic partitial-differential equations. In the algorithm of the computations some properties of the structural solution have been exploited, namely the fact that the solution is composed of two parts, one of them fulfils exactly the boundary conditions and the others fulfils the differential equation in an approximate way. This feature of the solution can be employed effectively if we assume simple, linear basis functions over local simplexes and use the structured grid of nodes. That, together with the “row-by-row” method of computing the coefficients of the resulting system of linear algebraic equations, leads to the effective parallel algorithm of the solution. In the authors’ opinion, the efficiency of the proposed method should be particularly observable in the analysis of the problems with real great domain solutions. On the other hand, if more complex boundary value problems are to be solved, the local basis spline functions of the higher order will probably be needed.

References 1. Grzymkowski, R., Korek, K.: On R-function Theory and its Application in Inverse Problem of Heat Conduction. Information Technology Interfaces. In: Proceedings of the 23rd International Conference on Pula, Croatia, pp. 393–402 (2001) 2. Reddy., J.N.: Applied Functional Analysis and Variational Methods in Engineering. McGraw–Hill Book Company, New York (1986) 3. Rvachew, W.L., Sliesarienko, A.P.: Algiebra łogiki i intierwalnyje prieobrazowanija w krajewych zadaczach (in Russian), Izd. Naukowa Dumka, Kijów (1976) 4. Shapiro, V.: Theory of R-functions and Applications, Technical Report, Cornell University (1988) 5. Wawrzynek., A.: Modelling of solidification and cooling of metals and heat diffusion problems by R-function method (in Polish), Zesz. Nauk. Pol Śląskiej, Mechanika 119, Gliwice, Poland (1994) 6. Yagawa, G.: Node-by-node parallel finite elements: a virtually meshless method. Int. J. Numer. Meth. Eng. 60(1), 69–102 (2004)

Using a (Higher-Order) Magnus Method to Solve the Sturm-Liouville Problem Veerle Ledoux , Marnix Van Daele, and Guido Vanden Berghe Vakgroep Toegepaste Wiskunde en Informatica, Ghent University, Krijgslaan 281-S9, B-9000 Gent, Belgium {Veerle.Ledoux,Marnix.VanDaele,Guido.VandenBerghe}@UGent.be

Abstract. The main purpose of this paper is to describe techniques for the numerical solution of a Sturm-Liouville equation (in its Schr¨ odinger form) by employing a Magnus expansion. With a suitable method to approximate the highly oscillatory integrals which appear in the Magnus series, high order schemes can be constructed. A method of order ten is presented. Even when the solution is highly-oscillatory, the scheme can accurately integrate the problem using stepsizes typically much larger than the solution “wavelength”. This makes the method well suited to be applied in a shooting process to locate the eigenvalues of a boundary value problem.

1

Introduction

In this paper we are concerned with the numerical approximation of problems of the form (1) y (x) = [V (x) − E] y(x), a ≤ x ≤ b This equation is the Sturm-Liouville equation in its Liouville normal form, also called Schr¨ odinger form. Mathematically, Schr¨ odinger problems arise from the standard separation of variables method applied to a linear partial diﬀerential equation, and in connection with the inverse scattering transform for solving nonlinear partial diﬀerential equations. The Schr¨ odinger equation is also well known as the fundamental equation in quantum physics or quantum chemistry but arises for instance also in geophysical applications, and vibration and heat ﬂow problems in mechanical engineering. Many Schr¨ odinger problems have explicit solutions, and are therefore important in the analytic investigation of diﬀerent physical models. However most (boundary value) problems cannot be solved analytically, and computationally eﬃcient approximation techniques are of great applicability. Although we focus in this paper on the basic Schr¨ odinger equation in a ﬁnite domain and with a smooth potential V (x), our scheme can be extended to a more general Sturm-Liouville problem −(p(x)y (x)) + q(x)y(x) = Ew(x)y(x). The parameter E (also called the eigenvalue) in (1) is unknown, and is to be found subject to some kind of boundary conditions in the endpoints a and b.

Postdoctoral Fellow of the Fund for Scientiﬁc Research - Flanders (Belgium) (F.W.O.-Vlaanderen).

M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1032–1041, 2008. c Springer-Verlag Berlin Heidelberg 2008

Magnus Method of Sturm-Liouville Problem

1033

It is well known that as E grows, the solutions of (1) become increasingly √ oscillatory. In fact, as E → +∞ the solution “wave length” approaches 2π/ E. This highly oscillatory character of the solution is the reason why standard integrators encounter diﬃculties in eﬃciently estimating the higher eigenvalues: a naive integrator will be forced to make increasingly smaller steps severely increasing the running time. By taking advantage of special methods, one can construct numerical algorithms having special advantages over these standard (naive) methods. Pruess suggested to approximate the coeﬃcients of the problem by piecewise constant approximations, solving the problem analytically on the piecewise constant intervals (see [15,16]). For such a coeﬃcient approximation method the step size is not restricted by the oscillations in the solution but the scheme is only second order, unless Richardson extrapolation approximations are made. Two approaches have been suggested to construct higher order schemes, both being natural extensions of the Pruess ideas. A ﬁrst approach is based on a technique from mathematical physics: the perturbation approximation, leading to the so-called Piecewise Perturbation Methods (PPM) (see [8,9,10,11]). In [2] it was shown that the piecewise perturbation approach may be viewed as the application of a modiﬁed Neumann series. The second approach consists in the application of another integral series: the Magnus series. During the last decade, numerical schemes based on the Magnus expansion received a lot of attention due to their preservation of Lie group symmetries (see [5],[14], and references cited therein). More generally, Magnus methods have been applied in spectral theory, Hamiltonian systems, symplectic and unitary integration, control theory, stochastic systems, and quantum chemistry; see [1] for a list of applications. Moan [13] was the ﬁrst to consider a Magnus method in the context of Sturm-Liouville problems. He applied a Magnus series integrator directly to eq. (1) with a piecewise polynomial V (x). However poor approximations can then be expected for large eigenvalues. Later Degani and Schiﬀ [2,3] and Iserles [4] showed that it is a better idea for oscillatory ordinary diﬀerential equations to apply the Magnus series integrator not directly to the equation but to the so-called modiﬁed equation. In [12] such a modiﬁed Magnus scheme of order eight was constructed for the Schr¨odinger problem and applied in a shooting procedure to compute the eigenvalues of the boundary value problem. In the current paper we present the construction of a modiﬁed Magnus method of order ten. In order to reach tenth order, the Filon-based quadrature rule for the oscillatory integrals appearing in the Magnus series, had to be extended to triple integrals. Also this new modiﬁed Magnus integrator can be used in a shooting process to eﬃciently compute eigenvalues.

2

The (Modiﬁed) Magnus Method

The diﬀerential equation (1) is converted into a system of ﬁrst-order ODE’s y(x) = A(x, E)y(x), y(a) = y0 ,

(2)

1034

V. Ledoux, M. Van Daele, and G. Vanden Berghe

where

0 1 , V (x) − E 0

A(x, E) =

(3)

and y = [y(x), y (x)]T . Suppose that we have already computed yi ≈ y(xi ) and that we wish to advance the numerical solution to xi+1 = xi + hi . We ﬁrst compute a constant approximation V¯ of the potential function V (x)

1 V¯ = hi

xi +hi

V (x)dx.

(4)

xi

Next we change the frame of reference by letting ¯

y(x) = e(x−xi )A u(x − xi ),

xi ≤ x ≤ xi+1

where ¯ A(E) =

0 1 . V¯ − E 0

(5)

(6)

We treat u as our new unknown which itself obeys the linear diﬀerential equation u (δ) = B(δ, E)u(δ), where

δ = x − xi ∈ [0, hi ],

u(0) = yi

¯ ¯ B(δ, E) = e−δA A(xi + δ) − A¯ eδA .

(7)

(8)

The matrix B can be computed explicitly. With ξ(Z) and η0 (Z) deﬁned as ξ(Z) =

cos(|Z|1/2 ) if Z ≤ 0 , cosh(Z 1/2 ) if Z > 0 ,

⎧ sin(|Z|1/2 )/|Z|1/2 if Z < 0 , ⎪ ⎪ ⎨ if Z = 0 , η0 (Z) = 1 ⎪ ⎪ ⎩ sinh(Z 1/2 )/Z 1/2 if Z > 0 ,

(9)

(10)

we can write B as ⎛ ⎜ B(δ, E) = ΔV (δ) ⎝

δη0 (Z2δ ) −

⎞ 1 − ξ(Z2δ ) 2(E − V¯ ) ⎟ ⎠,

1 + ξ(Z2δ ) −δη0 (Z2δ ) 2

(11)

where ΔV (δ) = V¯ − V (xi + δ) and Zγ = Z(γ) = (V¯ − E)γ 2 . Note that the PPM-formulation in e.g. [8,9] uses the same functions ξ(Z) and η0 (Z) . We apply a Magnus method to the modiﬁed equation (7). The Magnus expansion is then (where the bracket denotes the matrix commutator) σ(δ) = σ1 (δ) + σ2 (δ) + σ3 (δ) + σ4 (δ) + . . . ,

(12)

Magnus Method of Sturm-Liouville Problem

1035

where σ1 (δ) =

δ

B(x)dx, 0

1 δ x1 [B(x2 ), B(x1 )]dx2 dx1 , σ2 (δ) = − 2 0 0 x1 δ x1 1 σ3 (δ) = B(x2 )dx2 , B(x2 )dx2 , B(x1 ) dx1 , 12 0 0 0 x1 x2 1 δ B(x3 )dx3 , B(x2 ) dx2 , B(x1 ) dx1 , σ4 (δ) = 4 0 0 0 ¯

and u(δ) = eσ(δ) yi , δ ≥ 0. Thus, to compute yi+1 = ehA eσ(h) yi with h = hi , we need to approximate σ(h) by truncating the expansion (12) and replacing ¯ integrals by quadrature (see next section). The 2 × 2 matrix exponentials ehA ¯ and eσ(h) can be written down explicitely. ehA is the matrix exponential of a constant matrix, and thus 0 h ξ(Zh ) hη0 (Zh ) expm = (13) , Zh = Z(h). h(V¯ − E) 0 Zh η0 (Zh )/h ξ(Zh ) To write down an expression for eσ(h) , we note that σ(h) is always a two by two matrix with zero trace. For such matrices the following is true: a b ξ(ω) + aη0 (ω) bη0 (ω) expm = (14) , ω = a2 + bc. cη0 (ω) c −a ξ(ω) − aη0 (ω) Here a, b, c, ω are functions of x and E.

3

Integration of the Integrals

As shown in [4], the regular Magnus quadrature formulae ([7]) are useless in the presence of high oscillation. For E V¯ the matrix function B in (11) is highly oscillatory and quadrature must be used that respects high oscillation. Filon-type quadrature can be used to approximate highly oscillating integrals to a suitable precision in a small number of function evaluations per step. As in [12], we will apply Filon-type quadrature not-only in the oscillatory region E > V¯ , but also in the nonoscillatory E < V (x) region (where it is just as good as regular Gauss-Christoﬀel Magnus quadrature). The univariate Filon-rule is discussed in [4] and has the nice property that while regular quadrature is ineﬀective in the presence of high oscillation, Filon quadrature delivers accuracy which actually improves with higher oscillation. Here, we use this Filon-rule to approximate the univariate (modiﬁed) Magnus h integral 0 B(δ)dδ. In fact, thismeans that ΔV (δ) in (11) is replaced by the ν Lagrange polynomial LΔV (δ) = k=1 ΔV (ck h) k (δ) where k is the kth cardinal polynomial of Lagrangian interpolation and c1 , c2 , . . . , cν are distinct quadrature

1036

V. Ledoux, M. Van Daele, and G. Vanden Berghe

nodes. The resulting integrals can then be solved analytically. An alternative way to obtain the interpolating polynomial LΔV (δ) is by approximating V (x) by a series over shifted Legendre polynomials: V (x) ≈

ν−1

Vs hs Ps∗ (δ/h)

(15)

s=0

By the method of least squares the expressions for the coeﬃcients Vs are obtained: (2s + 1) h Vs = V (xi + δ)Ps∗ (δ/h)dδ, m = 0, 1, 2, . . . . (16) hs+1 0 ν−1 It can then be noted that V¯ = V0 and ΔV (δ) ≈ LΔV (δ) = − s=1 Vs hs Ps∗ (δ/h). To compute the integrals (16) tenth-order Gauss-Legendre is used, requiring ν = 5 function evaluations of V (Gauss-Lobatto is another option). With ξ = ξ(Z2h ),

η0 = η0 (Z2h ),

Z2h = 4Zh = 4(V¯ − E)h2

and Vˆs = hs+1 Vs , s = 1, . . . , 4, we obtain then the following

(Vˆ1 /2 + 5Vˆ4 + 3Vˆ2 /2 + 3Vˆ3 ) η0 Z h 0 (−Vˆ3 − Vˆ2 − Vˆ4 − Vˆ1 )ξ − Vˆ1 + Vˆ4 + Vˆ2 − Vˆ3 + 4Zh ˆ ˆ (−45V4 − 3V2 − 15Vˆ3 )ξ − 15Vˆ3 +45Vˆ4 +3Vˆ2 + 4Zh2 η0 −105Vˆ4 /4ξ+105 Vˆ4/4 (15Vˆ3 +105Vˆ4 ) + + 2 3 2Zh Zh h h ΔV (δ) (1+ξ(Z2δ )) dδ ≈ ΔV (δ)ξ(Z2δ )dδ 1 h

0

h

ΔV (δ)δη0 (Z2δ )dδ ≈

0

h

ΔV (δ) (1−ξ(Z2δ )) dδ ≈ − 0

(17)

h

ΔV (δ)ξ(Z2δ )dδ 0

η0 (3Vˆ2 +15Vˆ3 +45Vˆ4 ) η0 + ≈ (Vˆ1 + Vˆ2 + Vˆ3 + Vˆ4 ) Zh (−3Vˆ2 − Vˆ1 − 10Vˆ4 − 6Vˆ3 )ξ + 6Vˆ3 − 3Vˆ2 − 10Vˆ4 + Vˆ1 + 2Zh ˆ ˆ 210V4 η0 + (−105V4 − 15Vˆ3 )ξ − 105Vˆ4 + 15Vˆ3 + (18) 2Zh2 h which allows us to approximate 0 B(δ)dδ. Including only this ﬁrst Magnusterm is suﬃcient to have a fourth-order method. However to construct a method

Magnus Method of Sturm-Liouville Problem

1037

of order ten, we need to include more Magnus terms. First we consider the approximation of σ2 . We extend the Filon idea to the computation of the double integral. As in [12] we write the double integral as

h

δ1

h

δ1

[B(δ2 ), B(δ1 )]dδ2 dδ1 = 2 0

ΔV (δ1 )ΔV (δ2 )K1 (δ1 , δ2 )dδ2 dδ1 U1 0

0

0

h

δ1

+2

ΔV (δ1 )ΔV (δ2 )K2 (δ1 , δ2 )dδ2 dδ1 U2 0

0

h

δ1

+2

ΔV (δ1 )ΔV (δ2 )K3 (δ1 , δ2 )dδ2 dδ1 U3 0

0

(19) where K1 (x, y) = yη0 (Z2y ) − xη0 (Z2x ), K2 (x, y) = ξ(Z2x ) − ξ(Z2y ), K3 (x, y) = (x − y)η0 (Z2(x−y) ) and 1 1 1 − 0 0 2(E− 0 2(E− ¯ ¯ ¯) 4(E− V ) V ) V , U2 = . , U3 = −1 U1 = 1 1 0 0 0 ¯) 4(E−V 2 2 (20) The three integrals in (19) must be replaced by quadrature. We again replace ΔV by the polynomial LΔV and solve the resulting integrals analytically (Maple). For brevity reasons we do not list the full expressions of the resulting formulae here, we show only the expression for the third integral: 0

h

δ1

ΔV (δ1 )ΔV (δ2 )K3 (δ1 , δ2 )dδ2 dδ1 ≈

0

Vˆ 2 + Vˆ 2 − Vˆ 2 − Vˆ 2 + 2(Vˆ Vˆ − Vˆ Vˆ ) 4 2 3 1 2

4

3

1

190Vˆ42 − Vˆ12 + 15Vˆ22 − 66Vˆ32 4Zh2 9Vˆ 2 − 405Vˆ 2 + 4335Vˆ 2 − 30Vˆ3 Vˆ1 + 1110Vˆ4 Vˆ2 +

4Zh ˆ ˆ ˆ −42V3 V1 + 156V4 Vˆ2 3 4 + + 2 4Zh2 4Zh3 Vˆ 2 − 3Vˆ 2 + 6Vˆ 2 − 10Vˆ 2 −225Vˆ32 +20475Vˆ42 +630Vˆ4Vˆ2 11025Vˆ42 1 2 3 4 + + + η 0 4Zh4 4Zh5 4Zh2 7Vˆ3 Vˆ1 − 13Vˆ4 Vˆ2 −1110Vˆ42 − 270Vˆ4 Vˆ2 + 30Vˆ3 Vˆ1 − 9Vˆ22 + 105Vˆ32 + + 4Zh2 4Zh3 225Vˆ32 − 630Vˆ4 Vˆ2 − 5775Vˆ42 11025Vˆ42 + − ξ 4 4Zh 4Zh5 −Vˆ22 /20 − Vˆ12 /12 − Vˆ32 /28 − Vˆ42 /36 −7Vˆ4 Vˆ2 − 5Vˆ3 Vˆ1 + . + Zh 4Zh2 (21) As shown in [12] the inclusion of this second Magnus term leads to an eighthorder algorithm. Next we consider the approximation of σ3 and σ4 in order to have a tenth-order scheme. The same procedure is applied again: the function

1038

V. Ledoux, M. Van Daele, and G. Vanden Berghe

ΔV appearing in the expressions for σ3 and σ4 is replaced by a polynomial. By symbolic computation it can be shown that it is suﬃcient here to replace ΔV (δ) by a third-degree polynomial. Therefore we take ΔV (δ) ≈ 3s=1 Vs hs Ps∗ (δ/h), where the coeﬃcients Vs are still the same as the ones before. Also only the terms where the degree in h is smaller than 11 have to be considered: e.g. we do not take into account the Vˆ33 -term. We used the symbolic software package Maple to compute the expressions of the 2 by 2 matrix ς = σ3 + σ4 . As an illustration, we show some terms of the diagonal elements: ς11 = −ς22 =

135Vˆ 2 Vˆ + 49Vˆ 3 + 240Vˆ Vˆ Vˆ + 45Vˆ 3 + 150Vˆ 2 Vˆ + 123Vˆ Vˆ 2 1 3 2 1 2 1 2 2 1 1 3

480Zh2 961Vˆ12 Vˆ2 +105Vˆ13 + 8382Vˆ1 Vˆ3 Vˆ2 +2475Vˆ12 Vˆ3 + 2025Vˆ1 Vˆ22 +1161Vˆ23 + 96Zh3 5859Vˆ1 Vˆ22 +59662Vˆ1Vˆ3 Vˆ2 + 7245Vˆ12 Vˆ3 + 8055Vˆ23 + 736Vˆ12 Vˆ2 + 32Zh4 549Vˆ23 + 16305Vˆ1Vˆ3 Vˆ2 /4 + ξ + ... Zh5 (22)

The formulas in (17), (21) and (22) may be problematic for E close to V¯ due to near-cancellation of like terms. Therefore alternative formulas are used for small Zh values (see [12]). These alternative formulas are obtained by applying a Taylor expansion. The alternative for expression (17) is then e.g. 1 h ΔV (δ)δη0 (Z2δ )dδ ≈ h 0 (Vˆ1 /3 + Vˆ2 /15)Zh + (Vˆ3 /105 + 4Vˆ2 /105 + 4Vˆ1 /45 + Vˆ4 /945)Zh2 + (2Vˆ3 /945 + Vˆ2 /189 + Vˆ1 /105 + 2Vˆ4 /3465)Z 3 + . . .

(23)

h

The alternative formulae are used in the interval |Zh | < 0.15, in this case it is found that it is suﬃcient to go up to Zh8 .

4

Shooting for Eigenvalues

As mentioned before, a shooting procedure can be used to locate the eigenvalues of the boundary value problem associated with (1). The modiﬁed Magnus method presented here is well suited for the repeated solution of the initial value problems which appear in the shooting procedure. These initial value problems are solved for a ﬁxed potential V but for diﬀerent values of E. For our modiﬁed Magnus integrator, a mesh can be constructed which only depends on V and not on E (a procedure similar as in [12] can be used to construct the mesh). This mesh has to be computed only once and is then used in all eigenvalue computations. Moreover the value V¯ and the coeﬃcients Vs are computed and

Magnus Method of Sturm-Liouville Problem

1039

Algorithm 1. A Sturm-Liouville solver based on a modiﬁed Magnus method 1: Use stepsize selection algorithm to construct mesh a = x0 < x1 < ... < xn = b 2: for i = 1 to n do 3: Compute V¯ and Vs , s = 1, . . . , 4 for the ith interval (Gauss-Legendre with 5 nodes). 4: end for 5: Choose a meshpoint xm (0 ≤ m ≤ n) as the matching point. 6: Set up initial values for yL satisfying the BC at a and initial values for yR satisfying the BC at b. Choose a trial value for E. 7: repeat 8: for i = 0 to m − 1 do ¯ 9: yL (xi+1 ) = ehi A eσ(hi ) yL (xi ) 10: 11: 12:

end for for i = n down to m + 1 do ¯ yR (xi−1 ) = e−σ(hi ) e−hi A yR (xi )

13: end for 14: Adjust E by comparing yL (xm ) with yR (xm ) (Newton iteration). 15: until E suﬃciently accurate

stored once for all before the start of the shooting process. Algorithm 1 shows the basic shooting procedure in which the modiﬁed Magnus algorithm is used to propagate the left-hand and right-hand solutions. For more details on such a shooting procedure we refer to [12].

5

Numerical Examples

As test potentials we take two well-known test problems from the literature [17]. The Coﬀey-Evans problem is a Schr¨odinger equation with V (x) = −2β cos(2x) + β 2 sin2 (2x)

(24)

and y(−π/2) = y(π/2) = 0 as boundary conditions. Here we take β = 30. The second problem is the Woods-Saxon problem deﬁned by V (x) = −50

1−

5t 3(1+t)

1+t

(25)

with t = e(x−7)/0.6 over the interval [0, 15]. The eigenvalue spectrum of this Woods-Saxon problem contains 14 eigenenergies E0 , ..., E13 . We take here an equidistant mesh. Note however that an automatic stepsize selection algorithm can be constructed as in [12]. We performed some eigenvalue computations at diﬀerent step lengths. The absolute errors ΔEk = Ekexact − Ekcomput are collected in Table 1. For the Coﬀey-Evans problem some lower eigenvalues come in very close clusters and to distinguish between them the search algorithm must rely on a highly accurate integrator. Our modiﬁed Magnus method deals very well with these close eigenvalues. Also no systematic deterioration of the accuracy is

1040

V. Ledoux, M. Van Daele, and G. Vanden Berghe

Table 1. Absolute value of (absolute) errors ΔEk for the Coﬀey-Evans and WoodsSaxon problem. n is the number of (equidistant) steps. aE-b means a.10−b .

k 0 1 2 3 4 5 6 8 10 15 20 30 40 50

Coﬀey-Evans problem n = 128 Ek 0.0000000000000000 3.4E-10 117.9463076620687587 1.5E-9 231.6649292371271088 2.1E-9 231.6649293129610125 1.1E-9 231.6649293887949167 2.1E-9 340.8882998096130157 4.5E-9 445.2830895824354620 4.4E-9 445.2832550313310036 4.4E-9 637.6822498740469991 4.8E-9 802.4787986926240517 2.8E-9 951.8788067965913828 2.3E-9 1438.2952446408023577 2.0E-9 2146.4053605398535082 1.5E-9 3060.9234915114205911 1.0E-9

n = 256 2.2E-13 1.4E-12 1.1E-12 1.1E-12 7.9E-13 4.4E-12 3.6E-12 2.7E-12 4.2E-12 1.7E-12 3.7E-12 2.5E-12 2.7E-12 2.7E-12

k 0 1 2 3 4 5 6 7 8 9 10 11 12 13

Woods-Saxon problem Ek n = 64 n = 128 -49.45778872808258 3.9E-11 8.5E-14 -48.14843042000639 3.8E-10 2.6E-13 -46.29075395446623 2.0E-9 1.6E-12 -43.96831843181467 7.2E-9 6.3E-12 -41.23260777218090 2.0E-8 1.9E-12 -38.12278509672854 4.8E-8 4.6E-11 -34.67231320569997 9.7E-8 9.7E-11 -30.91224748790910 1.7E-7 1.7E-10 -26.87344891605993 2.8E-7 2.9E-10 -22.58860225769320 3.9E-7 4.3E-10 -18.09468828212811 5.1E-7 5.7E-10 -13.43686904026007 5.9E-7 6.7E-10 -8.67608167074520 6.0E-7 7.2E-10 -3.90823248120989 5.0E-7 6.6E-10

observed as k is increased. This tenth-order method gives of course more accurate approximations than the eighth order method of [12]: this method gives e.g. for the ﬁrst eigenvalue of the Coﬀey-Evans problem an error of 1.0E-7 (n = 128) and 4.0E-10 (n = 256).

6

Conclusion

In this paper we discussed a modiﬁed Magnus method of order ten for the integration of a Sturm-Liouville problem in the Schr¨ odinger form. Therefore the modiﬁed Magnus method described earlier by Degani and Schiﬀ and Iserles had to be extended to the non-oscillatory E < V region and a Filon-like quadrature rule had to be deﬁned for the multivariate integrals appearing in the Magnus series. The modiﬁed Magnus method can be applied in a shooting procedure in order to compute the eigenvalues of a boundary value problem. Since an E-independent mesh can be constructed, all function evaluations can be done before the actual shooting process, which makes the method well suited to compute large batches of eigenvalues or just particularly large eigenvalues.

References 1. Blanes, S., Casas, F., Oteo, J.A., Ros, J.: Magnus and Fer expansions for matrix diﬀerential equations: the convergence problems. J. Phys A: Math. Gen. 31, 259– 268 (1998) 2. Degani, I., Schiﬀ, J.: RCMS: Right Correction Magnus Series approach for oscillatory ODEs. J. Comput. Appl. Math. 193, 413–436 (2006)

Magnus Method of Sturm-Liouville Problem

1041

3. Degani, I.: RCMS - Right Correction Magnus Schemes for oscillatory ODEs and cubature formulae and commuting extensions. Thesis (PhD). Weizmann Institute of Science (2004) 4. Iserles, A.: On the numerical quadrature of highly oscillatory integrals I: Fourier transforms. IMA J. Numer. Anal. 24, 365–391 (2004) 5. Iserles, A., Nørsett, S.P.: On the solution of linear diﬀerential equations in Lie groups. Phil. Trans. R. Soc. Lond. A. 357, 983–1019 (1999) 6. Iserles, A.: On the global error of discretization methods for highly-oscillatory ordinary diﬀerential equations. BIT 42, 561–599 (2002) 7. Iserles, A., Munthe-Kaas, H.Z., Nørsett, S.P., Zanna, A.: Lie-group methods. Acta Numerica 9, 215–365 (2000) 8. Ixaru, L.G.: Numerical Methods for Diﬀerential Equations and Applications. Reidel, Dordrecht-Boston-Lancaster (1984) 9. Ixaru, L.G., De Meyer, H., Vanden Berghe, G.: SLCPM12 - A program for solving regular Sturm-Liouville problems. Comp. Phys. Commun. 118, 259–277 (1999) 10. Ledoux, V., Van Daele, M., Vanden Berghe, G.: CP methods of higher order for Sturm-Liouville and Schr¨ odinger equations. Comput. Phys. Commun. 162, 151–165 (2004) 11. Ledoux, V., Van Daele, M., Vanden Berghe, G.: MATSLISE: A MATLAB package for the Numerical Solution of Sturm-Liouville and Schr¨ odinger equations. ACM Trans. Math. Software 31, 532–554 (2005) 12. Ledoux, V., Van Daele, M., Vanden Berghe, G.: Eﬃcient numerical solution of the 1D Schr¨ odinger eigenvalue problem using Magnus integrators. IMA J. Numer. Anal (submitted) 13. Moan, P.C.: Eﬃcient approximation of Sturm-Liouville problems using Lie group methods. Technical report. DAMTP University of Cambridge (1998) 14. Munthe-Kaas, H., Owren, B.: Computations in a free Lie algebra, Phil. Trans. R. Soc. Lond. A. 357, 957–981 (1999) 15. Pruess, S.: Solving linear boundary value problems by approximating the coeﬃcients. Math. Comp. 27, 551–561 (1973) 16. Pruess, S., Fulton, C.T.: Mathematical software for Sturm-Liouville problems. ACM Trans. on Math. Software 19, 360–376 (1993) 17. Pryce, J.D.: Numerical Solution of Sturm-Liouville Problems. Clarendon Press (1993)

Stopping Criterion for Adaptive Algorithm Sanjay Kumar Khattri Stord/Haugesund University College, Bjørnsonsgt. 45 Haugesund 5528, Norway [email protected]

Abstract. Adaptive algorithm consists of many diﬀerent parameters. For example, adaptive index, adaptive criterion and stopping criterion. The adaptivity index drives the adaptive algorithm by selecting some elements for further reﬁnement. Apart from the driving force, another important aspect of an algorithm is its stopping criteria. We present a new stopping criterion for adaptive algorithm.

1

Introduction

Convergence rate of ﬁnite volume on uniform meshes depends on the regularity or singularity of the solution. We develop ﬁnite volume on adaptive meshes. And, we present pointwise or inﬁnity convergence of the developed method. It is shown that the convergence of the presented adaptive method is independent of the regularity or singularity of the underlying problem. An adaptive techniques depend on several factors such as error indicator and adaptive algorithm. We present a simple adaptive criterion and adaptive algorithm. Now let us consider the steady state pressure equation of a single phase ﬂowing in a porous medium Ω − div (K grad p) = f p(x, y) = p

D

in Ω ,

(1)

on ∂ΩD .

(2)

Here, Ω is a polyhedral domain in R2 , the source function f is assumed to be in L2 (Ω), and the diagonal tensor coeﬃcient K(x, y) is positive deﬁnite and piecewise constant. The coeﬃcient K is allowed to be discontinuous in space. In porous media ﬂow [7,4,1], the unknown function p = p(x, y) represents the pressure of a single phase, K is the permeability or hydraulic conductivity of the porous medium, and the velocity u of the phase is given by the Darcy law as : u = −K grad p. The next section presents ﬁnite volume method and adaptive algorithm.

2

Finite Volume Discretization and Adaptive Algorithm

For solving partial diﬀerential equations (PDEs) in a domain, by numerical methods such as ﬁnite volume, the domain is divided into smaller elements called M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1042–1050, 2008. c Springer-Verlag Berlin Heidelberg 2008

Stopping Criterion for Adaptive Algorithm M

A h2 m

1

h1

1043

h2

2

1

h1

o n

N

2

h3

3

B

(a) Flux on a matching grid.

(b) Flux on a non-matching grid.

Fig. 1. Computation of ﬂux across an edge

ﬁnite volumes or cells. Finite volume discretization of the Equation (1) for a ﬁnite volume is given as [7] 4 i=1

Fi =

f dτ .

(3)

V

Here, Fi is the ﬂux through the interface i. Now let us compute the ﬂux for the interface MN shared by the cells 1 and 2 (see Fig. 1(a)). The ﬂux [12,13,14,7] through the edge MN is given as FMN = ΦMN (p2 − p1 ) ,

(4)

where the scalar ΦMN is referred to as the transmissibility of the interface MN and is given as l 1 . (5) ΦMN = K1 K2 h1 h2 (K1 /h1 + K2 /h2 ) Here, K1 and K2 refers to the permeability of the cells 1 and 2 in Fig. 1(a). The perpendicular distance of the interface MN from the center of cell 1 is h1 . Similarly, h2 is the perpendicular distance of the interface MN from the center of cell 2. The length of interface MN is l. Adaptive discretization can result in a nonmatching grid as shown in Fig. 1(b). We are using the same ﬂux approximation for computing ﬂux on a non matching grid. We are using the following expression for computing the error from the cell i in a mesh [7], 1 def ˆ L2 (∂Ωi ) |∂Ω i |1/2 . i = f L2 (Ωi ) |Ωi | 2 + (K ∇ph ) · n

(6)

Here, |Ωi | is the area of the ﬁnite volume, |∂Ωi | is the circumference of the ˆ is the unit outward normal. The quantity ﬁnite volume, and n 1/2

nL2 (∂Ωi ) |∂Ω i | (K ∇ph )ˆ

1044

S.K. Khattri

is the total ﬂux associated with cell i. Let us further deﬁne a quantity named adaptivity index for cell i in a mesh, ⎤ ⎡ def i ⎦. ηi = ⎣ (7) max j j∈cells

It can be seen from the above deﬁnition of adaptivity index. For a cell with zero error ( = 0), the adaptivity index η is zero, and for a cell with maximum error η is 1. Thus for any cell, the adaptivity index ηi will be in the range [0, 1]. It can be seen in the Algorithm 1 that the driving force for the Algorithm is the adaptivity index η. The adaptivity index (7) drives the Algorithm 1 by selecting some ﬁnite volumes for further reﬁnement. Apart from the driving force, another important aspect of an algorithm is its stopping criteria. The two obvious stopping criteria of an adaptive algorithm are the maximum allowable degrees of freedom (DOFmax ) or the maximum allowed mesh reﬁnement, and the maximum allowed adaptive iteration steps “Iter ≤ Itermax ”. For deﬁning third criterion, let us compute maximum error associated with a ﬁnite volume or cell on a mesh formed after k iterative steps of the algorithm. Let this error be ξk . Thus, ξk = max i , i∈cells

Thus, ξ0 is the maximum error of a cell on the initial mesh. Our third stopping criterion is deﬁned as ξk ≥ tol . ξ0 The third criterion “ξk /ξ0 ≥ tol” is the error reduction after k iteration steps of the adaptive algorithm. Here, ξk denotes the maximum error (maximum value of i on a mesh) on an adaptively reﬁned mesh after k iteration steps of the adaptive Algorithm 1. The quantity ξk /ξ0 , which measures the reduction of the posteriori error estimate i , provides information of the relative error reduction. Thus, ξk /ξ0 can be used as a stopping criterion apart from the maximum number of degrees of freedom. The degrees of freedom and maximum iteration of the adaptive algorithm do not provide information about the error reduction. Algorithm 1 is used for adaptive reﬁnement. When a ﬁnite volume is selected for further reﬁnement based on the value of the adaptivity index (7), this ﬁnite volume is divided into four equal ﬁnite volumes. During the adaptive reﬁnement process all ﬁnite volumes Ωi in a mesh, for which the adaptivity index ηi is greater than a given tolerance δ, are reﬁned. The tolerance δ lies between the values 0 and 1. Tolerance δ equal to 0 means uniform reﬁnement (reﬁne all ﬁnite volumes), and tolerance δ equal to 1 means that the adaptive algorithm will reﬁne a single ﬁnite volume per iteration step which can be costly. Both of these values can be computationally expensive and may not be optimal. A small δ will

Stopping Criterion for Adaptive Algorithm

1045

Algorithm 1. Adaptive Algorithm with a new stopping criterion [ξIter /ξ0 ] ≥ tol. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Mesh the domain; Compute ξ0 ; Set Iteration Counter Iter = 0; while DOF ≤ DOFmax or Iter ≤ Itermax or [ξIter /ξ0 ] ≥ tol do Discretize the PDE on the mesh; Solve the discrete system to a given tolerance; forall Finite Volumes j in the Mesh do if ηj ≥ δ then Divide the Finite Volume j into Four Elements; end end Form a new mesh; Iter++ ; Compute ξIter ; end

reﬁne many ﬁnite volumes and thus introduce many new cells per iteration step of the adaptive algorithm. On the other hand, a large value of δ will reﬁne fewer cells and thus introduce fewer new ﬁnite volumes per iteration step. It should be kept in mind that during each iteration step of the adaptive algorithm a discrete system needs to be solved. Typically a value of δ = 0.5 is used [15]. To measure the eﬀectiveness of the adaptivity index (7) in selecting the cells with maximum error, we use the relation def

Γ =

Cell number with η = 1.0 . Cell number with maximum point-wise error |p − ph |

(8)

Here, Γ is the robustness of the indicator η. If Γ is close to 1, the cells with the maximum point-wise error and the cells with the maximum error given by the error indicator (6) are the same. We compute the robustness quantity Γ of the adaptive index during each iteration step of the adaptive Algorithm 1.

3

Numerical Examples

Let p be the exact solution vector, and ph be the ﬁnite volume solution vector on a mesh. Let us further assume that pk be the exact pressure at the center of the cell k and pkh be the discrete pressure by the ﬁnite volume approximation for the same location. Error in the L∞ norm is deﬁned as

def p − ph L∞ = maxk∈cells |pk (x) − pkh (x)| , (9) The ﬁnite volume solution is enforced inside the domain by the Dirichlet boundary condition and the source term. For solving the discrete systems of

1046

S.K. Khattri

equations formed on the sequence of adaptive and uniform meshes, we are using the ILU preconditioned Conjugate Gradient (CG) iterative solver unless mentioned otherwise. Let the domain be Ω = [−1, 1] × [−1, 1], and it is divided into four sub-domains according to the permeability K (see Fig. 2). Let the permeability in the sub-domain Ωi be Ki . It is assumed that the permeability in the sub-domain Ω1 is equal to the permeability in the sub-domain Ω3 , and the permeability in the sub-domain Ω2 is equal to the permeability in the sub-domain Ω4 . That is K1 = K3 and K2 = K4 . Let us further assume that K1 = K3 = R and K2 = K4 = 1. The parameter R is deﬁned below. Let the exact solution in polar form be p(r, θ) = rγ η(θ) , (10) [8,9]. The parameter γ denotes the singularity in the solution [9], and it depends on the permeability distribution in the domain. For the singularity γ = 0.1, the Fig. 3 presents permeability distribution. η(θ) is given as ⎧ π cos [(π/2 − σ)γ] cos [(θ − π/2 + ρ)γ] , θ ∈ [0, ] , ⎪ ⎪ ⎪ ⎪ π 2 ⎪ ⎪ ⎨cos(ργ) cos [(θ − π + σ)γ] , θ ∈ [ , π] , 2 η(θ) = (11) 3π ⎪ cos(σγ) cos [(θ − π − ρ)γ] , θ ∈ [π, ] , ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎩cos [(π/2 − ρ)γ] cos [(θ − 3π/2 − σ)γ] , θ ∈ [ 3π , 2π] , 2 and the parameters R, γ, ρ ⎧ ⎪ ⎨R 1/R ⎪ ⎩ R

and σ satisfy the following nonlinear equations = − tan [(π − σ) γ] cot(ργ) , = − tan(ργ) cot(σγ) , = − tan(σγ) cot [(π/2 − ρ)γ] ,

(12)

under the following nonlinear constraints 0< max{0, πγ − π} < max{0, π − πγ} <

γ <2 , 2γρ < min{πγ, π} , −2γρ < min{π, 2π − πγ} .

(13)

The constrained nonlinear Equations (12) can be solved for the parameters R, σ, and ρ by Newton’s iteration algorithm for diﬀerent degrees of singularity γ. The ∂p analytical solution p(r, θ) satisﬁes the usual interface conditions : p and K ∂n are continuous across the interfaces. It can be shown that the solution p belongs in the fractional Sobolev space H1+κ (Ω) (κ < γ) [10]. Let the singularity be γ = 0.1. Various parameters that satisfy the relations (12) under the constraints (13) are R ≈ 161.4476 ,

ρ ≈ 0.7854 and σ ≈ −14.9225 .

The permeability distribution is shown in Figure 3. The exact solution belongs to the fractional Sobolev space H1+k (k < 0.1). We have solved this problem on

Stopping Criterion for Adaptive Algorithm

Ω4

Ω3

1047

K 3 ≈ 161.45 I

K4 ≈ I

O

Ω1

Ω2

K 1 ≈ 161.45 I

Fig. 2. Domain Ω is divided into four subdomains Ωi , i = 1 . . . 4 according to the permeability

K2 ≈ I

Fig. 3. Permeability distribution for the singularities γ = 0.1 and γ = 0.1269. The solution is singular at O = (0, 0).

0.2

0.1 0.08

0.15

0.06

0.1

0.04 0.05

0.02 0

0

−0.02 −0.05

−0.04 −0.1

−0.06 −0.08

−0.15

−0.1 −0.2

−1 −1

−0.5

−0.5

0

0

0.5

0.5 1

−1 −0.5 0 0.5 −1 1

−0.5

0

0.5

(a) Exact solution given by the equation (b) Surface plot of the (10) for γ = 0.1. (u − uh )/uL∞ for γ = 0.1.

1

error

Fig. 4. Surface plots of the exact solution and error for the singularity parameter γ = 0.1

adaptive and uniform meshes. The outcome of our numerical work is reported in Figs. 4 and 5. Figure 4(a) is a surface plot of the exact solution. The solution is singular at the origin. Figure 4(b) presents a surface plot of the error. It can be seen that the error is maximum at the singularity. Figure 5 compares the convergence behaviour on adaptive and uniform meshes in the L∞ norm. We did not notice any convergence in the L∞ norm on uniform meshes till one million degrees of freedom. A similar behaviour was also observed in [11] on uniform meshes for singular problems. It was suggested in [11] that adaptive meshes may be ideal for such solutions. On adaptive meshes, we are getting P p − ph L∞ ≈ DOF− /2 with the convergence P ≈ 1 (see Figure 5). Because of the regularity of the solution, this convergence is quasi optimal [9,8].

1048

S.K. Khattri

0.025

0.025

0.02

0.02

0.015

0.015

||p−ph||L∞

||p−ph||L∞

Adaptive Uniform

0.01

0.005

Adaptive Uniform

0.01

0.005

0 0 10

2

10

4

6

10 10 Degrees of Freedom

8

10

Fig. 5. On adaptive meshes p − ph L∞ 1 ≈ DOF− /2

0 0 10

2

10

4

6

10 10 Degrees of Freedom

8

10

Fig. 6. On adaptive meshes p − ph L∞ 1 ≈ DOF− /2

Let the singularity be γ = 0.1269. Various parameters that satisfy the relations 12 under the constraints 13 are R ≈ 99.999999 ,

ρ ≈ 0.7853982 and σ ≈ −11.59263 .

The exact solution belongs in the fractional Sobolev space H1.126 . Figure 6 is comparing the convergence behaviour of the ﬁnite volume on adaptive and uniform meshes. Again, we did not observe any convergence till one million degrees of freedom on uniform meshes (there is some convergence during the 1.5

γ ≈ 0.10 γ ≈ 0.13

k

0

[ξ /ξ ]

1

0.5

0

0

0.5

1 1.5 2 2.5 Degrees of Freedom

3

3.5 5

x 10

Fig. 7. Decrease in the stopping criteria ξk /ξ0 in the Algorithm 1 with adaptive reﬁnement

Stopping Criterion for Adaptive Algorithm

2

γ = 0.10 γ = 0.13

1.5

Robustness [ Γ ]

1049

1 0.5 0 −0.5 −1 −1.5 −2 0

10

20 30 40 50 Iteration of the Algorithm [ Iter ]

60

Fig. 8. Robustness (deﬁned by the Equation (8)) of the adaptivity index for ﬁnding cells with most error. Solutions are in the spaces H1.126902 and H1.1 .

last reﬁnement, see the Figure 6). While on the adaptive meshes, we are still P getting p − ph L∞ ≈ DOF− /2 with P ≈ 1. Figure 8 is a plot of the robustness against the iterations of the adaptive algorithm. It can be seen in the Figure 8, the robustness is almost always equal to 1.0 for all of the adaptive iterations. It means that the cells with the maximum point-wise error and cells with the maximum value of the error indicator given by the equation 6 are the same.

References 1. Khattri, S.K.: Nonlinear elliptic problems with the method of ﬁnite volumes. Differential Equations and Nonlinear Mechanics. Article ID 31797, 16 pages (2006) doi:10.1155/DENM/2006/31797 2. Khattri, S.K.: Newton-Krylov Algorithm with Adaptive Error Correction For the Poisson-Boltzmann Equation. MATCH Commun. Math. Comput. Chem. 1, 197– 208 (2006) 3. Khattri, S.K., Fladmark, G.: Which Meshes Are Better Conditioned: Adaptive, Uniform, Locally Reﬁned or Locally Adjusted? In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 102–105. Springer, Heidelberg (2006) 4. Khattri, S.K.: Analyzing Finite Volume for Single Phase Flow in Porous Media. Journal of Porous Media 10, 109–123 (2007) 5. Khattri, S.K., Hellevang, H., Fladmark, G.E., Kvamme, B.: Simulation of longterm fate of CO2 in the sand of Utsira. Journal of Porous Media (to be published) 6. Khattri, S.K.: Grid generation and adaptation by functionals. Computational and Applied Mathematics 26, 1–15 (2007)

1050

S.K. Khattri

7. Khattri, S.K.: Numerical Tools for Multicomponent, Multiphase, Reactive Processes: Flow of CO2 in Porous Media. PhD Thesis, The University of Bergen (2006) 8. Morin, P., Nochetto, R.H., Siebert, K.G.: Data oscillation and convergence of adaptive FEM. SIAM J. Numer. Anal. 38, 466–488 (2000) 9. Chen, Z., Dai, S.: On the eﬃciency of adaptive ﬁnite element methods for elliptic problems with discontinuous coeﬃcients. SIAM J. Sci. Comput. 24, 443–462 (2002) 10. Strang, G., Fix, G.J.: An analysis of the ﬁnite element method, vol. 1. Wiley, New York (1973) 11. Eigestad, G., Klausen, R.: On the convergence of the multi-point ﬂux approximation O-method: Numerical experiments for discontinuous permeability. Numerical Methods for Partial Diﬀerential Equations 21, 1079–1098 (2005) 12. Aavatsmark, I.: An introduction to multipoint ﬂux approximations for quadrilateral grids. Comput. Geosci. 6, 405–432 (2002) 13. Ewing, R., Lazarov, R., Vassilevski, P.: Local reﬁnement techniques for elliptic problems on cell-centered grids. I. Error analysis. Math. Comp. 56, 437–461 (1991) 14. Ewing, R., Lazarov, R., Vassilevski, P.: Local reﬁnement techniques for elliptic problems on cell-centered grids. III. Algebraic multilevel BEPS preconditioners. Numer. Math. 59, 431–452 (1991) 15. Riviere, B.: Discontinuous galerkin fnite element methods for solving the miscible displacement problem in porous media. PhD Thesis, The University of Texas at Austin (2000)

Author Index

Abad, F. II-106 Abarca, R. III-471 Abbate, Giannandrea II-251 Abbod, Maysam III-16, III-634 Abdelouahab, Zair I-365 Abdullah, M. I-246 Abe, Takayuki II-35 Abell´ an, Jos´e L. I-456 Abramson, David I-66 Acacio, Manuel E. I-456 Adamczyk, Jakub I-355 Ahmad, Muhammad Bilal I-1013 Ai, Jianwen II-603 Akdim, Brahim II-353 Al-Kanhal, Tawfeeq III-634 Alda, Witold I-749, II-46 Alexandrov, Vassil III-379, III-429, III-438 Alles, Michael L. II-281 Alvaro, Wesley I-935 Anthes, Christoph III-379 Anthonissen, M.J.H. I-651 Anya, Obinna II-622, III-419 Arnal, A. II-96 Arod´z, Tomasz II-527 Assel, Matthias III-90 Assous, Franck II-331 Atanassov, Emanouil I-203 Aydt, Heiko III-26 Baczkowski, Krystian III-100 Bae, Seung-Hee I-407 Baeza, C. III-471 Bajka, Michael II-187 Balandin, Alexander A. II-242 Balfe, Shane III-510 Balint-Kurti, Gabriel G. II-387 Bali´s, Bartosz III-80, III-358 Banerjee, Sambaran II-207 Barabasz, Barbara III-682 Barbucha, Dariusz III-624 Bargiel, Monika II-126 Barty´ nski, Tomasz III-243 Barv´ık, Ivan I-661

Barzdziukas, Valerijus I-770 Battiato, Sebastiano II-76 Beezley, Jonathan D. III-46 Belkus, Houria II-207 Belloum, Adam III-459, III-481 Bengochea Mart´ınez, L. III-349 Benoit, Anne I-215 Bergdorf, Michael II-167 Bhowmick, Sanjukta I-873 Biecek, Przemyslaw III-100 Black, S.M. II-396 Blais, J.A. Rod II-638 Bl¨ ugel, Stefan I-6 Bode, Arndt III-201 Bonner, C.E. II-396 Boryczko, Krzysztof I-600, I-630 Bo˙zejko, Wojciech I-264 Brezany, Peter I-76 Brito, Rui M.M. III-70 Broeckhove, Jan I-226 Bubak, Marian I-56, I-254, II-217, III-80, III-90, III-243, III-358, III-446, III-459, III-481 Buchholz, M. I-45 Buchholz, Peter III-223 Buckingham, Lawrence III-491 Bulat, Jaroslaw III-178 Bungartz, Hans-Joachim I-45, III-213 Byler, Kendall G. II-360 Bylina, Beata I-983 Bylina, Jaroslaw I-983 Byrski, Aleksander III-584, III-654 Cai, Wentong III-26 Caiazzo, Alfonso II-291 ´ Calvo, Angel-Luis II-659 Camahort, E. II-106 Campos, Fernando Otaviano III-120 Campos, Ricardo Silva III-120 Cannataro, Mario III-67, III-148 Cao, Rongzeng I-853 ˇ Capkoviˇ c, Frantiˇsek III-545 Cˆ arstea, Alexandru I-126 Carvalho, Marco III-584

1052

Author Index

Cebrat, Stanislaw III-100 Cerepnalkoski, Darko III-463 Cernohorsky, Jindrich I-489 Cetnarowicz, Krzysztof III-533, III-594 Chaarawi, Mohamad I-297 Chakraborty, Soham III-46 Chalmers, Matthew III-158 Chang, Jaewoo I-731 Chaves, R. I-741 Chen, Chien-Hsing I-913 Chen, Chuan-Liang I-995 Chen, H. III-731 Chen, Jong-Chen I-813 Chen, Mark I-590 Chen, Tzu-Yi I-955 Chen, Zhengxin II-426, II-450 Chen, Zhenxing I-7 Chien, Shou-Wei I-813 Childs, Hank I-32 Chlebiej, Michal II-25 Choi´ nski, Dariusz II-261, III-381 Chojnacki, Rafal I-355 Chopard, Bastien II-227, II-291 Chover, Miguel II-5, II-86, II-136 Chrysanthakopoulos, George I-407 Pawel I-903 Chrzaszcz, Ciarlet Jr., Patrick II-331 Cicho´ n, Czeslaw I-1022 Renata III-594 Cieciwa, Ciepiela, Eryk III-740 Clarno, Kevin T. III-291 Cobo, Angel II-116 Coen, Janice L. III-46 Coﬁ˜ no, A.S. III-471 Cooper, Ian I-184 Cope, Jason II-646 Corcuera, Pedro II-715 Cort´es, Ana II-659, III-36 Cox, Simon J. III-339 Cuenca, Javier I-236 Cui, Jiangjun III-110 Cur´e, Olivier III-520 Czarnowski, Ireneusz III-624 Dagdeviren, Orhan I-509, I-519 Dai, Peng I-759 Danelutto, M. I-146 Darema, Frederica III-5 Davoli, Renzo I-287 de Oliveira, Bernardo Lino III-168

de Supinski, Bronis R. III-253 del Vado V´ırseda, Rafael I-800 Deng, Xiaotie II-407 Denham, M´ onica III-36 Denkowski, Marcin II-25 Depoorter, Wim I-226 Deshpande, Karishma III-16 Detka, Marcin I-1022 Deymier, Pierre II-301 Di Blasi, Gianpiero II-76 Ding, Wei I-853 Dobrowolski, Grzegorz III-555 Doherty, Thomas I-96 Dongarra, Jack I-935 Dostert, Paul III-54 Douglas, Craig C. III-3, III-46, III-54 Dre˙zewski, Rafal III-664, III-740 Du, L.H. III-731 Dubielewicz, Iwona II-687 Dubin, Uri I-274 Dubitzky, Werner I-106, I-274, III-70 Duda, Krzysztof III-178 Dunk, Andrew III-429 Duplaga, Mariusz I-476, III-178 Dutka, L ukasz III-409 Dymek, Dariusz I-386 Dzemyda, Gintautas I-770 Dziurzanski, Piotr I-427 Dzwinel, Witold II-177 Eckhardt, Wolfgang III-213 Efendiev, Yalchin III-54 El Zein, Ahmed I-466 Elsayed, Ibrahim I-76 Elts, E. I-45 Enticott, Colin I-66 Erciyes, Kayhan I-509, I-519 Ewing, Richard E. III-54 Fabja´ nski, Krzysztof I-499 Falcone, Jean Luc II-291 Falcou, Joel I-154 Falda, Grzegorz III-301 Fang, Y.D. III-731 Fangohr, Hans III-339 Fedoseyev, Alexander I. II-242, II-281 Fern´ andez, Juan I-456, III-471 Fern´ andez-Quiruelas, V. III-471 Fey, Dietmar I-174 Finger, N. I-945

Author Index Fox, Geoﬀrey C. I-407 Fragos, Tassos II-207 Frantziskonis, George II-301 Fraser, David L. I-417 Fregeau, John II-207 Freitag, Felix II-669 Fuji, Michiko II-207 Fukazawa, Kenji II-35 Funika, Wlodzimierz III-233, III-446 F¨ urlinger, Karl III-261 Gabriel, Edgar I-297 Gaburov, Evghenii II-207 Gagliardi, Fabrizio I-18 Gaiday, Alexander V. II-360 Gallery, Eimear III-510 Gallo, Giovanni II-76 Gallud, Jose A. III-389 G´ alvez, Akemi II-116, II-715 Gan, Boon Ping III-26 Gansterer, W.N. I-945 Gao, Guangxia II-476 Gao, Zhen-Guo I-559 Garc´ıa de Lomana, Adri´ an L´ opez I-610 Garc´ıa-Torre, F. III-471 Gardenghi, Ludovico I-287 Garic, Slavisa I-66 Gatial, Emil I-116, I-194 Gava, Fr´ed´eric I-375 Gavaghan, David I-66, I-571 Gavrilenko, A.V. II-396 Gavrilenko, V.I. II-396 Gehrke, Jan D. III-692 Gepner, Pawel I-42, I-417 Giannoutakis, Konstantinos M. I-925 Gil-Costa, Veronica I-327 Gim´enez, Domingo I-236, II-659 Gjermundrød, Harald III-399 Glebbeek, Evert II-207 Glowaty, Grzegorz I-883 Glut, Barbara I-641 Godowski, Piotr III-233 Goldweber, Michael I-287 ` G´ omez-Garrido, Alex I-610 ´ G´ omez-Nieto, Miguel Angel II-369 G´ omez-R´ıo, M. I-741 Gong, Yun-Chao I-995 Gonz´ alez-Cinca, Ricard II-735 G´ orriz, J.M. I-741 Goscinski, Andrzej I-164

1053

Grabska, Ewa III-604 Grau, Vicente I-571 Gravvanis, George A. I-925 Groen, Derek I-86, II-207 Gruji´c, Jelena II-576 Guarnera, Giuseppe Claudio II-76 Gubala, Tomasz I-56 Gumbau, Jesus II-136 Guo, Jianping II-630 Gutierrez, Eladio I-700 Guti´errez de Mesa, J.A. III-349 Guti´errez, J.M. III-471 Guzzi, Pietro Hiram III-148 Habala, Ondrej I-116, I-194 Habela, Piotr III-301, III-311 Haﬀegee, Adrian III-438 Hamada, Mohamed II-678 Han, Jianguo I-76 Har¸ez˙ lak, Daniel III-446 Harfst, Stefan II-207 Harvey, Jeremy N. II-387 Hasan, Adil III-321 He, Kaijian II-494 He, Y.L. III-731 Hegewald, Jan II-227 Heggie, Douglas II-207 Hern´ andez Encinas, L. II-706 Heˇrman, Pavel I-661 Herruzo, E. I-863 Hidalgo, J.L. II-106 Higashi, Masatake II-15, II-66 Hluch´ y, Ladislav I-116, I-194, III-331 Hnatkowska, Bogumila II-687 Hochreiter, Ronald II-408 Hoekstra, Alfons G. II-165, II-227, II-291 Hogan, James M. III-491 Horak, Bohumil III-564 Hovland, Paul D. I-873 Hsu, Chung-Chian I-913 Hsu, Jinchyr I-813 Hu, Yincui II-630 Huang, Fang II-605 Huang, Lican III-501 Huang, Rentian I-823 Huang, Yan I-184 H¨ ulsemann, Frank III-203 Hunt, Ela III-158

1054

Author Index

Hussain, Saber II-353 Hut, Piet II-207 Ibrahim, H. I-246 Iglesias, Andr´es II-3, II-116, II-715 Izumi, Hajime II-35 Izzard, Rob II-207 Jablonski, Stefan III-520 Jafari, Fahimeh I-436 Jakimovski, Boro III-463 Jakubowska, Joanna III-158 Jamieson, Ronan III-429, III-438 Jankowski, Robert I-710, II-614 I-355 Jarzab, Marcin J¸edrzejowicz, Piotr III-624 Johnson, Neil F. I-33 Johnston, Steven III-339 Jun, Qin III-674 Jurczuk, Krzysztof I-679 Jurczyk, Pawel I-136 Jurczyk, Tomasz I-641 Jurek, Janusz III-712 Juri´c, Mario II-207 Justham, Stephen II-207 Kaandorp, Jaap A. III-110 Kaczmarek, Pawel L. I-317 Kaczmarski, Krzysztof III-301, III-311 Kaminski, Wieslaw A. I-620 Kanada, Yasumasa I-446 Kaneko, Masataka II-35 Karl, Wolfgang III-268 Kasperkiewicz, Janusz III-702 Kasprzak, Andrzej I-549 Kasztelnik, Marek I-56 Khalili, K. II-146 Khan, Fakhri Alam I-76 Khattri, Sanjay Kumar I-975, I-1042 Khonsari, Ahmad I-436, I-539 Kim, Youngjin I-731 Kiraga, Joanna III-100 Kirou, Andy I-33 Kisiel-Dorohinicki, Marek III-654 Kitahara, Kiyoshi II-35 Kitowski, Jacek I-903, III-409 Kleijn, Chris R. II-251 Kneip, Georges III-268 Knuepfer, Andreas III-201 Kobayashi, Masakazu II-15

Kocot, Joanna III-740 Kohl, Peter I-571 Kolingerova, Ivana II-86 Kononowicz, Andrzej A. III-188 Konovalov, Alexander I-126 Kope´c, Mariusz I-600 Kornmayer, Harald III-399 Kosch, Harald I-215 Kotsalis, Evangelos M. II-234 Kotulski, Leszek I-386, III-644 Koumoutsakos, Petros II-167, II-234 Kowalewski, Bartosz III-358 Kowalik, Michal F. I-417 Koziorek, Jiri III-564 Krafczyk, Manfred II-227 Kranzlm¨ uller, Dieter III-201, III-253, III-379 Kravtsov, Valentin I-274 Krejcar, Ondrej I-489 Kr¸etowski, Marek I-679 Kriksciuniene, Dalia II-504 Krishamoorthy, Sriram I-20 Krishnan, Manoj I-20 Kroeker, Juergen I-581 Kr´ ol, Dariusz III-446 Kruk, Tomasz I-499 Kryza, Bartosz III-409 Krzhizhanovskaya, Valeria V. II-165 Kueﬂer, Erik I-955 Kulakowski, Krzysztof II-545 Kumar, Praveen II-387 Kundeti, Vamsi I-893 Kurdziel, Marcin I-630 Kuroda, Hisayasu I-446 Kurzak, Jakub I-935 Kuta, Marcin I-903 Laclav´ık, Michal III-331 Lai, Kin Keung II-494 Lang, E. I-741 Lassl, A. I-741 Lech, Piotr I-790 Ledoux, Veerle I-1032 Lee, Vernon I-590 Lendermann, Peter III-26 Levandovskiy, Igor A. II-360 Levnaji´c, Zoran II-584 Li, Deng III-54 Li, Feng I-853 Li, Guoqing II-605

Author Index

1055

Li, Hongquan II-416 Li, Man-Tian I-559 Li, Xiang I-559 Li, Xingsen II-436 Liu, Dingsheng II-603, II-605 Liu, Rong I-7, II-426, II-450 Liu, Ting I-76 Lloyd, Bryn A. II-187 Lluch, A. II-96 Lobosco, Marcelo III-168 Lodder, Robert A. III-54 Lombardi, James II-207 Long, Wen II-486 Lorenz, Daniel III-223 Low, Malcolm Yoke Hean III-26 Lozano, Maria III-389 Lu, Tingjie II-466 Luo, Q. II-657 Luo, Ying II-630 Luque, Emilio III-36

Melnik, R.V.N. II-197 Mendon¸ca Costa, Caroline III-120 Mertes, Jacqueline Gomes II-153 Messig, Michael I-164 Metzger, Mieczyslaw II-261, III-381 Mikolajczak, Pawel II-25 Milde, Florian II-167 Millar, Campbell I-96 Miranda Teixeira, Gustavo III-168 Misev, Anastas I-203 Mishra, Sudib K. II-301 Mitra, Abhijit II-379 Mitrovi´c, Marija II-551 Monterde, J. II-96 Moore, Shirley III-261 Moraveji, Reza I-529, I-539 Morimoto, Shoichi II-514 Morris, Alan III-276 Muntean, I.L. I-45 Muralidharan, Krishna II-301

Macariu, Georgiana I-126 Maciejewski, Henryk III-140 Mackiewicz, Dorota III-100 Mackiewicz, Pawel III-100 Madey, Greg III-6 Mahapatra, D. Roy II-197 Maischak, Matthias II-321 Maka, Tomasz I-427 Makino, Jun I-86 Makowiecki, Wojciech I-749 Malarz, Krzysztof II-559 Malawski, Maciej I-56, III-243 Maleti´c, Slobodan II-568 Malony, Allen III-276 Mandel, Jan III-46 Mandel, Johannes J. I-106 Mantiuk, Radoslaw I-780 Margalef, Tom` as III-36 Marin, Mauricio I-327 Markelov, Gennady I-581 Markowski, Marcin I-549 Marks, Maria III-702 Marqu`es, Joan Manuel II-669 Marranghello, Norian II-153 Martin, David I-96 Mazurkiewicz, Jacek I-671 McCreath, Eric I-466 McMillan, Steve I-86, II-207 Mehl, Miriam III-213

Nagai, Takahiro I-446 Nagar, Atulya I-823, II-622, III-419 Nagy, James G. I-721 Nahuz, Sadick Jorge I-365 Natkaniec, Joanna II-545 Navarro, Leandro II-669 Negoita, Alina I-833 Nielsen, Henrik Frystyk I-407 Nieplocha, Jarek I-20 Noble, Denis I-66 Noble, Penelope I-66 Noco´ n, Witold II-261, III-381 Nogawa, Takeshi II-15 Nowakowski, Piotr III-90 ´ Nuall´ ain, Breannd´ an O II-207 Okarma, Krzysztof I-790 Ong, Boon Som I-590 Orlowska, Maria E. I-3 Ostermann, Elke II-321 Ostropytskyy, Vitaliy III-70 Othman, M. I-246 Oya, Tetsuo II-66 Ozsoyeller, Deniz I-519 Pacher, C. I-945 Pachter, Ruth II-353 Paj¸ak, Dawid I-780 Palmer, Bruce I-20

1056

Author Index

Pannala, Sreekanth II-301 Park, Jongan I-1013 Park, Seungjin I-1013 Parus, Jindra II-86 Paszy´ nska, Anna III-604 Paszy´ nski, Maciej I-965, III-533, III-604 Pawlus, Dorota I-689 Peachey, Tom I-66 P¸egiel, Piotr III-233 Pelczar, Michal III-80 Penichet, Victor M.R. III-389 Pereira, Aledir Silveira II-153 Petcu, Dana I-126 Pﬂug, Georg Ch. II-408 Pita, Isabel I-800 Plagne, Laurent III-203 Plank, Gernot I-571 Plata, O. I-863 Plotkowiak, Michal I-571 Pokrywka, Rafal I-396 Poore, Jesse H. III-291 Portegies Zwart, Simon I-86, II-207 Pozuelo, Carmela II-659 Preissl, Robert III-253 Prusiewicz, Agnieszka III-614 Puglisi, Giovanni II-76 Puig-Pey, Jaime II-116 Puntonet, C.G. I-741 Qiu, Xiaohong I-407 Queiruga, D. II-706 Queiruga Dios, A. II-706 Quinlan, Daniel J. III-253 Radziszewski, Michal II-46 Rajasekaran, Sanguthevar I-893 Rajkovi´c, Milan II-568 Ramalho Pommeranzembaum, Igor III-168 Raman, Ashok II-242, II-281 Ramasami, Ponnadurai II-343, II-344 Ram´ırez, J. I-741 Ramos, Francisco II-5, II-86 Ramos-Quintana, Fernando II-725 Randrianarivony, Maharavo II-56 Rasheed, Waqas I-1013 Ratajczak-Ropel, Ewa III-624 Rebollo, Cristina II-136 Rehman, M. Abdul III-520 Rehn-Sonigo, Veronika I-215

Remolar, Inmaculada II-136 Rendell, Alistair I-466 Riaz, Muhammad I-1013 Riche, Olivier III-70 Ripolles, Oscar II-5 Robert, Yves I-215 Rodr´ıguez, A. I-741 Rodr´ıguez, Daniel III-289, III-368 Rodriguez, Blanca I-571 Roe, Paul III-491 Rojek, Gabriel III-594 Romberg, Mathilde III-67 Romero, A. I-741 Romero, Sergio I-700 Roux, Fran¸cois-Xavier II-311 Ruan, Yijun III-130 Ruiz, Irene Luque II-369 Ruiz, Roberto III-289 Ruszczycki, Bla˙zej I-33 Rycerz, Katarzyna II-217 Sadayappan, P. I-20 Safaei, Farshad I-539 Saiz, Ana Isabel I-800 Sakalauskas, Virgilijus II-504 S´ amano-Galindo, Joseﬁna II-725 San Mart´ın, R.M. III-471 Sano, Yoichi II-15 Santamaria, Eduard II-735 Sarbazi-Azad, Hamid I-529 Sarmanho, Felipe S. I-337 Schabauer, Hannes I-945, II-408 Schaefer, Robert I-965, III-533, III-682 Sch¨ afer, Andreas I-174 Schneider, J¨ urgen E. I-571 Schoenharl, Timothy W. III-6 Schroeder, Wayne III-321 Schulz, Martin III-253 Schuster, Assaf I-274 Segura, Clara I-800 Sekiguchi, Masayoshi II-35 ˇ Seleng, Martin III-331 Seo, Shinji II-66 Sepielak, Jan III-664 Serot, Jocelyn I-154 Shang, Wei II-416 Shao, Qinghui II-242 Sharda, Anurag I-833 Sharma, Purshotam II-379 Sharma, Sitansh II-387

Author Index Shende, Sameer III-276 Sher, Anna I-66 Shi, Yong I-7, II-407, II-426, II-436, II-450, II-459, II-476 Shiﬂet, Angela B. II-697 Shiﬂet, George W. II-697 Shirayama, Susumu II-535 Shubina, Tatyana E. II-360 Sicilia, Miguel-Angel III-368 Silva, Cˆ andida G. III-70 Sim˜ ao, Adenilso S. I-337 ˇ Simo, Branislav I-116, I-194 Simunovic, Srdjan II-301 Singh, Harjinder II-379, II-387 Sinha, N. II-197 Sinnott, Richard O. I-96 Siwik, Leszek III-664, III-740 Skabar, Andrew II-441 Sloot, Peter M.A. II-217 Slota, Damian I-1005 Smola, Alex I-466 Smolka, Maciej III-535 ´ zy´ Snie˙ nski, Bartlomiej III-533, III-722 Sobczynski, Maciej III-100 Socha, Miroslaw III-178 Soler, Pablo I-800 Sosonkina, Masha I-833 Souza, Paulo S.L. I-337 Souza, Simone R.S. I-337 Spear, Wyatt III-276 Sportouch, David I-610 Srovnal, Vilem III-564 Stahl, Frederic III-70 Stencel, Krzysztof III-301, III-311 Stephan, Ernst P. II-321 Stewart, Gordon I-96 St¨ umpert, Mathias III-399 Subieta, Kazimierz III-301, III-311 Subramaniam, S. I-246 Sumitomo, Jiro III-491 Sun, B. III-731 Sun, Li-Ning I-559 Sun, Luo I-759 Sunderam, Vaidy I-136 Sundnes, Joakim III-67 Sung, Wing-Kin III-130 ˇ Suvakov, Milovan II-593 Suzuki, Yasuyuki II-15 Swain, Martin I-106, I-274, III-70 Swain, W. Thomas III-291

1057

´ Swierczy´ nski, Tomasz III-409 Sykes, A.C. II-396 Szczerba, Dominik II-187 Sz´ekely, G´ abor II-187 Szepieniec, Tomasz I-254 Szydlo, Tomasz I-307 Szyma´ nski, Kamil III-555 Tadi´c, Bosiljka II-525, II-551 Tadokoro, Yuuki II-35 Tajiri, Shinsuke II-271 Takato, Setsuo II-35 Talebi, Mohammad S. I-436 Talik, Marek III-409 Tanaka, Hisao II-271 Tao, Jie III-201, III-268 Tao, Linmi I-759 Tavakkol, Arash I-529 Tawﬁk, Hissam I-823, II-622, III-419 Tay, Joc Cing I-590 Teixeira, Mario Meireles I-365 ten Thije Boonkkamp, J.H.M. I-651 Tesoriero, Ricardo III-389 Teuben, Peter II-207 Theis, F. I-741 Thijsse, Barend J. II-251 Tian, Chunhua I-853 Tian, Ying-Jie I-995, II-436 Tirado-Ramos, A. II-657 T¨ olke, Jonas II-227 Tomlinson, Allan III-510 Towsey, Michael III-491 Treigys, Povilas I-770 Trenas, Maria A. I-700 Trojanowski, Krzysztof I-843 Tsutahara, Michihisa II-271 Tufo, Henry M. II-646 Turcza, Pawel I-476, III-178 Turek, Wojciech III-574 Turner, Stephen John III-26 Turowski, Marek II-242, II-281 Uchida, Makoto II-535 Uebing, Christian III-223 Um, Jungho I-731 Uribe, Roberto I-327 Valiev, Marat I-20 van Bever, Joris II-207 Van Daele, Marnix I-1032

1058

Author Index

Vanden Berghe, Guido I-1032 Vanmechelen, Kurt I-226 Vary, James P. I-833 Vasiljevi´c, Danijela II-568 Vega, Vinsensius B. III-130 Velinov, Goran III-463 Veltri, Pierangelo III-148 Vicent, M.J. II-106 Vilkomir, Sergiy A. III-291 Vill` a-Freixa, Jordi I-610 Villasante, Jes´ us I-5 Vodacek, Anthony III-46 Volkert, Jens III-201, III-379 Volz, Bernhard III-520

Wojcik, Grzegorz M. I-620 Wojtusiak, Janusz III-692 Wolniewicz, Pawel III-399 Wr´ oblewski, Pawel I-600 Wrzeszcz, Michal I-903 Wu, Chaolin II-630 Wu, Jun II-466

Wach, Jakub III-80 Walkowiak, Tomasz I-671 Walkowiak, Wolfgang III-223 Walser, Markus I-33 Wan, Mike III-321 Wan, Wei II-603 Wang, Huiwen II-486 Wang, Jiangqing III-674 Wang, Jianqin II-630 Wang, Shouyang II-407, II-416 Wang, Yanguang II-630 Wang, Zhen III-46 Watt, John I-96 Wcislo, Rafal II-177 Weber dos Santos, Rodrigo III-67, III-120, III-168 Wei, Wenhong I-347 Weinzierl, Tobias III-213 Weise, Andrea III-321 Weller, Robert A. II-281 Wendykier, Piotr I-721 Wibisono, Adianto III-481 Widmer, Gerhard III-379 Wierzbowska, Izabela III-624 Wism¨ uller, Roland III-201, III-223 Wi´sniewski, Cezary III-409 Wi´sniowski, Zdzislaw III-188 Wodecki, Mieczyslaw I-264 W¨ ohrer, Alexander I-76

Yaghmaee, Mohammad H. I-436 Yamashita, Satoshi II-35 Yan, Nian I-7, II-426, II-450 Yan, Yunxuan II-605 Yang, Xuecheng II-466 Yaron, Ofer II-207 Yau, Po-Wah III-510 Yebra, J. Luis A. II-735 Yeow, J.T.W. II-197 Yoshida, Hitoshi I-446 Yuan, Huapeng I-407

Xiao, Wenjun I-347 Xie, Chi II-494 Xiong, Li I-136 Xu, Guangyou I-759 Xue, Yong II-603, II-630

Zapata, Emilio L. I-700, I-863 Zapletal, David I-661 Z´ arate-Silva, V´ıctor H. II-725 Zemp, Marcel II-207 Zeng, Yi II-605 Zhang, Peng II-436, II-476 Zhang, Xiaohang II-466 Zhang, Ying II-459 Zhang, Zhiwang II-436, II-476 Zhao, Zhiming III-459, III-481 Zheng, Bo-jin III-533, III-674 Zhou, Zongfang II-459 Zieli´ nski, Krzysztof I-307, I-355 Zieli´ nski, Tomasz I-476, III-178 Zolfaghari, H. II-146 Zoppi, G. I-146 Zuzek, Mikolaj III-409

Computational Science - ICCS 2008, 8 conf., part 3

Read more

Computational Science - ICCS 2008, 8 conf., part 2

Read more

Computational Science - ICCS 2004, 4 conf

Read more

Computational Science - ICCS 2004, 4 conf

Read more

Computational Science - ICCS 2002

Read more

Computational Processing of the Portuguese Language, 8 conf., PROPOR 2008

Read more

Computational Forensics, 2 conf., IWCF 2008

Read more

Intelligent Virtual Agents, 8 conf., IVA 2008

Read more

Computer-Human Interaction, 8 conf., APCHI 2008

Read more

Runtime Verification, 8 conf., RV 2008

Read more

Cellular Automata, 8 conf., ACRI 2008

Read more

Computer Science Logic, 22 conf., CSL 2008

Read more

Computer Science Logic, 8 conf., CSL '94

Read more

Theoretical Computer Science, 8 conf., ICTCS 2003

Read more

Mathematical Foundations of Computer Science 2008, 33 conf., MFCS 2008

Read more

Compiler Construction 8 conf

Read more

Security Protocols, 8 conf

Read more

Sanskrit Computational Linguistics, 1 and 2 conf. 2007 and 2008

Read more

Artificial Evolution, 8 conf

Read more

Advances in Bioinformatics and Computational Biology, 3 conf., BSB 2008

Read more

Computational Linguistics and Intelligent Text Processing, 9 conf., CICLing 2008

Read more

Research in Computational Molecular Biology, 12 conf., RECOMB 2008

Read more

Computational Fluid Dynamics 2008

Read more

Computational Neuroscience.. Cortical Dynamics, 8 conf., on Neural Nets 2003

Read more

Algorithms and Architectures for Parallel Processing, 8 conf., ICA3PP 2008

Read more

Знамя Журнал 8 (2008)

Read more

Foundations of Software Science and Computational Structures, 11 conf., FOSSACS 2008

Read more

Reasoning Web, 4 conf., 2008

Read more

Algorithms - ESA 2008, 16 conf

Read more

Computational Science ICCS 2008: 8th International Conference, Krakow, Poland, June 23-25, 2008, Proceedings, Part II (Lecture Notes in Computer Science)

Read more

Recommend Documents

Computational Science - ICCS 2008, 8 conf., part 3

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Computational Science - ICCS 2008, 8 conf., part 2

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Computational Science - ICCS 2004, 4 conf

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Computational Science - ICCS 2004, 4 conf

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Computational Science - ICCS 2002

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 2331 3 Berlin Heidelberg New Y...

Computational Processing of the Portuguese Language, 8 conf., PROPOR 2008

Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...

Computational Forensics, 2 conf., IWCF 2008

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris...

Intelligent Virtual Agents, 8 conf., IVA 2008

Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...

Computer-Human Interaction, 8 conf., APCHI 2008

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...

Runtime Verification, 8 conf., RV 2008

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris ...