Developments in Teracomputing: The Use of High Performance Computing in Meteorology

Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology DEVELOPMENTS IN TERACOM...

Author: Walter Zwiefhoger

15 downloads 809 Views 20MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology

DEVELOPMENTS IN TERACOMPUTING

DEVELOPMENTS IN TERACOMPUTING

Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology

DEVELOPMENTS IN TERACOMPUTING Reading, UK

November 13-17,2000

Editors

Walter Zwieflhof er Norbert Kreitz European Centre for Medium-Range Weather Forecasts, Reading, UK

V f e World Scientific WB

New Jersey • London • Singapore • Hong Kong

Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

DEVELOPMENTS IN TERACOMPUTING Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology Copyright © 2001 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-02-4761-3

Printed in Singapore by Mainland Press

V

PREFACE

In an earlier Workshop of this series, TeraComputing was defined as achieving a sustained performance of at least one teraflop per second in a production environment. In November 2000, around 140 researchers in the fields of meteorology and climatology and experts in high-performance computing gathered at the European Centre for Medium-Range Weather Forecasts (ECMWF) to discuss the latest developments in the quest for TeraComputing. Such high levels of computational performance are required in order to realise the potential for prediction of weather and climate on all timescales. Effective highperformance computing is needed for accurate numerical modelling of the earth's atmosphere and of conditions at its ocean, ice and land boundaries and for the assimilation of the wealth of in-situ and satellite-based observations that provide the high-quality initial conditions essential for accurate forecasts. It is also needed for running the ensembles of predictions that characterize the probability of occurrence of events and thereby enable informed decision-making and exploitation of the value inherent in the forecasts. The challenges and opportunities for medium- and extended-range weather prediction were discussed comprehensively by ECMWF's Deputy Director, Anthony Hollingsworth, in his keynote lecture published in the Proceedings of the Eighth Workshop on the Use of Parallel Processors in Meteorology, and reiterated by Adrian Simmons and Martin Miller in their introductory lectures to this Ninth Workshop. As the week progressed, the lectures provided an insight into what various high-performance computing sites around the world consider the most appropriate algorithmic techniques and parallelisation strategies to make effective use of the available computer architectures. The roundtable discussion on the last day of the Workshop showed that the relative merits of vector parallel systems versus SMP systems continue to be hotly debated and that the performance gains expected of techniques such as combining OpenMP and MPI have yet to be demonstrated. The next Workshop in this series is planned for November 2002 and one can confidently expect significant progress towards effective TeraComputing between now and then. Walter Zwieflhofer

CONTENTS

Preface

v

Research and development of the earth simulator K. Yoshida and S. Shingu

1

Parallel computing at Canadian Meteorological Centre J. -P. Toviessi, A. Patoine, G. Lemay, M. Desgagne, M. Valin, A. Qaddouri, V. Lee and J. Cote

14

Parallel elliptic solvers for the implicit global variable-resolution grid-point GEM model: Iterative and fast direct methods A. Qaddouri and J. Cote

22

IFS developments D. Dent, M. Hamrud, G. Mozdzynski, D. Salmond and C. Temperton

36

Performance of parallelized forecast and analysis models at JMA Y. Oikawa

53

Building a scalable parallel architecture for spectral GCMS T. N. Venkatesh, U. N. Sinha and R. S. Nanjundiah

62

Semi-implicit spectral element methods for atmospheric general circulation models R. D. Loft and S. J. Thomas

73

Experiments with NCEP's spectral model J. -F. Estrade, Y. Tremolet and J. Sela

92

The implementation of I/O servers in NCEP's ETA model on the IBM SP J. Tuccillo

100

VIII

Implementation of a complete weather forecasting suite on PARAM 10 000 S. C. Purohit, A. Kaginalkar, J. V. Ratnam, J. Raman and M. Bali

104

Parallel load balance system of regional multiple scale advanced prediction system J. Zhiyan

110

Grid computing for meteorology G.-R. Hoffmann

119

The requirements for an active archive at the Met Office M. Carter

127

Intelligent support for high I/O requirements of leading edge scientific codes on high-end computing systems — The ESTEDI project K. Kleese and P. Baumann

134

Coupled marine ecosystem modelling on high-performance computers M. Ashworth, R. Proctor, J. T. Holt, J. I. Allen and J. C. Blackford

150

OpenMP in the physics portion of the Met Office model R. W. Ford and P. M. Burton

164

Converting the Halo-update subroutine in the Met Office unified model to co-array Fortran P. M. Burton, B. Carruthers, G. S. Fischer, B. H. Johnson and R. W. Numrich

177

Parallel ice dynamics in an operational Baltic Sea model T. Wilhelmsson

189

ix

Parallel coupling of regional atmosphere and ocean models S. Frickenhaus, R. Redler and P. Post

201

Dynamic load balancing for atmospheric models G. Karagiorgos, N. M. Missirlis and F. Tzaferis

214

HPC in Switzerland: New developments in numerical weather prediction M. Ballabio, A. Mangili, G. Corti, D. Marie, J. -M. Bettems, E. Zala, G. de Morsier and J. Quiby

228

The role of advanced computing in future weather prediction A. E. MacDonald

240

The scalable modeling system: A high-level alternative to MPI M. Govett, J. Middlecoff, L. Hart, T. Henderson and D. Schaffer

251

Development of a next-generation regional weather research and forecast model J. Michalakes, S. Chen, J. Dudhia, L. Hart, J. Klemp, J. Middlecoff and W. Skamarock

269

Parallel numerical kernels for climate models V. Balaji

277

Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications Y. He and C. H. Q. Ding

296

Parallelization of a GCM using a hybrid approach on the IBM SP2 S. Cocke and Z. Christidis

318

Developments in high performance computing at Fleet Numerical Meteorology and Oceanography Center K. D. Pollak and R. M. Clancy

327

X

The computational performance of the NCEP seasonal forecast model on Fujitsu VPP5000 at ECMWF H. -M. H. Juang and M. Kanamitsu

338

Panel experience on using high performance computing in meteorology — Summary of the discussion P. Prior

348

List of participants

353

1

RESEARCH & DEVELOPMENT OF THE EARTH SIMULATOR KAZUO YOSHIDA Earth Simulator Research and Development Center, Global Change Research National Space Development Agency of Japan (NASDA)

Division,

SATORU SHINGU Earth Simulator Development

Team, Japan Atomic Energy Research Institute

(JAERI)

A high-speed parallel computer system based on vector processors for research into global environmental changes called the Earth Simulator (ES: previously also known as GS40) has been studied and developed at the Earth Simulator Research and Development Center (ESRDC). This project is promoted by the Science and Technology Agency of Japan (STA). In addition to facilitating research into global environmental changes, ES will contribute to the development of computational science and engineering. The target of sustained performance for ES is 5 T flop/s on our Atmospheric General Circulation Model (CCSR/NASDA-SAGCM). Two years ago the design of the ES and its estimated performance were introduced at the Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology [1] and there has been little change in this area since. Therefore, in this paper we describe the present status of development of the hardware system, operating support software, application software and facilities. The current schedule is to accomplish the ES by February 2002 and we plan to bring it into service after March 2002.

1

Overview

Earth observation, research on physical processes and the computer simulations are very important elements in global environmental change studies, but global change mechanisms are tightly coupled and typically form a complex system. Large-scale simulation is the best tool to understand these kind of complicated phenomena. As part of the Earth Simulator project, we are developing a high-speed parallel computer system based on vector processors named the Earth Simulator (ES), which we called GS40 before. The Science and Technology Agency of Japan (STA) proposed the ES project in 1996. STA promotes studies into global change by an integrated three-in-one research and development approach. Those are the earth observation, the basic research into physical modeling and the numerical simulation by means of a high performance computer system. The National Space Development Agency of Japan (NASDA) wants to improve the usefulness of earth observation satellite data. The Japan Atomic Energy Research Institute (JAERI) wants to study the effectiveness of using atomic energy and its effect on environmental changes applying computer science and simulation. The Japan Marine Science and Technology Center (JAMSTEC) wants to do research on marine observation data and do global change

2

studies. So, STA requested and ordered NASDA, JAERI and JAMSTEC to establish the ES project and the cooperate organization of the Earth Simulator Research and Development Center (ESRDC) in 1997. The main purposes of the ES are understanding and prediction of global climate changes, understanding of plate tectonics and the development of advanced parallel software. The requirements for the resolution of the Atmospheric General Circulation Model (AGCM) are shown in Table 1. Table 1. Requirement for ES

Global model Regional model ^ ^ . ^ _ ,

50 - 100 km mesh 20 - 30 km mesh ^ s _ _ _

5 - 10 km mesh , 1 km mesh , 100-200 ~

At ESRDC, NASDA, JAERI and JAMSTEC are cooperating with each other in the research and development of the ES system. Each of us will contribute about 1/3 of the ES system. NASDA and JAERI will develop the operating support software. NASDA and JAMSTEC will develop the AGCM and the Ocean General Circulation Model (OGCM) respectively. The ES facilities are constructed by JAMSTEC and JAMSTEC alone will be responsible for the operation and management of the ES system after its accomplishment in order to avoid questions of competence. NASDA has launched the Tropical Rain Measuring Mission (TRMM) satellite in cooperation with the National Aeronautics and Space Administration (NASA). Furthermore, NASDA will launch the Advanced Earth Observation Satellite II (ADEOS-II) and other Global Change Observation Mission (GCOM) satellites in the future and is going to provide the data to the ES for use by all global climate change researchers. As for example with Sea Surface Temperature (SST) data retrieved from TRMM, we think, satellite data can show more clearly the detail of the climatic phenomenon, but if satellite data will be used effectively within the ES, simulations may predict more precisely the global environmental change in future. 2

Hardware system

ES is a distributed memory parallel system, which consists of 640 Processor Nodes (PNs) connected by a single-stage crossbar network. Each PN is a shared memory system composed of 8 Arithmetic Processors (APs), a shared memory system of 16GBytes, a Remote Control Unit (RCU), an Input-Output Processor (IOP) and system disk space of about 576GBytes. The peak performance of each AP is 8Gflop/s. Therefore, the total number of APs is 5,120 and the total peak performance, main memory and system disk are 40TFlop/s, lOTBytes and about 360TBytes respectively. The interconnection Network (IN) is a single stage

3

crossbar network, which connects PN to PN independently. The IN makes the system appear completely flat and the bandwidth of the IN is 16 G Bytes/sec x 2 on a node to a node transfer. To achieve these specifications, we reduced the size of the cabinets containing PNs. For comparison, the present SX-5 supercomputer with a peak performance of 128GFlop/s houses 16APs in 1PN whereas the ES has 16APs in 2PNs. Both, the SX-5 and the ES adopt an air-cooling system. Main memory is 128GBytes compared to 32GBytes, which depends on our budgetary limitation, but power consumption is about 90KVA compared to 20KVA. The sizes are 3.2m (L) x 6.8m (W) x 1.8m (H) as opposed to 1.4m ( L ) x l m (W) x 2m (H). So, the volume of ES will reduce to about 1/14 compared with the SX-5. We also adopt many new technologies for the ES. The important ones are discussed in the following.

Figure 1. Arithmetic Processor Package

2. /

LSI technology In semiconductor's fabrication, exposure technology has currently achieved line width of 0.35 microns. This technology is expected to advance through the

4

development of 0.25-micron technology to eventually reach 0.15-micron technology at the beginning of the 21s1 century. In a trial run we manufactured CMOS LSI and decided to use 0.15-micron technology for the ES. Large sized LSI can achieve high efficiency, but it is very difficult to manufacture and gives cause to heat and pin-connection problems. Currently sizes of approximately 10mm x 10mm are usually used, but we need a high efficiency so we decided to use about 21mm x 21mm CMOS LSI for the AP and 18mm x 18mm CMOS LSI for the Main Memory Control (MMC). The clock frequency of the ES is 500 MHz (2nsec) for the AP, but 1 GHz (lnsec) for the vector units.

Figure 2. Main Memory Unit Package

2.2

Packaging technology Multi layered Printed Circuit Boards (PCB) are very useful for compact packages. Now a maximum of 12 layers is used in another case. But the ES uses more of 121ayered build-up PCBs and flip chip mounting with 5,185

5

solder bumps for the AP instead of pins. So, AP, MMC and some other chips are not mold, but bare chips. 2.3

Cooling technology Reducing the power voltage and increasing the clock speed in semiconductors leads to reduced power efficiency, increased power response time and currents. Usually, high-density LSI causes heat problems and it limits the performance of the AP. The ES uses an air-cooled heat sink system with a heat pipe for the AP to solve this problem and limitation. Also, we use a wind funnel for cooling the AP more effectively.

2.4

Interconnection technology The transfer capabilities, within a parallel computer network, must have maximum data transfer capacity and minimum access latency in order to accommodate various types of parallel software. We estimated the performance for inter-node communication. Latency of one-sided communication is about 7 micro-seconds, and point-to-point communication is about 10 micro-seconds. These values are a little bit worse than we estimated before, but still within our specification. The ES uses 1.25GHz balanced serial transmission and achieves a maximum of 40m-transmission distance.

We have finished the design and trial manufacturing of key elements of the ES in 1999, and defined the detail design early in 2000. Now, we are conducting manufacturing tests of the AP, MMU and integration tests of the PN and IN. So, we will start to manufacture PNs and INs and hopefully go into full production in early 2001. One PN consists of 8 APs, Main Memory Unit (MMU), Remote access Control Unit (RCU), I/O Processor (IOP), Diagnostic Processor (DGP), AC-DC Converter, Serial/Parallel Converter, half a Power Supply Unit (PSU), cooling fan and more. In order to reduce the overall size, our design accommodates two PNs in one cabinet. The cabinet size for PNs is about 1.4m (L) x lm (W) x 2m (H). An IN consists of a Crossbar Switch (XSW), AC-DC converter, DC-DC converter, serial/parallel Converter, half a power supply unit, cooling fan and more. We also reduced the size of the IN's cabinet and housed two INs in one cabinet. The cabinet size for INs is about 1.3m (L) x 1.2m (W) x 2m (H). The ES will have system disks, user disks, a Cartridge Tape Library (CTL) of about 12 PBytes for data archiving, file server, user work stations (WS), user graphic work stations (GWS), 3-dimensional graphic display (3D) and more. The ES will have 3 networks to connect peripherals and the system itself. One is the

6

Interconnection Network, which connects a TSS cluster and Batch clusters. The second one is the Data Transfer Network, which connects system disks, CTL and file server. The third one is the Backbone Network, which connects TSS cluster, file server, WSs, GWSs, the 3D display system and the Internet through a Firewall. The ES will occupy an area of 40m x 40m and the power consumption will be between 5,000 and 6,000 KVA in total for the main system. We will place 65 IN cabinets at the center with 320 PN cabinets surrounding these INs. The magnetic disk system and network equipment will be situated around the PNs. The CTL will be placed near the magnetic disk system. 3

Operating system environment

Figure 3. Peripheral Devices and Network

The operating system (OS) of the ES is based on the vender's Unix system (Super-UX) providing the usual Unix environment. However, some functions will be newly developed for the ES, such as: large scalability of distributed-memory parallel processing, large scale parallel I/O, interface to the operation support software and so on.

7

In order to decrease the difficulty of the ES operation, we divide the 640 nodes into 40 groups which we call "clusters". Each cluster consists of 16 nodes, a Cluster Control Station (CCS), an I/O Control Station (IOCS) and system disks. With the ES system we employ parallel I/O techniques for both disks and massstorage systems to enhance data throughput. That is, each node has an I/O processor with access to a disk system via high-speed networks and each cluster is connected to a driver of the mass-storage system via IOCS. Each cluster is controlled by a CCS and the total system is controlled by the Super Cluster Control Station (SCCS). All the clusters can be seen transparently by the users on the other hand.

Figure 4. Batch Job Queuing System

ES is basically a batch-job system and most of the clusters are called batch clusters. Batch clusters are mainly for multi-node batch jobs and have system disks. These system disks are used for system files used by the OS and for staging user files. However, we are preparing a very special cluster called the TSS cluster. In the TSS cluster, one or two nodes will be used for interactive processing and the rest of the nodes will be used for small-scale (single-node) batch jobs. Note that, user disks are only connected to the TSS cluster in order to save the cost of peripheral devices. Therefore, most of the user files, which will be used on the batch clusters, are to be stored in the mass storage system. All batch clusters and the TSS cluster are

8

connected directly to the mass storage system. A file management system automatically retrieves and migrates data between system disks and the mass storage system for batch jobs. To support the efficient operation of the ES, we are developing operation support software. Operation support software will comprise a job scheduler, a file management system, a broadcasting system, an automatic power control system and an accounting system. The job scheduler will manage node allocation assigning free nodes to a job. The file management system will provide the data movement such that data will be available on the nodes where a job will run. The job scheduler will run a batch job queuing system. If a large-scale job, such as a multi-node batch job, is submitted, the job scheduler will pass it to a batch cluster under the control of the job scheduler and the Network Queuing System (NQS) and allocate nodes and time. But when a small-scale job, such as a singlenode batch job, is submitted, the job scheduler will pass it to a batch cluster controlled by NQS only and allocate APs and time. The file management system will dynamically move user files from/to the CTL. It will retrieve specified files before the job will run and will save specified files after the job has finished. Users will have to declare such files in the job script. The file management system will maintain a virtual file tree in the database, so users see a unified file system. 4

Parallel vector processing

Table 2.3-levels of parallel vector processing

1

distributed

2

inter-node

PN

640

intra-node

AP

8

AP

vector register length

shared 3

256

MPI HPF OpenMP microtasking MPI HPF vectorization

Possible programming languages for the ES are Fortran90, C and C++. From these languages, we think that Fortran90 is more effective than other languages to achieve good vector performance and we recommend using it. The following parallel programming models will be supported. MPI-2 for message passing programming, HPF2.0/JA for data parallelism, OpenMP and

9 microtasking are for shared memory applications and vectorization. MPI-2 is the standard Message Passing Interface including one-sided communication and parallel I/O. HPF2.0/JA is High Performance FortranV2 plus extended JAHPF. MPI and HPF are available for both distributed and shared memory. OpenMP is the standard shared memory programming Application Program Interface (API). Microtasking is a shared memory programming API similar to OpenMP for NEC SX series and the Fortran90 compiler provides an automatic parallelization feature. There are 3-levels of parallel vector processing (see Table 2) on the ES. The first level is parallel processing among distributed PNs via the crossbar network (inter-node). The second level is parallel processing with shared memory in a PN (intra-node). The third level is vector processing in an AP. Users will have to select a suitable combination within these environments in order to achieve good performance. For example, one recommended combination is MPI for level-1, microtasking for level-2 and vectorization for level-3. This combination is used with CCSR/NASDA-SAGCM. If a user wants to go for relatively simple programming, he/she might prefer to select MPI or HPF for both level-1 and 2. If a user wants to achieve portability and performance, he/she might choose MPI for level-1 and OpenMP for level-2. 5

Application software

We are going to prepare as standard application software an Atmospheric General Circulation Model (AGCM) and an Ocean General Circulation Model (OGCM) to enable easy and efficient use of the ES. To develop software for general use achieving high performance combined with high resolution we found difficult. Therefore, we decided to create 3 software categories: One is for common use, the second one is for performance evaluation and the third one is for high-resolution experiments and future models. For common use, we are developing CCSR/NASDA-SAGCM, previously known as NJR-SAGCM. This is a global climate spectral model based on the CCSR model developed by the University of Tokyo. CCSR/NASDA-SAGCM is Fortran95 compliant enabling readability, portability, modularity and expandability for users. Also, JAMSTEC is developing and optimizing two types of OGCM for the ES. One is based on MOM3 and the other is based on POM. For performance evaluation, we are developing CCSR/NASDA-SAGCM (TF5) for the pursuit of high performance on the ES. The target of its sustained performance is higher than 5TFlop/s. The original program is the same as CCSR/NASDA-SAGCM, but CCSR/NASDA-SAGCM (TF5) is optimized to get the best performance on higher resolution experiments compared to the common use version. For high-resolution and future models, we are developing for instance a global lOKm resolution version of the AGCM in collaboration with many scientists. Also, we are working on a high resolution, non-hydrostatic global coupled atmosphere-

10

ocean model. Furthermore, we are developing a regional model, which will implement a new algorithm of Cubic Interpolation Pseudo-particle method (OP), etc. We measured the performance of CCSR/NASDA-SAGCM (T159L20) on a single node of SX-4 with SDRAM. Parallel programming in this version is using MPI only. Its speed-up ratios are 1.977 for 2 APs, 3.908 for 4 APs and 7.604 for 8 APs. Effective parallel ratios are 99.85%, 99.21% and 99.26%, respectively.

Figure 5. Performance of CCSR/NASDA-SAGCM

In order to estimate the performance for the ES, let's look at Amdahl's law in the following form: S = l / ( 1 - P + P/M), where S is the speed-up ratio, P is the parallel ratio, and M is the number of processors. For the ES, the situation is more complex and we have to think about parallel processing on two levels. The first level is the inter-node (max 640 nodes) and the second level is the intra-node (max 8 APs) parallelism. The updated Amdahl's law for a hybrid parallel programming model may then be formulated as:

11

S = l/( 1 - P + P/M( l - t + t/m) + a + P ) , where S is the speed-up ratio, P is the parallel ratio for inter-node, t is the parallel ratio for intra-node, M is the number of PNs, m is the number of APs, a is the overhead for MPI, for example message transfer throughput and latency, loadimbalance, etc., and fj is the overhead for microtasking. When a user runs a parallel program using a lot of nodes, P may become the most significant factor determining the speedup. In addition, not to reduce the length of vector loops will become a significant factor, too.

6

Facility and development schedule

It has been decided to locate the ES facilities in Yokohama City, Kanagawa Prefecture near Tokyo, Japan. The ES facility, which consists of the ES building itself, the research building, the chiller etc., is under construction and will be accomplished by December 2000.

Figure 6. Scale Model of ES Building

12

The ES building is a two-story, 50m wide and 65m long building, which looks like a gymnasium, but equipped with a seismic isolation system, electro-magnetic shielding, a double floor for cables and a powerful air conditioning system etc. We will begin the cabling for networks and power in January 2001.

Figure 7. Construction of ES Building (September, 2000)

As mentioned before, the detailed hardware design for the ES components has been finished and manufacturing started early in 2000. PNs and INs will be manufactured continuously in a factory, so we will install and set them up in the ES building unit by unit from spring 2001 onwards. Peripheral systems will be installed at the same time and we will begin to test and evaluate individual PNs including operation support software and then combine them into clusters. When all clusters are installed and working, we will begin to test and tune the entire ES system. This should be accomplished by the end of February 2002. We plan to start operational use of the ES at the beginning of March 2002 by JAMSTEC. NASDA, JAERI and JAMSTEC, including the Frontier Research System for Global Change (FRSGC), will begin to use the ES thereafter.

13

We expect not to complete the development of the application software such as AGCM and OGCM before the end of 2002. So, we have to continue the development of these after March 2002. Joint research with institutes from around the world interested in climate change will be welcome. Furthermore, JAMSTEC plans to issue a Research Announcement of Opportunity in the summer of 2001. Following the evaluation of the applications selected participants should be able to carry out their programs after October 2002.

References 1. Mitsuo Yokokawa and Satoru Shingu, et al, Performance Estimation of the Earth Simulator, Towards Teracomputing (1998) pp. 34-53. 2. Kazuo Yoshida and Shinichi Kawai, Development of the Earth Simulator, New Frontiers of Science and Technology (1999) pp. 439-444. 3. Keiji Tani, Earth Simulator Project in Japan, High Performance Computing (2000) pp. 33-42. 4. Ian Foster, DESIGNING and BUILDING PARALLEL PROGRAMS (1995), Addison-Wesley Publishing Company. 5. http://www.hlrs.de/organization/par/services/models/index.html 6. http://www.openmp.org 7. http://www.gaia.jaeri.go.jp

14

PARALLEL COMPUTING AT CANADIAN METEOROLOGICAL CENTRE JOSEPH-PIERRE TOVIESSI, ALAIN PATOINE, GABRIEL LEMAY Canadian Meteorological Centre, Meteorological Service of Canada, 2121 Trans-Canada Highway, 4th floor, Dorval, QC, Canada, H9P 1J3 MICHEL DESGAGNE, MICHEL VALIN, ABDESSAMAD QADDOURI, VIVIAN LEE, JEAN C6TE Recherche en prevision numerique, Meteorological Service of Canada, 2121 Trans-Canada Highway, 5th floor, Dorval, QC, Canada, H9P 1J3 The Canadian Meteorological Centre has four NEC SX supercomputers (2 SX-4 and 2 SX-5) used for operational, development and research activities in the area of data assimilation, forecasting, emergency response, etc. One of the major components is the Global Environmental Multiscale (GEM) model. The GEM model can be configured to run with a uniform-resolution grid to support long-range forecasting, as well as with a variableresolution grid to support limited-area forecasting. First designed as a shared memory model, it is being migrated to a distributed memory environment. This paper concentrates on some techniques used in the distributed GEM model. We will also discuss the performance of the model on NEC SX-5 supercomputer nodes.

1

Introduction

Few years ago the Canadian Meteorological Centre (CMC) undertook a major transition in its operational suite. The main reason for this was the advent of a unified model, the Global Environmental Multiscale (GEM) model, which can be used for both global and regional forecasting. Originally designed to run in a shared memory environment, in the context of OpenMp paradigm or NEC SX multitasking programming, it is now running operationally on a single 32-processor NEC SX-4 node. The development of the distributed-memory version of the GEM model (GEM-DM) is in its final stage. This paper outlines the special features of the GEM-DM model, and presents the performance obtained so far. 2

Hardware infrastructure

A multi-tiered networking topology is used to closely couple the varied components of the CMC supercomputer infrastructure as represented in Fig. 1. There are currently four shared-memory NEC vector supercomputers nodes at CMC. The two SX-4s have been in-service for some time while the two SX-5s have been recently acquired.

15

IXS

SX-4-32 asa

IXS

SiX-4-32 hiru

SX-S-16 yonaka

SX-5-16 kaze

HIPPI Switch <^^Disks^> Figure 1. CMC supercomputers.

In Table 1 below, a comparison is presented between one SX-4 node and one SX-5 node. Note that the SX-5 has a slower memory but the ratio of total peak Gigaflops to the central memory capacity is one, allowing each processor to access memory with little conflict. Table 1. Supercomputer architecture.

Number of processing elements (PE) Clock (MHz) Peak Gigaflops/PE (total) Vector Pipes/PE (total) Type of Main Memory Unit (MMU) Main Memory Unit in GBytes Memory Bandwidth (GBytes/s) Extended Memory Unit (XMU) in GBytes XMU Bandwidth (GBytes/s) IXS Cross-Bar Bandwidth (GBytes/s)

SX-4 32 125 2(64) 8 (256) SSRAM 8 128 16 8 8

SX-5 16 250 8 (128) 16 (256) SDRAM 128 1024 none N/A 16

Presendy, CMC uses only one SX-4 to run its shared-memory operational model. The distributed-memory model version of the operational GEM will run on two SX-5 nodes which will allow an increase of resolution as well as the inclusion of more sophisticated physics schemes.

16

3

The unified Global Environmental Multiscale (GEM) model

Many papers have described the design, the characteristics and the features of the GEM model. Cote" et al. [1, 2] presented some design considerations and results. This section outlines the main features of the operational GEM model, and the impact of those characteristics on the programming strategies in the distributed implementation. The GEM system uses an arbitrary rotated latitude-longitude mesh to focus resolution on any part of the globe. Figure 2 shows the grid of the current operational regional model, which has a resolution of 24 km over Canada and uses 28 levels. The future regional configuration will have instead, a resolution of about 16 km and 35 - 45 levels. Performance data presented further in this paper are based on this new grid and levels.

Figure 2. Grid of the operational regional model (resolution of 24 km in the uniform area).

The computational domain is discretised in Nx x Ny x Nz points. For distributed-memory computation, a regular block partitioning of the global computational domain across a P = Px x Py logical processor mesh is employed. Since no partitioning of the vertical domain is performed, each processor computes on a sub-domain covering Nx/Px x Ny/Py x Nz points (Fig. 3). The choice of Px and Py for a particular problem size should in principle be made according to the best achievable single-processor performance on a particular machine architecture [4]. For example, on the NEC SXs, given the fact that the code is largely vectorised along the X-axis, we tend to choose Px in such a way that Nx/Px is either 256 or slightly below in order to take full advantage of the vector registers. A halo (or ghost) region used for communication with neighbouring processors surrounds most computing arrays within the dynamical core of the model. These

17

communications are performed using a local application programming interface (API) that uses MPI primitives to exchange data between processors [5, 6]. The need for such communications arises mainly from the finite-difference stencil used by the model along with the number of intermediate results that must be stored and consequently exchanged before being used. A careful analysis and redesign of the data flow have therefore been performed in order to limit the need for communication as much as possible.

Figure 3. Memory layout (regular block partitioning).

Apart from grid considerations, the methods used to solve the dynamical problem include two items, that necessitates heavy communications: ® an implicit time treatment of the non-advective terms responsible for the fastest-moving acoustic and gravitational oscillations which leads to the solution of an elliptic problem; ® a semi-Lagrangian treatment of advection to overcome the stability limitations encountered in Eulerian schemes due to the convergence of the meridians and to strong jet streams, the semi-Lagrangian handling of the polar areas is particularly delicate as physically close points may lie in distant processors.

18

4

Performance results.

For any operational centre, there is a constraint on the time available to run the different models so that the products are disseminated to the users before a specific deadline. At CMC, the most restrictive deadline is imposed on the regional integration, which must be completed (including its outputs), in a time window of about 50 minutes (wall time). For that reason the performance results we present are those of the anticipated regional implementation of the GEM-DM model at 16 km. All the experiments are performed using two SX-5 nodes. 4.1

Timing one time step

A one time step execution of the new regional model on two SX-5 nodes gives 5.1 seconds for the dynamics and 4.1 seconds for the physics. This execution uses 16 PEs in a 4 x 4 processor mesh. The dynamics sustains 2.7 Gigaflops per PE while the physics reaches 1.3 Gigaflops per PE. The details of the timing of the dynamical core are listed in Table 2. Table 2. Timing the dynamical core on a two-node SX-5, (4 x 4) PE mesh.

Number Step calls 1 2 2 4 4 4 1

of

WT(s)

GF/PE

%ofWT

RHS ADV PREP NLIN SOL BAC DIFF

0.60 2.50 0.06 0.24 1.52 0.20 0.55

1.04 1.13 2.06 1.70 6.02 1.50 0.34

1.17 48.73 1.17 4.68 29.63 3.90 10.72

Total

5.13

N/A

100.00

A dynamical time step is composed (Table 2) of one right-hand-side computation (RHS), two advection computations (ADV), two linear preparations (PREP), four non-linear updates (NLIN), four solver calls (SOL), four back substitutions (BAC) and one horizontal diffusion (DIFF). The advection, solver and horizontal diffusion calls make up 89 percent of the dynamics timings. Qaddouri [3] presents a faster version of the solver and horizontal diffusion. The CMC/RPN GEM-DM team is rewriting the advection subroutine and its communication strategy, to improve its

19

performance for the operational model. In the following section, the results presented use the current subroutines only. 4.2

Timing a 48-h forecast

Figure 4 displays timings of the GEM-DM model on two SX-5 nodes for three families of domain decompositions, each run is a 48-hour integration, the length of the operational forecast. The (P x Q) domain decomposition divides the X-axis among P processors, and the Y-axis among Q processors, for a total of P x Q PEs for the model integration. The number of PEs along the X-axis is fixed for each member of the family, to obtain the same vectorisation optimisation for all the members. In general, the longer the length of the X-axis vectors, the more the vector pipes are being utilised. This is seen on Fig. 4, where the three families give results that are nearly parallel to each other. The perfect speedup curve is represented by the dotted line. The solid line, the (1 x n) decomposition, is the closest to the ideal speed-up. That the other curves are above this line reflects the decrease in the optimal usage of the vector pipes for the other types of domain decomposition. On the SX-5, the X-axis vector length should be close to a multiple of 256.

20 GEM-DM o ntwo SX-5 nodes 170 160

8X

1 50 140 130

I

1

X\ V1

„

-

120 110

-

100

I

I

—

„ (1 x n) PEs

at-

m (2 * n) PEs

m

«

(3x

n)

-

PEs

X* N

\ V \

SO 8O 70

-

-

60

—

—

SO X

V 40

N

\

~

* grid (509X625 or 16km)

^**xs

* 35 levels 30

\

*'\ \{

" no output

i

K

i

i

\

i

10 Number of PEs

F i g u r e 4 . GEM-DM on two SX-5 nodes.

The number of PEs needed to meet the operational deadline for this version of the GEM-DM model in its 16-km regional configuration is 18. This implies that the model must run on two nodes. When the number of PEs is above 20 the parallel performance diminishes as a result of the "regular block reduction effect". 5

Conclusion

The 48-hr integration of the distributed-memory GEM model on a two-node SX-5 cluster already displays acceptable performance. With further optimisation, we expect a regional model at 16 km to be feasible within the operational deadline.

21 References: 1.

Cote J., S. Gravel, A. Methot, A. Patoine, M. Roch, and A. Staniforth, 1998: The Operational CMC-MRB Global Environmental Multiscale (GEM) Model. Part I: Design Considerations and Formulation. Monthly Weather Review, 126, 1373-1395. 2. Cote, J., J.-G. Desmarais, S. Gravel, A. Methot, A. Patoine, M. Roch, and A. Staniforth, 1998: The Operational CMC-MRB Global Environmental Multiscale (GEM) Model. Part II: Results. Monthly Weather Review, 126, 1397-1418. 3. Qaddouri, A., 2000: Parallel elliptic solvers for the implicit global variableresolution grid-point GEM model: iterative and fast direct methods. Proceedings of the Ninth ECMWF Workshop, Use of High Performance Computing in Meteorology, Reading, UK, November 13-18, (this volume). 4. Thomas, S. J., M. Desgagn6, and R. Benoit, 1999: A real-time North American Forecast at 10-km Resolution with the Canadian MC2 Meso-LAM. Journal of Atmospheric and Oceanic Technology, 16, 1092-1101. 5. Gropp, W., E. Lusk, and A. Skjellum: Using MPI, 1994: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, 307 pp. 6. Samir, M., S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, 1996: MPI: The Complete Reference. The MIT Press, 336 pp.

22 PARALLEL ELLIPTIC SOLVERS FOR THE IMPLICIT GLOBAL VARIABLE-RESOLUTION GRID-POINT GEM MODEL: ITERATIVE AND FAST DIRECT METHODS ABDESSAMAD QADDOURI AND JEAN COTE Recherche en prevision numerique, Meteorological Service of Canada, 2121 Trans-Canada Highway, 5* floor, Dorval, QC, Canada, H9P 1J3 The direct solution of separable elliptic boundary value problems in spherical geometry arising in numerical weather prediction involves, in the variable-mesh case, a full-matrix multiplication. To decrease the execution time of this direct solution we investigate two alternative ways of performing the matrix product: the Strassen method, and the exploitation of the mirror symmetry of the mesh. An iterative Preconditioned Conjugate Gradient (PCG) solution has also been implemented and compared to the direct solution. The direct and iterative solvers have been tested on the NEC SX-4 and SX-5.

1

Introduction

Most models in operational numerical weather prediction (NWP) use either an implicit or semi-implicit time discretisation on tensor products spatial grids. This gives rise to the need to solve a separable 3D elliptic boundary value (EBV) problem at each model time step. This is the case for the Global Environmental Multiscale (GEM) model in operation at the Canadian Meteorological Centre [1]. In the GEM model the separable EBV problem is currently solved with a direct method which is very efficiently implemented on Environment Canada NEC SX-4 system of vector multiprocessors. A parallel distributed-memory implementation with explicit message passing was recently described in Qaddouri et al. [2]. This direct solver, see section 3 below, can be implemented either with a fast or a slow Fourier transform. In the case of the slow transform, a full-matrix multiplication, the cost per grid point rises linearly with the number of grid points along the transform direction. The object of the present paper is to report on attempts to improve the performance of the EBV problem solution either by accelerating the full-matrix multiplication in the direct solver or by using a PCG iterative solver. The paper is organised as follows: section 2 presents the problem, section 3 describes the direct method, its parallelisation, its adaptation to solve a high-order horizontal diffusion problem, and its acceleration by using either the Strassen method or by exploiting the mirror symmetry of the mesh. Section 4 presents the iterative method, section 5 the performance results and finally section 6 the conclusion.

23

2

ELLIPTIC PROBLEM

The problem to solve is a 3D separable positive-definite EBV problem, which after vertical separation reduces to a set of horizontal Helmholtz problems, each of the form {T]-A)(/> = R , T])0 , where the real constant 7], the solution

0 ^(A,0) ^

=(Ai,0j),{i = l,NI;j = l,NJ} ,

on the grid

0
a _

K

a

_

n

and the grid spacings

M = 4 + 1 -A i ,i = 0,NI,(A^ = AXNI) , A sin 9j = sin &j+l - sin 6j, j = 0,NJ .

24

For finite-differencing we also introduce the following auxiliary grids

\ =—

, i = \,NI,X0 = XNI - In ,AOT+1 =

\+2n

*, = ^ 4 y-uw-U.--§.«»-§. and corresponding spacings

AA,. =A,.+1-A,.,i = 0,M , A sin 9 • = sin 9j+l - sin 9j, j = 0, NJ -1 . The discrete form of each Helmholtz problem is written as:

A0 = r , where A = TJP(9) ® P(A) - P\9)

®PU-

Pee ® />(*) ,

and

r = P{9)®P{X)R , with AJ 0

P(X) = AI J V / - 1 1

1 AAj

AA 0

AA

_

1

1 AA 0

A AX

1

1

A AX

AAj

1

1

AA2

AA 2 1

1 A

AA 2 1

1 AAo

A

*NI-1

%/-1 1

AAW_i

1 AAW

25

A sing0 2 cos #j

A sin § 0

A sin # A sing

p(e)=

,**(*) =

«2

Asinew_!

Asin

cos 2 _ COS 0 j A sin *1

P

=

cos

2

gw-i e

w

h

A sin 0\ 2 _ 02

2 _ cos g j

2 _ cos $i

cos

A sin *1

Asin#j

A sin^ 2

cos

2 _ ^2

A sin 02

2 cos § 2

cos

A sin «2

QNJ-I

Asin

cos

Asin

gjyy-i *JVy-l

%y-i

cos

gyyy

A sin Ow

The matrix A is symmetric and positive definite provided J] > 0, which is the case here. The non-zero elements of the matrix A are shown below for the case NI = 4 and NJ = 3. It is the classical pattern of the five-point finite-difference scheme, except that periodicity implies that the diagonal blocks have entries in the bottomleft and top-right corners.

26

3

3.1

Direct solution

Direct solution algorithm

We obtain the direct solution of the NK Helmholtz problems by exploiting the separability, and expanding 6 in A-direction eigenvectors that diagonalise A, thus NI

1=1

In the case of a uniform mesh, the eigenvectors C^71) are the usual Fourier modes, whereas in the case of a variable mesh they are the solutions of the following generalised eigenvalue problem: Pu(pU] = £lP(A)(p['\l=l,NI , with the orthogonality property (p{nTP(X)(pU]=8IV,I,r=\,NI . The projection of each Helmholtz problem on these modes gives

A [ / V m =[?lP(0)-eIP,(0)-Pg0]
=r [ / ] ,

which in our discretisation is simply a set of symmetric positive-definite tridiagonal problems, and are solved by Gaussian elimination without pivoting. The algorithm can then be summarised as: NI

1.

analysis of the right-hand-sides: r ,

= / t(p\ xu

,j=l,NJ,

;=i

2.

solution of a set of NI tridiagonal problems, of dimension NJ each:

3.

synthesis of the solutions: 0tj = ^ ^ V ] ^ J=1>NJ.

NI

i=i

In the case of the uniform mesh, the analysis and the synthesis steps (1) and (3) are implemented with the Fast Fourier transform (FFT) algorithm. In the case of the variable mesh these steps are implemented with a full-matrix multiplication. The operation count for each of these steps in 3D and for large NI, is NKxNJxNIx(Cxln NI) and NKxNJxNIx(NI) where C is some constant, for the uniform and variable case respectively. The direct solution algorithm presented above has a great potential for vectorisation since at each step we have recursive work along only one direction at time, leaving the two others, in 3D, available for vectorisation. The principle to be followed to obtain good performance on the SX-4/5 are the same as those to obtain good performance on vector processors in general, see Levesque and Williamson

27

[3] for an introduction to the subject. For optimal performance, one must: avoid memory bank conflicts by using odd strides, maximise vector length by vectorising in the plane normal to the recursive work direction, avoid unnecessary data motion. We have at our disposal the multiple FFTs of Temperton [4] which vectorise very well, furthermore there exists on the SX-4/5 a highly optimised matrix product subroutine (MXMA), which on large problems nearly sustains the peak performance of the processors. The direct solver, in either the uniform mesh or the variable mesh case, is therefore highly optimised. 3.2

High-order horizontal diffusion

Implicit high-order horizontal diffusion has been recently implemented in the GEM model. It gives rise to a generalised Helmholtz problem of the form

(V + (-Arwer)

= R ,/7>0,

where power is an integer ranging from 1 to 4 and is half the order of the horizontal diffusion. The method used for the direct solution of this generalised Helmholtz problem proceeds similarly to Lindzen [5], and leads to almost the same algorithm as the previous one, except that in the second step we now have to solve block tridiagonal problems of size NJ x power. 3.3

Parallelisation of the direct solution algorithm

Observing that the communication bandwidth on our systems is large enough so that the volume of data communicated is less important than the communication latency (from software and hardware) in determining the communication cost of an algorithm, then minimising the number of communication steps would likely minimise the communication cost. This means that the remapping (or transposition) method might be a good choice. In this method the algorithm reads for a PxQ processor grid: 1. transpose the right-hand-sides to bring all the /l-direction grid points in each processors, 2. analyse the right-hand-sides, 3. transpose the data to bring all the ^-direction grid points in each processors, 4. solve the (block) tridiagonal problems along the ^-direction, 5. inverse step 3 with the solutions of step 4, 6. synthesise the solution, 7. inverse step 1; the solutions are again distributed in X-9 subdomains. This algorithm necessitates 4 global communication steps (1,3,5,7). The data partitioning used in the GEM model is described in Toviessi et al. [6].

28

3.4

Acceleration of the direct solution algorithm for the variable mesh case

To improve the speed of the direct solution algorithm on variable meshes, we need to find better algorithms to implement the full-matrix multiplication. A known possibility is the Strassen method of matrix multiplication [7]; another one is to take advantage of symmetries to reduce the number of operations. We have tried both separately. We have replaced the ordinary matrix product MXMA by the Strassen method. The operation count in the matrix product is then reduced to NKxNJxNIx((7/syxNI) where x is the number of recursive calls to the Strassen subroutine. We always construct variable 2-meshes that are symmetrical about n, i.e. if X is a grid point then 2n-X is also a grid point. A consequence of this mirror symmetry is that the eigenvectors ( ^ ) used for the separation are either odd or even under this symmetry operation, and can be represented with about half the number of degrees of freedom (NI/2). Furthermore the orthogonality of these modes implies that only the even (odd) part of the right-hand-side will project on the even (odd) modes. Since the separation of the right-hand-side in even and odd parts is almost free, exploiting this symmetry leads to a reduction by a factor of two of the cost of the projection step, the NIxNI projection matrix being replaced by 2 matrices of size NI/2xNI/2 each. We have the same reduction for the synthesis (inverse projection) step. The operation count in the matrix product is then reduced to NKxNJxNIx(NI/2). 4

Iterative solution

Since our problem is symmetric positive definite, the Preconditioned Conjugate Gradient (PCG) method is the natural choice for an iterative method. The problem is then to choose an adequate preconditioner. It should make the scheme converge rapidly while requiring only a few operations per grid point, two conflicting requirements, and finally it should vectorise because of the large discrepancy between vector and scalar performance on the SX-4/5. Preconditioning with the diagonal would have a low cost per iteration but it would require too many iterations because of two difficulties: the metric factors reduce the diagonal dominance in the polar regions, and the variable resolution induces large variations in the mesh sizes. Both difficulties would be alleviated by non-diagonal preconditioners, which are more effective at transporting global information at each iteration. We have examined preconditioners that have the same complexity as ILU(O), the simplest incomplete LU factorisation, and we have developed code for the preconditioning step (triangular solve) that exploits the data regularity and yields vector performance.

29 Writing A as

A = L,+D A +L r A , where LA is the strict lower triangular part of A and DA its diagonal, and writing the preconditioner M as

M = (I + LAD) D_1(I + DL^) , we obtain different preconditioners depending on the choice of the diagonal matrix D. Choosing D = DA, gives the symmetric Gauss-Seidel (SGS) preconditioner. The ILU(O) factorisation which gives a preconditioner that has the same sparsity structure as the matrix A is obtained by requiring that the diagonal of M be the same as the diagonal of A, viz.

diag(M) = DA , from which D is determined recursively. Note that M is also symmetric and positive definite [8]. The modified incomplete LU(0) factorisation [MILU(O)] satisfies the constraint that the row sum of M equals the row sum of A, thus

Me = Ae,e = (lX-J)T . We have experimented with these three preconditioners: SGS, ILU(O), and MILU(O). The set-up of each factorisation is different, but the triangular solve can proceed with the same subroutine. It is worth emphasising that our particular application requires the solutions of several linear systems with the same coefficient matrix but different right-handsides. In this case the cost of the set-up is amortised over many solutions and can be neglected. The operation count per iteration is NIxNJxNKx(2+3+5+5) where the various constants are for the two scalar products, the three vector updates, the matrix-vector product and the preconditioning step respectively in our implementation. 5

Numerical results

Uniform-mesh experiments were presented in Qaddouri et al. [2], and we present here results of variable-mesh experiments carried out on the SX-4 and SX-5. Table 1 summarises the three test problems considered, they are denoted PI, P2 and P3 respectively.

30 Table 1. Test problems.

5.1

Problem

Resolution

NI

NJ

NK

PI

0.22°

353

415

28

P2

0.14°

509

625

28

P3

0.09°

757

947

28

Performance of the direct solution algorithm

Figure 1 gives the execution time per grid point for the three test problems on a single SX-4 processor of the basic code (MXMA, power=l), while Fig. 2 gives the equivalent SX-5 results. As predicted by the operation count, and since all the operations involved in the direct solution algorithm are well vectorised, we obtain a straight line. The problem PI shows some overhead on the SX-5 related to the fact that the SX-5 is more sensitive to the vector length than the SX-4. We have added a fourth point (AT/=255,ArJ=360)) in Fig. 2, and we note that this point lies on the predicted line, which confirms the greater sensitivity of the SX-5 performance on the vector length. CO

o »

1.5

-

1 ' ': Q. i

•g

5, °-5 r

ID 200

300

400

500

600

700

800

Ni Figure 1. Execution time/(NIxNJxNK) in microsecond, on one SX-4 processor.

o

05

cr

31

0.4

CD W O

E •*-> c

o

Q.

u.a 0.2

•D 1—

O) 0)

E i—

0.1 0 200

300

400

500

600

700

800

Ni Figure 2. Execution time/(NIxNJ'xNK) in microsecond, on one SX-5 processor.

Figure 3 displays the execution time on a single SX-5 node for the problem P2 as a function of the number of processors, and for the three variants of the direct solver (MXMA, Strassen, Parity). We note that the use of the Strassen method decreases the execution time, and using the parity nearly cuts the execution time by two as expected. The execution time scales very well in the MXMA and Parity cases. We do not have a highly optimised code for the Strassen matrix product, but in principle the algorithm could be implemented using MXMA, and we would expect a gain of about 20% with respect to the basic MXMA version. Furthermore we could easily obtain the combined benefits of the Strassen and parity methods if needed.

I

o

a>

—•—MXMA

^ ^ ^

1

to

000

^ ^ 5 ^

-

- • — Parity - o — Strassen

: :

E ".*—> c o O 0) X LU

I

100

^-k i 10

Number of processors Figure 3. Execution time in millisecond, on one SX-5 node for problem P2.

32

In Fig. 4 we plot the execution time of the two well scalable variants of direct solver (MXMA, Parity) for problem P2 as we increase the number of processors using two SX-5 nodes this time. We note that the two variants continue to scale very well up to 20 processors, after that the global communications (4 transpositions) dominate the cost of the direct solver.

1

o

E +2

c _o

LU

1

1

1 1

^ \

•

^

1

^sJ

^-j\ 100

-

MXMA Parity

--

•w

3 O CD X

1

• *

W

E, 1000 i .•»> N .

1

;

% * . > j ^ - ^*

->•

10

Number of processors Figure 4. Execution time, in millisecond, on two SX-5 nodes for problem P2.

Table 2 shows the results, on one SX-5 node, of the MXMA variant of the direct solver when it is applied to solve the Helmholtz problem and the sixth-order (power=3) implicit horizontal diffusion problem for the grid P2. We note that the execution times for the two problems are comparable. The small difference comes essentially from the tridiagonal solve step in the direct solution algorithm. Table 2. Execution times of the Helmholtz and horizontal diffusion solvers on the P2 grid.

Number of processors 1 2 4 6 9 12

Helmholtz problem (msec) 6th-order diffusion (msec) 2617 2288 1164 1336 682 613 462 400 336 273 261 206

33

5.2

Performance of the iterative solution algorithm

Our implementation of PCG uses diagonal storage, which makes the matrix-vector products more efficient by taking advantage of vectorisation [9], but we do not use BLAS for either the dot product or the vector updates. In order to vectorise the preconditioning step, we perform the triangular solve by moving along hyperplanes [10]. In our experiments the right-hand-side is computed from a known random solution ranging from -1 to +1. The initial value for the preconditioned conjugate gradient iterations is the zero vector, and we iterate until the maximum difference between the exact solution and the PCG solution at each grid point is less than 10~3. Table 3 shows the results on a single SX-5 processor of applying PCG with the three preconditioners to the three test problems. We note that all these methods converge. ILU(O) an MILU(O) constitute a marked improvement over the SGS preconditioner, and they seem to exhibit some kind of robusmess as the grids are refined, this is probably due to the fact that the Helmholtz constants 77 are inversely proportional to the timestep squared, and the timestep varies linearly with the minimum grid size. Moreover MILU(O) converges nearly twice as faster as ILU(O), and solves the three test problems in a small (almost constant) number of iterations. Table 3. Number of iterations of PCG with the three preconditioners.

Problem PI P2 P3

PCG-SGS 151 185 201

PCG-ILU(O) 19 19 20

PCG-MILU(O) 8 8 9

Figure 5 shows the execution time per grid point, on a single SX-5 processor, as function of NI of the three variants of direct solver, and the preconditioned Conjugate Gradient solver with MILU(O). The PCG-MBLU(O) data points should be on a constant line according to the operation count, and this is roughly the case for the bigger problems. That it is not the case for PI reflects again the sensitivity of the SX-5 performance to the vector length, as the average vector length is smaller for that problem. We note that even though the PCG-MILU(O) requires fewer operations per grid point than the Parity variant of the direct solver, both have the same execution time per grid point for P2. This is due to the difference of vector performance between the two solvers.

34 O CD W

0.5

^

0.3

. ' ''

o o °-4 ~7 E

n

•

i

i

i

i

|

200

i

i

i

i

,

.

,

i

.

i

.

,

.

,

j

.

,

.

,

1 1 ^ " " " IJ ;

S<^Y **"• J — -i— —' 1 . ^ ^ — i

!

i , , , ,

.i

o

i

i
0.2

.J? 0.1 E LI

|

|

o Q.

i

—•—MXMA - • — Parity - n — Strassen --e--PCG(MILU(0))

300

400

500

i , . , , 600

i , , , , 700

800

Ni Figure 5: PCG-MILU(O) vs. three variants of the direct solver on a single SX-5 processor.

6

Conclusions

Three variants of the direct solution algorithm for elliptic boundary value problems on the sphere arising in the context of the GEM variable-resolution grid-point model have been presented. On the NEC SX-4/5 the parity variant is the fastest, as expected halving the execution time of the basic variant. This gives us efficient solvers for both the usual Helmholtz problem and the implicit high-order horizontal diffusion. These solvers have good vector and parallel performances. As a preliminary investigation of iterative solvers, we have studied the preconditioned conjugate gradient method with three closely related preconditioners: Symmetric Gauss-Seidel, ILU(0) and Modified ILU(0). A special effort was devoted to reach a reasonable vector performance. Of these three preconditioners MILU(0) is the most efficient. A comparison with the direct solver shows the potential of PCG-MILU(O) for large problems. Parallelisation of the iterative algorithm is under way, we are also investigating other types of preconditioners such as multigrid. Acknowledgements We thank Lise Rivard for her assistance in accessing the SX5.

35

References 1.

Cote J., S. Gravel, A. Methot, A. Patoine, M. Roch and A. Staniforth, 1998: The Operational CMC-MRB Global Environmental Multiscale (GEM) Model. Part I: Design Considerations and Formulation. Mon. Wea. Rev. 126, 13731395. 2. Qaddouri, A., J. Cote and M. Valin, 2000: A parallel direct 3D elliptic solver. In "High performance computing systems and applications" A. Pollard, D. J. K. Mewhort, D. F. Weaver (ed.). Kluwer Academic Publishers, 429-442. 3. Levesque J. M., J. W. Williamson, 1988: A Guidebook to FORTRAN on Supercomputers. Academic Press, San Diego, California, 218pp. 4. Temperton C , 1983: Journal ofComp. Physics 52, 1-23. 5. Lindzen R. S. and H. L. Kuo, 1969: A reliable method for the numerical integration of a large class of ordinary and partial differential equations. Mon. Wea. Rev. 97, 732-734. 6. Toviessi, J. P., A. Patoine, G. Lemay, M. Desgagne, M. Valin, A. Qaddouri, V. Lee and J. Cote, 2000: Parallel computing at the Canadian Meteorological Centre. Proceedings of the Ninth ECMWF Workshop, Use of High Performance Computing in Meteorology, Reading, UK, November 13-18, (this volume). 7. Douglas C. C , M. Heroux, G. Slishman and R. M. Smith 1994: GEMMW: A portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiply algorithm, Journal ofComp. Physics 110, 1-10. 8. Meijerink, J. and H. Van der Vorst, 1977: An iterative method for linear systems of which the coefficient matrix is a symmetric M-matrix. Mathematics of Computation 31,148-162. 9. Barrett R., M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine and H. Van der Vorst, 1994: Template for the solution of Linear Systems: Building Blocks for iterative methods, 2nd Edition. SIAM, 124pp. 10. Van der Vorst H., 1988: (M)ICCG for 2D problems on vector computers. In Supercomputing. A. Lichnewsky and C. Saquez (ed.). North-Holland.

36

IFS DEVELOPMENTS

DENT, HAMRUD, M O Z D Z Y N S K I , SALMOND, TEMPERTON ECMWF

Shinfield Park, Reading, UK

The Integrated Forecasting System has continued to develop technically, both from the point of view of the model and more particularly, within 4D-Var. (1)

4D-Var performance.

The current observed computational performance of a 4D-Var data assimilation cycle will be examined, looking at parallel efficiency in comparison to the forecast model (2) Parallel aspects of the model: The semi-Lagrangian communication is the most costly part of the message passing and impacts scalability with increasing numbers of MPI processes. This has been addressed by introducing an 'on demand' scheme which reduces the message volume to about 25% of the original. Some performance comparisons will be presented. (3) The adjoint of semi-Lagrangian advection - technical challenges Implementation of the adjoint of the SL advection technique requires a stencil of grid-points in the neighbourhood of each departure point to be updated. The memory access pattern of this procedure is pseudo-random, giving severe performance problems to vector and scalar architectures. Since this updating can be carried out by several MPI processes simultaneously, there is a significant challenge in carrying out these operations in a reproducible way (independent of processor configuration).

1

Introduction

The Integrated Forecasting System (IFS) provides the production software for the forecast model and the 4D-Var data assimilation system at ECMWF. The scientific descriptions have been presented earlier in this workshop. This paper will report on performance issues relating to the current computer platform (Fujitsu VPP5000) and also on developments carried out in order to address performance on alternative platforms such as those based on micro-processor SMPs (symmetric MultiProcessors).

37

2

4D-Var Timings

To set the scene, the basic performance characteristics of a single 4D-Var analysis cycle are: • • •

resolution is T511/T159 L60 (outer and inner loop spectral resolutions and vertical levels) production run utilizes 24 processors of VPP5000 taking 93 minutes wall clock time

We are interested in optimizing the performance both on the VPP5000 and also on possible future computing platforms that may be based on quite different architectures. A 4D-Var cycle consists of 7 executions of the IFS. •

• •

4 of these are so called "trajectory" runs which compute the differences between the high resolution model and the observations. Together, these constitute 33% of the run time. 2 executions are "minimisation" steps in which a cost function is minimized at the inner (lower) resolution. These constitute 53% of the run time The last execution is the computation of the forecast error estimate which takes 3%.

In order to make best use of today's production platform, the efficiencies with which these steps execute is of considerable interest. Some reasons for poor parallel scaling may be: • • • • • 2.1

Serial work (outside of the parallel executions) Imbalance in the distribution of observations across the parallel processes Vector efficiencies (in comparison to scalar platforms) Synchronous I/O Granularity of the "low-resolution" 4d-Var inner loop (T159) Serial Work

By examining the wall-clock time stamps in the 4D-Var job output, the following overheads are measured: • • •

Execution of Kshell scripts constitute 0.5% of the run time Startup of the executables amounts to 0.7% Time to start the parallel environments (of which there are 7) using MPIJNIT is 1.4%.

38

This cost is mostly associated with the MPI/2 layer, which has the potential to create processes dynamically. As IFS does not use this feature, it may be bypassed by, for example, direct calls to the vendor specific message-passing routines or by use of an MPI version 1 library. This reduces the startup cost to 0.2% 2.2

Observation Imbalances

In order to optimize observation processing, IFS allocates observations of a specific type in equal numbers across the available processors. However, for most observation types, the quantity of data within each observation varies unpredictably. For example, the number of reported levels within a radio-sonde balloon observation (TEMP) depends on the height reached by the balloon. Since it is the data quantity that determines the computational load, the logic will not produce a perfect load balance. Table 1 shows this for a sample analysis cycle.

OBSERVATION TYPE SYNOP AIREP SATOB DRIBU TEMP PILOT RADIANCES PAOB SCATT

ACTIVE OBSERVATION S 23479 22444 38727 1637 586 653 69653 157 6282

ACTIVE DATA 36035 57681 77454 2154 56571 29204 181354 157 25128

IMBALANCE ACROSS PES

PROCESSING COST

%

%

5.4 0.5 0.6 1.7 15.6 20.5 2.7 0 0.1

0.40 0.30 0.50 0.06 0.10 0.07 3.00 0.04 0.10

Table 1: quantities of observational data and their processing costs In this table, the term active is used to denote observations which have passed through a selection phase and which will therefore take part in the 4D-Var analysis minimisation. The table shows that, for TEMP observations, there is almost a factor of 100 between numbers of observations and data items, and that there is a processing load imbalance of more than 15%. However, this has a negligible effect on the overall performance efficiency because the total cost of processing TEMP data is so small (0.1%). On the other hand, the cost of processing RADIANCE data from satellite based instruments is much higher (3% of total). Fortunately, in this case, the measured load imbalance is much smaller (2.7%). Overall, we can deduce that inefficiencies introduced by poorly load balanced obervational processing is a very small effect.

39

2.3

Vector efficiencies

Some aspects of observation processing tend to be less well suited for vector oriented hardware. Table 2 compares vector efficiencies for the IFS model with those measured for the major components of 4D-Var that handle observation processing.

IFS model (T511L60) 4D-Var total Observations in first trajectory Observations in first minimization Observations in second trajectory Observations in second minimization Forecast error estimates

% VECTOR CODE 89 72 49 58 33 57 38

Table 2: vector efficiencies compared From these figures, it is clear that we are not making as good use of the vector hardware when processing observations in comparison to the high vector efficiency inside the model. While it may not be possible to improve this situation much, a scalar-based architecture is likely to perform relatively well on this type of code.

2.4

I/O

A variety of differing I/O methods is used within 4D-Var for differing data types. The majority is carried out using C I/O: • • • •

Input of initial fields (as in the forecast model) Output of fields through the post-processing mechanism (as in the model) Observation handling via an Observational DataBase (ODB) using C I/O Additional files for communication between 4D-Var job steps

The output of fields within the model is handled asynchronously and normally proceeds with minimum disruption to the forecast run time. However, in 4D-Var, the fields are output at the end of the execution when there is little opportunity to take advantage of asynchronous I/O. The more interesting component to examine is the I/O associated with observational processing through the ODB software since this is expected to rise sharply in the future as substantially more satellite-based data

40

becomes available. Table 3 gives an indication of the data volumes involved for a single 4D-Var cycle.

JOB STEP First trajectory (screening) Odbshuffle First minimization Second trajectory Second minimization Third trajectory Forecast error estimates

INPUT MBYTES 566 566 74 74 80 85 85

OUTPUT MBYTES 566 74 74 80 85 85 -

Table 3: ODB I/O quantities 2.5

Low resolution Inner loops

For computational economy, the minimisation steps in 4D-Var are executed at a lower horizontal resolution. Additionally, the first minimisation is carried out using only a very low cost physics package. On the other hand, within the adjoint code, the semi-Lagrangian computations are more expensive than in the forward model and the number of communications is higher. Overall, these differences mean that the computational granularity is smaller than for the forward model. From the vector length point of view, when run on 12 processors, the maximum vector length is 2977. When run on 24 processors, this reduces to 1489. Although not disastrously small, this leads to some loss of vector performance. 2.6

Overall performance

The parallel efficiency of 4D-Var in comparison to the T511 resolution forecast model is compared in figure 1. Model and analysis runs using 12 processors and 24 processors are compared. Comparison of the model running at the outer resolution of T511 and at the inner resolution of T159 clearly shows some loss. Overall, the 4D-Var minimization steps perform slightly better than the low resolution model. The trajectory steps exhibit the poorest performance due to observation I/O and processing costs.

41

• T511 full model • T159adiabatic model • minimisation • trajectory

Figure 1: model and 4D-Var parallel efficiency % (12 / 24 Pes)

Parallel efficiency of 4D-Var can also be illustrated (figure 2) by plotting throughput (in terms of analysis cycles per day) for differing machine partitions. Additionally, in order to quantify the cost of increasing numbers of observations from satellite data, an estimate has been made of the processing costs of projected radiance data quantities for the year 2002.

4?.

D12 Pes • 16 Pes • 20 Pes • 24 Pes • 32 Pes

0)

a, $ 5>> u

f today

2002

Figure 2: change of 4D-Var throughput for different machine partitions and with increasing numbers of observations

43

3

3.1

On-demand semi-Lagrangian Communication

Halo exchange for semi-Lagrangian Interpolation

In the MPI parallelisation of the IFS, the interpolation to determine quantities at the departure points in the semi-Lagrangian advection requires a relatively large halo width. The halo width depends on the maximum wind strength, the time step and the resolution. For example a halo width of 7 grid points is needed for T511L60. The ratio of the points in the fixed width halo to the number of core points increases with the number of MPI processes. This gives poor scalability particularly if the whole halo is communicated. The advantage of the on-demand scheme is that the number of points to be communicated depends on the strength of the wind locally.

Figure 3: semi-Lagrangian advection, showing departure point (d), arrival point (a) and stencil (denoted by circle) surrounding the departure point.

44

3.2

The effect of the semi-Lagrangian communications on scalability

The effect on the scalability of the original scheme where the whole halo was communicated can be seen as the number of MPI processes is increased. For example a run on the CRAY T3E using up to 1024 processors showed a deterioration of scalability as the number of processses increased above 512. Examination of the profile showed the main cause of this was the increasing cost of the semi-Lagrangian communications.

10000,

-T3E/1200 - Ideal Speedup 1515

100 128

—i

256

512

1024

Number of Processors Figure 4: IFS forecast model performance (T213L31/RAPS 4.0) on Cray T3E/1200 3.3

IFS implementation of 'on-demand' semi-Lagrangian communications

The 'on-demand' scheme first sends the complete halos of the 3 wind fields, then calculates the departure points and creates a mask containing departure points together with the stencil needed for the interpolation. A subsequent semiLagrangian communication first sends the mask and then communicates only the

45

actual points needed in the remaining 18 model fields by packing and unpacking the message buffers using the mask. Finally the semi-Lagrangian interpolation is done for these fields. The introduction of the on-demand scheme increased the number of messages by a factor of 3, but the total amount of data is reduced by 5 times. In the design of the 'on-demand' scheme it was found that a simple scheme gave the best performance. For example the mask used to decide which points to communicate was taken to be a 2-dimensional array with no level dependence. This meant that the points communicated were not reduced to a complete minimum as the wind is stronger at the higher levels and therefore more points are needed in the halo. However, the copy loops to create the buffer to be communicated were made much simpler. The original static scheme communicated the complete halos of 21 fields, then calculated the departure points followed by the semi-Lagrangian interpolation. The on-demand scheme was also implemented in the semi-Lagrangian communications in the tangent-linear and adjoint models, as used in 4D-Var.

3.4 3.4.1

Other 'on-demand' schemes UKMO scheme

This on-demand scheme is only applied in the East-West direction. The main reason for introducing the scheme is that, unlike the IFS which has a reduced grid near the North and South poles, the UKMO model has a latitude /longitude grid. This means that because of the convergence of the meridians at the poles there would be very large number of points in the semi-Lagrangian halos in the East-West direction near the poles.

3.4.2

NCAR CCM scheme

In the NCAR scheme the on-demand scheme communicates the points needed for the previous timestep with an extra margin for safety.

46 3.5 3.5.1

Tests of 'on-demand' semi-Lagrangian communications scheme Tests on VPP5000

Tests to show the benefit of the on-demand scheme were run on the Fujitsu VPP5000 at T511L60 with 16 processors. The amount of the semi-Lagrangian halo data communcated was reduced by 80%. The total time for these communications was reduced from 5.5% to 3.9% of the total run time. 3.5.2

Tests on CRAY T3E

The performance of the on-demand scheme at larger numbers of processors was demonstrated on the CRAY T3E. The time spent in semi-Lagrangian communications as a percentage of the total time reduced from 6.3% to 2.0% with 128 processors and from 7.2% to 3.0% with 512 processors.

4

The adjoint of the semi-Lagrangian advection - technical challenges

Until recently the minimisation steps used in IFS 4D-Var were based on an Eulerian advection scheme, which had the obvious disadvantage of requiring a relatively short timestep to satisfy the CFL criterion. In contrast, a semi-Langrangian scheme allows a longer (albeit more expensive) timestep to be used resulting in an overall saving in execution time as shown in Figure 5. In this figure we can see the SL scheme has a greater benefit in the 2nd minimisation where physical parametrization is enabled.

47

T106 Minimisation (EUL v S/L) lOAfk -

•o

1 S

low 1600 -j 14001200 1000 800 -

• I 1

600 400 200-

0 -

•1 •

^

L 1

m

IL

!

• EUL(tstep=7.5 min) • S/L(tstep=60 min)

1^B1 1 1• 1st Min.

2nd Min.

Figure 5: EUL v S/L minimisation

4.1

Vectorisation

The initial form of the adjoint S/L interpolation routines presented a technical challenge as they had to run with vectorisation disabled due to a dependency issue illustrated in Figure 6. In this figure S/L variable X is being updated by a 32 point stencil at the departure point, where index pair (N,L) denotes the grid point at the arrival point.

DO L=1,LEVS (60) !OCL SCALAR DO Nsl^NPROMA (2047) X(K(Nf1:32,L))=X(K(N,1:32,L))+.

Figure 6: Simplified extract of S/L interpolation routine A dependency exists, as it is very likely that some of the stencil points are the same between successive iterations of both the inner (NPROMA) or outer (LEYS) loop.

48

The solution to this dependency issue was to recognise where independence was present and vectorise over this, that is, • •

stencils points variables (e.g. U,V,W,T)

This resulted in average vector lengths for adjoint S/L interpolation of about 100 (not all interpolations are 32 point). A further refinement to this vectorisation approach was to determine globally the maximum vertical separation of departure points from arrival points (calculated during the trajectory calculation of the minimisation step). With this information it was then possible to perform stencil updates over levels by taking strides slightly greater than the above maximum vertical separation. The above vectorisation scheme resulted in vector lengths in excess of 2000, which was sufficient for good vector performance. For scalar machines it would probably be sufficient to keep the original loop structure (as shown in Figure 6) provided the inner (NPROMA) loop was not unrolled.

4.2

Reproducibility

For IFS, it is always required that it should produce identical results when rerun in an identical environment to earlier runs. At ECMWF we have gone further and required that all IFS configurations (forecast model, 3D-Var, 4D-Var) should also produce identical results when run on different numbers of processors. From this point, we shall assume reproducibility refers to the latter case. The adjoint of the S/L presented a 2nd challenge as the initial implementation produced slightly different results when run or a different numbers of processors or when grid-point space was partitioned differently on the same number of processors.

49

stencil points are read - forward model updated - adjoint

Figure 7: Stencil update The problem can be seen in Figure 7 where grid points either side of a processor boundary are shown. In this figure we have simplified the stencil to only 4 grid points surrounding some departure points. It is clear that to achieve reproducibility it is essential that stencil points are updated in a consistent order - as they would be on a single processor. One such ordering would be to iterate over arrival points for all levels (high to low) at one grid point before proceeding onto the next grid point, where grid points are ordered by longitude first (E to W) and then latitude (N to S). This is all fine until one considers that stencil updates can occur on different processors to the processor where the arrival points are located. An initial nonreproducible approach performed these non-local updates by accumulating this data in a halo (update space) and later sending this data to the processors owning the data. This approach is represented in Figure 8 by a set of overlapping disks where the compute (iteration) space is the same as the local space (grid points owned by a processor), these being contained within a larger update space.

50

X+Y+Z^ •

local space

•

compute space

•

update space

X+Z+Y ,

-^

^fl M

Perform adjoint interpolations

^|

Send halo data to local processors

/

Figure 8: Non-reproducible approach The strategy used to resolve the non-associative update was to enlarge the compute space to include all arrival points that could possibly update the local space. This in turn required a further halo for security (it was too difficult to determine a priori which arrival points correspond to stencil updates outside of the compute space). Note that once all interpolations have been performed all non-local data is thrown away - there is no need for extra communications to update other processors.

51

Mk• \ ^jyj ^

•

local space

•

compute space

•

update space

x

y

Figure 9: Initial reproducible approach (2 halos) Reproducibility was achieved, however at a cost - a typical 4D-Var experiment required about 40 percent more wall time than the same non-reproducible version.

Figure 10: Current reproducible approach

52

The solution to this cost issue was found by recognising that the minimisation process performs many iterations (about 50) which are identical with respect to the mapping of arrival points to departure stencil points (at each time step). Given this characteristic, we have reduced the size of the compute space (before the start of the 2nd iteration) to those arrival points that influence stencil points in our local space. With this optimisation, a typical 4D-Var experiment requires only 5 percent more wall time compared with the same experiment when run in non-reproducible mode.

5

Summary

4D-Var is now well established at ECMWF as the production data assimilation code. The current state of IFS (model and analysis) from a performance point of view can be summarised as follows: • • • • •

The model efficiency has been improved over many years OpenMP directives exist in the model code for the benefit of SMP platforms. They are currently being added to 4D-Var code. Alternative message passing modes such as "non-blocking" are being added where appropriate. 4D-Var is MUCH more complex than the forecast model. It has existed for a relatively short time. There is an ongoing effort to improve its efficiency. "On-demand" semi-Lagrangian communication is an example of this ongoing effort and is rather successful in its aim to improve parallel scalability by reducing halo communication.

The adjoint of the SL advection has posed some difficult problems, particularly for vector platforms. The requirement for reproducibility across varying machine partitions has imposed significant development costs, and so far, also run time costs. It is hoped that these costs can be further reduced in the future.

53

Performance of parallelized forecast and analysis models at JMA Yoshinori Oikawa Numerical Prediction Division, Forecast Department, Japan Meteorological Agency 1-3-4 Ote-machi,Chiyoda-ku,Tokyo, 100-8122,Japan E-mail: [email protected] Abstract We have developed parallel versions of JMA's Global Spectral Model(GSM), Regional Spectral Model(RSM), and Objective Analysis System, with the intention of implementing them on the distributed-memory parallel supercomputer which will replace our currently operational machine in March 2001. The forthcoming supercomputer is sure to bring us the possibility of further improvement of JMA's numerical weather forecasting system. Here are presented a brief description of our approaches to parallelizing NWP codes and some references to performances of them. 1. Introduction Public needs for more accurate, timely, and fine-grained weather forecast know no limit, for that can mitigate damage caused by severe weather as well as help our economic activities to be more efficient. A computer of much higher performance than now is prerequisite to meet those needs. The present mainframe at JMA, which has commenced operation in 1996, and since then produced several kinds of operational weather forecasts, will fall short of our demands before long. Single CPU performance has been boosted up to greatly an extent in the last decade and seems to have come near its ceiling. Under such circumstances it is the advent of a massively parallel supercomputer that has brought us the prospect for further amelioration of numerical weather prediction. We intend to renew our numerical analysis and prediction system(NAPS), including the mainframe HITACHI-S3800, in March 2001, in the expectation that our forthcoming supercomputer HITACHI-SR8000 should render us enough competent to produce weather forecast of still more accuracy than ever, and do the people a good service by giving more useful information for being well-prepared against weather disasters. With this aim in mind, we have expended a great deal of effort in reshaping our existing models into parallelized as well as developing such powerful new techniques as 3-D or 4-D variational data assimilation and ensemble forecasting. Currently JMA's short-range NWP system consists of Global Spectral Model and three limited area models(Regional Spectral Model, Meso-scale Model, Typhoon Model), and four kinds of Objective Analysis Systems to provide each forecast model with initial conditions. For the time being after the replacement of the supercomputer, the suite of these forecast models and associate analysis systems, with the forecast domain extended and resolution enhanced(Table 1), will have been in use as the routinely operational systems, until a nonhydrostatic limited-area model and 3-D or 4-D variational data assimilation system, now under development, will come to be serviceable. In this paper we describe a summary of

54 parallelization strategy and performance tests of GSM, RSM and the Objective Analysis on it.

Grid Size Number of Grids Vertical Levels Forecast Hours

MSM Near Future

RSM

GSM

Current

Near Future

Current

20km

20km

10km

257 x 217

325 x 257

TYM Current

Near Future

10km

40km

24km

129 x 129

361 x 289

163 x 163

271 x 271

0.5625 deg 640 x 320

Near Future 0.5625 deg 640 x 320

30

40

36

40

36

40

15

25

84

216

51

51

12

18

78

84

Current

Table 1. Current and near-future specifications of JMA's forecast models

CPU Architecture Memory Architecture Operation System Number of CPU Single CPU Performance Peak Performance

SR8000 RISC DISTRIBUTED UNIX 8x80 1.2GFLOPS

S3800 Vector SHARED VOS3 4 8GFLOPS

768GFLOPS

32GFLOPS

Table 2. Specifications of the coming supercomputer HITACHI-SR8000 compared with the currently operational HITACHI-S3800 2. Basic Strategy for High Performance Computing SR8000 offers three levels of methods of achieving high performance, that is, pseudovector processing, multi-thread parallelization, and inter-node parallelization. The troublesome details of the first and second levels we may leave to the compiler, but some directives to it for the optimization of do-loops too tangled to be analyzed without a programmer's awareness. As for the third level, we adopted Message Passing Interface(MPI) library as the approach with attention to its good portability and potential for high performance, rather than High Performance Fortran(HPF) that could spare us the trouble of parallel programming but remains to be improved in performance. 3. Parallelization of Global Spectral Model GSM will have several essential roles in our new forecasting system; medium-range deterministic forecasting, medium-range and long-range ensemble forecasting, as well as providing boundary conditions for Regional Spectral Model and Typhoon Model.

55 There are two computational phases in one time step in GSM, as shown in Fig.l. Gridpoint computations, including the evaluation of physical processes and calculations of nonlinear advection terms, are performed in the phase 1, where the sphere is partitioned into some latitude bands, cyclically assigned to each node lest remarkable load imbalance should arise(Fig.l(a)). Grid point variables on each latitude are transformed into Fourier coefficients using Fast Fourier Transform algorithm, followed by the rearrangement for the phase 2, where the forecast space is divided according to Fourier wavenumber and assigned swingingly to each node. Subsequent Legendre transforms generates the array of spectral coefficients in the shape of a triangle. In both phases a vertical column is retained in single node so that no other communication than the transition between two phases is required. Phase 1 : Grid Space l a t i t u d e - b a n d node number number

Figure 1 (a): Latitude bands assigned cyclically to each node. Phase 2 : S p e c t r a l

Space

F o u r i e r wavenumber n o d e number

Figure 1 (b): A triangular array of spectral coefficients assigned swingingly to each node

56 Here is a figure showing the necessary time of 24-hour forecasting of Global Spectral Model with 5,8,16 nodes. 4

3.5

3 Q. 3 TS 0/ E.5 Of

a. Vl

E

1.5

1 5

8

16

28

number of nodes

Figure 2: Performance of Global Spectral Model 4. Parallelization of limited area models JMA's limited area models are constructed after Tatsumi's spectral method[l], which made an epoch by demonstrating for the first time the feasibility of a limited-area spectral model the domain of which does not generally satisfy the wall boundary condition. Regional Spectral Model has been and will be used for 51-hour forecast covering the eastern Asia as well as providing Meso-scale Model with boundary conditions. Meso-scale Model, now in experimental service, will be used operationally for 18-hour forecast covering the area over and around Japan four times a day, mainly for the purpose of predicting mesobeta scale phenomena such as topography-induced heavy precipitation. Typhoon Model will be greatly enhanced in resolution and run four times a day when a tropical cyclone is found in the north-western Pacific Ocean. Three limited area models share the framework of dynamical calculations and physical processes. Here is described how to parallelize RSM. The actual field in RSM is decomposed into the orthogonal double Fourier series satisfying the wall boundary condition, and the 'pseudo-Fourier' series given by the outer model. The orthogonal double Fourier series is evaluated applying the Fast Fourier Transform(FFT) algorithm to the rows of grid-point values along X-direction followed by Y-direction, and after some spectral computations synthesized again into grid-point values mingled with the pseudo-Fourier series, applying backward FFT in the same way. A FFT of a row of grid-point values along X-directiion requires all of data within the row, leading to X-direction data dependency, and this is also the case with Y-direction. On the other hand, physical process computations on a grid point require the data on the neighboring grid points in the column including the grid point, leading to Z-direction data dependency. What offers the most challenging task in the context of parallelization of RSM is the implementation of these algorithms, which differ in data dependencies, upon the architecture of a distributed-memory parallel machine. The strategy we adopted is shown in Fig.3, in which the forecast domain of RSM is depicted to be a flat box. There are four algorithmic phases involved in one time step,

57 distinguished by the direction within which data dependency exists. In the phase 1, corresponding to physical process calculations, rows along X-direction and Y-direction are divided as evenly as possible, while grid points within a vertical column are assigned to one node. In the phase 2, corresponding to X-direction FFT, rows along Y-direction and vertical columns are divided, and in the phase 3, in which Y-direction FFT are performed, rows along X-direction and vertical columns are divided. In the phase 4, corresponding to spectral calculations including a semi-implicit time-integration scheme that entails a vertical data dependency, spectral coefficients are mapped in the same fashion as in the phase 1. This decomposition of the forecast domain allows computations between two subsequent transpositions to be done without extra communications. The assignment of pieces of the forecast domain to each node claims another attention. Several subsets of processor nodes can be defined at every transposition, for example, {0,1,2} make up one of those subsets at the transposition from phase 1 to phase 2 in Fig.3. This assignment reduces the frequency of communications as the transposition requires no other communication than what can be completed inside each subset coincidently.

Figure 3: Four computational phases involved in parallelized RSM

58 Easing I/O bottleneck is another matter as essential to the total performance as saving the expense of communication. It takes no little time to transfer as large data as giga bytes to external storage devices in the process of computation, and often brings about I/O bottleneck. We resorted to the expedient of having regular computing nodes attended by an extra node that devotes itself to output, as shown in Fig.4. The extra node gathers data from all of the computing nodes, and stores them to the disk while the latter can go on to the next computational stage without respect to the former. This expedient can practically get rid of the bottleneck of output, allowing of no dependence upon the system architecture. I/O node Waiting

computing nodes Computing (t=T-l) Computing (t=T) *

MP I

RECV*-

Storing

MP I SEND Computing (t=T+l)

4

Computing (t=T+2)

4

(t=T+3)

Computing

4 Wai :ing

(t=T+4)

Computing (t=T+5)

4

1

Computing (t=T+6)

4 Computing Figure 4: Behaviour of computing nodes and the I/O node

4

Fig.5 shows the necessary time and the speed-up for 18-hour forecasting of Meso-scale Model with 4 to 36 nodes accompanied by an I/O node on SR8000. Ideal

•••#•••

Figure 5: Performance of Meso-scale Model

59 5. Parallelization of Objective Analysis An objective analysis system comprises such components as decoding, quality control(QC), and optimum interpolation(OI). Decoding and QC, the costs of which are comparatively trivial, may be performed in single node, while 01, which equals a forecast model in requisite computational resources, is to be parallelized. OI has the following characteristics: •

D

The analysis at a grid point depends just upon the first-guess value at the point and quality-controlled observational data of its vicinity, but neither upon other grid points nor the order of operations. This allows of arbitrary decomposition of the analysis domain. Observational data are irregularly distributed over the analysis domain in general. This causes load imbalances among nodes.

We adopted the following tactics to make use of the arbitrarily of the domain decomposition and preclude load imbalances. 1 .One of nodes serves as the commander, others as the worker. 2.The analysis domain is devided into latitude bands. Rows included within a latitude band are a multiple of 8 in number for the efficiency of multi-thread parallelization. Those rows are assigned to each thread within single node. 3.Initially each worker is assigned one of latitude bands. 4.A worker sends a message to the commander when it accomplishes its assignment. 5.The commander orders the worker to take charge of another latitude band. Here we have a brief example of the MPI code and a summarized figure(Fig,6) to make clear the above-mentioned concept. jstride=8*m ! — the commander — if(myrank.eq.O) then ! receive the first accomplishment messages don=l,nnode call MPIJRECV(signal(n),l,MPI_iNTEGER,n,...) end do dojs=(nnode-l)*jstride+l,jmax,jstride ! latitude ! wait a message from any worker call MPI_WAITANY(nnode,req,n,...) ! assign another task call MPI_SEND(js,l,MPI_iNTEGER,n, ...) ! receive an accomplishment message call MPI_IRECV(signal(n),l,MPIJNTEGER,n,...) end do call MPI_WAITALL(nnode, ...) ! free the workers do n=l,nnode call MPI_SEND(SIG_RELEASE,l,MPI_INTEGER,n,...) end do

60 ! — workers — else js=(myrank-l)*jstride+l ! initial assignment do je=js+jstride-l do j=js,je ! performed in parallel by multi-threads within single node [OI SUBROUTINES] end do ! inform the commander of accomplishment call MPI_SEND(js,l,MPI_INTEGER,0,...) ! receive another assignment call MPI_RECV(js,l,MPI_INTEGER,0, ...) if(js.eq.SIG_RELEASE) exit end do end if commander node 0 waiting

computing node 1

computing node 2

computing node 3

J=l,E

message command!J=25) waiting

J=25,32

I message command!J=33) waiting

I waiting

J=33,40 message command!J=4l) J=41,4E

Figure 6. Behaviour of computing nodes and the commander node. Thus the workers are imposed one task after another and granted no recess, leading to the improvement of load balancing. Fig.5 shows the performance of Meso-scale optimum interpolation with 3,6,9,18 working nodes instructed by a commander node.

61

Figure 5: Performance of Meso-scale Analysis 6. Future Plans Within a few years after the replacement, we will get well-equipped with the following powerful analysis and forecasting systems on SR8000. • •

4-dimensional variational assimilation system to provide fine initial conditions for MSM, RSM, and GSM Non-hydrostatic Meso-scale forecasting model to produce reliable predictions of severe weather such as heavy precipitation early summer in Japan

7. References •

•

• •

D •

S.R.M.Barros, D.Dent, L.Isaksen, G.Robinson, G.Mozdzynski, and F.Wollenweber, 1995: The IFS model: A parallel production weather code, Parallel Computing 21(1995) 1621-1638 L.A.Boukas, N.Th.Mimikou, N.M.Missirlis, G.Kallos,1998: The regional weather forecasting system SKIRON: Parallel implementation of the Eta model, Proceeding of the 8th ECMWF Workshop on the use of parallel processors in Meteorology MPI Forum, 1995: The Message Passing Interface Standard, http://pahse.etl.go.ip/mpi/ C.Muroi,1998: Future requirement of computer power and parallelization of NWP at JMA, Proceeding of the 8th ECMWF Workshop on the use of parallel processors in Meteorology Numerical Prediction Division, JMA, 1997: Outline of the operational numerical weather prediction at the Japan Meteorological Agency Y.Tatsumi,1987: A spectral limited-area model with time-dependent boundary conditions and its application to a multi-level primitive equation model, J.Met.Soc. Japan,64,637-663

62

B U I L D I N G A SCALABLE PARALLEL A R C H I T E C T U R E FOR S P E C T R A L GCMS T. N. VENKATESH AND U. N. SINHA Flosolver Unit, National Aerospace Laboratories, Bangalore 560017, E-mail: tnvQflosolver.nal.res.in

INDIA

RAVI S. NANJUNDIAH and Oceanic Sciences, Indian Institute of Science, Bangalore 560012, INDIA E-mail: [email protected]

Centre for Atmospheric

Realization of the goal of tera-computing in meteorology, is possible only if the issue of the scalability of spectral GCMs on parallel computers is addressed. At the Flosolver Parallel Computing Group of the National Aerospace Laboratories, Bangalore, based on the experience in developing parallel computing platforms and in meteorological computing spanning more than a decade this problem of scalability in spectral computations was approached from a different angle. It was recognized that the bottlenecks in the scalability of spectral methods can not be alleviated by modifying the parallel algorithms alone and that customized hardware is needed. This lead to the concept, design and development of the FloSwitch which gathers data, processes completes the summation of spectral co-efficients over the latitudes and delivers only the relevant information to the processors. The initial results of the hardware tests of the switch have been highly encouraging.

1

Introduction

Numerical computing is very important for meteorology today. The availability of high performance computers and the success of numerical algorithms in solving certain classes of problems has made computing a viable option for research and also operational forecasts. With the current advances in computer hardware, the goal of tera-computing appears possible in the near future. To realize this however, the algorithmic issues, especially in meteorological computing have to be addressed. The lack of success in using MPPs in the past for meteorological problems have been analyzed by Swarztrauber *. Majority of the GCMs use spectral techniques due to the higher degree of accuracy and computational efficiency of this method. However the general experience has been that the scalability of the spectral models has been less than satisfactory especially on message-passing platforms. This is largely due to the large global nature of data exchange required in the transformation of grids from physical to spectral space. Most previous parallel implementations have concentrated on optimizing the code/algorithm for the architecture.

63

At the Flosolver Parallel Computing Group of the National Aerospace Laboratories, Bangalore, the policy has been to develop both hardware and software in a synergetic fashion 2 . Based on the experience in developing parallel computing platforms and in meteorological computing spanning more than a decade (since 1988 3 ' 4 ) , this problem of scalability in spectral computations has been approached from a different angle. In this paper we examine the algorithm/architecture issues relevant to the problem. We propose that designing and building specialized communication hardware is necessary for achieving this goal. 2

Computational flow of a spectral G C M

The general flow of a typical spectral GCM code is shown in Fig. 1. The linear terms in the prognostic equations for temperature, divergence, vorticity, surface pressure and specific humidity are computed in the spectral domain. The variables are then transformed into the grid space and the non-linear terms in the dynamics and the physical parameterizations computed. Efficient parallelization is made difficult due to the transformations between spectral and grid spaces.

3

Algorithm - Architecture Synthesis

It is well known that the algorithm must be in harmony with the architecture for maximum efficiency. Compilers can optimize the code to a certain extent. Usually the code is rewritten for a given architecture. On the other hand, we can ask the following question : Given the spectral algorithm, can the architecture be developed to suit the needs of this algorithm ? Improvements in scalability can be achieved by 1. Reduction of sequential portion of the code 5 . 2. Load balancing. 3. Reduction of communication time. We look at the communication aspect of the parallel architecture, to answer the above question. Since the main component of communication in a

64

Computation in Spectral space

Transform

Computation in Grid space

Transform Figure 1. Computational flow in a typical GCM

spectral GCM is the global sum (the "allreduce" call in MPI), we look at efficient strategies to achieve this. 4

Methods for Global operations

To illustrate the way in which global operations can be implemented, let us consider the global summation (Allreduce : S U M ) of arrays A0, A\, A2 and A3 (of length N) which are present on processors 0 to 3 respectively. At the the end of the Allreduce step, all the processors should contain the sum 5 = AQ + A\ + A2 + A3. (Fig. 2). Depending on the architecture, this could be implemented in different ways.

4-1

Gather, sum and broadcast

In this approach one processing element (PE) collects the arrays, calculates the sum and then broadcasts the result to the other PEs.

65

PEO

PE1

PE2

PE3

AO

Al

A2

A3

ALL REDUCE

Figure 2. ALLREDUCE : S = A0 + Ai + A2 + A3

PEO

PE1

PE2

PE3

AO

Al

A2

A3

Figure 3. ALLREDUCE : Gather - sum - broadcast

The total time (Ttotai) for this process is Ttotai = 2 * (np - 1) * tcp + (np - 1) * tsum where np is the number of processors tcp is the time required to communicate N words between two processors tsum is the time required to add two arrays of length N

66

4-2

Ring

summation

In this approach each PE communicates with the adjacent PE. This method is suitable for Bus architectures. The time required is Ttotai = (np - 1) * (tcp + tsum) + (np - 1) * t cp This is the same as that required for the previous approach. It is clear that in both the approaches the communication is sequential and not very efficient (scales as np). To reduce the time it is clear that the communications also should take place in parallel.

PEO

PE1

PE2

AO

Al

A2

PE3 A3

A0+A1 A0+A1+A2

S* s -*• Figure 4. ALLREDUCE : Ring

4-3

Binary tree summation

The Allreduce operation can be performed in less time if simultaneous communication is possible between the PEs. Simultaneous communication means

67

that exchanges between different pairs of PEs can take place at the same time. The binary tree summation is shown in Fig. 5.

PEO

PE1

PE2

PE3

AO

Al

A2

A3

A0+A1

A2+A3

s*s* Figure 5. ALLREDUCE : Binary Tree

The time required is Ttotal = l°g2 nP * ^cp + t sum ) + log2 np * tcp

The time required Ttotal increases as log2 np, which is much less than what is required for the ring and gather/sum/broadcast methods. However it should be noted that the hardware should support simultaneous communication. 5

Factors affecting scalability

From the preceding sections it might appear that the best approach is to use the binary tree summation with a very high speed communication switch. A closer look at the process reveals that there are stages where individual PEs are waiting for the intermediate sums, even with this approach. The largest factor is usually tcp which depends on the processor to switch communication

68

bandwidth and the switch bandwidth. Of these, the processor - switch bandwidth (which is of the order of 100 MBytes per sec for PCI) is usually lower and limits the speed. If the order of summation needs to be preserved, then binary summation cannot be used. 6

Case for customized hardware

Special hardware built to efficiently solve a certain class of problems is not a new phenomenon. For example, researchers in Japan have built the GRAPE machine 6 to solve problems in astronomy. Meteorological computing is a specialized activity and requires considerable computational resources. There is thus a case for the building of specific hardware suitable to the algorithms used. Efforts in this direction have been taken at N. A. L. which has lead to the concept, design and development of the FloSwitch. 7

FloSwitch : Concept

Let us take a look at Fig. 2 again. The processors require the sum S. In all the methods described in section 4, including the binary tree method, the PEs wait for the intermediate sums. The transfer of an array to a PE is only so that the summation can take place. If the summation could take place on the switch itself, the time to wait could be cut down. This can be accomplished by an intelligent switch (which we call the " FloSwitch" 7 ) which not only transfers data between the processors, but also does some amount of computation. This can be seen from Fig. 6 The sequence of steps for global summation is as follows. 1. All the PEs write the arrays to be added to buffers on the switch simultaneously. 2. The CPU on the switch adds the arrays and deposits the result on each of the P E buffers. 3. The PEs read the result from the the switch simultaneously. The time required is Ttotai = 2 * (iCp/2) + t*sum where t*sum is the time required to sum np arrays on the switch

69

PEO

PE1

PE2

AO

Al

A2

A3

s

s

s

s

PE3

Figure 6. Schematic of FloSwitch

t*um = (np — l)t* if the arrays are added in sequence. ttum = 1°S2 npt*s ^ t n e arrays are added in a binary tree manner A comparison between the above expression and that for the the binary tree summation reveals that the log2 np term multiplies only the tsum term. Since summation is generally much faster than the communication for modern processors, the total time would be much lower. Advantages of this approach are • It is scalable.

• Global communication is symmetric.

• Order of summation can be preserved. An estimate of the run times and parallel efficiencies for a typical T-170 computation is given in the table below Assumptions : Computation time /step Data transferred Processor - Switch B/w Memory B/w on FloSwitch

= 360 seconds = 26.57 MB = 60 MB/s = 300 MB/s

70

Bus Nprc 1 2 4 8 16 32 64 128

8

T 360.00 181.77 95.31 57.40 49.07 66.16 117.22 227.77

V

1.000 0.990 0.944 0.784 0.459 0.170 0.048 0.012

Switch (Binary Tree) T 360.00 181.77 93.54 50.31 29.59 20.11 16.25 15.21

FloSwitch V

1.000 0.990 0.962 0.894 0.761 0.560 0.346 0.185

T

T)

360.00 182.13 92.30 47.48 25.16 14.08 8.64 6.00

1.000 0.988 0.975 0.948 0.894 0.799 0.651 0.469

FloSwitch : Hardware implementation

Unlike most other switches which transfer data from one processor to another processor, the FloSwitch gathers data, processes it (in the present context completes the summation of spectral co-efficients over the latitudes) and delivers only relevant information to the processors. While the switch has been developed with meteorological computing in mind, any global logical reduction operation (OR, AND, MIN, MAX etc) across processors can be incorporated on the switch. This incorporation of global reduction operations on the switch reduces the data traffic across processors. The first version of the FloSwitch, designed and built is shown in figure 7. The communication between the processors and the switch is through the PCI bus. The PCI-DPM card (designed and developed by NAL) acts as the interface. The inter-process communication is carried out in the following steps. The processors transfer the data to the Dual Port Memory on the PCI-DPM card, and modify the control flags. The Intel 486 CPU on the FloSwitch checks the control flags and takes the appropriate action. For. e.g, for a SEND/RECEIVE operation, it just copies the data to the DPM of the receiving PE. Initial versions of the global operations like BROADCAST and ALLREDUCE have been successfully implemented on the switch. The tests of the hardware are currently going on. A version of the spectral General Circulation Model at T-170 resolution has been run on the hardware. Further optimization of the hardware and the system software is under progress.

71

Figure 7. Photograph of FloSwitch

9

Conclusions

Scalability of spectral codes on parallel computers is an area of ongoing research. Scalability is very important in the context of tera-computing. An analysis of the issues related to this reveals that the communication architecture is one of the key factors which limits the scalability. Special purpose hardware built using the concept of FloSwitch will improve the speedup achievable on such computations. The initial results of the hardware tests of the FloSwitch have been highly encouraging.

72

References 1. Swarztrauber P. N. 1996 MPP : What went wrong ? Can it be made right ? Proceedings of the seventh ECMWF Workshop on the Use of Parallel Processors in Meteorology November 2-6, 1996 2. U. N. Sinha, V. R. Sarasamma, S. Rajalakshmy, K. R. Subramanian, P. V. R. Bharadwaj, C. S. Chandrashekar, T. N. Venkatesh, R. Sunder, B. K. Basu, Sulochana Gadgil and A. Raju Monsoon forecasting on parallel computers Current Science, Vol. 67, No. 3, August 1994 3. U. N. Sinha and Ravi S. Nanjundiah A Decade of parallel meteorological computing on the Flosolver Proceedings, of the Seventh ECMWF Workshop on the use of parallel processing in meteorology , 1996 4. Ravi S. Nanjundiah and U. N. Sinha Impact of modern software engineering practices on the capabilities of an atmospheric general circulation model Current Science, Vol. 76, No. 8, April 1999 5. T. N. Venkatesh, Rajalakshmy Sivaramakrishnan, V. R. Sarasamma and U. N. Sinha Scalability of the parallel GCM-T80 code Current Science, Vol 75, No. 7, 709-712, 1998 6. D. Sugimoto, J. Makino, M. Taiji and T. Ebisuzaki GRAPE project for a dedicated Terra-flops Computer Proceedings of the First Aizu International Symposium on Parallel Algorithms/ Architecture Synthesis Fukushima, Japan March 15-17 1995 7. http://www.nal.res.in/pages/mk5.htm

73

SEMI-IMPLICIT S P E C T R A L E L E M E N T M E T H O D S FOR A T M O S P H E R I C G E N E R A L CIRCULATION MODELS R I C H A R D D . L O F T A N D S T E P H E N J. T H O M A S Scientific Computing Division National Center for Atmospheric Research Boulder, CO 80307, USA E-mail: [email protected], [email protected] Spectral elements combine the accuracy and exponential convergence of conventional spectral methods with the geometric flexibility of finite elements. Additionally, there are several apparent computational advantages to using spectral element methods on RISC microprocessors. In particular, the computations are naturally cache-blocked and derivatives may be computed using nearest neighbor communications. Thus, an explicit spectral element atmospheric model has demonstrated close to linear scaling on a variety of distributed memory computers including the IBM SP and Linux Clusters. Explicit formulations of PDE's arising in geophysical fluid dynamics, such as the primitive equations on the sphere, are time-step limited by the phase speed of gravity waves. Semi-implicit time integration schemes remove the stability restriction but require the solution of an elliptic BVP. By employing a weak formulation of the governing equations, it is possible to obtain a symmetric Helmholtz operator that permits the solution of the implicit problem using conjugate gradients. We find that a block-Jacobi preconditioned conjugate gradient solver accelerates the simulation rate of the semi-implicit relative to the explicit formulation for practical climate resolutions by about a factor of three.

1

Introduction

The purpose of this paper is to document our efforts at NCAR to construct a semi-implicit spectral element shallow water model that is more efficient than an explicit implementation. Traditionally, climate models have been based on the spectral transform method because the global spherical harmonic basis functions provide an isotropic representation on the sphere. In addition, it is trivial to implement semi-implicit time stepping schemes, as the spherical harmonics are eigenfunctions of the Laplacian on the sphere and the resulting Helmholtz problem is embarrassingly parallel in spectral space. Despite the lack of exploitable parallelism at relatively low climate resolutions, spectral models exhibit high performance on parallel shared and distributed-memory vector architectures. Achieving high simulation rates on microprocessor clusters at these resolutions has proven difficult due to the communication overhead required by data transpositions and the lack of cache data locality. However, some progress has been made in cache-blocking the spectral transform method, Rivier et al (2000).

74

In considering alternative numerical methods, spectral elements maintain the accuracy and exponential convergence rate exhibited by the spectral transform method. Spectral elements have proven to be effective in computational fluid dynamics applications (Ronquist 1988, Karniadakis and Sherwin 1999). More recently, they have been applied in geophysical fluid dynamics by Taylor et al (1997a) and Iskandarani et al (1995). Spectral elements offer several apparent computational advantages on RISC microprocessors. The computations are naturally cache-blocked and derivatives may be computed using local communication. An explicit spectral element model has demonstrated close to linear scaling on a variety of parallel machines, Taylor et al (1997b). However, an efficient semi-implicit formulation of a spectral element atmospheric model has been lacking. We have implemented semi-implicit and explicit multi-layer shallow-water models with periodic boundaries in cartesian coordinates. We demonstrate that the semi-implicit model is more efficient than the explicit over a useful range of climate resolutions. Efficiency in this case is measured in terms of simulated climate days per day of CPU time. Our eventual goal is to build a fully 3-D primitive equations model on the cubed-sphere and run the Held-Suarez (1994) climate tests to evaluate the semi-implicit formulation as a potential GCM dynamical core. A cubed-sphere grid tiled with quadrilateral elements results in a nearly equally spaced mesh which avoids pole problems. The transition from a 2-D periodic domain to the cubed-sphere represents a relatively small change numerically with the introduction of curvilinear coordinates, Rancic et al (1996). These include x-y rotations by the metric tensor and scaling by its determinant g — det(gl;>). 2

Spectral Elements

Spectral element methods are high-order weighted-residual techniques for the solution of partial differential equations that combine the geometric flexibility of /i-type finite elements with p-type pseudo-spectral methods, Karniadakis and Sherwin (1999). In the spectral element discretisation introduced by Patera (1984), the computational domain is partitioned into K subdomains (spectral elements) in which the dependent and independent variables are approximated by AT-th order tensor-product polynomial expansions within each element. A variational form of the equations is then obtained whereby inner products are directly evaluated using Gaussian quadrature rules. Exponential (spectral) convergence to the exact solution is achieved by increasing the degree N of the polynomial expansions while keeping the minimum element size fixed. The spectral element atmospheric model described in Taylor et al

75

(1997a) does not employ a weak variational formulation and so the equations are discretised on a single collocation grid. Alternatively, staggered velocity and pressure grids were adopted for the shallow water ocean model described in Iskandarani et al (1995). The major advantage of a staggered mesh in the context of a semi-implicit scheme is that the resulting Helmholtz operator is symmetric positive definite and thus a preconditioned conjugate gradient solver can be applied. 2.1

Shallow Water Equations

The shallow-water equations have been used as a vehicle for testing promising numerical methods for many years by the atmospheric modeling community. These equations contain the essential wave-propagation mechanisms found in more complete models. The governing equations for inviscid flow of a thin layer of fluid in two dimensions are the horizontal momentum and continuity equations for the wind u and geopotential height = gh. - ^ + ( u - V ) u + / k x u + V
(1)

^ + u - V < £ + 4>V-u = 0

(2)

Given the atmospheric basic state u — 0, v — 0 and = Q, linearization of the equations followed by a Von Neumann stability analysis reveals that solutions exist having phase speeds c = 0 and c = ±{o + f2/k2}1/2. The former represent slow movements of the atmosphere, related to the propagation of Rossby waves, whereas the latter are the fast-moving gravity wave oscillations. This implies that the maximum time step is limited by the speed of these fast modes. To overcome this limitation, Robert (1969) introduced a semi-implicit scheme in a primitive equations spectral model. A semi-implicit scheme applied to the shallow water equations combines an explicit leapfrog scheme for the diagonal advection terms with a Crank-Nicholson scheme for the off-diagonal Coriolis, gradient and divergence terms. 2.2

Variational Formulation

Consider the semi-implicit time-discretisation of the shallow water equations, written in terms of the differences 5 u = u™+1 - u™_1 and 5 = (f>n+1 - ™_1 5 u + At V 8 = 2At ( - V 0 " " 1 + f™ ) 1

5 + At<j>Q V • S u = 2At ( -
(3) (4)

76

o is the mean geopotential reference state and the tendencies fu and /^ contain nonlinear advection and Coriolis terms. The spectral element discretisation of (3) and (4) is based on a weak variational formulation of the equations over a domain ft. Find (Su, 8
(5) (6)

R u = 2 A i [ ( 0 " - 1 , V - w ) + (fu", w ) ] R(P=2At

V-u"-1) + (/^,g)]

[-4>0(q,

where X = Ti1 (fi) and M — £.2(fl) are the velocity and pressure approximation spaces with inner products f,g,v,w € £ 2 (ft), ( /, 3 ) = / /(x)s(x) dx, ./n

( v, w ) = / v(x) • w(x) dx. ./n

The weak variational formulation of the Stokes problem can lead to spurious 'pressure' modes when the Ladyzhenskaya-Babuska-Brezzi (LBB) inf-sup condition is violated. For spectral elements, solutions to this problem are summarized in Bernardi and Maday (1992). To avoid spurious modes, the discrete velocity Xh C X and geopotential Mh C M approximation spaces are chosen to be subspaces of polynomial degree iV and N — 2 over an element fix

xh = xnPN,K(n), PNMV)

Mh = MnPN-2,K(ty

= {f e £ 2 («): f\oK e PN(nK)}

For a staggered mesh, two integration rules are defined by taking the tensorproduct of Gauss and Gauss-Lobatto quadrature rules on each element. K

N

N

k=l

i=0

j=0

K

N-l

N-l

(/.ff)o = E E E *;=i »=i

/*(c<.Ci)5*(Ci,o)^^

j=i

where ( &, /9j ), i = Q,...N are the Gauss-Lobatto nodes and weights and ( Ci, ^i ), i = 1, • • • N — 1 are the Gauss nodes and weights on A = [ —1, 1 ].

77

Physical coordinates are mapped according to x € Qj =>• r € A x A. The discrete form of (5) - (6) can now be given as follows. ( 6 uh, w ) G L - At ( S 4>h, V • w ) G = R u ( & h, Q )G + A t o ( q, V • <5 uh )

G

= Rj,

To numerically implement these equations, a set of basis functions must be specified for Xh x M^. The velocity is expanded in terms of the iV-th degree Lagrangian interpolants hi defined in Ronquist (1988), N

N

Ui hi

u£(ri,r2) = ^2Y1

Mr2)

i ^

i=0 j = 0

and the geopotential is expanded using the (N — 2)-th degree interpolants hi N-1N-1

0*(n,r 2 ) = 5 3 Yl &J ~hi^ ~hi^ i=i

j=i

The test functions w € Xh are chosen to be unity at a single & and zero at all other Gauss-Lobatto points. Similarly, the q £ Mh are unity at a single £; and zero at all other Gauss points. The application of Gaussian quadrature rules results in diagonal mass matrices Bk = ( -±-±

j B B,

B = diag(Pi),

i =

0,...N

where L\ and L\ are the dimensions of element k. Given the quadrature rules and basis functions defined above, the bi-linear form d

rj

( q, V • u ft ) G = ^2 ( q, ~— I=I

)G

OXl

can be written in tensor-product form K

(q,V-uh)G

= Y^ {qk)T iPki «{ + Dk2 u: k=i

where

%)l)'®i>, »6, m = \-t-

u } = (-^ 3 D®/ x>;=

78

using the normalized derivative and interpolation matrices Dl3 -en

—

The spectral element discretisation of the advection operator (u • V) u in the momentum equation is represented by the square matrix V. In one space dimension N

[T>u]{ = m ^T Pi Dij UJ where Dy = /ij(&) is the matrix whose elements are derivatives of the GaussLobatto cardinal functions evaluated at Gauss-Lobatto points. In the momentum equation the geopotential is first interpolated to the velocity grid before application of the advection operator. In the spectral element method, C° continuity of the velocity is enforced at inter-element boundaries which share Gauss-Lobatto points and direct stiffness summation is applied to assemble the global matrices. The derivative matrix consists of the rectangular matrices D = (Di,D2)- B = (£?I,JB 2 ) and B are the global velocity and geopotential mass matrices. The assembled discrete shallow water equations are then B 8 u - At D T 8 4> = R u B 8 4> + At 4>0 D 5 u = R4

(7) (8)

The pressure is defined on the interior of an element and is 'communicated' between elements through the divergence in the continuity equation. An averaging procedure is required at element boundaries to enforce continuity, where velocity mass matrix elements in equation (7) are summed. 2.3

Helmholtz Problem

A Helmholtz problem for the geopotential perturbation is obtained by solving for the velocity difference 8 u 5 u = B " 1 ( A t D T S

(9)

and then applying back-substitution to obtain B 8 cj> + At 2 4>0 D B - 1 D T 8 <j> = R^ where R'=R^-

At 4>0 D B " 1 R u

(10)

79 Once the change in the geopotential 5 <j> is computed, the velocity difference S u is computed from (9). The Helmholtz operator H = B + Ai 2 <£0 D B - 1 D T is symmetric positive definite and thus preconditioned conjugate gradient iterative solvers can be applied. An effective preconditioner can be constructed by using local element direct solvers for the Helmholtz problem with zero Neumann pressure gradient boundary conditions. An LDLT decomposition of these local matrices is computed once and back-substitution proceeds during time-stepping. This block-Jacobi preconditioner is strictly local to an element and requires no communication. To prevent growth in the number of conjugate gradient iterations with the problem size, we have implemented the domain decomposition solver of Ronquist (1991) as an option. It is based on a deflation scheme that resembles a two-level multigrid algorithm. The geopotential (pressure) is split into coarse grid c and fine grid f components. The vector c contains the average pressure within each element whereas <j>f represents the pressure variation within each element. The coarse grid solution is mapped to the fine grid by injection (prolongation) of a constant value onto each pressure point within an element using the operator 7". The coarse grid solution is obtained from the fine grid by application of the averaging (restriction) operator IT. Equation (10) then becomes H(4>f

+

Ic)=g

where g is the discrete right-hand side vector. The coarse and fine grid Helmholtz operators are defined by Hc = IT HI and Hf = H — HIH~x IT H. The coarse and fine grid problems are therefore given by Hf
H'1 IT g,

Hccj>c = gc = IT'(g

- H
The deflation algorithm proceeds as follows. In a pre-processing step, the fine grid right-hand side gf is computed. The fine-grid component is then computed using a preconditioned conjugate gradient solver. Unlike the standard PCG algorithm, each fine grid iteration requires two matrix-vector multiplies and a coarse grid solve. Once the fine grid solution has converged, the coarse grid right-hand side gc is formed and the coarse grid component cj)c is computed in a post-processing step. Finally, the total pressure =
Computational

Complexity

The computational complexity of the semi-implicit scheme is crucial in determining whether or not it is competitive with an explicit time-stepping scheme. The ratio of semi-implicit to explicit time steps is ten to one (10:1) at T42 climate resolutions of approximately 300 km. The basic gradient and divergence

80

operators derived above require 0(KN3) floating point operations (flops) and therefore represent the cost of an explicit time step or the cost of computing the momentum and continuity right-hand sides. The cost-effectiveness of the semi-implicit scheme depends on the Helmholtz solver and this is a function of the number Na of conjugate gradient iterations (equivalently the convergence rate of the solver). The cost of a CG iterative solver can be broken down as follows. The core operations in a CG iteration are the daxpy and ddot inner product which both cost 0(KN2) flops. Application of the Helmholtz operator in the form of a matrix-vector product relies on the divergence and gradient operators and requires 0(KN3) flops. The local element preconditioned require back-substitution using dense matrices with a computational cost of 0(KN4). The coarse HC matrix is symmetric positive definite and sparse with dimensions K x K. A dense solver requires 0(K2) flops whereas exploiting sparsity reduces this to O(K). The coarse problem is a serial bottleneck because scaling the problem size implies K increases with the number of processors P. A parallel solver is required to prevent this cost from growing.

3

Parallel Implementation

We have implemented explicit and semi-implicit parallel computer codes to test their relative efficiencies. The target parallel architecture for these codes is a cluster of symmetric multiprocessors (SMP) nodes, such as the 540 processor IBM SP, or the 32 processor Compaq ES-40 cluster systems currently installed at NCAR. While the details differ, these systems share many of the same fundamental characteristics. Both systems employ 4 processor SMP nodes that access main memory through multiple levels of cache hierarchy. Both clusters are interconnected by a communications network which is roughly 30 times slower than the SMP local memory bandwidth as measured by the STREAMS benchmark. The parallel design of both computer codes is the so-called 'hybrid' MPI/OpenMP programming model. This model exploits thread level parallelism within an SMP node in conjunction with message passing calls between nodes. In theory, there are several advantages associated with the hybrid approach on SMP clusters. First, the hybrid method consolidates message traffic because fewer large messages will amortize latency overhead better than many small messages sent from multiple MPI processes per SMP node. Second, a single MPI process per node eliminates contention for the network interface card. This is frequently an advantage on SMP systems, since often the network interface card is a shared resource and communication performance can be hindered by contention of multiple MPI processes. Finally, hybrid provides the additional flexibility of efficiently exploiting large shared memory systems,

81

such as DSM's, by employing OpenMP threads, while maintaining the option to run in message passing only mode on single processor per node systems such as Linux clusters or Cray T3E's. At the same time, many of the advantages of the hybrid model can be realized only if the underlying thread implementation is light-weight relative to the MPI implementation and if adverse cache coherency effects between threads can be avoided by the implementation. Single processor optimizations focus on data re-use. Once loaded, the derivative matrices remain in cache since they are small. Loop unrolling over elements in the divergence, gradient and interpolation matrix-matrix multiplies exposes instruction level parallelism and allows the compiler to pipeline instructions through multiple functional units. These kernels achieve over 2.1 Gflops sustained performance (35% of peak) when threaded on a four processor IBM SP node. Unrolling is also applied in the solver prolongation and restriction operations. Significant performance gains are also realised in the preconditioner by unrolling the outer loop over elements and assisting the compiler in optimally scheduling stores. The inner loop is back-substitution where the LDLT factors are cache resident. This computation is embarrassingly parallel because the solvers are strictly local to an element. Several alternatives exist for implementing the coarse system solve. Solving the problem on each processor limits scalability since the problem size grows with K and hence P. Initially, we have experimented with both a parallel dense matrix multiply by H~l, and a parallel sparse Cholesky solver. The sparsity pattern of the Hc matrix and it's Cholesky factor Lc are displayed in Figs. 1 and 2. Shared-memory threads are coarse grained with OpenMP PARALLEL regions defined for operations on blocks of elements instead of low-level LOOP directives. This approach keeps threads running, eliminates start-up overhead and minimizes synchronization around MPI communication calls. Threads synchronize at a BARRIER before message-passing calls and perform parallel data copies into a shared communication buffer before the data is sent. Perhaps the most complex hybrid code is associated with the element edge averaging operation. Here threads apply the inverse mass matrix in parallel on the interior of each element before synchronizing to exchange boundary data. The threads average interior north-south edges in parallel, synchronize and then average interior east-west edges before exchanging subdomain boundaries. Hybrid programming choices are intimately connected to the 3-D implementation. Convergence rates of the CG solver will vary for different vertical layers due to the variation of the corresponding gravity wave phase velocities. The solver is load balanced by threading in the horizontal direction over elements instead of over layers and a mask is applied as layers converge so that threads only work on the remaining unconverged layers. Data for all layers is communicated in a

82

V \V

% .* . ''I:

:

» : \ . h «::. •••**!:. • •::. •::. •::. ' " I :. *s«. *:«. • ::. •::. •::.s

r

%. "h. ' fe.

' ^is.'%. \ . '

•X%."\. .

•

•

.'%. %. * *::, . •«

\ . " \

*»::, "V

30 nz = 576

40

*,

***

Figure 1: Sparsity pattern for the coarse Hc = LCL^

••

matrix.

::::::!:. ••••••••••

•• •• •• •• •• •• •• •• •• •• •••••••••• •••••••••• •••••••••••••••• ::::::::::.

MiUHih. :5

|::l:l:::. •• •• •• •• •• •• •• •• •• •• •• •• •• •• ••

.!:::::::::. > • • • • • • • • • • • • • • • • • • • • • • • •>••••••••! •••••••••••••••••••••••••• 30 nz = 1000

40

Figure 2: Sparsity pattern for the Cholesky factor

hc.

83

single message. We believe that data exchanges within each layer would limit scalability. Distributed-memory message passing is required in the element boundary averaging, vector gradient and divergence computations. These exchanges benefit from surface to volume effects much like finite difference methods. Messages are aggregated whenever possible over multiple layers. For example, the vector gradient and divergence can be combined when advancing the velocities to the next time level. Inner products in the conjugate gradient algorithm require a global reduction MPI.allreduce. In order to restrict the number of reductions to one per iteration (two simultaneous inner products), a latency tolerant version of the CG algorithm has been implemented, D'Azevedo et al. (1992). The cost of a global reduction is 0{U log2 P), where U is the messagepassing latency. This can seriously degrade scalability on machines with large latencies such as the IBM SP with i/ « 30/xsec. 4

P e r f o r m a n c e Studies

In order to compare the parallel semi-implicit and explicit models, both codes have been written so that they share computational kernels such as the interpolation, vector gradient, pressure gradient and divergence operators. The explicit model is based on the weak formulation of the shallow water equations and not the strong form employed by Taylor et al. (1997a). A cost function (CPU time) for the semi-implicit model depends on the number Nt of CG iterations, i.e. Csi

= Cadv + Nu

( CcG

+ CH<\> + CMZ )

where CQG represents the core cost of the conjugate gradient solver. Matrixvector multiplies cost CH4> a n d the preconditioner costs CMZ • For the deflation scheme, the cost per iteration would include the coarse system solve time CHC and two matrix-vector multiplies instead of one. The time to compute righthand sides and advance the velocities to the next time level is Cadv, which is roughly two times more expensive than an explicit time step. The semiimplicit model will be more efficient than the explicit model when CEI^E > Csi/dtsi- The above costs were measured on the NCAR IBM SP 2 with Power 3 processors and are reported in Table 1. The data collected are for a singlelayer semi-implicit model with block-Jacobi preconditioner and 8 x 8 elements per parallel subdomain. A P = 3 x 2 decomposition results in K — 384 elements with one MPI process per node and four OpenMP threads per node. The semi-implicit time step is dt — 1200s and the model was run for 72 steps or 24 hours. The time per step required was 3360/xs with Nu = 2 and the sustained performance per node was 665 Mflops.

84

Multi-layer results are displayed in Table 2, indicating that the sustained execution rate per node remains relatively constant as the number of layers is varied. For comparison, an explicit model run requires CE = 957/us per time step at the same resolution for 720 steps of dt = 120s. The sustained per node execution rate is 864 Mflops, which is 14% of the peak node performance. Multi-layer sustained performance for the explicit model is summarized in Table 3. Here again the per node sustained execution rate remains close to 1 Gflops or 17% of peak. A 'cube' grid nomenclature can be defined according Table 1: Semi-implicit model. IBM SP node performance.

time (is Mflops

Cgdv 1477 794

CQG

202 182

CH4>

CMZ

576 572

144 1135

Table 2: Multi-layer semi-implicit model. IBM SP node performance.

Layers

1 3 6 12 18 24 30

/zs/step 3360 7537 14663 26819 40647 56148 75697

Mflops

665 889 914 1000

990 955 886

s/day 0.242 0.542 1.056 1.933 2.927 4.043 5.450

Table 3: Multi-layer explicit model. IBM SP node performance.

Layers

1 3 6 12 18 24 30

/us/step

957 2521 4635 8693 14259 19428 26445

Mflops

864 984 1071 1142 1044 1022

939

s/day 0.689 1.815 4.966 6.259 10.267 13.988 19.041

to the number of degrees of freedom along a cube face or the domain boundary in the case of a 2-D periodic domain. Given N velocity points and Kx elements

85

in the x-direction, there are KX(N - 1) degrees of freedom. Therefore a C56 grid corresponds to Kx — 8 and N — 8. The execution time per model day for the explicit model divided by the semi-implicit model execution time provides an acceleration factor. The acceleration factors for different resolution problems run on the same P = 3 x 2 node configuration above are given in Fig. 3 for block-Jacobi preconditioning with and without deflation. The decrease in acceleration experienced at C224 is the result of an increase in the mean iteration count from Nit = 2 at C56 to Nit = 4 at C224. This plot indicates that the block-Jacobi preconditioner without deflation is more efficient for a broad range of climate resolutions despite a slight growth in the average number of conjugate gradient iterations as the resolution is increased. We next examine the overall integration rate as measured by simulated days per wall-clock day for two resolutions, C56L30 and C168L30 (30 layers). A multi-layer shallow-water provides a good estimate of the expected performance for the full 3-D primitive equations at reasonable climate resolutions. Fig. 4 contains plots of the number of simulated days of climate per wall-clock •day for both the semi-implicit and explicit models as the number of nodes (threads) is increased on the IBM SP. Fig. 5 compares the sustained total Gflops execution rates of the two models at these resolutions. The 96 thread result at C56L30 for the semi-implicit model suggests that 130 years per day of climate simulation is achievable with a scaling efficiency relative to 24 threads of 75%. These plots clearly demonstrate the advantage of the semi-implicit formulation relative to the explicit. The total sustained performance results at 96 nodes (384 threads) for the higher resolution C168L30 runs were 74 Gflops for the explicit model and 66 Gflops for the semi-implicit model. Finally, a scaled-speedup study has been performed to evaluate scalability of the IBM SP cluster. In this study the problem size is increased with the number of nodes thereby holding the subdomain size constant at 64 = 8 x 8 elements. The Courant stability limit for gravity waves is respected at higher resolutions by decreasing the explicit time step. Similarly, the semi-implicit time step is reduced according to the advective stability limit. Fig. 6 contains a plot of the boundary exchange communication time per step as the number of nodes are increased. The slowly growing exchange times indicate that the interconnection network bandwidth of the IBM SP is not scaling with machine size. This effectively limits the scalability of the algorithm. Fig. 6 also contains the semi-implicit solver core execution time as a function of log P. In this case, it is expected that the core time will grow as log2 P due to the global reduction required by the conjugate gradient solver. The solver core time is increasing faster than this due to limitations of the interconnect.

86 Acceleration Factor

' - « * - Deflation • o - PCO

\

"b

-

^ ^ 150

Resolution

Figure 3: Acceleration factor for block-Jacobi with and without deflation.

5

Conclusions

We have derived a semi-implicit time stepping scheme for a multi-layer shallow water model based on a spectral element space discretisation. The semiimplicit model, using a block-Jacobi preconditioned conjugate gradient solver, is accelerated by a factor of three relative to the explicit model for a range of relevant climate resolutions. The comparison is based on the number of simulated days per wall-clock day. Schwarz domain decomposition should be investigated as an alternative to deflation, which was not cost-effective at climate resolution. The parallel implementation of both models is hybrid MPI/OpenMP. A C58L30 integration rate of 130 years per day was observed for the multi-layer shallow-water equations and this result is encouraging for parallel climate simulations on SMP clusters. The C168L30 simulation scales out to 96 IBM SP nodes and achieves a total sustained performance of 74 Gflops. The scalability of the semi-implicit and explicit models seems limited only by the underlying system interconnect. Future work will include running the now standard shallow water test problems on the sphere, Williamson et al (1992). This will be followed by construction of a dynamical core based upon the 3-D primitive equations and evaluation of the Held-Suarez (1994) tests.

87 5

C56L30 Simulated Days per Day

x10*

y/

4.5 - e - Semi-Implicit •-*•- Explicit

3.5 3

.

-

•

2.5

-

2

-

1.b ^

,

—

•

^. '

1 0.5

-

x^'

10

20

30

40

70

50

SO

90

100

Threads

C168L30 Simulated Days per Day

150

200

250

Threads

Figure 4: Simulated days per day. Top: C56L30. Bottom: C168L30.

C56L30 Sustained Performance 22

x

20 - e - Semi-Implicit •-«•- Explicit

16

,'

16 in 14 o. _o 5 12

•

10 s'jr

10

20

30

40

50 Threads

70

60

90

100

C168L30 Sustained Performance

s - e - Semi-Implicit • - " - Explicit

y" y>

2 40

5

-

50

100

150

200 Threads

250

300

350

400

Figure 5: Total sustained Gflops. Top: C56L30. Bottom: C168L30.

89

Boundary Exchange Time

-e— Semi-Implicit • - * - Explicit

50

100

150

200

250

300

350

400

Threads

PCG Core Time

Figure 6: Top: Boundary exchange time. Bottom: CG solver core time.

90 Acknowledgments The authors would like to thank Einar Ronquist of the Norwegian University of Science and Technology for helpful discussions regarding the theory and implementation of spectral element methods. References 1. Bernardi, C. and Y. Maday, 1992: Approximations spectrales de problemes aux limites elliptiques. Mathematiques et Applications, vol. 10, Springer-Verlag, Paris, France, 242p. 2. D'Azevedo, E., V. Eijkhout, and C. Romine, 1992: Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors. LAPACK working note 56, University of Tennessee. 3. Held, I. H., and M. J. Suarez, 1994: A proposal for the intercomparison of the dynamical cores of atmospheric general circulation models. Bull. Amer. Met. Soc, 75, 1825-1830. 4. Iskandarani, M., D. B. Haidvogel, and J. P. Boyd, 1995: A staggered spectral element model with application to the oceanic shallow water equations. Int. J. Numer. Meth. Fluids, 20, 394-414. 5. Karniadakis, G. M., and S. J. Sherwin, 1999: Spectral/hp Element Methods for CFD. Oxford University Press, Oxford, England, 390p. 6. Patera, A. T., 1984: A spectral element method for fluid dynamics: Laminar flow in a channel expansion. J. Comp. Phys., 54, 468. 7. Rancic, M., R. J. Purser, and F. Mesinger, 1996: A global shallowwater model using an expanded spherical cube: Gnomic versus conformal coordinates. Q. J. R. Meteorol. Soc., 122, 959-982. 8. Rivier, L., L. M. Polvani, and R. Loft, 2000: An efficient spectral general circulation model for distributed memory computers. To appear. 9. Robert, A., 1969: The integration of a spectral model of the atmosphere by the implicit method. In Proceedings of WMO/IUGG Symposium on NWP, VII, pages 19-24, Tokyo, Japan, 1969. 10. Ronquist, E. M., 1988: Optimal Spectral Element Methods for the Unsteady Three Dimensional Navier Stokes Equations, Ph.D Thesis, Massachusetts Institute of Technology, 176p. 11. Ronquist, E. M., 1991: A domain decomposition method for elliptic boundary value problems: Application to unsteady incompressible fluid flow. Proceedings of the Fifth International Symposium on Domain Decomposition Methods for Partial Differential Equations, SIAM, Philadelphia, 545-557. 12. Taylor, M., J. Tribbia, and M. Iskandarani, 1997a: The spectral element

91

method for the shallow water equations on the sphere. J. Comp. Phys., 130,92-108. 13. Taylor, M., R. Loft, and J. Tribbia, 1997b: Performance of a spectral element atmospheric model (SEAM) on the HP Exemplar SPP2000. NCAR Technical Note 439+EDD. 14. Williamson, D. L., J. B. Drake, J. J. Hack, R. Jakob, and P. N. Swarztrauber, 1992: A standard test set for numerical approximations to the shallow water equations in spherical geometry. J. Comp. Phys., 102, 211-224.

92

Experiments with NCEP's Spectral Model Jean-Frangois Estrade, Yannick Tremolet, Joseph Sela December 14, 2000 Abstract The spectral model at The National Center for Environmental Prediction was generalized to scale from one processor to multiples of the number of levels. The model has also been 'threaded' for Open-MP. We present experiments with NCEP's operational IBM SP using similar resources in different mixes of MPI and Open-MP. We also present experiments with different platforms, assessing performance of vector machines vs. cache based machines ranging from Linux PCs, IBM SPs in different configurations, to Fujitsu's VPP5000. The scalability properties of the model on the various platforms are also discussed.

Introduction The National Center For Environmental Prediction has been running a global spectral model since 1980 on a variety of platforms. The code has continually evolved to reflect optimization features on the different computer architectures that were used in operations during these last twenty years. With the advent of parallel computing, the model was rewritten to conform to the distributed memory of the new machines. This code was developed and implemented on a T3D, T3E and then ported to the IBM-SP which is currently the operational machine at the U.S. National Weather Service. There were two fundamental issues during the development of the parallel code; parallelization of the algorithm and optimization for cache based computing. The parameterization of the physics and radiation processes were cloned from the Cray C90 code that was in operations just before the SP implementation. The data layout for parallel computations was completely redesigned. The grid point computations of dynamics, physics and radiation were originally designed for vector machines. In view of the cache's restrictive size on some parallel machines, the parameterization codes were devectorized for better performance. This issue will be discussed in more detail later in this paper. The parallel code was originally using SHMEM for communication but later switched to MPI for the IBM SP. Recently the OPEN-MP feature was incorporated in the code and results from tests using the same resources with and without OPEN-MP are studied in this paper.

93 In view of the shorter life time of present day computer systems, the issue of portability is gaining importance. To test portability, the model was first generalized to run on any number of PEs. The new parallelization permits execution of the model on a variety of machines, ranging from a LINUX based PC through Origin, SP or any MPI platform up to Fujitsu's VPP architectures. It must be emphasized that the effort invested in devectorization for optimal performance on our operational machine, will be shown to be counter productive when executing on a vector processor such as a VPP computer.

Parallelization and Data layout The model's data parallel distribution is as follows: Spherical Harmonic space We distinguish between three possibilities: If the number of processors is less than the number of vertical levels we distribute the levels among the processors. If the number of processors is equal to the number of vertical levels, we assign one level to each processor. In the 'massively parallel' case we assign the number of processors to be a multiple of the number of layers and we assign one level to each group of processors. The number of groups=number of PEs / number of levels. In addition we spread the spectrum among the groups when doing purely spectral computation such as diffusion, filtering and semi-implicit tasks. Fourier Space In this space we split the data among the levels and latitudes to facilitate efficient Fourier Transform computations with all longitudes present in processor for a given field. Gridpoint Space In this space, when we are "between Fourier Transforms", we further parallelize over longitudes since we need data for ALL levels for EACH point of the Gaussian grid on EACH processor. Communication In order to shuttle from one space to another, a set of encapsulated communication routines are used and the transition from MPI to other communication packages is thereby isolated from the computational part of the code. We have found that collective MPI routines perform more efficiently than point to point communication on the IBM SP.

94 O P E N - M P Threading The code contains MPI communication and OPEN-MP directives . OPEN-MP is used for threading FFTs and the gridpoint computations of dynamics physics and radiation. MPI is used whenever off node data is required. Given the memory per node on the SP, we are unable to execute a purely OPEN-MP code on one PE. (just for measurements of course, the computational power of one PE would not suffice for operational prediction). Invocation of Open_MP is achieved at execution time. The choice of the number of nodes, MPI tasks and the number of threads are only explicit in the execution script and no changes to source code are required when reconfiguration of the resources is desired. The compilation however requires the appropriate compiler and associated thread libraries.

Platforms and Tests Performance evaluation was executed in two resolutions. T6228, Triangular truncation at zonal wave 62 and 28 levels. T17042, the Operational truncation. Triangular truncation at zonal wave 170 and 42 levels. I B M - S P at N C E P The NCEP SP phase one system consists of 32 silver nodes and 384 winterhawk nodes. Each silver node has 2 GB of memory and 4 CPUs. Each winterhawk has 512 MB of memory and 2 CPUs. Winterhawk are faster than silvernodes, despite their slower clock (200MHz. vs. 333MHz.). The winterhawks deliver superior floating-point performance because of better pipelining in the CPUs and cache prefetch for large loops. Silver node CPUs are faster for character manipulations and scalar codes that do not optimize well. AIX operating system requires about 150 MB so only about 360 MB of user space on a winterhawk and 1.8 GB on a silver node are available. Since AIX is a virtual memory operating system, if the workload exceeds these values portions of its address space will page to disk. Only winterhawk nodes were used in tests presented in this report. Load distribution throughput and scheduling are important issues in an operational environment and the following configuration information seems pertinent. There are 4 classes of user access. A class is a group of nodes from which users can request job resources. Two of these classes are restricted for production use only and two are for general use. Additionally, two non-user classes are configured to support scheduling of threaded applications on the winterhawk nodes: Class 1: 17,4-way silver nodes supports batch and interactive POE jobs, up to 68 available scheduling slots. Class 2: 3,4-way silver nodes for Production - interactive or batch

95 Class dev: 384 2-way winterhawk nodes for Development in batch only, 2 tasks per node. Class. PROD: 384 2-way winterhawk nodes for Production in batch only, 2 tasks per node. In the PROD class, even when we are alone in the class, the timing results can vary from 9.6% for a T170L42 case with 42 processors, to 16% with 84 processors, 25% with 126 and 27% with 168 processors. This large variation required running the same cases several times in order to chose the minimum time for each set of experiments. The variation in timing results is larger when the dev queue is used. V P P 5 0 0 0 at Meteo-France The VPP5000 system consists of 32 vector nodes. Each node has 8 GB of memory and one CPU of 9.6 Gflops with 7.9 GB of memory available to the user. The nodes are linked by a crossbar with a bi-directional speed of 1.6 GB/s. This is an operational machine where the operational suite of Meteo-Prance runs four times a day. There is only one partition where development, research and operational system coexist. To manage all the different configurations, MeteoFrance has installed a system with several queues with several priorities which depend on memory use, the number of processors and the time requested. We were able to run alone only once and were able to test the model on 21 processors. If we compare the time of this test with the times measured when other jobs are running, the difference is lower than 1% ! A Linux P C This is a 450 Mhz PC with 384 MB of memory with Linux and 256 MB of swap space. We used Fortran 90 from Lahey/Fujitsu and ran alone in all experiments. We have also installed M P I C H on this machine for future experiments with PC clusters. At this time we are engaged with Fujitsu's PC cluster project and performance results will be published as they become available.

Performance and Scalability In order to make a three way comparison between the different machines it was necessary ,because of memory constrains, to run the T62L28 MPI version of the model. Since the only machine supporting O P E N _ M P that was available to us was the SP, we used the model's operational truncation of T17042. The timings of the different sections of the code in the three way comparison for the T17042 executions are shown in Table 1. In these runs the IBM-SP used 21 processors, the VPP5000 used eight and the PC one processor. The entries in Table 1 display the percentage time spent in the different section of the code.

96 Algorithm Transpose Semi-imp

% IBM/SP (T17042) 33.9 4.4

% VPP5000 (T17042) 11.4 2.6

% PC (T6228) 7.7 1.0

FFT For w. Leg In v. Leg Semi-imp Grid-dyn Physics Radiation

3.2 7.2 4.5 1.1 0.8 11.0 27.6

9.3 2.1 1.1 0.5 2.3 52.4 15.7

5.3 4.9 2.4 0.8 0.9 10.9 51.4

I/O Prints

3.4 0.7

0.9 0.6

1.1 0.5

We note that almost 40% of the time on the IBM-SP and 14 % of the time on the V P P are spent in communication. It should also be noted that these communication times are influenced by the relatively long time required to distribute the initial conditions which are not accounted for in this table. In addition, the time charged to communication, which is done in 32 bit precision, includes the 'in processor' reorganization of the data. We measured the relative times spent in this reorganization and using the T62L28 version found that on the VPP with only one processor the reorganization time is approximately 0.3 % of the total, while on the PC it is 8.7 %. This big difference is partly attributable to the vectorization advantage on the Fujitsu of this segment of the code. It is observed that on the vector machines more time is spent in the physics than in the radiation because the physics computations are poorly vectorized. For the scalar machines, the opposite is true. We note again that the original codes for the grid dynamics, physics and radiation were devectorized for the SP. Regarding the code's scalability we observe that on the Fujitsu VPP5000, the scalability is reasonable even though the code is poorly vectorized. The vectorization percentage was measured to be less than 30 %, and is not necessarily independent of the number of processors; depending on the number of processors, vector lengths can become too small for good vector speeds. If we run 'almost alone' using the production queue, the same observations can be made about the IBM-SP and the scalability remains reasonable up to 168 processors, the number used in operations.

97 VPP5000 runs with a T170L42 12 hours forecast

IBM-SP runs with a T170L42 12 hours forecast bMU-]

\ 11400-

^

\

8900

\ \

3900

\

I 1 s

\ 6400

\

690

£

\ 490

~~-

1400-

1 2

3

4 5 6 Processors

\ \

290

42

7

63

84 105 126 147 16 Processors

Open-MP versus MPI The Open-MP aspect of the code was tested on the IBM-SP. We ran tests using MPI alone and a combinations of MPI and OPEN-MP. On the IBMSP winterhawks phase one, there can be only two threads. These tests were executed in the prod class. We observe that MPI processes (dash line) are faster than the mixed OPEN-MP +MPI processes (full line). These performance results are probably due to the OPEN-MP overhead in spite the rather large grain computational tasks assigned to the Open-MP threads. We also note good scalability of the code with the use of Open-MP. Test with MPI alone (dash line) or MPI+OPEN-MP (full line) T170L42 runs 12 hours forecast 1280 -.

1080

\

\

\ \

880

680

\

\

\

\

480

- -. 280-

r-

,

,

""~-

T

1

126

147

1

__ 1

1

1

1

16

Processors

Scalar versus Vector If we compare the execution time of the VPP5000 and IBM-SP on a T170L42 case (12 hours forecast) on 21 processors, we get a ratio of 1.97. This ratio is rather disappointing for the VPP5000 but can be explained by the less than 30% vector nature of the code. In order to measure the vectorization advantage of the VPP5000 with respect to the SP we selected the simplest vectorizable part of the code, namely the grid

98 dynamics, for the adiabatic version of the model. In the adiabatic case the code is more vectorized (72%) but the subroutine which consumes most of the time is still the same. Using the profiling tools from Fujitsu, we observe that the second most time consuming subroutines is the dynamic grid computation. In this subroutine, we compute the contribution of one longitude point at each call. If we increase the computation to include several points for each call and if we increase the amount of vectorization inside this code, we increase the vectorization amount of the entire code to 75% and we execute in 805s instead of 907s on four processors (9% better). With the IBM-SP, the same test with no modification runs in 640s with 22 processors. After vectorizing the dynamics grid computations, the ratio between IBM and VPP is improved to around 4.3. If we run the new modified vector code on the IBM-SP and on the PC, we do not notice any degradation in performance except when we use the 2-D arrays with the second dimension set to one (only one longitude per call). It is possible that this problem arises from the prefetching mechanism of data on the SP. When the vectorized code is executed on the SP with several longitudes per call we get a slightly improved performance, contrary to the experience with the Cray T3E, this is most likely a result of the larger SP cache. Similar remarks apply to the execution and performance on the PC.

I.B.M Phase II Phase II consists of 2 systems: one for production, the other for development. Each system contains 276 Winterhawk-2 nodes: *256 are reserved for computation. * 20 nodes are reserved for filesystems, interactive access, networking, various system functions. At this writing only 128 nodes are available on one system. The Winterhawk-2 have a clock speed of 375 MHz ( 200 MHz for Winterhawk1) and an L2 cache size of 8 MBytes ( 4MBytes on Winterhawk-1). Each nodes contains 4 processors and 2 GBs of main memory (usable memory 1.8 GBytes). Theoretically, this system is 5 times faster than phase I. Comparison b e t w e e n phase I and phase II We compare T17042 12 hours forecasts on the two machines: On phase I with 21 processors requires 1606 sec. On phase II, this run requires 863 sec. So the ratio between the two machines is roughly: 1.86. We get exactly the same ratio with 42 processors. But with 84 processors, this ratios becomes 2.22. Given the variability in measurements of running times, we consider the ratio to be around 2.

99 T h r e a d i n g vs M P I The results on the next graph indicate improved scaling on phase II. We still find that the pure MPI version is more efficient than the thread + MPI version. Test with MPI and threads on IBM phase II system T17042 runs -12 hours forecast **" mpi [mpxtf) mpi + 2 threads (mpxH_r) mpi + 3 threads (mpxILr) mpi + 4 threads {mpxtt_r)

650- \ 550- \

350-

\

\

250150-1

1

1

s

\

\

**

\ 1

(

1

.

1

r^T

1

1

1

1

1 "i ' " i " " i

1

1

1

1

•

1

21 42 63 84 105126147 16B 189 210 231 252273294315336357378399420 441 462 483504

O n e M P I p r o c e s s p e r n o d e vs four M P I processes p e r n o d e The most efficient MPI version consists of one MPI processor per node. If we compare a run with one MPI process per node with a run with 4 MPI process per node, the former is approximately 1.6 time faster than the latter. This ratio is 1.1 if we compare one MPI process per node with 2 MPI process per node.

Conclusions The quantitative comparison of cache and vector architectures, especially in term of price vs. performance is not straight forward. It is clear from measurements that vector machines with well vectorized code will outperform scalar computers and will do so with fewer processors. Given the rather low percentage of peak performance on scalar machines, and the quite high sustained percentage of peak for vector computers, the cost effectiveness of the two architectures should be measured with fully operational codes using the guide lines of operational timeliness. Regarding the Open-MP implementation in an MPI application it would appear that the pure MPI version is still preferable. The finding that large cache machines can efficiently execute slab codes could be considered in planning and maintaining operational codes with long life expectancy. In conclusion we would also like to express our regret that there are so few vector MPPs available for use in Numerical Weather Prediction.

100 THE IMPLEMENTATION OF I/O SERVERS IN NCEP'S ETA MODEL ON THE IBM SP JAMES J. TUCCILLO International Business Machines Corporation, 415 Loyd Road, Peachtree City, GA 30269, USA E-mail: [email protected]

Asynchronous I/O servers have been recently introduced into the NCEP's Eta-coordinate model. These servers are additional MPI tasks responsible for handling preliminary post-processing calculations as well as performing the I/O of the files for post-processing and/or model restart. These activities can occur asynchronously with the MPI tasks performing the model integration and effectively reduce the model integration time.

1

Introduction

The NCEP Eta-coordinate model [1] is an operational, limited-area, short-range weather prediction model used by the US National Weather Service for numerical guidance over the North American continent. The important meteorological characteristics of the model are as follows: • • • • • • •

Eta vertical step-coordinate Arakawa E-grid for horizontal grid structure Hydrostatic or Non-hydrostatic options Comprehensive physical parameterization package Horizontal boundary conditions from NCEP Global Spectral Model One-way nesting option Initial conditions from an assimilation cycle The code characteristics are as follows:

• • • • • • •

Fortran90 with MPI-1 for message passing 2-D domain decomposition Supports 1-N MPI tasks where N is the product of any two integers Pure MPI or combined MPI/OpenMP MPI infrastructure is "hidden" Dummy MPI library provided for support on systems without an MPI library I/O and preliminary post-processing handled asynchronously through I/O servers It is the final code characteristic, the asynchronous I/O servers, that will be discussed in this paper.

2

Motivation for I/O Servers

With the original MPI implementation of the Eta model by Tom Black at NCEP, the global computational domain was two dimensionally decomposed over a number of MPI tasks that is the product of any two integers. Typically, the number of MPI tasks would be on the order of 10s to 100s and a square aspect ratio would be used (i.e. the number of tasks in the x and y direction would equal or nearly equal). For the purposes of post-processing and/or restarting, the model needs to output the state variables plus several diagnostic quantities every hour of simulated time. This was handled by having each MPI task write a separate file to a shared file system. Each file contained the state variables and diagnostic quantities for the MPI task that wrote the file. These files were subsequently read by another program (Quilt) that patched them together, performed some preliminary post-processing computations, and finally wrote out a single "restart" file that could be used for restarting the model or for post-processing. The net affect of this approach was that the model integration was delayed while each MPI task wrote its data to disk. In

101 addition, the data for restarting and post-processing was written to disk twice and read from disk once before a final "restart" file was generated. While this methodology was acceptable when the Eta model was first implemented operationally on the IBM SP system with a relatively low horizontal resolution of 32 kms, a different approach was needed for the anticipated increases in resolution to 22 kms (Fall of 2000) and 12 kms (Fall of 2001). The new approach was the introduction of asynchronous I/O servers to offload the I/O responsibility from the MPI tasks performing the model integration. In addition, I/O to disk could be significantly reduced by eliminating a write and read to disk during the Quilt processing.

3

Design of the I/O Servers

The I/O servers are essentially the incorporation of the Quilt functionality into the Eta model. Quilt, when run as a separate program, read the files written by the MPI tasks performing the model integration, "quilted" the data together, computed temperatures for model layers below ground in the Eta vertical coordinate via an elliptic equation for use in building down mean sea level pressure, and finally wrote out a single "restart" file. With the larger computational domain anticipated for the 22 km and 12 km implementations, the Quilt functionality was developed as a separate MPI code and incorporated into the Eta model. The tasks performing the model integration perform a non-blocking MPIJSEND to the I/O servers and then immediately return to performing the model integration. The net result is that the tasks performing the model integration are no longer delayed by writes to disk and the wall time is reduced. Essentially, the I/O is completely hidden as it is offloaded to additional MPI tasks (the I/O servers) whose sole responsibility is preliminary pre-processing and I/O. When the Eta model is started, additional MPI tasks are created above and beyond what is needed for the model integration. These additional MPI tasks are the I/O servers. There can be one or more groups of I/O servers and each group can have one of more I/O servers. The I/O servers that are part of a group work collectively on receiving the data from the tasks performing the model integration, perform the preliminary post-processing, and write the "restart" file to disk. Multiple output times can be processed at once if there are multiple groups. In other words, if the time to process an output time is greater than the time for a group of I/O servers to finish their work then multiple groups of servers can be initiated. Again, the idea is to completely hide the I/O time. Figure 1 show an example of the relationship between the tasks performing the model integration and two groups of servers with 2 servers each.

30

31

32

33

34

35

24

25

26

27

28

29

18

19

20

21

22

23

12

13

14

15

16

17

6

7

8

9

10

11

0

1

2

3

4

5

MPI_COMM_COMP MPI COMM TNTER A ( * l

Server„Group_0

Server„Group_1

1

0

MPI COMH_C OHP COMM_I NTER

0

MPI_COMM_COMP MPI_COMM_INTER

MP I_C0MM_W0RLD

Figure 1. An example of the relationship between the MPI tasks performing the model integration and two groups of I/O servers with two I/O servers each. In this example, the model integration uses 36 MPI tasks and there are two groups of I/O servers each with two I/O servers. With this arrangement, the model would be initiated with a total of 40 MPI tasks.

102 Each group of I/O servers will process every other output time. In other words, the tasks performing the model integration will round robin through the groups of I/O servers. The servers within a group work in parallel and perform their own MPI communication that is independent of the communication taking place between the tasks performing the model integration, between the tasks of other groups of I/O servers, and between the tasks performing the model integration and other groups of servers. This separation of the communication patterns is accomplished through the use of the intracommunicator and intercommunicators depicted in Figure 1. The intracommunicator ( MPI_COMM_COMP) is created by using MPI_COMM_SPLIT to split MPI_COMM_WORLD. The intercommunicators are created with MPI_INTERCOMM_CREATE and provides the mapping between the tasks performing the model integration and the I/O servers of the different I/O server groups. Since the I/O servers use a one-dimensional decomposition strategy, the communication between the tasks performing the model integration and the I/O servers maps a two-dimensional decomposition to a onedimensional decomposition. The tasks performing the model integration actually use an array of intercommunicators as they must be able to toggle between which group of servers they send data to. Each group of I/O servers needs a scalar intercommunicator since they only need to be able to map to the tasks performing the model integration.

4

Configuring the I/O Servers

How do you decide on how many groups of servers and how many servers per group are needed? Basically, the number of I/O servers per group is controlled by the size of the final "restart" file since the entire file is buffered by the I/O servers. On the SP at NCEP, each node contains 4 CPUs (4 MPI tasks are typically initiated per node) and 2 Gbytes of memory. Although the virtual operating system allows disk to be used as an extension of real memory via paging, it is best, from a performance point of view, to not use more than the amount of real memory per node. For the current operational 22 km configuration, each "restart" file is approximately 600 Mbytes so the memory of a single node is sufficient and 4 servers are used to consume the 4 CPUs on a node. For the 12 km implementation, the "restart" file will be in excess of 2 Gbytes so the memory of 2 nodes will be required - 8 I/O servers. In selecting the number of groups of I/O servers you need to know the time it takes a group of I/O servers to complete the processing of an output time plus the amount of wall time for the model to integrate to the next output time. At NCEP, the goal is to produce 1 simulated hour per minute in operations regardless of the resolution of the model. With an output frequency of 1 hour of simulated time, a group of I/O servers must be able to finish their processing in less than one minute otherwise a second (or third, or fourth) group of servers is needed. For the current 22 km operational configuration, one group of I/O servers is sufficient. For the 12 km configuration, 3 groups of I/O servers will be required. The number of I/O server groups and the number of I/O servers per group is specified at run time and the code monitors whether sufficient I/O server groups have been specified.

5

Performance Impact of the I/O Servers

The impact of the transfer of data from the tasks performing the model integration to the I/O servers has been examined. The tasks performing the model integration are able to perform MPIJSENDs to the I/O servers and immediately resume model integration without affecting the model integration time as determined by measuring model integration time with the I/O servers disabled. With a properly configured number of I/O server groups, the I/O is completely hidden from the model integration time with the exception of the output for the final simulated hour. In operations at NCEP, the Eta model post-processor is initiated as each "restart" file is written so the post-processing is completely overlapped with the model integration except for the last output time. 6

Acknowledgements

This work was based on many discussions with Tom Black, Geoff Dimego, Brent Gordon, and Dave MichaudofNCEP.

103 7

References 1. Black, Thomas L., 1994: The New NMC Mesoscale Eta Model: Description and Forecast Examples. Weather and Forecasting: Vol. 9, No.2, pp. 265-284.

104 IMPLEMENTATION OF A COMPLETE WEATHER FORECASTING SUITE ON PARAM 10 000 Sharad C.Purohit, Akshara Kaginalkar,. J. V. Ratnam, Janaki Raman, Manik Bali Center for Development for Advanced Computing Pune University Campus Ganesh Khind Pune-411007 E-mail: [email protected] The entire suite of weather forecasting, based on T80 global spectra] model, which consists of decoders, quality control, analysis and the forecasting programs have been ported onto PARAM 10000 super computer from CRAY -XMP. PARAM10000 is a distributed shard memory system scalable up to lOOGFlops. The most time consuming portions of the suite, the analysis code ( ssi80) and the forecasting code ( T80) were parallelized. Due to inherent dependencies in the ssi80 code, it was parallelized using compiler directives, which takes advantage of the shared memory architecture of the PARAM 10000 machine. The T80 forecasting code was parallelized using data decomposition method using Message Passing libraries. The analysis and the forecasting code show good speedups. The fault tolerance required for the forecast cycle was devised at the script level.

1

Introduction

The prediction of weather using numerical models involves many complicated steps. The models require initial conditions to be specified for predicting the weather. These initial conditions are prepared from the observations obtained from various observatories spread over the whole globe. The observations are to be passed through quality control checks for checking their quality and then interpolated to the model grid using different methods. The National Center for Medium Range weather Forecasting Center, New Delhi, issues a medium range weather forecast based on T80 global spectral model. The observations are received through GTS and decoded using ECMWF decoders. The data is checked for its quality using a number of quality control methods. It is then interpolated to the model grid using spectral statistical interpolation method (ssi80). The prepared initial conditions are used in the forecasting of weather for five days using global spectral model T80. The entire weather forecasting suite was ported onto PARAM 10000 supercomputer from CRAY-XMP. The porting involved rigorous checking of the results, optimization of the codes for best performance on PARAM 10000 and parallelization of the programs. It was found that the analysis code (ssi80) and the forecasting code (T80) were the most time consuming portions of the weather forecasting suite. An attempt was made to parallelize the codes to take advantage of the distributed-shared memory architecture of the PARAM 10000. The T80 code was parallelized by the data decomposition method using Message Passing Libraries ((MPI forum, 1995) and proprietary C-DAC MPI ( Mohanram et.al 1999). The ssi80 code, because of its inherent dependencies was parallelized using compiler directives and Pragmas. To make the whole weather forecasting suite fault tolerant to system failures, a script was developed to create a data image after the execution of each components of the suite. In this paper section 2, describes the architecture of PARAM 10000 supercomputer, followed by ssi80 and T80 model description (section 3). The parallelization strategy is discussed in section 4. Finally the results are discussed.

105 2 PARAM 10 000 PARAM10000 supercomputer was indigenously developed by Center for Development of Advanced Computing, C-DAC. It is a cluster of workstations based on UltraSparc family of microprocessors, which operate on Solaris 2.6. The workstations are configured as compute nodes, file servers, graphic nodes and Internet server node. PARAM10000 consists of thirty-six compute nodes. Each compute node is a high performance, shared memory, symmetric-multiprocessing UltraEnterprise 450 server from Sun Microsystems with four cpu's. The nodes are interconnected with Myrinet and FastEthernet. C-DAC has developed High Performance Computing and Communication (HPCC) software (Mohanram et. al, 1999) for PARAM 10000. This software architecture allows the ensemble of workstations to be used as independent workstations, cluster of workstations, or Massively Parallel Processors System connected through a scalable high bandwidth network. The HPCC suite provides many tools for developing parallel codes using message passing and for debugging them.

3 a Description of ssi80 model The Spectral Statistical Interpolation program ( Rizvi et. al 1995 ) is used for interpolation of the observaed data to the forecast model grid. The objective of this program is to minimize an objective function defined in terms of the deviations of the desired analysis from the guess field, which is taken as the six hour forecast and the observations weighted by the inverse of the forecast and the observation errors respectively. The objective function used by ssi is given by

J=l/2[ (x-xb)TB(x-xb) + [yobs -R(x)]To[yobs - R(x)]] Where x N component vector of analysis variables xb N compnonet variable of background variables ( forecast or guess variable ) yobs M component vector of observations B NxN forecast error covariance matrix O MxM observational error covariance matrix R Transformation operator ( nonlinear ) that converts the analysis variables to the observation type and location N Number of degrees of freedom in the analysis M Number of observations The data flow of the programme is given in fig 1. The ssi program calls three main routines The first set of routines read the input data ( decoded satellite data ). The second set of routines, are used for setting the right hand side of the optimization function. The third set of subroutines carry out the interpolation.

106

SSI ALGORITHM

c

READ INPUT I W A

3

SET RHS FOR SOI

READ THE SPECTRAL VARIABLES IAND CONVERT THEM TO GRID PERFORM VERTICAL ADVECTION AND TIME INTEGRATION IN GRID SPACE.

ADVECTION ALONG KIN GRID (PARALLELIZABLE LOOP) ADVECTION ALONG J

ADVECTION ALONG I

CONVERTS GRID VAR BACK TO SPEC.

CHECK STOPPING CRITERION

c

FINALISE

J

Figl. The ssi80 program was developed for a Cray vector processor. The porting of this code onto PARAM10000 involved rearranging some of the data structures and removal of asynchronous I/O, which was used for writing intermediate files. The performance bottlenecks of this programme were found to be in a) File I/O b) Subroutines which convert variables from spectral to grid space and back to spectral. For performance tunning, the intermediate files were handled using low level system I/O ( Solaris pread and pwrite. ). This modification gave a performance improvement of 25% in the total execution time. The rearrangement of the subroutines, which were tuned for the vector processor, involved the removal of unwanted test conditions and loop rearrnagements. This improved the timing of the sequential code by 10%.

107

Fig 2. 3b. Description of T80 global spectral model The global spectral model was developed at NCEP and was modified by National center for medium range weather forecasting. A detailed description of the model can be found in Purohit et.al, 1996. The model has eighteen levels in the vertical and has an horizontal resolution of 1.5° x 1.5°. The model can broadly be divided into three modules a) The physics module b) Dynamics module and c) the radiation module. These modules carry out the computations along the latitude loops, i.e. the variables along all the longitude and all levels are computed at a particular latitude at a particular time. The time step used in the model was fifteen minutes.

4a. Parallelization of ssiSO On analyzing the sequential ssi80 code, we found that the conversion of the spectral to grid domain and vice versa was carried out independent of levels. The model (fig 1.) requires about hundred iterations for the calculation of the final values. The iteration loop is found to be highly dependent and is not suitable for explicit parallelization using Message Passing across the nodes. However, to take advantage of the independency of the model along the levels, the explicit parallelization using the compiler directives and Pragmas was carried out within the node. This type of parallelization takes advantage of the SMP architecture of the node. The parallelization using the Pragmas and compiler directives required us to identify the loops, which can be parallelized. This type of parallelization requires the user to carry out the data and loop dependency analysis. On carrying out the loop analysis it was found that some subroutines required rearrangement to remove the dependency and become amenable to paralleization. The code after compilation was run on different number of processors of a node, by specifying the number of processors as an environmental value. 4 b. Parallelization of T80 The global spectral forecast model T80 was parallelized using data decomposition method (Purohit et.al 1996). The code was parallelized along the gaussian latitudes. Master- Worker paradigm was used for the parallelization. Initially the master program reads the input files and distributes the data to the worker programs. The workers along with the master program carry out the computations. The global spectral variables are updated, in both the master and worker program, every time step using the mpLallreduce call.

108 At the end of the program, the master program receives all the data required for writing the final output file and writes it. One day forecast timmings using public domain MPI and C-DAC MPI using active messaging 100 8888 90 » 80 I 70 H 60|0 5 0 -

D Public Domain MPI • C-DAC MPI

fl

i 40 f 30 u 8 20111 10 0

r>

1

-M

r^o

38

Ml _•

!_•

22

LJB

1 2 4 8 Number of processors of PARAM 10000

Fig. 3 5 Fault Tolerance

The hardware of PARAM 10000 has fault tolerance at node level. In case of some failure in the node, the node does a proper shut down without corrupting the data in the node. At the system level, fault tolerance is implemented to save rerunning of the codes from system failures. The script writes intermediate files, of input data, with appropriate time stamp before the execution of each module of the forecasting suite. In case of a system failure the user can run a restart script which will start execution from the point where the interrupt had taken place, reading the intermediate files. 6 Results The entire forecasting suite, which consists of about twenty-five programs, was ported on to PARAM 10000 from CRAY-XMP. The results of the individual programs were verified for correctness, by running the programs on both the systems simultaneously and verifying the output files. The performance of the entire forecasting suite was dependent on the performance of the analysis code and the forecasting code. Hence the analysis code and the forecast model were parallelized. Here, we present the results of the speedup of the analysis and the forecast model on varying number of processors The performance of the parallelized analysis code ssi80 and the forecasting code was measured for varying number of processors. The analysis code was compiled using the compiler directive -explicitpar. The stack for the subroutines, which had pragmas, was increased for the improving the performance of the code. The code was run on varying number of processors of the node, specifying the number as an environmental variable. Each node of PARAM 10000 has four processors. The speedup of the code is shown in fig 2.

109 From the figure it can be seen that the code has good speedup till three processors and the performance degrades after that. The good speedup can be attributed to even distribution of the work on all the processors and very a little overhead in creation of threads by the compiler.

The parallelized forecast model was run on varying number of processors for different net works. The code was run on Fast Ethernet and Myrinet networks. The results are shown in fig. 3. The proprietary C-MPI was used while the code was run on Myrinet. C-MPI was improved upon the public domain MPI (MPI forum, 1995). The point-to-point communication calls were optimized in C-MPI. From the fig it can be seen that the model has a good speedup till eight processors. The performance of the model degrades after eight processors. An analysis of the model, showed that the communication overheads dominate as the number of processors increase beyond eight processors.

Conclusions The entire forecasting suite was successfully ported and optimized on the PARAM 10000 supercomputer. The most time consuming programs of the suite, the analysis code and the forecast code, were parallelized. It is found that the parallelized analysis code shows good speedup for varying number of processors within a node. The good speedups are due to load balancing of the work and also due to very little overheads in the creation of the threads by the compiler. The forecast model is found to speedup till eight processors, after which the performance degradation was observed because of increased communication overheads. A script was developed to create fault tolerance in the execution of the weather forecasting suite.

References 1. Message Passing Interface Forum, MPI : A Message Passing Interface Standard, Version 1.1. , 1995, available at URL http://www.mcs.anl.gov/mpi 2. Mohanram, N., V.P. Bhatkar, R.K. Arora and S. SasiKumar, 1999, HPCC Software: A Scalable Parallel Programming Environment for Unix Cluster. In Proceedings of 6lh International Conference on Advanced Computing, edited by P.K. Sinha and C.R. Das. NewDelhi : Tata McGraw-Hill Publications, pp 265-273. 3. Rizvi, S.R.H. and D.F. Parrish, 1995, Documentation of the Spectral Statistical Interpolation Scheme, Technical Report 1/1995, NewDelhi: National Center for Medium Range Weather Forecasting, 38pp. 4. Purohit, S.C, P.S. Narayanan, T.V. Singh and A. Kaginalkar, 1996, Global Spectral Medium Range Weather Forecasting Model on PARAM, Supercomputer, 65: 27-36.

110

PARALLEL LOAD BALANCE SYSTEM OF REGIONAL MULTIPLE SCALE ADVANCED PREDICTION SYSTEM JIN ZHIYAN National Meteorological Center of China 46#Baishiqiao Rd., Beijing, 100081, P.R.China E-mail: jinzy @ rays. cma. gov. en We design a Parallel Load Balance System for our new meso-scale model R_MPS, it is the parallel processing driver layer of the model. The main functions include the automatic domain decomposition and dynamic load balancing. Non-rectangular shaped domain is used to archive better load balance. Dynamic load balance is implemented by global view of load balance and data re-mapping. Primary simulation result is presented

1

Introduction

This work belongs to the sub-project to generate a regional multiple scale nonhydrostatic model-Regional Multiple scale Advanced Prediction System. It is supported by the national 973 project. The model is under the development by the Numerical division of the National Meteorological Center of China. The model will be the production and research model of the center. It will be a completely redesign code, targeted for the 10-lkm grid-scale. The principal aim if this work is to generate a software named Parallel Load Balance System (PLBS) to take care all of the parallel processing of the model, separate all of the other scientific code from the parallel processing code, and make the model portable on different architecture machines. The performance of any parallel application depends on good load balance. For lowresolution model, the biggest load imbalance is in the radiation [1], which can be estimated at runtime. In a high-resolution model, microphysics can easy destroy the static load balance [2] and the load distribution cannot be estimated before the forecast. Fortunately, the load of each grid point does not vary too much and the load of the previous time step is a good prediction of the next time step [3]. Previous work [3][4][5] had studied some technique of load balance. In this paper, we use another approach to deal with the problem. Section 2 is the overview of the system. In section 3 and 4, we will compare some of the domain decomposition methods and the data layout of the sub domain. Section 5 will discuss the load balance strategy. Section 6 is the simulation results section 7 is the conclusion.

111

2

Overview of PLBS

PLBS works between the model and Message Passing Interface (MPI). It is be very portable to any machine that supports MPI with the architecture of shared or distributed memory parallel machines and distributed memory machine with SMP node, The software handles all of distributed parallel processing and I/O for the model. At present, parallel I/O has not been implemented in PLBS. For any I/O, one node read the input file from disk and distributes the proper data to other nodes, and collect the output date from all of the nodes and write to disk. Initial

Index mataince

Duta Distribution

Data exchange

-*

I Figure 1. Over view of PLBS

Figure 1 shows the overview of the parallel processing part of the PLBS. It is composed of six parts, initialization, partitioning, indexes maintenances, data distribution, data exchanges and load balancing. The Initialization is the first part that should called before any other parts of the PLBS. It Initialize the MPI, get the environment parameter, and get the options of the command line. It tells the PLBS the size of the domain of the model and how many processors are available for the model. Then the partition part calculate the domain decomposition with the methods specialize by the user. After that, the index maintain part establishes the local table of the neighbors of the grid points by reference in global index of the grid points, establishes the list for data exchange, input and output, etc. At the end of each time step, the load balancing module will timing the running time of each node and test the load is balanced or not. If certain condition is matched, It will lunch the partitioning module to decomposition the domain according to the new load distribution, and then test the load balance of the new decomposition is better or not

112

and redistribute the data and establish the new indexes table according to the new decomposition when the load balance is improved. It will discard the new decomposition if there is only a little or no improvement and continue the data exchange like any other time step. The user does not need to do any thing and the whole process is transparence to user of PLBS.

. . . . . 1~TZ 7 T T . . . . . .

. . . . .

. . . . . . . . . . . .

Figure 2. Four partitioning methods. Shade area is high load area

3

Domain Decomposition Methods

The model is intended to be run at the resolution of 10-1 km, the heavy load of the microphysics can easily destroy the static load balance and the performance will suffer a lot by load imbalance of the physics of the model. The method of domain decomposition is crucial to balance the load, It should be capable to distribute the load of the model as even as possible among processors on any given load distribution. Four methods have been tested, which were shown in Figure 2. In the first method, the processors are divided into Na by Nb groups in x and y directions, and the grid points in each direction are divided as even as possible by Na and Nb. This method is base the assumption that each grid point has roughly the same workload, which is true in many lower resolution limited area models. It has no capability to deal with the load imbalance of the physics in the model. The static load balance is poor when the number of processors does not matches the grid point very well. We use it as a base to evaluate other methods. In the second method, the whole load is divided as even as possible into Na columns and Nb rows. The area of the sub-domain can be

113

changed corroding to the load of each grid point, It keeps that each sub-domain has four neighbors and does not change them during the period of the calculation. The first step of the third method is the same as the second one, but in the second step instead of divide the whole load of domain, it divide the load of each column into Nb rows seperatly. Each sub-domain keeps in rectangle shape, the neighbors change with the changing of the load of each grid point. The fourth method is the same as the third one but introduce little steps when divide the columns and rows, which is quit similar to IFS [6]. The steps make the sub-domain irregular shape, with which the load can be distributed more even among processors. The numbers of rows in each column do not need to be the same in the third and fourth methods. This gives the advantage that the model can use any number of processors properly. Figure 2. Shows the four methods. We used the second order linear diffusion equation to test the performance of each method. In order to evaluate the capability of load balancing, we added a simulation of physics in the code. The workload of the grid points at the middle area is ten times higher than in other area. Figure 3 show the speedup of each method. Obviously the fourth method is the best, and we use it in PLBS.

100

1

2

4

6

16

32

64

Figure 3 Speedup of the linear diffusion equation of four methods

It is possible to use other domain decomposition method in PLBS. User can specify their own method in the system or just tells the system the sub-domain map. 4

Indexes Maintenance

A sub-domain is divided into three parts, the area that overlap with neighbors, called outer boundary, the area that need the data of outer boundary, called inner boundary,

114

the area that does not need any data of outer boundary, called inner area. We use one dimension array to hold all of the data which is shown in fig. 8. The index of local array has nothing to do with the grid position. A lookup table of the local index with the global index must be introduced, and also a neighbor table that tells where are the neighbors. The data exchange list of the overlap area and I/O list must also be maintained.

Figure 4 Data layout of a sub-domain 5

Dynamic Load Balance Strategy

The function of load imbalance is defined as:

T 7

,

-T

m

average

imbalance =

— T average

7 ^ is the maxim of the clock time among processors. Tav

e

is the average clock

time of the processors. we measured the load of every grid point and time each node every time step. At the end of each time step, before the normal date exchange of the overlapped area, the load imbalance module is called. We defined a threshold Kthreshold . If the function is less than the threshold, normal data exchange is performed. If the imbalance function is greater than the threshold, we start to re-partition the whole domain according to the new load distribution of the last time step. After the repartition, we test the load imbalance function again to make sure that the new partitioning is much better than the old one (but this seems can be removed for the new partition always much better than the old one). If the new partition is adopted, data re-mapping is performed to move the data from old buffer to new one. Each node scan their old sub-domain to make a list that where each grid point would be send to, and scan their new sub-domain to know where and how many grid points of their date should

115

be get from. Each node send the list and the data to the nodes which their old data need to go and receive form the nodes which their new data need to get from and copy the data from the receiving buffer to model buffer according to the list. If the data is on the same node, we just copy them form old buffer to new buffer. The index tables of the new decomposition should also be maintained. We found that the performance is very sensitive to the threshold and the adjustment is too often and the overhead is very high when the threshold is small, and sometimes the imbalance function was very high at random place due to the turbulence of the system for unknown reason. We changed the dynamic load balancing strategy a little bit, the re-partition would be called only when the load imbalance function is greater than the old one for certain times to avoid the unnecessary adjustment and control frequency of re-partitioning. The load of every grid point is only measured at the last time step before the re-partition.

6

Experiment result

Again, We use the second order linear diffusion equation to test the system. The domain is 103 by 103 square with 100 levels, we added the simulation of the physics in it that there is a heavy load area in the domain. The load of the area is 10 times higher than other area and let the area move form the north west to south east of the domain, which is shown in Figure 5.

Figure 5 A heavy load area goes through (he domain

The system is IBM SP, which has 10 nodes, and there are 8 processors share the 2 Gbytes memory in each node. In the experiment, each processor was used as a node. Figure 6 shows the result of the experiment.

116 18 1 16

-•-Ideal "•"Static

14

" * ^ l nproved

12

-*-Threshold(0. 1) -*-Threshold(0. 2)

-*-7rireshold(0. 3)

0 8

16 24 32 40 48 Nurrber of processors

56

64

Figure 6 Speedup of the different thresholds and improved strategy It is obvious that dynamic load balance improves the performance a lot than the static load balance scheme which is method one as we talked before, But the small threshold does not means higher performance for the overhead of re-partition, data re-mapping, indexes table maintenances, etc. The improved curve is the changed strategy that the re-partition is only after the imbalance function higher than threshold (10%) for 10 times, And Figure 7 shows the load imbalance function of it. We can see some turbulence in the figure.

M0k Figure 7 The variation of load imbalance function with time steps of the improved strategy.

117

The overhead is mainly dominated by the in the calculation of partition, calculation the data layout of the sub-domain, the indexes maintenance, and the establishment of new communication relationship. The overhead of the communication of data remapping is slightly higher than that of the ordinary data exchange. Fig. 12 shows the breakdown of the overhead. 0.12 0.1

|other overhead icormtini cation

0.08 0.06 0.04 0.02 0 r e-par t i t i on

exchange

Figure 8 The variation of load imbalance function with time steps of the improved strategy.

7

Conclusion

We have built the PLBS for our new model R-MAPS. Premier experiment shows that the irregular shaped sub-domain can balance the load better than rectangle one. Dynamic load balance is realized through repartitioning and data re-mapping. The overhead is mainly is the calculation of the partitioning, the data layout of the subdomain, the indexes maintenance, the establishment of new communication relationship, etc. The communication cost is not too high compared to normal data exchange of the overlapping area. Small threshold does not means high speedup. Choosing the dynamic load balance strategy carefully is important to the performance. References 1.

2.

Michalakes, J.G., Analysis of workload and load balancing issues in the NCAR Community Climate Model, Tech. Memo. ANL/MCS-TM-144, Argonne National Laboratory. J.P.Edwards, J.S.Snook, ZChrstidis, Use of parallel mesoscale model in support of the 1996 summer Olympic games, Making Its Mark, Proceedings of the Seventh Workshop on the Use of Parallel Processors in Meteorological, November 2-6,1996.

3.

4.

5.

6.

R.Ford, D.Snelling, A. Dickinson, Control Load Balance, Cache Use and Vector Length in the UM, Coming of Age, Proceedings of the Sixth Workshop on the Use of Parallel Processors in Meteorological, November 21-25,1994. HenDrik Elbern, Load Balancing of a Comprehensive Air Quality Model, Making Its Mark, Proceedings of the Seventh Workshop on the Use of Parallel Processors in Meteorological, November 2-6,1996. J.M.Bull, R.W.Ford, A.Dickinson, A Feedback Based Load Balance Algorithm for Physics Routines in NWP. Making Its Mark, Proceedings of the Seventh Workshop on the Use of Parallel Processors in Meteorological, November 26,1996. S.R.M.Barros, D.Dent, L.Isaksen, G.Robinson, The IFS model - Overview and Parallel Strategy Coming of Age, Proceedings of the Sixth Workshop on the Use of Parallel Processors in Meteorological, November 21-25,1994.

119

GRID COMPUTING FOR METEOROLOGY GEERD-R. HOFFMANN Deutscher Wetterdienst Kaiserleistr. 42, D-63067 Offenbach, Germany email: seerd-ruedieer. hoiTmann(Ssxtwd. de

ABSTRACT The development of the Internet and the advent of "portals" have led to an understanding of the network infrastructure as being equivalent to the electricity grid providing computing services instead of electricity. A number of projects have embarked on exploiting this new understanding, amongst them GLOBUS and UNICORE. The requirements for providing a "plug" to the computing grid and the underlying techniques are described. Furthermore, the necessary features of the applications using these services are considered. In the context of the EUROGRID project, funded by the EU, the planned meteorological application of a relocatable limited-area model is presented in detail. In particular, the services necessary for the application that have to be provided by the Internet are discussed. An outlook into the future of grid computing is given.

1. Introduction The exponential growth in the number of Internet connected systems in the last two decades has led to considerations of how the computer performance thus linked could be combined and used by applications. Following early tries in the late eighties (see e.g. [7]), some concerted efforts were started in the first half of the nineties (see e.g. [1] and [6]), mainly in the USA, to harness the total potential computing power of the Internet for specific users. Similar activities were started in Europe (see e.g. [3],[8]). However, only when I. Foster and his colleagues developed the idea of the Computing Grid in 1999 (see [5]), did the movement gain momentum and was supported by a number of projects bringing together users and computing service providers. At present, a variety of different international projects are under way to develop the technologies and applications for the Internet-wide exploitation of computing resources. 2. Definition and history of GRID computing The analogy used in [5] is that the Internet is seen as an electricity grid, but instead of delivering electricity to the users it provides computing resources. You just use the right "plug", and then you have access to a nearly limitless supply of computing units. Of course, a number of sometimes difficult requirements have to be met before this analogy can be turned into reality: • •

availability of a high speed network Internet a "plug" an interface to the computing services requested

120 • • •

an application a computing application that can be invoked via the Internet by authenticated users computing service provider an Internet service provider which offers computing resources for the application charges a billing unit for computing services

Assuming that all these preconditions are fulfilled, the structure of GRID computing can be depicted as:

Fig. 1: Structure of GRID computing Simplifying the structure to its basic components, the following, well known computing services all follow the GRID computing paradigm: • remote computing Required computing performance is not available locally. • metacomputing, co-operative computing Required computing performance is not available in one location. • application specific computing Required computing services are only available at specialised centres. GRID computing is a superset of remote computing techniques, some of which have been known for more than 30 years. However, GRID computing puts the emphasis on the generality of the approach and thus allows new insights into the topic.

121 3. Enabling technologies There are numerous projects, which have concentrated on solving parts of the problem posed by trying to put GRID computing into practice. In the following, only two such activities are briefly described, because they form the basis for the two European GRID projects currently funded by the EU, i.e. DATAGRID and EUROGRID. A more complete overview can be found in [2].

3.1. GLOBUS GLOBUS (see [4]) is a research and development project at Argonne National Laboratory that targets the software infrastructure required to develop usable highperformance wide area computing applications. Its development is focused on the Globus toolkit, a set of services designed to support the development of innovative and high-performance networking applications. These services make it possible, for example, for applications to locate suitable computers in a network and then apply them to a particular problem, or to organise communications effectively in tele-immersion systems. Globus services are used both to develop higher-level tools and directly in applications.

3.2. UNICORE UNICORE (Uniform Access to Computing Resources (see [3])) provides a computing portal for engineers and scientists to access supercomputer centres from anywhere on the Internet. Strong authentication is performed in a uniform and easy to use way. The differences between platforms are hidden from the user, thus creating a seamless interface for accessing supercomputers, compiling and running applications, and transferring input/output data. It is being developed by the UNICORE Plus project funded by the German BMBF. It offers the following •

Intuitive access to all functions through a graphical user interface

•

Platform independent creation of batch jobs

•

Full support for existing batch applications

•

Creation of interdependent jobs to be executed at different sites

•

Secure job submission to any participating UNICORE site

•

Sophisticated job control and monitoring

•

Transparent and secure file transfer between UNICORE sites and to the user's local system.

122 The structure of UNICORE is shown in Fig. 2 below. It shows the different levels of the implementation: •

the user workstation where jobs are put together as abstract job objects (AJO), AJOs are submitted to a UNICORF, server and the status of the jobs are monitored.

•

the UNICORE server where AJOs are converted into real jobs for the target system(s), site authentication for the user is carried out, the job-step dependencies are enforced and the jobs are submitted to the local batch subsystem(s) of the participating site(s).

•

the UNICORE site(s) which participate in the project.

All information across the Internet is secured by SSL, authentication is achieved by X.509 certificates. U 80 r W o r k s t a lio n U N I C O R E

U N IC O R

r [

G

N e t w o r k

Batch

•

a t e w

Job

G U I

. u n ic <

S e r v e r ay

S u p e r v i s o r

U N I C O R E

S o r v o r

1 y*-

S u h S y st

SITE I

SITE II

Fig. 2: Overview of UNICORE architecture

4. Applications of GRID computing Based on the technologies described above, an application to be used on the computing GRID has to satisfy also a number of requirements itself to become feasible. These are self-explanatory and stem from the nature of the underlying service. 4.1. Requirements The requirements include: • • •

user demand for individual solution well defined algorithm multiple requests possible

123

• • • •

reasonably small user input data set (e.g. parameter driven) output dependent on user input other input data well defined and easily available requirement for computing performance exceeds local resources

Applications meeting these requirements are, in principle, candidates for using the computing GRID. 4.2. Meteorological portal A suitable application within the meteorological field is the running of a relocatable limited-area forecast model on demand. Its parameter are: • Input • location of mid-point • size of area • resolution (vertical/ horizontal) • forecast length • Output • forecast as GRIBfields,graphics or text This application meets all the criteria outlined above in 4.1. Its possible structure is shown in Fig. 3. The user will specify the input parameters and the site of the supercomputer on which he owns the necessary resources. The application will ensure the correct installation of the model code on the chosen site, will provide the necessary model input data from DWD and initiate the model run in accordance with the user input data. When the forecast run is completed, the output of the run will be transmitted to the user. The implementation of this GRID application will use the UNICORE technology internally. The development of the forecasting system is part of the EUROGRID project. It is expected that the work will be completed by the end of 2003.

124

Inpul data & results

UNICORE UNICORE

Fig. 3: Structure of EUROGRID meteorological portal

4.3. GRID projects supported by the EU There are at present two European projects supported by the EU. 4.3.1. DATAGRID The objective of the project is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities. These requirements are seen to be emerging in many scientific disciplines, including physics, biology, and earth sciences. The project was proposed by CERN and five main partners. 4.3.2. EUROGRID The EUROGRID project is a shared cost Research and Technology Development project (RTD) granted by the European Commission (grant no. 1ST 20247). The grant period is Nov 1, 2000 until Oct 31, 2003. The objectives of the EUROGRID project are: • •

•

To establish a European GRID network of leading High Performance Computing centres from different European countries. To operate and support the EUROGRID software infrastructure. The EUROGRID software will use the existing Internet network and will offer seamless and secure access for the EUROGRID users. To develop important GRID software components and to integrate them into EUROGRID (fast file transfer, resource broker, interface for coupled applications

125

•

•

and interactive access). To demonstrate distributed simulation codes from different application areas (Biomolecular simulations. Weather prediction, Coupled CAE simulations. Structural analysis, Real-time data processing). To contribute to the international GRID development and to liase with the leading international GRID projects.

After project end, the EUROGRID software will be available as supported product. The partners are:

Fig.4: Partners of EUROGRID

126

5. Outlook The described meteorological portal may be used initially by other meteorological services to produce limited area forecasts on demand. Later on, the service may be offered to commercial weather forecast providers to complement their offerings to the public. Furthermore, it can be envisaged that the service may be provided by an ISP to deliver customised weather forecasts to its clients. This latter use, however, assumes that the weather forecast information can be tailored for general consumption. In the long term, it may be possible to invoke a request for a detailed weather forecast based on a high resolution, limited area model run by mobile telephone and to display the results on the screen of the device. 6. Acknowledgements The author would like to thank the UNICORE Plus project team, in particular Mrs. Romberg and Mr. Hoppe, and his EUROGRID collaborators for many valuable comments and discussions. 7. Literature 1. H. Casanova, J. Dongarra (1996): NetSolve: A Network Server for Solving Computational Science Programs. In: Proc. of Supercomputing '96, Pittsburgh. Department of Computer Science, University of Tenessee, Knoxville, 1996. 2. Computing Portals Project List, www.computingportals.org/projects/ 3. D. Erwin (2000): UNICORE, Uniformes Interface fur Computer-Ressourcen. Gemeinsamer AbschluBbericht des BMBF-Verbundprojektes 01 IR 703. ISBN 3-00006377-3. UNICORE Forum e.V. 2000. 4. I. Foster, C. Kesselman (1997): Globus: A Metacomputing Infrastructure Toolkit. Intl J. Supercomputer Applications, 11(2):115-128, 1997. 5. I. Foster, C. Kesselmann (1999): The Grid: Blueprint for a New Computing Infrastructure. I. Foster, C. Kesselmann (eds.). ISBN 1-55860-475-8. 1999. 6. A. S. Grimshaw, W. A. Wulf, J. C. French, A. C. Weaver, P. F. Reynolds Jr. (1994): Legion: The Next Logical Step Toward a Nationwide Virtual Computer. Uva CS Technical Report CS-94-21, June 8, 1994. 7. M. Litzkow, M. Livny, and M. Mutka (1988): Condor - A Hunter of Idle Workstations. Proceedings of the 8th International Conference of Distributed Computing Systems, pp. 104-111. June, 1988. 8. The Polder Metacomputing Initiative. 1997. www.science.uva.nl/projects/polder/

127 THE REQUIREMENTS FOR AN ACTIVE ARCHIVE AT THE MET OFFICE

MICHAEL CARTER Meteorological Office, H106, Hadley Centre, London Road, Bracknell, RG12 9ER, United Kingdom E-mail:

[email protected]

The Met Office is currently installing a new mass storage system, called MASS, to act as an active archive. That is, much of the data that will reside in MASS will be actively accessed and used. Data use is central to the business of many areas of the Met Office and, because the volumes are so large, we see rapid delivery of data to users from cheap media as a fundamental requirement of our infrastructure. We estimate that archive rates will be in excess of 80 Tbytes per annum during the next year and will increase to 500 Tbytes per annum, by 2005. At the end of this period, we expect to have almost 1 Pbyte of data in the archive and to be able to retrieve data from files that would, if retrieved as whole files, imply Terabytes of data restored from archive each day. In defining a requirement, the MASS team has relied heavily on experiences gained with current archive and access methods, particularly the systems developed for the Climate Research division of the Met Office, currently the holders of the largest data volumes. This paper describes the requirements of users and administrators that we have considered in the MASS project as well as the fundamental need to contain costs. It focuses on the difficulties in specifying a system that needs to balance conflicting requirements. In particular, an effective active archive must give the user good response (particularly for small requests), use its resources efficiently and continually manage the data within it. As user needs will change in time, the system needs to deliver all this in a flexible way as well as providing a very high degree of reliability and integrity.

1

Introduction

The Met Office is in the business of providing weather, climate and other related services many of which rely on storage of and access to data. Numerical models for climate prediction, weather forecasting and scientific research generate most of the data and, although the data is used actively for analysis, the volumes are too great to store or manage on magnetic disks. A managed data repository is central to the business of the Met Office. The Met Office is in the process of installing a new system to provide this functionality called MASS which is viewed as an active archive: that is, much of the data is write once, read many. The system will be based on the StorHouse product provided by FileTek. The most challenging aspect for MASS will be dealing with the data from the numerical models that form part of the Met Office's Unified Model system (UM). This is a multi-component model incorporating atmospheric, oceanic, biomass and ice processes. Experiments built on this system can take a long time to run (up to months) on an expensive super-computer system. Rerunning such models would reduce the available CPU resource, which, given that the problem is compute bound, is very unattractive. Hence, an economical way to store and manage large amounts of frequently accessed data is essential. Today, this means sequential media, such as tape cartridges. Another key aspect to data access at the Met Office is that, no matter how we lay data down, users will want to access it in another dimension. Combined with the fact that we have a significant number of restores for large amounts of data from any given experiment, data access from sequential media is always going to provide a challenge.

128 2

The Current Archive System

AML/2 Robdlo Library

Observations Inwards

Racking

Workstation LAN

T3E-B:640 PEs

Figure 1 Figure 1 shows the current archive system and main data flows. Data is generated primarily on the two T3E platforms. Other data sources are the observational data systems and the distributed workstation systems. The data management system runs on the IBM (OS390) platform which is connected to an AML/2 robotic library. Because of pressures on space in that robot, some of the data resides outside the robot in racking. For disaster recovery, some data is also replicated in off-site racking. T3E IBM

Climate Model

CRACER Archive Server

UM Archive Server Send trigger

N

etc. until trigger big

Write data to file

^>

m

1

File done, tell server

Pick up triggers and archive files

In O i"1-—J

Figure 2 In the Met Office's Climate Research Division, scientists have access to an archive and restore system called CRACER that was developed as an interim solution to archive problems before MASS.

129 Figure 2 shows the archive processes in the CRACER system. As a climate model runs it generates output files that are closed and refreshed (by a new file with a new name) periodically. Once a file is closed, a message is sent to a UM archive server dedicated to dealing with the output for that model run (one server for each run). This server immediately clears the data off the T3E and puts it into a data park on the IBM system. Once a number of such files has built up, the UM archive server sends a trigger file containing archive instructions over to the IBM. This is done when the space in the data park used by the run reaches a given size. The CRACER archive server on the IBM will pick up new trigger files and write the data to tape. This system allows the archive process to use resources more efficiently as it both reduces the need to mount cartridges and reduces the number of items in catalogues. Not shown in this diagram is a cross checking system where data that is sent for archive is listed and sent to another software system to ensure that it does end up on tape. Also, not apparent from Figure 1 is the fact that data from one experiment is always collocated onto one physical set of tapes, all but the last being fully utilised. This significantly improves retrieval efficiency, as most requests for data require very few tape mounts. The development of this system and the cross-checking done has significantly reduced the amount of time users need to spend managing the archiving of their data. It has proven to be very robust. Figure 3 details the restore system. This is even more complicated than the archive system. The user requests data, with optional processing, via the PARIAH command. Alternatively, they can use a metadatabase which can generate the call to PARIAH for them.

Figure 3 PARIAH generates a trigger file that is sent to the IBM defining the data required and the processing on that data. Often the processing is a filtering request allowing the user to get a selection of data from larger files. The CRACER restore server reads the trigger file and submits a process to restore the data from tape. Once restored, the process task is submitted and the result file left on disk on the IBM along with a pointer file that indicates the data is ready to be recovered back to the workstation system. Periodically a task on the workstation system is started that looks for these pointer files and, under instructions contained in them, restores the data to a specified location on the workstation. This system is designed to be robust to machine outages and network problems but is let down by the restore process, which sometimes results in errors of various kinds. Not shown here is that restores from tape are made to a disk pool only if the data is not already residing there. The disk pool acts as a cache and we currently get up to 60% cache hits from this system. This is largely as a result of users only asking for a subset of the data from any given file. Often, the same user may want to look at more data from the same file within a short period of time.

130 Also not show here is the collection of queues, limits and quota systems that are in place to try and reduce the impact of large requests on the performance of smaller requests. The CRACER system has various weaknesses that need to be addressed by its successor MASS: • There are limits in the namespace of underlying software that will imply reduced performance as the number of files grows. • The system, as can be seen, is very complicated. This has a maintenance overhead and implies less than ideal error handling. • Too much intelligence is built into the client-side software, for example quotas and metrics relating to available disk cache. The system delivers excellent collocation, but as cartridge capacity rises, this leads to poor cartridge utilisation and ultimately poor performance as data is forced outside the robot. 3

Estimated Requirements of MASS

Figure 4 Figure 4 shows the throughput estimates for the MASS system. Archive throughput rises from about 220Gbytes per day to 1.4 Tbytes per day at the end of the period. The unfiltered restore curve assumes that the user brings back data as whole files as laid down. This starts at about 400 Gbytes per day and rises to 3 Tbytes per day. Comparing this to the filtered restore curve, where the user just gets exactly the data they need, it is evident that there is a significant benefit in having filtering mechanisms within the MASS system. Here, we start the period with a throughput of just 30Gbytes per day and end with 220 Gbytes per day. These graphs are based on projections of affordable super-computer capacity including an upgrade in 2003, but no further upgrades are included.

131 •

Total Volume Held m MASS at year end (Tbytes)

2530 ;cao I5C0 1GC3

M a 2001

2002

2003

2004

2*05

2M(

2*07

20N

200»

2010

Figure 5 Figure 5 shows estimates of the data volumes that will be held in MASS. Here it is assumed that data in the current archive is transferred into MASS over a two-year period. An enhancement in super-computer capacity is assumed in 2003, but no enhancement after this. Towards the end of the period, the curve starts to flatten as data is deleted. This would not happen if further super-computer power was made available. 4

The Business Case for MASS

The business case for MASS storage is strongly linked to the business case for super-computing. Numerical weather and climate predictions are limited in quality by the availability and affordability of computing capacity. Climate prediction in particular produces large amounts of data with a significant shelf life (typically three to five years). Spending money on mass storage reduces funds available for supercomputing. However, not storing data implies an increased requirement for computing power to allow data to be regenerated. Within climate research, the business case for the MASS system was based on this balance between the cost of the storage system and the need to regenerate data if less data is stored. A key element of a storage system is its data management features. These are required to ensure that human resources and volume-related costs do not escalate because of the inability to safely delete data or move to higher capacity media. 5

Mass Requirements

There are too many requirements of a storage system to list here. In this section, I highlight requirements that are less obvious and/or more challenging to satisfy. 5. /

User Requirements of MASS

Users need a system that allows them a logical view of their data. They should not be concerned about where their data is held. However, for certain large restore tasks it is important that the user is given facilities that will allow them to automate those tasks in an efficient way. For example, within the Met Office's Climate Research Division we need to export large amounts of data from the archive to another site. The volumes are such that it cannot be held on local disk space and the data may be spread over many tape volumes. Because of the constraints of local disk space, such an exercise needs to be done in blocks of a manageable size. It is very important that such exercises are done efficiently so as not to tie up tape drives. Hence, splitting the task into blocks should ensure that loading of cartridges is minimised. This requires the MASS system to inform the user of an appropriate split of the data. Users need to be able to archive data from any platform and retrieve it to any platform. For certain types of data this includes data translation which needs to understand file formats unless machine independent representations are used.

132 Section 4 above illustrates the need for filtered retrieval. For many storage systems, such a facility would require large amounts of temporary storage space to allow filtering processes to take place. Users needs to be able to perform both synchronous and asynchronous retrievals. The former type is useful for automated processes that need to restore data and then use it. Asynchronous retrievals will often be more appropriate for very long retrievals when connections could be lost, such retrievals need to have a checkpoint-restart facility. It is important that the system has good error messages direct back to the user or batch process. This can prove difficult in disjointed client-server systems. A very robust archiving process is essential. Often, tasks will need to delete data once it has been sent to archive. A scheduling function is very important at the Met Office as there is such a wide range of job sizes. Any scheduling algorithm for a storage system, where there is high latency, needs to have the following aims: • to ensure restores are completed within defined targets for a given priority • to ensure high utilisation of available resources • to reduce thrashing and other inefficiencies to an acceptable level 5.2

Resource Management Requirements of MASS

To control costs, MASS needs to have an effective deletion management process. To avoid inefficiencies, users need to be confident when performing deletion management. Past experience has taught us that systems based on expiry dates require a lot of effort such as informing users of data about to be deleted and ensuring they reply and are not away. Without a safety net, users tend to specify very long expiry dates. A much better system is based on review dates. Here users know that no data will be deleted without their confirmation. They will be asked to review data periodically at review dates which are far more likely to be set realistically. However, such a system needs other mechanism to ensure that users are not holding on to excessive amounts of data. In MASS, we are considering quotas and metrics, such as how frequently users are accessing sets of data. Another important aspect of deletion management is providing the right tools to allow users to work with large numbers of files safely. Users are reluctant to delete data if they feel they cannot do so safely. As MASS will be used by many groups with different funding mechanisms within the Met Office, it is important that the system is able to provide proper accounting metrics. Both data held and throughput may prove important and the system needs to consider change of ownership for data. Performance monitoring tools are also seen as a benefit. They help ensure appropriate system tuning and that upgrades are both timely and well targeted at needs. 5.3

System Efficiency Requirements of MASS

A good active archive system needs to be able to balance various conflicting efficiency requirements. A key requirement for an active archive is high retrieval efficiency for high priority tasks. This is best met, for us, by data collocation. That is, data from one run of the UM is located physically close on tape media. This works because many requests are for sets of data from one experiment and thus the number of tape mounts is reduced. Collocation has the following interactions with other aspects of the system: • Archive efficiency is reduced as data from many different runs of the UM are active at any one time. As we write once and read many, this is a price that must be paid. • Media utilisation will be poor if the system can only achieve collocation by keeping data on one set of media for all time. Housekeeping exercises to consolidate data need to be both human and system efficient to offset this problem. • Deletion management is typically more efficient if data is collocated. We will tend to delete whole sets of data at once and, if collocated, this implies whole media (or large chunks) can be freed. Again, good housekeeping functionality will help make best use of the opportunity. Another key interaction that has already been discussed is the trade off between improving retrieval performance for high priority tasks and the effect that has on general system efficiency and, as a consequence, total throughput.

133 6

Closing Observations

As MASS is not yet installed, it is too early to make any firm conclusions. However, the following observations should be useful to the reader: • The Met Office needs an active archive to conduct its business. We call it MASS • MASS needs to be reasonably sophisticated and affordable. The sophistication is required to help keep costs under control by using resources efficiently. • Mass storage systems are complicated. We have tried to build our own in the past and this can work, but solutions will be more scalable if they are based on a flexible and adaptable foundation. Adaptability is important as, for many organisations, requirements will change in time, often in ways that are difficult to predict. • Of the systems looked at in the procurement exercise, there was no ideal product. A good deal of integration is required. • Mass storage systems are complicated. If your data and access patterns have structure, there are opportunities to improve performance and save costs by organising around the structure. However, this adds complexity. • It is our experience that understanding the requirement and communicating it to the suppliers is not a simple task. Being well organised about finding a common understanding of requirements and the challenges they bring can help this process.

134

INTELLIGENT SUPPORT FOR HIGH I/O REQUIREMENTS OF LEADING EDGE SCIENTIFIC CODES ON HIGH-END COMPUTING SYSTEMS - THE ESTEDI PROJECT -

KERSTIN KLEESE CLRC - Daresbury Laboratory, UK E-mail: [email protected] PETER BAUMANN Active Knowledge GmbH, Germany E-mail: [email protected]

The most valuable assets in every scientific community are the expert work force and the research results/data produced. The last decade has seen new experimental and computational techniques developing at an ever faster pace, encouraging the production of ever larger quantities of data in ever shorter time spans. Concurrently the traditional scientific working environment has changed beyond recognition. Today scientists can use a wide spectrum of experimental, computational and analytical facilities, often widely distributed over the UK and Europe. In this environment new challenges are posed for the Management of Data, foremost in the areas of data archival, organisation of data in a heterogeneous, distributed environment, data accessibility and data exploration. In mid 1998 CLRC - Daresbury Laboratory started a Data Management Project (DAMP) which is looking specifically at the requirements of computational scientists in a High Performance Computing Environment. As part of this work surveys have been carried out, workshops were organised (Data Management 2000) and a number of follow on projects and collaborations were initiated. The European ESTEDI collaboration is such a project, it focuses on a particular aspect of data management, the support of high I/O requirements of codes on high-end computing facilities. This paper will give a short introduction into the aims and realm of the DAMP project and reports on the work of the ESTEDI co-operation.

1

Data Management at CLRC

CCLRC - the Council for the Central Laboratory of the Research Councils1 is responsible for one of Europe's largest multidisciplinary research support organisations, the Central Laboratory of the Research Councils (CLRC). Its facilities and expertise support 1

See list of URLs at the end of the paper for references to institutions.

135

the work of more than 12000 scientists and engineers from around the world, both in universities and industry. Operating from three sites - Daresbury Laboratory in Cheshire, Rutherford Appleton Laboratory in Oxfordshire, and Chilbolton Observatory in Hampshire - CLRC supports research projects in a wide range of disciplines, and actively participates in collaborative research, development and technology transfer projects. The Laboratory has some 1700 staff and a turnover in excess of 100m pounds. To support leading edge research in physics, chemistry, biology, space science, engineering, materials science and environmental science CLRC operates distributed over its three sites a range of large research facilities such as: • National Computing Service • Data Centres (1 world, 3 national and several smaller collections) • ISIS - neutron spallation source • Particle Physics Research and Support • Space Science Unit (Satellite development, missions and exploration) • Synchrotron Radiation Source Keeping records, archiving results, granting access to and disseminating new findings, as well as providing backup capacities is one of CLRC's major tasks behind the scenes, because without adequate data archive and data exploration, historical and new data may well be lost for future generations. Consequently CLRC has been engaged in a wide variety of data storage/management projects over the years and provides a wide range of services to the community. In 1998 the Advanced Computing Group in the Computational Science and Engineering Department at CLRC - Daresbury Laboratory started the Data Management for Computational Science Project (DAMP). The project initially focused on data management for climate research, but as demands from other computational science disciplines became apparent the realm of the project was widened. The aim of the project is to analyse user data management requirements, in-situ solutions and state-of -the-art research and developments, disseminate our findings, and give advice wherever possible. We have published a 60 page survey report in November 1999 [4], which is available from the author, an updated version of the report is expected to be published towards the end of the year. At the beginning of the year we also organised a workshop and exhibition on Advanced Data Storage / Management Techniques for High Performance Computing, which with nearly 100 participants was well received [5]. A range of projects and collaborations have been initiated by the DAMP project, or are being partnered by DAMP:

136

1.1 DIRECT The Development of an Interdisciplinary Round Table for Emerging Computer Technologies (DIRECT) has three major working groups: Data Storage and Management (DSM), Data Inter-Operability (DIO) and Visualisation and Emerging Computing Technologies (VECT). 1.2 ESDANET The European Climate Data Network for Climate Model Output and Observations including the High Performance Mass Storage Environment (CLIDANET) is a collaboration of several high profile data and computing centres: DKRZ (D), UEA (UK), CCLRC (UK), FZI (D), PIK (D), CINECA (IT), CERFACS (F), CESCA (F), DMI (DK). The project plans to link existing mass storage data archives at European climate modelling centres and related observed data sources by a climate data network. 1.3

WOS

The Web Operating System collaboration aims to provide users with the possibility to request networked services without prior knowledge about the service and to have the request fulfilled according to the user's desired constraints/requirements. The management of data and information in this environment is of high importance. 1.4

HPC1 - Core Support

CLRC - Daresbury Laboratory CSE Department, the University of Edinburgh EPCC and the University of Manchester MRCCS are forming the UK High-End Computing (HEC) Collaboration to offer core support for HEC to the UK academic community. The UK HEC Collaboration's primary aim is to investigate new areas of computing and to inform and provide advice to the user community in developments of hardware and software, in emerging tools, in best practice code development and data management. In the realm of this work surveys and small test installations will be carried out and evaluated. 1.5 ESTEDI The European Spatio-Temporal Data Infrastructure for High Performance Computing (ESTEDI) is a collaboration between numerous well known research institutes in Europe, namely: Active Knowledge (D), FORWISS (D), University of Surrey (UK),

137

CLRC (UK), CINECA (IT), CSCS (CH), DKRZ (D), DLR (D), IHPC&DB (RU) and NUMECA International (B). 2

I/O Requirements of Scientific Codes on HPC Systems and available Support

One of the bottlenecks identified by the DAMP surveys are the high I/O requirements of some scientific codes, specifically in disciplines like climate research and quantum chromodynamics. Other disciplines like astronomy, engineering, biology and materials science are following closely. Find below a number of examples of typical output sizes of simulations in various scientific disciplines. In table 1 and 2 you will find examples of the average output size of climate modelling codes which originated from a paper by Alan O'Neill and Lois Steenman-Clark [9]: Table 1. Typical atmospheric experiment data sizes for 10 model year run with data output four times per model day.

Spatial Resolution climate seasonal climatology forecast

Data Sizes (GB) 36.5 74.8 98.1 324.1

Table 2. Typical ocean experiment data sizes for a 10 model year run with data output once per model day

Spatial Resolution

4°x4° (global) 1° x 1° (Atlantic) 1° x 1° (global)

Data Sizes (GB) 5.8 21.5 133.2

You would expect to run most climate codes for time spans of 100's of years, preferably closely coupling atmospheric and oceanographic simulations. In Astrophysics for example a 10243 particle N-body simulation generates 25 Gbytes each output time, the codes are used to study the evolution of the modelled systems, resulting in hundreds of Gbytes of output data. In biology currently most scientists when modelling proteins will aim for multiple simulations of the same system over periods of 5 - 10 ns, each of these runs will produce 7.5 GB of data usually you will try to run at least 10 related simulations, adding up to 75 GB of data for a small set. For the future it is envisaged that scientists would like to investigate the same protein under varying conditions, such studies are important to understand the for example the effect of mutants on protein stability and unfolding. In Quantum Chromodynamics the algorithms used are all based on estimating an integral using Monte Carlo simulation, they produce many samples of the integral (currently ~100 samples, an increase to over a 1000 is forecasted for the

138

next few years). The amount of data produced is dependent on the available computing and I/O capacities. Currently the UKQCD consortium holds 4.5 Terabytes in the archival at EPCC, but with the installation of their new system they expect to produce ~100 Terabytes over a three year period. Today's codes often also require access to large volumes of input data, either for the simulation itself or for the analysis of the results. In climate research single input files can be as big as 2.5 TB. In other disciplines it is more the quantity of files that need to be accessed at the same time, e.g. in materials chemistry where it would be desirable to run analysis across 10s of 10GB files. Similar requirements can be found in astrophysics, biology and quantum chemistry. 2,1

A vailable Support

The capabilities of components like processors, compilers and scientific libraries have improved dramatically over the past years. They are thereby promoting the faster production of more output in shorter time spans (TFlop systems produce TByte output), but disk I/O and data archival and data retrieval mechanisms have not kept pace with this development, leaving a growing gap between the amount of input and output data the codes require and the capabilities of the components that are supposed to feed them. "Currently the speed at which data can be moved in and out of secondary and tertiary storage systems is an order of magnitude less than the rate at which data can be processed by the CPU. High Performance computers can operate at speeds exceeding a trillion operations per second, but I/O operations run closer to 10 million bytes per second on state of the art disk." The National Computational Science Alliance, 1998 In some applications program I/O has become a real limiting factor to code performance. "The data volumes generated by climate models can be very large and are a problem to deal with especially when the models generating this data have been optimised to run quickly and efficiently on HPCplatforms." Alan O'Neill and Lois Steenman-Clark, UGAMP, 1999 "Even with multi-terabyte local disk sub-systems and multi-PetaByte archives, I/O can become a bottleneck in high performance computing." Jeanette Jenness, LLNL, ASCI-Project, 1998

139 Different types of bottleneck can occur for different models. Current some models need large input data sets to be present at program start and during the program run. It is usually time consuming to retrieve these data sets from on-site mass storage facilities or from other data centres and it is not always possible to predict the exact time such retrieval will require (e.g. due to network traffic or heavy use of the facility). In addition computing centres may charge for the amount of disk space that is used by the scientists, so intermediate storing of large data sets on local disc can be an expensive option. If local copies of the data sets are not available, they have to be retrieved at the start of the program run, expensive processors may be blocked for long periods, waiting for input. Some models require regular, time dependent input which is difficult to sustain on current architectures, similarly leading to a waste of resources. The execution of coupled models, communicating via files, can be stalled, if the disk I/O capabilities can not keep pace with the model speed. If disk space is limited, output files need to be transferred to alternative storage during the program run, but if these processes do not match the speed of the model or are not available, the execution time can rise tremendously or the program run can even fail. Traditionally support for program I/O was provided on three different levels: hardware, software and application. 2.2

Hardware I/O Support

The main limiting factor in satisfying the high I/O requirements of modelling programs on HPC platforms is the underlying hardware. "SRAM/DRAM access times have only dropped by a factor of six within the last 15 years " Greg Astfalk, HP, 1998 In contrast to this the speed of high-end processors has increased by a factor of ~80 in the same time span (based on the comparison of a Cray 1 and a NEC SX-5 processor). There is a growing gap between the processor capabilities and the actual ability of the disk sub-systems, fifteen years ago ~10 instructions where necessary to "hide' one memory access, today ~360 are required. For codes with high I/O requirements such as climate modelling codes, this gap represents a serious performance bottleneck. Today's systems are very unbalanced in their ability to produce I/O and to subsequently handle it. Leading hardware vendors like SGI, IBM and Fujitsu acknowledge the problem, but are not planning to focus significant research effort into this area. The main argument against it is that any major improvements would be too expensive or

140

even physically impossible at the moment. Instead some of them e.g. Fujitsu are developing software solutions for this problem (see section Software I/O Support). There are however two companies which have taken different approaches to deliver better balanced systems: Tera Computing and SRC. Tera Computing is following the philosophy 'if you can't beat it - hide it!', using cheap memory chips and hiding the long memory latency times with a thread based programming approach. SRC says that 'memory is too stupid' so they use cheap off the shelf processors in their systems and concentrate their main efforts on the development of fast memory and network solutions. Future benchmarking efforts have to show how well these approaches work. CLRC Daresbury Laboratory has set up a collaboration with Tera enabling it to test their 8 processor MTA system at the San Diego Super Computing Centre. Currently an ocean model and an engineering code are being tested. Tera has recently acquired Cray from SGI and it will be interesting to observe future developments in these product lines. 2.3

Software I/O Support

Most HPC systems today are quite unbalanced in their provision of processor speed and I/O supporting devices. "Performance bottlenecks: Memory bandwidth, long memory latency, poor cache behaviour and instruction level parallelism. " David Snelling, Fujitsu European Centre for Information Technology (FECIT), 1998 As it can not be expected that a significantly better balanced system will become widely available in the next few years (see section Hardware I/O Support), it is necessary to look for software solutions to bridge the existing gap between processor speed and for example memory access times. In this section techniques and products will be discussed which help to improve the systems I/O performance, but do not require any changes to the applications themselves. There are a number of helpful techniques available which can support faster program I/O on current HPC systems. The major areas these techniques are addressing are 'in-time' delivery of data at program start and faster access to the data via various mechanisms. Most advanced solutions are either vendor specific or site specific (incorporating proprietary software), but some products are portable and many of the ideas are transferable to other sites and scientific areas.

141 2.3.1

Pre-ordering

Some computing centres offer the possibility to pre-order data sets that are stored on-site (e.g. DKRZ), others actually demand pre-ordering to ensure fast execution of programs (e.g. Oregon State University). Both systems offer a fairly simple way to make sure that required input data are present on temporary secondary storage systems at program start and the program does not waste valuable execution time by waiting for a retrieval request. Techniques like these become more important with the introduction of usage based charging schemes (e.g. CSAR in the UK) where scientists have to pay for their storage space. As the fast disk space is a lot more expensive then for example tape storage, research groups which require large input or output data sets are more likely to use the cheap slow storage than the expensive fast one. Unfortunately by doing so they slow down their applications, and pay in a different way. 2.3.2

Selected Access

Until a few years ago Cray offered a Fields Data Base (FDB) which allowed storing and accessing data files based on their content, e.g. extracting just the salinity field. Both access on a field basis and storage on local disks granted fast access to the required data. The ECMWF had integrated FDB into its hierarchical storage management until the mid 90's. Data was either stored in the main tape archive (slow access) or in the FDB (most recent data, fast access. Data access was managed by the software package MARS [6]. MARS is the ECMWF's Meteorological Archival and Retrieval System an Application Programming Interface which provides high level access to data from the ECMWF and other archives. Data can be accessed as a single field or a collection of fields, for a particular region, date and time window. MARS itself also incorporates a small database holding the most frequently used fields as well as the most recently requested ones. MARS is able to locate the fastest accessible version of a particular data field within the local storage system (MARS, FDB or tape store). In recent years the ECMWF has further developed this range of software products to respond to technology changes and enhancements. The ECMWF today houses a variety of computing systems which made it necessary to adapt the products to a distributed computing environment. The FDB can now be accessed from different platforms via the network and no longer requires to run on the same system. MARS can locate and deliver data sets in the local distributed environment, but a client server version granting access from outside is also currently under development.

142

2.3.3

Fast File Systems

Fujitsu promises to deliver the best-balanced I/O performance on the market with its new VPP5000 range systems, which features a flexible and high performance file system (FPFS). The FPFS allows flexible I/O processing according to the I/O request length. Small requests will be transferred via the cache, large ones are transferred directly avoiding the system buffer. In addition parallel access is supported (e.g. multiple real volumes). The FPFS is coupled with the distributed parallel file system (DPFS) which divides I/O requests for externally stored data into multiple requests to PE's connected to the external storage device (IO-PEs). Taking advantage of the large memory capacity of the VPP5000 (up to 16 GB per processor) the memory resident file system (MRFS) improves access time to local data. Currently Fujitsu seems to be the only provider of MRFS. IBM has developed the General Parallel File System (GPFS) as one of the key system software components of the ASCI Pacific Blue Supercomputer, it allows highspeed parallel file access. Originally invented for multimedia applications it has been extended to support the requirement's of parallel applications. On ASCI Pacific Blue, GPFS enables parallel programs running on up to 488 nodes to read and write individual files at data rates of up to 1.2 GB/s to file systems of up to 15 TB. The product is available for RS/6000 SP systems of any size. 2.3.4

Program Output

There are also some vendor/system dependent solutions enabling faster I/O. Cray/SGI offers different possibilities to do sequential, unformatted I/O: • COS blocked (Fortran unformatted) • Asynchronous buffered (FIFO / assign) • Preallocation of files (setf) The introduction of preallocated files delivers the best performance. It is possible to further enhance performance by combining asynchronous buffered I/O and preallocation of files. 2.4

Application I/O Support

In contrast to the previous section ( Software I/O support) which discussed software I/O support methodologies which do not require changes to the application code, this section will discuss what changes could be made and how they might affect I/O performance. In principle there are two ways to improve I/O performance, one is the usage of efficient libraries, the other is the optimisation of the I/O pattern for the particular target

143

architecture. The latter is obviously application specific and will not be discussed in this report, but much work has been done on this by research groups, national HPC centres (e.g. UK's HPCI centres at Edinburgh and Daresbury) and vendors of high-end computing systems. 2.4.1

Vendor specific solutions

IBM have implemented asynchronous I/O as part of its AIX operating system, via special non-blocking read/write instructions which allow users e.g. to prefetch data. Thus it can improve I/O performance significantly. The Advanced Computing Technology Centre (ACTC) part of the IBM T.J.Watson Research Centre has developed a buffered Modular I/O system (MIO), it contains e.g. pattern analysis and prefetch tools, and can be called via special read/write statements. SGI/Cray uses a number of techniques to improve sequential I/O performance (see above). The fastest being the combination of preallocation of files and asynchronous buffered I/O. This can be further extended by preallocating files on different disks "user disk striping" to perform a user-driven parallel I/O. In addition programs with very high I/O requirements should use Cache aligned data buffers. The transfer length should be an exact multiple of the block size and the transfer address should also be an exact multiple of the block size. For parallel I/O, processors should send their I/O to well formed addresses that no other processor is sending to. All the above -mentioned techniques have to be implemented in the user application. 2.4.2

Portable solutions

MPI was the first portable programming library which targeted the issue of parallel I/O; the new MPI-2 version includes useful features to make the implementation of parallel I/O patterns easier for the user via MPI-IO. Unfortunately only very few vendors have developed implementations for this aspect of MPI-2. To our knowledge only Fujitsu offers a full MPI-2 version, although most vendors recognise MPI-IO and single sided communication to be a high priority. Tests at the ECMWF have shown that it is very useful indeed. 2.5

Summary

High I/O requirements are increasingly becoming a major limiting factor for the performance of a range of applications on High End computing systems, e.g. climate research, QCD and advanced visualisation. Although there are a range of supporting measures available the majority of them are custom made, discipline or vendor specific,

144

limiting thereby the accessibility of such measures for the wider community. The ESTEDI project aims to provide a European data infrastructure for high performance computing on a range of platforms and with interfaces to all major disciplines. 3

The ESTEDI Project

Satellites and other sensors, supercomputer simulations, and experiments in science and engineering all generate arrays of some dimensionality, spatial extent, and cell semantics. While such data differ in aspects such as data density (from dense 2- D images to highly sparse high-dimensional accelerator data) and data distribution, they usually share the property of extreme data volumes, both per data file and in quantity of data files. User access nowadays is accomplished in terms of files containing (part of) the information required, encoded in a sometimes more, sometimes less standardised data exchange format chosen from a rich variety of options. This implies several shortcomings. • access is done on an inappropriate semantic level. Applications accessing HPC data have to deal with directories, file names, and data formats instead of accessing multidimensional data in terms of, say, simulation space and time and other useroriented terms. • data access is inefficient. Data are stored according to their generation process, for example, in time slices. All access pertaining to different criteria, for example spatial co-ordinates, requires data-intensive extraction processes and, hence, suffers from severe performance penalties. • data retrieval is inefficient. Studies show that only a small fraction of the retrieved data is actually required by the simulation or visualisation (e.g. -10% of data in climate research analysis), the rest is only retrieved due to the inability to perform content based selective retrieval. • search across a multitude of data sets is hard to support. Evaluation of search criteria usually requires networks transfer of each candidate data set to the client, implying a prohibitively immense amount of data to be shipped. Hence, many interesting and important evaluations currently are impossible. • the much more efficient parallel program I/O is rarely used due to the problems arising in reassembling the various outputs at a later stage. • all the aforementioned efficiency problems are intensified as the user community grows, as obviously networks load grows linearly with the number of users. In summary, a major bottleneck today is fast, user-centric access to and evaluation of these so-called Multidimensional Discrete Data (MDD).

145

The recently set up European Spatio-Temporal Data Infrastructure for High Performance Computing (ESTEDI) is a collaboration between numerous well known research institutes in Europe, namely: Active Knowledge (D), FORWISS (D), University of Surrey (UK), CLRC (UK), CINECA (IT), DKRZ (D), DLR (D), IHPC&DB (RU) and NUMECA International (B). The project will establish a European standard for the storage and retrieval of multidimensional HPC data. It addresses a main technical obstacle, the delivery bottleneck of large volume HPC data to the users/applications, by augmenting high-volume data generators with flexible data management and extraction tools for spatio-temporal data. The multidimensional database system RasDaMan developed in EU Framework IV, will be enhanced with intelligent mass storage handling and optimised towards HPC. The project participants will operate the common platform and evaluate it in different HPC fields (Engineering, Biology, Astrophysics and Climate Research). The outcome will be a field-tested open prototype platform with flexible, standards-based - contents driven retrieval of multiterabyte data in heterogeneous networks. CLRC - Daresbury Laboratory focuses on the support of climate research applications within the realm of the project. We are working in collaboration with the UGAMP consortium and Manchester Computing to ensure that user requirements are met and availability to a wider community. 3.1

Project Scope

ESTEDI addresses the delivery bottleneck of large HPC results to the users by augmenting the high-volume data generators with a flexible data management and extraction tool for spatio-temporal data. The observation underlying this approach is that, whereas transfer of complete data sets to the client(s) is prohibitively time consuming, users actually do not always need the whole data set; in many cases they require either some subset (e.g., cut-outs in space and time), or some kind of summary data (such as thumbnails or statistical evaluations), or a combination thereof. Consequently, it is expected that an intelligent spatio-temporal database server can drastically reduce networks traffic and client processing load, leading to increased data availability. For the end user this ultimoately means improved quality of service in terms of performance and functionality. The project is organised as follows. Under guidance of ERCOFTAC (European Research Community on Flow, Turbulence and Combustion), represented by University of Surrey, a critical mass of large European HPC centres plus a CFD package vendor [numeca] will perform a thorough requirements elicitation. In close co-operation with these partners and based on the requirements, the database experts of FORWISS (financial/administrative project co-ordination) and Active Knowledge GmbH

146

(technical/scientific project management) will specify the common data management platform. Subsequent prototype implementation of this platform will rely on the multidimensional database system RasDaMan which will be optimised towards HPC by enhancing it with intelligent mass storage handling, parallel retrieval, and further relevant technologies. The architecture will be implemented and operated in various environments and for several important HPC applications, forming a common pilot plat-form thoroughly evaluated by end-users: • climate modelling by CLRC and DKRZ • cosmological simulation by CINECA • flow modelling of chemical reactors by CSCS • satellite image retrieval and information extraction by DLR • simulation of the dynamics of gene expression by IHPC&DB • computational fluid dynamics (CFD) post-processing by Numeca International All development will be in response to the user requirements crystallised by the User Interest Group (UIG) promoted by ERCOFTAC. Active promotion of the results, including regular meetings, will be instrumental to raise awareness and take-up among industry and academia, both in Europe and beyond. The project outcome will be twofold: (i) a fully published comprehensive specification for flexible DBMS-based retrieval on multi-Terabyte data tailored to the HPC field and (ii) an open prototype platform implementing this specification, evaluated under real-life conditions in key applications. 3.2 RasDaMan The goal of the RasDaMan DBMS is to provide database services on general MDD structures in a domain-independent way. To this end, RasDaMan offers an algebra-based query language which extends SQL with declarative MDD operators. Server-based query evaluation relies on algebraic optimization and a specialised array storage manager. Usually, research on array data DBMS's focuses on particular system components, such as multidimensional data storage [10] or data models [7,8]. RasDaMan, conversely, is a generic array DBMS, where generic means that functionality and architecture are not tied to some particular application area. The conceptual model of RasDaMan centres around the notion of an n-D array (in the programming language sense) which can be of any dimension, size, and array cell type (for the C++ binding, this means that valid C++ types and structures are admissible). Based on a specifically designed array algebra [1], the RasDaMan query language, RasQL, offers MDD primitives embedded in the standard SQL query

147

paradigm [2]. The expressiveness of RasQL enables a wide range of statistical, imaging, and OLAP operations. To give a flavour of the query language, we present a small example. From a set Climate Models of 4-D climate models (dimensions x, y, z, t), all those models are retrieved where average temperature in 1,000m over ground exceeds 5° C. From the results, only the layers from ground up to 1,000m are delivered. The corresponding RasQL query reads as follows: select cm[ *:*, *:*, *:1000, *:* ]

from where

ClimeSimulations as cm avg_cells( cm[ *:*, *:*, 1000, *:* ] ) > 5.0

Internally, RasDaMan employs a storage structure which is based on the subdivision of an MDD object into arbitrary tiles, i.e., possibly non-aligned sub-arrays, combined with a spatial index [3]. Optionally, tiles are compressed. In the course of ESTEDI, RasDaMan will be enhanced with intelligent mass storage handling and optimised towards HPC; among the research topics are complex imaging and statistical queries and their optimisation. 3.3

CLRC's contribution to ESTEDI

CLRC - Daresbury Laboratory is focussing on the support of climate research applications during this project, but aims to apply the technology to other scientific disciplines were appropriate. We are collaborating with the UK Global Atmospheric Modelling Project (UGAMP) group at the University of Reading, namely Lois Steenman-Clark and Paul Berrisford. The system itself will be installed and tested at the national super computing service at Manchester (CSAR). In a first step we have installed a climate modelling meta data system on the Oracle database at Manchester. After some deliberation we decided to use the CERA meta data system which has been developed by the German Climate Computing Centre (DKRZ) and the Potsdam Institute for Climate Research (PIK). The CERA meta data system will allow us to catalogue, search and retrieve private and public data files stored at Manchester based on their content. The system holds information not only information about the stored data, but also about the person who submitted the data, related projects and storage location (disk, tape etc.). The usage of the CERA model will increase the organisation and accessibility of the data (currently 4 TB) significantly. In parallel we are installing the RasDaMan database. This will use the CERA meta data system to locate requested files, but holds itself, additional information on the organisation of the contents of the data files (e.g. data format, properties, field length). This will allow RasDaMan to extract only the requested information from the file, therefore cutting down dramatically on the retrieval and transmission times.

148

For the UGAMP consortium we are planning to implement a range of services. We will develop an interface which will connect their analysis program to RasDaMan, allowing them to retrieve their required data from a running application. We are also planning to implement a parallel I/O interface which will allow to write certain fields in parallel into the database which will in turn reassemble the separate outputs into one data file. If time permits we will also try to implement on-the-fly data format conversion; converters from GriB to NetCDF and HDF are currently planned. We are in the implementation phase and expect a first prototype by mid 2001. 4

Summary

Our scientific research data is one of our most valuable assets, unfortunately much of it is inaccessible due to outdated hardware and lack of organisation. Databases could play a major role in changing the current situation helping to organise the data, making it accessible and prepare it for the application of state of the art exploration tools. However, we need to apply database technology that is well suited for the multidimensional nature of our scientific data. Standardisation and the usage of generic technology will help to make these tools easier to install, maintain and use, allowing the fast uptake by wide areas of the scientific community. It is important that all these developments are carried out in close collaboration with the scientific community to ensure that their requirements are met. The ESTEDI project will provide a major building block in this development by delivering a field-tested open platform with flexible, contents-driven retrieval of multiterabyte data in heterogeneous networks. References 1. P. Baumann: A Database Array Algebra for Spatio-Temporal Data and Beyond. Proc. Next Generation Information Technology and Systems NGITS '99, Zikhron Yaakov, Israel, 1999, pp. 76-93. 2. P. Baumann, P. Furtado, R. Ritsch, and N. Widmann: Geo/Environmental and Medical Data Management in the RasDaMan System. Proc. VLDB'97, Athens, Greece, 1997, pp. 548-552. 3. P. Furtado and P. Baumann: Storage of Multidimensional Arrays Based on Arbitrary Tiling. Proc. ICDE '99, Sydney, Australia 1999, pp. 480-489.

149 K. Kleese: Requirements for a Data Management Infrastructure to support UK High-End Computing, Technical Report, CLRC - Daresbury Laboratory (DL-TR99-004), UK, November 1999 K. Kleese: A National Data Management Centre, Data Management 2000, Proc. 1. Int'l Workshop on Advanced Data Storage/Management Techniques for High Performance Computing, Eds. Kerstin Kleese and Robert Allan, CLRC - Daresbury Laboratory, UK, May 2000 6. G. Konstandinidis and J. Hennessy: MARS - Meteorological Archival and Retrieval System User Guide, ECMWF Computer Bulletin B6.7/2, Reading, 1995 L. Libkin, R. Machlin, and L. Wong: A query language for multidimensional arrays: Design, implementation, and optimization techniques. Proc. ACM SIGMOD'96, Montreal, Canada, 1996, pp. 228 - 239. A.P. Marathe and K. Salem: Query Processing Techniques for Arrays. Proc. ACM SIGMOD '99, Philadelphia, USA, 1999, pp. 323-334. 9. A. O'Neill and Lois Steenman-Clark: Modelling Climate Variability on HPC Platforms, High Performance Computing, R.J.Allan, M.F.Guest, A.D.Simpson, D.S.Henty, D.A.Nicole (Eds.), Plenum Publishing Company Ltd., London, 1998 10. S. Sarawagi, M. Stonebraker: Efficient Organization of Large Multidimensional Arrays. Proc. ICDE'94, Houston, USA, 1994, pp. 328-336.

List of Relevant Links Active Knowledge CINECA CLRC

cscs

Data Management 2000 workshop DAMP DIRECT DKRZ DLR DMC ERCOFTAC ESTEDI IHPC&DB NUMECA RasDaMan UKHEC WOS

www.active-knowledge.de www.cineca.it www.clrc.ac.uk www.cscs.ch www.dl.ac.uk/TCSC/datamanagement/conf2.html www.cse.clrc.ac.uk/Activity/DAMP www.epcc.ed.ac.uk/DIRECT/ www.dkrz.de www.dfd.dlr.de www.cse.ac.uk/Activity/DMC imhefwww.epfl.ch/Imf7ERCOFTAC www.estedi.org www.csa.ru www.numeca.be www.rasdaman.com www.ukhec.ac.uk www.woscommunity.org

150

COUPLED MARINE ECOSYSTEM MODELLING ON HIGHPERFORMANCE COMPUTERS

M. ASHWORTH CLRC Daresbury Laboratory, E-mail:

Warrington WA4 4AD, UK

[email protected]

R. PROCTOR, J. T. HOLT Proudman Oceanographic

Laboratory, E-mail:

Bidston Observatory, Birkenhead

CH43 7RA, UK

[email protected]

J. I. ALLEN, J. C. BLACKFORD Plymouth Marine Laboratory,

Prospect Place, West Hoe, Plymouth PL1 3DH, UK E-mail: jia @pml. ac. uk

Simulation of the marine environment has become an important tool across a wide range of human activities, with applications in coastal engineering, offshore industries, fisheries management, marine pollution monitoring, weather forecasting and climate research to name but a few. Hydrodynamic models have been under development for many years and have reached a high level of sophistication. However, sustainable management of the ecological resources of coastal environments demands an ability to understand and predict the behaviour of the marine ecosystem. Thus, it is highly desirable to extend the capabilities of existing models to include chemical, bio-geo-chemical and biological processes within the marine ecosystem. This paper describes the development of a high-resolution three-dimensional coupled transport/ecosystem model of the Northwest European continental shelf. We demonstrate the ability of the model to predict primary production and show how it may be used for the study of basic climate change scenarios. The challenges of extending the model domain into the Atlantic deep water are discussed. The use of a hybrid vertical coordinate scheme is shown to give improved results. Optimization of the model for cache-based microprocessor systems is described and we present results for the performance achieved on up to 320 processors of the Cray T3E-1200. High performance levels, sufficient to sustain major scientific projects, have been maintained despite a five-fold increase in grid size, incorporation of the ecosystem sub-model and the adoption of highly sophisticated numerical methods for the hydrodynamics.

151

1

The POLCOMS shelf-wide model

The hydrodynamic model used in this study is the latest in a series of developments at the Proudman Oceanographic Laboratory (POL). The model solves the threedimensional Shallow Water form of the Navier-Stokes equations. The equations are written in spherical polar form with a sigma vertical co-ordinate transformation and solved by an explicit forward time-stepping finite-difference method on an Arakawa B-grid. The equations are split into barotropic (depth-mean) and baroclinic (depthfluctuating) components, enabling different time-steps to be used with typically a ratio of ten between the shorter barotropic and the baroclinic time-steps. The model is a development of that of James [9] and is capable of resolving ecologically important physical features, such as stratification, frontal systems and river plumes, through the use of a sophisticated front-preserving advection scheme, the Piece wise Parabolic Method (PPM). This method has been shown to have excellent structure preserving properties in the context of shelf sea modelling [10]. Vertical mixing is determined by a simple turbulence closure scheme based on Richardson-number stability. A sub-model is included for the simulation of suspended particulate matter (SPM) including transport, erosion, deposition and settling [6]. The computational domain covers the shelf seas surrounding the United Kingdom from 12°W to 13°E and from 48°N to 63°N with a resolution of about 12 km. The computational grid has 150 x 134 points in the horizontal and 20 vertical levels; a total of 402000 gridpoints. The typical depth over most of the domain is around 80 m, but the western extremity includes the shelf edge where depths increase rapidly to around 2000 m. Previous models include a fine resolution model of the southern North Sea [16], which was parallelized for execution on the Cray T3D [12]. This model was larger (226 x 346 x 10 = 781960 gridpoints) covering a smaller region at finer resolution. The achieved performance was 1.0 Gflop/sec on 128 Cray T3D processors, or around 5% of peak performance. 2

The ERSEM ecosystem model

The European Regional Seas Ecosystem Model (ERSEM) [4] was developed by a number of scientists at several institutes across Europe through projects under the MAST programme of the European Community. Many features and applications of the ERSEM model are described in a special issue of the Journal of Sea Research. ERSEM is a generic model that describes both the pelagic and benthic ecosystems and the coupling between them in terms of the significant bio-geo-chemical processes affecting the flow of carbon, nitrogen, phosphorus and silicon. It uses a 'functional group' approach to describe the ecosystem whereby biota are grouped together according to their trophic level and sub-divided according to size and

152

feeding method. The pelagic food web describes phytoplankton succession (in terms of diatoms, flagellates, picoplankton and inedible phytoplankton), the microbial loop (bacteria, heterotrophic flagellates and microzooplankton) and mesozooplankton (omnivores and carnivores) predation. The benthic sub-model contains a food web capable of describing nutrient and carbon cycling, bioturbation/bioirrigation and the vertical transport in sediment of particulate matter to the activity of benthic biota. The model has been successfully run in a wide variety of regimes and on a variety of spatial scales. All studies illustrate the importance of the ecological model being coupled with fine-scale horizontal and physical processes in order for the system to be adequately described. ERSEM was coupled with a simple 2D depth-averaged transport model in a modelling study of the Humber plume [1]. The study successfully simulated much of the behaviour of the plume ecosystem, with primary production controlled by limited solar radiation between March and October and by nutrient availability during the rest of the year. This allowed exploration of the causal linkages between land-derived nutrient inputs, the marine ecosystem and man's influence. 3

The coupled model

The models were coupled by installing the ERSEM code as a component within the main time stepping loop of the POL code. Coupling is unidirectional. ERSEM requires input describing the physical conditions (namely temperature, salinity and transport processes) from the hydrodynamic model, but there is no feedback in the opposite direction. The PPM advection subroutines from the POL code, which are used for temperature and salinity, were also used to perform advection of 36 ecological state variables. Both components take information from meteorological data (solar heat flux, wind speed and direction, cloud cover, etc.). 4

Coupled model validation

The coupled model has been run to simulate conditions in 1995 (driven by ECMWF meteorology, tides and daily mean river inflows of freshwater, nutrients and SPM) and validated against collected observations [2]. One output of the model is an estimate of total primary production (Fig. 1). Area-averaged modelled production compares well with estimates of observed production taken from the literature (not necessarily for 1995) (Table 1).

153

1

10*W

5*W

0.0 0.4 0.8 1.2 1.6 2.0 2.4 Total Annual Production(xl 00 g C m"2) Figure 1. Total Primary Production for 1995 (gC m"2)

154 Table 1. Primary Production, modelled and observed

Region Southern North Sea Northern North Sea English Coast Dogger Bank German Bight

Primary Production (gCm-2) 150 - 250 54 - 127 79 112 261

Reference Anon 1993 [14] [11]

Using this simulation as a baseline, the model has been used to predict estimates of the effect of climate change on plankton blooms on the NW European shelf. In these first calculations the predictions of the Hadley Centre HADCM2 climate model [8] for the mid-2 lsl century have been simplified to very basic climate change effects, i.e. an increase in air temperature of 2% and an increase in wind speed of 2%. The 1995 simulation was repeated with these simple meteorological changes and the magnitude and timing of the spring plankton bloom compared. Fig. 2 shows the effect of the changed climate relative to the baseline 1995 simulation. It can be seen that this simplified climate change can result in spatial difference with increases and decreases in the amplitude of the spring bloom of as much as 50%. The timing of the bloom is also affected, generally advancing the bloom in the south and delaying the bloom in the north, by up to 30 days, but the result is patchy. Although not yet quantified these changes seem to lie within the bounds of spring bloom variability reported over the past 30 years. Cllmalo Bosnmk) Fractional Ampl it u d a o l spring btoom

Climala Boonatto ShM In poak ol Eprino, bloom (day*)

Figure 2. Impact of simple climate change on amplitude and timing of the spring bloom

155

5

Extension of the Shelf Model into Atlantic deep water

The western boundary of the present Shelf model (at 12°W) intersects the shelf edge west of Ireland. This causes difficulties in modelling the flow along the shelf edge (e.g. [18]), effectively breaking up the 'continuous' shelf edge flow. Additionally, the present model area does not include the Rockall Trough west of Scotland, an area of active exploration drilling by the offshore industry (as is the Faeroe-Shetland Channel). To fully resolve the shelf edge flows and to forecast temperature and current profiles in these regions, the model area has been extended to include the whole of the Rockall Trough. The larger area Atlantic Margin model (Fig. 3) extends from 20°W to 13°E and from 40°N to 65°N and has a grid of 198 x 224 x 34 (1507968 gridpoints) at the same resolution. At present, boundary conditions are provided by the l/3 rd degree UK Meteorological Office Forecast Ocean Atmosphere Model (FOAM) in the form of 6 hourly 3-dimensional temperature and salinity (used in a relaxation condition) and depth-averaged currents and elevation (both used in a radiation condition). FOAM is a rigid lid model, without tides, so tidal forcing has to be added separately to the radiation condition. Also, in order to fully resolve the small scale current variability in both the Rockall Trough and FaeroeShetland Channel, models with grid resolution of the order of 2km (to 'resolve' the baroclinic Rossby radius of deformation, i.e. eddy-resolving) are required. Such models are being nested into the larger area Atlantic Margin model (also shown in Fig. 3). Extending the domain into significantly deeper water required a modification to the vertical coordinate system. The disadvantage of the simple sigma coordinate when the model domain includes deeper water is that the thickness of the model surface layer can be greater than the ocean surface mixed layer, giving poor results in deeper water. To overcome this a hybrid co-ordinate can be specified, the 'S' coordinate [17] which combines the terrain following co-ordinate at lower model levels with near-horizontal surface layers, giving improved resolution of the model surface layer in deeper water. At each gridpoint the relationship between S and a is defined as:

for htj>hc

(Tk=Sk

for fly

and

Where hy is the water depth at point (i,j), and hc is a specified constant depth. In water shallower than hc, the S co-ordinate is identical to a sigma co-ordinate. In water deeper than hc, the vertical spacing of levels is adjusted.

156

Figure 3. FOAM coupling and the extended Atlantic Margin model.

157

Sea Surface Temperature: Model versus AVHRR data at 57N 12W 20

20

15

15.

r

10

10

5t

St0 0

64 level s-giid

0 0 20

200

20

64 level sigmagnd 200

400

15 10 5 0 I0

32 level s-grid 200

400

20 =5

15 10

0 20

200

400

15

.-.

10

q

5 0

32 level sigmagnd

0

5r 16 level s-giid 200 Days

16 level sigmagnd

0 400

0

200

400

Figure 4. AVHRR SST (+) compared to model surface level SST in water 2000m depth for different numbers o f ' S ' and sigma levels

The transformation expression C is given by:

158

C(S

) = (l ''

g

) s i n h ^ ) | Btmh(0[Sk +0.5])-tanh(0.5g) sinh<9 2tanh(O.50)

Implementation of the model on the extended domain required the vertical resolution to be increased to n=34 levels (from 20 in the Shelf model). The parameters of the S-coordinate spacing were selected as 6 = 6 and B = 0.8 with hc - 150m. These values were chosen to allow compression in both surface and bottom layers with good resolution throughout the water column along the Atlantic margin where oil exploration is taking place. From Fig. 4 it can be seen that this number of 5 levels provides good agreement with AVHRR sea surface temperatures. Other changes to the model have been made. The turbulence closure scheme has been updated to a more sophisticated form based on the Mellor-Yamada "level 2.5" turbulent energy model [13], with algebraic mixing lengths as described by Galperin [5]. This scheme includes turbulent kinetic energy as a prognostic variable. This scheme gives better results than the Richardson number scheme, particularly in the depth and sharpness of the seasonal thermocline. Horizontal viscosity and diffusion are now calculated based on the Smagorinsky shear-dependent formulation described by Oey [15]. A description of the revised model can be found in [7]. 6

Performance

The model has been structured to allow execution on parallel systems, .using twodimensional horizontal partitioning of the domain and message passing between neighbouring sub-domains. The original code had been designed to run efficiently on vector computers such as the Cray-1, Cray X-MP, Cray Y-MP and the Cyber 205. In order to increase vector lengths for these systems the main data arrays were structured with the two horizontal dimensions collapsed into one array index. Thus for a grid of dimension 1 x m x n, arrays were declared as a(lm,n) where lm=l*n. For today's highly parallel systems, built out of cache-based microprocessors, a different strategy has been shown to be more appropriate [3]. With the arrays restructured to have three indices and reordered so that the smallest vertical dimension is first, to give a(n,l,m), and loop nests reordered with the innermost loop over the vertical dimension, there is the opportunity for much improved cache utilization. There are many processes taking place in the water column, such as convection and vertical diffusion, in which it is advantageous to keep data for the current water column in cache for re-use. For advection processes, retaining the water column for point i+1 in cache allows its re-use in the next iteration as point i and in the following one as point i-1.

159

Benchmark 24-hour runs of both the Shelf model and the Atlantic margin model have been made on the Cray T3E-1200E operated by CSAR at the University of Manchester, UK. Performance results for the two variants of the code first with arrays ordered as a(l,m,n) (ijk-ordering) and secondly with the restructuring to use a(n,l,m) arrays (kij-ordering) are shown in Fig. 5. Performance is measured in simulated model days per day of elapsed time. In addition to giving better perprocessor performance, the restructured code scales much better with the number of processors. This is indicative of another advantage of the restructured code. For parallel processing the arrays have additional halo data in the 1 and m dimensions, but not in n. With ijk-ordering boundary data in both dimensions are non-contiguous in memory. With kij-ordering contiguous n x 1 slabs are available for communications in the i-direction, and the memory structure of data to be sent in the j-direction is also simpler. Performance (model days/day) 1200

•'' ^•i\,

/ '

•

1000

•

A i"""

800 •

/ .-''

600

kr%

_ . ' • '

•

400

-'"''

X

200

1

1

1

1—

64

1

1

1

—

128

192

256

320

384

Number of processors

Figure 5. Performance of the Shelf model with ERSEM showing the original code with ijk ordering (black squares) and the optimized code with kij ordering (grey triangles). The dashed line shows perfect scaling from the 16-processor kij run.

160 Performance (model days/day) 1200

T

1000

1

1

I

1

4

r—T

1

^

^^^^ 0

T

0

^

•

64

—

•

128

^

^

192

^

^

256

^

^

^

320

Number of processors

Figure 6. Performance (without ERSEM) of the Shelf model (grey triangles) compared with that of the larger Atlantic Margin model (black squares). The dashed lines show perfect scaling from the respective 16-processorruns.

Performance of the pure hydrodynamic model, without the ERSEM biological sub-model, is typically increased by a factor of about 8-10. ERSEM requires considerable additional calculations in the water column (local to each processor). The large increase (from 4 to 40) in the number of variables to be advected using the PPM routines also adds considerably to the communications load. Timing of individual components of the model shows that, with ERSEM, 84% of the time is spent in the advection of the ecological state variables. The extension of the region and the increase in the number of vertical levels from the Shelf model to the Atlantic Margin model resulted in a factor of five increase in the number of gridpoints. Fig. 6 shows the performance of these two models (without ERSEM). Once again both models scale well with about a factor of 6.5 in absolute performance between the two. With a factor of five in the number of gridpoints, the additional physical processes described in section 5 account for the remaining 30%. Absolute performance of 800 model days per day on 320 Cray T3E processors is achieved. This compares favourably with the level of 584 model days per day achieved on 128 Cray T3D processors with the southern North Sea model [12], bearing in mind the more complex physics of the current model and the larger grid

161

size. At this level of performance, the project target of a ten-year simulation to investigate inter-annual variability can be achieved in a reasonable time. Using operation counts from the Cray pat performance analysis tool, we found that the maximum operations per second count for a 16-processor run was 82 Mflop/sec. The average over all processors was 56 Mflop/sec/PE, approximately seven times the T3D rate of 8 Mflop/sec/PE reported by Lockey et al, 1998. The code currently uses a simple partitioning algorithm, which takes no account of land points. This gives rise to a load imbalance across the processors and accounts for the average performance being only 68% of that of the fastest processor. Despite advances to a new generation of high-performance software and hardware technology, we note that, as in [12], the code is limited (primarily by memory bandwidth) to around 5% of the peak performance of the machine. 7

Conclusions

We have described the coupling of models from two different scientific disciplines: a physical hydrodynamic model and a marine ecosystem model. The coupled model has been used to perform simulations across the northwest European shelf seas. Predicted annual primary production compares well with estimates of observed production. A basic climate change study shows reasonable variations in the amplitude and timing of the spring bloom. In order better to model the flows along the shelf edge, the model domain has been extended into the Atlantic deep water. A hybrid vertical coordinate scheme is shown to give improved results, results that compare well with satellite data. The combination of a fine resolution grid, sophisticated numerical schemes and a complex sub-grid-scale ecological model present a serious computational challenge that can only be met with the use of state-of-the-art high-performance computer systems. The model has been structured to allow execution on parallel systems using two-dimensional horizontal partitioning of the domain and message passing between neighbouring sub-domains. Reordering of the dimensions of the main data arrays and of the loop nests has been shown to give a large increase in performance on the Cray T3E. High performance levels, sufficient to sustain major scientific projects, have been maintained despite a five-fold increase in grid size, incorporation of the ecosystem sub-model and the adoption of highly sophisticated numerical methods for the hydrodynamics. As the project continues, further work will explore more detailed simulations, further extend the geographical area, examine fine resolution local area models, improve the capabilities of the hydrodynamics and optimize the computational performance. Key performance issues are an improved partitioning algorithm, asynchronous message passing and input/output optimization.

1. Allen J. I., A modelling study of ecosystem dynamics and nutrient recycling in the Humber plume, UK. Journal of Sea Research 38 (1997) pp. 333-359. 2. Allen J. I., Blackford J. C , Holt J. T., Proctor R., Ashworth M. and Siddorn J., A highly spatially resolved ecosystem model for the North West European Continental Shelf. Sarsia (2001) in press. 3. Ashworth M., Optimisation for Vector and RISC Processors. In Towards Teracomputing: Proceedings of the Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology, ed. by W. Zwieflhofer and N. Kreitz (World Scientific, 1998) pp. 353-359. 4. Baretta J. W., Ebenhoeh W. and Ruardij P., The European Regional Seas Ecosystem Model; a complex marine ecosystem model. Netherlands Journal of Sea Research 33 (1995) pp. 233-246. 5. Galperin B., Kantha L. H., Hassid S. and Rossati A., A quasi-equilibrium turbulent energy model for geophysical flows. Journal of Atmospheric Sciences 45 (1988) pp. 55-62 6. Holt J. T. and James I. D. A simulation of the Southern North Sea in comparison with Measurements from the North Sea Project. Part 2: Suspended Particulate Matter. Continental Shelf Research 19 (1999) pp. 1617-1624. 7. Holt J. T. and James I. D., An s-coordinate density evolving model of the north west European continental shelf. Part 1: Model description and density structure. Journal of Geophysical Research (2001) in press. 8. Hulme M. and Jenkins G. J., Climate change scenarios for the UK: Scientific Report. UKCIP Technical Report No. 1, Climate Research Unit, Norwich (1998) 80pp. 9. James I. D., A front-resolving sigma co-ordinate sea model with a simple hybrid advection scheme. Applied Mathematical Modelling 10 (1986) pp. 87-92. 10. James I. D, Advection schemes for shelf sea models. Journal of Marine Systems 8 (1996) pp. 237-254. 11. Joint I. and Pomroy J., Phytoplankton biomass and production in the southern North Sea. Marine Ecology - Progress Series 99 (1993) pp. 169-182. 12. Lockey P., Proctor R. and James I. D., Parallel Implementation of a 3D Baroclinic Hydrodynamic model of the Southern North Sea. Physics and Chemistry of the Earth 23 (1998) pp. 511-515. 13. Mellor G. L. and Yamada T., A hierarchy of turbulence closure models for planetary boundarty layers. Journal of Atmospheric Sciences 31 (1974) pp. 1791-1806. 14. Moss A., Regional distribution of primary production in the North Sea simulated by a three-dimensional model, Journal of Marine Systems 16 (1998) pp. 151-170.

163

15. Oey, L.-Y., Subtidal energetics in the Faeroe-Shetland Channel: Coarse grid model experiments. Journal of Geophysical Research 103 C6 (1998) pp. 12689-12708. 16. Proctor R. and James I. D., A fine-resolution 3D model of the southern North Sea. Journal of Marine Systems 8 (1996) pp. 285-295. 17. Song Y. and Haidvogel D„ A semi-implicit ocean circulation model using a generalized topography following coordinate system. Journal of Computational Physics 115 (1994) pp. 228-244. 18. Souza, A. J., Simpson J. H., Harikrishnan M. and Malarkey J., A flow and seasonality in the Hebridean slope current. Oceanologica Acta 24(Supplement) (2001) S63-S76.

164

O P E N M P IN T H E PHYSICS PORTION OF T H E M E T OFFICE MODEL R. W. FORD Centre for Novel Computing, The University of Manchester, E-mail [email protected]

U.K.

P. M. BURTON Met Office, Bracknell, U.K. E-mail [email protected] This paper examines the suitability of using OpenMP to provide a high mance parallel implementation of the physics portion of the Met Office Two approaches are explored; a 'bottom up' incremental approach and a routine 'segmenting' approach. The latter is found to be more scalable on

perforModel. physics an SGI

Origin 2000.

1

Introduction

The Met Office's suite of modeling codes, known as the Unified Model (UM) 1 , has been in use for both operational Numerical Weather Prediction (NWP) and climate prediction since 1991. The UM is a large Fortran 77 and Fortran 90 code which includes atmospheric and oceanic prediction models (which may be coupled together), and data assimilation for both these models. The models can be run in many different configurations and resolutions. In contrast to many other forecasting centres, which use semi-implicit spectral models, the UM uses an explicit grid point formulation in both its global and regional configurations. The scientific formulation of the model can be split into two sections; the dynamics which keep the various fields in dynamical equilibrium, and the physics which parameterise the physical aspects of the atmosphere (for example solar heating and rainfall). The UM is currently run on an 880 processor Cray T3E. The next major release of the UM will incorporate a semi-lagrangian dynamics scheme. The work described here has used a 'stand-alone' version of this new scheme, termed New Dynamics (ND). The ND already includes a relatively efficient message passing option for use on traditional MPP's. However, as a number of current and next generation high performance computers will be implemented as clusters of symmetric multi-processors (SMP's) (with some allowing the possibility of a shared memory programming model across the whole machine), it would be

165

useful also to have an efficient shared memory parallel option. This paper examines the feasibility of this aim.

2

A i m and constraints

The primary aim of this work is to produce an efficient OpenMP 2 implementation of the physics portion of the ND code. The most scalable implementation (if it were possible) would clearly be one with a single, load balanced, parallel region for the whole of the physics code, with no synchronisation between threads. However, there are four (potential) constraints which could conflict with this aim. The first is for the OpenMP implementation to be flexible i.e: allow the number of processors to change as the program runs. This could be useful to help system throughput and/or to target additional parallelism where the code is more scalable. The second is for the implementation to be clean, thereby allowing it to be easily understood and maintained. The third is for the implementation to be portable; this is addressed by using OpenMP. The fourth is for the modifications to be relatively minor so that the existing (familiar) structure of the code is not drastically changed and that the modifications are achievable within the time constraints.

3

Physics code structure

The majority of the ND physics is derived from the UM physics, thus the main physics structure (splitting physics code into functional blocks, such as, long and short wave radiation, convection and precipitation) is still prevalent. The previous section noted that a fully asynchronous physics implementation would be efficient. However, there are three reasons why it is not possible to implement this in the ND. Firstly, in contrast to the current UM, the ND further splits the physics code into two separate routines (atmos^physicsl and atmos-physics2) which are separated by dynamics routines. Secondly, the ND utilises a staggered grid and interpolation is performed in the physics when required e.g: in the boundary layer. To achieve this requires communication and synchronisation between processors. Thirdly, in the stand alone version of the ND there may be diagnostic i/o between physics routine calls, which potentially requires data from all processors.

166

4

Parallelisation strategy

As fully asynchronous physics is not feasible, two alternative parallelisation approaches are examined in this paper. The first uses the traditional 'bottom up' approach in which each loop is independently parallelised (similar to the microtasking employed by the UM on the Cray YMP and Cray C90 1 ' 3 ). Enhancements to this approach are also employed which remove (a potentially large amount of) thread initialisation and synchronisation using functionality provided in OpenMP. In the second approach, each physics section is 'segmented' into a loop of separate independent portions of work (where possible). This loop is then parallelised. A load balancing technique is also used to improve the performance of this approach. Segmenting, will be described in Section 6.3. 5

Experimental method

The ND code was run on a 16 processor Origin 2000; each processor being a MIPS 195MHz R10000. The L2-cache was 4M and the total memory was 6144M. The compiler was MlPSpro f90 version 7.3.1.1m and the OS was IPJX release 6.5. All routines were compiled using -O optimisation. Runs were performed using the SGI 'miser' queuing system. Miser is similar to NQE except that, on the Origin2000, it guarantees exclusive access to processors. It does not, however, reserve any memory or interconnect bandwidth. Results are presented for one to eight processors. For each data point the fastest time from three runs was taken. Timings were taken by interfacing to the P A P P performance tuning library. A global climate run (resolution 96x73x19) was used as a test bed for parallelisation. This was chosen as it was small, so that, firstly, runs did not take too much time and secondly, scalability issues were highlighted at smaller numbers of processors. 6 6.1

Implementation and Results h-ppn

This section presents the implementation strategies and results for large scale precipitation (ls_ppn). This routine is not the most computationally costly but was chosen as it is relatively small, therefore it was easy to test different strategies and provided a difficult test due to the limited amount of work. As "See http://icl.cs.utk.edu/projects/papi

167

it is relatively small it is also possible to present and explain its structure in reasonable detail. ls_ppn has a relatively simple structure. For each atmospheric level, a gather of points is performed to avoid redundant computation, based on whether precipitation can occur at that point. Work is then performed on these gathered points. This work is partitioned using the (original UM) one dimensional vector form, whereas the full arrays are dimensioned and accessed using the (new ND) latitude and longitude form. The code structure is summarised below*: subroutine ls_ppn() call IspconO do .. end do ! lat, long do ! levs do . . end do ! lat, long (gather index) call ls_ppnc() end do end subroutine ls_ppnc() do . . end do ! points (gather) call lsp_ice() do .. end do ! points (scatter)

subroutine lsp_ice() do .. end do ! points do ! iters call qsat call qsat.vat do .. end do ! points end do do .. end do ! points end subroutine qsat_wat() do . . end do > points end

end subroutine q B a t O do .. end do ! points end

Two parallel approaches are presented here. The first is a bottom up and the second a segmenting approach. 6.2

Bottom up approach

In the bottom up approach, the loops outlined in the previous section are parallelised. Three implementations were examined using this approach. The first (base) implementation used the automatic parallelising compiler option (invoked with the -apo flag) on the target platform. This caused the code to slow down as the number of processors increased. The second implementation manually added directives to the parallelisable loops. In this case it was decided to parallelise over latitude and longitude (and gathered latitude and longitude), keeping the calculation over levels sequential. This decision was based on the fact that, in general, the physics code is independent in latitude and longitude and has many dependencies "For the purposes of the paper the reader does not need to know the function of these routines, just their structure.

168

over levels. Therefore, not only is it likely to be more efficient to parallelise the routine this way but it is also expected that other routines will be parallelised in the same manner, facilitating improved data locality. Furthermore, there axe very few levels compared with latitude and longitude points. The routine ls_ppn was parallelised in longitude (the outer dimension and iterator) for the latitude/longitude loop. It was not possible to parallelise both indices (without code modification) as OpenMP is limited to the parallelisation of a single iterator. The gather index calculation is not paxallelisable and was performed sequentially. The loops in the remaining routines perform work on the gathered data and are all singly nested and paxallelisable. To help reduce any thread startup overheads the whole routine was enclosed in a single OpenMP parallel region and loops were partitioned using the OpenMP DO directive, rather than using a parallel region for each loop with the combined parallel region and parallel loop construct (PARALLEL DO). Note, OpenMP allows the number of executing threads to vaxy between parallel regions, but within a parallel region the number is fixed. Even with this optimisation this approach caused the code to slow down as the number of processors increased. subroutine ls_ppn() call IspconO c$omp parallel c$omp do do .. end do ! lat,long c$omp end do nowait do ! levs c$omp barrier c$omp master do .. end do ! lat.long (gather index) c$omp end master c$omp barrier call ls_ppnc() end do c$omp end parallel end subroutine ls.ppncO c$omp do do .. end do ! points (gather) c$omp end do nowait call lsp_ice() c$omp do do .. end do ! points (scatter) c$omp end do nowait end

subroutine lsp_ice() c$omp do do .. end do ! points clomp end do nowait do ! iters call qsat_omp call qsat_wat_omp c$omp do do .. end do ! points c$omp end do nowait end do c$omp do do .. end do ! points c$omp end do nowait end subroutine qsat_wat_omp() c$omp do do .. end do ! points c$omp end do nowait end subroutine qsat_omp() c$omp do do .. end do ! points c$omp end do nowait end

The third implementation removed any unnecessary barriers from the second implementation. By default, parallel loops in OpenMP perform a barrier synchronisation on completion. This can be suppressed with an ad-

169

ditional OpenMP (NO WAIT) directive. The loops in ls.ppnc, lsp Jce, qsat and qsat-wat are all enclosed within a single parallel region and have the same bounds. All loops are therefore guaranteed to use the same number of threads and, if each thread computes the same portion of the iteration space for each loop, there should be good data locality between loops and, for a static schedule, no barriers will be required. This implementation, therefore, must be run with a static schedule and relies on the OpenMP compiler to use the same partition and thread ordering for each loop. This is not an unreasonable assumption, and works for this target machine, but is not guaranteed in the OpenMP specification. The code structure of this third implementation is summarised on the previous page. In the above implementation the parallel qsat and qsat-wat routines were copied and renamed, as qsat and qsat_wat are also called from convection. The performance results for this strategy are given in Figure 1. 6.3

Segmenting approach

The segmenting approach splits the latitude and longitude points into separate segments at the control level i.e: from the routine that calls ls_ppn. From this level, the gathering and scattering of points is independent as it is performed separately for each segment. The original code has a single subroutine call: call

ls_ppn(cf,....rows,row_length f ...)

The parallel code implements a parallel loop with the number of iterators equal to the number of threads. Each thread has its own segment calculated beforehand by the partition subroutine. In this case only the longitude is partitioned, however it is possible to partition latitude, or both latitude and longitude. An advantage of this approach is that fewer code changes are required c and these are kept at the control level. The code structure is: call partition(nthreads,loj,hij f.. .) c$omp parallel do private (...) do i=ltnthreads local_row_length=row_length local_rows=hij(i)-loj(i)+1 rl_idx=l r_idi=loj(i) call ls_ppn(cf(rl_idx,r_idx,l),. . .,rows,row_length,... k ,local_rows,local_row_length) end do c$omp end parallel do

c

Directives are considered as code changes here.

170

A potential problem with this approach is load imbalance. In the first instance the partition subroutine gives each processor an equal number of points to compute. However, if the separate gathers inside ls_ppn do not reduce the number of computed points by similar amounts, some processors will have (potentially significant) more work to do than others. The solution employed here was to use feedback guided dynamic loop scheduling4. This technique uses the time from the previous iteration to calculate an improved partition. This is possible as the routine is called many times (once each timestep) and the load varies slowly over timesteps. The result of using this strategy is shown in Figure 1.

Precipitation (bottom up) Precipitation (segments) Naive ideal -

Figure 1. Large Scale Precipitation performance

Despite the optimisations employed in the bottom up approach, it still scales poorly compared with the segmenting approach. This is probably due to the sequential index gather and associated synchronisation and communication. The conclusion from this result is that it is better to segment than to use a bottom up approach as, both better performance is obtained and less code change is required (with the added benefit that this occurs at the control level). Given this clear preference for segmenting, the strategy for parallelising the remaining code was to use this approach whenever possible. The rest

171

of this section presents the results of parallelising the remaining (computationally costly) physics routines. These are discussed in the order of their computational cost. 6.4

Long-wave Radiation

Long-wave radiation is already segmented in the same style as it was for the UM 3 . Although the callee references arrays in both latitude and longitude, inside the routine latitude and longitude points are treated as a single vector. 0.12 . -

Longwave radiation Naive ideal

0.1 -

0.08 -

J

0.06 -

0.04 -

0.02 -

0 I 1

' 2

'

'

1

'

'

1

3

4

5

6

7

8

Processore

Figure 2. Long-Wave Radiation performance

The results of this paxallelisation are given in Figure 2. In related work the super-linearity has been shown to be primarily due to better cache performance with smaller segment sizes and the performance on smaller numbers of processors can therefore be improved by increasing the number of segments per processor. 6.5

Short-wave Radiation

Short-wave radiation is called from the same routine as Long-wave radiation and has the same features and issues. The code was therefore parallelised in the same manner. Note, parallelisation gives an equal number of lit-points to each processor and is therefore load balanced. In contrast the MPI parallel

172

implementation gives an equal number of grid points to each processor and is therefore not load balanced. Load balance is easily achieved in the shared memory model as the associated communication is implicit and the cost of communication is small compared with computation.

Figure 3. Short-Wave Radiation performance

The results of this parallelisation are given in Figure 3. Again, the superlinearity has been shown to be primarily due to better cache performance with smaller segment sizes and the performance on smaller numbers of processors can be improved by increasing the number of segments per processor. 6.6

Convection

On examination of the convection code it was found that (in a similar way to the radiation routines) although the callee references arrays in both latitude and longitude, inside the routine latitude and longitude points are treated as a single vector. Unlike the radiation routines the segmented structure had been removed. However, the internal code still had the appropriate structure to allow segmenting i.e: the concept of local and global points. A loop was therefore added around the calling subroutine and the arrays indexed appropriately. This loop was then parallelised. The results of this parallelisation are given in Figure 4. Again, convection scales well using this strategy.

173

Convection Naive ideal

Figure 4. Convection performance

6.1

Boundary Layer

The Boundary layer code is large and complex. All of the code has been converted to use latitude and longitude indexing. The code performs computation on both data aligned to the pressure grid points and to the velocity grid points (these are staggered). Therefore, interpolation is performed which requires communication between processors. Due to its complex structure it was decided to parallelise the routine using an incremental (bottom up) approach. A number of code sections were identified within the Boundary layer subroutine. These were then parallelised using as few parallel regions and synchronisations as possible (following the strategy discussed in Section 6.2). The results of this parallelisation are shown in Figure 5. 6.8

Large Scale Cloud

The structure of Large scale cloud is similar to ls_ppn (see Section 6.1). For each level, a gather of points is performed, to avoid redundant computation, based on whether cloud can occur at that point. Work is then performed on these gathered points. All work on ungathered arrays is performed using latitude and longitude indexing.

174

Boundary layer Naive idea! -

Figure 5. Boundary Layer performance

,-:

Cloud (bottom up) — i — Cloud (segments) — x — Naive ideal

,je-""

,-*c" -

-'x'' .

-

*

'

,

"

•

*

"

"

-^ -

'

-

-',-'* ^---^^^^

-

4

5 Processors

Figure 6. Large Scale Cloud performance

In Figure 6 the results of both a bottom up approach (following the strat-

175

egy discussed in Section 6.2) and a segmented approach (incorporating the load balance technique mentioned in Section 6.3) are presented. These results supply further evidence that the segmenting approach is better than the bottom up approach.

7

Conclusions

Parallelisation at the segment level is more efficient than the bottom up approach (parallelising individual loops) and provides good scaling for the physics routines implemented using segmentation. Advantages of parallel segment approach are that it requires very little code modification and the modifications are implemented at the control level. Furthermore, the number of processors used for each parallelised routine can be varied at run time (if required). Therefore the segment implementation fulfills the constraints discussed in Section 2. Parallelising using a bottom up approach has produced disappointing results. Only one routine still uses this approach, and this was only chosen because of the complexity of this routine. OpenMP has proven to be very powerful and relatively simple to use. However, it is easy to make mistakes, and a simple method to check for correct results would be beneficial. The latitude/longitude indexing can cause problems when parallelising the code, and suffers from a limitation of OpenMP which only allows the paxalleUsation of a single iterator in a nested loop. The mixed use of different partitions can also cause additional unnecessary synchronisation. Future work will attempt to segment the boundary layer code and examine the possibility of merging smaller physics routines to further improve scalability. A more radical, and potentially more efficient, approach would be to implement fully asynchronous regions for each of the two main physics routines (atmos-physicsl and atmos-physics2). This approach would have to perform some redundant computation to avoid the synchronisation associated with interpolation and delay any i/o until after the routines have completed.

8

Acknowledgments

The authors would like to thank the Met Office for funding this work.

176

References 1. R. S. Bell A. Dickinson. The meteorological office unified model for data assimilation, climate modelling and nwp and its implementation on a cray y-mp. In Proc. 4th Workshop on the Use of Parallel Processors in Meteorology. World Scientific, 1990. 2. Openmp. http://www.openmp.org. 3. R. W. Ford D. F. Snelling A. Dickinson. Controlling load balance, cache use and vector length in the UM. In G. R. Hoffman N. Kreitz, editor, Proc. 6th Workshop on the Use of Parallel Processors in Meteorology, pages 195-205. World Scientific, 1994. 4. J. M. Bull R. W. Ford A. Dickinson. A feedback based load balance algorithm for physics routines in nwp. In G. R. Hoffman N. Kreitz, editor, Proc. 7th Workshop on the Use of Parallel Processors in Meteorology, pages 239-249. World Scientific, 1996.

177 C O N V E R T I N G T H E H A L O - U P D A T E S U B R O U T I N E IN T H E MET OFFICE UNIFIED MODEL TO CO-ARRAY FORTRAN

P A U L M. B U R T O N The Met Office, London

Road, Bracknell,

Berkshire

RG12 2SZ,

UK

BOB CARRUTHERS Cray UK Limited,

200 Brook Drive,

Green Park, UK

Reading,

Berkshire

RG2

6UB,

G R E G O R Y S. F I S C H E R , B R I A N H. J O H N S O N A N D R O B E R T W . N U M R I C H Cray Inc.,

1340 Mendota

Heights

Road, Mendota

Heights,

MN 55120

USA

This paper describes an experiment to convert the halo-update subroutine in the Met Office Unified Model to Co-Array Fortran. Changes are made incrementally to the existing code, which is written mainly in Fortran 77 with some Fortran 90, by inserting co-array syntax only locally into the one routine t h a t performs most of the communication in the code. The rest of the code is left untouched. Performance of the modified co-array version is compared with the original SHMEM version and with an MPI version. As expected, Co-Array Fortran is faster than MPI but slower than SHMEM. The CAF code itself is much simpler t h a n either of the other two versions.

1

Introduction

There are at least two ways of implementing the Single-Program-MultipleData (SPMD) programming model. The one most often used is based on libraries, for example, the Cray SHMEM Library 1, the MPI Library 2 , the Global Arrays Library 3 , or others. The popularity of this approach comes from the fact that libraries are easy to define and to develop and it is easy to give them the appearance of portability from one system to another. Optimizing and implementing these libraries on any specific system, however, is time consuming and requires many resources not only in development but also in testing and quality control. Another approach is based on extensions to languages, for example, CoArray Fortran (CAF) 4 , Unified Parallel C (UPC) 5 , an extension to Java called Titanium 6 , or others. This approach has not been as popular as the library approach because not many people have access to compiler systems to implement extensions, extensions are perceived to be nonportable, and the work required to shepherd a language extension through the standardization process is more than most mortals are willing to bear.

178 Nevertheless, the language-based approach is superior to the library-based approach. From what we have learned about parallel processing, on both shared memory and distributed memory machines, we are able to distill the essence of parallel programming into a few simple language extensions. The essence of parallel programming is decomposition of a problem into logical domains that make sense for the problem at hand. Once that very difficult work has been done, the programmer needs little more than a simple, intuitive way to point from one logical domain to another, a way to get data as needed or to give data as needed. Unlike libraries, compilers understand all the legal data structures defined by the language and can handle data transfers symmetrically, on both sides of an assignment statement, so that the programmer can write arbitrarily complicated data communication statements and can expect the compiler to know what to do. Rather than a library dictating the kind of communication patterns the programmer is allowed to use, the programmer dictates to the compiler what communication patterns to perform. The language-based approach is a liberating approach to parallel programming. In this paper, we illustrate how to apply the essence of parallel programming to a real application code, and we show how the simple, elegant syntax of Co-Array Fortran expresses that essence. 2

The S P M D Programming model

Many computational scientists use the SPMD programming model as an effective tool for writing parallel application codes. The popularity of this model derives from at least two sources. A well written SPMD program has the best chance of scaling to large numbers of processors. More importantly, the SPMD model closely resembles the way that physical problems and numerical methods naturally decompose into domains. These domains are logical domains that make sense for the problem at hand. They are not hardware domains for particular hardware. How logical domains are mapped onto hardware is implementation dependent. Ideally, the programmer should not be aware of this mapping. Problem decomposition is the hard work of program development. It is relatively independent of the underlying hardware. Once the programmer has accomplished the hard work, it should not be redone each time the hardware changes underneath the program. If it has been clearly expressed in a language that knows how to represent the SPMD model, it shouldn't need to be re-expressed for a new machine. If the SPMD model is expressed as language extensions, then the compiler, if it is doing its job, knows how to map the logical description of the

179

decomposition onto a specific piece of hardware in such a way to take advantage of its particular strengths. In many cases, the map will be one-to-one, the easiest map to implement, in which one physical processor takes responsibility for the work and ownership of the data for one and only one domain. But the one-to-one mapping is not the only one possible or even the only one desirable. One physical processor might be responsible for the work associated with more than one logical domain. This idea has, in the past, been referred to as assigning virtual processors to a single physical processor, and in some cases it might be an effective way to obtain a modest level of work balancing across domains. On the other hand, for a system with multiple physical processors sharing a single physical memory, multiple physical processors may be responsible for the work associated with a single domain. This model has recently received much attention as researchers investigate the efficacy of a mixed OpenMP 7 under MPI programming model. In general, combining all variations, multiple physical processors may be responsible for multiple logical domains. Whatever the underlying implementation of the SPMD model, the programmer should be able to express a logical decomposition of a physical or numerical problem and expect it to be mapped efficiently onto any specific piece of hardware, perhaps with the help of some directives to the compiler or to the run-time system. While the SPMD model fits the domain decomposition model well, it has some well-known difficulties, which make program development hard. An old legacy code with some history to it, written with little attention to modular design and with no attention to parallel implementation, may be difficult to decompose into the SPMD model without heroic effort. The decomposition may be relatively straightforward for regular problems with, perhaps artificial, constraints restricting the size and number of domains relative to the number of physical processors. But if the domains have irregular sizes and shapes, figuring out the multitude of end cases that occur when dimensions are not a multiple of the number of domains can be a difficult problem to solve. The problem becomes even more difficult if the original code uses statically defined data arrays and the programmer wishes to update the code to a more modern design using dynamically allocated memory. Some of these problems can be solved using the method outlined in this paper. The dynamic memory problem can be solved using Fortran 90 derived types where the actual data is contained in a pointer component of the derived type. One array at a time may be allocated as a pointer, which may then be passed into existing subroutines and used as a normal array within the subroutine. Or conversely, the pointer component may be associated with

180

existing arrays to define an alias for a subarray already defined. The difficulties dealing with end cases may be encapsulated into initialization routines that look like constructors in an object-oriented language 8 . These routines contain all the logic for decomposing global data structures into decomposed structures, which depend on the run-time parameters of the program. This logic is worked out once in the constructors and then used over and over again rather than being reinvented for each new problem.

3

The Co-Array Fortran Model

The Co-Array Fortran model expresses the SPMD model by defining a Fortran-like syntax that allows the programmer to point from one logical domain to another using the familiar rules of normal Fortran. Its interpretation of the SPMD model is that a single program and all its data objects are replicated a fixed number of times. Some of these data objects are marked as co-arrays by specifying a co-dimension in their declaration statements. For example, a declaration such as r e a l , dimension[*] : : x defines a scalar variable x replicated across all the replications of the program. In addition, it indicates to the compiler and the supporting run-time system that a map must be created so that address of the variable in every replicated program is known by all of the replicated programs. The simplest map is to make all the addresses the same. The simpler the map, the easier the implementation. The co-dimension contains other information. Just as a normal dimension statement specifies a convenient numbering convention for addresses within a single memory, so the co-dimension defines a convenient numbering convention for the replicated variable across programs. In this case, assuming the normal Fortran convention of numbering indices from one, the replicated programs are numbered from one. If the programmer would rather number them from zero, or from any other base, the usual Fortran convention applies by simply changing the base for the co-dimension. For example, the declaration statement, r e a l , dimension[0:*] : : x tells the compiler to number the replications starting from zero.

181

The upper limit of a final co-dimension is always indicated by an asterisk because the total number of replications is always known to be a fixed value, which may not be, and probably will not be, set until run-time. The compiler doesn't need to know this number. Multi-dimensional co-arrays are allowed and are useful for representing domains in more than one dimension. In the next section, we define a coarray with two co-dimensions, which represents a natural way to decompose fields for a weather model. 4

I n c r e m e n t a l U p g r a d e from F o r t r a n 77 t o C o - A r r a y F o r t r a n

How can we incorporate the co-array model into an existing Fortran 77 code without tearing the code apart from the beginning. Can we incrementally, on a subroutine by subroutine basis, add co-arrays to the code? A typical haloupdate routine, for example, exchanges the overlap regions of arbitrary field variables from one domain to its four neighbors in the north-south-east-west directions. This kind of communication pattern is very common and occurs in many other applications 9 . The size and shape of the field variable may be, and probably will be, different for each domain and, more importantly, will be located at a different virtual address in the memory of each physical processor. One copy of a program knows nothing about the state of memory in another processor. If a variable has been declared as a co-array, however, each copy of the program has a map that tells it how to find the address of that variable on any other processor. So the solution to our problem is to define a co-array alias for the field variables as they come into the update subroutine. This alias can be defined locally and can be allocated dynamically upon entry to the routine. To be specific, suppose the subroutine interface looks something like the following. interface subroutine swapBounds(f i e l d , I o n s , l a t s , l e v s ) i n t e g e r :: I o n s , l a t s , l e v s real,dimension(0:lons+i,0:lats+l,levs) :: field end subroutine swapBounds end i n t e r f a c e We define a co-array structure with a pointer component, which will be used to form an alias to the field variable.

182

type cafField real,pointer,dimension(:,:,:) integer :: size(3) end type cafField

::

field

This structure could, of course, hold more specific information about each kind of field. For our purposes, the structure mainly serves as an alias to fields so that one domain knows how to find the field on any other domain where the memory allocation may be completely different from the local memory allocation. If we were to convert the entire code to CAF, these structures would look more like objects in an object-oriented language. They would have procedures associated with them, although not methods as defined in an object-oriented language, which act like constructors, and all subroutines in the program would have generic interfaces that recognize named types as parameters. This approach using co-array syntax with the powerful features of Fortran 90 can alleviate much of the burdensome drudgery associated with writing SPMD programs. Declaring a co-array of this type, type(cafField),allocatable,dimension[:,:]

:: z

tells the compiler that a structure of this type will be allocated at run-time, one for each copy of the SPMD program, and that the co-dimension information will be specified at that time according to how the programmer wishes to relate the logical domains one to another. For example, if there are a total of n domains, the programmer may wish to factor the total such that n = npxnq. At run-time, the number of copies of the program is available from the intrinsic function n =num_images() from which the programmer is responsible for the factorization. Given the factorization, the allocation statement allocate(z[np,*]) results in each copy of the program allocating space for one structure in a special area of memory. Every copy must allocate the same structure of the same size in order to keep memory consistent across all the copies, and a barrier is enforced before any copy of the program may leave the allocate statement. The second thing that happens at the allocate statement is that the programmer has informed the compiler of the topology of the logical domains.

183 Just as with normal Fortran dimensions, only the first co-dimension is significant, the second one, implied by the asterisk, is known to be nq=n/np. The domains are then logically related to each other through these co-dimensions as shown in Figure 1. [np.nq]

North

[np,l]

[p+i,q] West

[p,q-i]

[1-1]

[p.q] [p-l,q] South

[p,q+i]

East [l,nq]

Figure 1. Correspondence between Logical Domains and Co-Dimensions

This three-dimensional problem is decomposed over the two longitudelatitude dimensions with a pencil of levels rising vertically above each domain. Two co-dimensions are a natural choice for representing this logical decomposition. If a processor owns a domain with logical co-index [p,q], then, if we adopt the convention that increasing the first dimension corresponds to motion from west to east, the processor's west neighbor has co-index [p-l,q] and its east neighbor is [p+l,q]. With the co-array structure allocated within the swapBounds () subroutine, a simple pointer assignment, z'/.field => f i e l d converts the Fortran 77 code to Co-Array Fortran. At the same time, recording the dimension information into the structure, z*/,size(l) = Ions; z%size(2) = l a t s ; z'/.size(3) = l e v s makes it available to other domains that may need it. This solves the problem of how one domain knows how to find data on remote memory stacks and to find the size and shape of that data. This problem is very difficult to solve in other SPMD programming models, but it is not a problem for CAF. Knowing how to find the information follows naturally from the design of the language. For example, to find the dimensions of a field on domain [r,s], one simply writes dims(:) = z [ r , s ] ' / , s i z e ( : ) The compiler generates code, appropriate for the target hardware, to read the

184

array component s i z e ( : ) of the structure z on the remote domain. In the same way, to obtain actual field data from a remote domain, one writes x = z[r,s],/.field(i,j,k) In this case, the compiler first reads the pointer component from the remote domain, which contains the address associated with the pointer on the remote domain, then uses that remote address to read the data from the heap on the remote domain. These deceptively simple assignment statements open all the power of CAF to the programmer. First, consider updating halo cell zero by reading all the latitudes from all the levels from the neighbor to the west. If we assume periodic boundary conditions, then the co-dimension index to the west is west = p - i ; if(west < 1) west = np For example, the halo-update from the west becomes a single line of code, field(0,1:lats,1:levs)=z[west,q]%field(lons,1:lats,i:levs) Similarly, the update from the east, where east = p+1 ; i f ( e a s t > np) east = 1 is a similar line of code, field(lons+l,l:lats,i:levs)=z[east,q]Afield(1,1:lats,l:levs:) In the north-south direction, the mapping is a bit more complicated since the domains must be folded across the poles. We do not give the specific mapping here, but point out that any mapping to another domain,be it simple and regular or complicated and irregular, may be represented as a function that returns the appropriate co-dimension. For example, north = neighbor('N')

might represent a function that accepts character input and returns the result of a complicated relationship between the current domain and some remote domain.

185

5

Experimental Results

We performed our experiments using the Met Office Unified Model (UM) 10 . The Met Office has developed this large Fortran77/Fortran90 code for all of its operational weather forecasting and climate research work. The UM is a grid point model, capable of running in both global and limited area configurations. The next major release of the UM will have a new semi-implicit non-hydrostatic dynamical formulation for the atmospheric model, incorporating a semi-lagrangian advection scheme. The code used in this paper is a stand-alone version of the model, which incorporates the new dynamical core. This was used for developing the scheme prior to insertion into the UM. The timing results are based around climate resolution (3.75 degree) runs. We performed two experiments, one to compare Co-Array Fortran with SHMEM and the other to compare it with MPI. The UM uses a lightweight interface library called GCOM u to allow interprocessor communication to be achieved by either MPI or SHMEM, depending on which version of the library is linked. Additionally, the halo exchange routine swapBounds ( ) exists in two versions, one using GCOM for portability and another using direct calls to SHMEM for maximum performance on the CRAY-T3E. We converted the swapBounds () subroutine from the SHMEM version to Co-Array Fortran, as outlined in Section 4, and then substituted it into the UM, this being the only change made to the code. This was then run with the two different versions of the GCOM library (MPI and SHMEM) to measure the performance. Table 1 shows that, with four domains in a 2 x 2 grid, the SHMEM version, times shown in column 2, runs about five percent faster than the pure MPI version, times shown in column 5. Although an improvement of five percent may not seem like much, because of Amdahl's Law this difference makes a big difference in scalability for large numbers of processors. This can be seen from the fact that at 32 domains the difference in times is already about fifteen per cent. The times for the SHMEM code with the CAF subroutine inserted are shown in column 3 of Table 1, and the times for the MPI version with CAF inserted are shown in column 4. Comparison of the results shows what one would expect: CAF is faster than MPI in all cases but slower than SHMEM. The explanation of the timing comparisons is clear. CAF is a very lightweight model. The compiler generates inline code to move data directly from one memory to another with no intermediate buffering and with very little overhead unlike the high overhead one encounters in something like the heavyweight MPI library, which strives more for portability than for performance. The SHMEM library, on the other hand, is a lightweight library, which

186

also moves data directly from memory to memory with very little overhead, but, in addition, its procedures have been hand optimized, sometimes even written in assembler, to obtain performance as close to peak hardware performance as possible. Hence, the compiler generated code produced for CAF is not as fast as SHMEM, but considering that the compiler alone generated the code with no hand optimization, the performance of the CAF version of the code is quite impressive. With more work to put more optimization into the compiler to recognize CAF syntax and semantics, there is no reason why the CAF version will not eventually equal the SHMEM version. Table 1. Total Time (s) Domains 2x2 2X4 2x8 4x4 4x8

6

SHMEM 191 95.0 49.8 50.0 27.3

SHMEM with CAF 198 99.0 52.2 53.7 29.8

M P I with CAF 201 100 52.7 54.4 31.6

MPI 205 105 55.5 55.9 32.4

Summary

This paper has outlined a way to convert the halo-update subroutine of the Met Office Unified Model to Co-Array Fortran. It demonstrated how to add CAF features incrementally so that all changes are local to one subroutine. Hence the bulk of the code remained untouched. It incidentally demonstrated the compatibility of three programming models, not a surprising result since they all assume the same underlying SPMD programming model. The communication patterns in this application are very regular, but the Co-Array Fortran model is not restricted to such patterns. In some cases, such as encountered in a semi-lagrangian advection scheme, it is necessary to gather a list of field elements from another domain using a line of code such as tmp(:)=z[myPal],/.field(i,list(:),k) Or it might be necessary to gather field elements from a list of different domains using a line such as tmp(:)=z[list(:)]y.field(i,j,k)

187

And at times it may be necessary to modify data from another domain before using it or storing it to local memory,

f i e l d ( i , j , k ) = f i e l d ( i , j , k ) + scale*z[list(:)]'/.field(i, j,k) No matter what is required, the programmer writes code that clearly describes the requirements of the application and the compiler maps that description onto the hardware using it in the most efficient way it can. Such a programming model allows the programmer to concentrate on solving a physical or numerical problem rather than on solving a computer science problem. In the end, the Co-Array Fortran code is simpler than either the SHMEM code or the MPI code. It is easier to read as well as easier to write, which means it is easier to modify and easier to maintain. In addition, its performance is as good as or better than the performance of library-based approaches, which have had much more optimization effort put into them. With a similar amount of work devoted to compiler development, the performance of the CAF model will approach or exceed that of any other model. References 1. Message passing toolkit. CRAY Online Software Publications, Manual 007-3687-002. 2. William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI, Portable Parallel Programming with the Message-Passing Interface. The MIT Press, 1994. 3. J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global Arrays: A nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10:197-220, 1996. 4. Robert W. Numrich and John K. Reid. Co-Array Fortran for parallel programming. ACM Fortran Forum, 17(2):1—31, 1998. http://www.coarray.org. 5. William W. Carlson, Jesse M. Draper, David E. Culler, Kathy Yelick, Eugene Brooks, and Karen Warren. Introduction to UPC and language specification. Technical Report CCS-TR-99-157, Center for Computing Sciences, 17100 Science Drive, Bowie, MD 20715, May 1999. http://www.super.org/upc/. 6. Katherine Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham, David Gay, Phillip Colella, and Alexandar Aiken. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience, 10:825-836, 1998.

188

7. OpenMP Fortran application program interface. http://www.openmp.org. 8. Eric T. Kapke, Andrew J. Schultz, and Robert W. Numrich. A parallel 'Class Library' for Co-Array Fortran. In Proceedings Fifth European SGI/Cray MPP Workshop, Bologna, Italy, September 9-10, 1999. 9. Robert W. Numrich, John Reid, and Kieun Kim. Writing a multigrid solver using Co-Array Fortran. In Bo Kagstrom, Jack Dongarra, Erik Elmroth, and Jerzy Wasniewski, editors, Applied Parallel Computing: Large Scale Scientific and Industrial Problems, pages 390-399. 4th International Workshop, PARA98, Umea, Sweden, June 1998, Springer, 1998. Lecture Notes in Computer Science 1541. 10. T. Davies, M.J.P. Cullen, M.H. Mawson, and A.J. Malcolm. A new dynamical formulation for the UK Meteorological Office Unified Model. In ECMWF Seminar Proceedings: Recent developments in numerical methods for atmospheric modelling, pages 202-225, 1999. 11. J. Amundsen and R. Skaalin. GC user's guide release 1.1. Technical report, SINTEF Applied Mathematical Report STF42 A96504, 1996.

189

PARALLEL ICE D Y N A M I C S IN A N O P E R A T I O N A L BALTIC SEA MODEL TOMAS WILHELMSSON Department of Numerical Analysis and Computer Science, Royal Institute of Technology, SE-100 44 Stockholm, Sweden E-mail: [email protected] HIROMB, a 3-dimensional baroclinic model of the North Sea and the Baltic Sea, has delivered operational forecasts to the SMHI since 1995. The model is now parallelized and runs operationally on a T3E, producing a 1 nautical mile forecast in the same time as a 3 nm forecast took on a C90. During the winter season, a large fraction of CPU time is spent on ice dynamics calculations. Ice dynamics include ice drift, freezing, melting, and changes in ice thickness and compactness. The equations are highly nonlinear and are solved with Newton iterations using a sequence of linearizations. A new equation system is factorized and solved in each iteration using a direct sparse solver. This paper focuses on the efforts involved in parallelizing the ice model.

1

Introduction

The Swedish Meterological and Hydrological Institute (SMHI) makes daily forecasts of currents, temperature, salinity, water level, and ice conditions in the Baltic Sea. These forecasts are based on data from a High Resolution Operational Model of the Baltic Sea (HIROMB). Within the HIROMB project, 1 the German Federal Maritime and Hydrographic Agency (BSH) and SMHI have developed an operational ocean model, which covers the North Sea and the Baltic Sea region with a horizontal resolution from 3 to 12 nautical miles (nm). This application has been parallelized 2 ' 3 ' 4 and ported from a CRAY C90 vector computer to the distributed memory parallel CRAY T3E-600. The memory and speed of the T3E allows the grid resolution to be refined to 1 nm, while keeping the execution time within limits. Figure 1 shows output from a 1 nm resolution forecast. 2

Model description

HIROMB gets its atmospheric forcing from the forecast model HIRLAM. 5 Input includes atmospheric pressure, wind velocity, wind direction, humidity and temperature, all at sea level, together with cloud coverage. Output includes sea level, currents, salinity, temperature, ice coverage, ice thickness and ice drift velocity. HIROMB is run once daily and uses the latest 48-hour

190 HIROMB (1 nm) Surface salinity and current TUE 26 SEP 2000 00Z +48

50

100

150

200

250

300

Figure 1. This is output from a 1 nautical mile resolution forecast centering on the South Baltic Sea. Arrows indicate surface current and colors show salinity. Blue is fresh water and red is salt water.

forecast from HIRLAM as input. There are plans to couple the models more tightly together in the future. The 1 nm grid (see Figure 2) covers the Baltic Sea, Belt Sea and Kattegat. Boundary values for the open western border at 10° E are provided by a coarser 3 nm grid which extends out to 6° E and includes Skagerrak. Boundary values for the 3 nm grid are provided by 12 nm resolution grid which covers the whole North Sea and Baltic Sea region. All interaction between the grids takes place at the western edge of the finer grid where values for flux, temperature, salinity, and ice properties are interpolated and exchanged. Water level at the 12 nm grid's open boundary is provided by a storm surge model covering the North Atlantic. Fresh water inflow is given at 70 major river outlets. In the

191

100

200

300 400

500 600 700

Figure 2. The 1 nm resolution grid is shown to the left. Colors indicate depth with maximum depth south of the Norwegian coast around 600 meters. The grid has 1,126,607 active points of which 144,449 are on the surface. The 3 n m grid, on the upper left, provides boundary values for the 1 nm grid and has 154,894 active grid points of which 19,073 are on the surface. T h e 12 nm grid, on the lower right, extends out to whole North Sea and has 9,240 active points and 2,171 on the surface. It provides boundary values for the 3 n m grid. The sea level at the open North Sea boundary of the 12 nm grid is given by a storm surge model.

vertical, there is a variable resolution starting at 4 m for the mixed layer and gradually increasing to 60 m for the deeper layers. The maximum number of layers is 24. In the future, we plan to replace the 3nm and 12 nm grids with one 6 nm grid, and also to double the vertical resolution. HIR.OMB is very similar to an ocean model described by Backhaus. 6,T Three different components may be identified in the HIROMB model: the baroclinic part, the barotropic part, and ice dynamics. In the baroclinic part, temperature and salinity are calculated for the whole sea at all depth levels. Explicit two-level time-stepping is used for horizontal diffusion and advection. Vertical exchange of momentum, salinity, and temperature is computed implicitly. In the barotropic part, a semi-implicit scheme is used for the vertically

192

Grid 12 nm 3nm 1 nm

Equations 1,012 16,109 144,073

Non-zeros 4,969 84,073 762,606

Table 1. Matrix sizes used in ice dynamics iterations from a May 3, 2000 forecast.

integrated flow, resulting in a system of linear equations (the Helmholtz equations) over the whole surface for water level changes. This system is sparse and slightly asymmetric, reflecting the 9-point stencil used to discretize the differential equations over the water surface. It is factorized with a direct solver once at the start of the simulation and then solved for a new right-hand side in each time step. During mid-winter ice dynamics dominates the total computation time, and its parallelization will be the focus of the rest of this paper. 3

Ice dynamics

Ice dynamics occur on a very slow time scale. It includes ice drift, freezing, melting, and changes in thickness and compactness. The ice cover is regarded as a granular rubble of ice floes and modeled as a 2D continuously deformable and compressible medium. The viscous-viscoplastic constitutive law is based on Hibler's model. 8 The dynamic state—constant viscosity or constant yield stress—may change locally at any time, creating a highly nonlinear equation system. The system is solved with Newton iterations using a sequence of linearizations of the constitutive law until quasi-static equilibrium is reached. In each iteration a new equation system is factorized and solved using a direct sparse solver. Convergence of the nonlinear system is achieved after at most a dozen iterations. The ice model is discussed in detail by Kleine and Sklyar. 9 Profiling shows that almost all computation time for ice dynamics is spent in the linear solver. The linearized ice dynamics equation system contains eight unknowns per ice covered grid point: two components of ice drift velocity and three components of strain rates and three components of stress. Hence, even with a small fraction of the surface frozen, the ice dynamics system may become much larger than the water level system which has only one equation per grid point. The system is unsymmetric, indefinite, with some diagonal elements being very small: max, |oy| > 10 6 |ajj|. It is also strongly ill-conditioned. The matrix from the 12 nm grid in Table 1 has condition number K 2 « 6.6 • 10 15

193

which is close to the machine precision limit. On parallel computers large linear systems are most commonly solved with iterative methods. But for ill-conditioned matrices an iterative solver needs a good preconditioner in order to have rapid and robust convergence. Kleine and Sklyar 9 found it difficult to apply common iterative solvers to the ice equation system and they instead turned to a direct solver. It was therefore decided to use a direct solver also for parallel HIROMB". Lack of diagonal-dominance, indefiniteness, and large condition number suggest that partial pivoting would be necessary to get an accurate solution with a direct solver. Serial HIROMB, however, successfully makes use of the YSMP 10 solver which does no pivoting. We have compared results with and without partial pivoting using the direct solver SuperLU. n Both the residuals and the solution differences indicate that pivoting is not needed for these matrices. Furthermore, as the linear system originates from Newton iterations of a nonlinear system, an exact linear solution is not necessary for the nonlinear iterations to converge. 4

Parallel Direct Sparse Matrix Solvers

At the start of the parallelization project, there were few direct sparse solvers available for distributed memory machines. At the time we only had access to a solver written by Bruce Herndon, then at Stanford University. 12 Like the YSMP solver it handles unsymmetric matrices, and it does not pivot for stability. Later the MUMPS 13 solver from the European PARASOL project 14 was added as an alternative. The solution of a linear system with a direct method can be divided into four steps: (i) Ordering where a fill-reducing reordering of the equations is computed. Multiple minimum degree (MMD) and nested dissection (ND) are two common algorithms for this, (H) Symbolic factorization which is solely based on the matrix' non-zero structure. (Hi) Numeric factorization where the actual LU factors are computed, (iv) Solving which involves forward and back substitution using the computed LU factors. In HIROMB's ice dynamics a sequence of matrices are factorized and solved until convergence in each time step. During the iterations the non-zero structure of the matrices stays the same, so the ordering and symbolic factorization steps need only to be done once per time step. Herndon's solver performs all four solution steps in parallel, whereas version 4.0.3 of the MUMPS "Some tests with Matlab's GMRES and incomplete factorization as preconditioner have been done with the systems in Table 1. At least 45% of the non-zero entries in L and U had to be retained for the algorithm to converge.

194

solver has only parallelized the two latter numerical factorization and solve stages. Both Herndon's solver and MUMPS use the multi-frontal method 15 to factorize the matrix. The multi-frontal method seeks to take a poorly structured sparse factorization and transforms it into a series of smaller dense factorizations. These dense eliminations exhibit a well understood structure and can be made to run well using techniques already developed for dense systems. In Herndon's solver, processors factorize their own local portions of the matrix independently and then cooperate with each other to factorize the shared portions of the matrix. The shared equations are arranged hierarchically into an elimination tree, where the number of participating processors is halved at each higher level in the hierarchy. Thus, if the fraction of shared equations is large, performance will suffer due to lack of parallelism. Due to this arrangement, the Herndon solver also requires the number of participating processors to be a power of two. The MUMPS solver is able to parallelize the upper levels of the elimination tree further by using standard dense parallel solvers to factorize the equations remaining near the root of the elimination tree.

5

Matrix decomposition

Since MUMPS does the equation reordering and symbolic factorization serially, the whole matrix is first moved to a master processor. During symbolic factorization an optimal load balanced matrix decomposition is computed. The matrix is then distributed onto the processors according to the computed decomposition for subsequent numerical factorizations. Herndon's solver accepts and uses the original matrix decomposition directly. No data redistribution is necessary before calling the solver. But performance will suffer if the initial matrix decomposition is not load balanced and suitable for factorization. As can be seen in Figure 3, HIROMB's ice distributions are in general very unbalanced. The performance of Herndon's solver for HIROMB's ice matrices may be substantially improved by first redistributing the matrix using ParMETIS. ParMETIS 16 is a parallel version of the popular METIS package for graph partitioning and fill-reducing matrix ordering. It can compute a matrix decomposition which is optimized for parallel factorization. The matrix is first decomposed using multilevel recursive bisection and then each partition is ordered locally with a multiple minimum degree (MMD) algorithm.

195

100

200

300

400

500

600

700

Figure 3. The picture on the left shows the ice distribution in the 1 nm grid on May 3, 2000. About 13% of the grid points (18,251 out of 144,449) were ice covered. The picture on the right shows an example of how this grid would be decomposed into 74 blocks and distributed onto 16 differently colored processors.

6

Solver performance

We have measured solver performance on a set of equation systems from a late spring forecast (May 3, 2000). On this date 13% of the l n m grid was ice covered as shown in Figure 3. Table 1 gives the matrix sizes. We limit the discussion below to the matrices of the two finer grids 3 nm and 1 nm. Although none of the solvers show good speedup for the 12 nm grid, the elapsed times are negligible in relation to the time for a whole time step. Times for solving ice matrices from the 3nm grid are shown in upper graph of Figure 4. The time to compute a fill-reducing equation ordering is included under the symbolic factorization heading. The graph shows that the MUMPS solver is the best alternative for all processor counts up to 16. For 32 processors and above Herndon's solver combined with ParMETIS redistribution gives the best performance. The lower graph of Figure 4 gives times to factorize and solve the matrix from the 1 nm grid. This matrix is 9 times larger than the 3 nm matrix and at least 8 processors were necessary to generate it due to memory constraints. Here, MUMPS is slowest and shows little speedup. Herndon's solver with ParMETIS gives by far the best performance in all measured cases.

196 Factorize and solve ice matrix once

4

8

16

Number of processors Factorize and solve ice matrix once

16 32 Number of processors Figure 4. The top graph gives measured time to factorize and solve an ice matrix from the 3 nm grid. For each number of processors a group of three bars is shown. T h e left bar gives times for the MUMPS solver, the middle bar times for Herndon's solver, and the right bar gives times for Herndon's solver with ParMETIS redistribution. Timings for Herndon's solver on 64 processors are missing. The lower graph gives measured time spent to factorize and solve t h e ice matrix from t h e 1 n m grid. At least 8 processors were necessary for this matrix due to memory constraints.

197

In this forecast, on average three iterations were necessary to reach convergence for the nonlinear system in each time step. So in order to get all time spent in the solver, the times for the numerical factorization and solve phases should be multiplied by three. However, doing this does not change the mutual relation in performance between the solvers. The differences in time and speedup between the solvers are clearly revealed in Figure 5. Here a time line for each processor's activity is shown using the VAMPIR tracing tool. MUMPS is slowest because it does not parallelize symbolic factorization which accounts for most of the elapsed time. MUMPS would be more suitable for applications where the matrix structure does not change, e.g. the Helmholtz equations for water level in HIROMB. Herndon's solver is faster because it parallelizes all solution phases. But it suffers from bad load balance as the matrix is only distributed over those processors that happen to have the ice covered grid points. But by redistributing the matrix with ParMETIS all processors can take part in the computation which substantially improves performance.

6.1

ParMETIS

optimizations

Initially ParMETIS had low speedup and long execution times. For the 1 nm matrix in Table 1 the call to ParMETIS took 5.9 seconds with 32 processors. The local ordering within ParMETIS is redundant as Herndon's solver has its own MMD ordering step. By removing local ordering from ParMETIS, the time was reduced to 5.0 seconds and another 0.7 seconds was also cut from the solver's own ordering time. HIROMB originally numbered the ice drift equations first, then the strain rate equations and finally the stress equations. This meant that equations referring to the same grid point would end up far away in the matrix. Renumbering the equations so that all equations belonging to a grid point are held together reduced the ParMETIS time by 3.8 seconds down to 1.2 seconds. Now the whole matrix redistribution, factorization and solve time, 4.7 seconds, is lower than the initial ParMETIS time. HIROMB's ice matrices may become arbitrarily small when the first ice appears in fall and the last ice melts in spring. Due to a bug in the current version 2.0 of ParMETIS, it fails to generate a decomposition when the matrix is very small. By solving matrices smaller than 500 equations serially without calling ParMETIS the problem is avoided.

198

Figure 5. VAMPIR traces of MUMPS (top), Herndon (middle) and Herndon with ParMETIS (bottom) when solving the 1 nm grid ice matrix from the May 3, 2000 dataset on 16 processors. Yellow respresents active time spent in the solver, green is ParMETIS time and brown is wasted waiting time due to load inbalance. Intialization shown in light blue is not part of the solver time. Compare with Figure 4, lower part.

199

7

Conclusion

The ice dynamics model used in HIROMB necessitates an efficient direct sparse solver in order to make operational l n m forecasts feasible. The performance achieved with Herndon's solver and ParMETIS was absolutely essential. SMHI's operational HIROMB forecasts today run on 33 processors of a T3E-600. The water model (batotropic and baroclinic) accounts for 12 seconds per time step. A full summer (ice free) 48-hour forecast is produced in 1.0 hours with a time step of 10 minutes. The May 3, 2000 forecast with 13% ice coverage (see Figure 3) would with the MUMPS solver take 6.8 hours. Using Herndon's solver with ParMETIS the forecast is computed in 1.8 hours. The difference between the solvers grows even larger with more ice coverage. Acknowledgments The parallelization of HIROMB was done together with Dr Josef Schiile at the Institute for Scientific Computing in Braunschweig, Germany. The author would like to thank Lennart Funkquist at SMHI and Dr Eckhard Kleine at BSH for their help with explaining the HIROMB model. Computer time was provided by the National Supercomputer Centre in Linkoping, Sweden. Financial support from the Parallel and Scientific Computing Institute (PSCI) is gratefully acknowledged. References 1. Lennart Funkquist and Eckhard Kleine. HIROMB, an introduction to an operational baroclinic model for the North Sea and Baltic Sea. Technical report, SMHI, Norrkoping, Sweden, 200X. In manuscript. 2. Tomas Wilhelmsson and Josef Schiile. Running an operational Baltic sea model on the T3E. In Proceedings of the Fifth European SGI/CRAY MPP Workshop, CINECA, Bologna, Italy, September 9-10 1999. URL: http://www.cineca.it/mpp-workshop/proceedings.htm. 3. Josef Schiile and Tomas Wilhelmsson. Parallelizing a high resolution operational ocean model. In P. Sloot, M. Bubak, A. Hoekstra, and B. Hertzberger, editors, High-Performance Computing and Networking, number 1593 in LNCS, pages 120-129, Heidelberg, 1999. Springer. 4. Tomas Wilhelmsson and Josef Schiile. Fortran memory management for parallelizing an operational ocean model. In Her-

200

5. 6.

7.

8. 9.

10.

11.

12.

13.

14.

15. 16.

mann Lederer and Friedrich Hertweck, editors, Proceedings of the Fourth European SGI/CRAY MPP Workshop, pages 115123, IPP, Garching, Germany, September 10-11 1998. URL: http://www.rzg.mpg.de/mpp-workshop/papers/ipp-report.html. Nils Gustafsson, editor. The HIRLAM 2 Final Report, HIRLAM Tech Rept. 9, Available from SMHI. S-60176 Norrkoping, Sweden, 1993. Jan 0 . Backhaus. A three-dimensional model for the simulation of shelf sea dynamics. Deutsche Hydrographische Zeitschrift, 38(4):165-187, 1985. Jan 0 . Backhaus. A semi-implicit scheme for the shallow water equations for application to sea shelf modelling. Continental Shelf Research, 2(4):243-254, 1983. W. D. Hibler III. Ice dynamics. In N. Untersteiner, editor, The Geophysics of Sea Ice, pages 577-640. Plenum Press, New York, 1986. Eckhard Kleine and Sergey Sklyar. Mathematical features of Hibler's model of large-scale sea-ice dynamics. Deutsche Hydrographische Zeitschrift, 47(3):179-230, 1995. S. C. Eisenstat, H. C. Elman, M. H. Schultz, and A. H. Sherman. The (new) Yale sparse matrix package. In G. Birkhoff and A. Schoenstadt, editors, Elliptic Problem Solvers II, pages 45-52. Academic Press, 1994. Xiaoye S. Li and James W. Demmel. A scalable sparse direct solver using static pivoting. In 9th SIAM Conference on Parallel Processing for Scientific Computing, 1999. Bruce P. Herndon. A Methodology for the Parallelization of PDE Solvers: Application to Semiconductor Device Physics. PhD thesis, Stanford University, January 1996. Patrick R. Amestoy, Ian S. Duff, and Jean-Yves L'Excellent. MUMPS multifrontal massively parallel solver version 2.0. Technical Report TR/PA/98/02, CERFACS, 1998. Patrick R. Amestoy, Iain S. Duff, Jean-Yves L'Excellent, and Petr Plechac. PARASOL An integrated programming environment for parallel sparse matrix solvers. Technical Report RAL-TR-98-039, Department of Computation and Information, Rutherford Appelton Laboratory, Oxon, UK, May 6 1998. I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford University Press, London, 1986. George Karypis and Vipin Kumar. A coarse-grain parallel formulation of multilevel k-way graph partitioning algorithm. In 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997.

201 PARALLEL COUPLING OF R E G I O N A L A T M O S P H E R E OCEAN MODELS

AND

STEPHAN FRICKENHAUS Alfred-Wegener-Institute Columbusstrasse, E-mail:

for Polar and Marine Research, 27568 Bremerhaven, Germany [email protected]

RENE REDLER AND PETER POST Institute for Algorithms and Scientific Computing, German National Research Center for Information Technology Schloss Birlinghoven, D-53754 Sankt Augustin, Germany In coupled models the performance of massively parallel model components strongly suffers from sequential coupling overhead. A coupling interface for parallel interpolation and parallel communication is urgently required to work out this performance dilemma. Performance measurements for a parallel coupling of parallel regional atmosphere and ocean models are presented for the CRAY-T3E-1200 using the coupling library MpCCI. The different rotated grids of the models MOM2 (ocean-seaice) and PARHAM (atmosphere) are configured for the arctic region. In particular, as underlying MPI-implementations CRAY-MPI and MetaMPI are compared in their performance for some relevant massive parallel configurations. It is demonstrated that an overhead of 10% for coupling, including interpolation and communication, can be achieved. Perspectives for a common coupling specification are given enabling the modeling community to easily exchange model components as well as coupling software, making model components reusable in other coupling projects and on next generation computing architectures. Future applications of parallel coupling software in parallel nesting and data assimilation are discussed.

1

Introduction

The climate modeling community produces a growing number of model components, e.g., for simulations of atmosphere, ocean and seaice. Currently, more and more model codes are parallelized for running on massively parallel computing hardware, driving numerical performance to an extreme, mostly with the help of domain decomposition and message passing techniques. From the undoubted need for investigation in coupled high performance models a new performance bottleneck appears from the necessary interpolation and communication of domain decomposed data between the model components 1 . In particular, scalability of coupled massively parallel models is strongly bound when using a sequential coupling scheme, i.e., gathering distributed data from processors computing the sending model component, interpolating, communicating and scattering data to processors computing the receiving

202

model component. The alternative to such an external coupling approach is the internal coupling approach: mixing the codes of model components to operate on the same spatial domains, for convenience with the same spatial resolution. Thus, internal coupling puts strong limits on the flexibility of the model components. In external coupling, the performance of the coupled model can be optimized by running model components in parallel, each on an optimal, load balancing number of processors. Furthermore, external coupling allows for an easy replacement of model components, at least, when a certain standard for coding the coupling is followed. To overcome the bottleneck of sequential coupling, a set of parallel coupling routines is required, capable of parallel interpolation of data between partly overlapping domains and of managing all required communication in parallel, e.g., by a message passing technique. As an implementation of such a functionality, the mesh based parallel code coupling interface MpCCI 3 ' 4 is used in the following. MpCCI can be considered as a medium level application programming interface, hiding the details of message passing and interpolation in a library, while offering a small set of subroutines and an extensive flexibility by the use of input configuration files. It is advantageous for the integration into a certain class of model codes, to encapsulate calls to the library in a high level interface, the model interface, allowing, for example, for an easy declaration of regular domain decomposed grids and for a simple call to a coupling routine. The details of the interface developed for the presented models are not subject to this paper. Instead, performance measurements for the specified arctic regional model in different massively parallel configurations and an outline for further applications as well as for standardization of model interfaces are presented. Concerning the programming effort of coupling it may be unpractical to mix model components, due to the fact that memory per processor is limited, used file unit numbers or naming of variables and/or common blocks may coincide. Making model components compatible to work in a single executable (SPMD) by using a common I/O-library and a common memory allocation scheme may be achievable for model codes of low complexity. However, such a procedure must be repeated for every new model component and also for model code updates; furthermore the reusability of coupled model components is better without references to special I/O-managing libraries and naming conventions. The approach to leave model component codes in separate binaries (MPMD, i.e., multiple Program Multiple Data) seems much more practical. However, on certain computing architectures this requires a metacomputing

203

library for message passing. For example, on a CRAY-T3E, using CRAYMPI, two different executables cannot be launched in one MPI-context; it is also not possible with CRAY-MPI or CRAY-shmem to establish message passing communication between separately launched groups of MPI-processes, i.e., between application teams. This is worked around with metacomputing MPIimplementations, such as metaMPI or PACX 2 . Furthermore, a metacomputing MPI allows for coupling model components across different computing architectures, even in different locations, provided that a high bandwidth low latency network connection is installed. In the following presentation of performance measurements the potentials of metaMPI and CRAY-MPI are investigated in detail. 2

M p C C I and the Model Interface for Domain Decomposed Data

MpCCI is designed as a library. It enables to loosely couple different (massively) parallel or sequential simulation codes. This software layer realizes the exchange of data which includes neighborhood search and interpolation between the different arbitrary grids of any two codes that take part in the coupled problem. In parallel applications the coupling interfaces of each code can be distributed among several processors. In this case communication between pairs of processes is only realized where data exchange is necessary due to the neighborhood relations of the individual grid points. In the codes themselves the communication to MpCCI is invoked by simple calls to the MpCCI library that syntactically follow the MPI nomenclature as closely as possible. On a lower level, and hidden from the user, message passing between each pair of codes in a coupled problem is performed by the subroutine calls that follow precisely the MPI standard. Therefore the underlying communication library can be a native MPI implementation (usually an optimized communication libray tuned to the hardware by the vendor), MPICH or any other library that obeys the MPI standard like, e.g., metaMPI. It must be noted, that for coupling of domain decomposed data by interpolation on the nodes, elements must be defined, spanning the processor boundaries of data domains. Otherwise, gridpoints of the receiving model lying between gridpoints of the domain boundaries of the sending model do not receive data. This also requires the introduction of ghostpoint data that must be updated before sending data. Such a functionality is easily implemented in the model interface to MpCCI. Furthermore, due to the rather simple, but very precise conservative in-

204

Figure 1. The different rotated grids of the arctic atmosphere model HIRHAM and the ocean-seaice model MOM.

terpolation of fluxes in MpCCI, the received fluxes show up artifical patterns with strong deviations from a smooth structure. These deviations must be smoothed out locally, either by calculation of local mean values, or by a more sophisticated local smoother, that may be based on an anisotropic diffusion operator. Such a smoother with local diffusion coefficients calculated from the interpolation error of a constant flux is currently under development. Alternatively, one might use the non-conservative interpolation also for the fluxes and rescale the received data such that global conservativity is restored. 3

Measuring parallel coupling performance

The j arctic ocean-seaice model MOM2 has 243x171 horizontal gridpoints on 30 levels. The \° atmosphere model HIRHAM works on 110x100 horizontal gridpoints on 19 levels. In figure 1 the rotated grids of the models are sketched over the arctic. The atmosphere model communicates 6 scalar fluxes and 2 scalar fields to the ocean-seaice model, making a total of 0.08 MW (1 Megaword = 8 Megabyte) coupling data on the sender site, and 0.33 MW for the receiver after interpolation. In the reverse direction 4 scalar fields are sent, summing up to 0.2 MW coupling data in total for the sender and 0.06 MW for the

205 Table 1. Performance measurements of MpCCI for the configuration of the coupled model of the arctic; bandwidth d a t a is given in Megawords per second [MW/s] (one word = one 8 byte double precision number), see text. OS: ocean send; OR: ocean receive; AR: atm. receive; AS: atm. send.

PEs ocn atm 20 80 30 110 1 100 PEs ocn atm 20 80 30 110 1 100

st dMPI MGPD/s] OS OR AR AS 1.15 1.28 1.97 0.20 1.06 0.92 1.62 0.16 0.29 0.37 0.066 5.53 std MPI/ metaM PI OS OR AR AS 14 20 8 151 24 37 27 135 0.7 1.0 0.5 61

metaMPI-local [MGPD/s] OS OR AR AS 0.092 0.057 0.026 0.013 0.044 0.025 0.006 0.012 0.387 0.342 0.135 0.091

receiver. Here gridpoints from non-overlapping domains were included in the counting. The lower block in table 1 shows the ratio of CRAY-MPI bandwidths over metaMPI bandwidths. Since the timed routines contain - besides the communication routines - MpCCI-implicit interpolation routines, the increase of bandwidth is not linearly dependent on the achievable increase in point-topoint bandwidth between processors of the two models when switching from metaMPI to CRAY-MPI. It is noteworthy at this point, that metaMPI has almost the same communication performance between the processors within the model components compared to CRAY-MPI, i.e., the performance of uncoupled models is unchanged. It is seen that in the case of coupling a single MOM-process (holding the full size arrays of boundary data) with 100 HIRHAM processes, the use of metaMPI has noteworthy influence only on the HIRHAM-to-MOM send bandwidth (61 times the CRAY-MPI bandwidth). In the setups with parallel MOM-coupling (upper two rows) the reduction of the bandwidth for HIRHAM-to-MOM send is also dominant. In figure 2 the timing results for a set of communication samples are displayed for 20 MOM processors coupled to 80 HIRHAM processors. The upper graph displays results from CRAY-MPI, the lower graph for metaMPI. The displayed samples are a sequence of 20 repeated patterns. The points in the patterns represent the timings of the individual processors. It is observed in the upper graph that the receiving of data in MOM

206

(MOM-RCV) takes the longest times (up to 0.225 seconds), while the corresponding send operation from parallel HIRHAM (PH-SND) is much faster (needs up to 0.05 seconds). In contrast, the communication times for the reverse direction are more balanced. In the lower graph, displaying the results for metaMPI-usage, communication times appear more balanced. The times of up to 4 seconds are a factor 18 above the corresponding CRAY-MPI measurements. In this massive parallel setup coupling communication times would almost dominate the elapsed time of model runs, since the pure computing time for a model time interval between two coupling communication calls (typically one hour model time) is on the same order of magnitude (data not shown). In figure 3 the timing results for communication are displayed for 30 MOM processors and 110 HIRHAM processors. Qualitatively the same behavior is seen as in figure 2. For the usage of CRAY-MPI (upper graph) comparable timings are measured. However, for metaMPI, the maximum times are 8 seconds for certain HIRHAM receive operations, which is a factor 2 longer than in figure 2, bottom graph. Clearly the ratio of communication times over computation times is even worse compared to the setup used for figure 2. Figure 4 depicts the timing results for coupling communication between one MOM processor and 100 HIRHAM processors. It is seen in the upper graph, that the MOM receive operations dominate the coupling communication times (about 0.85 seconds at maximum). This characteristic is also found for metaMPI-usage (lower graph). Interestingly, in this setup, also the coupling times are nearly unchanged. Furthermore, the four displayed operations are performed partially in parallel. The net time of 1.33 seconds used for one coupling communication call is also found for MetaMPI (data not shown). This corresponds well to the bandwidth ratios given in the lower block in table 1.

207

0,25 MOM-SND MOM-RCV PH-RCV PH-SND

0,2

u0,15

A •J :

••/%

-y\

"•

*

k v s W v ^

0,1

0,05

i.^ i*i, ***** * ^ ^ k 1 • * H» ^.-A^s ' ^ ^*V* .'"** ' *S samples

**r**-

•C'

o

-c

o

©

o

o

o o o- o . 0

o o o

o o o

0

o •o

0* 0

"O"

o o. o

o a o

o o o

'O' 0

o

o o o-

0 0

o-

"O-

o o

•o o

, Q

"0 0 X> ,

0

•a

o o o, o,

0 0 .0-

MOM-SND MOM-RCV PH-RCV PH-SND

Figure 2. Timing data measured under CRAY-MPI (top graph) and metaMPI (bottom graph), coupling 20 MOM processes with 80 HIRHAM processes

208

MOM-SND MOM-RCV PH-RCV PH-SND

0,25

0,2

%

• V- 4

:

•

¥

• ***/: **

\

-

•V; u0,i5 - .

• ;

.

• • • ^^^^^33K:=s&^.::tf^^t2£Ei&:£i&£^^Kfc^^

0,1

0,05

samples —. * *v»*^A^i?

^*-^

-*te

ifeil^fSftSslSSiS

jjf^f^w^imw"

*« * ^ * ^ t • * j i { * i ^ *r»

. / ,

_»*»«*+*,

*

.

.

*

.

' !*•

*r:;**';;^^*t?* t ::^::^**:^rt:*ft"*:***v?**<* , it**"***?*:t**:^"?«*: t i* A ***»« , i**^*^**^ . = , » * . ** . • « . * % * ; . . „ B- » . . * „ * * * „ « -*

\ *

*

'

T

sSp^Sf^®S^^%ftfii

. .'•

t

•

.

• • • •

MOM-SND MOM-RCV PH-RCV PH-SND

-

/

/

. !

*

**

. samples

Figure 3. Timing data measured under CRAY-MPI (top graph) and metaMPI (bottom graph), coupling 30 MOM processes with 110 HIRHAM processes

209

'

.

"

"

*

"

"

-

"

»

.

0.8

~ 0 . 6 --

o • •

MOM-SND MOM-RCV PH-RCV PH-SND

-

) KJ

^

'

w

*J

w

0.4

_

0.2

samples

1.5

a

O • •

MOM-SND MOM-RCV PH-RCV PH-SND

1 "f

iS-ft #

**4 fN?

*£!*

0-5"(iiWt|QioOiti^;-.aM«ii^io>iii^i,i»Vii^i^ita''
samples Figure 4. Timing data measured under CRAY-MPI (top) and metaMPI (bottom), coupling one MOM process, holding the full coupling boundary data arrays, with 110 HIRHAM processes

210

Using metaMPI has the obvious drawback that communication bandwidths between different processor groups, i.e., model components, are rather low, at least on the CRAY-T3E. This is due to the implementation of MetaMPI, using the low-bandwidth socket communication between application teams on the T3E. The socket communication maximum bandwidth of about 20 MB/s and its sequential character makes parallel communication patterns between processor groups strongly inefficient for massively parallel coupling. Thus, the sequentialization inherent in coupling of 100 HIRHAM processors with only one MOM processor is expected to have a comparable influence on inter-model bandwidth for both, metaMPI as well as CRAY-MPI. This is proved by the results of measurements given in table 1 and depicted in figure 4. The overhead of coupling is, in this case, about 10 % percent of the total elapsed time used for the simulation of one hour model interval, with a coupling frequency of 1/hr. It is concluded that a high performance parallel coupling strongly requires parallel high-bandwidth communication between the model components. Within the MPMD model, on other computing architectures than CRAY's T3E, this parallel communication may be easily achieved without a metacomputing MPI. Within a SPMD approach of mixing the codes, which requires much more work for restructuring complex codes, a fivefold coupling performance increase can be achieved. 4

Outline of a C o m m o n Coupling Specification — CCS

Having in mind the reusability and exchangeability of model components, it is attractive to think about a certain standardization of coupling interfaces for classes of models. Here not only the coding standard must be taken into account by defining subroutines for coupling, but also a proper definition of the physical quantities that a model component must send and receive. As an example, a number of defined fields/fluxes must be specified to be exported by ocean models to the coupling interface that is common to all currently used models or can be at least easily implemented into them. As the climate modeling community already operates with different types of couplers, e.g., the NCAR CSM flux coupler or the OASIS coupler, a common model interface to different coupling interfaces should be specified instead of selecting/developing a common coupling interface. This should allow for an easy exchange of the coupler as well as of the model components. However, it may become necessary to modify existing coupling interfaces to be compliant with the needs of the model interface.

211

In this sense a Common Coupling Specification is an intermediate specification of subroutines and data formats, independent from model as well as coupler implementation details. For these considerations it is completely irrelevant, whether the coupling interface is a coupling library or a separate coupling process.

5

Further application areas for parallel coupling libraries

Although coupling interfaces have until now widely been used in two dimensional coupling across the sea-atmosphere interface, further applications are easily conceivable. Here two perspectives are drawn. A common procedure in numerical weather prediction is to perform a sequence of simulations from the global scale down to regional scales. This implies an interpolation of boundary data as well as initial state data from large scale models to a hierarchy of nested models. The approach of storing this data on filesystems before being consumed by the more regional model simulations may in a parallel computation of the complete model hierarchy be replaced by the operation of a coupling interface. For reasons of reliability in an operational weather prediction environment nesting must be implemented on a flexible basis, allowing for both, I/O-based communication of data for sequential computation as well as message passing based communication for parallel computation. One advantage of the parallel nesting approach may be, that model components can run below their scaling limit, i.e., on a moderate number of processors, as results for the nested models are used (consumed) almost at the same time as they are produced. Thus, the investment of computer resources used to minimize the elapsed time of a prediction from sequential nesting may be used for model enhancements, e.g., higher spatial resolution or more complex numerical approximations to physical processes. In the framework of model nesting data assimilation is a central task, generating proper initial and boundary data for limited area models. Here one might think of coupling interfaces being used for the interpolation of simulated data onto a domain decomposed set of observations. This could allow for a load-balanced and scalable evaluation, for example in 4D-var, of the costfunction and its gradient. The mesh-based approach of MpCCI is very promising as it provides an efficient parallel interpolation on to a set of irregularly distributed points.

212

6

Conclusion

The performance dilemma of coupled parallel models can be overcome by the use of sophisticated parallel coupling interfaces. The profits of the external coupling approach in a MPMD model can fully be exploited only under an appropriate high performance implementation of the communication between processor groups. In general, a parallel communication facility is required for parallel coupling of massively parallel, domain decomposed models to be optimal. The foreseeable needs for more flexible coupled modeling environments, based on a whole variety of model components used in a model hierarchy, should be met by a community wide initiative defining an interface standard that specifies the fields and fluxes to be exchanged between certain model classes. As well, a rather limited set of subroutines is to be defined as the model interface to different coupling interfaces. Such a Common Coupling Specification together with a parallel coupling interface may also serve as a basis for further fields of application, such as parallel nesting and data assimilation. Furthermore, the transition of modeling environments to next generation computing architectures is much better preconditioned by standardization. Here, the activities of the MPI forum5 may serve as an example, having established successfully a specification for message passing and parallel I/O. Acknowledgments The performance measurements were part of the arctic model coupling project bvkpOl running at ZIB/Berlin using ZIB's T3E-1200. MetaMPI was kindly provided by Pallas GmbH, Briihl (Germany). References 1. S. Valcke, International Workshop on Technical Aspects of Future SeaIce-Ocean-Atmosphere-Biosphere Coupling, CERFACS, Technical Report TR/CMG 00-79, (2000). 2. Th. Eickermann, J. Henrichs, M. Resch, R. Stoy, and R. Volpel. Metacomputing in Gigabit Environments: Networks, Tools, and Applications, Parallel Computing 24, 1847-1872 (1998). 3. R. Ahrem, M.G. Hackenberg, P. Post, R. Redler, and J. Roggenbuck. Specification of MpCCI Version 1.0, GMD-SCAI, Sankt Augustin (2000). see also http://www.mpcci.org

213

4. R. Ahrem, P. Post, and K. Wolf. A Communication Library to Couple Simulation Codes on Distributed Systems for Multi-Physics Computations. PARALLEL COMPUTING, Fundamentals and Applications, Proceedings of the International Conference ParCo99. E.H. D'Hollander, G.R. Joubert, F.J. Peters, and H. Sips (eds.), Imperial College Press, 2000. 5. see http://www-unix.mcs.anl.gov/mpi/index.html

214

DYNAMIC LOAD BALANCING FOR ATMOSPHERIC MODELS G. KARAGIORGOS, N.M. MISSIRLIS1 AND F. TZAFERIS Department of Informatics, University of Athens, Panepistimiopolis 15784, Athens, Greece. ABSTRACT We consider the load balancing problem for a synchronous distributed processor network. The processor network is modeled by an undirected, connected graph G = (V,E) in which node v ; eV possesses a computational load w,. We want to determine a schedule in order to move load across edges so that the weight on each node is approximately equal. This problem models load balancing when we associate each node with a processor and each edge with a communication link of unbounded capacity between two processors. The schedules for the load balancing problem are iterative in nature and their theory closely resembles the theory of iterative methods for solving large, sparse, linear systems [17,20]. There are mainly two iterative load balancing algorithms: diffusion [5] and dimension exchange [5,19]. Diffusion algorithms assume that a processor exchanges load between neighbour processors simultaneously, whereas dimension exchange assumes that a processor exchanges load with a neighbour processor in each dimension at the time. In this paper we consider the application of accelerated techniques in order to increase the rate of convergence of the diffusive iterative load balancing algorithms. In particular, we compare the application of SemiIterative on the basic diffusion method combined with a minimum communication scheme. Keywords: Distributed load balancing, Diffusion algorithms, Synchronous distributed processor network, accelerated techniques.

1

Author for correspondence. Tel.: +(301) 7275100; Fax: +(301) 7275114; email: [email protected]

215

1. INTRODUCTION During the last decade a large number of atmospheric models have been parallelized resulting in a considerable reduction in computing time (see Proceedings of the ECMWF Workshops). Atmospheric models for climate and weather prediction use a three dimensional grid to simulate the behaviour of the atmosphere. The computations involved in such simulation models are of two types: "dynamics" and "physics". Dynamics computations simulate the fluid dynamics of the atmosphere and are carried out on the horizontal domain. Since these computations use explicit schemes to discretize the involved Partial Differential Equations they are inherently parallel. Alternatively, the physics computations simulate the physical processes such as clouds, moist convection, the planetary boundary layer and surface processes and are carried out on the vertical grid. The computations of a vertical column do not require any data from its neighbour columns and are implicit in nature (e.g. solution of a linear system). Thus, these computations are difficult to parallelize. From the above considerations it follows that domain decomposition techniques is best to be applied in the horizontal domain only. Domain decomposition techniques divide a computational grid domain into subdomains, each one to be assigned to a separate processor for carrying out the necessary computations for time integration. For grid point models the domain is subdivided into rectangular subdomains. As mentioned previously the column computations refer to the physical processes which can be subject to significant spatial and temporal variations in the computational load per grid point. As more sophisticated physics will be introduced in the future atmospheric models, these computational load imbalances will tend to govern the parallel performance. Furthermore, on a network of processors, the performance of each processor may differ. To achieve good performance on a parallel computer, it is essential to establish and maintain a balanced work load among the processors. To achieve the load balance, it is necessary to calculate the amount of load to be migrated from each processor to its neighbours. Then it is also necessary to migrate the load based on this calculation. In case of the atmospheric models the amount of work on each processor is proportional to the number of grid points on the processor. In [14] a load balancing scheme was proposed which consisted of an exchange of every other point in a latitude, decomposing that row in such a way that, after the exchange, each processor has almost the same number of day and night points. This and other similar

216

strategies for load balancing were studied in [9]. In [7] four grid partitioning techniques were studied in order to increase the parallel performance of an Air Quality Model (Fig. 1). Furthermore, in [1] a grid partitioning technique, similar to that of [7], where the mesh is like a brick wall and each subdomain is a rectangle, is studied for the RAMS model. When the computational load for each processor is the same, then [4] proves that the ratio communication over computation is minimized when the rectangles in Fig. la are squares.

Figure 1. Grid partitioning for load balancing All the above load balancing approaches fall into the category of direct algorithms where decisions for load balancing are made by a centralized manner requiring global communication. Alternatively, the nearest neighbour load balancing algorithms are a class of methods in which processors make decisions based on local information in a decentralized manner and manage workload migrations with their neighbour processors. By restricting the data movement to neighbouring processors, the processor graph will remain the same. This is important because, if there is no such restriction, then after a few dynamic load balancing steps all processors may

217

well share boundaries with each other. Finally, it is hoped that the algorithm will balance the load with minimal data movement between processors, because communication is expensive compared with computation. 1.1

THE LOAD BALANCING PROBLEM

We consider the following abstract distributed load balancing problem. We are given an arbitrary, undirected, connected graph G = (V,E) in which node V, e V contains a number w of current work load. The goal is to determine a schedule to move an amount of work load across edges so that finally, the weight on each node is (approximately) equal. Communication between non-adjacent nodes is not allowed. We assume that the situation is fixed, i.e. no load is generated or consumed during the balancing process, and the graph G does not change. This problem describes load balancing in synchronous distributed processor networks and parallel machines when we associate a node with a processor, an edge with a communication link of unbounded capacity between two processors, and the weight is infinitely divisible, independent tasks. It also models load balancing in parallel adaptive finite element/difference simulations where a domain, discretized using a grid, is partitioned into subdomains and the computation proceeds on elements/points in each subdomain independently. Here we associated node with a grid subdomain, an edge with the geometric adjacently between two subdomains, and the current work load with grid elements/points in each subdomain. As the computation proceeds the grid refines/coarsens depending on the physics computational load and the size of the subregions has to be balanced. Because elements/points have to reside in their geometric adjacency, they can only be moved between adjacent grid subdomains i.e. via edges of the graph (migration), by effectively shifting the boundaries to achieve a balanced load. Fig. 1 shows different shifting of the boundaries [7]. The quality of a balancing algorithm can be measured in terms of number of iterations it requires to reach a balanced state and in terms of the amount of load moved over the edge of the graph. Recently, diffusive algorithms gained some new attention [8,11]. The original algorithm described by Cybenko [5] and, independently, by Boillat [3] lacks in performance because of its very slow convergence to the balanced state. Most of the existing iterative dynamic load balancing algorithms [6,11,15,18] involve two steps:

218

• "flow" calculation: finding out the amount of load to be migrated between neighbouring processors, such that a uniform load distribution will be achieved when the migration is carried out to satisfy the "flow". • "migration": deciding which particular tasks are to be migrated, and migrating these tasks to the appropriate neighbouring processors. Diffusion type algorithms [3,5,16] are some of the most popular ones for flow calculations, although there are a number of other algorithms [10,11,19]. In practice, the diffusion iteration is used as preprocessing just to determine the balancing flow. The real movement of load is performed in a second phase [6,11,15,18]. In this paper we consider the application of Semi-Iterative techniques in order to increase the rate of convergence of the diffusive iterative load balancing algorithms. The paper is organised as follows. Section 2 presents some basic definitions and notations. Section 3 adapts Semi-Iterative techniques to diffusion load balancing algorithms. These techniques increase the rate of convergence of the basic iterative scheme by an order of magnitude. Section 4 compares theoreticaly the rate of convergence of the Diffusion method and its accelerated version for the mesh network graph. Section 5 presents the optimal communication algorithm for the diffusion method. Section 6 includes the numerical results and conclusions. 2. BASIC DEFINITIONS AND NOTATIONS Let G = (V,E) be a connected, undirected graph with IVI nodes and IEI edges. Let II(. G 1R be the load of node v; e V and « e i " be the vector of load values. Let n be the iteration index, n = 0,1,2,... and «f B) (l
219

where M, called the diffusion matrix, is given by T>}

if processors i and j are directly connected.

(M\ l
0

otherwise.

where T, are real parameters. With this formulation, the features of diffusive load balancing are fully captured by the iterative process governed by the diffusion matrix M . M is hence referred to as the characteristic matrix of the diffusion algorithm. It can be verified that M is a nonnegative, symmetric and doubly stochastic matrix. In light of these properties of M , the convergence of (1) was proved by Cybenko [5] and Boillat [3], simultaneously. Let ju. (1< j'
r

-=xk

and the corresponding minimum value of y(M) is given by

(4>

220

P(L)-l P(L) + l

7 o (A/):

(5)

where P{L)-.

(6)

K

which is the P-condition number of L. Note that if P(L)»1, then the rate of convergence of the DF method is given by 2 R(M) = -logY0(M) = (7) P{L) which implies that the rate of convergence of the DF method is a decreasing function of P(L). In the sequel we will express the optimum values of the parameters involved in each considered iterative scheme, using the second minimum and maximum eigenvalues &2,A>n, respectively of the Laplacian matrix of G. The determination of these eigenvalues, for the processor graphs we consider, is presented in [19]. 3. SEMI-ITERATIVE DIFFUSION METHOD (SI-DF) We now consider iteratives schemes for further accelerating the convergence of (1). It is known [8,17,20], that the convergence of (1) can be greatly accelerated if one uses the Semi-Iterative scheme

« (n+1) =p„ + . [pMuw+(i

- p)«< •>]+(i - Pn+l y-1'

(8)

with (9)

2-(p+a) Pi=l, Pi

(

jy

(10)

,n = 2,3,..

(11)

\ (i pn+i

where

=

2 I- T*P.

v

'

221

< x = - ^ - .

(12)

and where /z(. are the eigenvalues of M. The iterative scheme (8) is independent of p . Indeed, because of (3) a = l-T^,
(13) (14) d5)

with P(L)-1 P(L) + 1 ° respectively. Note that (15) is independent of p . To study the effectiveness of the DF method we write (1) in the form „<"> = P_(M)w(0) (17) where Pn{M) is a certain polynomial in M (which is related to Chebyshev polynomials). It can be shown [17,20] that1 5 />

M

2rn,2

(18>

( n ( ))=Y77

where

1 + Vl-cr 2

VTO + 1

In addition, for P ( L ) » 1 , we have r = l — ,

4

thus the asymptotic

VPW average rate of convergence for the Semi-Iterative DF (SI-DF) method is given by

*.{W»-±**r.-jL..

1

5(.) denotes the spectral radius.

(20)

222

From (7) and (20) the following relationship holds between the reciprocal rates of convergence2 * < . ( , . ( * , ) . « .

(21)

Therefore, the use of Semi-Iterative techniques results in an order of magnitude improvement in the reciprocal rate of convergence and in turn in the number of iterations.

4. THE MESH In this section we compare theoreticaly the rate of convergence of the DF method for the mesh network of processors. Our comparison is based on P(L) the P-condition number of the Laplacian matrix. Note that, because of (7), the rate of convergence of DF is invertialy proportional to P{L). The eigenvalues for the Laplacian matrix of a p1 x p2 mesh, where /?,, p2 are given by [19] Xj=2 V c o s f c ^ -cos . P2

where jx =0,1,2,...,/?,-! and j2 = 0 , 1 , 2 , . . . , p 2 - \ . Hence f

X,=2

, A =4 1-cos

1-cos

I

and (

1-cos

(nip-X)\\ 2{p-\f

i_

(

p

1-cos V

where p = max {/?,, p2}. In this case

IL*W) =

RR(.)-.

R(.)

(22) {P-V>2

223

Therefore, for large values of p, the rate of convergence of DF for the mesh topology is invertialy propotional to p .

5. THE DIFFUSION ALGORITHM In this section we present the DF algorithm which employes a minimum communication scheme [12,13]. 1. Initialization T0 = 2/(A ! + \ I ) , < 0 ) = 0 , u<0) = initial load, / e V . 2. While (Stopping criterion not satisfied) { for each node i e V , {

djM>=djk)+T,ri» for each j neighbour to / compute

} } N.ode Selection: Fc>r each node i e V { for each j neighbour to i compute w = x0(di-dj) if (w < 0) node j sends <w>to node i else node i sends <w> to node j } } Notice that this algorithm computes the load for each node as well as the load to be transferred. Fig. 2 illustrates the results of the Diffusion Algorithm applied to a mesh network.

224

0

©

0

0

0

0

0

1

1

©

0

100

100

100

100

0"^ °

0

0 -47 |

®-—0

©

0

0

-47

0 16

0-^7*0—TT*

°(P4-4-4 ®

©

©

©

a) Initial Load

b) Load Transfer 7

1*

0

17

l ,-v

0

18

1 7

^

18

^.

0

18

111

@—®——®-—0 id

17

l?|

18

171

is]

lsl

171

0 0

0

<£>

0 0

c) Load is balanced

0 0

-20

225

Figure 2. Results of the Diffusion Algorithms for the graph in (a). The initial load is indicated with the numbers near the circles (circled numbers indicate the numbering of the processors) in (a). Graph (b) shows the load transfer. Arrows indicate the direction of the load if the number is positive, otherwise the load must follow the opposite direction. Graph (c) shows the (balanced) loads after the load transfer. 6. NUMERICAL EXPERIMENTS - CONCLUSIONS In this section we describe our experimental simulations and state our conclusions. Our purpose is to compare the application of accelerated techniques. In particular, we consider the Semi-Iterative techniques applied to the DF method. For the 2D-Mesh graph we study the behaviour of our iterative scheme for a different number of processors ranging from 16 up to 16.384 (128x128). The initial work load is generated as a uniformly random distribution. The iterative schemes were compared by the number of iterations they required to converge to the same criterion. The convergence v\ 2 criterion was ^ ( w - " ' - ^ " ' ) <£, where e = 0.001. For all cases we used 1=1

the optimum values for the parameters involved. These values were obtained via the second minimum and maximum eigenvalues of the Laplacian matrix, which for our examples were derived in Section 4. In Section 5 we improve the classical DF algorithm by computing the load for each processor and the minimum load transfer simultaneously. Table 1 presents the number of iterations of DF and SI-DF methods to converge.

#of procs. DF SI-DF

4x4 46 23

8x8 190 36

16x16 636 71

2D-Mesh 32x32 3234 153

64x64 * 327

128x128 * 693

256x256 * *

Table 1. Number of iterations for the two algorithms, * indicates no convergence after 5-103 iterations.

226

From Table 1 we see that the number of iterations for DF and its accelerated counterpart SI-DF behaves approximately as p2 and p, respectively. Therefore, SI-DF improves the rate of convergence of DF by an order of magnitude.

Acknowledgement We would like to thank 1) the National and Kapodistrian University of Athens and 2) The General Secretary of Research and Industry for supporting our research. REFERENCES [1]

[2] [3] [4]

[5] [6]

[7]

Barros S.R.M., Towards the RAMS-FINER Parallel model: Load Balancing aspects, in Proceedings of the Eighth ECMWF Workshop on the use of Parallel Processors in Meteorology, Towards Teracomputing, eds. W. Zwieflhofer and R. Kreitz, ECMWF, Reading, UK, World Scientific, pp. 396-405, 1999. Biggs N., Algebraic Graph Theory, Cambridge University Press, Cambridge 1974. Boillat J.E., Load balancing and poisson equation in a graph, Concurrency: Practice and Experience 2 (1990), 289-313. Boukas L.A, Mimikou N.Th, Missirlis N.M and Kallos G., The RegionalWeather Forecasting system SKIRON: PARALLEL Implementation of the Eta model, Proceedings of the Eighth ECMWF Workshop on the use of Parallel Processors in Meteorology, Towards Teracomputing, eds. W. Zwieflhofer and R. Kreitz,ECMWF, Reading, UK, World Scientific, pp. 369-389, 1999. Cybenko G., Dynamic load balancing for distributed memory multi-processors, J. Parallel Distrib. Comp. 7 (1989), 279-301. Diekmann R., Frommer A., Monien B., Efficient schemes for nearest neighbour load balancing, Parallel Computing 25 (1999) 789-812. Elbern H., Load Balancing of a Comprehensive Air Quality Model, in Proceedings of the Seventh Workshop on the use of Parallel Processors in Meteorology, ECMWF workshop on Making its

227

[8]

[9]

[10] [11] [12] [13] [14]

[15]

[16]

[17] [18]

[19]

[20]

Mark, Hoffman, G. and Kreitz, N. eds., World Scientific, Singapore, 429-444, 1996. Evans D.J., Missirlis N.M., On the acceleration of the Preconditioned Simultaneous Displacement method, MACS, 23, No, 2, 191-198, 1981. Foster I. And Toonen B., Load Balancing Algorithms for the NCAR Community Climate Model, Technical Report ANL/MCS-TM-190, Argonne National Laboratory, Argonne, Illinois, 1994. Horton G., A multi-level diffusion method for dynamic load balancing, Parallel Computing 9 (1993), 209-218. Hu Y.F, Blake R.J., An improved diffusion algorithm for dynamic load balancing, Parallel Computing 25 (1999), 417-444. Karagiorgos G., Missirlis N.M., Accelerated Iterative Load Balancing Methods, (in preparation). Karagiorgos G., Missirlis N.M., Accelerated Diffusion Algorithms for Dynamic Load Balancing , (submitted 2000). Michalakes J. and Nanjundiah R.S., Computational Load in Model Physics of the Parallel NCAR Community Climate Model, Technical Report ANL/MCS-TM-186, Argonne National Laboratory, Argonne, Illinois, 1994. Schloegel K., Karypis G., Kumar V., Parallel multilevel diffusion schemes for repartitioning of adaptive meshes, in Proc.of the Europar 1997, Springer LNCS, 1997. Song J., Partially asynchronous and iterative algorithm for distributed load balancing, Parallel Computing 20 (1994), 853868. Varga R., Matrix iterative analysis, Prentice-Hall, Englewood Cliffs, NJ, 1962. Walshaw C , Cross M., Everett M., Dynamic load balancing for parallel adaptive unstructured meshes, in Proc.of the 8th SIAM Conference on Parallel Processing for Scientific Computing, 1997. Xu C.Z. and Lau F.C.M., Load balancing in parallel computers: Theory and Practice, Kluwer Academic Publishers, Dordrecht, 1997. Young D.M., Iterative solution of large linear systems, Academic Press, New York, 1971.

228

HPC IN SWITZERLAND: NEW DEVELOPMENTS IN NUMERICAL WEATHER PREDICTION

M. BALLABIO, A. MANGILI, G. CORTI, D. MARIC Swiss Center for Scientific Computing (CSCS) 6928 Manno, Switzerland J-M. BETTEMS, E. ZALA, G. DE MORSIER, J. QUIBY Swiss Federal Office for Meteorology and Climatology (MeteoSwiss) 8044 Zurich, Switzerland

In the frame of the new Consortium for Small-scale Modelling (COSMO), the National Weather Service of Switzerland has started a close collaboration with the Swiss Center for Scientific Computing (CSCS). The new COSMO model, called the Local Model (LM), has been installed on the NEC SX-5 of the CSCS where it will run operationally and will be used as a tool for research. This article covers the most significant aspects of the LM porting and optimization on the NEC SX-5 supercomputer

1. Introduction Not only the general public but more and more professionals working in various fields are asking for more detailed weather forecasts in time and space. A new aspect is the increasing demand of forecast weather data for follow-up models. In Switzerland, we deliver today model data for follow-up models in the fields of: - hydrology: for the forecast of the water level of large rivers - air pollution: for the computation of trajectories and pollutant dispersion - civil aviation: for landing management - avalanche forecasting: for warnings. These requirements can only be satisfied with sophisticated, very high resolution meso-scale numerical weather prediction models. On the other hand, the operational constraint requires that the model results have to be available rapidly, otherwise they are useless. This last requirement can only be met if this type of models are computed on very powerful computers, on so-called supercomputers.

229

Figure 1: The integration domain of the new Local Model

2. The new model In the frame of COSMO (Consortium for Small-scale Modelling), the National Weather Services of Italy, Greece and Switzerland have developped under the lead of the National Weather Service of Germany a new non-hydrostatic meso-scale model called the Local Model (LM). This model has already been presented in this serie of proceedings (Schattler and Krenzien, 1997) and we ask the interested reader to refer to this article or to the scientific description of the model (G. Doms and U. Schattler, 1999). The model version installed in Switzerland presents today the following configuration: 385x325 points in the horizontal plane and 45 layers in the vertical. The corresponding integration domain is given in Figure 1. The set of the non-linear, coupled partial differential equations is solved by the socalled split-explicit method. The time step for the computation of the tendencies due to the terms responsible for the compression (or sound) waves is 10 seconds and the main time step is 40 seconds.

230

=

=

=

FDDI Ethernet ATM/STM-1

wooocscfrceco

•—"^^~

ATM/STM-4

Figure 2: The hardware chart of the Swiss Center for Scientific Computing

3. The CSCS The CSCS is the major computing centre of Switzerland for the needs of science and technology. It is used primarily by the universities and research centres, but also by governmental agencies and industry. The CSCS is located in Manno, in the italian speaking part of Switzerland. Its main computer is a NEC SX-5 parallel vector supercomputer. It has ten processors each clocked at 4 ns giving an aggregate peak performance of 80 Gflop/s and it has 64 GBytes of shared SDRAM memory with 12.6 GBytes/s I/O bandwidth. A general overview of CSCS hardware, including archiving and communication, is given in Figure 2.

231

4. The optimization of the LM code for the NEC SX-5 The porting and optimization work of the LM code has been performed on a grid size of 325x325x40 points, which was the initial computational domain. The optimization procedure has allowed to increase the total number of grid points without running out of the required elapsed time window of 90 minutes for the production runs. One aspect of the applied optimization strategy that has been particularly considered is the reduction of possible side-effects of code optimization for non-vector architectures. Thus, even though it is sometimes a big challenge to combine a good cross-platform code portability and highest possible sustained performances, our work was clearly focused to gain the highest efficiency improvement for vector-type computers by keeping on the other hand a minor impact for non vector systems. 4.1 General Optimization Guidelines The final goal of the optimization phase should be to deliver a minimally modified code, without any readability problem, still reflecting the original implementation scheme. In general, we tried to achieve the maximum code efficiency by minimizing the changes in the original code. As a matter of fact, simply by using optimal compiler switches and features, it has been possible to achieve interesting improvements, without changing any line of the source code. Improvements have been mostly achieved by making use of following techniques: - usage of special compiler features - reengineering of existing algorithms - inlining of frequently called and "cheap" routines - implementation of efficient gather/scatter mechanism - implementation of efficient search loops. For the code optimization and performance measurement process, it is much easier and preferable to rely on the usage of simple and easy-to-use basic tools rather than on existing and more sophisticated ones as, typically, GUI based analysis tools. By using basic code profiling utilities (i.e. -p option and UNIX prof command) and own-implemented timing and performance monitoring utilities, the whole code optimization process has been quite fast and successful.

232

Figure 3: The difference between the old (left panel) and the new (right panel) vertical schemes

4.2 Optimization of the vertical interpolation scheme The routines src_output.p_int and src_output.z_int are making use of the following interpolation scheme, essentially based on the interpolation routines utilities.tautsp and src_output.spline: DO j = jmin, jmax DO i = imin, imax ! setup needed values of pexp, fexp, ... for point (i,j) [ ... ] I interpolate over vertical levels CALL tautsp(pexp,fexp,kint,gamma,s,break,coef,nldim,ierr) CALL spline(break,coef,nldim,pfIs,fpfIs,namelist%kepin) ENDDO ! vertical interpolation ENDDO ! slicewise iteration With this approach (see Fig. 3, left) the total number of function calls is twice the number of domain grid points, i.e. 2 x (imax-imin+1) x (jmax-jmin+1), and it might lead to a significant performance degradation, in particular if the resident time in tautsp and spline is small. Between the two called routines, tautsp is the most time consuming one, and in particular it delivers a poor performance due to a non-vectorizable implementation. The major optimization step has been to implement a vectorizable tautsp (called tautsp2D) which could be able to "simultaneously" interpolate over multiple points of the grid.

233

The following solution, that doesn't require any significant increase in memory, makes use of the optimized implementation and works "slice-by-slice": DO j = jmin, jmax DO i = imin, imax !setup needed values for points (i,j) ENDDO ! perform vertical interpolation over slice j CALL tautsp2D(pexp_vec, fexp_vec, ..., imin, imax, 1, 1) CALL spline2D (break_vec, coef_vec, ..., j, j) ENDDO Tautsp2D offers also the possibility to work on a full 2D domain: DO j = jmin, jmax DO i = imin, imax !setup needed values for points (i,j) [...] ENDDO ENDDO ! perform vertical interpolation over whole domain CALL tautsp2D(pexp_vec, fexp_vec, ..., imin, imax, jmin, jmax) CALL spline2D (break_vec, coef_vec, ..., jmin, jmax) This solution which minimizes the total amount of routine calls is however not preferable since it requires much more resources in terms of work space requirements than the previous one. 4.3 Efficient gather/scatter of vector elements One of the most time consuming routine of LM (subroutine meteo_utilities.satad) uses several times the following structure: DO j = jlo, jup DO i = ilo, iup IF (zdqd (i, j) < 0.0_ireals ) THEN ztgO(i, j) = ... END IF ENDDO ENDDO

234

Since computations are done only for elements subject to a specific condition, a sequence of similar structures are generating cumulated overheads coming from the high number of masked operations that this code requires. On the SX-5 such operations are performed via masked instructions, which, despite being vectorizable, are anyway slower than non-masked ones. The proposed new scheme has an initialization phase where the indices i and j of the needed elements are stored in two work arrays: ! index compression, initialization nsat = 0 DO j = jlo, jup DO i = ilo, iup IF (zdqd (i, j) < 0.0_ireals ) THEN nsat = nsat + 1 wrk_i(nsat) = i wrk_j(nsat) = j ENDIF ENDDO ENDDO

Then computations are only done on the required elements: ! usage DO indx = 1, nsat i = wrk_i(indx) ! get the i-index j = wrk_j(indx) ! get the j-index ! no further changes in the code ztgO(i, j) = ... ENDDO

Similar techniques showed a quite significant improvement in both LM-forecast and LM-assimilation modes. Also, not less important, this scheme is very efficient even on non-vector type machines.

235

4.4 Optimized search loops Within the LM there is a typical usage and implementation of search loops inside a 2D domain based on the following structure: loopj: DO j = jstartpar, jendpar loopi: DO i = istartpar, iendpar IF ( (qc(i,j,k,nu) > 0.0) ) THEN lcpexist = .TRUE. EXIT loopj ! not vectorizable END IF ENDDO loopi ENDDO loopj

This solution is quite optimal for scalar processors, but on the other hand the EXIT statement in the inner do-loop is killing the performance and the vectorization in case of vector processors. The revised algorithm that is significantly improving the performance on vector processors is obtained by extracting the EXIT statement from the inner loop and using a partial sum approach in order to check the event occurrence: ic_ = 0 loopj: DO j = jstartpar, jendpar loopi: DO i = istartpar, iendpar IF ( (qc(i,j,k,nu) > 0.0) ) THEN ic_ = ic_ + 1 END IF ENDDO loopi ! vectorizable IF ( ic_ > 0 ) THEN lcpexist = .TRUE. EXIT loopj END IF ENDDO loopj

This solution might only marginally affect the performance on scalar processors, being the partial sum is anyway reasonably short. In some cases an even better performance on non-vector processors could be obtained.

236

4.5 Aggressive compiler options The following aggressive NEC SX-5 f90 compiler options have been used: - pvctl noverrchk: value range of arguments in the vectorized portion are not checked - pvctl novlchk: the loop-length checking codes are not generated. In the Table 1 the impact of these options is demonstrated; a 20% reduction of the CPU time used by these routines has been achieved. Table 1: Impact of compiler options Standard options

Aggressive options

Routine % CPU time

Seconds

% CPU time

Seconds

ep_ddpwr

8.9

42.20

7.2

31.97

ew_dsqrt

5.9

28.03

2.9

12.76

ew_dexp

4.1

19.26

3.3

14.63

src_radiation.inv_th

a

Total

4.1

19.57

6.2

27.53

23.0

109.06

19.6

86.89

a. in the aggressive case sqrt has been inlined.

These options are suggested only when the code correctness has been sufficiently tested. 4.6 Inlining The inlining has been done manually, but the selection of candidates was based on the diagnostics given by the profiling command (prof-p). In the Table 2 the impact of this technique is demonstrated; a 30% reduction of the CPU time used by the considered routines has been achieved. Table 2: Impact of routine inlining No inlining

With inlining

Routine % Time

Seconds

src_radiation.coe_th

2.9

16.59

src_radiation.inv_th

2.1

12.08

Total

5.0

28.67

% Time

Seconds

# Calls

1248000

0.2

0.94

31200

31200

3.4

18.90

31200

3.6

19.84

# Calls

237

5. Results The programme LM is the model itself. It can be used in two different modes: besides the forecast mode there is an assimilation mode used to create initial conditions for the forecast. In this later case the weather observations are used to modify the solution given by the model, by locally forcing its undisturbed solution towards the observations. This is the so-called "nudging" or "Newtonian relaxation" technique. An other important element of the operational suite has also been optimized, the programme GME2LM. This programme interpolates the GME data fields - GME is the name of the global model of the German Weather Service - onto the LM grid for the determination of the lateral boundary conditions. In the three following tables a summary of the achieved performance gain is displayed. Table 3: Performances of the programme GME2LM (domain 325x325x40, 1 hour interpolation) Number of CPUs

1

Elapsed time [sec]

Performance [Mflop/sec]

original

modified

original

modified

775

207

90

440 (+390%)

Table 4: Performances of the programme LM in assimilation mode (domain 325x325x40, 35 sec. time step, 1 hour assimilation) Number of CPUs

1

Elapsed time [sec]

Performance [Mflop/sec]

original

modified

original

modified

1612

772

733

1470 (+200%)

Table 5: Performances of the programme LM in forecast mode (domain 325x325x40,40 sec. time step, 4 hour forecast) Number of CPUs

Elapsed time [sec]

Performance [Mflop/sec]

original

modified

original

modified

1496

1572

2327 (+48%)

1

2285

2

1201

784

3005

4463 (+48%)

4

629

439

5731

8060 (+40%)

6

491

295

7435

11949 (+60%)

238

Recently, an additional improvement of 10% has been obtained by vectorization of the packing/unpacking of the GRIB code (I/O facility, fpackl and fcrackl routines). At the end of the pre-operational phase (end of March 2001), on a grid of 385x325x45 points, the LM is running at 19 Gflop/s on 8 CPUs; it takes about 50 minutes time to calculate a 48 hours forecast.

6. Conclusion This collaboration between, on one side, a National Weather Service which wants to deliver the best possible short-range weather forecast for the meso-scale and, on the other side, a centre of excellence for scientific computing very well equipped and organised for HPC and networking, has been extremely efficient. This collaboration also opens for a weather service quite exiting new possibilities regarding very high resolution numerical weather prediction, allowing explicit weather simulation in alpine valleys. This is specially important for an alpine country like Switzerland. For the CSCS, the installation and operation of a full NWP suite has been a very challenging task. The time constrains inherent to operational meteorology has posed difficult technical and managerial problems for the CSCS which have requested unconventional and clever solutions.

Acknowledgements It has been a pleasure to work with an extremely well written code like the LM. Our best compliments to our colleagues of the Research Department of the German Weather Service, specially to Ulrich Schattler, whose brilliant work has helped us quite a lot in significantly reducing the porting efforts. Thanks are also due to ECMWF colleagues, particularly to Geir Austad and John Hennessy for their help in the installation of several MARS modules and to Vesa Karhila for the porting of Metview. Finally this work has only been made possible thanks to the backing of the Directorate of MeteoSwiss who has supported and encouraged this CSCS/MeteoSwiss collaboration from the beginning and has, by doing so, promoting MeteoSwiss into the difficult but promising field of the very high resolution, non-hydrostatic numerical weather prediction.

239 References 1. U. Schattler and E. Krenzien Model development for Parallel Computers at DWD In Marking its Marks, Proceedings of the Seventh ECMWF Workshop on the Use of Parallel Processors in Meteorology. World Scientific, pp 83-100, 1997. 2. G. Doms and U. Schattler The Nonhydrostadc Limited Area Model LM (Lokal-Modell) of DWD Scientific Documentation, Part I. Deutscher Wetterdienst, 63004 Offenbach, Germany. December 1999.

240

THE ROLE OF ADVANCED COMPUTING IN FUTURE WEATHER PREDICTION ALEXANDER E. MACDONALD

NOAA Research, Forecast Systems Laboratory, Boulder, Colorado

80305

U.S.A.

ABSTRACT Advanced computing is one of several key technologies identified by Forecast Systems Laboratory as crucial for advancement of weather and climate prediction. Historically, while skill scores for 500-mb wind and heights have improved significantly, skill for precipitation prediction has improved much more slowly. A discussion of selected state-of-the-art mesoscale models is presented. It is argued that today's mesoscale models, which have much more sophisticated moisture physics and surface physics, will evolve to improve both shortrange precipitation prediction and regional climate prediction. Extrapolation of current trends in processing power will allow these improvements, with classical computers leading to Petaflop performance by the 2020s, and classical computers with molecular circuits reaching Exaflop speeds by the 2040s. Quantum computing may be applicable to weather and climate prediction by the middle of the century.

1

Introduction

The Forecast Systems Laboratory (FSL) is one of 12 research laboratories of the National Oceanic and Atmospheric Administration (NOAA). It has the mission of improving weather services by the transfer of science and technology into operations. FSL has identified four key technologies needed to improve weather prediction: observing systems, advanced computing, numerical modeling and assimilation, and information systems. In this paper the emphasis is on the future of advanced computing, and how these advances will be applied to two important problems, short-range weather prediction, especially precipitation and long-term (decadal to centennial) prediction of regional climate change. FSL's program for advanced computing is described by Govett et al. . The strong storm that affected Europe at the end of October 2000, shown in Figure 1, is an example of the importance of short-range weather prediction. Mesoscale weather, such as high winds and heavy precipitation, has great economic impacts. Although most of the energy of atmospheric systems is in the larger scales, this storm had high winds in a relatively small area (hundreds of kilometers). In fact, most hazardous weather occurs in similar sized areas, making prediction of small-scale intense events important. The great increase in computing required to

241

resolve the dynamics and physics of these small-scale phenomena will be a driver for increases in supercomputer speeds for the foreseeable future. Pacing the advance of numerical weather prediction (NWP) over the last half century has been the growth of supercomputing. Improved models have been the result of higher resolution, better model physics, and better assimilation systems. In recent years, the use of ensemble predictions has proved useful. Each of these aspects of NWP requires substantial increases in computation as more sophisticated capabilities are implemented. The combination of all of them has resulted in a slow but definite improvement of skill while the computations used

Figure. 1. Storm over Europe, 30 October 2000, as seen from the SEAWIFS satellite (Obtained from NASA).

242

have increased rapidly. In the next section, the current state of mesoscale weather prediction using gridded nonhydrostatic models will be discussed. It is likely that models of this type will play an important role in NWP over the coming decades. In section 3, a simple extrapolation for supercomputing speeds through the next 50 years is presented and discussed. Finally, the implications of the extrapolated growth rate in computing on the science of weather and climate prediction for the next 50 years are discussed. 2

Current Mesoscale Models

In this section three current mesoscale models are discussed: the NCAR/Penn State MM5 model2'3, the QNH model4,5'6, and the WRF model7. The discussion establishes how resolution and physics will cause associated computational requirements to evolve during the coming decades. It is assumed that the long term trend of development of more sophisticated physics (e.g. microphysics, turbulent parameterization, radiation etc.) and increasing resolution will result in improvements in the realism of weather and climate models, as it has in the past. This is not to say that such approaches are always the wisest use of computing resources - the best numerical prediction results from a trade-off of computing resources among assimilation, model resolution, sophistication of physics packages and multiple runs using different initial conditions (data ensembles) and models (model ensembles). FSL has extensive experience with the MM5 model. This model is widely used in the community, and has been used in many meteorological studies. It has a wide variety of physics packages, which can be "plugged" into the model by developers. For example, a microphysics package developed by Dudhia 3 has five moisture variables (vapor, rain, snow, cloud ice, and cloud water) and associated equations for phase change and advection. Packages such as this should be compared to the simpler moisture physics of 20 years ago that had only one or two microphysics parameters. It is easy to envisage future models that compute droplet size distributions explicitly, as well as other more sophisticated approaches. Similarly, radiation packages that are one dimensional and treat the entire shortwave spectrum with a single set of equations are evolving to systems that are three dimensional and treat many bands in the shortwave. Better radiation calculations in the longwave can be achieved by higher and higher spectral resolution. It is argued here that the increase of physics sophistication will continue for the foreseeable future. As an example of the role that increasing resolution can play, consider the prediction of a severe thunderstorm event run by FSL using MM5. The event was a line of severe thunderstorms that moved through southeast Nebraska during the night of 27 and 28 July 1996. Experience has shown that for horizontal model resolutions greater than a few kilometers the convection becomes unrealistic. In this case, MM5 was run at 3.3-km horizontal resolution for 18 hours, beginning at

243

15 UTC (mid morning) on 27 July. The model captured the growth and mature stage of the squall line, with strong surface winds (greater than 40 m/s) evident in both the model and in observational reports in late evening. Figure 2 shows a threedimensional view of the squall line as portrayed by the model nine hours into the forecast. The light colored volume depicts total mixing ratios of cloud water, cloud ice, or precipitation in excess of 0.2 g/kg. This was a good forecast both in the behavior of the model (the squall line had an appearance in radar and satellite similar to that found in the model) and in its detailed predictions (i.e., the strong winds). It should be noted that while this particular prediction was good, many similar model integrations have produced realistic looking storms that were quite different from what actually occurred.

PSL MM5 Forecast - 3.3 km dx nest Initialized: 15Z 27 July 1996 Forecast valid: 0OZ 28 July 1996

Figure 2. MM5 model run at 3.3-km resolution using FSL's initialization from the Local Analysis and Prediction System. The model forecast captured the evolution of a line of thunderstorms that (both in the model and in reality) generated high winds in southeastern Nebraska in late evening (about 06 UTC on 28 July).

244

The Quasi-Nonhydrostatic (QNH) model developed by the author and collaborators4,5,6 can be used to examine the future evolution of computing requirements for models that are explicit in three dimensions. This model is based on the approaches outlined by Browning and Kreiss 8 and is of interest here because it allows an easy extrapolation to determine computing requirements for high resolutions. QNH is a finite-difference model, with fourth-order differencing in space and time. The model is currently run at 20-km horizontal resolution, with 56 levels in the vertical. It was designed for parallelization by spatial decomposition in the horizontal, and vectorization in the vertical (since it uses explicit differencing in the vertical direction). The model was designed for high efficiency on parallel computers as discussed by Baillie et al. 9 . QNH runs on FSL's Jet computer, using 25 (Alpha 667) processors to make a forecast in 5% of real time. This is a sustained rate of calculations of 5 Gigaflops. An extrapolation from a regional to a global domain gives 2,000 processors needed for a sustained rate of 400 Gigaflops. Since the model scales almost linearly with the number of processors, such an increase would seem feasible. A global 5 km model, with the same vertical resolution and a time step one-quarter as long would require approximately 25 Teraflops of sustained computing. Extrapolation of current trends shows that top end supercomputers could be reaching these rates between 2010 and 2015. This particular extrapolation is unrealistic in that it assumes that the vertical resolution, physics, assimilation and ensembles associated with future models would remain unchanged. In the next section realistic assumptions are made about these factors to produce an estimate of future operational model resolutions. The third model that will be discussed briefly is the Weather Research and Forecast (WRF) model. Details of the model are presented elsewhere in this volume by Michalakes et al.10. This model is being designed for horizontal grids from 1 to 10 km. In particular, it should be useful for predictions run at resolutions of 1 or 2 km that explicitly predict convection. Its goal of becoming a community model should generate a growing number of physics packages, increasing in complexity with time. 3

Supercomputing Trends

Since the inception of numerical weather prediction in the 1950s, the computers that have been used for operational predictions have grown in processing power at approximately 40% per year. This increase is related to, but distinct from, the famous Moore's Law, that noted the doubling of the number of transistors in a microprocessor every 18 months. Moore himself described the growth of computing power at such rates as a self-fulfilling prophecy. Will the 40% growth rate of supercomputer speeds continue indefinitely? A reasonable argument can be made that fundamental limits on computing technology

245 may slow this pace. For example, as the size of circuitry decreases, the processes used to create electronic components bump up against limits such as the wavelength of light used in chip lithography. However, the 50 years of increases in computing speed were not due to extension of a single technology. As vacuum tubes proved too slow and unreliable in the 1960s, the technology changed to transistors. When individual transistors proved lacking, the technology of choice became integrated circuits of greater and greater complexity. More recently, when the rapid increase of speed of a single CPU became untenable, supercomputing moved to parallelism to continue the increases of total processing power. Here it is assumed that the established growth rate of supercomputing will continue at 40% per year, with suggestions of some of the technologies that may allow this increase through the coming half century.

Global Weather Models 10Exaflops

7,

1

Convection

1 Exaflop

6km

1

10 fetaflops

S3

lOOTerallops

Actu

Model Calct

lOOPBtaflops

12 km

1 Petaflop 25 km

50 km 10Teraflops

1 Teraflop 2000

1 100 km 2010

2020

Classical Computing Macroscopic Circuits

2030

2040

Classical Computing Molecular Circuits

2050

Quantum Computing?

Figure 3. Extrapolated growth of computing power at 40% per year, starting at 1 Teraflop in the year 2000. The speed of a supercomputer available at a large center is shown on the left ordinate. The resolution of an ensemble of global models is indicated on the right ordinate. Possible technologies associated with the computers are shown on the abscissa.

246

Figure 3 presents an extrapolation of supercomputing speeds, starting at 1 Teraflop (peak speed rather than sustained) and assuming a growth rate of 40% per year for the next 50 years. As shown on the left axis, this brings peak speeds of about 1 Petaflop by the year 2020, and 1 Exaflop by about 2040. In this calculation I took a conservative approach (consistent with the last several decades) concerning the application of additional computing resources to weather prediction. Specifically, it was assumed that each doubling of horizontal resolution will be accompanied by an increase in vertical resolution and more computationally intensive assimilation (e.g. four-dimensional variational analysis), as well as the running of ensembles. Furthermore, associated models such as those for oceans, land surface and ice surfaces will become more sophisticated and have many more levels during the coming decades. Based on recent history in the United States, it is estimated that a doubling of model resolution requires a 30-fold increase in computations. As seen on the right axis, this shows a rather conservative increase in model resolution, 25 km by 2020, and global resolutions of a few kilometers by 2045. If this projection is reasonable, it means that our current mesoscale models are already being tested at resolutions that we will see in global models in the coming decades. This could be important in the design of future composite observing systems, which uses simulations to determine the effects of proposed future observational subsystems, because the models that will be operational are similar to those being used in the simulations. Along the abscissa of Figure 3 are technologies that supercomputers may use during the coming half century. Current supercomputers use macroscopic circuits; a memory cell, even for the extremely small dimensions on the drawing board, has a very large number of electrons associated with a given state. For the next couple of decades improvements in the use of macroscopic circuits and increasing use of parallelism should allow Moore's Law and the increase of supercomputer speed to continue. However, there is already work on the use of individual molecules as electronic devices. For example, DiVentra et al.11'12 discuss the use of benzene molecules as the main component of a circuit that could lead to a 10,000-fold increase in the density of transistors in a given space. These nanodevices could be used to build computers that operate on classical principles, as distinct from quantum computers. Quantum computers are profoundly different from classical computers13. They use strange concepts of quantum mechanics such as superposition and entanglement. Their use in weather and climate prediction is probably decades away, if ever. Only a limited number of algorithms have been identified as feasible on a quantum computer (e.g. factoring prime numbers), but in these cases it may be possible to solve problems on a quantum computer that could not be solved on a classical computer in the age of the universe. It is only speculation, but the nature of quantum computing may make it the ultimate ensemble technique. More

247

specifically, the state of the global atmosphere may be represented on a set of qubits (the quantum computer analogue of the classical computer bit) and operated on to predict a very large ensemble of future atmospheric states. Ideas such as this may be feasible by mid-century. 4

Applications of Supercomputing in Atmospheric Science

Though the speed of computers has increased very rapidly in the past 50 years, the improvement of weather predictions, as measured objectively, has improved rather slowly. Some measures of skill, such as for the prediction of 25 mm of precipitation in a 24-hour period, have improved very little in 40 years. If the extrapolation of computing power shown in Figure 3 is correct, what might we expect in the way of improved weather and climate prediction during the next 50 years? It is reasonable to divide expected improvements in weather prediction into two time and space domains, global and regional. For this discussion, take the regional scale to be on the order of a domain 10,000 km on a side, or 100 million square kilometers. The longest prediction time scale for regional models is about three days, and for global models about two weeks. Improvements for both regional and global predictions will require better model physics formulations, faster computers, and better observations. The most rapid progress will be possible if significant improvements are made in all three areas. As discussed below, it is likely that the observational data input to the models will improve significantly in the coming decades, enhancing the utility of the rapid growth in available computing power discussed in section 3. A number of exciting new observational technologies are emerging for global weather prediction. A new generation of polar orbiting satellites is planned, with improved microwave and other sensors. The COSMIC14 system, consisting of 6 small satellites using Global Positioning System occultation to retrieve soundings of temperature and moisture, will be tested in 2004 and 2005. This system should deliver 3,000 soundings a day, evenly distributed over the globe. A satellite lidar system is being studied that could deliver thousands of wind observations each day. The global fleet of commercial aircraft is being used more and more effectively to deliver observations of wind, temperature and (recently) moisture at flight level and during take-off and landing18. A new generation of global in situ sensors, consisting of balloons and autonomous aircraft, is under discussion.15 In summary, all the ingredients needed to improve global models will be available in the coming decades. The potential for improved observing also exists for regional models. New generations of geostationary satellite sounders will use infrared interferometers to measure upwelling radiation in thousands of spectral channels. These promise effective vertical resolutions of about 1-km (in clear air), a big improvement over today's sounders that deliver 2- to 3-km resolution. Ground-based wind and

248

temperature profilers have been demonstrated for many years, and are ready for operational use on a continental scale. Recently, the possibility of using slant delays from a network of surface stations to GPS satellites has been discussed17. Such a system could conceivably deliver accurate and detailed three-dimensional moisture fields every hour on a continental scale. In the United States, the number of commercial aircraft observations has been steadily increasing, with more soundings and the addition of humidity sensors18 planned. Thus, the incredible computing resources needed for regional models running at resolutions of one or two kilometers can be complemented by observations of important fields with small-scale variability, especially water vapor. In summary, the steady 40% increase in supercomputer speeds should be complemented by better observational systems, better assimilation systems, and better numerical models. This will result in improved global models, and should open an era when short range (e.g. to 48 hours) skill in prediction of precipitation could increase as much as 25 % per decade for the next several decades. This would also imply that the accuracy of short-range prediction of severe weather such as wind and heavy snowstorms should improve commensurately. Finally, the fabulous increase in supercomputing is already being applied to understanding the great problem of the 21 st century - the modification of the global atmosphere and ocean by the anthropogenic load of carbon dioxide and other greenhouse gases. Today's long-term climate models, which are used to run global integrations for periods of hundreds of years, are necessarily low in resolution and limited in the sophistication of their physical parameterizations. Current climate models have much in common with the weather models of several decades ago. Similarly, existing mesoscale weather models should be similar to the climate models of 20 to 30 years from now. This suggests that the investment of computing resources to improve weather models should lead the way to the climate models needed to understand and predict scenarios of 21 st century anthropogenic change. Very difficult regional climate issues, such as the potential melting of the Arctic ice cap and the prospect of much drier summers in the grain belts of North America and Eurasia, can only be addressed by sophisticated and computationally intensive models including better moisture parameterizations and better surface processes. Increases in supercomputing are thus central to the understanding needed for sustainable habitation of planet earth. 5

References 1.

M. W. Govett, D. S. Schaffer, T. Henderson, L. B. Hart, J. P. Edwards, C. S. Lior, and T.-L. Lee, SMS: A Directive-Based Parallelization Approach for Shared Distributed Memory High Performance Computers, (this volume).

249 2.

G. A. Grell, J. Dudhia, and D. R. Stauffer, A description of the fifthgeneration Penn State/NCAR Mesoscale Model (MM5). NCAR Tech. Note, NCAR.TN-398+STR, 122 pp, 1994. [Available from NCAR Publications, P.O. Box 3000, Boulder, CO 80307.]

3.

J. Dudhia, A non-hydrostatic version of the Penn State/NCAR Mesoscale Model: Validation tests and simulation of an Atlantic cyclone cold front, Mon. Wea. Rev., 121, 1493-1513 (1993).

4.

A. E. MacDonald, J. L. Lee, and S. Sun, QNH: Design and Test of a Quasi-Nonhydrostatic Model for Mesoscale Weather Prediction, Mon. Wea. Rev., 128,1016-1036 (1999).

5.

A. E. MacDonald, J. L. Lee, and Y. Xie, The Use of Quasi-Nonhydrostatic Models for Mesoscale Weather prediction, J. Atmos. Sci., 57, 2493-2517 (1999).

6.

J. L. Lee, and A. E. MacDonald, QNH: Mesoscale Bounded Derivative Initialization and Winter Storm Test over Complex Terrain, Mon. Wea. Rev.,128, 1037-1051 (1999).

7.

J. Michalakes, J. Dudhia, D. Gill, J. Klemp, and W. Skamarock, Design of a next-generation regional weather research and forecast model: Towards Terecomputing, World Scientific, 117-124(1998).

8.

G. L. Browning, and H. O. Kreiss, Scaling and computation of smooth atmospheric motions, Teluis, 38A, 295-313 (1986).

9.

C. F. Baillie, A. E. MacDonald, and J. L. Lee, Eurosim '96 - HPCN Challenges in Telecomp and Telecom, Parallel Simulation of Complex Systems and Large-Scale Applications. QNH: a numerical weather prediction model developed for massively parallel processors, 399-405, eds. L. Dekker, W. Smit, and J. C. Zuidervaart, (Elsevier North-Holland, 1996).

10. J. Michalakes, S. Chen, J. Dudhia, L. Hart, J. Klemp, J. Middlecoff, and W. Skamarock, Development Of A Next-Generation Regional WEATHER Research And Forecast Model, National Science Foundation Cooperative Agreement, ATM-9732665. 11. M. Di Ventra, S. T. Pantelides, and N. D. Lang, First-principles calculation of transport properties of a modecular device, Phys. Rev. Letters, 84, 979982 (2000).

250

12. M. Di Ventra, S. T. Pantelides, and N. D. Lang, The benzene molecule as a molecular resonant-tunneling transistor, Applied Physics Letters, 76, 34483450, (2000). 13. G. J. Milburn, The Feynman Processor — Quantum Entanglement and the Computing Revolution, ed. P. Davis (Perseus Books, 1998). 14. L. C. Lee, C. Rocken, R. Kursinski, eds. COSMIC - Applications of Constellation Observing System for Meteorology, Ionosphere & Climate, (Springer, 2001). 15. C. M. I. R. Girz, and A. E. MacDonald, Global Air-ocean In-situ System (GAINS), Proceedings 14th ESA Symposium on European Rocket and Balloon Programs and Related Research, European Space Agency (1999). 16. T. L. Smith, and S. G. Benjamin, Impact of network profiler data on a 3-h data assimilation system, Bull. Amer. Meteor. Soc, 74, 801-807 (1993). 17. A. E. MacDonald, and Y. Xie, On the use of slant observations from GPS to diagnose three-dimensional water vapor using 3DVAR, Fourth Symposium on Integrated Observing Systems, Long Beach, CA, American Meteorological Society, 62-73 (2000). 18. R. J. Fleming, Water vapor measurements from commercial aircraft: Progress and plans, Fourth Symposium On Intergrated Observing Systems, Amer. Meteor. Soc, 30-33 (2000).

251

THE SCALABLE MODELING SYSTEM: A HIGH-LEVEL ALTERNATIVE TO MPI M. GOVETT, J. MIDDLECOFF, L. HART, T. HENDERSON*, AND D.SCHAFFER* NOAA/OAR/Forecast Systems Laboratory 325 Broadway, Boulder, Colorado 80305-3328 USA Email: [email protected]

*[In collaboration with the Cooperative Institute for Research in the Atmosphere, Colorado State University, Ft. Collins, Colorado 80523 USA] A directive-based parallelization tool called the Scalable Modeling System (SMS) is described. The user inserts directives in the form of comments into existing Fortran code. SMS translates the code and directives into a parallel version that runs efficiently on both shared and distributed memory high-performance computing platforms. SMS provides tools to support partial parallelization and debugging that significantly decreases code parallelization time. The performance of an SMS parallelized version of the Eta model is compared to the operational version running at the National Centers for Environmental Prediction (NCEP).

1 Introduction Both hardware and software of high-end supercomputers have evolved significantly in the last decade. Computers quickly become obsolete; typically a new generation is introduced every two to four years. New systems utilize the latest advancements in computer architecture and hardware technology. Massively Parallel Processing (MPP) computers now comprise a wide range and class of systems including fully distributed systems, fully shared memory systems called Symmetric MultiProcessors (SMPs) containing up to 256 or more CPU's, and a new class of hybrid systems that connect multiple SMPs using some form of high speed network. Commodity-based systems have emerged as an attractive alternative to proprietary systems due to their superior price performance and to the increasing adoption of hardware and software standards by the industry. Programming on these diverse systems offer many performance benefits and programming challenges. The primary mission of the National Oceanic and Atmospheric Administration's (NOAA's) Forecast Systems Laboratory (FSL) is to transfer atmospheric science technologies to operational agencies within NOAA, such as the National Weather Service, and to others outside the agency. Recognizing the importance of MPP technologies, FSL has been using these systems to run weather and climate models since 1990. In 1992 FSL used a 208 node Intel Paragon to produce weather

252 forecasts in real-time using a 60km version of the Rapid Update Cycle (RUC) model. This was the first time anyone had produced operational forecasts in real time using a MPP class system. Since then, FSL has parallelized several weather and ocean models including the Global Forecast System (GFS) and the Typhoon Forecast System (TFS) for the Central Weather Bureau in Taiwan [15], the Rutgers University Regional Ocean Modeling System (ROMS) [8], the National Centers for Environmental Prediction (NCEP) 32 km Eta model [17], the high resolution limited area Quasi Non-hydrostatic (QNH) model [16], and FSL's 40 km Rapid Update Cycle (RUC) model currently running operationally at NCEP [2]. Central to FSL's success with MPPs has been the development of the Scalable Modeling System (SMS). SMS is directive-based parallelization tool that translates Fortran code into a parallel version that runs efficiently on both shared and distributed memory systems. SMS was designed to reduce the effort and time required to parallelize models targeted for MPPs, provide good performance, and allow models to be ported between systems without code change. Further, directive-based SMS parallelization requires no changes to the original serial code. The rest of this paper describes SMS in more detail. Section 2 introduces several approaches to code parallelization, followed by an overview of SMS in Section 3. Section 4 describes the flexibility and simplicity of code parallelization using SMS and explains how this tool has significantly decreased code parallelization time. Section 5 describes several performance optimizations available in SMS and compares the performance of NCEP's operational Eta code with the SMS parallelized Eta. Finally, Section 6 concludes and highlights some additional work that is planned.

2 Approaches to Parallelization In the past decade, several distinct approaches have been used to parallelize serial codes. Directive-based Micro-tasking - This approach was used by companies such as Cray and SGI to support loop level shared memory parallelization. A standard for such a set of directives called OpenMP, has recently become accepted in the community. OpenMP can be used to quickly produce parallel code, with minimal impact on the serial version. However, OpenMP does not work for distributed memory architectures. Message Passing Libraries - Message-passing libraries such as Message Passing Interface (MPI), represents an approach suitable for shared or distributed memory architectures. Although the scalability of parallel codes using these libraries can be quite good, the MPI libraries are relatively low-level and can require the modeler to

253 expend a significant amount of effort to parallelize their code. Further, the resulting code may differ substantially from the original serial version; code restructuring is often desirable or necessary. One notable example of this strategy is the Weather Research and Forecast (WRF) model which was designed to limit the impact of parallelization and parallel code maintenance by confining MPI-based communications calls into a minimal set of model routines called the mediation layer [20]. Parallelizing Compilers - These solutions offer the ability to automatically produce a parallel code that is portable to shared and distributed memory machines. The compiler does the dependence analysis and offers the user directives and/or language extensions that reduce the development time and the impact on the serial code. The most notable example of a parallelizing compiler is High Performance Fortran (HPF). In some cases the resulting parallel code is quite efficient [23], but there are also deficiencies in this approach. Compilers are often forced to make conservative assumptions about data dependence relationships, which impact performance [13]. In addition, weak compiler implementations by some vendors result in widely varying performance across systems [4, 21]. Interactive Parallelization Tools - One interactive parallelization tool, called the Parallelization Agent, automates the tedious and time-consuming tasks while requiring the user to provide the high-level algorithmic details [14]. Another tool, called the Computer-Aided Parallelization Tool (CAPTools), attempts a comprehensive dependence analysis [13]. This tool is highly interactive, querying the user for both high level information (decomposition strategy) and lower level details such as loop dependencies and ranges that variables can take. While these tools offers the possibility of a quality parallel solution in a fraction of the time required to analyze dependencies and generate code by hand, Umitations exist in their ability to offer efficient code parallelization of NWP codes that contain more advanced features (e.g. nesting, spectral transformations, and Fortran 90 constructs). Library-Based Tools - Library-based tools, such as the Runtime System Library (RSL) [18] and FSL's Nearest Neighbor Tool (NNT) [22], are built on top of the lower level libraries and serve to relieve the programmer of handling many of the details of message passing programming. Performance optimizations can be added to these libraries that target specific machine architectures. Unlike computer-aided parallelization tools such as CAPTools, however, the user is still required to do all dependence analysis by hand. In simplifying the parallel code, these high level libraries also reduce the impact to the original serial version. Parallelization is still time consuming and invasive, since code must be inserted by hand and multiple versions must be maintained. Source translation tools have been developed to help modify these codes

254 automatically. One such tool, the Fortran Loop and Index Converter (FLIC), generates calls to the RSL library based on command line arguments that identify decomposed arrays and loops needing transformations [19]. While useful, this tool has limited capabilities. For example, it was not designed to handle multiple data decompositions, interprocessor communications, or nested models. Another tool of this type, and the topic of this paper, is a directive-based source translation tool that is a new addition to SMS called the Parallel Pre-Processor (PPP). The programmer inserts the directives (as comments) directly into the Fortran serial code. PPP then translates the directives and serial code into a parallel version that runs on shared and distributed memory machines. Since the programmer adds only comments to the code, there is no impact to the serial version. Further, SMS hides enough of the details of parallelism to significantly reduce the coding and testing time compared to an MPI-based solution.

3 Overview of SMS SMS consists of two layers built on top of the Message Passing Interface (MPI) software. The highest layer is a component called the PPP, which is a Fortran code analysis and translation tool built using the Eli compiler construction software [7]. PPP analysis ensures consistency between the serial code and the user-inserted SMS parallelization directives. After analysis, PPP translates the directives and serial code into a parallel version of the code. In addition to loop translations, array re-declarations, and other code modifications, the parallel version contains PPP generated calls to SMS library-based routines in the Nearest Neighbor Tool (NNT), Scalable Spectral Tool (SST) and Scalable Runtime System (SRS) shown in the Figure 1. NNT is a set of high-level library routines that address parallel coding issues such as data decomposition, halo region updates and loop translations [22]. SRS provides support for input and output of decomposed data [9]. SST is a set of library routines that support parallelization of spectral atmospheric models. These libraries rely on MPI routines to implement the lowest layered functionality required.

255

THE SCALABLE MODELING SYSTEM (SMS)

Figure 1. Functional diagram of the layers of SMS that are built on top of MPI

Early versions of SMS did not contain the highest level PPP layer. Instead, model parallelization was accomplished by inserting NNT, SST and SRS library calls directly into the parallel code. While a number of models were successfully parallelized using this method, the serial and parallel versions of the code were distinctly different and had to be maintained separately [3,1,10]. Conversely, directive-based parallelization permits the modeler to maintain a single source code capable of running on a serial or parallel system. Modelers are able to test new ideas on their desktop, yet can easily generate parallel code using PPP when faster runs on an MPP are desired. Figure 2 illustrates code parallelization using SMS directives and PPP to generate the parallel code.

256

Code Parallelization using SMS

?;\ >r*> n»

vSerial Executable

4S allel icle

Parallel • ^ Executable

Figure 2. SMS directives are added to the original serial code during code parallelization. The SMS serial code can then be run serially as before, or parallelized using PPP to generate an MPPready parallel code.

To simplify the user's interface to parallelization, the number of directives available in the SMS toolkit is minimized. Currently 20 SMS directives are available to handle parallel operations including data decomposition, communication, local and global address translation, I/O, spectral transformations and nesting [5]. Further, when PPP translates the code into its parallel form, it changes only those lines of code that must be modified; the rest of the serial code including comments and white space remain untouched. Another advantage of this approach is that directives serve to abstract the lower level details of parallelization that are required to accomplish complicated operations including interprocess communication, process synchronization, and parallel I/O. An illustration of an SMS abstraction is the use of a high-level data structure, called a decomposition handle, which defines a template that describes how data will be distributed among the processors. Two SMS directives are

257 required to declare and initialize the user-specified data decomposition structure {csms$declare_decomp and csms$create_decomp). A layout directive (csms$distribute) is then used to associate arrays with this data decomposition. Once data layout has been defined, the user does not need to be concerned with how data are distributed to the processors or how data will be communicated - SMS handles these low-level details automatically. SMS retains all information necessary to access, communicate, input and output decomposed and nondecomposed arrays through the use of the user-specified decomposition handle. For example, to update the halo (ghost) region of arrays x and y between neighboring processors, the user is only required to insert csms$exchange( x, y ) into the serial code at the appropriate place. SMS automatically generates code to store information about each variable to be exchanged (global sizes, halo thickness, decomposition type, data type), and then perform the communications necessary to update the halo points of each process. Using the information contained in the decomposition handle, SMS determines how much of the halo region each process must be exchanged, where the information must go, and where it should be stored. Process synchronization is also handled by SMS for these communication operations. Using this encapsulation strategy other communication operations, including reductions (csms$reduce), transferring data between decompositions (csms$transfer), and the gather and scatter of decomposed data (csms$serial) between global and decomposed arrays are easily handled at the directive level. Further, input and output of data to or from disk require no SMS directives or any special treatment by the user. Figure 3 shows an example of an SMS program in which the decomposition handle my_dh is declared (line 3) and then referenced by directive (csms$distribute: lines 5, 9) to associate the first array dimension with the first dimension of the decomposition for the arrays x and y. Once the data layout has been specified via directive, SMS handles all the details required for halo updates (csms$exchange: line 19), reductions (csms$reduce: line 27), and I/O operations (no directives required). Once SMS understands how arrays are decomposed, parallelization becomes primarily an issue of where in the code the user wishes to perform communications and not how data will be moved to accomplish these operations. The user is still required to determine by dependence analysis where communication is required in

258

their code, but a single directive is generally all that is required once this information is known. Further information about the use of SMS directives is available in the SMS User's Guide [11].

Code with SMS Directives 1: 2: 3:

p r o g r a m DYNAMIC_MEMDRY_EXAMPLE p a r a m e t e r ( I M = 15) CSMS$nEJCIARE_DBCOMP ( n y „ d h >

5: 6: 7: 8: 9:

CSMS$DIS13lIBUrE(nv_c3h / 1 ) real, allocatable real, allocatable real xsum C£MS$niSTOIBOTE END

10 :

CSHS$CREAXE_DECCMP

(ny_dh,

BEGIN :: x ( : ) :: y(:)

,

<2>)

11 12 13 14

: : : :

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

:

CSMH$PARALLEL(ny_dh,

: : :

d o 1 0 0 i = 3 , 13 y(i) = x(i) - x(i-l) - x(i+l) 100 continue CSMS$EXCHRNGE ( y ) d o 200 i = 3 , 13 x(i) = y(i) + y(i-l) + y(i+l) 200 continue xsum = 0.0 d o 300 i = 1, 15 xsum = xsum + x ( i ) 300 continue CSMS$REDUCE(xsura, SUM) CSHS$PARALLEL END p r i n t *,'xsum = ',xsum end

: : : : : : :

: :

allocate(x(im)) allocate(y(im)) o p e n {10, f i l e = ' x _ i n . d a t ' , r e a d (10) x
form='unformatted')

BEGIN - x(i-2)

-

x(i+2)

+ y{i-2)

+

y(i+2)

Figure 3. SMS directives are used to map sub-sections of the arrays x and y to the decomposition given by "my_dh". Each process executes on its portion of these decomposed arrays in the parallel region given by csms$parallel.

Alternatively, when an operation such as a halo update is done with MPI, either each variable is exchanged separately, or in some cases, multiple arrays can be exchanged at the same time using an MPI-derived type or common block. In addition, the programmer must determine its neighbors and decide if communication is required. While not a difficult operation, it can be a tedious and time-consuming endeavor. One example of this complexity can be found in the Eta

259 code where a key communications routine containing over 100 lines of code was replaced with a single exchange directive during SMS parallelization. 4 Code Parallelization using SMS The parallelization of codes targeted for MPPs can be a difficult and timeconsuming process. The objective in developing SMS was to design a tool that is easy to learn and use, and to provide support for operations that simplify and speedup code parallelization. This section highlights some of the features of SMS that have been developed to achieve these goals. 4.1

Code Generation and Run-Time Options

SMS control over the generation and execution of code can be divided into three areas: parallelization directives, command line options, and run-time environment variables. SMS directives, discussed in Section 3, are the most obvious way to control when, where and how code parallelization should be done. SMS also provides the user with command line options to modify code translation. User access to parallel code generation using PPP is provided through a script that runs a series of executables to transform the serial code. Several command line options are available in this script that affect parallel code generation including type promotion (eg. — r8), retain translated code as comments, and a verbose level to warn of inconsistencies encountered during translation. Users can also control the run-time behavior of SMS parallel code using environment variables. Environment variables are used to control when sections of PPP- translated user code will be executed. For example, conditional execution of generated code is used to verify the correctness of a parallelization where global sums are required, and for debugging purposes. This allows users to debug and verify parallelization without requiring that code be re-generated after correctness of results is established (discussed below). Environment variables are also used to control the run-time behavior of SMS to: configure the layout of processors to the problem domain, designate the number of processors used to output decomposed arrays to disk, determine the type of input/output files that will be read/written (MPI-l/O, Native I/O, parallel file output, etc.), and tune model performance.

260 4.2

Advanced Parallelization Support

There are three phases to any code parallelization effort: code analysis, code parallelization, and debugging. Code analysis generally involves finding data dependencies that exist in the code, and based on this information, determining a data decomposition and parallelization strategy. SMS does not currently offer user support for code analysis; however, plans to provide this capability will be discussed in Section 6. Code Parallelization - SMS provides support for simplifying code parallelization. Recognizing that code parallelization becomes simpler to test and debug when it can be done in a step-wise fashion, the user can insert directives to control when sections of code will be executed serially (csms$serial). Serial regions are implemented by gathering all decomposed arrays, executing the code segment on a single node, then scattering the results back to each processors sub-region as illustrated in Figure 4. In this example, the routine n o t _ p a r a l l e l executes on a single node referencing global arrays that have been gathered by the appropriate SMS routines. While the extra communications required to implement gather or scatter operations will slow performance, this directive permits users to test the correctness of parallelization during intermediate steps. Once assured of correct results, the user can remove these serial regions and further parallelize their code. This directive has also been useful in handling sections of code where no SMS support for parallelization is currently available such as NetCDF I/O. Further, if adequate performance is attained, some sections of code can be left unparallelized. Debugging - Once SMS directives have been added to the serial source, the parallel code must be run to verify the correctness of parallelization. To ensure correctness, output files should be examined to verify that the results are exactly the same for serial and parallel runs of the code. Since summation is not associative, reductions may not lead to exactly the same results on different numbers of processors. To alleviate this inconsistency, SMS provides a bit-wise exact reduction capability which performs exactly the same arithmetic operations that would be executed in the serial program. This capability is particularly useful when the reduction variables feed back into model fields that are output or compared. Bit-wise exact results also permit the user to verify results exactly against the serial version and ensure the accuracy and correctness of the parallelization effort. Building on the bit-wise exact reduction capability, two SMS directives have been developed to support debugging that have significantly streamlined model parallelization, reduced debugging time, and simplified code maintenance. The first directive, csms$check_halo, permits the user to verify that halo region values are up to date. Using this directive, the halo region values of each user-specified array is

261 compared with their corresponding interior points on the neighboring process. If these values differ, SMS will output the differences and exit. This information helps determine where an exchange or halo update may be required to ensure correctness.

Incremental Parallelization <^S y?9& "local"

"global"

CALL NOT^PARALLEL(...)

w

"global"

SMS Directive:

"local"

CSMS$SERIAL

Figure 4. An illustration of SMS support for incremental parallelization. Prior to execution of the serial region of code, decomposed arrays are gathered into global arrays, referenced by the serial section of code, and then results are scattered back out to the processors at the end of the serial region.

The second debug directive, csms$compare_var, provides the ability to compare array values for model runs using different numbers of processors. For example, the programmer can specify a comparison of the array "x", for a single processor run and for multiple processors by inserting the directive: csms$compare_var ( x ) in the code and then entering appropriate command line arguments to request concurrent execution of the code. The command:

smsRun

1 mycode

2 mycode

262 will run concurrent images of the executable mycode for 1 and 2 processors. Wherever csms$compare_var directives appear in the code, user-specified arrays will be compared. If differences are found SMS will display the name of the variable (x for example), the array location (e.g. the i, j , k index) and the corresponding values from each run, and then terminate execution. The ability to compare intermediate model values anywhere in the code has proven to be a powerful debugging tool during code parallelization. The effort required to debug and test a recent code parallelization was reduced from an estimated eight weeks down to two simply because the programmer did not have to spend inordinate amounts of time determining where the parallelization mistakes were made. Additionally, this directive has proven to be a useful way to ensure that model upgrades continue to produce the correct results. For example, after making changes to serial code the modeler executes the debug sections of code (generated by csms$compare_var), controlled through a command line option, in order to verify that the intermediate results are still correct. By allowing the programmer to test parallelization in this way, code maintenance becomes much simpler for everyone.

5 Performance and Portability As stated in the introduction, SMS has been used to successfully parallelize a number of mesoscale and global forecast models. These models have demonstrated good performance and scaling on a variety of computing platforms including IBM SP, Intel Paragon, Cray T3E, SGI Origin, and Alpha-Linux clusters. This section details some of the portability and performance optimizations available with SMS and then highlights some results of a recent comparison for the operational Eta model. 5.7 SMS Optimizations Model performance can vary significantly depending on the hardware and architecture of the target system and the run-time characteristics of the code. Architectural differences affecting performance include processor speed, the access times and size of each type of memory (register, cache, main memory), bandwidth of the communication pathways, and speed of peripherals such as disks [12]. Issues that affect model performance include the compiler implementation, size and frequency of I/O operations, frequency and type of interprocessor communications, and data locality. SMS has been designed so that models can be ported between

263 systems without code change, to both run efficiently across shared and distributed memory systems and to provide options that tune the model for the best performance. Portability has become increasingly important both because high-end computer system hardware changes frequently and because codes are often shared between researchers who run their models on different systems. To ensure portability across shared and distributed memory systems, SMS assumes that memory is distributed; no processor can address memory belonging to another processor. Despite the assumption that memory is distributed, the performance on shared memory architectures is good due to efficient implementations of MPI on these systems. Also, when an SMS parallelized model runs successfully on one system, it can easily be ported and run on another computing platform. For example, it took only two hours to port the ROMS model, parallelized for Alpha-Linux, to an SGI Origin system. SMS provides several techniques to optimize models for high performance. One is to make architecture-specific optimizations in the lower layer of SMS. During a recent FSL procurement, one vendor replaced the MPI implementation of key SMS routines with the vendor's native communications package to improve performance. Since these changes were made at a lower layer of SMS, no changes to the model codes were necessary. SMS also supports other performance optimizations of interprocessor communications including array aggregation and halo region computations. Array aggregation permits multiple model variables to be combined into a single communications call to reduce message-passing latency. SMS also allows the user, via directive, to perform computations in the halo region in order to reduce communication. Further details regarding these communication optimizations are discussed in the SMS Users Guide [11] and overview paper [6]. Performance optimizations have also been built into SMS I/O operations. By default, all I/O is handled by a single processor. Input data are read by this node and then scattered to the other processors. Similarly, decomposed output data are gathered by a single process and then written asynchronously. Since atmospheric models typically output forecasts several times during a model run, these operations can significantly affect the overall performance and should be done efficiently. To improve performance, several options can be specified at run-time via environment variable. One option, illustrated in Figure 5, allows the user to dedicate multiple output processors to gather and output these data asynchronously. This allows compute operations to continue at the same time data are written to

264

disk. The use of multiple output processors has been shown to improve model performance by up to 25% [ 10].

Computational

Cache

Figure 5. An illustration of SMS output when cache processes and a server process are used. SMS output operations pass data from the computational domain to the cache processes. Data are re-ordered on the cache processes before being passed through the server process to disk.

Another output option allows the user to specify that no gathering of decomposed arrays be done; instead each processor writes out its section of the arrays to disk in separate files. This option allows users to take advantage of high-performance parallel I/O available on some systems including the IBM SP2. After output cycles are complete, post-processing routines can be run as a separate operation to reassemble the array fragments. 5.2 Eta Model Parallelization As a high-level software tool, SMS requires extra computations to maintain data structures that encapsulate low-level MPI functionality that could lead to potential performance degradation. While a number of performance studies have been done using SMS in recent years, no study has been done to measure the cost of the SMS overhead. To measure this impact, a performance comparison was done between the hand-coded MPI based version of the Eta model running operationally at NCEP,

265 and the same Eta model parallelized using SMS. The MPI Eta model was considered a good candidate for fair comparison since it is an operational model and has been optimized for high performance on the IBM SP2. Performance optimizations of NCEP's Eta model include the use of IBM's parallel I/O capability which offers fast asynchronous output of intermediate results during the course of a model run. To accomplish parallelization, the MPI Eta code was reverse engineered to return the code to its original serial form. This code was then parallelized using SMS. Code changes included restoring the original global loop bounds found in the serial code, removing MPI-based communications routines, and restoring array declarations. Fewer than 200 directives were added to the 19,000 line Eta model during SMS parallelization. To ensure correctness of parallelization, generated output files were bit-wise exact compared for both serial and parallel runs. Table 1: Eta model performance for MPI-Eta and SMS-Eta run on NCEP's IBM SP-2. Times are for a two-hour model run.

Processors 24 32 48 88

Time 78 59 45 27

Speedup 1.00 1.32 1.73 2.88

Efficiency 1.00 0.99 0.87 0.79

After parallelization was complete, performance studies were done to compare SMS Eta to the handed-coded MPI Eta. In these tests, identical run-times were measured on 88 processors of NCEP's IBM SP2 for a two hour model run. Further tests on FSL's Alpha Linux cluster, shown in Table 1, illustrate good performance and scaling. Additional analysis of these results is planned and will be forthcoming. However, these results demonstrate that SMS can be used to speed and simplify parallelization, improve code readability, and allow the user to maintain a single source, without incurring significant performance overhead. 6 Conclusion and Future Work A directive-based approach to parallelization (SMS) has been developed that can be used for both shared and distributed memory platforms. This method provides general, high level, comment-based directives that allow complete retention of the serial code. The code is portable to a variety of hardware platforms. This parallelization approach can be used to develop portable parallel code on multiple platforms and achieve good performance.

266 As we continue to parallelize more atmospheric and ocean models additional features are being added to SMS to enhance its usefulness. Parallelization of these models for MPPs has driven the development of SMS for the last ten years. Based on this experience, we have developed a tool that significantly decreases the time required to parallelize models. Further, we offer a simple, flexible user interface, provide tools that permit partial parallelization, simplify debugging and can verify the correctness of the model results exactly. In addition, our experience in working with a variety of computing platforms has allowed us to develop a tool that provides flexible high-performance portable solutions that are competitive with hand-coded vendor specific solutions. We have also demonstrated in the parallelization of NCEP's Eta model that the SMS solution performs as well as the MPI based operational version of the code. 6.1 Future Work SMS currently supports the analysis and translation of Fortran 77 with added support for some commonly used Fortran 90 constructs such as allocatable arrays, limited module support, and array syntax. However, full support is planned for all of the Fortran 90 language including array sections, derived types, and modules. Another upgrade will enable the PPP translator to generate OpenMP code. Further, for state-the-of-art machines that consist of clusters of SMPs, a parallel code that implements tasking "within the box" using OpenMP and message passing "between the boxes" using MPI may be optimal. The PPP translator could be designed to generate both message passing and micro tasking parallel code. We would also like to reduce the dependence analysis and code modification time (insertion of directives) required to parallelize a model. Development has begun on a tool, called mitogen, to analyze the user code and automatically insert SMS directives into the serial code. A typical model (20-30K source lines) parallelized using SMS requires the insertion of about 200 directives into the code. Autogen could automatically generate the two most common SMS directives (csms$parallel and csms$distribute) that account for roughly half of the directives users must add to the serial code. As the analysis capabilities of this tool grow, we expect to further reduce the number of directives that must be inserted by the user. However, one limitation of autogen is that it does not provide interprocedural analysis of the code. Therefore, we would like to combine SMS code translation capabilities with a semi-automatic dependence analysis tool. This tool would automatically insert SMS directives into the serial code, from which a parallel version could be generated using PPP in order to further simplify parallelization.

267 References 1. Baillie, C , MacDonald A.E. and Lee J.L., QNH: A numerical Weather Prediction Model developed for MPPs. International Conference HPCN Challenges in Telecomp and Telecom: Parallel Simulation of Complex Systems and Large Scale Applications. Delft, The Netherlands (1996). 2. Benjamin, S., Brown J., Brundage K., Kim D., Schwartz B., Smirnova T., and Smith T., The Operational RUC-2, 16th Conference on Weather Analysis and Forecasting, AMS, Phoenix, (1998) pp.249-252. 3. Edwards, J., Snook J., and Christidis Z., Forecasting for the 1996 Summer Olympic Games with the NNT-RAMS Parallel Model, 13th International Information and Interactive Systems for Meteorology, Oceanography and Hydrology, Long Beach, CA., American Meteorological Society, (1997) pp.1921. 4. Frumkin, M., Jin H., and Yan J., Implementation of NAS Parallel Benchmarks in High Performance FORTRAN, NAS Technical Report NAS-98-009, NASA Ames Research Center, Moffett Field, CA (1998). 5. Govett, M., Edwards J., Hart L., Henderson T., and Schaffer T., SMS Reference Manual, http://www-ad.fsl.noaa.gov/ac/SMS ReferenceGuide.pdf (1999). 6. Govett, M., Edwards J., Hart L., Henderson T., and Schaffer D., SMS: A Directive-Based Parallelization Approach for Shared and Distributed Memory High Performance Computers, http://wwvvad.fsl.noaa.gov/ac/SMS Overview.pdf (2001). 7. Gray, R., Heuring, V., Levi, S., Sloane, A., and Waite W., Eli, A Flexible Compiler Construction System., Communications of the ACM 35 (1992) pp.121-131. 8. Haidvogel, D.B., Arango H.G., Hedstrom K., Beckman A., Malanotte-Rizzoli P., and Shchepetkin A.F., Model Evaluation Experiments in the North Atlantic Basin: Simulations in Nonlinear Terrain-Following Coordinates, Dyn. Atmos. Oceans 32 (2000) pp.239-281. 9. Hart, L., Henderson T., and Rodriguez B., An MPI Based Scalable Runtime System: I/O Support for a Grid Library, http://wwwad.fsl.noaa.gov/ac/hartLocal/io.html (1995). 10. Henderson, T., Baillie C , Benjamin S., Black T., Bleck R., Carr G., Hart L., Govett M., Marroquin A., Middlecoff J., and Rodriguez B., Progress Toward Demonstrating Operational Capability of Massively Parallel Processors at Forecast Systems Laboratory, Proceedings of the Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, European Centre for Medium Range Weather Forecasts, Reading, England (1994). 11. Henderson, T, Schaffer D., Govett M., and Hart L., SMS User's Guide, http://www-ad.fsl.noaa.gov/ac/SMS UsersGuide.pdf (2001).

268 12. Hwang, K., Advanced Computer Architecture: Parallelism, Scalability, and Programmability, McGraw Hill, Inc, (1993) pp.157-256. 13. Ierotheou, C.S., Johnson S.P., Cross M., and Leggett P.F., Computer aided parallelization tools (CAPTools) - Conceptual Overview and Performance on the Parallelization of Structured Mesh Codes, Parallel Computing 22 (1996) pp.163-195. 14. Kothari, S., and Kim Y., Parallel Agent for Atmospheric Models, Proceedings of the Symposium on Regional Weather Prediction on Parallel Computing Environments, (1997) pp.287-294. 15. Liou, C.S., Chen J., Terng C , Wang R, Fong C , Rosmond T., Kuo H., Shiao C , and Cheng M., The Second-Generation Global Forecast System at the Central Weather Bureau in Taiwan, Weather and Forecasting 12, pp.653-663 (1997). 16. MacDonald, A.E., Lee J.L., and Xie Y., QNH: Design and Test of a Quasi Nonhydrostatic Model for Mesoscale Weather Prediction. Monthly Weather Review 128 (2000) pp.1016-1036. 17. Mesinger, F., The Eta Regional Model and its Performance at the U.S. National Centers for Environmental Prediction. International Workshop on Limited-area and Variable Resolution Models, Beijing, China, 23-28 October 1995; WMO, Geneva, PWPR Rep. Ser. No. 7 WMO/TD 699 (1995) pp.42-51. 18. Michalakes, J., RSL: A Parallel Runtime System Library for Regular Grid Finite Difference Models using Multiple Nests, Tech. Rep. ANL/MCS-TM-197, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, Illinois (1994). 19. Michalakes, J., FLIC: A Translator for Same-Source Parallel Implementation of Regular Grid Applications, Tech. Rep. ANUMCS-TM-223, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, Illinois (1997). 20. Michalakes, J., Dudhia J., Gill D., Klemp J. and Shamarock W., Design of a Next Generation Regional Weather Research and Forecast Model, Proceedings of the Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology, European Centre for Medium Range Weather Forecasts, Reading, England (1998). 21. Ngo, T., Snyder L., and Chamberlain B., Portable Performance of Data Parallel Languages, Supercomputing 97 Conference, San Jose, CA (1997). 22. Rodriguez, B., Hart L., and Henderson T., Parallelizing Operation Weather Forecast Models for Portable and Fast Execution, Journal of Parallel and Distributed Computing 37 (1996) pp. 159-170. 23. High Performance Fortran Applications (HPFA): A collection of applications developed and maintained by the Northeast Parallel Architectures Center (NPAC) under the auspices of the Center for Research in Parallel Computation (CRPC) http://www.npac.s\T.edu/hpfa/bibl.html(1999).

269

DEVELOPMENT OF A NEXT-GENERATION REGIONAL WEATHER RESEARCH AND FORECAST MODEL* J. MICHALAKES, 2 S. CHEN, 1 J. DUDHIA, 1 L. HART, 3 J. KLEMP, 1 J. MIDDLECOFF, 3 W. SKAMAROCK 1 'Mesoscale and Microscale Meteorology Division, National Center for Atmospheric Research, Boulder, Colorado 80307 U.S.A. 2

Mathematics and Computer Science Division, Argonne National Laboratory, Illinois 60439 U.S.A., [email protected], +1 303 497-8199

Chicago,

3

NOAA Forecast Systems Laboratory, Boulder, Colorado 80303 U.S.A.

The Weather Research and Forecast (WRF) project is a multi-institutional effort to develop an advanced mesoscale forecast and data assimilation system that is accurate, efficient, and scalable across a range of scales and over a host of computer platforms. The first release, WRF 1.0, was November 30, 2000, with operational deployment targeted for the 2004-05 time frame. This paper provides an overview of the project and current status of the WRF development effort in the areas of numerics and physics, software and data architecture, and single-source parallelism and performance portability.

1

Introduction

The WRF project is developing a next-generation mesoscale forecast model and assimilation system that will advance both the understanding and the prediction of mesoscale precipitation systems and will promote closer ties between the research and operational forecasting communities. The model will incorporate advanced numerics and data assimilation techniques, a multiple relocatable nesting capability, and improved physics, particularly for treatment of convection and mesoscale precipitation. It is intended for a wide range of applications, from idealized research to operational forecasting, with priority emphasis on horizontal grids of 1-10 kilometers. A prototype

This work was supported under National Science Foundation Cooperative Agreement ATM-9732665.

270

has been released and is being supported as a community model.1 Based on its merits, it will be a candidate to replace existing forecast models such as the PSU7NCAR Mesoscale Model (MM5), the ETA modal at NCEP, and the RUC system at FSL. The first release of the model, WRF 1.0, was November 30, 2000. This paper reports on progress since our first ECMWF workshop paper on the WRF design two years ago [3]. Section 2 provides an overview of the WRF project and model, Section 3 the software design and implementation, and Section 4 preliminary performance results for the WRF 1.0 code. 2

WRF Overview

A large number of organizations are participating in the WRF project. The principal organizations are the Mesoscale and Microscale Meteorology Division of the National Center for Atmospheric Research (NCAR/MMM), the National Centers for Environmental Prediction (NOAA/NCEP), the Forecast Systems Laboratory (NOAA/FSL), the University of Oklahoma Center for the Analysis and Prediction of Storms (CAPS), and the U.S. Air Force Weather Agency (AFWA). Additional participants include the Geophysical Fluid Dynamics Laboratory (NOAA/GFDL), the National Severe Storms Laboratory (NOAA/NSSL), the Atmospheric Sciences Division of the NASA Goddard Space Flight Center, the U.S. Naval Research Laboratory Marine Meteorology Division, the U.S. Environmental Protection Conservative variables Inviscid, 2-D equations in Cartesian coordinates

U = pu,

du

D

V = pv,

ee

_,

— + yRx fir = dt dx dW _ d® + yRn hep = dt dz

due

ee

W = pw,

& = p0

duu dwu dx dUw

dz dWw

dx

dz

me

+

dt Pressure terms directly related to &.

+ = pQ dx dz dp dU dW . — H + = 0 dt dx dz

yRnV®

= CP®VTV

=Vp

Figure 1 Flux form equations in height coordinates.

1

http://www.wrf-model.org

271

Hydrostatic pressure coordinate: Conservative variables:

U = [iu, dU dt

Inviscid, 2-D equations without rotation:

Tj = {it-Kt)l fi, W' = fiw,

dp dx SW

d<j>

0 = /id,

Q = fit)

dp d
dUu dx

dClu drj

dp\

dUw

CQH>

(

dt

It

fi-ns-nt

y dtj) dx d@ dUO dC10 + = fjQ + dt dx drj du dU XI _C + = 0 + dt dx d
dr;

(Re)' {Paa)

Figure 2 Flux form equations in mass coordinates.

Agency Atmospheric Modeling Division, and a larger number of university researchers. The project is organized under a WRF Oversight Board, who appoint and work with a WRF Coordinator (Klemp) and a WRF Science Board. Five WRF development teams — Numerics and Software, Data Assimilation, Analysis and Validation, Community Involvement, and Operational Implementation — are further divided into a number of working groups, which include Dynamic Model Numerics; Software Architecture, Standards, and Implementation; Analysis and Visualization; and Data Handling and Archiving. The WRF development timeline is in two main phases: full implementation as a research model, to be completed by the end of 2002 (with the exception of 4DVAR, which is planned for 2003), and then full implementation as an operational forecasting system at NCEP and AFWA, to be largely completed by the end of 2004 but with additional effort for 4DVAR implementation and diagnosis and operational performance refinement stretching into the 2005-08 time frame. The WRF design allows for multiple dynamical cores. The dynamical core in this first release uses Eulerian finite-differencing to integrate the fully compressible nonhydrostatic equations in height coordinates (Figure 1) in scalar-conserving flux form using a time-split small step for acoustic modes. A mass-coordinate prototype is also being implemented (Figure 2). Large time steps utilize a third-order Runge-Kutta technique, and second- to sixth-order advection operators can be specified. The horizontal staggering is an Arakawa-C grid. Future releases may include an implicit option to time splitting for the Eulerian dynamical cores. A semi-implicit semiLagrangian prototype is also under development at NCEP [2,5]; this will use an Arakawa-A grid and allow WRF and the new global model under development at NCEP to operate as unified models.

272

WRF physics is intended to be plug compatible [1] to allow easy exchange of physics packages with other models. In actual fact, the physics packages that are included in this first release of WRF are from other modeling systems, adapted to operate within the WRF software framework. At least one representative package for each of the physical processes necessary for real-data mesoscale forecasts are included: • • • • •

Longwave radiation: RRTM Shortwave radiation: NASA/GSFC, MM5 (Dudhia) Cumulus: Kain-Fritsch, Betts-Miller-Janjic Explicit microphysics: Kessler, Lin et al, NCEP 3-class (Hong) PBL: MRF, MM5 (Slab)

Work is continuing to add additional parameterizations ~ notably in the area of land surface processes, for which a new working group is being added to the WRF development effort. Parameterizations and packages of concern for atmospheric chemistry and regional climate are also being considered. The WRF 1.0 distribution includes idealized cases for testing: two-dimensional hill case, baroclinic wave, squall line, and supercell thunderstorm. Model initialization for real data is handled by the WRF Standard Initialization (SI) package, developed largely at NOAA/FSL. The SI package is also distributed with the model. WRF is packageindependent by design; datasets are currently stored in NetCDF format, and other formats such as GriB and HDF will be supported. Effort is under way within the WRF collaboration to design and implement a three-dimensional variational assimilation (3DVAR) system to initialize WRF in the 2001-02 timeframe; this will be followed by a full four-dimensional variational (4DVAR) system in the 2004-05 time frame. The current release of WRF 1.0 supports a single domain. The WRF model will support nesting, with 2-way interacting nests that are dynamically instantiable and that may move and overlap. Control of nesting will be via a priori scripting. Eventually, adaptive mesh refinement — that is, nesting based on evolving conditions within a running simulation — will be supported. 3

Software Architecture

The WRF prototype employs a layered software architecture that promotes modularity, portability, and software reuse. Information hiding and abstraction are employed so that parallelism, data management, and other issues are dealt with at specific levels of the hierarchy and transparent to other layers. The WRF software architecture (Figure 3) consists of three distinct model layers: a solver layer that is usually written by scientists, a driver layer that is responsible for allocating and deallocating space and controlling the integration sequence and I/O, and a mediation layer that communicates between the driver and model. In this manner, the user code is isolated from the concerns of parallelism.

273

Driver

Driver Layer

Mediation Layer

Config Inquiry

Solve

DM comm OMP u

Model Layer

Config Module

I/O API

WRF Tile-callable Subroutines

or, op " Z in • •Ji vt

Package Independent

Data formats Package Parallel I/O Dependent

21 — v_

External Packages Figure 3 WRF software architecture schematic The WRF prototype uses modern programming language features that have been standardized in Fortran90: modules, derived data-types, dynamic memory allocation, recursion, long variable names, and free format. Array syntax is avoided for performance reasons. A central object of a nested model is a domain, represented as an instance of a Fortran90 derived data type. The memory for fields within a domain are sized and allocated dynamically, facilitating run-time resizing of domain data structures to accommodate load balancing and dynamic nesting schemes. External packages, such as MPI, OpenMP, single-sided message passing, and higher-level libraries for parallelization, data formatting, and I/O, will be employed as required by platform or application; however, these are considered external to the design, and their interfaces are being encapsulated within WRF-specific/packageindependent Application Program Interfaces (API). For example, details of parallel I/O and data formatting are encapsulated within a WRF I/O API. A detailed specification of the WRF I/O API is distributed with the model so that implementers at particular sites can use an existing implementation or develop their own to accommodate site-specific requirements. A flexible approach for parallelism is achieved through a two-level decomposition in which the model domain may be subdivided into patches that are assigned to distributed memory nodes, and then may be further subdivided into tiles that are allocated to shared-memory processors within a node. This approach addresses all current models for parallelism — single-processor, shared-memory, distributed-memory, and hybrid — and also provides adaptivity with respect to processor type: tiles may be

274

sized and shaped for cache-blocking or to preserve maximum vector length. Model layer subroutines are required to be tile -callable, that is, callable for an arbitrarily sized and shaped subdomain. All data must be passed through the argument list (state data) or defined locally within the subroutine. No COMMON or USE-associated state data is allowed. Domain, memory, and run dimensions for the subroutine are passed separately and unambiguously as the last eighteen integer arguments. Thus, the WRF software architecture and two-level decomposition strategy provides a flexible, modular approach to performance portability across a range of different platforms, as well as promoting software reuse. It will facilitate use of other framework packages at the WRF driver layer as well as the reverse, the integration of other models at the model layer within the WRF framework. A related project is under way at NCEP to adapt the nonhydrostatic Eta model to this framework as a proof of concept and to gauge the effectiveness of the design. As with any large code development project, software management in the WRF project is a concern. For source code revision control, the project is relying on the CVS package. In addition, a rudimentary computer-aided software engineering (CASE) tool called a Registry has been developed for use within the WRF project. The Registry comprises a database of tables pertaining to the state data structures in WRF and their associated attributes: type, dimensionality, number of time levels, staggering, interprocessor communication (for distributed memory parallelism), association with specific physics packages, and attributes pertaining to their inclusion within WRF initial, restart, and history data sets. The Registry, currently implemented as a text file and Perl scripts, is invoked at compile time to automatically generate interfaces between the driver and model layers in the software architecture, calls to allocate state fields within the derived data type representing WRF domains, communicators for various halo exchanges used in the different dynamical cores, and calls to the WRF I/O API for initial, restart, and history I/O. The Registry mechanism has proved invaluable in the WRF software development to date, by allowing developers to add or modify state information for the model by modifying a single file, the database of Registry tables. 4

Preliminary Results

Initial testing of the WRF prototype has focused on idealized cases such as simple flow over terrain, channel models, squall lines, and idealized supercell simulations. Subsequent testing has involved real data cases and comparison with output from existing models such as MM5. An automated real-time forecasting system similar to the MM5 real-time system at NCAR1 is under construction and will provide a means for real-time testing and verification of the WRF system over extended periods.

' http://rain.mmm.ucar.edu

275

400 350 0 300

1 250 |

200

3 150 .§ 100 10

50 0

0

1

1

1

1

1

1

1

1

1

8

16

24

32

40

48

56

64

72

processors Figure 4 WRF performance compared with MM5 Benchmarking to evaluate and improve computational performance is also under way. The model has run in shared-memory, distributed-memory, and hybrid parallel modes on the IBM SP, Compaq ES40, Alpha and PC Linux clusters, and other systems. WRF is neutral with respect to packages and will use a number of different communication layers for message passing and multithreading. The current prototype uses MPI message passing through the RSL library [4] and OpenMP. Typically, straight distributed-memory parallelism (multiple patches with one tile per patch) has been the fastest, most scalable mode of operation to date, suggesting that additional effort may be required to realize the full potential from the sharedmemory implementations of WRF. Using straight distributed-memory parallelism and benchmarking an idealized baroclinic wave case for which a floating-point operations count is known (measured by using the Perfex tool on the SGI Origin), WRF ran at 467 Mflop/s on 4 processors and 6,032 Mflop/s on 64 processors of the IBM SP Winterhawk-II system at NCAR.1 This is approximately 81 percent efficient relative to 4 processors (6032/467/16). On an SGI Origin2000 with 400 MHz processors, performance was 2497 Mflop/s on 16 processors and 8,914 Mflop/s on 64 processors, or 89 percent efficient relative to 16 processors. A comparison of WRF performance with performance of an existing model, MM5, is shown in Figure 4. This shows both MM5 and WRF running a standard MM5 benchmark case, the AFWA T3A (Europe) scenario2: 36 km resolution on a http://www.scd.ucar.edu/main/computers.html 2

http://www.mmm.ucar.edu/mm5/mpp/helpdesk

276

136 by 112 by 33 grid. The WRF is using comparable to or more sophisticated physical options than the MM5. A comparable operations count for WRF on this scenario is not known, so instead time-to-solution is compared. In terms of timeper-timestep, the WRF simulation is considerably more costly, however, the 2 timelevel Runge-Kutta solver in WRF allows for a considerably longer time step: 200 seconds versus 81 seconds for MM5. Thus, the time-to-solution performance for WRF is slightly better than MM5 and should improve with tuning and optimization. 5

Summary

With the release of WRF 1.0, an important milestone has been reached in the effort to develop an advanced mesoscale forecast and data-assimilation system designed to provide good performance over a range of diverse parallel computer architectures. Ongoing work involves design and implementation of nesting, expanding physics and dynamical options, development of a parallel 3DVAR system based on the same WRF software architecture, performance optimization, and testing and verification over a range of applications including research and operational forecasting. References 1. Kalnay E., M. Kanamitsu, J. Pfaendtner, J. Sela, M. Suarez, J. Stackpole, J. Tuccillo, L. Umscheid, and D. Williamson: "Rules for interchange of physical parameterization," Bull. Amer. Meteor. Soc. 70 (1989) 620-622. 2. Leslie, L. M., and R. J. Purser: "Three-dimensional mass-conserving semiLagrangian scheme employing forward trajectories," Mon. Wea. Rev., 123 (1995)2551-2566. 3. Michalakes, J., J. Dudhia, D. Gill, J. Klemp, and W. Skamarock: "Design of a next-generation regional weather research and forecast model," Towards Teracomputing, World Scientific, River Edge, New Jersey (1999), pp. 117-124. 4. Michalakes, J.: RSL: A parallel runtime system library for regional atmospheric models with nesting, in Structured Adaptive Mesh Refinement (SAMR) Grid Methods, IMA Volumes in Mathematics and Its Applications (117), Springer, New York, 2000, pp. 59-74. 5. Purser, R. J., and L. M. Leslie: "An efficient semi-Lagrangian scheme using third-order semi-implicit time integration and forward trajectories." Mon. Wea. Rev. 122 (1994) 745-756.

277 PARALLEL NUMERICAL KERNELS FOR CLIMATE MODELS

V. BALAJI SGI/GFDL Princeton University PO Box 308, Princeton NJ 08542, USA Climate models solve the initial value problem of integrating forward in time the state of the components of the planetary climate system. The underlying dynamics is the solution of the non-linear Navier-Stokes equation on a sphere. While the dynamics itself is the same for a wide variety of problems, resolutions and lengths of integration vary over several orders of magnitude of time and space scales. Efficient integration for different problems require different representations of the basic numerical kernels, which may also be a function of the underlying computer architecture on which the simulations are done. Modern languages such as Fortran 90 and C + + offer the possibility of abstract representations of the the basic dynamical operators. These abstractions offer a large measure of flexibility in the dynamical operator code, without requiring large-scale rewriting for different problem sizes and architectures. The cost of this abstraction is a function of the maturity of the compiler as well as the language design.

1

Introduction

Numerical climate and weather models today operate over a wide range of time and space scales. Current resolutions of weather forecasting models approach 50 km for global models, and 10 km for mesoscale models. Ocean models range from 0(100) km for climatic studies to the eddy-resolving models in coastal basins. Non-hydrostatic atmospheric models at kilometre-scale resolutions are used in small domains to study cloud and severe storm dynamics, and in much larger domains spanning 0(1000) km to study the processes underlying large-scale convectively-driven systems such as mid-latitude cyclones and the ITCZ. Depending on the problem under consideration, we may choose to use spectral or grid-point methods; hydrostatic or non-hydrostatic primitive equations; a variety of physical processes may be resolved by the numerics, or remain unresolved (parameterized); the research may be focused on understanding chaotic low-order dynamics of a simplified coupled system, or on a comprehensive accounting of all contributing climate system components; and in all of these, the underlying dynamics remains the solution of the same non-linear Navier-Stokes equation applied to the complex fluids that constitute the planetary climate system. In climate research, with the increased emphasis on detailed representation of individual physical processes governing the climate, the construction of a model has come to require large teams working in concert, with individual

278

sub-groups each specializing in a different component of the climate system, such as the ocean circulation, the biosphere, land hydrology, radiative transfer and chemistry, and so on. The development of model code now requires teams to be able to contribute components to an overall coupled system, with no single kernel of researchers mastering the whole. This may be called the distributed development model, in contrast with the monolithic small-team model of earlier decades. These developments entail a change in the programming paradigm used in the construction of complex earth systems models. The approach is to build code out of independent modular components, which can be assembled by either choosing a configuration of components suitable to the scientific task at hand, or else easily extended to such a configuration. The code must thus embody the principles of modularity, flexibility and extensibility. The current trend in model development is along these lines, with systematic efforts under way in Europe and the U.S to develop shared infrastructure for earth systems models. It is envisaged that the models developed on this shared infrastructure will go to meet a variety of needs: they will work on different available computer architectures at different levels of complexity, with the same model code using one set of components on a university researcher's desktop, and with a different choice of subsystems, running comprehensive assessments of climate evolution at large supercomputing sites using the best assembly of climate component models available at the moment. The shared infrastructure currently in development concentrates on the underlying "plumbing" for coupled earth systems models, building the layers necessary for efficient parallel computation and data transfer between model components on independent grids. The next stage will involve building a layer of configurable numerical kernels on top of this layer, and this paper suggests a possible mechanism for the numerical kernel layer. The Fortran-90 language 1 offers a reasonable compromise between the need to develop high-performance kernels for the numerical algorithms underlying non-linear flow in complex fluids, and the high-level structure needed to harness component models of climate subsystems developed by independent groups of researchers. In this paper, I demonstrate the construction of parallel numerical kernels in F90 in the context of the GFDL Flexible Modeling System (FMS) a . The structure of the paper is as follows. Sec. 2is a brief description of FMS. Sec. 3 describes the MPP modules, a modular parallel computing infrastructure underlying FMS, and extensively used in the approach outlined in this paper. In Sec. 4, central to this paper, this shared °http://www.gfdl.gov/~fms

279

software infrastructure is taken to the next stage: it is shown how configurable numerics can be built, and extended to include numerical algorithms suitable to the problem at hand. A shallow water model is used here as a pedagogical example. The code used in this section is released under the GNU public license (GPL) and is available for download 6 . The section concludes with a discussion of the strengths and limitations of this approach. Sec. 5 summarizes the findings of the paper, and suggests ways forward. 2

FMS: t h e GFDL Flexible Modeling System

The Geophysical Fluid Dynamics Laboratory (NOAA/GFDL) undertook a technology modernization program beginning in the late 1990s. The principal aim was to prepare an orderly transition from vector to parallel computing. Simultaneously, the opportunity presented itself for a software modernization effort, the result of which is the GFDL Flexible Modeling System (FMS). The FMS constitutes a shared software infrastructure for the construction of climate models and model components for vector and parallel computers. It forms the basis of current and future coupled modeling at GFDL. It has been recently benchmarked on a wide variety of high-end computing systems, and runs in production on three very different architectures: parallel vector (PVP), distributed massively-parallel (MPP) and distributed shared-memory (DSM)C, as well as on scalar microprocessors. Models in production within FMS include a hydrostatic spectral atmosphere, a hydrostatic grid-point atmosphere, an ocean model (MOM), and land and sea ice models. In development are a non-hydrostatic atmospheric model, an Arakawa C-grid2 version of the hydrostatic grid-point atmospheric model, an isopycnal coordinate ocean model, and an ocean data assimilation system. The shared software for FMS includes at the lowest level a parallel framework for handling distribution of work among multiple processors, described in Sec. 3. Upon this are built the exchange grid software layer for conservative data exchange between independent model grids, and a layer for parallel I/O. Further layers of software include a diagnostics manager for creating runtime diagnostic datasets in a variety of file formats, a time manager, general utilities for file-handling and error-handling, and a uniform interface to scientific software libraries providing methods such as FFTs. Interchangeable components are designed to present a uniform interface, so that for instance, behind an ocean "model" interface in FMS may lie a full-fledged ocean model, a few

c

http://www.gfdl.noaa.gov/~vb/kernels.html Also known as cache-coherent non-uniform memory access (ccNUMA) architecture.

280

lines of code representing a mixed layer, or merely a routine that reads in an appropriate dataset, without requiring other component models to be aware which of these has been chosen in a particular model configuration. Physics routines constructed for FMS adhere to the ID column physics specification** providing a uniform physics interface. Coupled climate models in FMS are built as a single executable calling subroutines for component models for the atmosphere, ocean and so on. Component models may be on independent logically rectangular (though possibly physically curvilinear) grids, linked by the exchange grid, and making maximal use of the shared software layers. 3

The M P P modules

The parallel framework for modeling in FMS is provided by a layer of software called the MPP modules 6 . Low-level communication. This layer is intended to protect the code from the communication APIs (MPI, SHMEM, shared-memory maps), offering low-overhead protocols to a few communication operations, sufficient for most purposes. It is designed to be extensible to machines which oblige the user to use hybrid parallelism semantics. The domain class library. This module provides a class-^ to define domain decompositions and updates. It provides methods for performing halo updates on gridpoint models, and data transposes for spectral models or any other purpose (e.g FFTs). It is currently restricted to logically rectilinear grids (which category includes non-standard grids such as the bipolar grid 3 and cubed sphere 4 ). This layer is described in some detail below. Parallel I / O . The parallel I/O module provides a simple interface for reading and writing distributed data. It is designed for performance on parallel writes, which are far more frequent in these models. Merely by setting the appropriate flags when opening a file, users can choose between different I/O modes, including sequential or direct access, multi-threaded d

http://www.gfdl.gov/~fms/gfdl/f90-physics-spec.ps http://www.gfdl.noaa.gov/~vb/mpp.html •'The class library terminology is usually associated with object-oriented languages ( C + + , Java). It is a well-kept secret that F90 modules allow one to build class libraries, having many of the useful features, but few of the current performance disadvantages of OO languages. See POOMA (http://www.mcs.lanl.gov/pooma) for a C + + approach to builing numerical kernels. e

281

or single-threaded I/O 9 , writing a single file or multiple files for later assembly. It currently supports netCDF and raw unformatted data, but is designed to be extensible to other formats. The domain class is briefly described here, as it is used in what follows. The basic element in the hierarchy is an axis specification: t y p e , p u b l i c : : domain_axis_spec i n t e g e r :: begin, end, s i z e , max_size l o g i c a l :: i s _ g l o b a l end type domain_axis_spec The axis specification merely marks the beginning and end of a contiguous range in index space, plus for convenience some additional redundant information that can be computed from the range. Using this, we construct a derived type called domainlD that can be used to specify a ID domain decomposition: t y p e , p u b l i c : : domainlD type(domain_ajcis_spec) : : compute, d a t a , g l o b a l i n t e g e r :: pe type(domainlD), p o i n t e r : : p r e v , next end type domainlD The domain decomposition consists of three axis specifications that contain all the information about the grid topology necessary for communication operations on distributed arrays. The compute domain specifies the range of indices that will be computed on a particular processing element (PE). The data domain specifies the range of indices that are necessary for these computations, i.e by including a halo region sufficiently wide to support the numerics. The global domain specifies the global extent of the array that has been distributed across processors. In addition the domainlD type associates a processor ID with each domain, and maintains a linked list of neighbours in each direction. It is now simple to construct higher-order domain decompositions, of which the 2D decomposition is the most common. This is specified by a derived type called domain2D, which is constructed using orthogonal domainlD objects: t y p e , p u b l i c : : domain2D type(domainlD) : : x, y i n t e g e r : : pe type(domain2D), pointer :: west, east, south, north 'Single-threaded I / O on multiple processors involves having one processor gather the global d a t a for writing, or scatter the d a t a after reading.

282

( Le ie)

Ii s . i s )

(1,1) Figure 1. T h e donain2D type, showing globed, d a t a and compute domains.

end type domain2D

Many of the methods associated with a domain2D can be assembled from operations on its domainlD elements. The additional information here is to maintain a list of neighbours along two axes, and to assign a PE to each 2D domain. The domain2D object is shown here in Fig. 1, for a global domain ( l : n x , i : n y ) distributed across processors and yielding a compute domain (is:ie,js:je). There are two basic operations on the domain class. One is to set up a domain decomposition based on the model grid topology. The second is a high-level communication operation to fill in non-local data points which lie in compute domains on other PEs, and must therefore be updated after a compute cycle (e.g halo update). The simplest version of the first call is given a global grid and the required halo sizes in x and y, and returns an array of domain2D objects with the required decomposition: type(domain2D) : : domain(0:npes-l) c a l l mpp_define_domains( ( / l , n i , i , n y / ) , domain, xhalo=2, yhalo=2 )

283

This requests a decomposition of the global domain on npes PEs, with a halo size of 2. The layout across PEs and domain extents here are internally chosen, though optional arguments allow the user to take control of these if desired. In addition, there are optional flags for the user to pass in information about the grid topology, e.g periodic boundaries in x or y, or folds (as in cylindrical or bipolar grids). After setting up a domain decomposition, we may proceed with a computation. All data is allocated on the data domain, of which only the compute domain subset is locally updated. At the end of a compute cycle, typically a timestep, we make a call to update the non-local data (halo region): r e a l , a l l o c a t a b l e : : f( : , : ) c a l l mpp_define_domains( ( / l , n x , l , n y / ) , domain, xhalo=2, yhalo=2 ) c a l l mpp_get_compute_domain( domain, i s , i e , j s , j e ) c a l l mpp_get_data_domain( domain, i s d , i e d , j s d , jed ) allocate( f(isd:ied,jsd:jed) ) do t = s t a r t , e n d !time loop do J = Js»je do i = i s . i e f(i,j) = ... {computational kernel end do end do c a l l mpp_update_domains( f, domain ) end do All the topology and connectivity information is stored in domain, allowing the halo update to proceed. This data encapsulation proves to be a powerful mechanism to simplify high-level code. The syntax remains identical for arrays of any type, kind or rank; for open or periodic boundary conditions; for simple or exotic logically rectangular grids; and on a variety of parallel computing architectures. It is a testament to the success of the layered and encapsulated interface design that there are very few circumstances under which developers of parallel algorithms in FMS have needed direct recourse to the underlying communication protocols. In the next section we demonstrate the construction of a layer of numerical kernels on top of the domain class library. 4

Parallel numerical kernels in F90

The shallow water model represents the dynamics of small displacements of the surface of a fluid layer of constant density in a gravitational field. Given a rotating fluid of height H, the 2D dynamics of a small displacement f} <^H

284

may be written as: at]

H du dt

i/V-u+i/V2i/

(la)

jVr/ + / k x u + i / V 2 u + F

(lb)

where u is the horizontal velocity, g the acceleration due to gravity, / the Coriolis parameter, v the coefficient of viscosity, and F an external forcing. I have chosen to discretize it below using a forward-backward timestep, and an explicit treatment of rotation, as follows: Vn+l ~ rf At u " + 1 - u" At

H{V • u ) n + i / V V g{Vr])n+1 + / k x u n + i/V 2 u" + F "

(2a) (2b)

The superscript refers to the time level. The spatial discretization has been left unspecified because the numerical kernels constructed will be able to support multiple grid stencils. Observant readers will note that the treatment of the Coriolis term is formally unstable; the unconditionally stable representation introduces some additional considerations that are discussed below in Sec. 4g. The shallow water model is a standard pedagogical tool used in NWP 5 . It is also of significance in real models: each layer of a mass-based vertical coordinate system may be thought of as being a shallow water layer. In addition, it is customary in ocean models to separate out the 2D barotropic dynamics of the ocean, which is formally a shallow water model, from the 3D baroclinic dynamics of the interior, whose intrinsic timescales are much longer. Parallel barotropic solvers tend to be latency-bound and limit the scalability of ocean codes 6 , and are thus a potential client for efficient numerical kernels developed for a shallow water model. 4a

A class hierarchy for the shallow water model

The key to constructing class libraries that are intuitive and easy to use is to define objects as close as is reasonable to the elements of the system being modeled. Here the objects being modeled are vector and scalar fields, and the operations are arithmetic, rotational and differential operations on fields. These will thus become the objects and methods of our class. Differential

285

operations on scalar and vector fields will require a representation of a grid, with an associated metric tensor. Finally, for parallel codes, the grid will need to built on a layer supporting domain decomposition. Thus the class hierarchy being developed is: The domain class described above, representing index-space topologies of domains distributed across parallel processing elements; A grid class superposing a physical grid on the domain class, supplying a grid metric tensor and differentiation coefficients; A field class of 2D scalar and vector fields supporting arithmetic, rotational, and differential operations on the grid. The shallow-water model is 2D, and therefore tends to be latency-bound, since a 2D halo region produces a short communication byte-stream. We use the active domain axis specification for latency-hiding. 4b

Active domains.

A new axis specification that has recently been added to the domain class is the active domain axis specification. This can be used to maintain the state of the data domain, and can provide a simple and powerful mechanism of trading communication for computation on parallel clusters with high-latency interconnect fabrics, or, more generally, on high-latency code segments. The idea is to use wide halos and compute as much locally in the halo as the numerics permit. These points are shared by more than one PE, and thus are redundantly computed. The redundant computations permit one to perform halo updates less often. In each computational cycle, the size of the active domain is reduced until no further computations are possible without a halo update. Only then is the halo update performed. An example is shown below in Sec. 4f. Fig. 2 shows successive stages of a computational cycle requiring halo updates on every fourth timestep. 4c

Scalar and vector fields

The scalar field is constructed as follows: type, public :: scalar2D r e a l , pointer :: d a t a ( : , : ) integer :: i s , i e , j s , j e end type scalax2D

286

i ie)

1 is.is)

fl.l) Figure 2. The domain2D type, showing global, d a t a and compute domains as thick lines. Successive stages of a diminishing active domain are shown as thin lines.

The data element of the scalar2D type contains the field values. The Fortran standard requires arrays within derived types to have the pointer attribute. (This will be revised in Fortran-2000, which will permit allocatable arrays within types). The active domain described in Sec. 4b maintains the state of the data domain, and sets the limits of the domain that contain valid data. We use this information to limit the array sections used for computations, and to determine when a halo update is required. There are pros and cons to the use of arrays with a pointer attribute: • Pointer arrays can be pointed to a persistent private workspace, called the user stack. This has two advantages: — It saves the overhead of a system call to allocate memory from the heap; — It avoids the possibility of memory leaks that can be caused by incautious use of the a l l o c a t e statement on pointer arrays:

287

r e a l , pointer :: a ( : ) r e a l , target :: b(100) a l l o c a t e ( a(100) ) a => b At this point, the original allocation of 100 words has no handle, and cannot be deallocated by the user, nor can the compiler determine whether or not some array is pointing to this location. This is a memory leak that can grow without limit, if it occurs in a routine that is repeatedly called. • It may not be possible for an optimizing compiler to determine if two pointer arrays are or are not aliased to each other. Non-standard compiler directives (e.g !dir$ IVDEP on compilers supporting Cray directive syntax) generally permit the user to assist the compiler in this regard. • The F90 standard requires the assignment operator (=) applied to pointer components to redirect them, rather than copy them. This can yield counter-intuitive behaviour: type(scalar2D) :: a, b a = b b'/.data = . . . This will change the values of a'/.data, which may run against expectations. (Changes to the RHS of an assignment after the assignment do not affect the LHS for ordinary variables). A further complication is that the future revision of the standard, which permits allocatable array components, will restore the expected behaviour. In consideration of the last issue, we overload the assignment interface to restore the expected behaviour when the RHS is of type(scalar2D). We also use the assignment interface to make it possible to assign simple arrays or scalars to a scalar field (e.g a=0): interface assignment(=) module procedure copy_scalar2D_to_scalar2D module procedure assign_0D_to_scalar2D module procedure assign_2D_to_scalar2D end interface i.e, have overlapping memory locations.

288

In all versions of these procedures, we first allocate space for the LHS data from the user stack. In the first instance of the overloaded assignment, the RHS is also a scalar field, and the values in its data element are copied into the LHS data, and the active domain on the RHS is applied on the result. In the other two instances, the RHS is a real array or scalar, which is used to fill the d a t a element of the LHS: type(scalar2D) : : a, b real : : c real : : d ( : , : ) a = b a = c a = d The different possibilities are illustrated above. In the assignment of b and d, an error results if the arrays on the RHS do not conform to a'/.data. The horizontal vector field is constructed as a pair of scalar fields: type, public :: hvector2D type(scalar2D) : : x, y integer :: i s , i e , j s , j e end type hvector2D and its assignment operations are inherited from its scalar components: subroutine copy_hvector2D_to_hvector2D( a, b ) type(hvector2D), intent(inout) :: a type(hvector2D), i n t e n t ( i n ) :: b a'/.x = b'/.x a'/.y = b'/.y return end subroutine copy_hvector2D_to_hvector2D 4d

Arithmetic operators

Consider the addition operation applied to scalar fields: type(scalar2D) : : a, b, c a = b + c This operation requires the arithmetic operation aftdata = bXdata + cftdata, subject to limits on the active domain. The active domain that results is the intersection of the two active domains on the LHS: interface operator(+) module procedure add_scalar2D module procedure add_hvector2D

289 end interface function add_scalar2D( a, b ) type(scalar2D) :: add_scalar2D type(scalar2D), intent(in) :: a, b add_scalar2D'/.data => work2D(: ,: ,nbuf2) ! assign from user stack add_scalar2D'/,is = maxCaV.is.b'/iis) add_scalar2D'/£ie = m i n ( a % i e , b % i e ) add_scalar2D'/,js = maxCa'/.js.b'/.js) add_scalar2Dy,je = minCa'/.je.b'/.je) do j = add_scalar2Dy,js,add_scalar2D'/,je do i = add_scalar2D'/,is,add_scalar2Dy,ie add_scalar2Dy.data(i,j) = a'/.data(i, j) + by,data(i,j) end do end do return end f u n c t i o n a d d _ s c a l a r 2 D For multiplication, one of the multiplicands is a scalar or 2D array. T h e vector field is able to inherit its arithmetic operations as a combination of operations on its components, just as was done for the assignment interface in Sec. 4c. 4e

Rotational

operators

T h e c o m p u t a t i o n of the Coriolis t e r m requires the rotation of a vector field about the vertical. We can define an operator k c r o s s which takes a single vector field argument and returns a rotated vector field. This is very striaghtforward on grid stencils where the vector components are defined at the same point, as in the Arakawa A and B grids 2 , b u t requires averaging for the C-grid, where vector components are not collocated. T h e function k d o t c u r l , taking a vector field operand and returning the kcomponent of vorticity as a scalar field, m a y additionally be defined if needed. 4f

Differential

operators

T h e differential operators include gradient, which takes a scalar field operand and returns a vector field; divergence, taking a vector field operand and returning a scalar field, and laplacian, defined for b o t h vector and scalar fields. I illustrate here the construction of differential operators on a B-grid. T h e B-grid T - and U-cells are shown in Fig. 3:

290

Figure 3. T- and U-cells on the Arakawa B-grid.

The discrete representation of gradient and divergence operators using forward differences on the B-grid is: (3a) (3b) e

(VT)y = Sy(T )

(3c)

The code for the gradient operator is shown below: function g r a d ( s c a l a r ) type(hvector2D) : : grad t y p e ( s c a l a r 2 D ) , i n t e n t ( i n o u t ) :: s c a l a r c a l l mpp_get_compute_domain( domain, i s , i e , j s , j e ) c a l l mpp_get_data_domain( domain, i s d , i e d , j s d , jed ) grad'/.x => work2D( : , : .nbufx) grad*/.y => work2D( : , : ,nbufy) i f ( scalar'/.ie.LE.ie .OR. scalar*/.je.LE.je )then c a l l mpp_update_domains( s c a l a r ^ d a t a , domain, EAST+NORTH ) s c a l a r X i e = ied scalar*/, j e = j e d end if grad'/.is = scalar'/.is; grad'/.ie = scalar'/.ie - 1 grad*/.js = scalar*/.js; grad'/.je = scalar'/.je - i do j = grad*/.js,grad*/.je do i = grad'/.is,grad'/.ie

291 tmpl = scalar'/,data(i+i,j+1) - scalar'/.data(i, j) tmp2 = scalar*/,data(i+l, j) - scalar*/,data(i,j+1) work2D(i,j,nbufx) = gradx(i,j)*( tmpl + tmp2 ) work2D(i,j.nbufy) = grady(i,j)*( tmpl - tmp2 ) end do end do

There are certain aspects of this code fragment worth highlighting: • The differencing on the B-grid requires the values at ( i + 1 , j+1) in order to compute ( i , j ) , thus losing one point in the eastern and northern halos on each compute cycle. A halo update is performed when there are insufficient active points available to fill the compute domain of the resulting vector field. We may avoid updating the western and southern halos, which are never used in this computation. The halo size must be at least 1 for this numerical kernel, but can be set much higher at initialization in the mpp_define_domains call (Sec. 3). For a halo size of n, halo updates are required only once every n timesteps. This illustrates the sorts of parallel optimizations yielded by the construction of modular numerical kernels. • The loop itself has a complicated form, particular to the B-grid stencil shown. Other loop optimizations may be applied here without compromising the higher-level code. Operators for other stencils may be overloaded here as well under the generic interface grad, again preserving higher-level code structure. • The halo updates have been illustrated here as blocking calls, which complete upon return from the call. For a fully non-blocking halo update, the call would be made at the end of a compute cycle, and the usual waitfor-completion call (e.g MPI_WAIT) called at the top of the next cycle. 4g

High-level formulation of a shallow water model

Using the constructs developed here, a basic shallow water model may be written in standard Fortran-90 as follows: program shallow_water use f i e l d s type(scalar2D) : : e t a type(hvector2D) : : u, f o r c i n g do t = s t a r t , e n d !time loop e t a = e t a + d t * ( -h*div(u) + n u * l a p l ( e t a ) )

292

u = u + dt*( - g * g r a d ( e t a ) + f*kcross(u) + nu*lapl(u) + f o r c i n g ) end do end program shallow_water This code' possesses several of the features we seek to build into our codes: portability, scalability, ease of use, modularity, flexibility and extensibility: • The code has been validated on Cray PVP and MPP systems, SGI ccNUMA, and a Beowulf cluster using the PGF90 compiler. On the Cray and SGI compilers, the abstraction penalty without using special directives is estimated at about 20%. Some causes for the abstraction penalty are noted below. • For a typical model grid of 200x200 points, it scales to 80% on 64 PEs with a halo size of 1. It is memory-scaling except for the halo region. • The parallel calls are built into the kernels, and are called only at need. Both blocking and non-blocking halo updates are supported. In addition, the frequency of halo updates can be reduced on a high-latency interconnect at the cost of additional computation, merely by setting wider halos at initialization, and no other code changes. • The grid stencils, and the ordering of indices, etc., can be changed within the kernels without affecting the calling routines. This is done by overloading kernels for different grid stencils under the same generic interface. In addition this would permit the use of different grid stencils in different parts of the code. However, there are some basic limits to this flexibility, noted below. Some difficulties with this approach may also be noted. One is that arithmetic operators constructed as shown here do not perform certain optimizations that would be done for equivalent array operations, illustrated by these examples: a = a + b a = b + c + d In the first example, one memory stream may be saved by knowing that the LHS array also appears on the RHS. Since a function call is used to perform the addition here, an extra memory stream is used. In the second example, the loop construct would chain the two additions together in a single loop. Here the functions are called pairwise. While it is possible in principle to 'Available for download from http://www.gfdl.noaa.gov/~vb/kernels.html

293 chain function calls, it is unlikely that any compiler in practice performs this level of interprocedural analysis. These are contributors to the abstraction penalty noted above. A second limitation of this approach affects the flexibility in the choice of grid stencil for a given high-level casting of the equations. This is due to the fact that the time discretization must be explicitly maintained in highlevel code while we seek to construct numerical kernels for spatial operators. This is best illustrated by modifying Eq. 2 to use an implicit treatment of the Coriolis force: r]n+l - rf At u"+1 - u" At

F(V • u)"

(4a) /u"

g(Vr,)n+1 + A x [

+1

+u"\

^ — J + F"

(4b)

Using a grid stencil where vector components are collocated, it is easy to rearrange terms in the u equation to bring u " + 1 to the LHS. The timestepping for u then takes place in two steps: f 2 = f/2 u = u + dt*( -g*grad(eta) + f2*kcross(u) + nu*lapl(u) + forcing ) u = ( u + dt*f2*kcross(u) ) / ( 1 + f2**2 ) However, on a C-grid, since the vector components are not collocated, the implicit Coriolis term requires a matrix inversion7. Particularly vexing in the current context is that the high-level structure of the equations depends on the spatial grid stencil. The high-level form of the equations may not in all instances be able to be made independent of the grid numerics. 5

Discussion

Earth systems models are run over a wide range of time and space scales, in many configurations, serving a wide variety of needs, while sharing the same underlying Navier-Stokes dynamics applied to complex fluids. As model complexity grows, the need for a flexible, modular and extensible high-level structure to support a distributed software development model has become apparent. First attempts at standardization for parallel computing architecture resulted in standards for low-level communication APIs (MPI, OpenMP). These have the disadvantage of implying a particular kind of underlying hardware (distributed or shared memory) and add greatly to code complexity for

294

portable software that attempts to encompass multiple parallelism semantics. Efforts are currently underway to create a software infrastructure for the climate modeling community that meets these needs with high-performance code running on a wide variety of parallel computing architectures, concealing the communication APIs from high-level code. A prototype for this would be the domain class library described in Sec. 3. The class library approach is a powerful means of encapsulating model structure in an intuitive manner, so that the data structures being used are objects conceptually close to the entities being modeled. In this paper I have demonstrated the extension of this approach to create a class of modular, configurable parallel numerical kernels using Fortran-90 suitable for use in earth systems models. The language offers a reasonable compromise between high-performance numerical kernels and the high-level structure required for a distributed development model. This approach is being experimentally applied to production models, particularly in high-latency code segments where configurable halos could yield performance enhancements. It could yield additional advantages in models where we may need to choose different grid stencils for different research problems using the same model code. The parallel numerical kernels developed here have been validated on a small subset of current compilers and architectures. As they use many advanced language features, it is likely that performance on different platforms will be uneven. This is something that the developers of shared software infrastructure will have to come to terms with. A key recommendation flowing from the conclusions of this paper would be an earnest effort on our part to involve the compiler and language standards communities in these efforts. This involves closer partnership, rather than an antagonistic relationship, with the scalable computing industry, and participation of our community in the evolution of language standards. Acknowledgments This work is a part of the research and development undertaken by the FMS Development Team at NOAA/GFDL to build a software infrastructure for modeling. Bill Long of Cray Inc. provided valuable assistance and advice in elucidating issues of F90 standard compliance, implementation and optimization in an actual compiler (CF90). Thanks to Jeff Anderson, Ron Pacanowski and Bob Hallberg of GFDL for close examination and critique of this work at each stage of development, including the final manuscript.

295

References 1. Michael Metcalf and John Reid. Fortran 90/95 Explained. Oxford University Press, 2nd edition, 1999. 2. A. Arakawa. Computational design for long-term numerical integration of the equations of atmospheric motion. J. Comp. Phys., 1:119-143, 1966. 3. Ross J. Murray. Explicit generation of orthogonal grids for ocean models. J. Comp. Phys., 126:251 - 273, 1996. 4. M. Rancic, R.J. Purser, and F. Mesinger. A global shallow-water model using an expanded spherical cube: Gnomonic versus conformal coordinates. Quart. J. Roy. Meteor. Soc., 122:959 - 982, 1996. 5. G. J. Haltiner and R. T. Williams. Numerical Prediction and Dynamic Meteorology. John Wiley and Sons, New York, 1980. 6. Stephen M. Grimes, Ronald C. Pacanowski, Martin Schmidt, and V. Balaji. Tracer conservation with an explicit free surface method for zcoordinate ocean models. MWR, 129:1081-1098, 2001. 7. D.E. Dietrich, M.G. Marietta, and P.J. Roache. An ocean modeling system with turbulent boundary layers and topography, part 1: numerical description. Int. J. Numer. Methods Fluids, 7:833-855, 1987.

296 U S I N G A C C U R A T E ARITHMETICS TO I M P R O V E N U M E R I C A L R E P R O D U C I B I L I T Y A N D STABILITY IN PARALLEL APPLICATIONS YUN HE AND CHRIS H.Q. DING NERSC

Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA E-mail: [email protected], [email protected]

Numerical reproducibility and stability of large scale scientific simulations, especially climate modeling, on distributed memory parallel computers are becoming critical issues. In particular, global summation of distributed arrays is most susceptible to rounding errors, and their propagation and accumulation cause uncertainty in final simulation results. We analyzed several accurate summation methods and found that two methods are particularly effective to improve (ensure) reproducibility and stability: Kahan's self-compensated summation and Bailey's double-double precision summation. We provide an MPI operator MPI_SUMDD to work with MPI collective operations to ensure a scalable implementation on large number of processors. The final methods are particularly simple to adopt in practical codes: not only global summations, but also vector-vector dot products and matrix-vector or matrix-matrix operations.

Keywords: Reproducibility, Climate Models, Double-Double Precision Arithmetic, Self-Compensated Summation, Distributed Memory Architecture. 1

Introduction and Motivation

One of the pressing issues for large scale numerical simulations on high performance distributed memory computers is numerical reproducibility and stability. Due to finite precisions in computer arithmetics, different ordering of computations will lead to slightly different results, the so called rounding errors. As simulation systems become larger, more data variables are involved. As simulation time becomes longer, teraflops of calculations are involved. All these indicate that accumulated rounding errors could be substantial. One manifestation of this is that final computational results on different computers, and on the same computer but with different number of processors will differ, even though calculations are carried out using double precision (64-bit). It is important to distinguish between the high precision required in internal consistent numerical calculations and the expected accuracy of the final results. In climate model simulations, for example, the initial conditions and boundary forcings can seldom be measured more accurately than a few per-

297 cent. T h u s in most situations, we only require 2 decimal digits accuracy in final results. But this does not imply t h a t 2 decimal digits accuracy arithmetic (or 6-7 bits mantissa plus exponents) can be employed during the internal intermediate calculations. In fact, double precision arithmetic is usually required. In this paper, we are concerned with the numerical reproducibility and stability of the final computational results, and how the improvements in the internal computations could affect the final outcome. Numerical reproducibility and stability are particularly i m p o r t a n t in long term (years to decades and century long) simulations of global climate models. It is known t h a t there are multiple stable regions in phase space (see Figure 1) t h a t the climate system could be attracted to. However, due to the inherent chaotic nature of the numerical algorithms involved, it is feared t h a t slight changes during calculations could bring the system from one regime to another. One common but expensive solution is to run the same model simulation with identical initial conditions with small perturbation at the level far below the observational accuracy for several times, and take the ensemble average. In a scenario climate experiment, first a controlled run of model simulation with standard parameters is produced. Then a sensitivity run of model simulation with slightly changed parameters, say CO2 concentration, is produced. T h e difference between these two runs is computed and the particular effect, say the average surface temperature, is discerned. W i t h o u t numerical stability, the difference between the controlled run and the sensitivity run could be due to the difference in different regimes, not due to the fine difference in input parameters which is the focus of the scenario study. On distributed memory parallel computer ( M P P ) platforms, an additional issue is the reproducibility and stability with regard to number of processors utilized in the simulations. One simulation run on 32 processors should produce "same" or "very close" results as the same simulation run on 64 processors. Sometimes, different number of processors could be involved during different periods of a long t e r m (10 simulated years or more) simulation due to computer resources availability, further demanding reproducibility of the simulation. Numerical reproducibility of the final results are determined by the internal computations during intermediate steps. Obviously, there are a large number of areas involved in the numerical rounding error propagation. In clim a t e models, one major part is the dynamics, i.e., finite difference solution to the primitive equation. On M P P , stencil updates proceed almost identically as in sequential implementation except the u p d a t e order changes slightly. If the first order time stepping scheme is used, the order remains unchanged.

298

oA CI

C2

oB

Oc

Figure 1. An illustrative diagram of several stable regimes centered around A, B, and C in the multi-dimensional phase space at a simulated time far beyond initial period (spin-up). By reproducibility and stability, we mean that results of the same simulation running on different computers or on same computer but with different number of processors remain close inside one regime, say C l and C2. (The "exact" or "absolute" reproducibility, i.e., identical numerical results on different computers, or on different number of processors, or even on different compiler versions or optimization levels, is impossible in our view, and is not our goal.) In contrast, non-reproducibility indicates the simulation results change from one regime to another regime on different computers.

In barotropic component of an ocean model [6], [19], [20], [24], an elliptic equation is solved using a global linear equation via a conjugate gradient (CG) method. In atmospheric d a t a assimilation [5], [7], a correlation equation is also solved via a conjugate gradient m e t h o d . T h e key parameters a,/3 calculated through the global s u m m a t i o n (dot product between global vectors) appear to be sensitive to the different s u m m a t i o n orders on different number of processors. A slight change on the least significant bits will accumulate quickly into the significant bits in a, (3 over several iterations, leading to a different C G trajectory in the multi-dimensional space, and a slightly different solution (see, e.g., [11], [22]). Although the self-correcting nature of the conjugate gradient method ensures t h a t these different solutions on different number of processors are equally correct with regard to solving the linear equation, this inherent difference generating nature could be significant. It is feared t h a t the multiplicative effects on the small differences in each t i m e step will eventually lead the system into another stable regime (say from C to A in Figure 1), instead of merely wandering around the stable regime C (shifting from C l to C2). Another example is the spectral transforms used in m a n y atmospheric models [8], [13], where global summations are very sensitive to the s u m m a -

299 tion order. (In this case, there is no self-correcting mechanism as in the CG m e t h o d mentioned above.) There are many other parts in climate models where s u m m a t i o n of global arrays is used. In general, experience in distributed memory parallel computing indicates global s u m m a t i o n appears to be the most sensitive with regard to rounding error, a fact also known in numerical analysis [15], [21]. 1.1

Current

Approach

T h e commonly used m e t h o d to resolve the global s u m m a t i o n related reproducibility problem is to use serialized implementation, e.g., in some ocean models [6],[12]. In this scheme, elements of distributed array are sent to one designated processor, often processor 0, and are s u m m e d up in a fixed order on the designated processor. T h e result is then broadcasted back to all the relevant processors. This scheme guarantees the reproducibility but at the extra costs of communication. T h e designated processor has to receive P — 1 messages, each from other P — 1 processors, in a time linear to P. Clearly this scheme does not scale well to large number of processors. In addition, this serialization causes unwanted complexity in the implementation for a simple s u m m a t i o n . For large arrays, it is more reasonable t h a t each processor adds the subsection of the array in a fixed order and only sends its local s u m m a t i o n to the designated processor. Although the local orders are the same for each processor, the global order of s u m m a t i o n will be different when different number of processors are used, thus the final global s u m m a t i o n result will not be reproducible. Finally, even though the global s u m m a t i o n result is reproducible, the result could often be inaccurate since double precision arithmetic is usually not adequate for the d a t a with large cancellation. T h e accumulation of these rounding errors could be severe for long time simulations during repeated arithmetic operations. This affects numerical stability for certain applications. 1.2

New

Approach

Here we look into a new approach t h a t b o t h guarantees (or substantially improves) reproducibility and achieves scalability and efficiency on distributed platforms. T h e basic idea is t h a t non-reproducibility is directly caused by the rounding errors involved in the intermediate arithmetic operations. Therefore instead of fixing the s u m m a t i o n order, if we could reduce rounding errors very significantly (sometimes eliminate) with more accurate arithmetic, we could

300 achieve reproducibility also. On a computer with infinitely accurate arithmetics, there is no rounding error. A simple and trivial solution is to use higher precision, say 128-bit precision arithmetics in the relevant d a t a arrays. However, most computer platforms we know do not support 128-bit precision arithmetic. T h e only exception is Cray P V P (SV1, C90) line of computers, where the 128-bit precision is supported and implemented in software, resulting in huge (factor of 10 or more) performance degradation. T h e real challenge here is to find a simple and practical method t h a t can effectively improve the numerical reproducibility and stability. In this paper, we examined several accurate s u m m a t i o n methods, in particular, the fixed-point and the multi-precision arithmetics, the selfcompensated s u m m a t i o n (SCS) and doubly-compensated s u m m a t i o n (DCS) methods, and the double-double precision arithmetics. We first examined these methods in sequential computer environment, i.e., on a single processor. We found two effective methods: the self-compensated s u m m a t i o n and double-double precision s u m m a t i o n . Then these promsing methods are examined in the parallel environment. We provide necessary functions to work together with M P I communication library to utilize these accurate methods in parallel environment in a scalable way. We also note t h a t the local s u m m a tion results from self-compensated s u m m a t i o n and double-double precision s u m m a t i o n can be summed accross processors in a simple and unified way with the M P L S U M D D operator we provided. All the major problems associated with the serialized implementation, non-scalability and implementation complexity, are therefore resolved in our new approach. Furthermore, the final results in the new approach will be more accurate due to accurate arithmetics, compared to the current serialized approach. This improves numerical stability and could be of critical importance for some applications. In summary, our m a i n concern in this paper is to find simple and practical methods to improve the numerical reproducibility between runs on different number of processors on the same distributed memory computer, and also between runs on different computers. We approach this task by using more accurate (than standard double precision) methods in some of the key steps in parallel applications. Although the overall final computation results will be undoubtedly more accurate, t h a t is an added benefit, but is not our primary goal.

301 2

Sea Surface Height D a t a

Sea surface height (SSH) is the first serious difficulty we encountered in an ocean circulation model development [6]. T h e variable stores the average sea surface height from the model simulation, which can later be compared with satellite data. Repeating the same simulation on different number of processors leads to different results. This difficulty, along with issues mentioned above, motivated this work. This simple problem serves as a good example for the main ideas of global s u m m a t i o n in parallel environment. All m e t h o d s are tested against this SSH d a t a in this paper on Cray T 3 E at N E R S C (We also used synthetic d a t a in this work, and the results are very similar). T h e SSH variable is a two-dimensional sea surface volume (integrated sea surface area times sea surface height) distributed among multiple processors. At each time step, the global s u m m a t i o n of the sea surface volume of each model grid is needed in order to calculate the average sea surface height. T h e absolute value of the d a t a itself is very large (in the order of 10 1 0 to 10 1 5 ), with different signs, while the result of the global s u m m a t i o n is only of order of 1. Running the model in double precision with different number of processors generate very different global summations, ranging from -100 to 100, making the simulation results totally meaningless. We saved the SSH d a t a from after one-day simulation as our test data. T h e 2D array is dimensioned as s s h ( 1 2 0 , 6 4 ) , with a total of 7680 double precision (64-bit) numbers. Other sizes and simulation time will not affect the conclusions reported here (Both the codes used here and the SSH d a t a can be downloaded from our website [14]). For the 2D SSH array, the most n a t u r a l way of global s u m m a t i o n is to use the following simple codes: do j = 1, 64 ! index f o r l a t i t u d e do i = 1, 120 ! index f o r l o n g i t u d e sum = sum + s s h ( i , j ) end do end do

Code ( 1 )

We call this order the "longitude first" order, the result of s u m m a t i o n is 34.4 (see Table 1). If we exchange the do i and do j lines, so t h a t elements with different j index (while i index remain fixed) are s u m m e d up first (latitude first), we get 0.67, a totally different result! Sometimes in practice, we need to sum up the array in reverse order, such as do i = 120, 1, - 1 , denoted as "reverse longitude first", the result would change again, t o 32.3. These results are listed in Table 1. Clearly the results are s u m m a t i o n order dependent on the sequential computer. In fact, we do not know what is

302 Table 1. Results of the summation in sifferent natural orders with different methods in double precision

Order Longitude First Reverse Longitude First Latitude First Reverse Latitude First Longitude First SCS Longitude First DCS Latitude First SCS Latitude First DCS

Result 34.414768218994141 32.302734375 0.67326545715332031 0.734375 0.3823695182800293 0.3882288932800293 0.37443733215332031 0.32560920715332031

the exact correct result, until later with other methods. On a distributed memory computer platform, the 2D SSH array is decomposed into many subdomains among multiple processors. On different number of processors, the order of summation is not guaranteed, and the results are not reproducible! The origin of the rounding error is due to finite representation of a floating point number. A simple example explains the idea best. In double precision, the following Fortran statement S = 1.25 x 1020 + 555.55 - 1.25 x 1020 will get S = 0.0, instead of S = 555.55. The reason is that when compared to 1.25 x 10 20 , 555.55 is negligibly small, or non-representable. In hardware, 555.55 is simply right-shifted out of CPU registers when the first addition is performed. So the intermediate result of the first addition is 1.25 x 10 20 , which is then cancelled exactly in the subtraction step. The sea surface height data is one of the worst cases and therefore serves as a good clear test case. For most variables, this dependency on summary ordering is less pronounced. However, this error could propagate to higher digits in the following iterations. Our goal is to find an accurate summation scheme that minimizes this rounding error. 3

Fixed-Point Arithmetic

The first method we investigate is a fixed point summation without loss of precision. It is a simple method and can be easily implemented (codes could be downloaded from our web site [14]). we wrote a code to first convert double precision floating point numbers of a global array into an array of integers, a db2int() function. Depending upon the dynamical range (maximum and

303

minimum) and precision required, the integer representation chooses a proper fixed point (a scale factor) and one or a few integers to represent each floating point numbers. These integers are then summed up using standard integer arithmetic, (and sum across the multiple processors using MPLREDUCE with MPIJNTEGER data type) and are finally converted back to double precision numbers, rounding off all lower bits which are non-representable in double precision, by using the int2db() function. This method is applied to the 3 number addition example and the correct result S=555.55 is obtained. We applied this method to the SSH data discussed above, and the summation result is J2 ssh(i, j) = 0.35798583924770355

(1)

This result remains the same upon changing the summation orders, convincing us that it is the exact result. (The double-double precision discussed later gives the same result.) This method requires the users to know the dynamical range before calling the d b 2 i n t ( ) conversion routine. A simple way is to find the maximum and minimum magnitudes of the array. Based on this information, an initialization routine will properly determine the scale factor (the fixed point) and the number of integers (32 bits or 64 bits) required to represent the floating point numbers. In large simulation codes, however, finding the maximum and minimum in a large array distributed over multiple processors before each array summation is quite inconvenient in many situations. For this reason, we do not recommend this method. We strive to find methods which are simple to adapt in practical large scale coding. 4

Multi-Precision Arithmetic

The above fixed-point arithmetic is a simple example of a class of multiprecision arithmetic software packages which carry out numerical calculations at a pre-specified precision [15]. Brent's BMP [3] is the first complete Fortran package on the multi-precision arithmetics offering a complete set of arithmetic operations as well as the evaluation of some constants and special functions. Bailey's MPFUN [2] is a more sophisticated and more efficiently implemented complete package. He used this package to compute 7r to 29 million decimal digits accuracy! Several other packages are also available (see [15] for more discussions).

304 We looked into b o t h B M P and M P F U N multi-precision packages and decided not to investigate t h e m further because of the nontrivial practical coding efforts involved to adopt t h e m in large simulation codes. At the same time we learned of the error-compensated s u m m a t i o n methods and the doubledouble precision arithmetics which are much more easier to adopt. T h e y seem to offer the practical solution we are looking for. 5

Self-Compensated Summation Methods

K a h a n [16] in 1965 suggested a simple, but very effective method to deal with this problem. T h e idea is to estimate the roundoff error and store in an error buffer; this error is then added back in the next addition (For consistency, from now on subtraction will be called as addition, since subtraction = addition of a negative number). T h e computer implementation is as follows: sum = a + b e r r o r = b + ( a - sum) Using this self-compensated s u m m a t i o n m e t h o d in the previous example, the first addition would give 1.25 x 10 2 0 and 555.55 as the sum and e r r o r . T h e pair (sum, e r r o r ) is returned as a complex number from the function S C S ( a . b ) [See Appendix A]. In the next addition, the e r r o r is first "compensated" or added back to one of the addends (numbers to be added); the same addition and error estimation are then repeated. Suppose we need to calculate d = a + b + c. T h e following two function calls give the final results: (sum, e r r o r ) = SCS(a, b) Code ( 2 ) ( s u m l , e r r o r l ) = SCS(sum, c + e r r o r ) T h e final result is d = suml with error = e r r o r l . T h e results of the error compensated s u m m a t i o n is always a (sum, e r r o r ) pair. Priest [23] further improved the error-compensating method by repeated applications of SCS. Note t h a t in the second SCS in Code (2) for addition of c, the addition of e r r o r to c could have substantial precision loss. T h e doublycompensated s u m m a t i o n is to use SCS again on this c + e r r o r to compensate the error for any possible loss of precision. This further application of SCS could be implemented as three SCS calls; the calculation o f d = a + b + c is done by the following codes: (sum, e r r o r ) = SCS ( a , b) ( s u m l , e r r o r l ) = SCS ( e r r o r , c) (sum2, e r r o r 2 ) = SCS (sum, suml) Code ( 3 ) (sum3, e r r o r 3 ) = SCS (sum2, e r r o r l + e r r o r 2 ) Here e r r o r from the first SCS is added back to c using SCS, which produces a second level error e r r o r l (called a second level error because it comes from

305 an addition involving the e r r o r ). T h e third line in Code (3) performs the same function as the second line in Code (2), which is essentially sum + c with another first level e r r o r 2 generated. These two errors are b o t h compensated in the fourth line of Code (3). T h e final result is d = sum3 with error = e r r o r 3 . See Appendix B for the Fortran implementation. This doubly-compensated s u m m a t i o n m e t h o d works the best when the original array is sorted in the decreasing magnitude order beforehand. This concludes our intuitive discussions on the error-compensated methods. Good theoretical analyses about rounding error bounds of the m e t h o d s could be found in [10], [15], [18]. We applied both SCS and DCS to the 2D sea surface height data, using the straight s u m m a t i o n in Code (2). T h e results are listed in Table 1, "longitude first SCS", etc. Now all 4 orders give results with agreement on the first significant decimal digit; in fact they are quite close to the correct result in E q . ( l ) . This indicates very substantial improvements in rounding error reductions during the summations. We also note t h a t DCS does not seem to outperform the much simpler SCS.

5.1

Sensitivity

Regarding to Summation

Order

To further study the sensitivity regarding to different s u m m a t i o n orders and to see further difference in performance between SCS and DCS, we applied a number of different sorting orders on the SSH data. For simplicity, SSH d a t a is treated as a simple I D array of 7680 double precision numbers. This I D array is easily sorted in 6 different ways as explained in Table 2. They are s u m m e d up from the left to right. For an array of all positive numbers, the increasing order would be the best, since the smaller numbers are added first, and they would accumulate large enough not to be rounded off when added with big numbers. But for arrays with mixed positive and negative numbers, the decreasing magnitude order would be the best, especially when the absolute value of the s u m m a t i o n is much smaller t h a n the addends due to the cancellation of most addends. We also tried several other orders. T h e s u m m a t i o n results are shown in Table 3. From these results we observe t h a t (1) Using ordinary double precision s u m m a t i o n , different sorted orders always lead to different results ranging from -73.6 to 34.4 , similar to those listed in Table 1. (2) In all 6 sorted orders (excluding Order 1 which is same as the Longitude First order in Table 1), DCS gives the identical results as SCS does. They are always between 0.358 and 0.359, very close to the real result of 0.358 (less t h a n 0.4% error). This indicates t h a t sorting helps reduce the rounding error in SCS and D C S ,

306 Table 2. Seven different orders of the array tested for summation

Order 1 2 3 4 5 6 7

Sequential Order no sort increasing order decreasing order increasing magnitude order decreasing magnitude order positives reverse from order 2 negatives reverse from order 2

Example -9.9 5.5 2.2-3.3-6.6 0 1.1 -9.9-6.6-3.3 0 1.1 2.2 5.5 5.5 2.2 1.7 0-3.3-6.6-9.9 0 1.1 2.2-3.3 5.5-6.6-9.9 -9.9-6.6 5.5-3.3 2.2 1.10 -9.9-6.6-3.3 0 5.5 2.2 1.1 -3.3-6.6-9.9 0 1.1 2.2 5.5

Table 3. Results of the summation in different sorting orders with different methods in double precision [fixed point arithmetic and double-double precision arithmetics always give the same and correct answer as in Eq.(l)]

Order 1 2 3 4 5 6 7

Double Precision 34.414768218994141 -70.640625 -73.015625 13.859375 14.318659648299217 -36.254243895411491 -66.640625

SCS 0.3823695182800293 0.359375 0.359375 0.359375 0.35798583924770355 0.35812444984912872 0.359375

DCS 0.3882288932800293 0.359375 0.359375 0.359375 0.35798583924770355 0.35812444984912872 0.359375

but not much in the double precision arithmetic. (3) Among the 6 different sorting orders, SCS or DCS with decreasing magnitude order produces the correct answer of Eq.(l). Intuitively in this order, the numbers with similar magnitude but oppositive signs are added together first, and then the result is added to the accumulation without much loss of precision. This order is the recommended order in summation of large arrays [15], [23]. From these observations and results in Table 1, we conclude first that SCS and DCS always give results very close to the correct result (better results when the array is sorted), and show dramatic improvements over the straight double precision summation. Secondly, the differences between DCS and SCS are also very small: when array is sorted, there is no difference (Table 3); when array is unsorted, there is small difference (Table 1), but no indication that DCS does better than SCS. Therefore, from now on, we will concentrate on the simpler SCS. So far, we have studied compensated methods that require no extra stor-

307 age space (except 1 to 3 temporary buffer spaces) at the cost of extra additions. Although no exact tight error bounds can be given, they improve the accuracy very substantially. Another practical approach is to do truly higher precision arithmetic using existing double precision d a t a representation and C P U arithmetic units.

6

Double-Double Precision Arithmetic

In double-double precision arithmetics, each floating point number is represented by two double-precision numbers. In fact, the bits in the second double precision number are extended mantissa of the first double precision number. The dynamic range of the double-double precision number remains the same as the double precision number. As an analogy to the self-compensated summ a t i o n discussed above, one m a y think t h a t the first double precision number as the s u m m a t i o n result and the second double precision number as the estim a t e d rounding error. A suite of Fortran 90 compatible codes are developed by Bailey [1] by using K n u t h ' s [17] trick. We used this double-double precision codes in the sea surface height d a t a s u m m a t i o n . It always gives the same correct result of E q . ( l ) , irrespective of the 9 different s u m m a t i o n orders listed in Tables 1 and 3. This indicates the extra precisions (about 106 bits) in the double-double representation is highly effective in reducing the roundoff errors during the s u m m a t i o n . T h e double-double arithmetic employs slightly more arithmetic steps. T h e number of additions increases much more t h a n in the self-compensated m e t h o d . For a single addition, double-double arithmetic requires 11 double precision additions vs. 4 for the SCS. However, on cache-based processor architectures, the increased C P U arithmetic does not slow down calculation as much as the increased memory access required since we use two double precision numbers for one single double-double number. In practice, on wide range of processors, the double-double arithmetic only slows down by a factor of 2 while getting effective double-double precision. T h e double-double arithmetics has also increased m e m o r y requirements, which is doubled. However, this is not a real problem in today's computers when D R A M is very plentiful. Furthermore, one can implement the basic arithmetic codes in a way t h a t also requires no increase in memory requirement, similar t o the self-compensated methods.

308 7

S u m m a t i o n Across Distributed M e m o r y Processors

Among the above methods we have investigated so far, fixed-point/multiprecision and double-double arithmetics always give the same and correct result. However, as pointed out earlier, the fixed-point arithmetic requires the dynamical range of the d a t a array before the s u m m a t i o n which is not practical. T h e true multi-precision packages require too much effort to adopt regarding the conversion of the d a t a format. T h e self-compensated s u m m a t i o n emerges as a very simple and effective m e t h o d as it gives results very close to the exact one in the worst case of SSH d a t a in 9 totally different s u m m a t i o n orders (Tables 1 and 3). T h e doubly-compensated s u m m a t i o n does not show significant improvement in accuracy over SCS b u t more arithmetic operations are involved. Doubledouble arithmetic is easy to implement. Thus, we will concentrate on only two s u m m a t i o n methods in the distributed systems, i.e., the SCS m e t h o d and double-double precision arithmetics. DCS m e t h o d can be adopted in same way as SCS. In distributed m e m o r y environment, each processor will first do the summ a t i o n on a subsection of the global array covered by the processor. One m a y use the self-compensated method or the double-double precision arithmetics. T h e results of this local s u m m a t i o n are two double precision numbers: (sum, e r r o r ) in the SCS and ( d o u b l e , d o u b l e ) in double-double arithmetic. In both cases, the result can be represented as a single complex number. T h e issue here is how to sum up the pair of numbers on each processor in a consistent way to achieve high accuracy. We investigated a number of different approaches and found a unified and consistent approach.

7.1

Double-Double

Precision

Arithmetic

Consider the double-double arithmetic case first. If we simply use M P I to sum up the first double number on all processors together, and the second double numbers on all processors, we can use MPI.REDUCE ( l o c a l _ s u m , g l o b a l . s u m , 1, MPI_C0MPLEX, MPI.SUM, . . . ) Here the complex variable l o c a l - s u m contains the local s u m s and g l o b a l _ s u m is the result of the global s u m m a t i o n . Note we use M P L C O M P L E X as the d a t a type. During this procedure, the first double and second double numbers on different processors are s u m m e d up separately, using the usual double precision arithmetic, not the double-double arithmetic. Therefore the final results of first and second double numbers are not consistent and precision is lost during this procedure. Applying this to the SSH data, the final s u m m a t i o n

309 Table 4. Results from self-compensated summation and double-double precision summation with different number of processors in natural order as in code (1). Here P means number of processors, MPI.SUM means to add sums and errors from self-compensated summation or first and second double numbers from double-double precision summation separately with double precision; MPLSUMDD means to add local summations in double-double precision with a unified MPI operator across processors.

p 1 2 4 8 16 32 64

P 1 2 4 8 16 32 64

SCS MPIJSUM 0.3882288932800293 0.3745570182800293 0.3804163932800293 1.9995570182800293 1.2169642448425293 2.1164088249206543 2.2617335319519043

MPLSUMDD 0.3882288932800293 0.3745570182800293 0.3804163932800293 0.3745570182800293 0.3575892448425293 0.3351588249206543 0.3706812858581543

Double-Double MPLSUMDD MPI.SUM 0.35798583924770355 0.35798583924770355 0.35798583924770355 0.35798583924770355 0.35798583924770355 0.35798583924770355 0.35798583924770355 1.4829858392477036 -0.78263916075229645 0.35798583924770355 -0.54826416075229645 0.35798583924770355 0.35798583924770355 2.5547937005758286

result changes on different processors, and is not exactly correct (see column M P L S U M for SCS in Table 4). A consistent way to sum up double-double numbers across multiple processors is to implement the double-double addition as an M P I operator, in the same functionality as M P L S U M . This can be done by using the MPI operator creation subroutine M P L O P X R E A T E to create an operator M P L S U M D D for the double-double summation. An implementation of M P L S U M D D creation is given in the Appendix C. W i t h the operator M P L S U M D D , the consistent double-double arithmetic can be done in the same way as the normal complex double precision MPI.REDUCE ( l o c a l . s u m , global_sum, 1, MPI_COMPLEX, MPI_SUMDD, . . . ) W i t h this approach, we always get the exact result of E q . ( l ) on different number of processors, as expected (see column M P L S U M D D for double.double in

310 Table 4). 1.2

Global Summation

Using Self-Compensated Method

Separating global summation of an array into local summation followed by summation across processors requires a slight change of the basic procedure of the self-compensated summation. Here on, say, 4 processors we have 4 (sum, e r r o r ) pairs. In principle, we can just add up 4 sums and add up 4 e r r o r s back for the final results. We can use collective function MPLREDUCE with double precision operator MPLSUM to sum all sums and e r r o r s separately; and then add the sum of sums and the sum of errors. In this way, the results are unpredictable (see column MPLSUM for double_double in Table 4). From our experience in the sequential environment, we need to sort the sums in decreasing magnitude order before summation. In order to sort the local sum and e r r o r across processors, we need to gather them to one processor, and apply the same sorting and SCS summation algorithm. The test results with different number of processors are all 0.35798583924770355, which is correct. However, as discussed in the introduction, this serialized communication method will not scale well to large number of processors. Fortunately, we found that the MPLSUMDD operator discussed above can be "directly" applied to the results of SCS method. Therefore the local summation results of SCS on different processors can be added up using MPLREDUCE with the operator MPLSUMDD. It will carry out the doubledouble precision arithmetic during the intermediate steps with the tree algorithms typically employed in MPLREDUCE. This is sufficient to produce the exactly correct result in a scalable way. The reason that the (sum, e r r o r ) pairs, produced by the SCS (and DCS) method, can be used with MPLSUMDD operator is that the e r r o r in the (sum, e r r o r ) pair is exactly the non-representable portion of the double precision number sum [In other words, suppose on an imagined computer, the mantissa has 106 bits (which is effectively the precision offered by the doubledouble representation) to represent the exact summation result. On the real computer in IEEE format, a double precision number has mantissa of 53 bits. e r r o r would be the portion represented by those lower 53 bits beyond the upper 53 bits representable in the IEEE double precision number]. By design, this non-representable portion is exactly the second double precision number in the double-double number representation. Therefore, (sum, e r r o r ) is already in the correct double-double representation. This reveals the intimate relationship between SCS method and double-double arithmetic. This relation between SCS and double-double arithmetic could also be

311 used in situations other t h a n s u m m a t i o n across processors. Suppose we have several subroutines or modules each producing a (sum, e r r o r ) pair by employing SCS. Then these SCS results can be directly s u m m e d up using the double-double arithmetic without loss of precision. This technique will be very useful for maintaining modularity of large simulation codes. Note t h a t since DCS has the same representation of the final results as the SCS [(sum3, e r r o r 3 ) in Code (3) vs. (suml, e r r o r l ) in Code (2)], this relationship holds for DCS too. In particular, one may use DCS on local s u m m a t i o n , and then double-double arithmetic on global s u m m a t i o n . In summary, we found t h a t SCS m e t h o d can be effectively combined with double-double precision arithmetics to achieve accurate s u m m a t i o n on a distributed global array. 8

Beyond Global Summation

In this paper, we illustrated the new approach with a simple SSH global summation example. We also studied the dot product between two global vectors to see the effects on orthogonalization between vectors; and the new method substantially improves the orthogonality. T h e dot product code is a simple modification of the global s u m m a t i o n codes, but with a multiplication routine to perform multiplication of two double precision numbers in double-double precision accuracy and o u t p u t the result in double-double precision. This code together with those listed in the Appendix are available for download from our web site [14]. Matrix-vector and m a t r i x - m a t r i x operations are repeated applications of vector-vector dot product operations, and thus can be easily implemented with the above codes, without much memory overhead. Also we note t h a t there is an effort to develop XBLAS[4], extended precision BLAS using the double-double precision arithmetics. W i t h this suite of efficient codes, many intermediate results during matrix-vector operations can be promoted to double-double precision, therefore further improve the reproducibility and stability. 9

Practical Implications

Our m a i n results are (a) SCS and double-double arithmetic are very effective; (b) T h e results of SCS are directly in double-double representation; (c) A simple M P L S U M D D can carry out double-double arithmetic consistently during the s u m m a t i o n across processors. Together with our experiences in this work and other large scale ocean and atmospheric model codes, we recommend the

312

following steps in adopting the accurate arithmetics in real application codes to improve numerical reproducibility and stability: (1) Select the parts or modules where global summation plays important roles, such as the conjugate gradient solution of the barotropic equations or the spectral transforms. The very simple self-compensated summation can be easily adopted. The double-double arithmetic can also be easily adopted to guarantee the summation accuracy. The results of different modules are summed up using double-double arithmetic. Results on different processors are summed up using MPLREDUCE with MPI.SUMDD. (2) Carry out test runs of the resulting codes on different computer platforms and on different number of processors to see if the numerical stability, especially reproducibility are improved. If the improvements are small in some measure, it might indicate that more parts of the codes should be considered for modification to adopt the SCS. If the improvements are substantial in some measure, either more parts of the codes should be considered for adoption, or the double-double arithmetic should be adopted on the selected codes for further accuracy enhancements. Adoption of double-double arithmetics generally requires more memory and slight changes of codes. One technique we found useful is to think of both the (sum, error) in SCS and double-double number as a complex number (see our implementations listed in the Appendix). This is conceptually more consistent than simply using two double numbers in a row, and makes the modifications a little easier. As long as addition and subtraction are concerned, they can be carried out simply as complex numbers. Care must be taken, however, when multiplications and divisions are involved, which are fortunately not the major problems here. Although we studied effects on summation due to sorted orders, and found sorting array does reduce rounding errors in SCS and DCS, we do not see it as a solution to the reproducibility problem: (1) sorting an array is much slower compared to summation; (2) sorting a distributed array across processors is even more time consuming and unnecessarily complicated. As explained, we believe the SCS and double-double arithmetics could effectively resolve the problem. 10

Concluding Remarks

We studied the numerical reproducibility and stability issue on scientific applications, especially in climate simulations, on parallel distributed memory computers. We focus on the dominant issue that affects the reproducibility and stability: the global summation of distributed data arrays. We investi-

313 gated accurate s u m m a t i o n methods and found two particular effective methods which can also be easily implemented and scale well to large number of processors. Our m a i n point here is t h a t numerical reproducibility and stability can be improved substantially by using high accuracy in some of the key steps in parallel computations. As a result, the final results will be slightly more accurate; This is a welcome benefit, but of secondary consideration. In fact, we do not propose to use higher precision on all parts of a simulation codes, unless it is shown to be absolutely necessary. As we pointed out in Section 6, for a single addition, double-double arithmetic involves 11 double precision arithmetics vs. 4 for SCS vs. 1 for normal double precision s u m m a t i o n . T h e run timings for these methods are not in the same ratio. For a particular scientific application such as climate simulation, global s u m m a t i o n usually takes a very small part (less t h a n 1%), though critical, of the overall codes. Adopting accurate arithmetic m e t h o d s will almost not affect the total timing. To our knowledge, this work is the first systematic a t t e m p t to address the reproducibility from accurate arithmetics approach. Here we used the SSH d a t a as a validation test of the improvement techniques. How well these techniques perform in practical codes when the variables affect dynamics? Are they adequate? All these require further investigations. We plan to examine the spectral transforms in the C C M atmosphere model [8], [13] and conjugate gradient solver in the P O P ocean model [24]. Global s u m m a t i o n is just an example we used in this work. High accurate arithmetics should be used in wherever reproducibility is an issue. Acknowledgment s We t h a n k David Sarafini for pointing to us K a h a n ' s methods, David Bailey for the double-double arithmetic codes, Sherry Li for discussions on extended precision arithmetics, and Horst Simon for support. We also thank those who participated in the discussions on the reproducibility issue at climate modeling conference [25] and workshop [26]; those discussions motivated this work. This work is supported by the Office of Biological and Environmental Research, Climate Change Prediction Program, and by the Office of C o m p u t a tional and Technology Research, Division of Mathematical, Information, and C o m p u t a t i o n a l Sciences, of the U.S. Department of Energy under contract number DE-AC03-76SF00098. This research uses resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy.

314

Appendix Note: These codes are written on Cray T3E, where r e a l is of 64-bit precision by default. On many other computers, r e a l is of 32-bit by default and one needs to replace r e a l by r e a l * 8 , complex by complex*16, etc., in the implementation. A. Self-Compensated Summation Method C This function r e t u r n s the (sum,error) p a i r as a complex number. complex function SCS_sum ( a r r a y , n) r e a l array(n) complex cc, SCS cc = cmplx ( 0 . 0 , 0.0) do i= 1, n cc = SCS ( r e a l ( c c ) , imag(cc) + a r r a y ( i ) ) enddo SCS_sum = cc end complex function SCS (a, b) r e a l a, b , sum sum = a + b SCS = cmplx (sum, b - (sum - a)) end

B. Doubly- Compenstated Summation Method C This function r e t u r n s the (sum,error) p a i r as a complex number. complex function DCS_sum (array, n) r e a l array(n) complex cc, SCS, c l , c2 cc = cmplx ( a r r a y ( l ) , 0) do i= 2, n cl = SCS (imag(cc), a r r a y ( i ) ) c2 = SCS ( r e a l ( c c ) , r e a l ( c i ) ) cc = SCS ( r e a l ( c 2 ) , imag(cl) + imag(c2)) enddo DCS_sum = cc end

315 C. Double-Double

Precision

Summation

C This code c a l c u l a t e s t h e summation of an a r r a y of r e a l numbers C d i s t r i b u t e d on m u l t i p l e p r o c e s s o r s u s i n g d o u b l e - d o u b l e p r e c i s i o n . include 'mpif.h' real array(n) i n t e g e r myPE, t o t P E s , stat(MPI_STATUS_SIZE), i e r r i n t e g e r MPI_SUMDD, i t y p e e x t e r n a l DDPDD complex local_sum, global_sum c a l l MPI_INIT(ierr) call MPI_COMM_RANK( MPI_COMM_WORLD, myPE, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, totPEs, ierr ) C operator MPI_SUMDD is created based on an external function DDPDD. call MPI_OP_CREATE(DDPDD, .TRUE., MPI.SUMDD, ierr) C assume array(n) is the local part of a global distributed array. local_sum = cmplx (0.0,0.0) do i = 1, n call DDPDD (cmplx(array(i), 0.0), local_sum, 1, itype) enddo C add all local_sums on each PE to PE0 with MPI_SUMDD. C global_sum is a complex number, represents final (sum, error). call HPI_REDUCE (local_sum, global.sum, 1, MPI_C0MPLEX, & MPI.SUMDD, 0, MPI_C0MM_WQRLD, ierr) call MPI_FINALIZE(ierr) end C C

c

Modification of original codes written by David H. Bailey. This subroutine computes ddb(i) = dda(i) + ddb(i) subroutine DDPDD (dda, ddb, len, itype) implicit none real*8 e, tl, t2 integer i, len, itype complex*16 dda(len), ddb(len) do i = 1, len Compute dda + ddb using Knuth's trick. tl = real(dda(i)) + real(ddb(i)) e = tl - real(dda(i)) t2 = ((real(ddb(i)) - e) + (real(dda(i)) - (tl - e))) & +imag(dda(i)) + imag(ddb(i))

316 c

The r e s u l t i s t l + t 2 , a f t e r n o r m a l i z a t i o n . d d b ( i ) = cmplx ( t l + t 2 , t 2 - ( ( t l + t 2 ) - t l ) enddo end

)

References [1] D. H. Bailey. A Fortran-90 Suite of Double-Double Precision Programs. See web page at h t t p : / / w w w . n e r s c . g o v / ~ d h b / m p d i s t / m p d i s t . h t m l . [2] D. H. Bailey. Multiprecision Translation and Execution of Fortran Programs. ACM Transactions on Mathematical Software, 19(3):288—319, September 1993. [3] R. P. Brent. A Fortran Multiple Precision Arithmetic Package. ACM Transactions on Mathematical Software, 4:57-70, 1978. [4] J. Demmel, X. Li, D. Bailey, M. Martin, J. Iskandar, and A. Kapur. A Reference Implementation for Extended and Mixed Precision BLAS. In Preparation. [5] C. H. Q. Ding and R. D. Ferraro. A Parallel Climate D a t a Assimilation Package. SIAM News, pages 1-12, November 1996. [6] C. H. Q. Ding and Y. He. D a t a Organization and I / O in an Ocean Circulation Model. In Proceedings of Supercomputing'99, November 1999. Also LBL report number LBNL-43384, May 1999. [7] C. H. Q. Ding, P. Lyster, J. Larson, J. Guo, and A. da Silva. Atmospheric D a t a Assimilation on Distributed Parallel Supercomputers. Lecture Notes in Computer Science, 1401:115-124. Ed. P. Sloot et al., Springer, April 1998. [8] J. Drake, I. Foster, J. Michalakes, B. Toonen, and P. Worley. Design and Performance of a Scalable Parallel C o m m u n i t y Climate Model. Parallel Computing (PCCM2), 21:1571, 1995. [9] G. Fox, M. Johnson, G. Lyzenga, S. O t t o , J. Salmon, and D. Walker. Solving Problems on Concurrent Processors. Vol 1. Prentice Hall, Englewood Cliffs, New Jersey, 1988. [10] D. Goldberg. W h a t Every Computer Scientist Should Know About Floating-Point Arithmetic. ACM Computing Surveys, March 1991. [11] A. Greenbaum. Iterative Methods for Solvong Linear Systems. Frontiers in Applied Mathematics. Vol 17. SIAM. Philadelphia, PA, 1997. [12] S. M. Grimes, R. C. Pacanowski, M. Schmidt, and V. Balaji. T h e Explicit Free Surface Method in the G F D L Modular Ocean Model. Submitted to Monthly Weather Review, 1999. [13] J. J. Hack, J. M. Rosinski, D. L. Williamson, B. A. Boville, and J. E.

317

Truesdale. Computational Design of NCAR Community Climate Model. Parallel Computing, 21:1545, 1995. Y. He and C. H. Q. Ding. Numerical Reproducibility and Stability/NERSC Homepage. See web page at http://www.nersc.gov/research /SCG/ocean/NRS. N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM Press, Philadelphia, PA, 1996. W. Kahan. Further Remarks on Reducing Truncation Errors. Comm. ACM, page 40, 1965. D. E. Knuth. The Art of Computer Programming. Vol 2, Chap4, Arithmetic. Addison-Wesley Press, Reading, MA, 1969. D. Moore. Class Notes for CAAM 420: Introduction to Computational Science. Rice University, Spring 1999. See web page at http://www. owlnet.rice.edu/~caam420/Outline.html. The NCAR Ocean Model User's Guide. Ver 1.4. See web page at http://www.cgd.ucar.edu/csm/models/ocn-ncom/UserGuidel_4.html, 1998. R. C. Pacanowski and S. M. Grimes. MOM 3.0 Manual. GFDL Ocean Circulation Group, Geophysical Fluid Dynamics Laboratory, Princeton, NJ, September 1999. B. N. Parlett. The Symmetric Eigenvalue Problem. Classics in Applied Mathematics, 20. SIAM, Philadelphia, PA, 1997. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in Fortran: the Art of Scientific Computing. 2nd Edition. Cambridge University Press, Cambridge, UK, 1992. D. M. Priest. Algorithms for Arbitrary Precision Floating Point Arithmetic. On Properties of Floating Point Arithmetics: Numerical Stability and the Cost of Accurate Computations. Ph.D. Thesis. Mathematics Dept. University of California, Berkeley, 1992. R. D. Smith, J. K. Dukowicz, and R. C. Malone. Parallel Ocean General Circulation Modeling. Physica, D60:38, 1992. See web page at h t t p : / / www. acl .lanl.gov/climate/models/pop. Second International Workshop for Software Engineering and Code Design for Parallel Meteorological and Oceanographic Applications. Scottsdale, AZ, June 1998. Workshop on Numerical Benchmarks for Climate/Ocean/Weather Modeling Community. Boulder, CO, June 1999.

318

Parallelization of a GCM using a Hybrid Approach on the IBM SP2 STEVEN COCKE Department of Meteorology/COAPS, Florida State University Tallahassee, Fl 32306-304, USA

ZAPHIRIS CHRISTIDIS IBM T. J. Watson Research Center Yorktown Heights, New York 10598, USA Abstract Recently Florida State University has acquired a large IBM SP2 distributed SMP cluster. The cluster currently has 42 nodes, with each node having 4 CPUs. Previously our GCM has been parallelized using either MPI or OpenMP. In order to maximize performance of our GCM on this new machine, we employ a hybrid approach, where MPI is used across the nodes, and OpenMP is used within the node. Our preliminary finding is that the hybrid approach scales better than the pure MPI approach.

1. Introduction Florida State University (FSU) has recently established a School for Computational Science and Information Technology and has begun a major undertaking to upgrade its computing facilities. A decision was made to purchase a large IBM SP cluster, to be installed in two phases. The first phase, now complete, includes a cluster of 42 nodes. Each node is a so-called "Winterhawk II" node consisting of 4 375 MHz Power3 CPUS, with 2 GB RAM and 36 GB scratch disk space. The second phase, due to be complete by the end of 2001, will bring the newer Power4 architecture. The exact configuration is yet to be determined, but the goal is to have at least 768 CPUs. Previously, the FSU GCM has been parallelized to run on an older IBM SP2, with 1 CPU per node, using MPI and on an SGI Origin 2000 using OpenMP 1 . We have developed a substantially improved GCM, which employs both MPI and OpenMP approaches, as well as a hybrid MPI/OpenMP approach. In Section 2 we give a brief description of the model. We discuss the parallelization approaches in Section 3. Performance results are given in Section 4 and some concluding remarks are given in section 5.

319

2. Model Description 2.1. The GCM The GCM is a global hydrostatic primitive equation model, and is designed for short range forecasting as well as climate simulations. The prognostic variables are vorticity, divergence, virtual temperature, moisture and log surface pressure. The model uses the spectral technique in the horizontal direction, and second order finite difference in the vertical. The wave number truncation used ranges from T42 to T126 for climate simulations and from T126 to T170 or higher for short and medium range forecasting. A a coordinate is used in the vertical. The model physics include long and shortwave radiation, boundary layer processes, large scale precipitation, shallow and deep cumulus convection. The GCM physics routines have been developed to adhere to the plug-compatible guidelines developed by the U.S. Modeling Infrastructure Working Group. As a result, the model has available to it a large array of selectable physical parameterizations. Currently six cumulus parameterizations have been implemented, along with three radiation packages and a couple of boundary layer schemes. Most of the NCAR CCM 3.6 atmospheric physics has been incorporated as an option in the model. In addition to the current simple land surface scheme, two detailed land biosphere schemes are in the process of being implemented: BATS and SSiB. We have coupled the model to three ocean models: a version of the Max Planck HOPE model, the MICOM and HYCOM ocean models.

2.2 Benchmark

Configuration

The benchmark GCM is a T126 resolution model (384 by 192 grid points) with 27 levels in the vertical. The physical parameterizations used are the CCM 3.6 atmospheric physics. The land surface scheme is a simple land surface scheme with 3 soil temperature layers, prescribed (climatological) albedo and ground wetness. The sea surface temperatures are also prescribed.

3. Parallelization Approach We briefly describe the computational the organization of the model. Initially, the spectral prognostic variables and their derivatives are transformed to grid, one latitude at a time. One routine computes the nonlinear dynamical tendencies for each latitude, and another routine computes the physical tendencies. After the tendencies have been computed, they are spectrally analyzed, one latitude band at a time. These tendencies are accumulated from one band to the next. The spectral and grid point calculations are thus done in one large loop across the latitudes. Once the spectral tendencies have been summed up, they are then used in a semi-implicit algorithm to obtain the prognostic spectral variables at the next time step. The process is then repeated until the forecast is complete.

320 Our parallelization approach uses a simple one dimensional partitioning of the latitudes. Each partition is distributed across the nodes. The partitions are further repartitioned across CPUs within the node. Each subpartition may computed by an MPI task or OpenMP thread. For MPI, a reduction is done by using the MPI_Allreduce library function. For the OpenMP approach, a simple summation loop is used for reduction. Serial "overhead" includes solving the semi-implicit algorithm and array initialization. Load balancing is achieved by distributing the latitudes evenly geographically across each subpartition. In other words, each partition (CPU) has latitudes that are evenly distributed from North to South Pole. This is done by using a mapping function which maps physical latitudes onto logical latitudes prior to partitioning. In the next Section we will compare the pure MPI approach, where we use MPI both within the nodes as well as across the nodes, to the hybrid approach, where MPI is used across the node but OpenMP is used within the nodes.

4. Performance Issues and Results We show in Figure 1 the average time (wall clock) to complete a single time step for the pure MPI approach and the hybrid MPI/OpenMP approach. We averaged over a large number of time steps, but we excluded from our timings those time steps in which I/O was done, since I/O performance was heavily dependent on system load due to other users and processes. Shown are configurations ranging from 2 to 24 nodes, and from 1 to 4 CPUs per node (for a range of 2 to 48 total CPUs). Note that not all possible configurations were run. It can readily be seen that the hybrid approach scales better, particularly for larger number of CPUs per node. For the pure MPI approach, increasing the number of CPUs per node from 3 to 4 does not result in decreased wall clock time. For 24 nodes, going from 1 to 2 CPUs per node also leads to increased wall clock time for the pure MPI approach. In contrast, for the hybrid approach, wall clock times always decrease with increasing number of CPUs per node. A more direct comparison of the two approaches can be seen in Figure 2. In all cases, the hybrid approach outperforms the pure MPI approach. There are several factors that can be identified which affect the performance (and scaling) of the GCM. Among them are: load balance, memory bus bandwidth (and cache), inter- and intra-node communication, and other serial "overhead". We discuss some of these factors below: Load Balance. As mentioned earlier, load balancing is achieved by distributing the grid computations so that each CPU has latitudes that are evenly distributed geographically from the North Pole to the South Pole. Tropical latitudes are typically more expensive to compute due to the more frequent calls to the cumulus parameterization scheme. For a relatively modest number of latitudes per processor, this load balancing scheme is effective. For a 24 CPU run (8 latitudes per CPU), for example, we find that the difference between the maximum and minimum computation time per latitude loop is on the order of 1 to 2 percent. When the number of latitudes per

321 Time to Complete One Time Step MPIOnly

I

Time to Complete One Time Step MPI &OpenMP

I

Number of Nodes

Figure 1: Timings for (top) MPI-only versus (bottom) Hybrid parallelization approach

322

Time to Complete One Time Step MPI vs. Hybrid MPI/OpenMP

• a

MPI only Hybrid MPI/OpenMP

! 4

6

8

12

16

24

Number of Nodes using 2 CPUs per Node

Time to Complete One Time Step MPI vs. Hybrid MPI/OpenMP

• a

MPI only Hybrid MPI/OpenMP

! 4

6

8

12

16

24

Number of Nodes using 4 CPUs per Node

Figure 2: Comparison of MPI-only versus Hybrid parallelization approach for (top) 2 CPUs per node and (bottom) 4 CPUs per node.

323 CPU approaches unity, however, we expect load balancing to be a more serious issue. Memory Bus Bandwidth Limitations. On a shared memory architecture, memory bus contention can have a significant performance penalty, if all the CPUs are accessing the memory at the same time. To estimate this performance degradation, we measured the time to compute one latitude loop using differing number of CPUs per node. In Figure 3 we show the latitude loop time for 2,4,6 and 8 CPUs, using one CPU per node ("distributed" approach) as well as 2-4 CPUs per node on 2 nodes ("SMP" approach). For an equal number of CPUs, it can be seen that using fewer CPUs per node results in better performance than using fewer nodes with more CPUs per node. In particular, for the 8 CPU configuration, it can be seen that using 4 CPUs per node (with 2 nodes) results in a 20 to 30 percent performance penalty compared to using 1 CPU per node (with 8 nodes). These timings, which do not include communication, were nearly the same for both the pure MPI and hybrid approaches. Since the latitude loop timings include any OpenMP overhead (used in the hybrid approach), we find that this overhead is negligible compared to the overall latitude computations. Communication and Reduction. After the latitude loop calculations are done, a reduction is used to accumulate all the spectral tendencies, which are then used in the semi-implicit time integration algorithm. For the MPI approach, this reduction is done using the MPI_Allreduce library function, and this is where virtually all the communication takes place in the model, save for some less frequent model diagnostic model output. For the hybrid approach, intranode reduction is done by a simple summation loop. We show the reduction timings in Figure 4. For one CPU per node, the MPI reduction scales very well. However, at 4 CPUs per node, the reduction exhibited substantially poorer performance. With 4 nodes (and 4 CPUs per node), the performance was especially poor. The internode communication with MPI performs very well across the high performance SP switch. However, intranode communication was problematic with the MPI library used here. For the hybrid approach, the reduction time depends on summation time after the OpenMP thread calls within the nodes plus the MPI reduction time across the nodes (equivalent to that for 1 CPU per node). The OpenMP reduction time, shown in Figure 4 (bottom), was substantially less than that for intranode MPI, and scales linearly with the number of CPUs as expected. We note that the MPI library did not use shared memory. A shared memory MPI library is now available on the IBM SP2, and we expect these reduction times to be reduced dramatically.

5. Conclusions and Future Work We have effectively parallelized our new GCM using pure MPI, OpenMP, and hybrid parallelization approaches. We have found that the hybrid approach scales best, with decreasing wall clock times up to the largest number of CPUs used. The pure MPI approach exhibited performance-limiting bottlenecks, particularly for configurations which involved 3-4 CPUs per node. The primary bottleneck is due to the

324

Time to Complete Latitude Loop Distributed vs. SMP performance

i

I

2 nodes, 1-4 CPUs per node 2-8 nodes, 1 CPU per node

i i 4

6

Number of CPUs

Figure 3: Distributed versus SMP

325

MPI Reduction Time Per time step

a

>.

1.5

i -

6

8

12

Number of Nodes

SMP (OpenMP) Reduction Time Per Time Step

|

2 nodes, 1-4 CPUs per node

-

0.15

0.1

-

0.05

1

1

1

Number of CPUs

Figure 4: Reduction times for (top) MPI and (bottom) OpenMP.

326 reduction step, in particular the intranode MPI communication. With a shared memory MPI, we will likely find that this reduction time will be substantially reduced, and the advantage of the hybrid approach to the pure MPI approach diminished. Memory bus contention was found to be a performance bottleneck for both approaches. This could be substantially improved by using cache blocking to more effectively utilize the cache. The older FSU GCM was optimized heavily to make better use of cache, but the physics routines used here were primarily developed on vector machines. Scalability is also limited by some serial code in the model that has not been optimized. Previously the serial portions of the code took a negligible fraction of the time on older machines with fewer CPUs. But with a large number of CPUs, even initializing arrays becomes a sizable serial overhead. Load balancing did not appear to be a significant problem. However, for larger number of CPUs than used here, we will need to revisit this and other issues, such as using a transpose Fourier-Legendre Transform 2 .

Acknowledgments We would like to thank the FSU School of Computational Science and Information Technology (CSIT) for allowing us to use the supercomputing facility. We would also like to thank J. Curtis Knox for his invaluable help in system administration.

6. References 1. Cocke, Steven and Zaphiris Christidis: Parallelization of the FSU Global and Regional Spectral Models on Shared and Distributed Memory Architectures, Proceedings from the Eighth Workshop on the Use of Parallel Processors in Meteorology, November, 1998, ECMWF, Reading, U.K. 2. Foster, I.T., and Worley, P.H., Parallel Algorithms For the Spectral Transform Method, Tech. Report ORNL/TM-12507, Oak Ridge National Laboratory, Oak Ridge, TN, USA, April 1994.

327

Developments in High Performance Computing at Fleet Numerical Meteorology and Oceanography Center Kenneth D. Pollak and R. Michael Clancy Models and Data Department Fleet Numerical Meteorology and Oceanography Center 7 Grace Hopper Ave., Stop 1 Monterey, CA 93943-5501 USA ABSTRACT The Navy's Fleet Numerical Meteorology and Oceanography Center (FNMOC) in Monterey, California, is in the process of replacing its two Cray C90 computers with a new system. A phased approached over the next five years will bring the most current technology to bear on FNMOC's responsibilities for global and regional coupled air-sea modeling. To accomplish this, an SGI Origin 2000 was purchased in 1999 to serve as a platform for porting the existing FNMOC vector codes to a scalable architecture. The Origin 2000 will be supplemented in the fall of 2000 with an Origin 3000 system. This follow-on system will achieve Initial Operational Capability in early 2001 and allow removal of the C90s. Another upgrade in the fall of 2001 will bring about Full Operational Capability and achieve the first performance level goal of 100 GFLOPS sustained on major FNMOC applications. Future performance targets for the project are 200 GFLOPS sustained in 2003 and 400 GFLOPS sustained in 2005.

328

1.

Introduction

The U.S. Navy's Fleet Numerical Meteorology and Oceanography Center (FNMOC), located in Monterey, CA, is the lead activity within the U.S. Department of Defense (DoD) for numerical weather prediction and coupled air-sea modeling. FNMOC fulfills this role through means of a suite of sophisticated global and regional meteorological and oceanographic (METOC) models, extending from the top of the atmosphere to the bottom of the ocean, which are supported by one of the world's most complete real-time METOC data bases. Fleet Numerical operates around-the-clock, 365 days per year and distributes METOC products to military and civilian users around the world, both ashore and afloat, through a variety of means. High performance computing is a key ingredient in FNMOC's operation. In this paper we review plans and progress for upgrading high performance computing capabilities at FNMOC, and summarize the associated METOC modeling plan. 2.

Background

Two Cray C90s comprise the current operational high performance computing platforms at FNMOC. Programmatically, these machines and their software are referred to as the Primary Oceanographic Prediction System - 2 (POPS2). Planning for replacement of POPS2 with a modern scaleable system began in 1997 with the development of the Systems Integration Master Plan for the POPS2 Upgrade (POPS2-U). Fundamental requirements for the Full Operational Capability (FOC) implementation of POPS2-U, to be achieved in 2001, included: •

100 GFLOPS sustained on major FNMOC models

329

• • • • •

128 GBytes of memory 5 TBytes of disk storage 160 TBytes of near-line storage 1 TByte of data throughput per 12-hour watch no single point of failure

with support for a Multi-Level Security (MLS) system identified as highly desirable. In addition, future performance upgrade requirements for the system of 200 GFLOPS sustained in 2003 and 400 GFLOPS sustained in 2005 were also specified. The vendor solicitation for the POPS2-U contract was initiated in February of 1999. A benchmark suite, consisting of the Navy Operational Global Atmospheric Prediction System (NOGAPS; Hogan and Rosmond, 1991), the Navy Atmospheric Variational Data Assimilation System (NAVDAS; Daley and Barker, 2000), and the Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS; Hodur, 1997), was released to vendors in the spring of 1999. Vendor proposals and benchmark results were evaluated during the summer of 1999, and contract award to SGI was announced in September of 1999. 3.

POPS2-U Hardware Plan

Under the terms of the POPS2-U contract, SGI delivered a 128 processor Origin 2000 with 128 GBytes of memory to FNMOC in October of 1999. This platform was intended mainly as a transitional system. It allowed completion of the porting of FNMOC's vector codes to MPI, optimization of these codes for the SGI architecture, implementation of the required data management functionality, testing of job scheduling and software configuration management mechanisms, and the construction of a "pseudo ops run" to simulate operations at the point of Initial Operational Capability (IOC). The availability of this machine at FNMOC proved to be invaluable for the smooth transition of operations from the old Cray vector architecture to the new SGI NUMA architecture.

330

In September of 2000, FNMOC announced the purchase of SGI Origin 3000 systems as the follow-on operational POPS2-U platforms. The Origin 3000 boasts the largest single-kernel, sharedmemory image available, with up to 512 processors and a TByte of memory. Innovative "snap-together" modularity, based on seven types of hardware "bricks", provides unprecedented flexibility for system configuration and allows clustering to tens of thousands of CPUs. See www.sgi.com for more information about the Origin 3000 line.

jwderhorn (840 Drivsa J490 Drives

Figure 1. High-level view of the FNMOC systems configuration at POPS2-U Initial Operational Capability (IOC) in February 2001.

331

The FNMOC IOC system, to be delivered in November 2000, is a 128 processor Origin 3000 with 128 GBytes of memory. This system will run the SGI Trusted IRIX (TRIX) operating system in support of MLS to meet FNMOC's unique security requirements, and incorporate SGI Storage Area Network (SAN) technology. The IOC configuration, shown in Figure 1, will also incorporate the 128 processor Origin 2000 as an operational platform. This configuration will allow hosting of the full FNMOC ops run on the SGI machines and removal of the Cray C90s at IOC in February of 2001. The POPS2-U FOC configuration, to be achieved in September of 2001, is shown in Figure 2. This configuration reflects the addition of a 512 processor Origin 3000 with 512 GBytes of memory to achieve the FOC requirement of 100 GFLOPS sustained on major FNMOC models. The 128 processor Origin 3000 is also retained at FOC. The two Cray J90s, which are a holdover from the Cray C90 configuration, are replaced at FOC with two 12 processor Origin 3000s (see Figures 1 and 2). Upgrades in the form of additional and/or faster processors will be made to the POPS2-U system to achieve 200 GFLOPS sustained in 2003 and 400 GFLOPS sustained in 2005.

332

Figure 2: High-level view of the FNMOC systems configuration at P0PS2-U Full Operational Capability (FOC) in September 2001. 4.

POPS2-U Modeling Plan

The atmospheric Multi-Variate Optimum Interpolation (MVOI) system (Barker, 1992) will be transitioned to become operational on POPS2-U at IOC. However, MVOI will be progressively replaced by the Navy Atmospheric Variational Data Assimilation System (NAVDAS) described by Daley and Barker (2000). By POPS2-U FOC, NAVDAS will become the essential atmospheric data assimilation component for the FNMOC air-sea modeling system. Thus, it will directly support NOGAPS and CO AMPS, and indirectly

333

the related ocean models discussed below. NAVDAS, based on the 3D-VAR technique, will set the stage for advancement to a 4D-VAR system at some point beyond POPS2-U FOC. The Optimum Thermal Interpolation System (OTIS) ocean thermal analysis (see Cummings et al, 1997) will be transitioned to full operations on POPS2-U at IOC for both global and regional applications. By POPS2-U FOC, OTIS will be replaced with the Ocean Multi-Variate Optimum Interpolation System (Ocn_MVOI) of Cummings and Phoebus (2001). OcnJVIVOI, which represents a merger of the Modular Ocean Data Assimilation System (MODAS) and COAMPS Ocean Data Analysis (CODA) technologies (see Cummings et ah, 1997), will perform all global and regional ocean thermal, wave and sea-ice analyses on POPS2-U. Thus, Ocn_MVOI will provide the crucial ocean data assimilation capability required by the atmospheric and oceanographic models that comprise the core of the FNMOC air-sea modeling system, and represent a key enabling technology for this system. Beyond POPS2-U FOC, Ocn_MVOI will be upgraded to a variational system and integrated with NAVDAS to provide a complete coupled air-sea data assimilation capability. NOGAPS (Hogan and Rosmond, 1991), and the associated Ensemble Forecast System (EFS; Pauley etai, 1996), will be transitioned to full operations on POPS2-U at IOC. By FOC, NOGAPS resolution will be increased to about 50 km grid spacing in the horizontal with 48 levels in the vertical, and its MVOI analysis will be replaced with NAVDAS. Beyond FOC, NOGAPS resolution will increase to about 27 km with 60 levels and the model will evolve progressively into a fully two-way interactive global air-sea system via coupling with the ocean models described below. As always, various other incremental improvements will be made to progressively improve the skill of the model. COAMPS (Hodur, 1997) will be transitioned to full operations on POPS2-U at IOC. Additional and expanded COAMPS areas will be implemented, and a 6-hour update cycle will be activated for all areas.

334

By FOC, the COAMPS MVOI analysis will be replaced with NAVDAS. Beyond FOC, the ocean models described below will be progressively integrated into COAMPS with closer and closer coupling to achieve a fully two-way interactive regional air-sea prediction system. Other incremental improvements will be made to increase the skill of the model. The GFDN tropical cyclone model (Kurihara et ah, 1995) will be transitioned to full operations on POPS2-U at IOC. Beyond POPS2U FOC, GFDN will be coupled with an underlying ocean model to provide for two-way air-sea interaction. The Thermodynamic Ocean Prediction System (TOPS) mixed layer model (Clancy and Pollak, 1983) will be transitioned to full operations on POPS2-U at IOC for both global and regional applications. By POPS2-U FOC, or shortly thereafter, the global TOPS will be replaced by a global eddy-resolving implementation of the Parallel Ocean Prediction (POP) model (Dukowicz and Smith, 1994), which will be the Navy's first operational implementation of a global full-physics ocean circulation model. POP will couple closely with both NOGAPS and Ocn_MVOI. In addition, it will provide lateral boundary conditions for regional implementations of the Navy Coastal Ocean Model (NCOM; Martin, 2000), that will, in turn, couple closely with regional implementations of COAMPS and OcnJVIVOI. Beyond FOC, regional TOPS will be replaced by regional NCOM, and POP may be replaced with a global implementation of NCOM. The Third-Generation Wave Model (WAM; see WAMDI Group, 1988) will be replaced at POPS2-U IOC with the WaveWatch III model (Tolman, 1999) for both global and regional implementations. The global WaveWatch will be coupled closely with NOGAPS and provide lateral boundary conditions for the regional WaveWatch runs. Regional WaveWatch will be implemented for each COAMPS area and coupled closely with COAMPS. Beyond POPS2-U FOC, the resolution of both the global and regional implementations of

335

Wave Watch will be increased and a data assimilation capability will be added via Ocn_MVOI. The Polar Ice Prediction System (PIPS) sea ice model (Cheng and Preller, 1992) will be transitioned to full operations on POPS2-U at IOC. Shortly after POPS2-U FOC, an upgraded version of PIPS will be implemented that will couple with NOGAPS, POP and Ocn_MVOI. The NOGAPS/PIPS coupling will account for the important two-way interactions that occur between the atmosphere and the underlying sea-ice field. 5.

Summary

FNMOC is replacing its aging Cray C90 computers with SGI Origin 3000 systems. Initial Operational Capability on the new systems and retirement of the C90s is on schedule for February 2001. Follow-on performance upgrades and technology refreshment of the Origin 3000 systems will take place at regular intervals over the next five years. This will enable higher resolution and more physically complete and accurate air-sea models, leading ultimately to improved products for customers. 6.

Acknowledgments

The authors and the Fleet Numerical Meteorology and Oceanography Center are supported through the operational Naval Meteorology and Oceanography Program under the sponsorship of the Commander, Naval Meteorology and Oceanography Command, Stennis Space Center, Mississippi, USA. 7.

References

Barker, E.H., 1992: Design of the Navy's Multivariate Optimum Interpolation analysis system. Weather and Forecasting, 7, 220-231.

336

Cheng, A. and R.H. Preller, 1992: An ice-ocean coupled model. Geophysical Research Letters, 19, 901-904. Clancy, R.M. and K.D. Pollak, 1983: A real-time synoptic ocean thermal analysis/forecast system. Progress in Oceanography, 12, 383-424. Cummings, J.A., C. Szczechowski, and M. Carnes, 1997: Global and regional ocean thermal analysis systems. Marine Technology Society Journal, 31, 63-75. Cummings, J.A. and P. Phoebus, 2001: Description of the threedimensional ocean multivariate data assimilation system. Technical Report, Marine Meteorology Division, Naval Research Laboratory, Monterey, CA 93943-5502, (in preparation). Daley, R. and E. Barker, 2000: NAVDAS Source Book 2000. Technical Report NRL/PU/7530-00-418, Marine Meteorology Division, Naval Research Laboratory, Monterey, CA 93943, 153 pp. Dukowicz, J. K. and R. D. Smith, 1994: Implicit free-surface method for the Bryan-Cox-Semtner ocean model. Journal of Geophysical Research, 99,7991-8014. Hodur, R., 1997: The Naval Research Laboratory's Coupled Ocean/Atmosphere Mesoscale Prediction System (COAMPS). Monthly Weather Review, 125, 1414-1430. Hogan, T. and R. Rosmond, 1991: The description of the Navy Operational Global Atmospheric Prediction System's spectral forecast model. Monthly Weather Review, 119, 1786-1815. Kurihara, Y., M.A. Bender, R.E. Tuleya, and R.J. Ross, 1995: Improvements in the GFDL hurricane prediction system. Monthly Weather Review, 118, 2186-2198.

337

Martin, P. J., 2000: A description of the Navy Coastal Ocean Model Version 1.0. Technical Report, NRL/FR/7322-00--9962, Oceanography Division, Naval Research Laboratory, Stennis Space Center, MS 39529,40 pp. Pauley, R., M.A. Rennick, and S. Swadley, 1996: Ensemble forecast product development at Fleet Numerical Meteorology and Oceanography Center. Tech Note, Models Department, FNMOC, Monterey, CA 93943-5501. Tolman, H.L., 1999: User manual and system documentation of WAVEWATCH-III version 1.18. Technical Note No. 166, Ocean Modeling Branch, National Centers for Environmental Prediction, National Weather Service, National Oceanic and Atmospheric Administration, Camp Springs, MD 20746, 110 pp. WAMDI Group, 1988: The WAM model - A third-generation ocean wave prediction model. Journal of Physical Oceanography, 18, 1775-1810.

338

THE COMPUTATIONAL PERFORMANCE OF THE NCEP SEASONAL FORECAST MODEL ON FUJITSU VPP5000 AT ECMWF HANN-MING H E N R Y JUANG Climate Prediction Center, National Centers for Environmental 5200 Auth Road, Camp Springs, MD 20746, USA E-mail: [email protected]

Prediction

MASAO KANAMITSU Climate Prediction Center, National Centers for Environmental 5200 Auth Road, Camp Springs, MD 20746, USA E-mail: [email protected]

Prediction

The US National Centers for Environmental Prediction (NCEP) Seasonal Forecast Model (SFM) has been designed with multi-threads and message passing interface (MPI) capabilities, which can run on heterogeneous machines of single-node or multi-node with single-processor or multi-processor. For flexibility, 1-D and 2-D decompositions are implemented for MPI. With 1-D decomposition, the program can utilize any number of processors or nodes, while with 2-D decomposition, it can maximize the use of given number of processors. Two resolutions, T62 and T320 with 1-D and 2-D decompositions, are used to examine the performance of the model on Fujitsu VPP5000 at the European Center for Medium-Range Weather Forecasts (ECMWF). The results indicate the model codes are 96% to 99% parallel. The range of percentage depends on the resolution and choice of decomposition. One significant result, which is different from the results performed on U3MSP, is that the 1-D decomposition performs better than the 2-D decomposition. In some cases, performance of 1-D decomposition is found to be even better than that of 2-D decomposition with larger number of node. The 1 -D decomposition contains longer length of array than the 2-D decomposition. In scalar machine, such as IBM-SP, the length of array plays no role in the performance, but in the vector machine, such as VPP5000, the longer the array length, the higher the performance. This is consistent with the finding that the 1-D decomposition is more efficient than the 2-D decomposition on VPP500.

1

Introduction

The improvements of computational power are benefit to numerical modeling of all fields, meteorology is not an exception. Numerical weather forecasts and climate predictions require super-computers to perform large number of integration in a limited time. During the last decades, super-computer has evolved from singleprocessor scalar machine, vector machine, multi-processor machine, to multi-node multi-processor machine. In order to take full advantage of the latest supercomputers, most of the application programs need to be modified. For the vector machine, designing the innermost loop to assure independence of array is important.

339 While for the multi-processing machine, the designing of the outer-most loop to concurrently use multiple processors is essential. And lastly for multi-nodes cluster machines, the decomposition of the entire numerical model to multi-section of data is the basic techniques. At the National Centers for Environmental Prediction (NCEP), the Seasonal Forecast Model (SFM) is used to make operational seasonal prediction up to 7 months. The forecast system is consisted of ocean model, atmospheric model and data assimilation. Currently, two-tier approach is used to couple these two models. The atmospheric model prediction is based on 20-member ensemble forecasts, with resolution of T42L28, once in a month, and accompanied by 10-member 20-year hindcasts of the same month. The amount of integration corresponds to about 7-year integration per day. In order to increase the resolution and update the physics of the model, it is most critical to improve of the numerical performance. Furthermore, in order to take full advantage of the available computer resources within and outside NCEP, incorporating the portability and flexibility of the model to run on heterogeneous machines is considered to be essential. In this report, the design of hybrid portable and flexible SFM will be illustrated in section 2. The dependence of performance on decomposition method on scalar and vector cluster machines will be presented in section 3. The discussion and future concerns will be given in the section 4. 2 2. /

The NCEP Seasonal Forecast Model (SFM) The element of SFM

The NCEP seasonal forecast model (SFM) is evolved from the operational CRAY version of the NCEP medium range forecast (MRF) [1] model. SFM is a global spectral model with hydrostatic primitive equation on sigma coordinate. It uses spherical harmonic function, with Fourier series in east-west direction and associated Legender function in north-south direction, for representation of fields in horizontal, and finite difference representation in vertical. Semi-implicit scheme for gravity wave term with leap-frog scheme are used for time integration. The model physics includes short- and long-wave radiation with diagnostic cloud, non-local PBL physics with surface and soil layer physics, gravity-wave drag, shallow and deep convection with grid-scale precipitation, and a simple hydrology over land. 2.2

The hybrid-code with MP1 implementation

The design of the CRAY version code is based on the vectorization and multitasking. In order for portability, the original coding structure is kept and the new

340

options are added, so that the code can run efficiently on old computer systems as well as on new ones. The optimization for vector processor is kept and the multitasking is modified to run on a multi-thread OpenMP environment. Since there are many differences in coding for a variety of machine architectures, it is necessary to use a tool to manage options. The hybrid-code was adopted for this purpose, and the C language pre-processing is used to manage the options. The code includes Cpreprocessing variables and the model is preprocessed to run on machines with single- or multi-cpus and single- or multi-node. This method has an advantage of minimizing the influence on its performance, while it may make the reading of the original coding a little difficult. The OpenMP is adopted for multi-threads environment, and the Message Passing Interface (MPI) [2] is added for running on multi node cluster computers. For the decomposition of the model for multi-node machines, a minimization of the code modification is one of the major concerns. For spectral computation, the computation requires entire grids in some particular dimension, and the decomposition with area-overlay is not practical. Therefore, the transpose method is used. Besides input and output, all transposes are required only for spectral transformation. The good load-balance is established by using local symmetric distribution of grids. The reproducibility of computation with an arbitrary number of processors is also taken into account. 2.3

The comparison between 1-D and 2-D decompositions

The 1-D and 2-D decompositions are built in to the SFM. The 1-D can be used with any number of processor, but up to a number of the smallest dimension among all directions and all spaces. The 2-D can be used up to a number of the product of two smallest dimensions in all directions and all spaces, but it cannot be used with prime number of processors. There are two types of decomposition in the 1-D, one is to slice in east-west direction and the other is to slice in north-south direction, both without decomposition i vertical direction. On the other hand, there are three types of decomposition in the 2-D, one is to slice in east-west and north-south directions, another is to slice in east-west and vertical directions, the last is to slice in northsouth and vertical directions. The transpose is required while the computational dependency changes from one direction to another. Table 1 shows a comparison of various elements between the 1-D and 2-D decomposition when transforming from spectral to grid-point space. This example is from the model with spectral truncation of T62 and 28 vertical layers running on 12processor computer. For 1-D, it requires only one transpose from Legendre transform array configuration to the Fourier transform configuration. For 2-D decomposition, it requires three transposes. First one is before Legendre transform that has a north-south dependency. Another one is before Fourier transform that has east-west dependency. And the last is after the Fourier transform to transpose to

341

vertical column configuration. For 1-D decomposition, all the 12 processors communicate each other and each processor sent out 11/12 of data. For 2-D, communication takes place only within a subgroup of processors and therefore less data will be sent out by each processor. For the first and the third subgroups, communication takes place among 3 processors, each processors sending out 2/3 data. The second subgroup communicates among 4 processors, each processor sending out 3/4 data. The maximum number of processors used in 1-D is 31, and that in 2-D is 868. Since 1-D decomposition is sliced only in one direction, the computational direction contains longer length of array than that of 2-D. In the gridpoint space, 1-D has array length of 192, but 2-D has only 64. Table 1. Comparison between 1-D and 2-D decompositions in case of T62L28 layers with 12 processors. From Spectral to 1-D 2-D Grid-point space (1x12) (3x4) Transpose calls 1 3 Communication 12 3 4 3 Data sent out 11/12 2/3 3 / 4 2 / 3 Max # of processor 31 31x28=868 Any number of processor? Yes Non-prime number Length of array 192 64 It is an advantage to incorporate 1-D and 2-D decompositions in the model since each of the decompositions has its own advantages and disadvantages as discussed above. We will demonstrate this in the next section. 3 3.1

The performance on VPP5000 The consideration of porting NCEP SFM to VPP5000

The hybrid SFM code has been tested over many different machines, including Cray J90, Cray T90, DEC, SGI, LINUX, T3E, IBM-SP and SGI origin2000, but has not been tested in any vector cluster machine, such as VPP5000 due to its unavailability in United States. With mutual agreement between ECMWF and NCEP, the SFM has been ported and tested on the VPP5000. The portability of the SFM was found to be high and only minor modifications were needed to run SFM on VPP5000. Since this project is primarily to compare the performance of the model on different machines with minimum of effort, no effort to optimize the code for VPP5000 was made. Note that the code is not optimized for IBM-SP since it still retains the vectorization as noted earlier. The compile options are also simple to only consider 64-bit computation and linking with MPI system

342

library. The portable FFT (Fast Fourier Transform) is used instead of the VPP5000 system FFT. Two model resolutions, T62L28 and T320L42, are used with different number of processors (NPES). T62L28 is the planned SFM resolution to be implemented into operation in Spring of 2001, and its performance on IBM-SP is already available. The T320L42 is used to test the performance of 1-D verses 2-D decompositions, because of its much longer array length and also it was used as an operational model at ECMWF. 3.2

The results with T62L28

The results from T62L28 are summarized in Table 2. 16 different numbers of processors with two kinds of decomposition comprise 25 experiments (note that prime number of processors cannot be used for 2-D decomposition). The time spent in seconds for both 1-D and 2-D shown in the third column. The 1-D runs show the computation speeds up as the number of NPES increases except for the case of NPES=14. The same speed up is obtained for 2-D decomposition except NPES=16. We don't know the reason for the exceptions. The comparison of 1-D and 2-D for the same number of NPES clearly showed that 1-D decomposition is faster than 2-D in all the ranges of NPES. Table 2. Wall-time spend for one-day forecast of NCEP SFM with T62L28 on VPP5000 NPES COLUMN x ROW Seconds 2-D 1-D 2-D 1 230.

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1x2 1x3 2x2

1x4

68.3

1x5 2x3

1x6

47.9

1x7 2x4 3x3 2x5 3x4 2x7 3x5 4x4

1x8 1x9 1x10 lxll 1x12 1x13 1x14 1x15 1x16

37.3 35.5 32.3 29.7 24.6 24.1 24.6

1-D 110. 75.1 62.1 50.0 42.3 36.9 33.3 30.3 29.1 28.2 25.3 22.6 24.4 22.3 20.8

343 SPEEDUP

Figure 1. The speedup is obtained from NCEP SFM on the VPP5000 at ECMWF with resolution of T62L28.

SPEEDUP

x mu-sp

/-

T62L28-1D

+

IBM-SP TB2L28-2D(2X»)

•

IBM-SP T82L28-2D(3X*)

IS

/&- 14

A®'-

A IBM-SP T62L2B-2D(4X»)

- 12

^»^-

Oj/X

10

_ -

/ ^

7 *//^

/ \

1 1

i

i

i

i

i

i

10

i

i

12

i

i

14

i

8

i

16

Figure 2. The speedup is obtained from NCEP SFM on the IBM SP at NCEP with resolution of T62L28.

344

The speedup by the increase in numbers of processors can be compared with a theoretical speedup, which is defined as: S = (s + p ) / ( s + c + p / n )

(1)

where s is the time-spent for sequential portion of the code, p is the time-spent for the parallel portion of the code by running with one processor, c is the wall-time spent for communication, and n is the total number of processors used. Note that c will be a function of n and the way of decomposition. Fig. 1 shows the speedup (S) on VPP5000 with T62L28 resolution. The theoretical curves for 100% and 97% parallelization are plotted. It shows that the results from 1-D decomposition are very closely following the 97%-parallelized theoretical results. From the figure, it is clear that 2-D results, with 2x*, 3x*, and 4x*, are slower than 1-D for all cases, even for large number of NPES. Note that in some cases, 2-D decomposition is slower than 1-D decomposition run with two less NPES (compare triangle mark at NPES=16 and cross mark at NPES=14 that shows 1-D is faster with NPES=14 than 2-D with NPES=16). Apparently, this phenomenon is not found on the IBM-SP scalar machine as shown in Fig. 2. On the scalar machine, 1-D is not always faster than 2-D. Note also that about 99% of the computational time are in parallel portion of the code, compared to about 97% in the VPP5000. 3.3

The results from T320L42

From the previous section, it is found that 1-D is faster than 2-D for all cases on VPP5000, but not on IBM-SP. Most likely reason for this different behavior is that VPP5000 is a vector machine. 1-D decomposition creates longer array as discussed in Table 1 compared to 2-D decomposition. Therefore, the difference may become more significant if the model dimension is increased and if the vector processor efficiency does not fall off significantly for longer vector length. To examine this, the resolution of T320L42 is select to repeat the test. Because of the large memory requirement for the high resolution, the model can not be run with small number of processors, only NPES=12, 14, 16 and 18 are selected in this comparison (see Table 3). Figure 3 displays the speedup curve for this case. In plotting, s+c, and p in Eq. (1) is estimated from the speedup obtained from the experiment. From Table 3 and Fig. 3, it is clear that 1-D is still faster than 2-D decompositions. And 2x* is faster than 3x*, and 3x* is faster than 4x* in 2-D decompositions for all different NPES. It can be concluded that the longer the array length the shorter the time-spend for one day integration, because longer array is advantageous for VPP5000 vector machine, at least for the length of the array of the order of a few thousand.

345

NPES 12 14 16 18

Table 3. Wall-time spend for one-day forecast of NCEP SFM with T320L42 on VPP5000 COLUMN x R O W Length of array in x Seconds 1x12 1944 3808. 2x6 972 4544. 3x4 648 5024. 1x14 1944 3337. 2x7 972 3900. 1x16 1944 2914. 2x8 972 3434. 4x4 486 4278. 1x18 1944 2629. 2x9 972 3065. 3x6 3394. 648 SPEEDUP IB 18 14 12 10

a 8 4 2 2

4

6

IT

'16

-

1 2 ' 14

16

18

Figure 3. The speedup is obtained from NCEP SFM on the VPP5000 at ECMWF with resolution of T320L42. 4

Discussion and future concern

The NCEP SFM has been ported to VPP5000 at ECMWF. The VPP5000 compiler is found to be the fastest as compared to most of the system used by the authors. The turnaround of batch job is also quite reasonable. Without any

346

optimization for VPP5000, the SFM was about 50% vectorized and 97% parallelized with T62L28 resolution and 99% parallelized with T320L42, both with 1-D decomposition. This implies that poor vectorization is one of the weaknesses of this portable hybrid SFM code. As compared to the scalar machine such as the IBM-SP, VPP5000 is considerably faster and requires less processor resources to accomplish the same speed. For example, if 30 sec is a maximum wall-time to make one-day forecast (for seasonal prediction) with T62L28 resolution, it requires only 10 processors on VPP5000. In comparison, it requires 64 processors on IBM SP. The most significant difference found in this experiment between the scalar and the vector machines is the performance difference between 1-D and 2-D decomposition. The experiments with T62L28 and T320L42 on VPP5000 showed that the 1-D decomposition is always faster than 2-D. For the 2-D decomposition, smaller number of slices of array is faster than the larger number of slices. This all leads to the conclusion that the longer the array length (with less slices) the better the performance of the VPP5000 vector processors. In contrast, this is not the case in the scalar machine such as IBM SP. We conclude that 1-D decomposition is more advantageous than 2-D decomposition for VPP5000. For the high resolution, with T320L42, the 1-D decomposition still performed better than 2-D, indicating that the vector processor performance still increases with the array length of a few thousand. It is not clear how generally this conclusion applies. One possibility is that somewhat poorly vectorized SFM may exaggerate the advantage of the vector processor. If the SFM was fully vectorized and made the length of the vectors as long as possible, the results may not be as clear as this experiments. Another lesson we learned was that the T320L42 with 18 processors on VPP5000 requires 2629 sec of wall-time for making one-day forecast. This is apparently not fast enough for practical use. The major reason for this low performance is the very small time steps required due to the use of the leap frog scheme in advection. The use of semi-Lagrangian scheme is essential to make the model more practical in higher resolution configuration. 5

Acknowledgements

We thank David Burridge and Anthony Hollingsworth for their support on this inter-center activity, Umberto Modigliani and Dominique Lucas for their assistance in porting the model to VPP, and David Dent for his constructive suggestions to examine the model performance. One of the authors would like to thank Dr. Yoshihiro Kosaka of Fujitsu European Center for Information Technology who made this activity between NCEP and ECMWF possible.

347

References 1. 2.

Kanamitsu M., and co-authors. Recent changes implemented into the global forecast system at NMC. Weather and Forecasting, 6 (1991) pp. 425-435. MPI Forum. MPI: A message passing interface standard, April 1994

348

PANEL EXPERIENCE ON USING HIGH PERFORMANCE COMPUTING IN METEOROLOGY SUMMARY OF THE DISCUSSION Pam Prior European Centre for Medium-Range Weather Forecasts Shinfield Park, Reading/Berks., RG2 9AX, United Kingdom

1.

Introduction

Brief, controversial presentations were invited to open the discussions. The following statements were made, expanded upon and discussed, arousing widely varying degrees of controversy: •

OpenMP is a waste of time.

•

Can users afford to write beautiful code?

•

Coding for vectors is the best way of coding for scalar processors.

•

Weather forecasters do not need as much power as they claim.

•

There is a trend towards SMPs. Is this what users want?

•

Users should not have to concern themselves with optimisation.

2.

OpenMP is a waste of time

All the evidence to date shows that MPI-only applications perform better than those with a combination of MPI and OpenMP. Nevertheless, the speaker claimed that there were circumstances in which OpenMP could be advantageous. For instance, since OpenMP requires little tailoring, it can be used to prototype parallel applications and experiment with different parallelisation strategies. Its memory usage is very economical. In a comparison run on a J90, an experiment took 109 Mwords using macrotasking, 216 MW using MPI and 83 MW with OpenMP; the actual memory savings to be expected always depend on the specific application and

349 system used. OpenMP also appears to perform better than MPI in a heavily loaded system; since it can take advantage of dynamic scheduling of processors. Finally, the speaker was concerned that MPI (IFS) applications might not be scalable to large numbers of nodes because of the increasing cost of communication. There was some support for this view but a suggestion that, in order for Amdahl's law not to adversely affect OpenMP applications, it might be necessary to limit the number of processors per node to a maximum of eight, unless the application is extremely well parallelised. The ease of use of OpenMP is a danger, as directives can be inserted into code anywhere, even at low levels, with no consideration for the consequential overheads. There is a general belief in the theoretical potential advantages of a hybrid MPI/OpenMP model, primarily because of the memory bandwidth available within each SMP node. None of the sites represented at the workshop have, as yet, however, found a code which runs better with OpenMP in practice. 3.

Can users afford to write beautiful code?

The speaker, being an operational manager, was (violently) opposed to "beautiful" features in code which might degrade its performance. A conflict of interest between operational and research users quickly became apparent. While operational use requires optimal performance, research users, who are likely to share code with a large number of colleagues of varying degrees of computational skill, recognise that ease of use must take priority over absolute performance efficiency. At least interfaces to the outside world must be "elegant" (readable), even if they hide lower levels of "ugly" code. Several attendees pointed out that elegant code did not necessarily degrade performance; conversely ugly code is not inevitably efficient. It was, however, generally accepted that the responsibility of cleaning up old ugly routines, which nevertheless worked, was not to be undertaken lightly. A vendor representative stated that it was up to the users to discover the best ways of exploiting the systems available. This stance was largely accepted by user participants, who acknowledged that the market sector they represent was not large enough to warrant vendor investment in specialist solutions. Another vendor representative proposed that users commission compiler writers to develop nonstandard requirements and pay for the service. Users, however, believed that the availability of standard software was a greater priority. Ideally, there should a possibility of direct dialogue between compiler writers, language creators and users. In reality, the process is so time-consuming that few can afford such involvement.

350

4.

Coding for vectors is the best way of coding for scalar processors

This was a most uncontroversial statement, as there was universal agreement that currently, while the possibility of new vector machines cannot be definitely excluded, the risk in completely rewriting code to run optimally on scalar machines is too great. There was also general agreement that the performance of vector code on scalar machines was unlikely to be extremely efficient, although it does vary, depending upon the particular architecture of the scalar machine. The majority of the poor performance in an application is usually caused by a very small proportion of the code and it was generally agreed that the optimisation of those kernels, which are easily identifiable, was necessary and would be worth the time and effort expended. There are certain techniques which are not only good for both architectures, such as stride-one data access patterns and ensuring inner loops are unit stride, but critical for some scalar/cache architectures. There was universal agreement that code should be kept flexible, to keep the choice of architectures open. 5.

Weather forecasters do not need as much power as they claim

The speaker proposed that customers tended to use peak performance requirements as an indirect measure of throughput and that the peak performance was not actually needed. Once this restraint is removed, the use of clusters of microprocessor-based machines becomes viable. He expressed his belief that the HPC industry must reorganise itself collaboratively to develop open source standards and solve issues such as job scheduling and storage management. Similarly, development of different versions of UNIX should be discontinued and all manufacturers' resources combined into developing Linux. He maintained that collaboration had already been shown to succeed with such Internet developments as DNS. Each centre could employ computer experts to experiment with and collaboratively develop the open source material. There was very strong refutation of the claim that peak performance is not required; one centre requires 50 % of performance for its major operational application and the rest of the machine is taken up with time-critical tasks. It was universally agreed that support for job scheduling and partitioning was poor and that it was still not possible to manage a mix of operational and research work efficiently with ease. The speaker maintained that in reality budgetary limits were the major restricting factor on performance and microprocessor clusters could provide cost effective power. A user suggested that in the higher performance ranges two nTflops machines were unlikely to be less expensive than one 2nTflops machine. Another user pointed out that research in an operational centre was closely tied to its

351

operational activities: testing minor variants of the operational system or something that is well established as the next system to be implemented; there is little or no scope for "blue skies" research or research determined by the nature of the computing resources available. 6.

There is a trend towards clusters of SMPs. Is this what users want?

The first statement made was that users want an easy to use platform and are not concerned about the underlying architecture. Two vendor representatives agreed that economics determine the architectures available. Since clustered SMPs could be sold to a wide range of customers, using the same processors and technology for groups from two to hundreds, this was currently the market driven, preferred architecture. Users agreed that systems currently available are varieties of clustered SMPs, most handling vectors, some with caches, some with pseudo-vectors and some with real vector registers. Even the Fujitsu VPP could be considered an SMP, albeit a very tightly controlled one. A VPP user explained that the major advantage of the VPP5000 was that users did not have to be concerned with programming within an individual node, which was handled by compilers. The management of multiple nodes, shared file systems and distributed memory remains a major problem. It was suggested that operating system independent middleware would be the solution. A user replied that since manufacturers were unlikely to develop operating system independent software, a single programming model would suffice. However, it is impossible to define a definitive programming model, so it was proposed that the creation of standards, with high user participation and much interdisciplinary communication was the best compromise solution. However, the development of sound standards requires more time than manufacturers or users can afford. Again, the development of open-source middleware by Internet collaboration was proposed as the future solution. 7.

Users should not have to concern themselves with optimisation

The speaker claimed that there are a number of inherent optimisation features in the Fortran language which are not being exploited by current compilers, so users are having to compensate by taking account of the machines' architecture when writing their code. He believed that users should have the opportunity for direct discussions with compiler writers, to explain the programming structures they use and allow them to be taken into consideration in the design of compilers. There was much support for this viewpoint. One user believed that many features, such as pointers, had been included in Fortran 90 to make it look like an object-orientated language, but Fortran is not suited to this model. Such unnecessary inclusions have made Fortran 90 difficult to work with, so it is neglected by the Fortran community, resulting in no feedback to compiler writers and, therefore, no improvements. If

352

users want Fortran to survive, they must use it, promote it and press for the creation of a clean interface to Java and C++, to allow users to understand what an object is in an object-orientated language and obtain that object from Fortran. Eventually, the methods associated with those objects could perhaps be written in Fortran. The speaker was, however, concerned that the Fortran 2000 standard is already completely defined, very much as an object-orientated language, and there is no further room for discussion. Another user pointed out the difficulty in becoming involved, as standards committees could require very long-term involvement, which few users could afford. A previous speaker noted that with his proposed open-source Internet development methods, those who proposed the standards would also be those who wrote the compilers, which should be the most efficient way of ensuring the practicability of the standards defined.

353 LIST OF PARTICIPANTS

Dr. Mikko Alestalo

Finnish Meteorological Institute, P.O.Box 503, FIN-00101 Helsinki, Finland [email protected]

Dr. Mike Ashworth

CLRC Daresbury Laboratory, Warrington, WA4 4AD, United Kingdom [email protected]

Mr. Ken Atkinson

Compaq, Skippetts House, Skippetts Lane West, Basingstoke, RG21 3AT, United Kingdom [email protected]

Mr. Venkatramani Balaji

SGI/GFDL, Princeton University, P.O.Box 308, Princeton, NJ 08542 USA [email protected]

Mr. Alan Baldock

Hitachi Europe Ltd., Whitebrook Park, Lower Cookham Road, Maidenhead/Berks., SL6 8YA, United Kingdom [email protected]

Mr. Ramesh Balgovind

HPCCC, Bureau of Meteorology, GPO Box 1289K, Melbourne, Vic. 3001, Australia [email protected]

Dr. Manik Bali

Center for Development of Advanced Computing, Pune University Campus, Ganeshkind Road, Pune 411-007, India [email protected]

Mr. Mauro Ballabio

Swiss Center for Scientific Computing (CSCS), Via Cantonale, Galleria 2, CH-6928 Manne, Switzerland [email protected]

354 Dr. Peter Baumann

Active Knowledge GmbH Munich, Kirchenstr. 88, D-81675 Munich, Germany [email protected]

Mrs. Sylvia Baylis

Meteorological Office, London Road, Bracknell/Berks., RG12 2SZ, United Kingdom [email protected]

Mr. Brian Bence

BB Associates, 130 Blackmoor Wood, Ascot/Berks., United Kingdom [email protected]

Dr. Dag Bj0rge

Norwegian Meteorological Institute, P.O.Box 43, Blindern, N-0313 Oslo, Norway [email protected]

Mr. David Blaskovich

IBM, Weather & Environmental Marketing, Scientific & Technical Computing, 519 Crocker Avenue, Pacific Grove, CA 93950, USA [email protected]

Mr. Jan Boerhout

Sun Microsystems, Doornenburg 30, 2402 KD Alphen a/d Rijn, The Netherlands [email protected]

Mr. Reinhard Budich

Max Planck Institute for Meteorology, Bundesstr. 55, D20146 Hamburg, Germany [email protected]

Dr. Philip Bull

Sun Microsystems UK Ltd., Regis House, 45 King William Street, London, EC4R, United Kingdom [email protected]

Mr. Paul Burton

Meteorological Office, London Road, Bracknell/Berks., RG12 2SZ, United Kingdom [email protected]

355 Dr. Ilene Carpenter

SGI, 655 Lone Oak Drive, Eagan, MN 55121, USA [email protected]

Mr. Bob Carruthers

Cray UK Limited, 200 Brook Drive, Green Park, Reading/Berks., RG2 6UB, United Kingdom [email protected]

Mr. Michael Carter

Meteorological Office, HI06, Hadley Centre, London Road, Bracknell/Berks., RG12 9ER, United Kingdom [email protected]

Dr. Paolo Cesolari

Aeroporto di Centocelle "F. BARACCA", Ufficio Generale per la Meteorologia, 1 c Ufficio 3 A Sezione Via Papiria, 365,1-00175 Rome, Italy [email protected]

Dr. Zaphiris Christidis

IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, New York 10598, USA zaphiri @us.ibm.com

Dr. Il-Ung Chung

Korea Research and Development Information Center (KORDIC), P.O.Box 122, Yusong, Taejon 305-600, Korea [email protected]

Dr. Steven Cocke

Florida State University, Dept. of Meteorology, Tallahassee, FL 32306, USA [email protected]

Dr. Michael Czajkowski

Deutscher Wetterdienst, Kaiserleistr. 42, D-63067 Offenbach, Germany [email protected]

356

Mr. Nigel Dear

Unitree Software, 50 Gunthorpe Road, Marlow/Bucks., SL7 1UH, United Kingdom [email protected]

Dr. Alan Dickinson

Meteorological Office, London Road, Bracknell/Berks., RG12 2SZ,United Kingdom [email protected]

Dr. Thomas Diehl

German Climate Computing Centre (DKRZ), Bundesstr. 55, D-20146 Hamburg, Germany [email protected]

Mr. Ben Edgington

Hitachi Europe Ltd., Whitebrook Park, Lower Cookham Road, Maidenhead, SL6 8YA, United Kingdom [email protected]

Mr. Kalle Eerola

Finnish Meteorological Institute, P.O.Box 503, FIN-00101 Helsinki, Finland [email protected]

Mr. Jean-Francois Estrade

EMC/NCEP/NWS/NOAA, 5200 Auth Rd., Camp Springs, MD 20746, USA [email protected]

Mr. Torgny Faxen

National Supercomputer Centre, Linkoping University, S-581 83 Linkoping, Sweden [email protected]

Dr. Rupert Ford

Centre for Novel Computing, Dept. of Computer Science, The University of Manchester, Manchester Ml3 9PL, United Kingdom [email protected]

357 Mr. Tom Formhals

Sun Microsystems, 7900 Westpark Drive, Suite Al 10, McLean, VA 22102, USA [email protected]

Dr. Stephan Frickenhaus

Alfred-Wegener-Institute for Polar and Marine Research, Burgermeister-Smidt-Str. 20, CC5, D-27568 Bremerhaven, Germany [email protected]

Mr. Jose Garcia-Moya Zapata

Instituto Nacional de Meteorologia, Camino de las Moreras, S/N, E-28040 Madrid, Spain [email protected]

Mr. Lawrence Gilbert

Cray UK Limited, 200 Brook Drive, Green Park, Reading/Berks., RG2 6UB, United Kingdom [email protected]

Mr. Philippe Gire

NEC ESS, Immeuble Le Saturne, 3 pare Ariane, F-78284 Guyancourt Cedex, France [email protected]

Mr. Jean Gonnord

CEA/DAM, B.P 12, F-91680 Bruyere le Chatel, France [email protected]

Mr. Mark Govett

NOAA/Forecast Systems Lab., 325 Broadway, R/FS5, Boulder, CO 80305-3328, USA [email protected]

Dr. Volker Giilzow

German Climate Computing Centre (DKRZ), Bundesstr. 55, D-20146 Hamburg, Germany [email protected]

358 Mr. John Hague

AS&T, IBM Bedfont Lakes, Feltham, Middx., TW14 8HB, United Kingdom [email protected]

Mr. Paul Halton

Met Eireann, Glasnevin Hill, Dublin 9, Ireland [email protected]

Dr. Steven Hammond

NCAR, 1850 Table Mesa Dr., Boulder, CO 80303, USA [email protected]

Dr. Robert Harkness

NCAR, 1850 Table Mesa Dr., Boulder, CO 80303, USA harkness @ ucar. edu

Mr. Leslie Hart

NOAA/Forecast Systems Lab., 325 Broadway, R/FS5, Boulder, CO 80305-3328, USA [email protected]

Mr. John Harvey

IBM, Temple Way, Bristol, BS2 OJG, United Kingdom [email protected]

Mr. Detlef Hauffe

Potsdam Institut f. Klimafolgenforschung e.V., P.O.Box 601203, D-14412 Potsdam, Germany [email protected]

Ms. Yun (Helen) He

Lawrence Berkeley Lab., 1 Cyclotron Road, MS 50F, Berkeley, CA 94720, USA [email protected]

Dr. George Heburn

Naval Research Lab., NRL 7320, Stennis Space Center, MS 39525, USA [email protected]

359 Dr. Ryutaro Himeno

Advanced Computing Center, RIKEN (The Institute of Physical and Chemical Research), 2-1, Hirosawa, Wako-shi, Saitama, 351-0198 Japan [email protected]

Mr. Yamada Hiroshi

Japan Atomic Energy Research Institute (JAERI), Sumitomo Hamamatsucho, BIdg. 4F, 1-1816 Hamamatsucho Minato-ku, Tokyo 105-0013, Japan yamadah @jamstec. go.jp

Dr. Richard Hodur

Naval Research Laboratory, 7 Grace Hopper Ave., Monterey, CA 93943-5502, USA [email protected]

Prof. Geerd-R. Hoffmann

Deutscher Wetterdienst, Kaiserleistr. 42, D-63067 Offenbach, Germany geerd-ruediger.hoffmann@ dwd.de

Dr. Teddy Holt

Naval Research Laboratory, 7 Grace Hopper Ave., Monterey, CA 93943-5502, USA [email protected]

Mr. Gerhard Holzner

NEC Deutschland GmbH, ESS, Prinzenallee 11, D-40549 Diisseldorf, Germany

Mr. Takahiro Inoue

Research Organization for Information Science & Technology, 1-18-16, Hamamatsucho, Minato-ku, Tokyo,105-0013,Japan [email protected]

Ms. Caroline Isaac

IBM UK Ltd., 21 NE, 1 New Square, Bedfont Lakes, Feltham, Middx., TW14 8HB, United Kingdom [email protected]

360

Mr. Yoshihiko Ito

IBM Japan, (HZD-47) 19-21 Nihonbashi, Hakozaki-cho, Chuoku, Tokyo 103-8510, Japan [email protected]

Dr. Bogumil Jakubiak

ICM - Warsaw University, Pawinskiego 5A, 02-106 Warsaw, Poland [email protected]

Mr. Zhiyan Jin

National Meteorological Center of China, 46 Baishiqiao Rd., Beijing, 100081, China [email protected]

Mr. Bruce Jones

NEC (UK) Ltd., Pinewood, Chineham Business Park, Basingstoke, RG24 8AL, United Kingdom [email protected]

Mr. Hirotumi Jyounishi

Hitachi Ltd., 810 Shimoimaizumi, Epina-shi, Kanagawa-ken, 2430435 Japan [email protected]

Mr. Tuomo Kauranne

Oy Arbonaut Ltd., Torikatu 21 C, FIN-80100 Joensuu, Finland [email protected]

Mr. Al Kellie

National Center for Atmospheric Research, P.O.Box 3000, Boulder, CO 80307-3000, USA [email protected]

Mr. Luis Kornblueh

Max-Planck-Institute for Meteorology, Bundesstr. 55, D20146 Hamburg, Germany kornblueh @dkrz.de

Dr. Yoshihiro Kosaka

FECIT, 2 Longwalk Road, Stockley Park, Uxbridge, Middlesex, UB11 1AB, United Kingdom [email protected]

361 Dr. Elisabeth Krenzien

Deutscher Wetterdienst, Kaiserleistr. 42, D-63067 Offenbach, Germany [email protected]

Mr. Kolja Kuse

NEC Deutschland GmbH, ESS, Prinzenallee 11, D-40549 Dusseldorf, Germany [email protected]

Dr. Timothy Lanfear

Hitachi Europe GmbH, High Perform. Computer Group, Technopark IV, Lohstr. 28, D85445 Schwarg-Oberding, Germany [email protected]

Dr. Leif Laursen

Danish Meteorological Institute, Lyngbyvej 100, DK-2100 Copenhagen 0 , Denmark

Mr. Christopher Lazou

HiPerCom Consultants Ltd., 10 Western Road, London N2 9HX, United Kingdom [email protected]

Mr. Young-Tae Lee

Korea Meteorological Administration (KMA), 460-18, Shindaebang-dong, Tongjak-gu, Seoul 156-720, Rep. of Korea

Mr. John Levesque

IBM /ACTC, 134 Kitchawan Rd., MS-17-212A, Yorktown Heights, N.Y. 10598, USA

Mr. Robert Levy

1 Rue de Montenotte, 75017 Paris, France [email protected]

Dr. Richard Loft

National Center for Atmosph. Research, 1850 Table Mesa Dr., Boulder, CO 80303, USA [email protected]

362

Mr. Wai Man Ma

Hong Kong Observatory, 134A Nathan Road, Tsim Sha Tsui, Kowloon, Hong Kong, China [email protected]

Dr. Alexander MacDonald

NOAA/Forecast Systems Laboratory, 325 Broadway, Boulder, CO 80305, USA [email protected]

Dr. Paolo Malfetti

CINECA, Via Magnanelli 6/3,140033 Casalecchio di Reno (Bologna), Italy [email protected]

Mr. Angelo Mangili

Swiss Center for Scientific Computing (CSCS), Via Cantonale, Galleria 2, 6928 Manno, Switzerland mangili @cscs.ch

Mr. Matthew Manoussakis

HNMS, El Venizelou 14, Hellinikon, Athens, Greece [email protected]

Dr. D. Marie

ETH-CSCS, CH-6928 Manno, Switzerland [email protected]

Dr. Michael McAtee

The Aerospace Corporation, 106 Peacekeeper Dr., Ste 2N3, Offutt AFB, NE 68113-4039, USA [email protected]

Mr. A. Mechentel

IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, New York 10598, USA

Mr. John Michalakes

Argonne National Laboratory/NCAR, 3450 Mitchell Lane, Boulder, CO 80301, USA [email protected]

Mr. David Micklethwaite

Cray Australia, Suite 1, Level 2, 1 -7 Franklin Street, Manuka, ACT 2603, Australia [email protected]

363 Prof. Nikolaos Missirlis

University of Athens, Department of Informatics, Panepistimiopolis 157 84, Athens, Greece [email protected]

Dr. Kristian Mogensen

Danish Meteorological Institute, Lyngbyvej 100, DK-2100 Copenhagen 0 , Denmark [email protected]

Mr. Eduardo Monreal

Institute Nacional de Meteorologia, c/Camino de las Moreras, S/N, E-28071 Madrid, Spain [email protected]

Dr. Ravi Nanjundiah

Indian Institute of Science, CAOS, Bangalore 560012, India [email protected]

Mr. David Norton

Mission Critical Linux, 2351 Wagon Train Trail, Southlake Tahoe, CA 96150, USA [email protected]

Dr. Robert Numrich

Cray Inc., 1340 Mendota Heights Road, Mendota Heights, MN 55120, USA [email protected]

Mr. Enda O'Brien

Compaq Computer Corp., Ballybrit Business Park, Galway, Ireland [email protected]

Mr. Yoshinori Oikawa

Japan Meteorological Agency (JMA), Numerical Prediction Division, Forecast Dept., 1-3-4 Otemachi, Chiyoda-ku, Tokyo, 100-8122, Japan [email protected]

364 Prof. Matthew O'Keefe

Dept. of Electrical & Computer Engineering, Parallel Computer Systems Laboratory, University of Minnesota, 200 Union St. S.E., Minneapolis, MN 55455, USA [email protected]

Mr. Mike O'Neill

Fujitsu Systems (Europe) Limited,), 2 Longwalk Road, Stockley Park, Uxbridge, Middlesex, UB11 1AB, United Kingdom meon@[email protected]

Mr. Satoru Okura

Japan Atomic Energy Research Institute (JAERI), Sumitomo Seimei, Bldg. 10F, 1-18-16, Hamamatsu-cho, Minato-ku, Tokyo 105-0013, Japan [email protected]

Mr. Bernie O'Lear

National Center for Atmospheric Research, 1850 Table Mesa Drive, Boulder, CO 80305, USA [email protected]

Mr. Vic Oppermann

NEC Deutschland GmbH, ESS, Prinzenallee 11, D-40549 Diisseldorf, Germany

Mr. Kiyoshi Otsuka

Japan Atomic Energy Research Institute (JAERI), Sumitomo Hamamatsucho, Bldg. 4F, 1-1816 Hamamatsucho Minato-ku, Tokyo 105-0013, Japan [email protected]

Mr. Jairo Panetta

INPE/CPTEC, C.P. 01, 12630000 Cachoeira Paulista, SP, Brazil [email protected]

Mr. Claude Paquette

Cray Canada Corp., 1540 Royal Orchard Drive, Cumberland, Ontario, Canada K4C 1A9 [email protected]

365 Dr. Denis Paradis

METEO-FRANCE, 42 Av. Coriolis, F-31057 Toulouse Cedex, France [email protected]

Mr. Niels J0rgen Pedersen

Danish Meteorological Institute, Lyngbyvej 100, DK-2100 Copenhagen 0 , Denmark [email protected]

Mr. Simon Pellerin

Meteorological Service of Canada, 2121 Trans Canada Highway N., Dorval, QC, Canada H9P 1J3 [email protected]

Mr. Kim Petersen

SGI, Hovedgaden 45IB, DK2640 Hedehusene, Denmark [email protected]

Mr. Mike Pettipher

Manchester Computing, University of Manchester, Oxford Road, Manchester M13 9PL, United Kingdom [email protected]

Mr. Hans Plum

Pallas GmbH, Hermuhlheimer Str. 10, D-50321 Briihl, Germany

Mr. Ken Pollak

FNMOC, 7 Grace Hopper Ave., Stop 1, Monterey, CA 939435501, USA [email protected]

Dr. Abdessamad Qaddouri

Meteorological Service of Canada, 2121 Trans Canada Highway, Dorval, QC, Canada H9- 1J3 abdessamad.qaddouri @ec.gc.ca

Mr. Jean Quiby

MeteoSwiss, Krahbiihlstr. 58, CH-8044 Zurich, Switzerland jean, quiby @meteoschweiz.ch

366 Dr. Ben Ralston

IBM UK Ltd., St. Johns March, Mortimer, RG7 3RD, United Kingdom [email protected]

Mr. Deuk-Kyun Rha

Korea Meteorological Administration (KMA), 460-18, Shindaebang-dong, Tongjak-gu, Seoul 156-720, Rep. of Korea [email protected]

Dr. Thomas Rosmond

Naval Research Laboratory, 7 Grace Hopper Ave., Monterey, CA 93943-5502, USA [email protected]

Dr. Ulrich Schattler

Deutscher Wetterdienst, Postfach 100465, D-63004 Offenbach, Germany [email protected]

Mr. Daniel Schaffer

NOAA/Forecast Systems Laboratory, 325 Broadway, R/FS5, Boulder, CO 80305-3328, USA schaffer® fsl.noaa.gov

Dr. Jerome Schmidt

Naval Research Laboratory, 7 Grace Hopper Ave., Monterey, CA 93943-5502, USA [email protected]

Dr. Joseph Sela

NOAA/NCEP, 5200 Auth Rd., Camp Springs, MD 20746, USA [email protected]

Mr. Wolfgang Sell

German Climate Computing Centre (DKRZ), Bundesstr. 55, D-20146 Hamburg, Germany [email protected]

Mr. Eric Sevault

METEO-FRANCE, SCEM/PREVI/COMPAS, 42 Av. Coriolis, F-31057 Toulouse, France [email protected]

367 Mr. Shuwei Shi

National Meteorological Center of China, 46 Baishiqiao Rd., Beijing, 100081, China

Mr. Satoru Shingu

Japan Atomic Energy Research Institute (JAERI), Sumitomo Hamamatsu-cho Building 10F, 118-16 Hamamatsu-cho, Minatoku, Tokyo 105-0013, Japan [email protected]

Mr. Peter Silva

Meteorological Service of Canada, 2121 Trans Canada Highway, Dorval, QC, Canada H9P 1J3 [email protected]

Mrs. Angele Simard

Meteorological Service of Canada, 2121 Trans Canada Highway, Dorval, QC, Canada H9P 1J3 [email protected]

Dr. Burton Smith

Cray Inc., USA

Dr. David Snelling

Fujitsu European Centre for Information Technology Ltd. (FECIT), 2 Longwalk Road, Stockley Park, Uxbridge, Middlesex, UB11 1AB, United Kingdom [email protected]

Mr. Karl Solchenbach

Pallas GmbH, Hermuhlheimer Str. 10, D-50321 Briihl, Germany [email protected]

Mr. Junqiang Song

National Meteorological Center of China, 46 Baishiqiao Rd., Beijing, 100081, China

Dr. Lois Steenman-Clark

University of Reading, Dept. of Meteorology, Earley Gate, P.O.Box 243, Whiteknights, Rading, RG6 6BB, United Kingdom [email protected]

368 Mr. Bo Strandberg

Swedish Meteorological & Hydrological Institute, Folkborgsvagen 1, S-60176 Norrkoping, Sweden [email protected]

Mr. Masaharu Sudoh

NEC Deutschland GmbH, Prinzenallee 11, D-40549 Diisseldorf, Germany

Capt. Robert Swanson, PhD

Hq Air Force Weather Agency, 106 Peacekeeper Dr. Ste 2N3, Offutt AFB, NE 68113-4039, USA [email protected]

Mr. David Tanqueray

Cray UK Limited, 200 Brook Drive, Green Park, Reading/Berks., RG2 6UB, United Kingdom [email protected]

Dr. John Taylor

QUADRICS, One Bridewell Street, Bristol, BSl 2AA, United Kingdom [email protected]

Mr. Peter Thomas

Fujitsu Systems (Europe) Limited,), 2 Longwalk Road, Stockley Park, Uxbridge, Middlesex, UB 11 1AB, United Kingdom [email protected]

Dr. Stephen Thomas

Scientific Computing Division, National Center for Atmosph. Research, 1850 Table Mesa Drive, Boulder, CO 80303, USA [email protected]

Mr. Yasumasa Tomisawa

Fujitsu Systems Europe, 2 Longwalk Road, Stockley Park, Uxbridge, Middlesex, UB 11 1AB, United Kingdom [email protected]

369 Mr. Joseph-Pierre Toviessi

Canadian Meteorological Centre, 2121 Trans Canada Highway, N. Service Road, 4lh Floor, Dorval, QC, Canada H9P1J3 [email protected]

Mr. Eckhard Tschirschnitz

NEC Deutschland GmbH, ESS, Prinzenalleell,D-40549 Diisseldorf, Germany [email protected]

Mr. Jim Tuccillo

IBM Global Government, 415 Loyd Road, Peachtree City, GA 30269, USA [email protected]

Mr. Michel Valin

Environment Canada, 2121 N. Trans Canada Highway #500, Dorval, Qc, Canada H9P1J3 [email protected]

Mr. Roger Vanlierde

Royal Meteorological Institute, Ringlaan 3, B-l 180 Brussels, Belgium [email protected]

Mr. Gerrit van der Velde

NEC Deutschland GmbH, ESS, Prinzenallee 11, D-40549 Diisseldorf, Germany [email protected]

Mr. Tirupathur Venkatesh

National Aerospace Laboratories, Flosolver Unit, Kodihalli, Bangalore 560017, India [email protected]

Mr. Reiner Vogelsang

Silicon Graphics GmbH, Am Hochacker 3, D-85630 Grasbrunn, Germany [email protected]

Mr. Tadashi Watanabe

NEC, 5-7-1, Shiba, Minato-ku, Tokyo 108-8001, Japan [email protected]

370 Dr. William Webster

NASA/Goddard Space Flight Center, Code 931, Build. 28, Rm. S243, Greenbelt, MD 20771, USA William.P. Webster. 1@ gsfc.nasa.gov

Mr. Will Weir

IBM, Bedfont Lakes, Feltham, Middx.,TW14 8HB, United Kingdom [email protected]

Dr. Gunter Wihl

Zentralanstalt f. Meteorologie und Geodynamik, Hohe Warte 38, A1191 Vienna, Austria gunter. [email protected]. at

Mr. Tomas Wilhelmsson

Department of Numerical Analysis and Computer Science, Royal Institute of Technology, S100 44 Stockholm, Sweden [email protected]

Mr. Michael Woodacre

SGI, The Coach House, Seagry Road, Sutton Benger, SN15 4RX, United Kingdom Woodacre @ sgi .com

Mr. Haruo Yonezawa

Hitachi, Shinsuna Plaza 6-27, Shinsuna 1, Koto-ku, Tokyo, Japan [email protected]

Mr. Kazuo Yoshida

NASD A, World Trade Center Bid. 27 Fl., 2-4-1 Hamamatsucho, Minato-ku, Tokyo 105-8060, Japan [email protected]

Mrs. Mayumi Yoshioka

Research Organization for Information Science & Technology, 1-18-16, Hamamatsucho, Minato-ku, Tokyo, 105-0013, Japan [email protected]

371 Mr. Emanuele Zala

MeteoSwiss, Krahbiihlstr. 58, CH-8044 Zurich, Switzerland [email protected]

ECM WF: David Burridge Erik Andersson Anton Beljaars Jens Daabeck Matteo Dell'Acqua David Dent Richard Fisker Mats Hamrud John Hennessy Claus Hilberg Anthony Hollingsworth Mariano Hortal Lars Isaksen Peter Janssen Norbert Kreitz Francois Lalaurette Dominique Marbouty Martin Miller Umberto Modigliani George Mozdzynski Tim Palmer Pam Prior Deborah Salmond Adrian Simmons Neil Storer Jean-Noel Thepaut Yannick Tremolet Saki Uppala Walter Zwieflhofer

Director Head, Data Assimilation Section Head, Physical Aspects Section Head, Graphics Section Head, Netw.& Comp.Secur. Sect. Numerical Aspects Section Head, Servers & Desktops Sect. Numerical Aspects Section Head, Met. Apps Section Head, Comp. Operations Section Deputy Dir./Head, Research Dept Head, Numerical Aspects Section Satellite Data Section Head, Ocean Waves Section User Support Section Head, Met. Operations Section Head, Operations Department Head, Model Division Head, User Support Section Systems Software Section Head, Diagn. & Pred. Res.Section User Support Section Numerical Aspects Section Head, Data Division Head, Systems Software Section Head, Satellite Data Section Data Assimilation Section ERA Project Leader Head, Computer Division

www. worldscientific. com 4819 he

Use of High Performance Computing in Meteorology

Read more

The Use of High Persormance Computing in Meteorology

Read more

Use of High Performance Computing in Meteorology: Proceedings of the Twelfth ECMWF Workshop

Read more

High performance computing

Read more

High Performance Computing

Read more

High Performance Computing

Read more

High Performance Computing for Dummies

Read more

Tools for High Performance Computing

Read more

Grid computing: the new frontier of high performance computing

Read more

High Performance Computing on Vector Systems: Proceedings of the High Performance Computing Center Stuttgart, March 2005

Read more

High Performance Computing in Science and Engineering ' 06: Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2006

Read more

High performance computing in science and engineering '08 transactions of the High Performance Computing Center, Stuttgart (HLRS) 2008

Read more

High Performance Computing in Science and Engineering ' 07: Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2007

Read more

High Performance Computing in Science and Engineering '10: Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2010

Read more

High Performance Computing in Science and Engineering '11: Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2011

Read more

High-Performance Scientific Computing: Algorithms and Applications

Read more

Introduction to High Performance Scientific Computing

Read more

High-Performance Scientific Computing: Algorithms and Applications

Read more

Introduction to High Performance Scientific Computing

Read more

Introduction to High Performance Scientific Computing

Read more

High-Performance Scientific Computing: Algorithms and Applications

Read more

Introduction to High Performance Scientific Computing

Read more

High-Performance Computing : Paradigm and Infrastructure

Read more

New Developments in High Temperature Superconductivity

Read more

New developments in high temperature superconductivity

Read more

High Performance Computing on Vector Systems 2006: Proceedings of the High Performance Computing Center Stuttgart, March 2006

Read more

High Performance Computing on Vector Systems 2006: Proceedings of the High Performance Computing Center Stuttgart, March 2006

Read more

New Developments in High Temperature Superconductivity

Read more

Mesoscale meteorology in midlatitudes

Read more

Heavy Minerals in Use, Volume 58 (Developments in Sedimentology)

Read more

Recommend Documents

Use of High Performance Computing in Meteorology

This page intentionally left blank Proceedings of the Eleventh ECMWF Workshop on the Use of High performance computi...

The Use of High Persormance Computing in Meteorology

Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology DEVELOPMENTS IN TERACOM...

Use of High Performance Computing in Meteorology: Proceedings of the Twelfth ECMWF Workshop

Use of High Performance Computing in Meteorology This page intentionally left blank Proceedings of the Twelfth ECMW...

High performance computing

Advances in COMPUTERS VOLUME 72 This page intentionally left blank Advances in COMPUTERS High Performance Computi...

High Performance Computing

High Performance Computing By: Charles Severance High Performance Computing By: Charles Severance Online: < http://...

High Performance Computing

High Performance Computing for Dummies

g Easier! Making Everythin ™ ial Edition Sun and AMD Spec e c n a m r o f r e High P Computing Learn to: • Pick out...

Tools for High Performance Computing

Tools for High Performance Computing Michael Resch · Rainer Keller · Valentin Himmler · Bettina Krammer · Alexander S...

Grid computing: the new frontier of high performance computing

GRID COMPUTING The New Frontier of High Performance Computing ADVANCES IN PARALLEL COMPUTING VOLUME 14 Series Editor...

High Performance Computing on Vector Systems: Proceedings of the High Performance Computing Center Stuttgart, March 2005

Resch · Bönisch · Benkert · Furui · Seo · Bez (Eds.) High Performance Computing on Vector Systems Michael Resch · Thom...