Use of High Performance Computing in Meteorology

This page intentionally left blank Proceedings of the Eleventh ECMWF Workshop on the Use of High performance computi...

Author: George Mozdzynski

50 downloads 2133 Views 32MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

This page intentionally left blank

Proceedings of the Eleventh ECMWF Workshop on the

Use of High performance computing in Meteorology

Reading, UK 25- 29 October 2004

Editors

Walter Zwieflhofer

George Mozdzynski European centre for Medium-Range Weather Forecasts, UK

World Scientific NEW JERSEY LONDON SINGAPORE BEIJING SHANGHAI HONGKONG TAIPEI CHENNAI

Published by

World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK oftice: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

USE O F HIGH PERFORMANCE COMPUTING IN METEOROLOGY Proceedings of the 11th ECMWF Workshop Copyright 0 2005 by World Scientific Publishing Co. Re. Ltd All rights reserved. This book, or parts thereoJ may not be reproduced in any form or by any means, electronic or mechanical, includingphotocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-256-354-7

Printed in Singapore by Mainland Press

PREFACE

The eleventh workshop in the series on the “Use of High Performance Computing in Meteorology” was held in October 2004 at the European Centre for Medium Range Weather Forecasts. This workshop received talks mainly from meteorological scientists, computer scientists and computer manufacturers with the purpose of sharing their experience and stimulating discussion. It is clear from this workshop that numerical weather prediction (NWP) and climate science continue to demand the fastest computers available today. For NWP in particular this ‘need for speed’ is mainly a consequence of the resolution of weather models used, increasing complexity of the physics, and growth in the amount of satellite data available for data assimilation. Today, the fastest commercially available computers have 1,000’s of scalar processors or 100’s of vector processors, achieving 10’s of teraflops sustained performance for the largest systems. These systems are mainly programmed using Fortran 90/95 and C++ with MPI for interprocessor communication and in some cases OpenMP for intra-node shared memory programming. Can this programming model continue to be used to run our community models to achieve orders of magnitude increases in performance? Or do we now need to look to new languages and/or hardware features to address known problem areas of space/power and memory latency? Chip roadmaps show increasing numbers of cores per chip with modest increases in clock frequency. The implication of this is that our applications need to scale on increasing numbers CPUs with greater dependence on fast switch networks. Will our applications rise to this challenge? During the week of this workshop a number of talks considered these issues while others presented on areas of grid computing, interoperability, use of Linux clusters, parallel algorithms, and updates from meteorological organisations. The papers in these proceedings present the state of the art in the use of parallel processors in the fields of meteorology, climatology and oceanography.

George Mozdzynski

Walter Zwieflhofer

V

This page intentionally left blank

CONTENTS Preface

V

Early Experiences with the New IBM P690+ at ECMWF Deborah Salmond, Sami Saarinen

1

Creating Science Driven System Architectures for Large Scale Science William TC. Kramer

13

Programming Models and Languages for High-productivity Computing Systems Hans P. Zima

25

Operation Status of the Earth Simulator Atsuyu Uno

36

Non-hydrostatic Atmospheric GCM Development and Its Computational Performance Keiko Takahashi, Xindong Peng, Kenji Komine, Mitsuru Ohduira, Koji Goto, Masayuki Yamada, Fuchigami Hiromitsu, Takeshi Sugimura

50

PDAF - The Parallel Data Assimilation Framework: Experiences with Kalman Filtering L. Nerger, W. Hiller, J. Schroter

63

Optimal Approximation of Kalman Filtering with Temporally Local 4D-Var in Operational Weather Forecasting H. Auvinen, H. Huario, 'I: Kauranne

84

Intel Architecture Based High-performance Computing Technologies Herbert Cornelius

100

Distributed Data Management at DKRZ Wolfgang Sell

108

Supercomputing Upgrade at the Australian Bureau of Meteorology I. Bermous, M. Naughton, W Bourke

131

4D-Var: Optimisation and Performance on the NEC SX-6 Stephen Oxley

143

vii

viii

The Weather Research and Forecast Model: Software Architecture and Performance J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W Skamarock, W Wang

156

Establishment of an Efficient Managing System for NWP Operation in CMA Jiangkai, Hu, Wenhai, Shen

169

The Next-generation Supercomputer and NWP System of the JMA Masami Narita

178

The Grid: An IT Infrastructure for NOAA in the 21'' Century Mark W Govett, Mike Doney, Paul Hyder

187

Integrating Distributed Climate Data Resources: The NERC Datagrid A . Woo& B. Lawrence, R. Lowry, K. Kleese van Dam, R. Cramer, M. Gutierrez, S. Kondapalli, S. Lathan, K. O'Neill, A . Stephens

215

Task Geometry for Commodity Linux Clusters and Grids: A Solution for Topology-aware Load Balancing of Synchronously Coupled, Asymmetric Atmospheric Models I. Lumb, B. McMillan, M. Page, G. Carr

234

Porting and Performance of the Community Climate System Model (CCSM3) on the Cray X1 G.R. Carr Jr., I.L. Carpenter, M.J. Cordery, J.B. Drake, M.W. Ham, F.M. Hoffman, P.H. Worley

259

A Uniform Memory Model for Distributed Data Objects on Parallel Architectures V Balaji, Robert W. Numrich

272

Panel Experience on Using High Performance Computing in Meteorology - Summary of the Discussion George Mozdzynski

295

List of Participants

299

EARLY EXPERIENCES WITH THE NEW IBM P690+ AT ECMWF DEBORAH SALMOND &

SAMISAARINEN ECMWF

This paper describes the early experiences with the large IBM p690+ systems at ECMWF. Results are presented for the IFS (Integrated Forecasting System) which has been padelized with MPI message passing and OpenMP to obtain good scalability on large numbers of processors. A new pmfiler called Dr.Hook, which has been developed to give detailed performance information for IFS, is described. Finally the code optimizations for IFS on the lBM p69W are briefly outlined.

1. Introduction

This paper describes the performance of the IFS on the IF3M p690+ systems at ECMWF. The IFS has been designed and optimized to perform efficiently on vector or scalar processors and parallelized for shared and distributed memory using a hybrid of MPI and OpenMP to scale well up to O( 1000) processors. The performance of the IFS is measured and compared on high performance computer systems using the RAPS (Real Applications for Parallel Systems) benchmark.

1.1. High performance Computer Systemsfor Numerical Weather prediction at ECMWF

In 2004 ECMWF installed two IBM p690+ clusters known as hpcc and hpcd. Each cluster is configured with 68 shared memory nodes, each node having 32 IBM Power4+ processors with a peak performance of 7.6 Gflops per processor. These replaced two IBM p690 clusters known as hpca and hpcb. These were configured with 120 nodes, each node having 8 IBM Power4 processors with a peak performance of 5.2 Gflops per processor. Figure 1 shows the key features of these IBM systems for IFS. The main improvement &om hpcahpcb to hpcc/hpcd was the introduction of the new Federation switch. This significantly reduced the time spent in message passing communications. 1

4. Dr.Hook

Dr.Hook is an instrumentation library which was initially Written to catch runtime errors in IFS and give a calling tree at the time of failure. Dr. Hook also gathers profile information for each instrumented subroutine or called sub-tree. This profile information can be in terms of Wall-clock time, CPU-time, Mflops, MIPS and Memory usage. Dr.Hook can also be used to create watch points and so detect when an m a y gets altered. Dr.Hook can be called ffom Fortran90 or C. The basic feature is to keep track of the current calling tree. This is done for each MPI-task and OpenMP-thread and on error it tries to print the calling trees. The system's own traceback can also be printed. The Dr.Hook library is portable allowing profiles on different computer systems to be compared and has very low overhead (-1% on IBM). Figure 10 shows how a program is instrumented with Dr.Hook. Calls to DR-HOOK must be inserted at the beginning and just before each return from the subroutine. SUBROUTINE SUB USE YOMHOOK, ONLY IMPLICIT NONE

:

LHOOR, DR-HOOK

REAL(8) ZHOOK-HANDLE ! Must b e a l o c a l

(stack) variable

f i r s t statement i n the subroutine IF (LHOOK) CALL DR_HOOK('SUB',O,ZHOOK_~LE)

!- The v e r y

! - - - Body o f the r o u t i n e goes here !

-

---

J u s t b e f o r e RETURNing from the subroutine IF (LHOOR) CALL DR__HOOK( 'SUB' ,l.ZHOOK--HANDLE)

END SUBROUTINE SUB

Figure ZU. How to instrument a Fortran9Uprogram with Dr.Hook.

Figure 11 shows an example of a trace-back obtained after a failure - in this case due to CPU-limit being exceeded. The IBM's system trace-back is also shown. The IBM trace-back will usually give the line number of the failure but in some cases, for example CPU-limit exceeded no useful information is given. Figure 12

I

10

shows and example of a profile obtained from IFS for a T511 forecast run on hpca - the communications routines are highlighted in red.

Figure 11. Example of DrHook trace-back.

#

%Time (self) 1

2 3 4 5 6 7 8 9

10 11

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

7.43 3.67 3.65 3.64 3.63 3.51 2.76 2.51 2.41 2.40 2.39 2.39 2.36 2.31 2.30 2.29 1.94 1.94 1.92 1.92 1.85 1.85 1.83 1.83 1.82 1.80

CUmul (sec)

(sec)

Self

Total (sec)

35.027 52.349 52.349 52.349 52.349 68.918 81.935 93.763 105.145 105.145 105.145 105.145 116.296 116.296 116.296 116.296 125.448 125.448 125.448 125.448 134.173 134.173 134.173 134.173 142.737 151.219

35.02’1 17.322 17.204 17.181 17.138 16.569 13.017 11.829 11.382 11.336 11.274 11.267 11.150 10.897 10.832 10.816 9.152 9.130 9.073 9.045 8.725 8.724 8.654 8.621 8.565 8.482

40.573 17.367 17.287 17.289 17.202 16.584 18.260 11.831 30.536 30.436 30.394 30.072 11.185 10.940 10.920 10.310 9.327 9.263 9.256 9.220 8.785 8.777 8.741 8.658 8.580 69.102

#calls MIPS MFlops bDiv

49 5824 5791 5769 5770 54 51 54 11540 11538 11582 11648 3492 3502 3474 3484 27785 27980 27715 27750 5563 5596 5541 5546 51 13

961 1113 1116 1118 1117 783 926 742 1106 1112 1110 1113 2135 2218 2216 2224 1433 1434 1432 1440 985 987 986 989 782 581

273 546 548 549 549 0

1 0 88 88 88 86 2172 2259 2258 2266 682 679 682 686 592 593 593 595 0 22

Figure 12. Example of DrHookproJilefor T511forecast.

2.9 3.6 3.6 3.6 3.6 27.6 2.8 24.8 3.4 3.4 3.4 3.4 0.0 0.0 0.0 0.0

0.0 0.0

0.0 0.0 2.2 2.2 2.2 2.2 21.6 10.6

Routine

WVCOITPLEBl I567.11 *CLOWSCBl 15.41 CLOWScB4 15.41 CLOm)Sc82 15.41 CLOUDSC@3 15.41 T m O L COMMSa1 [ 5 2 5 , 1 1 TROTOM31 1520.11 TRLTOO_COMMSBl (523.11 *CUAS-3 110.41 WAS-2 130.41 WAS-4 130.41 CUASCNB1 130.41 *MXUAOP@l 1166.41 MXUAOP@2 1166.41 MXUAOPOI 1166.41 MXUAOPO3 1166.41 *LATTQ-3 1138.41 LArTQX@l 1138.41 LAITQ-4 1138.41 LAITQ-2 1138.41 * S L T B m 4 1297.41 SLT-1 I297.41 SLTB-2 1297.41 SLTBNDB3 1237.41 TRLTOM-COMMSBl I524.11 RhDINTrWl L207.11

11

Figure 13 shows and example of a Dr.Hook memory profile. This keeps trace of how much is data is allocated and de-allocated by each instrumented subroutine and its child subroutines. Also a total for the heap and stack is recorded.

1 i 3 4

5 6 7 8

9 10 11 ia 13 14 15 16 17

40784 440483a

20.00 9.95 8.01 6.51 6.15 4.91 4.n 4.59

587569164 a9iisio88 135681718 i91547a80 183480071 144638640 138637064 134851148

4.39

US847896

13056

iao 14 110 113 14 1 1 1 1

4.19 1.79 1.71 1.68 1.39 1.18 1.63 1.56

ll3051080 81091910

~90aeo

113

134631140 10530944 3900704 a0593680 18431 19391

79457488

78809176 70318856

9691368 135193904

111 iao

4408480

a

10530944 13014 17115456 17184

64001688

47978864 45774496

713 168 1910

113 1 1 1

34

710 as8 1917 144 518 10 31

1

0

50 241

a41

a44

518 10

141 360 6

111 8

1 113

49 141 360 6 la1 3 1 1ll

0 s 1

-1.10.1 CAr&-#Ul '~B-~TTM

1rnB-LmLIO

Smz1m.l amv_w.LI LmpupI.1

.mer ~ -Ul 1,nB-rTTM

mm.1 m - 1 1-

Figure 13. Example of DrHook memory profile.

Figure 14 shows an extracts from Dr.Hook profiles for and IFS T5 11 forecast on two different systems - the Cray X1 and the IBM p690+. The performance of the different subroutines on the scalar and vector systems can easily be compared.

IBM p690+ (7.6 Gf lops peak)

CRAY X1 (3.2 Gf lops peak)

MFlops per CPU

Routine

Mf lops

609

CUADJTQ

851

548

CLOUDSC

486

a12

I

LAITQM

I

1185

LASCAW VDFEXCU

704

Figure 14. Example of DrHook for comparing different systems for T511 forecast

m

~

~

12 5. Optimisation of IFS

Today’s Fortran compilers do most of the basic code tuning. This depends on the optimization level that has been chosen. At ECMWF we use ‘-03 -qstrict’ flags on the IBM Fortran compiler. The ‘-qstrict’ flag ensures that operations are performed in a strictly defined order. With higher optimization levels we found that we did not get bit-reproducible results when NFROMA was changed. The ‘-qstrict’ flag does however stop certain optimizations like replacing several divides by the same quantity, by one reciprocal and corresponding multiplies. The main optimization activities done for IFS on the IBM p690 system have been as follows:

- Add timings (at subroutine level and also for specific sections using &.Hook)

- MPI (remove buffered mpi-send and remove overlap of communications & CPU) - Add more OpenMP parallel regions - ‘By hand’ optimization of divides - necessary because of ‘-qstrict’

- Use IBM ‘vector’ functions - Remove copies and zeroing of arrays where possible - Optimize data access for cache - group different uses of the same array together. - Remove allocation of arrays in low-level routines. - Basic Fortran optimizations, e.g. ensuring inner loops are on first array index. The IBM has scalar processors, but as we have worked on optimization we have tried to keep a code that performs well on scalar and vector processors. This has meant that in a few places we have had to put in alternate code for vector and scalar versions.

References 1. D.Dent and G.Mozdzynski: ‘ECMWF operational forecasting on a distributed memory platform’, Proceedings of the 7” ECMWF workshop on the use of Parallel Processors in Meteorology, World Scientific, pp 36-51, (1996). 2. D.Dent, M.Hamrud, G.Mozdzynski, D.Salmond and C.Temperton: ‘IFS Developments’, Proceedings of the 9’hECMWF workshop on the use of Parallel Processors in Meteorology, World Scientific, pp 36-52, (2000). 3. M.Hamrud, S.Saarinen and D.Salmond: ‘Implementation of IFS on a Highly Parallel Scalar System’. Proceedings of the 10” ECMWF workshop on the use of Parallel Processors in Meteorology, World Scientific,pp 74-87, (2002).

CREATING SCIENCE DRIVEN SYSTEM ARCHITECTURES FOR LARGE SCALE SCIENCE WILLIAM T.C. KRAMER Lawrence Berkeley National Laboratory MS Sob-4230, One Cyclotron Road Berkeley, California 94720 (510) 486-7577 [email protected] Application scientists have been frustrated by a trend of stagnating application performance, despite dramatic increases in claimed peak performance of highperformance computing systems. This trend is sometimes referred to as the “divergence problem” and often has been assumed that the ever-increasing gap between theoretical peak and sustained performance was unavoidable. However, recent results from the Earth Simulator (ES) in Japan clearly demonstrate that a close collaboration with a vendor to develop a science-driven architectural solution can produce a system that achieves a significant fraction of peak performance for critical scientific applications. This papa discusses the issues contributing to divergence problem, suggests a new approach to address the problem and documents some early successes for this approach.

1. The “Divergence Problem”

Application scientists have been frustrated by a trend of stagnating application performance, despite dramatic increases in claimed peak performance of highperformance computing systems. This trend is sometimes referred to as the “divergence problem”’ and often has been assumed that the ever-increasing gap between theoretical peak and sustained performance was unavoidable. Other people have described this situation as the memory wall. Figure 1 shows this divergence comparing peak theoretical performance and the measured (sustained) performance for an application workload. Due to business pressures, most computational system components are designed for applications other than high performance computing (e.g. web servers, desktop applications, databases, etc.) In the early 1990’s, in an attempt to increase price performance for HPC applications, many organizations in the US government, most notably the High Performance Computing and Communications (HPCC) Initiative, encouraged the use of COTS (Commodity Off The Shelf) components for large scale computing. Very effective systems like the Cray T3E and the IBM SP were created during this. This effort was 13

15

1.1. Applications Have Diverse Algorithmic Needs Tables 1 and 2 show disciplines require the use of many different algorithmic methods. It is very challenging for architectures to address all these algorithmic areas at the same time. Data parallel algorithms perform well on systems with high memory bandwidth such as vector or super scalar systems. Table 1. Significant scientific breakthroughs are. waiting increased in computational performance. Science Areas

Goals

Computational Methods Breakthrough Target (5OTflop/s sustained) Simulate nanostructures with hundreds to thousands of atoms as well as transport and optical properties and other parameters Explicit finite difference Simulate laboratory Implicit finite difference scale flames with high Zero-dimensional physics fidelity representations of governing physical Adaptive mesh processes refinement Lagrangian particle methods

Nanoscience

Simulate the synthesis and predict the properties of multi-component nanosystems

Quantum molecular dynamics Quantum Monte Carlo Iterative eigensolvers Dense linear algebra Parallel 3D FFTs

Combustion

Predict combustion processes to provided efficient, clean and sustainable energy

Fusion

Understand high-energy density plasmas and develop an integrated simulate of a fusion reactor

Multi-physics, multi-scale Simulate the lTER Particle methods reactor Regular and irregular access Nonlinear solvers Adaptive mesh refinement

Climate

Accurately detect and attribute climate change, predict future climate and engineer mitigations strategies

Finite difference methods Perform a full FFk oceadatmospheres climate model with Regular & irregular access 0.125 degree spacing, Simulation ensembles with an ensemble of 810 NllS

Astrophysics

Use simulation and analysis of observational data to determine the origin, evolution and fate of the universe, the nature of matter and energy, galaxy and stellar evolutions

Multi-physics, multi-scale Simulate the explosion Dense linear algebra of a supernova with a full 3D model Parallel 3D FFT's Spherical transforms Particle methods Adaptive Mesh Refinement

16

Irregular control flow methods require excellent scalar performance and spectral and other applications require high bisection bandwidth and low latency for interconnects. Yet large scale computing requires architectures capable of achieving high performance across the spectrum of state of the art applications. Hence, for HPC computing, balanced system archtectures are extremely important. Table 2. State-of-the-art computational science requires increasingly diverse and complex algorithms. Science Areas

Nanoscience Combustion Fusion Climate Astrophysics

Multi Dense Physics and Linear Multi-scale Algebra

d

FFTs Particle A M R Data Irregular Methods Parallelism Control Flow

.i

. i d

d

d

d d

d .i

d

d

d d

d d

4

d

d d

d v

4 4

4

d

d

d d

Some systems and sites focus on particular application areas. For example, a country may have one or several sites focused on weather modeling and/or climate research. Other focused systems may be for aeronautical applications, for military design, for automotive design or for drug design. Because a more focused set of applications run at these sites, they are more concerned about supporting a smaller set of methods. However, many sites, such as the National Energy Research Scientific Computing (NERSC) Facility, the flagship computing facility for the US Department of Energy Office of Science, support a wide range of research disciplines. Figure 1 shows the FY 03 breakdown of usage by scientific discipline at NERSC. Clearly, combining Table 1 and 2 with Figure 1 indicates only a balanced system will operate well across all disciplines. Sites such as NERSC are important bell weathers for large scale system design because of their workload diversity. In other words, if a system performs well at a site with a diverse workload, it will perform well at many sites that have a more focused workload.

2. The Science Driven System Architecture Approach Recent results from the Earth Simulator (ES) in Japan clearly demonstrate that a close collaboration with a vendor to develop a science-driven architectural solution can produce a system that achieves a significant fraction of peak performance for critical scientific applications. The key to the ES success was

18

Computing Revitalization Task Force (HECRTF) Workshop3, the Federal Plan for High-End Computing4 and the DOE SCaLeS Workshop’ and has been demonstrated successhlly by the Earth Simulator. the initial Berkeley Lab/IBM Blue Planet effort, and the SandidCray Red Storm effort6. The HECRTF report states: “We must develop a government-wide. coordinated method for influencing vendors. The HEC injluence on COTS components is small, but it can be maximized by engaging vendors on approaches and ideas five years or more before commercial products are created. Given these time scales. the engagement must also be focused and sustained. We recommend that academic and government HEC groups collect and prioritize a list of requested HEC-speciJicchangesfor COTS components,focusing on an achievable set. ”

The SDSA is in part an implementation of this recommendation. The SDSA process requires close relationships with vendors who have resources and an R&D track record and a national collaboration of laboratories, computing facilities, universities, and computational scientists.

3. Goals of the Science Driven System Architecture Process

Even for applications that have high computational intensity, scalar performance is increasingly important. Many of the largest scale problems can not use dense methods because of N3 memory scaling that drives applications to use sparse and adaptive computational methods with irregular control flow. Specialized architectures typically are optimized for a small subset of problems and do not provide effective performance across the broad range of methods discussed above7. Additionally, one of the main drivers for large, more accurate simulations is to incorporate complex microphyics into inner loops of computations. This is demonstrated by the following quote from a NERSC computational scientist. “It would be a major step backward to acquire a new plarform that could reach the 100 Tflop levelfor only a few applications that had ‘clean’microphysics. Increasingly realistic models usually mean increasingly complex microphysics.’‘

Thus, the goal for SDSA is to create systems that provide excellent sustained price performance for the broadest, large scale applications. At the same time, SDSA should allow applications that do well on specialized architectures such as vector systems to have performance relatitively near optimal on a SCSA system. In order to be cost effective, the SDSA systems will not likely leverage COTS technology while adding a limited amount of focused technology accelerators.

19

In order to achieve thls goal, the entire large-scale community must engage in technology discussions that influence technology development to implement the science-driven architecture development. This requires serious dialogue with vendor partners who have the capability to develop and integrate technology. To facilitate this dialogue, large-scale facilities must establish close connections and strategic collaborations with computer science programs and facilities funded by federal agencies and with universities. There needs to be strong, collaborative investigations between scientists and computer vendors on sciencedriven architectures in order to establish the path to continued improvement in application performance. In order to do this, staff and users at large scale facilities and the computational science research community must represent their scientific application communities to vendors to the Science-Driven Computer Architecture process. The long-term objective: integrate the lessons of the large scale systems, such as the Blue Gene/L and the DARPA HPCS experiments with other commodity technologies into hybrid systems for petascale computing. There are significant contributions that large-scale facilities can provide into this effort. They include Workload Analysis which is difficult to do for diverse science but critically important to provide data to computer designers. Explicit scaling analysis including performance collection, system modeling and performance evaluation. Provide samples of advanced algorithms, particularly the ones that may be in wide use five or ten years into the future provide a basis for design of systems that require that long lead times to develop. Provide computational science successes stories, these are important because they provide motivation to system developers and stakeholder funders. Numerical and System Libraries such as SuperLU, Scalapack, MPI-2, parallel NETcdf, as well as applications, tools and even languages that quickly make use of advanced features.

3.1. An Example - Sparse Matrix Calculations Sparse matrix calculations are becoming more commonly driven by the growing memory needs of dense methods, particularly as simulations move from one or two dimensions to three dimensional analysis. Many applications that have in the past or currently use dense linear algebra have to use sparse matrix methods in the future. Current cache based computing systems are designed to handle applications that have excellerit temporal and spatial memory locality but do not operate efficiently for sparse cases.

20

Consider the following simple problem, y = A*x, where A is a two dimension array that stores matrix elements in uncompressed form. Nrow and Ncol are the number of rows and columns in the array, respectively, and s is the interim sum of the multiplication. The equation above can be carried out with the following psuedo-code. do j=l,Nrow s=o do i=l, Ncol s = s+A(j,i)*x(i) a d do y(j) = s end do

This methods works well when the elements in A have good spatial locality, which is when the matrix is dense. However, if the matrix is sparse, then many of the values are zero and do not need to be computed. Figure 4, shows the impact of sparsity in the calculation. Because memory is loaded into cache according to cache lines (typically eight to sixteen 64-bit words), as the matrix gets sparser, contiguous memory locations no longer hold meaninghl data. It soon evolves into an application using only a small fiaction of the data moved to cache - in the extreme, using only one word per cache line. In many 3D simulations of scale, at least one dimension uses a sparse approach. In this case. to optimize computational effort and to conserve memory as problems grow, sparse methods are used. The psuedo-code for a sparse matrix implementation would look as follows. A is row compressed sparse matrix. int row-start() is the starting index of each row and int col-ind0 is the column index for each row. A ( ) is a single dimension array that stores sparse matrix elements in a concentrated form. Nrow is the number of rows in the array, and s is the interim sum of the row multiplication. The equation above can be carried out with the following psuedo-code. do j=l,Nrow jj=row-start ( j ) row-length = row-start(j+l) jj s=o do i d , row-length s s+A(jj+i)*x(col_ind(jj+i)) end do y(j) = s end do

-

22

specialized “accelerator” hardware to commodity CPUs that would increase efficiency for applicationsthat can typically do well on vector architectures 2. There is a possibility to add a “pre-load” feature that takes user input into account to provide more flexibility than pre-fetch engines. Other ideas are also being explored. 4. Early Results for Science Driven Architecture

Realizing that effective large-scale system performance cannot be achieved without a sustained focus on application-specific architectural development, NERSC and IBM have led a collaboration since 2002 that involves extensive interactions between domain scientists, mathematicians, computer experts, as well as leading members of IBM’s R&D and product development teams. The goal of this effort is to adjust IBM’s archtectural roadmap to improve system balance and to add key architectural features that address the requirements of demanding leadership-class applications - ultimately leading to a sustained Petaflop/s system for scientific discovery. The first products of this multi-year effort have been a redesigned Power5-based HPC system known as Blue Planet and a set of proposed architectural extensions referred to as ViVA (Virtual Vector Architecture). This collaboration has already had a dramatic impact on the archtectural design of the ASCI Purple system’ and the Power 5 based SP configuration. As indicated in the following quote from the “A National Facility for Advanced Computational Science: A Sustainable Path to Scientific Discovery” technical report, Dona Crawford, the Associate Laboratory Director for Computation at Lawrence Livermore National Laboratory wrote, “The Blue Planet node conceived by NERSC and IBM I...] features high internal bandwidth essential ,for successful scientific computing. LLNL elected early in 2004 to modifv its contract with IBM to use this node as the building block of its 100 TF Purple system. This represents a singular benefit to LLNL and the ASC program. and LLNL is indebted to LBNL for this effort. ’’

The Blue Planet node is an 8 way SMP that is balanced with regard to memory bandwidth. That is, if each processor operates at its full potential memory bandwidth, the system can keep all processors supplied with data at the sum of the individual rate. If more processors are added, the node memory bandwidth would be oversubscribed. The implication of this change meant that the interconnection architecture had to scale to more than was originally planned, yet the overall sustained system performance improved. It is important to point out that this approach has the potential for additional improvements, not just with IBM but with other providers. The Blue Planet design is incorporated into the new generation of IBM Power microprocessors that are the building blocks of future configurations. These

23

processors break the memory bandwidth bottleneck, reversing the recent trend towards architectures poorly balanced for scientific computations. The Blue Planet design improved the original power roadmap in several key respects: dramatically improved memory bandwidth; 70% reduction in memory latency; eight-fold improvement in interconnect bandwidth per processor; and ViVA Virtual Processor extensions, which allow all eight processors within a node to be effectively utilized as a single virtual processor. 5. Summary

This paper explains a new approach to making the COTS based large scale system more effective for scientific applications. It documents some of the problems with the existing approach, including the need for systems to be balanced and perform well on multiple computational methods. The paper proposes a more interactive design and development process where computational science experts interact directly with system designers to produce better, more productive systems. The goals for SDSA systems include excellent price-performance, consistent performance across a range of applications, and pointed hardware improvement that add high value. The SDSA process has already resulted in improvements to IBM systems that are available now.

Acknowledgments This work is supported by the Office of Computational and Technology Research, Division of Mathematical, Information and Computational Sciences of the U S . Department of Energy, under contract DE-AC03-76SF00098. In addition, this work draws on many related efforts at NERSC and throughout the HPC community, who we acknowledge for their efforts to improve the impact of scientific computing.

24

References 1 C. William McCurdy, Rick Stevens, Horst Simon, et al., “Creating Science-Driven Computer

Architecture: A New Path to Scientific Leadership,” Lawrence Berkeley National Laboratory report LBNUPUB-5483 (2002), http://www.nersc.gov/news/ArchDevProposal.5.01.pdf. 2 Gordon E. Moore, “Cramming more components onto integrated circuits”. Electronics, Volume 38, Number 8, April 19, 1965. Available at

ftp://download.intel.com/research/silicon/moorespaper.pdf 3 Daniel A. Reed, ed.,“Workshop on the Roadmap for the Revitalization of High-End Computing,” June 16-18.2003 (Washington, D.C.: Computing Research Association). . 4 “Federal Plan for High-End Computing: Report of the High-End Computing Revitalization Task

Force (HECRTF),” (Arlington, VA: National Coordination Ofice for Information Technology Research and Development, May 10.2004), http://www.house.gov/science/hearings/fullO4/may13/ hecrtf.pdf 5 Phillip Colella, Thom H. Dunning, Jr., William D. Gropp, and David E. Keyes, eds., “A ScienceBased Case for Large-Scale Simulation” (Washington, D.C.: DOE Ofice of Science, July 30, 2003). 6 “Red Storm System Raises Bar on SupercomputerScalability” (Seattle: Cray Inc., 2003). http://www .cray.comlcompany/RedStorm-flyer.pdf.

7 h n i d Oliker, Andrew Canning, Jonathan Carter, John Shalf, and Stephane Ethier. Scientific Computations on Modem Parallel Vector Systems (draft). In SC2004 High Performance Computing, Networking and Storage Conference; Pittsburgh. P A Nov 6 - 12,2004. - Available at

http://www.nersc.gov/news/reports/vectorbench.pdf 8 Simon, Horst, et. al. “A National Facility for Advanced Computational Science: A Sustainable Path to Scientific Discovq”. LBNL Report #5500, April 2.2004 - http://www-

library.lbl.gov/docs/PUB/55OOiPDF/PUB-55OO.pdf 9

“Facts on ASCI Purple,” Lawrence Livermore National Laboratoly report UCRLTB-I 50327

(2002); http://www.sandia.gov/supercomp/sc2002/fly~/SCO2ASC~u~~ev4.pdf.

PROGRAMMING MODELS AND LANGUAGES FOR HIGH-PRODUCTIVITY COMPUTING SYSTEMS*

HANS P. ZIMA JPL, California Institute of Technology, Pasadena, CA 91109, USA and Institute of Scientific Computing, University of Vienna, Austria E-mail: [email protected]

High performance computing (HPC) provides the superior computational capability required for dramatic advances in key areas of science and engineering such as DNA analysis, drug design, or structural engineering. Over the past decade, progress in this area has been threatened by technology problems that pose serious challenges for continued advances in this field. One of the most important problems has been the lack of adequate language and tool support for programming HPC architectures. In today’s dominating programming paradigm users are forced to adopt a low level programming style similar to assembly language if they want to fully exploit the capabilities of parallel machines. This leads to high cost for software production and error-prone programs that are difficult to write, reuse, and maintain. This paper discusses the option of providing a high-level programming interface for HPC architectures. We summarize the state of the art, describe new challenges posed by emerging peta-scale systems, and outline features of the Chapel language developed in the DARPA-funded Cascade project.

1. Introduction

Programming languages determine the level of abstraction at which a user interacts with a machine, playing a dual role both as a notation that guides thought and by providing directions for program execution on an abstract machine. A major goal of high-level programming language design has traditionally been enhancing human productivity by raising the level of abstraction, thus reducing the gap between the problem domain and the level at which algorithms are formulated. But this comes at a cost: the success *This paper is based upon work supported by the Defense Advanced Research Projects Agency under its Contract No. NBCH3039003. The research described in this paper was partially carried out at the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration.

25

26

of a language depends on acceptable performance of the generated target programs, a fact that has influenced high-level language design since the very beginning a. The necessity to find a viable compromise between the dual goals of high level language abstraction and target code performance is a key issue for the design of any programming language. For HPC systems this is particularly critical due to the necessity of achieving the best possible target code performance for the strategically important applications for which these architectures are being designed: the overarching goal is to make scientists and engineers more productive by increasing programming language usability and time-to-solution, without sacrificing performance. Within the sequential programming paradigm, a steady and gradual evolution from assembly language to higher level languages took place, triggered by the initial success of FORTRAN, COBOL and Algol in the 1960s. Together with the development of techniques for their automatic translation into machine language, this brought about a fundamental change: although the user lost direct control over the generation of machine code, the progress in building optimizing compilers led to wide-spread acceptance of the new, high-level programming model whose advantages of increased reliability and programmer productivity outweighed any performance penalties. Unfortunately, no such development happened for parallel architectures. Exploiting current HPC architectures, most of which are custom MPPs or clusters built from off-the-shelf commodity components, has proven to be difficult, leading to a situation where users are forced to adopt a low-level programming paradigm based upon a standard sequential programming language (typically Fortran or C/C++), augmented with message passing constructs. In this model, the user deals with all aspects of the distribution of data and work to the processors, and controls the program’s execution by explicitly inserting message passing operations. Such a programming style is error-prone and results in high costs for software production. Moreover, even if a message passing standard such as MPI is used, the portability of the resulting programs is limited since the characteristics of the target architectures may require extensive restructuring of the code.

&John Backus(1957): “...It was our belief that if FORTRAN ... were to translate any reasonable “scientific” source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger ...”

27

In this paper, we address the issue of providing a high-level programming interface for HPC systems. After discussing previous work in Section 2 we outline the requirements posed by emerging peta-scale systems in Section 3. This is followed by a description of some key features of the language Chapel in Section 4,which focuses on the support provided for multithreading and locality management. Section 5 concludes the paper.

2, Languages for Scientific Parallel Programming

With the emergence of distributed-memory machines in the 1980s the issue of a suitable programming paradigm for these architectures, in particular for controlling the tradeoff between locality and parallelism, became important. The earliest (and still dominant) approach is represented by the fragmented programming model: data structures and program state are partitioned into segments explicitly associated with regions of physical memory that are local to a processor (or an SMP); control structures have to be partitioned correspondingly. Accessing non-local state is expensive. The overall responsibility for the management of data, work, and communication is with the programmer. The most popular versions of this explicitly parallel approach today use a combination of C, C++, or Fortran with MPI. It soon became clear that a higher-level approach to parallel programming was desirable, based on data-parallel languages and the SingleProgram-Multiple-Data (SPMD) paradigm, with a conceptually single thread of control coupled with user-specified annotations for data distribution, alignment, and datalthread affinity. For such languages, many low-level details can be left to the compiler and runtime system. High Performance Fortran (HPF) became the trademark for a class of languages and related compilation and runtime system efforts that span more than a decade. Some of the key developments leading to HPF included the Kali language and compiler the SUPERB restructuring system 2 , and the Fortran D and Vienna Fortran languages. HPF-1 5, completed in 1993, was soon recognized as being too constrained by relying exclusively on regular data distributions, resulting in performance drawbacks for important classes of applications. HPF-2 and HPF+ extended the power of the distribution mechanism to accommodate dynamic and irregular applications '; HPF+ took the additional step of providing high-level support for communication management. A related approach based on the Japanese HPF version JA-HPF, reached in 2002 a performance of 12.5 Teraflops for

28

a plasma code on the Earth Simulator g . As the interest of the user community in HPF waned due to shortcomings of the language and weak compiler support, a number of languages rose in its place, commonly referred to as partitioned global address space languages. The best-known examples are Co-Array Fortran lo, Unified Parallel C 11, and Titanium 1 2 . These languages are similar due to their support for regular data distributions which are operated on in SPMD style; they have the advantage of being easier to compile than HPF, but achieve this by shifting some of that burden back t o programmers by requiring them to return to the fragmented programming model. In summary, these languages offer a compromise between MPI and HPF in the tradeoff between programmability and performance. Other language developments include OpenMP l 3 and ZPL 14. OpenMP is one of the few current parallel programming providing a non-fragmented, global view of programming; however its execution model assumes a uniform shared address space, which severely limits its scalability. ZPL supports parallel computation via user-defined index sets called regions 14, which may be multidimensional, sparse, or strided and are used both t o declare distributed arrays and t o operate on them in parallel. 3. New Challenges and Requirements A new generation of HPC architectures is currently being designed in the USA and Japan, with the goal of reaching Petaflops performance by 2010. The rising scale and complexity of these emerging systems, together with large-scale applications, pose major new challenges for programming languages and environments. These include the massive parallelism of future systems consisting of tens of thousands processing and memory components, the extreme non-uniformities in latencies and bandwidths across the machine, and the increase in size and complexity of HPCS applications, as researchers develop full-system simulations encompassing multiple scientific disciplines. Such applications will be built from components written in different languages and using different programming paradigms. These and related challenges imply a set of of key requirements that new languages must address: (1) Multithreading: Languages must provide explicit as well as implicit mechanisms for creating and managing systems of parallel threads that exploit the massive parallelism provided by future ar-

29

chitectures. (2) Locality-Aware Computation: Future architectures are expected to provide hardware-supported global addressing; however, the discrepancies between latencies and bandwidths across different memory levels will make locality-aware computation a necessary requirement for achieving optimum target code performance. (3) Programming-in-the-Large: Emerging advanced application systems will be characterized by a dramatic increase in size and complexity. It is imperative that HPC languages support componentbased development allowing the seamless interaction of different languages, programming paradigms, and scientific disciplines in a single application. (4) Portability and Legacy Code Integration: Taking into account the huge investment in applications, their long lifetime, and the frequent change in the supporting hardware platforms makes the automatic or semi-automatic porting and integration of legacy codes a high-priority requirement. Under the necessary constraint of retaining high performance this is a demanding task requiring a sophisticated approach and intelligent tools. There are a number of additional key requirements, which are beyond the scope of this paper. They include fault tolerance and robustness and a range of intelligent tools supporting debugging, testing, static analysis, model checking, and dynamic monitoring. 4. Chapel

4.1. Overview

In this section we discuss some design aspects of Chapel, a new language developed in the Cascade project led by Cray Inc. Cascade is a project in the DARPA-funded High Productivity Computing Systems (HPCS) program. Chapel pushes the state-of-the-art in programming for HPC systems by focusing on productivity. In particular Chapel combines the goal of highest possible object code performance with that of programmability by supporting a high level interface resulting in shorter time-to-solution and reduced application development cost. The design of Chapel is guided by four key areas of programming language technology: multithreading, locality-awareness in the sense of HPF, object-orientation, and generic programming supported by type inference and cloning. The language provides a global name space.

30

In the following we will outline those features of Chapel that are related to the control of parallelism and distributions. A more complete discussion of Chapel can be found in 15. 4.2. Domains, Indez Sets, and A m y s

The key feature of Chapel for the control of parallelism and locality is the domain. Domains are first-class entities whose primary component is a named index set that can be combined with a distribution and associated with a set of arrays. Domains provide a basis for the construction of sequential and parallel iterators. Index sets in Chapel are far more general than the Cartesian products of integer intervals used in conventional languages; they include enumerations] sets of object references associated with a class, and sparse subsets. This leads to a powerful generalization of the standard array concept. Distributions map indices to locales, which are the virtual units of locality provided t o executions of a Chapel program. The language provides a powerful facility for specifying user-defined distributions in an objectoriented framework. In addition] alignment between different data structures, and afinity between data and threads can be expressed at a high level of abstraction. Arrays associated with a domain inherit the domain’s index set and distribution. The index set determines how many component variables are allocated for the array, and the distribution controls the allocation of these variables in virtual memory. The example below illustrates some of these basic relationships: D is a one-dimensional arithmetic domain with index set (1, . . . , n } and a cyclic distribution; A1 and A2 are two arrays associated with D . c o n s t D: var var

domain (1) = [I. .n] d i s t (cyclic); Al: a r r a y CD] of float; A2: a r r a y [D] of integer;

... forall

i i n D on D.locale(i) { Al(i) = . . . }

Both arrays have n component variables and are distributed in a roundrobin manner to the locales assigned to the program execution. The forall loop is a parallel loop with an iteration range specified by the index set of

31

D. Iteration i is executed on the locale to which element i of that index set is mapped. 4.3. Specification of Distributions in Chapel

Standard distributions such as block or cyclic have been used in many highlevel language extensions. Most of these languages rely on a limited set of built-in distributions, which are based on arrays with Cartesian product index sets. Chapel extends existing approaches for the specification and management of data distributions in major ways: 0

0

0

0

Distributions in Chapel support the general approach to domains and their index sets in the language. Thus, distributions can not only be applied to arithmetic index sets but also, for example, to indices that are sets of object references associated with a class. Domains can be dynamic and allow their index set to change in a program-controlled way, reflecting structural changes in associated data structures. Distributions can be adjusted as well in an incremental manner. Chapel provides a powerful object-oriented mechanism for the specification of user-defined distribution classes. An instance of such a class, created in the context of a domain, results in a distribution for the index set of the domain. As a consequence, Chapel does not rely on a pre-defined set of built-in distributions. Each distribution required in a Chapel program is created via the user-defined specification capability. In addition t o defining a mapping from domain indices t o locales, the distribution writer can also specify the layout of data within locales. This additional degree of control can be used to further enhance performance by exploiting knowledge about data and their access patterns.

4.3.1. User-Defined Distributions: An Example User-defined distributions are specified in the framework of a class hierarchy, which is rooted in a common abstract class, denoted by distribution. All specific distributions are defined as subclasses of this class, which provides a general framework for constructing distributions. Here we illustrate the

32

basic idea underlying the user-defined distribution mechanism by showing how to implement a cyclic distribution. class c y c l i c

{

implements d i s t r i b u t i o n ;

blocksize: integer;

/*

The constructor arguments sd and td respectively specify the index set to be distributed and the

index set of the target locales */ constructor c r e a t e ( s d : d o m a i n , t d : d o m a i n , c : i n t e g e r ) {

t h i s . SetDomain(sd) ; t h i s . SetTarget (td) ; if (c<=O) t h e n ERROR("Illega1 block s i z e " ) e n d i f ; b l o c k s i z e = c;

...

1 /*

The function Map specifies the mapping f r o m indices to target locales represented b y the array Locales */

function Map(i: index) : Locale

{

return (Locales (mod(f l o o r ( i / b l o c k s i z e )

,

GetTarget () . e x t e n t ) )

1

1

4.3.2. User-Defined Distributions: The General Approach There are no "standard" distributions in Chapel: every distribution is defined using the standard mechanism for user-defined distributions. While the definition of standard distributions provides the compiler with enough information to produce target code with performance equivalent to that of built-in constructs, the mechanism is completely general and allows the definition of arbitrary mappings. A distribution can be dynamically adapted to the requirements of changing data structures such as needed for dealing with unstructured grids. Chapel offers additional options, including the replication of data structures across a set of locales, the suppression of distribution for one or more dimensions of a multi-dimensional array, and random distributions realized by the system.

33

Until now we have only discussed the mapping of indices to locales. Chapel also allows the specification of the layout of data structures within locales. For standard distributions this may be of little interest - for example the set of all components of a block-distributed array mapped t o a single locale will usually be stored in a contiguous memory area, and such representations can be generated by the compiler without the necessity of user intervention. However, for more sophisticated distributions this facility provides an important structuring tool. Since this is still an area of active research, we limit ourselves t o sketching the main idea. The specification machinery is expected to be sophisticated enough to allow for the userdefined specification of distributed sparse data sets. This means that the user will specify (1)the particular sparse representation (such as CRS with the data, column, and row vectors for a two-dimensional matrix), (2) the distribution of the index set, and (3) the associated access functions. The task of the compiler consists then of transforming the algorithm - which is based on the notation for dense arrays - in such a way that it exploits the particular properties of the representation.

5 . Conclusion

This paper outlined the state-of-the-art in programming today’s HPC architectures and provided an overview of the challenges and requirements for high-level programming languages. A number of key features of the Chapel programming language was discussed in this context. The problem of developing efficient high-productivity languages for HPC architectures is an open research issue. Another general-purpose language developed in the HPCS program is IBM’s XlO language 16. In contrast t o Chapel, which is based on a new object-oriented model, the foundation of the X10 model is Java, with places (as entitites supporting locality management), asynchronous activities, and Fortran-type arrays added, and Java threads and arrays removed. Similar to Chapel, X10 provides strong high-level support for multithreading and locality. A distinctive feature of X10 is its embedding in an integrated programming environment based on the Eclipse platform. This environment provides interoperability with the standard languages Java, C/C++ , and Fortran, and supports continuous program optimization. Many other related efforts are underway, spanning a broad range of activities including problem-solving environments, domain-specific languages, and related compilation and tool efforts. We expect that the next five years

34

will result in significant progress in this area, providing the user of HPC architectures with a more mature, robust, and efficient programming interface than existing today.

Acknowledgment The research described in this paper was performed in cooperation with David Callahan and Brad Chamberlain (Cray Inc.), Roxana Diaconescu (California Institute of Technology), and Mark James (JPL, California Institute of Technology). References 1. P. Mehrotra and J. V. Rosendale, “Programming distributed memory architectures using Kali,” in Advances in Languages and Compilers for Parallel Computing. MIT Press, 1991. 2. H. Zima, H. Bast, and M. Gerndt, “Superb: A tool for semi-automatic MIMD/SIMD parallelization,” Parallel Computing, vol. 6, pp. 1-18, 1986. 3. G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. W. Tseng, and M. Y. Wu, “Fortran D language specification,” Houston, TX, Tech. Rep. CRPC-TR9OO79, December 1990. 4. B. M. Chapman, P. Mehrotra, and H. P. Zima, “Programming in Vienna Fortran,” Scientific Programming, vol. 1, no. 1, pp. 31-50, Fall 1992. 5. High Performance Fortran Forum, “High Performance Fortran language specification,” Scientific Programming, vol. 2, no. 1-2, pp. 1-170, Spring-Summer 1993. 6. High Performance FORTRAN Forum, “High performance fortran language specification, version 2.0,” Rice University,” Technical Report, 1997. 7. B. Chapman, H. Zima, and P. Mehrotra, “Extending HPF for advanced dataparallel applications,” IEEE parallel and distributed technology: systems and applications, vol. 2, no. 3, pp. 59-70, Fall 1994. 8. P. Mehrotra, J. V. Rosendale, and H. Zima, “High Performance Fortran: History, status and future,” Parallel Computing, vol. 24, no. 3, pp. 325-354, May 1998. 9. H. Sakagami, H. Murai, and M. Yokokawa, “14.9 TFLOPS three-dimensional fluid simulation for fusion science with HPF on the earth simulator,’’ in Proceedings of the 2002 ACM/IEEE conference o n Supercomputing. IEEE Computer Society Press, 2002, pp. 1-14. 10. R. W. Numrich and J. K. Reid, “Co-Array Fortran for parallel programming,” Rutherford Appleton Laboratory, Oxon, UK, Tech. Rep. RAL-TR1998-060, August 1998. 11. T. A. El-Ghazawi, W. W. Carlson, and J. M. Draper, UPC Language Speczfication, October 2003. 12. K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken, “Titanium: A

35

13.

14.

15.

16.

high-performance Java dialect,” in ACM 1998 Workshop on Java for HighPerformance Network Computing, ACM, Ed. New York, NY 10036, USA: ACM Press, 1998. L. Dagum and R. Menon, “OpenMP: an industry-standard API for sharedmemory programming,” IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46-55, January-March 1998. B. L. Chamberlain, E. C. Lewis, C. Lin, and L. Snyder, “Regions: An abstraction for expressing array computation,” in A CM/SIGAPL International Conference on Array Programming Languages, August 1999, pp. 41-49. D. Callahan, B. Chamberlain, and H. Zima, “The Cascade High Productivity Language,” in Proc.9th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2OO4), Santa Fe, New Mexico, April 2004. K.Ebcioglu, V.Sarkar, and V.Saraswat, “X10: Programming for Hierarchical Parallelism and Non-Uniform Data Access,” in Proc.LaR 2004 Workshop, OOPSLA 2004, 2004.

OPERATION STATUS OF THE EARTH SIMULATOR

ATSUYA UNO Earth Simulator Center/JAMSTEC, 31 73-25 Showa-machi, Kanazawa-ku, Yokohama, Kanagawa 236-0001, Japan E-mail: uno @es.jamstec.go.jp

As many large-scale distributed parallel programs are run efficiently in the Earth Simulator (ES), 1/0 throughput between the tape library system and the work disk of the PNs became the bottleneck for simulations. So, in 2003, we introduced NQSII and Mass Data Processing System (MDPS) to address this problem and to improve system utilization and maintainability. In this paper, the new system (a scheduling algorithm and MDPS) and the recent program execution status for the ES are described.

1. Introduction The Earth Simulator (ES) was developed to promote research and development for a clear understanding and more precise prediction of global changes of the earth system. ES is a MIMD-type, distributed-memory parallel system, and consists of 640 processor nodes (PNs) connected via a fast 640x640 single-stage crossbar network. The theoretical peak performance is 40 TFLOPS. We have been operating the ES for more than 2 years. As many largescale parallel programs have been executed in the ES, 1/0 throughput between the tape library system and the work disk of the PNs became the bottleneck. Therefore, we decided to install the Mass Data Processing System (MDPS) and Network Queuing System I1 (NQSII) to resolve this problem. MDPS is a hierarchical storage management system and has improved the overall and performance of 1/0on the ES. Also, we have further developed the scheduler of NQSII to improve the system utilization. In this paper, an overview of MDPS, NQSII and the customized scheduler are described, and an update on the program execution status for the

ES. 36

43

(3) Request Escalation

The scheduler checks the estimated start time of each request periodically. If the scheduler finds the earlier start time than the estimated start time, it allocates the new start time to the request. This process is called “Request Escalation.” At this time, if nodes already allocated to the request are allocated to that request again, the scheduler doesn’t execute the stage-in process toward those nodes, and starts the stage-in process toward nodes newly allocated to that request. This process is called “Partial stage-in.”

(4) Request Execution When the estimated start time has arrived, the scheduler executes the request. If the stage-in process hasn’t been finished by the estimated start time, the scheduler cancels this estimated start time, and does rescheduling after the stage-in process is completed. When the request execution is finished or the declared elapsed time is over, the scheduler terminates the request and starts the stage-out process. (5) Stage-out After the request execution, the scheduler starts the stage-out process. This process is also executed by FSPs without using the APs of PNs. The priority of the stage-out process is lower than that of the stage-in process. as the stage-in process needs to be finished as fast as possible to ensure the estimated start time of the request. The L-system is basically a single batch queue environment, however, it can also support a number of batch queues at the same time. This process can be done without stopping the ES system. NQSII can manage each batch queue as an independent queue, and can increase or decrease the nodes of any queue at anytime. 4. Program execution status

The program execution status from July 2002 to July 2004 is outlined in this section. 4.1. The number and site of requests

Figure 8 and Figure 9 shows the number of executed requests and the execution time in the L-system, respectively. The execution time is a measure

48

mance). In 2003, a project in solid earth physics “A 14.6 billion degrees of freedom 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator” won the Gordon Bell Award for Peak Performance‘. This project used 1,944 processors of the ES to simulate seismic wave propagation resulting from a large earthquake, and performed at 5 TFLOPS (33%of the peak performance). Moreover, in 2004, “A 15.2 TFlops Simulation of Geodynamo on the Earth Simulator” was awarded the Gordon Bell Award for Peak Performance7. This is a new geodynamo simulation code by combining the finite difference method with the recently proposed spherical overset grid called the Yin-Yang grid, and performed at 15.2 TFLOPS (46%of the peak performance) on 4096 processors. 5. Summary

In this paper we described recent developments to the ES, the installation of the Mass Data Processing System (MDPS) and Network Queuing System I1 (NQSII). A recent execution status for the ES was also shown. The MDPS enhanced the throughput between the work disks of PNs and CTL (M-Disk), and MDPS and NQSII have improved both user usability and the system utilization on the ES. As a result of this installation, both the aggregate execution time of user requests and the volume of staging files have increased. The ES executes about 8,000 requests every month, of varying sizes, while achieving a high ratio of the node utilization. The applications executed in the ES have achieved high performance, the average being about 30% of the peak performance. Some of the top performers were honored with the Gordon Bell Award in 2002, 2003 and 2004. We believe that our first goal, improving the user usability and system utilization, has been achieved. Future work includes plans to achieve an even higher ratio of node utilization by improvements to the scheduling algorithm and by improved analysis of the characteristics of users’ requests.

References 1. S.Habata, M.Yokokawa and S.Kitawaki: The Earth Simulator System, NEC Research&Development,Vol. 44, No. 1, pp. 21-26 (2003). 2. E.Miyazaki and K.Yabuki: Integrated Operating Multi-Node System for SX-Series and Enhanced Resource Scheduler, NEC Reseaxch&Development, Vol. 44, NO.1, pp. 37-42 (2003). 3. S.Shingu, H.Takahara, H.Fuchigami, M.Yamada, Y.Tsuda, W.Ohfuchi,

Y.Sasaki, K.Kobayashi, T.Hagiwara, S.Habata, M.Yokokawa, H.Itoh and

49

4.

5.

6.

7.

K.Otsuka: A 26.58 TJops Global Atmospheric Sirnulation with the Spectral Transform Method on the Earth Simulator, Proceedings of the ACM/IEEE SC2002 conference (2002). H.Sakagami, H.Murai, Y.Seo and M.Yokokawa: 14.9 TFLOPS Threedimensional Fluid Simulataon for Fusion Science with HPF on the Earth Simulator, Proceedings of the ACM/IEEE SC2002 conference (2002). M.Yokokawa, K.Itakura, A.Uno, T.Ishihara and Y.Kaneda: 16.4-Tftops Direct Numerical Simulation of TzGrburance b y a Fourier Spectral Method on the Earth Simulator, Proceedings of the ACM/IEEE SC2002 conference (2002). D.komatisch, S.Tsuboi, C.Ji and J.Tromp: A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator, Proceedings of the ACM/IEEE SC2003 conference (2003). A.Kageyama, M.Kameyama, S.Fhjihara, M.Yoshida. M.Hyodo. and Y.Tsuda: A 15.2 TFlops Samulataon of Geodynamo on the Earth Simulator, Proceedings of the ACM/IEEE SC2004 conference (2004).

NON-HYDROSTATIC ATMOSPHERIC GCM DEVELOPMENT AND ITS COMPUTATIONAL PERFORMANCE

KEIKO TAKAHASHI The Earth Simulator Center. JAMSTC, 31 73-25 Showa-machi. Kanazawa-ku Yokohama,236-0001. Japan XINDONG PENG The Earth Simulator Center, JAMSTC, 31 73-25 Showa-machi, Kanazawa-ku Yokohama. 236-0001, Japan KENJI KOMINE The Earth Simulator Center, JAMSTC. 3173-25 Showa-machi. Kanazawa-ku Yokohama,236-0001. Japan MITSURU OHDAIRA The Earth Simulator Center, JAMSTC, 31 73-25 Showa-machi, Kanazawa-ku Yokohama,236-0001. Japan KOJI GOT0 NEC Corporation, 1-10 Nisshin-cho. Fuchu-shi. Tokyo, 183-5801,Japan MASAYUKI YAMADA NEC Informatec Systems LTD. 3-2-1 Sakato. Takatsu-ku. Kawasaki-shi Kanagawa, 213-0012, Japan FUCHIGAMI HIROMITSU NEC Informatec Systems LTD. 3-2-1 Sakato. Takatsu-ku. Kawasaki-shi Kanagawa. 213-0012, Japan TAKESHI SUGIMURA Graduate School of Science of Nagoya Universiw, Furou-cho, Chikusa-ku Nagoya, 464-8602. Japan

50

52

In the Earth Simulator Center, we have started developing a globalhegional non-hydrostatic coupled ocean-atmosphere-land simulation code with high computational performance for the purpose of research into multi-scale interactions on the global sphere. This approach might bring us useful tools to research simultaneously global climate and regional weather/climate, which involves extremes such as typhoon, tornado, concentrated down pour and down burst. In this paper, we focus on an atmospheric component of our coupled code and present an outline of model codiguration and preliminary validation results. Each of the components of our non-hydrostatic coupled code is designed and built from scratch in the Earth Simulator Center. It is characterized by a grid system, discretization and interpolation schemes described in section 2. In section 3, we present validation results on a conservation issue on an overset grid system on the sphere. Further results from benchmark test experiments and typhoon tracking prediction mals are also presented in section 3. High performance computation on the Earth Simulator, using these components is also presented in section 4. We conclude with a brief summary and describe near future work in section 5.

2. Model Description 2.1. Governing Equations

In the atmospheric component of our coupled simulation code, the dynamical core is comprised of the non-hydrostatic, fully compressive three-dimensional Navier-Stokes equations with rotational effects (l), continuity equation (2), pressure equation (3) and the equation of the state (4) as defined by the following equations. Forecast variables are perturbation density, three components of momentum, and perturbation pressure. aPv 1 + Vp’+ p’g = -v (pw)+ 2 p x f +p(VZv+ -v 0

at

3

0

(VV))

-+ JP’ v .(p)= 0 at

apt -+pgw+ at

P = pRT

vVP‘+ yPVv = ( y - l ) f l 9 - + ( y - l ) @

(3) (4)

54

Any value boundary region of each grid panels is computed by using an interpolation scheme based on 5th order Lagrange polynomials. A boundary point value is calculated by using neighborhood points in the other panel. For numerical stabilization, 2nd order horizontalhertical numerical viscosity and divergence damping are adopted. Details will be shown in [3][4]. Outline of other schemes used in the developed coupled code is presented in Table 1.

3. Model Verifications 3.1. Mass conservation The conservation problem is one the of basic issues of climate models. IN particular, the interface in overset grid systems such as the Yin-Yang grid system should be designed to conserve physical values. We have introduced a mass imbalance correction scheme [5] to conserve global total mass. A schematic image for the method on boundary interface of each panel is shown in Figure 3. A sufficient condition to conserve global total mass is that the budget of fluxes across the boundary interface is balanced as shown in a following equation,

wherer is a boundary interface. Flux fYin should transfer total of fluxes from Yin-system to Yang-system through boundary interface r FEF is considered as a flux corresponding to (SABE&ABCD) times of total budget of a cell ABCD which is calculated by (fi+f2+f3+fa)on the Yang-system. Therefore, FEF throughout the boundary on the Yin-system is obtained by fluxes on Yangsystem such that

A flux F E H through a boundary circle EH on Yin-system is calculated by

58 w ( x--2 ) along the equator (m/s)

(a) . _ Figure 8. (a) A -z cross section of vertical velocity in m s-' for the mountain of lOOOm height with tenain-following vertical coordinate at 30days. (b) shows horizontal divergence at height lOkm (middle) levels at 30days. Yin-YOng

u

[rn *-I]

0

6

1

10

Figure 9. Zonal mean wind U distribution in m sec-I. Mean was calculated for 1000 days integration. (a) Results by Held and Suarez. (b) Simulation results.

4. Preliminary Results for Forecasting 4.1. Global cloud water distribution with the developed coupled

atmosphere-ocean-land simulation code A global integration has been performed for 48 hours with a 5.5km horizontal resolution and 32 vertical levels as a preliminary validation of the physical performance of the atmospheric component of the developed coupled simulation code. Initialized data was GPV data at OOZ08Aug2003 provided by the Japan Meteorological Business Support Center. Since cloud microphysics has been introduced in the atmosphere component, precipitation in equatorial area in Figure 10 has been brought by working of cloud microphysics. Further detail research on cloud characteristics due to cloud microphysics will be required in our near future work.

PDAF - THE PARALLEL DATA ASSIMILATION FRAMEWORK: EXPERIENCES WITH KALMAN FILTERING

L. NERGER, w. HILLER AND J. SCHROTER Alfred Wegener Institute for Polar and Marine Research, P.O. Box 12 0161, D-,27515 Bremerhaven, Germany E-mail: [email protected] Data assimilation with advanced algorithms based on the Kalman filter and largescale numerical models is computationally extremely demanding. This motivates the parallelization of the assimilation problem. As another issue, the implementation of a data assimilation system on the basis of existing numerical models is complicated by the fact that these models are typically not prepared to be used with data assimilation algorithms. To facilitate the implementation of parallel data assimilation systems, the parallel data assimilation framework PDAF has been developed. PDAF allows t o combine an existing numerical model with data assimilation algorithms, like statistical filters, with minimal changes t o the model code. Furthermore, PDAF supports the efficient use of parallel computers by creating a parallel data assimilation system. Here the structure and abilities of PDAF are discussed. In addition, the application of filter algorithms based on t h e Kalman filter is discussed. Data assimilation experiments show an excellent parallel performance of PDAF.

1. Introduction

Advanced data assimilation algorithms for filtering and smoothing applications with state-of-the-art large-scale geophysical models are of increasing interest. The applied algorithms are typically based on the Kalman filterg. Their aim is to estimate the state of a geophysical system (atmosphere, ocean, . . . ) on the basis of a numerical model and measurements by combining both sources of information. In addition, the algorithms provide an estimate of the error in the computed state estimate. There is already a large variety of data assimilation (DA) algorithms based on the Kalman filter. Several of them can be classified as Error Subspace Kalman Filters (ESKF)12, due to their representation of the estimation error in a low-dimensional sub-space of the true error space. Examples of such algo63

64

rithms are the Ensemble Kalman Filter (EnKF)', the SEEK filter15, and the SEIK filter1*. These algorithms show several advantages in comparison to the variational DA algorithms 3D-Var and 4D-Var1', which are widely used, e.g., in weather-forecasting applications. The variational algorithms do not provide a direct error estimate which, in contrast, is inherent to the ESKF algorithms. In addition, the 3D/4D-Var techniques require the implementation of an adjoint model code5, which can be a very difficult task for realistic numerical models of the atmosphere and/or ocean. From the computational point of view, another advantage of the ESKF algorithms is their high scalability on parallel computers. While the 3D/4D-Var algorithms are of serial nature due to alternating integrations of the numerical (forward) model and its adjoint, the integration of an ensemble of model states in the ESKFs renders these algorithms to be highly scalable. Due to the differences between ESKF algorithms and the variational 3D/4DVar methods, the ESKFs have the potential to provide better state estimates combined with error estimates in a smaller amount of time when they are applied in a highly parallel manner. This can significantly reduce the required computing time since the computational cost of advanced DA algorithms is several times higher than the cost of pure forecasts. The implementation of DA systems on the basis of filter algorithms is typically complicated by the fact that the numerical models have not been developed with data assimilation applications in mind. This is partly also true for the existing DA systems which use variational techniques, since their code structure does often not allow a simple transformation into a filtering system. For the utilization of the parallelism in the ensemble integration the parallelization of the models has to be changed in a way to allow multiple concurrent model tasks. In addition, the filter algorithm itself has to be parallelized, to exploit the high scalability of the filter algorithms. To facilitate the implementation of DA applications, the interface systems SESAM17 and PALM13 have been developed. These systems base on a logical separation between the filter and model parts of the DA problems. An interface structure for the transfer of data between the filter and the model is defined and the filter part is implemented such that it is independent of the model code itself. SESAM uses Unix shell scripts which control the execution of separated program executables like the numerical model and the program which computes, e.g., the assimilation of the measurement in the analysis phase of the filter algorithm. Data transfers between the programs are performed using disk files. The structure of SESAM allows to use the model for DA without changes in the source code. The problem

65

of data transfers between the model and filter parts of the data assimilation system is shifted to the issue of consistent handling of disk files. Since SESAM is based on shell scripts, it does not; support multiple model tasks. PALM uses program subroutines which have to be instrumented with meta information for the PALM system. The DA program is assembled using the prepared subroutines and a library of driver and algebraic routines supplied by PALM. The driver supports the concurrent integration of multiple model tasks. For the implementation of a DA application, PALM requires to assemble the algorithm from separate subroutines. Hence, also the model itself has to be implemented as a subroutine, since the main program is provided by the PALM-driver. In addition, the control of the filtering program will emanate from the driver routine of PALM. With the introduction of the Parallel Data Assimilation framework PDAF we follow a different approach: For the creation of a data assimilation system, filter algorithms are attached to the model with minimal changes of the model source code itself. The parts of the filter problem which are model-dependent or refer to the measurements which are assimilated are organized as separate subroutines. These have to be implemented by the user of the framework. The data assimilation system is controlled by the user-written routines. Thus, the driver functionality remains in the model part of the program. Data transfers between the core part of PDAF and the user-supplied routines are conducted solely via a defined interface. Accordingly, the user-written routines can be implemented analogously to the model code, i.e. by sharing Fortran mmmon blocks or modules. This simplifies the implementation of the user-supplied routines for users knowing about the particularities of their model. PDAF provides some of the most widely used filter algorithms, like the Ensemble Kalman Filter and the SEEK filter, which are fully implemented and optimized in the core part of PDAF. In general, also variational DA algorithms can be implemented within PDAF. However, in the current version of PDAF only filter algorithms are provided. In the following, ESKF algorithms are reviewed in section 2 to discuss the structure and issues of these algorithms. Then, the PDAF system for the application of ESKF algorithms is presented in section 3. In section 4 experiments are discussed which show the parallel performance of PDAF and the EnKF, SEEK, and SEIK filter algorithms.

66

2. Error Subspace Kalman Filters

Error Subspace Kalman Filters are a class of advanced Kalman filter algorithms which are intended for data assimilation with large-scale nonlinear numerical models. Here, the concepts of ESKF algorithms are reviewed. A more detailed and mathematical description of three algorithms, the Ensemble Kalman filter, and the SEEK and SEIK filters together with a comprehensive review of the Kalman filter has been given by Nerger et al. 12. The Kalman filter (KF)’ is based on the theory of statistical estimation. Accordingly, the data assimilation problem is formulated as an estimation problem in terms of an estimate of the system state and a corresponding covariance matrix which describes the error in the state estimate. Since the error estimate is based on a covariance matrix, it is implicitly assumed that the errors are well described by Gaussian distributions. The virtue of the KF lies in the evolution of both the state estimate and the state covariance matrix according to the model dynamics. For linear dynamics, the KF can thus provide the optimal estimate if the system state is the error are normally distributed. The optimality is defined by the estimation errors which are of minimum variance and maximum likelihood. For nonlinear dynamics the extended Kalman filter (EKF), see Jazwinski8, which is a first-order extension of the KF to nonlinear problems, can be applied. However, the estimate will be sub-optimal, since the assumption of Gaussian distributions is not conserved by nonlinear dynamics. Another practical issue represents the huge computational cost of the KF and EKF algorithms. Both require the explicit allocation of the state covariance matrix in memory. Furthermore, the evolution of the covariance matrix requires a number of model integrations which is twice as large as the dimension of the discretized model state. For today’s models of the atmosphere or the ocean, the state dimension is of order lo6 to lo8. Thus, both the storage and the evolution of the covariance matrix are not possible with current high-performance computers. To reduce the requirements of memory and computing time of the filter algorithm, several approximations and variants of the EKF have been developed over the last two decades in the atmospheric and oceanographic communities. A promising approach is given by the class of ESKF algorithms12. This class includes the popular ensemble Kalman filters (see Evensen7 for a review), but also algorithms which use a low-rank approximation of the covariance matrix, like the SEEK filter15. In addition, the SEIK filter14, which combines the ensemble and low-rank concepts, belongs to this class.

67

The concept of the ESKF algorithms is based on a low-rank approximation of the state covariance matrix. Mathematically, the ESKFs approximate the error space of the full EKF, which is described by the state covariance matrix, by a low-dimensional sub-space. The approximated covariance matrix is treated in decomposed form. This reduces the memory requirements. The evolution of the covariance matrix now only requires a number of model integrations which is equal to the rank of the approximated covariance matrix. Numbers as small as 7 have been reported for the rank in oceanographic applications3. Hence, the ESKF algorithms can strongly reduce the computation time for the DA in comparison to the full EKF. Furthermore, the ESKF algorithms allow for a better consideration of model nonlinearities. While the EKF applies a linearized model to evolve the covariance matrix, the Ensemble Kalman filter and the SEIK filter sample the covariance matrix together with the state estimate by an ensemble of model states which is integrated by the nonlinear model. The minimum size of the ensemble equals the rank plus one. The nonlinear ensemble integration can provide more realistic estimates of the state estimate and covariance matrix. In this case, the error sub-space estimated by the filter will no longer be a sub-space of the full error space estimated by the EKF, but a better estimate of the true error space of the estimation problem. For the implementation of algorithms based on the Kalman filter, it is important to note that these methods are sequential filter algorithms, i.e. the algorithms can be separated into different phases which are executed sequentially. Figure 1 exemplifies the flow of a generic ESKF algorithm which relies on an ensemble representation of the covariance matrix. First, in the initialization phase, the initial state estimate and the corresponding covariance matrix are sampled by an ensemble of model states. Then, in the forecast phase, each ensemble state is evolved with the nonlinear numerical model. At times when measurements are available, the analysis step is performed: The analysis equations of the KF are applied to update the ensemble mean or all ensemble states together with the error estimate on the basis of the measurements and the evolved ensemble. Some algorithms, like the SEIK filter, apply finally a re-initialization step. Here the evolved ensemble of model states is transformed to represent the updated error estimate given by the decomposed covariance matrix. Subsequently, the algorithm steps back to perform the next forecast phase. All phases of the filter algorithms can be parallelized. For the Ensemble Kalman filter this has been discussed in detail". The parallelization of the SEEK and SEIK algorithms has also been studied and compared to that of

68

Generic ESKF algorithm Initialization: Sample initial state estimate and the corresponding covariance matrix by an ensemble of model states.

i Forecast: Evolve each state of the ensemble with the nonlinear numerical model.

i Analysis: Apply the analysis equations of the Kalman filter to the ensemble mean or all ensemble states. The state covariance matrix is given by the ensemble statistics.

i Re-initialization: Transform the state ensemble, such that it exactly representsthe updated error statistics. fa,se

t

assimilation completed? true

Figure 1. Generic flow diagram for error subspace Kalman filter algorithms using an ensemble of states.

the Ensemble Kalman filter1’. There are basically two strategies which rely on the distribution of the matrix holding the ensemble states in its columns. First, it is possible to distribute the matrix column-wise over several p r e cesses. Thus, each process will hold a sub-ensemble of full model states. This strategy is independent from the model itself, as in the filter algorithm always full states are considered. The second strategy is to distribute the matrix row-wise. In this case, each processor will hold a full-ensemble of partial model states. This distribution can correspond to a decomposition of the model domain. It is thus the obvious parallelization strategy when the numerical model itself is parallelized using domain-decomposition. According to the distribution of the ensemble matrix, all other operations of the filter algorithm, in particular those in the analysis and reinitialization phases, are parallelized. The achievable parallel performance of the two parallelization strategies will be discussed in section 4. 3. The Parallel Data Assimilation Framework PDAF

The parallelization of ESKF algorithms alleviates the computational cost of storing and integrating an ensemble of model states and of the analysis

69

and re-initialization phases. However, the parallel implementation of an ESKF algorithm within an existing numerical model is a non-trivial task. The complex physical models are usually not prepared for the application of filter algorithms. Furthermore, while the core routines of filter algorithms can be parallelized independently from the model, the concurrent computation of multiple model integrations requires changes in the parallelization of the model itself. To simplify the implementation of parallel DA systems, the Parallel Data Assimilation Framework PDAF has been developed. PDAF contains fully implemented and optimized ESKF algorithms and defines an application program interface (API) which permits to combine the filter algorithms with a numerical model. Only minimal changes in the model source code are required for the implementation. The core routines of PDAF are completely independent from the model and require no modifications during the implementation of a DA system. The A P I permits to switch easily between different filter algorithms or sets of measurements. Parts of the data assimilation program which are specific to the model or refer to measurements are held in separate subroutines. These have to be implemented by the user of the framework such that they can be called by the filter routines via the API. PDAF is implemented in Fortran95 using the MPI standard for parallelization. In general, and next to the distinction of mode-decomposed and domain-decomposed filter algorithms, two parallel configurations are possible for PDAF. First, the filter analysis and reinitialization phases can be executed by processes which also contribute to the model integrations. Second, processes which are separate from the processes computing the model integrations could perform the filter analysis and reinitialization phases. We will focus here on the first parallel configuration. However, the interface structure of PDAF is essentially equal for both configuration. The difference is mainly that for the first configuration, direct subroutine calls to the filter routines are possible while in the second configuration MPI communications will be necessary to connect the model integrations with the filter routines.

3.1. Geneml Considemtiom The development of PDAF, is based on the following considerations: 0

The numerical model is independent from the routines of the filter algorithms. The model source code should be changed as little as possible when combining the filters with the model.

70

Model initialization time stepper post-processing

0

0

0

0

0 0

state vector time, nsteps

Filter initialization analysis re-initialization

observationvector

Measurements observationvector

4

state vector

+ measurementoperator observationserrors

The filter source code is independent from the model. It solely operates on model state vectors, not on the physical fields of the model. The information on measurements are independent from the filter and depend only on the information about the grid of the model. The model does not need information about the observations. To compute the filter analysis step the filter routines require information on the measurements. This is, e.g., the vector of measurements or the measurement operator, i.e. the operation which computes the observations which would be measured from the model state. Since this operation requires information on the spatial structure and the physical meaning of the elements of the state vector, it depends on the definition of the state vector and on the model grid. However, the filter algorithms only require the operation of the measurement operator on some state vector, which can be implemented as a subroutine. The framework can be logically partitioned into three parts as is sketched in figure 2. The transfer of information between the model and the filter as well as between the filter and the observations is performed via the API. PDAF has to allow for the execution of multiple concurrent model evolutions, each of these can be parallelized. Both, the parallelization of the model and the number of parallel model tasks have to be specified by the user. Like the model, the filter routines can be executed in parallel. The filter routines can either be executed by (some of) the processes used for the model integrations or by a set of processes which is disjoint from the set of model processes. The exact specification of processes has to be configured by the user.

To combine a filter algorithm with a numerical model in order to obtain

a data assimilation program, we consider the 'typical' structure of a model

71

which computes the time evolution of several physical fields. The 'typical' structure is depicted in figure 3a. In the initialization phase of the program, the grid for the computations is generated and the physical fields are initialized. Subsequently, the model fields are integrated for n s t e p s time steps. The integration takes into account boundary conditions as well as external forcing fields, like wind fields over the ocean. Having completed the evolution some post-processing operations can be performed. The following discussion of PDAF focuses on the configuration in which the filter routines are executed by some of the model processes. For this case, the structure of the DA program with attached filter is shown in figure 3b. To initialize the filter framework, a routine pdaf-filter-init is added to the initialization part of the program. Here the arrays required for the filter, like the ensemble matrix, are allocated. Subsequently, the state estimate and approximated covariance matrix are initialized and the state ensemble is generated in a user-supplied routine called by pdaf-filter-init. The major logical change when combining a filter algorithm with the model source code is that a sequence of independent evolutions has to be computed. This can be achieved by enclosing the time stepping loop by an unconditioned outer loop which is controlled by the filter algorithm. Before the time stepping loop of the model is entered, the subroutine pdaf-get-state is called. This routine provides a model state from the state ensemble together with the number of time steps (nsteps) to be computed. To enable the consistent application of time dependent forcing in the model, the pdaf-get-state also provides the current model time. In addition, a flag doexit is initialized which is used as an exit-condition for the unconditioned outer loop. For the forecast, the user has to assure that the model integrations are really independent. Any reused fields must be newly initialized, for example. After the time stepping loop a call to the subroutine pdaf-put-state is inserted into the model source code. In this routine, the evolved model fields are stored back as a column of the ensemble state matrix of the filter. If the ensemble forecast has not yet finished, no further operations are performed in the routine pdaf-put-state. If, however, all model states of the current forecast phase are evolved, pdaf-put-state executes the analysis and re-initialization phases of the chosen filter algorithm. With this structure of PDAF, the logic to perform the ensemble integration is contained in the routines pdaf-get-state and pdaf-put-state. Furthermore, the full filter algorithms are hidden within these routines. The user is required to ensure independent model integrations. In addition, some routines have to be supplied by the user which provide operations related to

72

a> Initialize Model generate mesh initialize fields

generate mesh

i

Time stepper include BC include forcing

\1 J, Post-processing

Time stepper include BC

_PDAF-Put-State ___________ Filter-Update ........

. ....

Q Post-processing

Figure 3. Flow diagrams: a) Sketch of the typical structure of a model performing time evolution of some physical fields. b) Structure of the data assimilation configuration of the model with attached filter. Added subroutine calls and control structures are shaded in gray.

the model fields or the measurements, as will be discussed below. For the parallelization of the DA system, the routine pduf_init-commis added to the program. Details of this routine are discussed in section 3.3.

73

3.2. Interface Structure The three subroutines pdaf-jilter-init, pdaf-get-state, and pdaf-put-state prcvide a defined interface to the filter algorithms. In addition, the usersupplied routines like the observation-related subroutines are called using the API defined by PDAF. This structure allows to keep the core part of PDAF independent from the model, such that PDAF can be compiled independently from the model code. Thus, the API allows to use PDAF with models implemented in Fortran as well as in C. The user-supplied routines should be programmed similarly to the model code. This simplifies the inclusion of model-related variables, like information on the grid or the model fields. For the case that the filter routines are executed by processes which compute also model integrations, the filter algorithms are split into parts which are contained in the three routines pdaf-filterkit, pdaf-get-state, and pdaj-put-state. However, all variables which are required to specify the filter are contained in the interface to pdaf-filter-init. Here also the parallelization information, like communicators, is handed over to PDAF. The three routines also require the specification of the names of user-supplied routines in the API. This allows the user to choose arbitrary names for these routines, since the internal calls are independent from the real subroutine names. Furthermore, it allows to switch easily, e.g., between different observational data sets. The specifications for each particular set of measurements can be contained in separate routines. Next to the user-supplied routines required for the analysis phase of a filter algorithm, a pre/poststep routine has to be supplied. This routine is called at the begin of the DA problem, before each analysis phase, and after the re-initialization phase. This routine allows the user to conduct a further examination of the estimates of state and covariance matrix. Also corrections to the model state can be performed, which might be necessary after the statistical update of states by the filter algorithm1. The filter algorithms operate on abstract one-dimensional state vectors which are contained in the array of ensemble states. The transfer of state information between this ensemble matrix and the model fields is performed by other user-supplied routines. These are defined in the calls to the routines pdaf-get-state and pdaf-put-state. Before the integration of a state an operation distribute-state is required. This initializes the model fields from a state vector and performs other re-initializations of the model, if required. After the integration, the state vector is update from the evolved

74

Forecast

Filter-Update

Forecast

w process 2 process 3

____._)

Figure 4. Two-level parallelism of PDAF: The model and filter parts of the program are parallelized. In addition, several model tasks can be executed concurrently.

model fields by an operation collect-state. Dependent on the parallelization of the model and on the choice whether a mode- or a domain-decomposed filter algorithm is used, some MPI communication can be necessary in these operations. A similar operation like collect-state is also required in the usersupplied routine which is called by pdaf-filterinit to initialize the ensemble matrix. 3.3. Pam1 lelization Aspects

PDAF supports a 2-level parallelism, which is depicted in figure 4. Each model task and the filter routines themselves can be executed by multiple processes. In addition, multiple model tasks can be executed concurrently. The communication within the filter routines and the control of multiple model tasks is conducted within PDAF. Thus, only a possible communication for the initialization of the model fields before a model integration has to be added as described above. If the filter routines are executed by some of the processes which perform the model integrations, the four routines of PDAF which are called from the model code are always called by all processes. The decision which processes execute, e.g., the filter analysis phase is drawn within these routines. For the parallelization of the DA system, a change to the model source code concerning the configuration of MPI communicators is required. For MPI-parallelized models there is typically a single model task which operates often in the global MPI communicator MPI-COMM- WORLD. To

75

allow for multiple concurrent model tasks, the global communicator has to be replaced by a communicator of disjoint process sets in which each of the model tasks operates. Thus, a communicator COMM-MODEL has to be defined. In the model source code, the reference to MPI-COMM- WORLD has to be replaced by COMM-MODEL. Besides the communicator for the model tasks, a communicator COMM-FILTER has to be defined for the processes which execute the filter routines. Finally, a communicator COMM-CO UPLE is required, which defines processes used for data transfers between the filter and model parts of the DA system. These communicators are defined in the routine pduf-init-comm which is inserted into the original model code as depicted in figure 3b. Since the process configurations will depend on the parallelization of the model, PDAF can only provide a template for pduf-init-comm. This template will be utilizable in general, but without providing the optimal parallel performance. Thus, for optimal performance, the routine should be adapted to the particular problem, e.g. to support particular process topologies. The communication is determined by the three MPI communicators COMM-MODEL, COMMJ’ILTER, and COMM-CO UPLE. Figure 5 exemplifies a possible configuration of the communicators for a modedecomposed filter. In the example the program is executed by a total of 8 processes. These are distributed into four parallel model tasks each executed by two processes in the context of COMM-MODEL. The filter routines are executed by two processes. These are the processes of rank 0 and 4 in the context of MPI-COMM- WORLD. In the context of COMM-MODEL the filter processes have rank 0. Each filter process is coupled to two model tasks. Thus, there are two disjoint process sets in COMM-COUPLE each consisting of two processes. With this configuration, the filter initialization in pduf-filter-init divides the ensemble matrix into two matrices of subensembles which are stored on the two filter processes. To utilize of all four model tasks, each filter process will again distribute its sub-ensemble at the begin of each forecast phase to the two model tasks which are coupled to it by COMM-COUPLE. After the forecast phase, the ensemble members are gathered again on the two filter processes. A simpler parallelization variant would be to execute the filter routines by a single process of each model task. This variant would also involve a smaller amount of communication. If a domain-decomposition is used for the parallelization of the model and the filter parts of the program, a different configuration of the processes is used. Considered is again the situation that the filter and the model use the same domain-decomposition. Figure 6 shows a possible process

76

[0 [0 [O [O

~

logical process number (= rank in MPI-COMM-WORLD) 1 11

2 [0 11

3 11

4 [0 (0 11

5 11

71 11

6 [0 11

MPI-COMM-WORLD COMM-MODEL COMM-CO UPLE COMM-FILTER

Figure 5: Example communicator configuration mode-decomposition o the ensemble matrix.

[O [0 [O

logical process number (= rank in MPI-COMM- WORLD) 1 1

2 21

3 [0

4 1

11

[O

11

51 21

}

MPI-COMM-WORLD COMM-MODEL COMM-COUPLE

[O 11 COMM-FILTER [O 1 21 Figure 6: Example communicator configuration for the case of domaindecomposed states. The filter is executed by some of the model processes. configuration. Here the program is executed by six processes in total. These are distributed over two model tasks each consisting of three processes. The filter routines are executed by all processes of one of the model tasks. This allows for the direct initialization of the model fields in this task. The second model task is connected to the filter via COMM-COUPLE. With domain-decomposition, the initialization of the sub-states is performed in the user-supplied initialization routine called by pduf-filter-init. The filter operates on the whole ensemble of local sub-states. To use multiple model tasks the ensemble is distributed into sub-ensembles. These are sent to the model tasks via COMM-CO UPLE. The simplest process configuration with domain-decompositionwould be to use a single model task only. If the filter and the model use the same domain-decomposition, this configuration would avoid any communication between the filter and the model. 3.4. Filter algorithms

In the current version PDAF contains implementations of the Ensemble Kalman Filter (EnKF)6, the SEIK filter14, and the SEEK filter15. Also some variants of the filters are available, e.g. a fixed-basis SEEK filter. This

77

formulation keeps the error directions, described by the singular vectors of the state covariance matrix, constant while changing the weight for these directions during the analysis phase of the filter algorithm. The fixed-basis SEEK filter allows for a much faster assimilation process, since only a single model integration is required. However, this variant is expected to provide inferior state and error estimates, see e.g. Brasseur et aL2. The clear separation between the model, measurement, and filter parts of PDAF facilitates the development and implementation of additional filter algorithms within PDAF. The filter implementation is independent from a further development of the model. Each implemented filter algorithm can be called from the model code via the uniform API. If a new filter algorithm requires new measurement-related routines, their names can be added to the call to pdaf_put-state to maintain the flexibility of user-defined subroutine names. To avoid that the API of pdaf-put-state becomes too long, separate routines for the filters have been implemented. These filterspecific routines only require the specification of the routines used in the particular algorithm.

4. Parallel Performance of PDAF and ESKF Algorithms To exemplify the parallel performance of PDAF, the framework was tested in the finite element ocean model FEOM4. To allow for a large number of experiments, a small model configuration of a rectangular box of 31 by 31 grid points with 11 layers was used. In this model configuration synthetic observations of the full sea surface height field are assimilated each 2.5 days for a period of 40 days. Using the horizontal velocity components, the temperature field, and the sea surface height as state variables, the state vector has a size of 32674. In each analysis phase, a measurement vector of size 961 is assimilated. The experiments have been performed on a Sun Fire 6800 with 24 processors. Details on these experiments and a comparison of the filtering performances of different filter algorithms have been given by Nerger et al. ll. Figure 7 shows the speedup of the total computing time as the function of the number of parallel model tasks. Results for the EnKF, SEEK, and SEIK filters are shown for an ensemble of 36 members. Each model task is executed by a single process. Thus, no MPI communication within the operations distribute-state and collect-state is required. The speedup is equal for all three filters and slightly sub-optimal with a parallel efficiency of about 85% for 18 model tasks. This sub-optimality is due to the

78 1

8

.

........_ ideal

.

.

.

.

16-- EnKF

---

14-

SEEK SEIK

12a 4 10. Q

5 86-

4-

2Ob

2

4

6 0 I 0 12 14 1 6 parallel model tasks

B

Figure 7. Speedup as a function of the number of parallel model tasks for the framework with a filter process on each model task. For 18 parallel model tasks a parallel efficiency of 85% is obtained.

iterative solver applied in the inverse time stepping of FEOM. Since the number of iterations can vary in each time step for different model tasks, a small de-synchronization occurs causing the speedup to be sub-optimal. The computing time for the filter analysis and re-initialization phases are negligible in these experiments. For the EnKF, which is the most costly algorithm in the experiments, the filter analysis requires less than 0.1% of the total computing time in the serial experiment. There are practical situations in which the computing time for the filter analysis and re-initialization phases can become significant, e.g., if measurements are assimilated very frequently. In this case, the parallel efficiency of these parts of the algorithms will determine the speedup of the DA program. Figure 8 shows the computing time and speedup of the EnKF, SEEK, and SEIK algorithms for PDAF with mode-decomposition (upper panels) and domain-decomposition (lower panels) for ensembles with 60 members. For each filter algorithm and number of processes 20 experiments are performed. The figure shows mean values and error bars of one standard deviation. The SEEK and SEIK algorithms are much faster than the EnKF algorithm. This difference will diminish for increasing ensemble sizes, showing the difference of the algorithmic formulation of the EnKF compared to the SEEK and SEIK filters. The speedup of the EnKF stagnates at a value of about 1.2 for both the mode-decomposition and domain-decomposition variants. For increasing

79 Execution Time, N=60 60 50

.

Speedup. N=60

- EnKF - - SEEK SElK

4,:

t

processes

Execution Time, N=60

Speedup, N=60

- EnKF - - SEEK 50

SElK

r2

4

6

8

1

0

1

2

processes

Figure 8. Execution time and speedup for the filter update phases (analysis and reinitialization) as a function of the number of processes. The assimilations have been performed with an ensemble size of 60. The upper panels show the results for the m o d e decomposed variant of PDAF while the lower panels are for the domain-decomposition variant. Displayed are mean values and standard deviations over 20 experiments for each combination of filter algorithm and number of processes. The modedecomposed algorithms require a large amount of communication which leads to a small speedup. For domain-decomposition an ideal speedup is obtained with the SEEK and SEIK filters.

ensemble sizes, the speedup will grow (not shown), but will remain far from optimal. This small speedup is particular to the EnKF algorithm and the possibilities of its parallelization. The operations in the EnKF algorithm are global over the ensemble and over the observation space. Due to this, a large amount of communication is required with mode-decomposition as well as domain-decomposition. In addition, several parts of the algorithm show a small parallel efficiency which could only be avoided by rearranging larger arrays over all processes which would itself require costly communications. A possibility to increase the speedup of the EnKF algorithm would

80

be to introduce a localization to the analysis phase, see e.g. Keppenne and RieneckerlO, This would effectively amount to a reduction of the dimension of the measurement vector and hence to a smaller amount of MPI communication. However, a localization would involve an approximation of the analysis phase. The speedup of the SEEK and SEIK filters depends strongly on the c h e sen parallelization strategy. For mode-decomposition only a small speedup is obtained. It will increase slightly for larger ensemble sizes. The small speedup is mainly due to a large amount of communication required in the mode-decomposed algorithms. For increasing ensemble size, the time for communications decreases relative to the time of computations. The difference in the speedups of the SEEK and SEIK algorithms is caused by algorithmic differences which lead to smaller communication requirements in the SEEK filter compared to the SEIK filter. For domain-decomposition the SEEK and SEIK filters show an ideal speedup. In this case only a very small amount of data is communicated. This is due to the fact, that both the SEEK and SEIK algorithms are global over the ensemble, i.e. over the error space, but not global over the model space. Thus, it is optimal to distribute data as full ensembles of sub-states as is done in the domaindecomposition variant of the filter algorithms. For larger ensemble sizes. the parallel efficiency will decrease slightly. This effect is caused by some non-parallelized operations whose complexity scales with the cube of the ensemble size. However, the influence of these operations will be negligible in practical situations with a much larger state dimension than in the experiments performed here.

5. Summary

The parallel data assimilation framework PDAF has been introduced for data assimilation with advanced data assimilation algorithms based on the Kalman filter. To motivate the organization of the framework, the concept and structure of error subspace Kalman filters has been reviewed. These algorithms represent a wide range of filter algorithms with promising properties for their application to large-scale nonlinear numerical models. The filters show a sequential structure consisting of alternating forecast and analysis phases. In addition, some algorithms perform a re-initialization phase. The filter algorithms allow for a clear separation between the core part of a filter, the model used to integrate different model states, and the measurements to be assimilated during the analysis phase. In addition, the

81

algorithms share a natural parallelism in the forecast phase caused by the independent integration of an ensemble of model states. These properties motivate the development of a data assimilation framework which facilitates the implementation of a data assimilation system on the basis of numerical models which are not designed for data assimilation. PDAF defines an application program interface used to call the framework routines from within the model code. In addition, an interface is defined for several user-supplied routines which, e.g., include operations dependent on the measurements. These routines are called by the core routines of PDAF. The framework contains a number of filter algorithms which are fully parallelized and optimized. In the current version of PDAF these are the popular Ensemble Kalman Filter, the SEEK filter, and the SEIK filter. Two parallelization variants are supported which rely on different distributions of the matrix of ensemble states. The variants are the columnwise mode-decomposition and the row-wise domain-decomposition. The domain-decomposition variant is the obvious strategy if the model itself is parallelized using a domain-decomposition of the model grid. The framework permits a further development and implementation of DA algorithms independently from the development of the numerical models. Switching between the filter algorithms is possible by the specification of a single variable. In general, also alternative methods like the variational 4D-Var (adjoint) techniques are possible within the framework, though this has not yet been implemented. Numerical experiments with an idealized configuration of the finite element ocean model FEOM have been performed to examine the parallel efficiency of the full framework as well as the analysis and re-initialization parts of the filter algorithms. An excellent speedup is obtained for the full framework. With regard to the analysis and re-initialization phases of the filters, strong differences between the Ensemble Kalman filter and the SEEK and SEIK filters have been found. The EnKF shows only a small speedup for both decomposition variants. The SEEK and SEIK filters show a small speedup for mode-decomposition but exhibit an ideal speedup for the variant using domain-decomposition with small ensemble sizes. It is planned to make PDAF publicly available in the future to facilitate the implementation of data assimilation systems. Furthermore, a simplified sharing of filter algorithms will be possible on this unified implementation basis. Currently the framework undergoes a beta-testing phase with a limited number of users and models.

82

Acknowledgments

We are grateful to Stephan Frickenhaus, Sergey Danilov and Gennady Kivm a n for instructive discussions and comments on the structure and parallelization of PDAF. Meike Best is thanked for a careful proofreading. References 1. J.-M. Brankart, C.-E. Testut, P. Brasseur, and J. Verron. Implementation of a multivariate data assimilation scheme for isopycnic coordinate ocean models: Application to a 1993-1996 hindcast of the North Atlantic ocean circulation. J. Geophys. Res.. 108(C3):3074, 2003. doi:10.1029/2001JC001198. 2. P. Brasseur, J. Ballabrera-Poy, and J. Verron. Assimilation of altimetric data in the mid-latitude oceans using the singular evolutive extended Kalman filter with an eddy-resolving, primitive equation model. J. Mar. Syst., 22:269-294, 1999. 3. K. Brusdal, J. M. Brankart, G. Halberstadt, G. Evensen, P. Brasseur, P. J. van Leeuwen, E. Dombrowsky, and J. Verron. A demonstration of ensemble based assimilation methods with a layered OGCM from the perspective of operational ocean forecasting systems. J. Mar. Syst., 40-41:253-289, 2003. 4. S. Danilov, G. Kivman, and J. Schroter. Evaluation of an eddy-permitting finiteelement ocean model in the North Atlantic. Ocean Modelling, 2005. in press. 5 . F.-X. Le Dimet and 0. Talagrand. Variational algorithms for analysis and assimilation of meteorological observations: Theoretical aspects. Tellus, 38A:97-110, 1986. 6. G. Evensen. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Car10 methods to forecast error statistics. J. Geophys. Res., 99(C5):10143-10162, 1994. 7. G. Evensen. The Ensemble Kalman Filter: Theoretical formulation and practical implementation. Ocean Dynamics, 53:343-367, 2003. 8. A. H. Jazwinski. Stochastic Processes and Filtering Theory. Academic Press, New York, 1970. 9. R. E. Kalman. A new approach to linear filtering and prediction problems. %ns. ASME, J. Basic Eng., 82:35-45, 1960. 10. C. L. Keppenne and M. M. Rienecker. Initial testing of a massively parallel Ensemble Kalman Filter with the Poseidon isopycnal ocean circulation model. Mon. Wea. Rev., 130:2951-2965, 2002. 11. L. Nerger. Parallel Filter Algorithms for Data Assimilation in Oceanography. Number 487 in Reports on Polar and Marine Research. Alfred Wegener Institute for Polar and Marine Reasearch, Bremerhaven, Germany, 2004. PhD Thesis, University of Bremen. 12. L. Nerger, W. Hiller, and J. Schroter. A comparison of error subspace Kalman filters. Tellus A , 2005. in press. 13. Project PALM. Projet d’assimilation par logiciel multi-methodes. URL http ://www .cerf acs .f r/-palm/.

83 14. D. T. Pham, J. Verron, and L. Gourdeau. Singdar evolutive Kalman filters for data assimilation in oceanography. C. R. Acad. Sci., Ser. II, 326(4):255260, 1998. 15. D. T. Pham, Jacques Verron, and Marie Christine Roubaud. A singular evolutive extended Kalman filter for data assimilation in oceanography. J. Mar. Syst., 16:323-340, 1998. 16. F. Rabier, H. Jarvinen, E. Klinker, J.-F. Mahfouf, and A. Simmons. The ECMWF operational implementation of four-dimensional variational assimilation. 1: Experimental results with simplified physics. Quart. J . Roy. Meteor. Soc., 126:1143-1170, 2000. 17. SESAM. An integrated system of sequential assimilation modules. URL http: //me01715 .hmg. inpg .fr/Web/Assimilation/SESAM/.

OPTIMAL APPROXIMATION OF KALMAN FILTERING WITH TEMPORALLY LOCAL 4D-VAR IN OPERATIONAL WEATHER FORECASTING

H. AUVINEN University of Helsinki, Department of Mathematics and Statistics, P.O. Box 68 (Gustaf Hallstromin katu 2b), FI-00014 University of Helsinki, Finland E-mail: [email protected] H. HAARIO AND T. KAURANNE Lappeenranta University of Technology, Department of Information Technology, P.O.Box 20, FI-53851 Lappeenranta, Finland E-mail: [email protected], tuomo.kauranneOlut.fi The adoption of the Extended Kalman Filter (EKF) for data assimilation in operational weather forecasting would provide estimates of prediction error covariance and make it possible to take model error into account in the assimilation prccess. Unfortunately, full-blown Kalman filtering is not feasible even on the fastest parallel supercomputers. We propose a novel approximation t o EKF, called the Variational Kalman Filter, or VKF. In VKF, a low rank approximation t o the prediction error covariance matrix is generated by a very short, temporally local four dimensional variational assimilation cycle, at a low computational cost. VKF provides a locally optimal approximation t o the error covariance matrix and also allows model error to be incorporated in the assimilation process. Initial numerical tests with VKF applied t o an advection-reaction-dispersionequation are reported.

1. Data assimilation in operational weather forecasting The principal component in Numerical Weather Prediction (NWP) is the primitive equations based atmospheric model that produces the forecast. But no causal time dependent partial differential equation admits a unique solution without initial conditions. Data assimilation is that other half of numerical weather prediction that produces these initial conditions on the basis of atmospheric measurements. The first algorithms used for data assimilation were based on optimal statistical interpolation and used strictly spatial structure functions to dis 84

85

seminate the information contained in measurements. It was recognized in the 1980's that structure functions should be flow, and therefore time, dependent. This objective was the most important motivation in introducing the four-dimensional data assimilation, or 4D-Var, into operational weather forecasting in the 1980's and 1990's, see e.g. LeDimet and Talagrand'O. Derber' and Rabier et a1.I' However, any purely spatial snapshot of the atmosphere is an idealization that is never attained in practice. Weather observations are carried out mostly asynchronously and their target, the atmosphere, is constantly in chaotic motion. This means that data assimilation, just like the weather forecast itself, is an inherently time dependent process, not a single snapshot. The theoretical framework used in defining 4D-Var was optimal control theory. The part of it that was applied in the 1980's was identification theory. Optimal identification produces an optimal estimate of a snapshot state of the system. Since the 1990's it has become increasingly evident that data assimilation should rather be seen as a true optimal control problem. The objective function is not to optimally identify a snapshot state, but rather to minimize the growth of forecast errors. This observation has turned the attention of the numerical weather prediction community from 4D-Var into Kalman filtering. 2. General assimilation problem

The formulation of the general data assimilation problem for discrete time steps t = 1 , 2 , 3 , n contains an evolution or prediction equation and an observation equation: ...?

where M t is a model of the evolution operator, xt is the state of the process at time t and Et is a vector valued stochastic process. In NWP, Et includes model error and other sources of prediction error, such as the discretization error of atmospheric states in the chosen discrete basis. The second equation connects the state estimate x(t) to the measurements y ( t ) , with their associated observation errors e,, by way of an observation operator ICt. In NWP, Kt maps the state of the atmosphere to quantities that can be directly measured, i.e. weather observations. Both M t and ICt are generally nonlinear. We shall denote nonlinear operators

86

by italicized capital letters below, and their local linearizations by the corresponding capital letters in boldface.

3. The four-dimensional variational assimilation method The purpose of four-dimensional variational data assimilation method (4DVar), see e.g., Bannister1 for an introduction, is to estimate the initial conditions of a high-dimensional model of a dynamical system. This estimation is done by optimizing the fit between a forecast of the evolution model and a set of observations which the model is supposed to predict. This task is usually formulated as a minimization problem of adjusting the free parameters of the model until the model best predicts the observations. The name of the method refers to the use of model initial conditions as the control variable for predicting observations carried out in three spatial dimensions over a given time interval. 4D-Var is a generalization of the less expensive 3D-Var method and it also supersedes earlier statistical interpolation methods, which do not take into account model dynamics. For operational weather forecasting, often featuring atmospheric models with up to lo8 spatial degrees of freedom, the computation cost of the 4D-Var method is reasonable. The method also allows a large number of atmospheric observations of many different types to be included in the assimilation process. Because of these features, many operational weather services are using the 4D-Var method for estimating the initial conditions of their numerical weather prediction model.

3.1. The cost function The goal of assimilation is to find those initial conditions xo of a fixed evolution model that minimize a cost or penalty function J(x0). This cost function is defined t o be a global measure of the misfit between the observation sets and a model forecast over the assimilation period. Since the number of degrees of freedom of the model is far greater than the number of reliable and independent observations, a priori information is needed in order to initialize the model. We also note that observations are not evenly spread over the atmosphere and therefore interpolation for data voids is out of question. In the 4D-Var data assimilation method, the necessary a priori information is taken from a previous model forecast, which is valid at time t = 0 and called the background state xb. Because the forecast model is deterministic, it is sufficient to calculate the cost function with respect t o Xb at initial time t = 0 only.

87

A common formulation of the cost function J(x0) simultaneously penalizes a misfit between the model initial state and the background, namely J b , and the synthetic observation values derived from the model state and the corresponding real observations Jo. Typically both of these contributions are present in the cost function formula:

+

J(X0) = J b Jo 1 T -1 = -(xb - XO) s a (xb - XO) 2

(3) (4)

T

+

X(Y(~) - Ict(Mt(xo)))TSe-l(y(t)- Ict(Mt(xo))), (5) t=O

where S, is the error covariance of a priori information, Se is the covariance of the observation errors and T is the length of the assimilation period. In 4D-Var, Se is often assumed time invariant, although this is not strictly necessary by the formulation above. This formulation of the cost function assumes that the errors of observations and background are unbiased and Gaussian. Operational implementations of 4D-Var also assume that the model is perfect over the assimilation period T . This leads to the sGcalled strong constraint formalism of data assimilation. A more recent weak constraint formulation of 4D-Var is able to relax on this condition and to use the limited memory of atmospheric motion to arrive at the same analysis as an indefinitely run Kalman smoother, see Fisher '. 4. The Basic Linear Kalman Filter

The basic Kalman filter, see e.g., Rodgers13, operates on a linear version of the general assimilation problem, equations (1)and (2):

+

x(t) = Mtx(t - 1) Et

(6)

~ ( t= ) Ktx W + et, (7) where Mt is the linear evolution operator and similarly Kt is the linear observation operator. The Kalman filter operates sequentially in time. Let xeSt(t- 1) be an estimate of state x(t - 1) and SeSt(t- 1) be the corresponding error covariance matrix of the estimate. At time t the evolution operator is used to produce an a priori estimate xa(t) and its covariance s a (t):

x a w = Mtxest(t - 1)

+

S a ( t ) = MtSest(t - 1)MT SEt,

(8) (9)

88

where SEt is the covariance of the prediction error Et. The next step is to combine xa(t) with the observations y ( t ) made at time t to construct an updated estimate of the state and its covariance:

Gt = Sa(t)KF(KtSa(t)KF+Set)-' Xest(t) = xa(t) + G t ( y ( t )- Ktxa(t)) Sest(t) = Sa(t) - GtKtSa(t),

(10)

(11) (12)

where Gt is the Kalman gain matrix, which is functionally identical to the maximum a posteriori estimator. 5 . The Extended Kalman Filter In a more general case, when the evolution model and/or the observation model is non-linear, the Extended Kalman Filter (EKF), see e.g., Rodgers13, is required. The filter uses the full non-linear evolution model equation (1) t o produce an a priori estimate:

xa(t>= Mt(Xest(t- 1)).

(13)

In order t o obtain the corresponding covariance S,(t) of the a priori information, the prediction model is linearized about xest(t - 1):

S,(t) = MtS,,t(t - 1)MT + SEt.

(15)

In operational weather forecasting the linearization M of the prediction model is often available because 4D-Var also requires it. Among other user groups of EKF, the linearization in equation (14) is produced by taking finite differences between evolved small state vector perturbations, which is a computationally expensive operation for large models. The observation operator is linearized at the time of the observation about the a priori estimate x a ( t )in order t o obtain K t , which is then used for calculating the gain matrix:

+

Gt = S,(t)KF(KtSa(t)KF Set)-'.

(17)

After this, the full non-linear observation operator is used to update x,(t) and this is then used to produce a current state estimate and the corre-

89

sponding error estimate: x e s t ( t ) = xa(t)

+ Gt(y(t) - G ( x a ( t ) ) ) .

(18)

= sa(t) - GtKtSa(t).

(19) If the linearization of the observation operator at x a ( t )is not good enough to construct x e s t ( t ) ,it is necessary to carry out some iterations of the last four equations. In this case the most recent estimate xest(t)should be used for evaluating Kt(xest(t)), which is then compared with the observation y(t) in equation (18). Sest(t)

6. Computationally feasible approximations to Kalman

filtering The biggest drawback of the EKF algorithm is its enormous computational cost. The formulation of the algorithm calls for tangent linear integrations of all the columns of the matrix Sest(t- 1)at step (15) in order to forecast the analysis error covariance matrix S a ( t ) . As there are up to 10' spatial degrees of freedom in a weather model, S e S t ( t )is a matrix with 10' columns and 10l6 elements. It will not be possible to compute such matrices in the foreseeable future. It has therefore been necessary to develop approximate algorithms to emulate the behaviour of full Kalman filtering. The two foremost candidates in this effort have been the Reduced Rank Kalman Filter (RRKF), see Fisher and Andersson and the related SEEK algorithm, please see Nerger ll; and the Ensemble Kalman Filter (EnKF), see Evensen3. The benefits of both of these methods are combined in the SEIK algorithm, see Nerger ll. In RRKF, a low rank update of the prediction error covariance matrix is determined on the basis of a singular vector analysis of the associated 4D-Var assimilation. Dominant Hessian singular vectors can be expected to represent directions of fastest growing atmospheric modes, as demonstrated by Ehrendorfer and Beck2. In order to identify a bundle of dominant singular vectors for RRKF, the Lanczos algorithm is run with the tangent linear model, linearized around an analysis produced by 4D-Var, over some suitable time interval. A small number of singular vectors corresponding to the largest singular values are chosen as the subspace along which the error covariance matrix is updated. In practice, this is done by projecting the spectral control variables of the model onto this subspace every time step and computing the covariance matrix of these projections in this low dimensional subspace only.

90

AS reported by Isaksen, Fisher and Andersson 7, problems have been encountered in containing the true model error covariance within any such predetermined subspace. Covariance tends t o "leak" away over time, irrespective of how the subspace had been chosen. In particular, there are problems in associating the fastest growing modes over 48 hours with analysis error. Yet the analysis is the only route that 4D-Var data assimilation can influence the ensuing forecast. It may well be that the reasons t o these problems are very fundamental to fluid dynamics. After all, fluid flow is a chaotic phenomenon and no proof of continuous dependence of fluid motion on its initial state is known for three dimensional flows. It seems plausible that there are no long-living low dimensional invariant subspaces in fluid Aow, and therefore the covariance of fluid state cannot be bound to any fixed low dimensional subspace of the state space for any long period of time. Yet we clearly witness temporally local predictability of atmospheric motion in our daily weather forecasts. Also, as Ehrendorfer and Beck' have shown with a quasi-geostrophic model, Extended Kalman Filter does capture the error covariance structure of a quasi-geostrophic model if the subspace chosen is big enough, such as two thirds of the total dimension of the model. In Ensemble Kalman Filtering, see Evensen3, the idea of an explicit matrix algorithm for modelling prediction error covariance is dropped altogether. Instead, a bundle of forecasts is produced from slightly perturbed initial conditions, analogously to Monte Carlo methods. The crosscorrelations between the different forecasts are then calculated to form a low rank approximation to the forecast error covariance for any time step. The advantage of EnKF is that it can use the full non-linear model at every step and is not dependent on any particular linearization of it. However, it is well known from stochastic process theory that Monte Carlo type approximations converge slowly. There are ways to choose the ensemble of initial conditions in EnKF judiciously, so as t o obtain faster convergence to the true covariance matrix, see Ehrendorfer and Beck2. But such selection methods are generally quite analogous t o RRKF. In particular, Hessian singular vectors play a significant role in them, too. This means that such accelerated EnKF methods may fall prey to the very same pitfalls in representing the true error covariance as RRKF.

91

7. Variational Kalman Filter

In this section we introduce the Variational Kalman Filter (VKF) method. The idea of the VKF method is to replace the most time consuming part of the Extended Kalman Filter (EKF) method, the updating of the prediction error covariance matrix Sest(t- 1) equation (15) in order t o obtain S,(t), by a low rank update that only requires a few dozen model integrations, instead of lo8. Moreover, even these few extra integrations are obtained without extra cost from an accompanying, temporally local 4D-Var run. The main difference compared to the EKF method is that the covariance of a priori information S,(t) is estimated by a low rank update t o an earlier covariance S, (t - t,). Typically, this local assimilation window tw is just a few integration time steps long and may even correspond to just the time interval between two successive sets of observations. The vectors forming the low rank update used by the VKF method are collected from the search directions of a temporally local 4D-Var run. These search directions span the same subspace as the dominant Hessian singular vectors in a corresponding Lanczos algorithm because of a similarity between the Lanczos method and Quasi-Newton minimization with the BFGS update, see Kauranne 8. Just like 4D-Var, VKF does not require any finite difference computations to produce an estimate to the error covariance matrix when an adjoint to the linearized forecast model is available. Instead, it relies on a key property of all quasi-Newton methods that renders all successive pairs of minimization residuals t o be differences of two successive cost function gradients. The Variational Kalman Filter can be viewed as a predictor corrector scheme in the Cartesian sum of the state and covariance spaces. In the scheme, a temporally local 4D-Var over assimilation window tw is used as a predictor to predict both the state and the error covariance matrix. Dominant Hessian singular vectors at time t - t , are approximated by the search directions at that time. These search directions are evolved over the assimilation period oft, time steps to the present time, to be then corrected with a full EKF step where the prior error covariance matrix is this evolved low rank update obtained from the predictor step, and those possibly retained and evolved from earlier assimilation windows. This process allows the rank of the error covariance approximation to be regulated at will. The choice of the rank of the low-rank approximation to the error covariance matrix in VKF maps directly to the length of the memory of the VKF method of

92

past dominant singular vectors.

7.1. Kalman fonnulation of VKF The VKF method uses the full non-linear prediction equation (1) t o construct an a priori estimate from the previous state estimate:

xa(t) = Mt(xest(t - 1)),

(20)

At every t, steps of the VKF method we perform a 4D-Var optimization with assimilation window width t,. During the optimization process we update the approximation of the previous covariance S,(t - t,). These updates axe based on the search directions of the 4D-Var optimization, see Kauranne'. In practice the updates axe carried out by using the L-BFGS formula in order to maintain an upper limit on the rank of the Hessian approximation used in the BFGS Quasi-Newton method. After a 4D-Var optimization we have an updated covariance S,(t - t w )which is then used t o produce: S , ( t ) = Sa(t- t w )+ SEt.

(21)

The presence of the prediction error term SEt in the update formula means that model error is taken into account automatically in the VKF algorithm in the same manner as in EKF. SEt is an estimate of prediction error covariance and should reflect our confidence in the model we use. It can be a time, state and flow dependent matrix but it does not have t o reflect the dynamics of the atmosphere, nor the presence of observations, since these effects are taken into account by the S,(t-t,) term. A simple candidate for SEt is the diagonal variance matrix of the model state variables. The matrix SEt could be determined empirically by statistically comparing the error covariances of short window forecasts with respect to the corresponding analyses over time and a range of atmospheric conditions. Like in the EKF method, the observation operator is linearized at the measurement time about the a priori estimate in order to obtain Kt, which is then used t o calculate the gain matrix:

Gt = S,(t)KT(KtSa(t)KT +Set)-'.

(23)

The full nonlinear observation model is used for observation comparison

93

and this is used for calculation of the new state estimate: Xest(t) = xa(t)

+Gt(y(t)

K:t(xa(t)))* Sest(t) = S a ( t ) - GtKtSa(t)-

(24) (25)

Similarly to the EKF method, it may be necessary to carry out some iteration of the last four equations. see Sec. 5. 7.2. Cost function formulation of VKF during the local ID-Var step

During the local 4D-Var optimization of the VKF method. the following cost function is minimized: to+L

J(xto)=

C

((Xt

- X,")T(sa(to))-'(Xt

- x,")

(26)

t=to

+ a(xt - x,")T(SEt)-l(xt - x,")

(27)

+ Tt(Yt - w4)Tse,1(Yt - K:(xt)),

(28)

where xt = Mt(xt,) is determined by the nonlinear evolution operator M, K: is the nonlinear observation operator and rt is an indicator function indicating the presence of observations yt at time t. All dynamic quantities in the cost function above are time and flow dependent. The stochastic error processes are normally assumed to be Markovian, and their statistics therefore stationary. But the formulation does not necessitate this, if there was information available to justify making them time or flow dependent. The coefficient a is a Tikhonov like regularization parameter in covariance space that makes VKF to behave conservatively when updating S,(to) to the direction of SEt. It's effect on the formulation of VKF is analogous to the impact of the underrelaxation factor used in the Levenberg-Marquardt method for solving least squares problems. a! is chosen to be inversely proportional to the ratio of analysis error variance to model error variance. Since the level of both errors is generally unknown, calibration runs are used to determine an appropriate value of a. At each iteration of the optimization method the search directions and the gradients of J are stored and the approximation of the S , ( t o ) is updated as the optimization method updates it's own approximation of the Hessian VVJ. Dominant Hessian singular vectors are an optimal low-rank approximation to the error covariance matrix in the least squares sense by the properties of the singular value decomposition. The BFGS Quasi-Newton

94

method, on the other hand, shares the Krylov space optimality of the Conjugate Gradient method in a local linearization of a smooth nonlinear equation that renders the Hessian locally constant. As discussed in Kauranneg, these properties are inherited by VKF through its use of L-BFGS in its 4DVar predictor. VKF can therefore be characterized as the locally Krylov optimal least squares approximation to EKF in the Cartesian sum of state and prediction error covariance spaces.

8. Computational results The results presented in this section were generated using an assimilation system for the following partial differential equation: dc(x)

--

at

dc(z)

- V-

dX

+ E-d2C(X) + Reac(z) , 8x2

where V is the advection velocity, E is the dispersion coefficient and Reac(x) is a reaction rate which is a spatially varying function. In our tests the reaction rate was set to a step function with zero global average. The equation has been dimensioned so that it roughly corresponds to the progress of zonal low wavenumber atmospheric waves at mid-latitudes. The observations are generated by a pattern resembling the narrow swath of a polar orbiting satellite that traverses each latitude 15 times a day. The reaction rate is set to correspond to a light-sensitive, and thereby diurnally varying, chemical reaction. Total concentration of the chemical c is conserved. The initial condition is a constant concentration that settles down to a reaction rate dependent V-shape after a while. Model error is incorporated by specifying an incorrect reaction rate in the model, when compared to the reaction rate that the observations were generated with. The equation (29) can be spatially discretized to arrive at a set of coupled ordinary differential equations:

1

-dc(z) - V - ( s - ci-I) at h

1 + E-(s-l h2

-2 s+ % + I )

+ Reac(i),

(30)

where h is the distance between grid points i = 1,...,ns. The number of grid points is controlled by ns and the domain is set to be cyclic, so that c-3 = cns-2, c-2 = %S-1, c-1 = hS, %,+I = c1, %8+2 = c2, %S+3 = c3. The following list holds the parameter values which were used in the simulation of the system: the number of grid points ns = 64, wind velocity V = 50, dispersion coefficient E = 5e 7, h = 1/2.8e 7, observation noise Set = 0.51 and prediction error covariance SEt = 0.51. The time integration

+

+

95

method used to solve the system is a low order predictor corrector scheme with an Euler predictor and a leapfrog type corrector. The reaction rate Reac(i) in equation (30) used for the data simulation was defined by the following formula:

( Reac(i, t ) =

I

0, 3e-4. -3e-4, 0,

if if if if

iE

[O,...,ns/4- 11

i E [ns/4, ...,ns/2 - 11 i E [ns/2, ...,3ns/4] i E [3ns/4 1,..., ns]

(31)

+

During the assimilation process the reaction rate fieac(i) was set to be:

Reac(i, t ) =

I

0, 2e-4, -2e-4,

I.

0,

if if if if

i E [0, ...,ns/4 - 11 i E [ns/4, ...,n s /2 - 11 i E [ns/2, ..., 3ns/4] i E [3ns/4+ 1, ...,ns]

(32)

The difference between equations (31) and (32) causes the evolution model to be non-perfect. Observation sets of true states were made 15 times during 24 hours. Each observation set contains 10 observed points with systematically varying locations, so that the observation set is moving eastwards and observations are made from the same location again after 24 hours, since the domain is cyclic. Gaussian noise at ten per cent level was added to the observations. The system was assimilated with 4D-Var7EKF and VKF. Figure 1 illustrates the approximated covariance matrix S,(t) of the VKF method and the corresponding covariance matrix S a ( t ) of the EKF method after a burn-in period. Since the observations do not cover all spatial grid points at one time, the covariance of the a priori estimate is changing rapidly in time as the observation set moves eastward. The shape of the approximated covariance Sa(t) of the VKF method is able to follow the shape of the covariance Sa(t) of the EKF method, even if it has a noisy overall appearance. Similarly to the EKF method, the VKF method gives also the evolved prediction error covariance matrix Sest of the estimate. Figure 2 compares these matrices of both methods at one time step. Again the shape and the magnitude of these matrices are similar, which shows that the VKF method is capable of following the evolution of the error covariance matrix, in spite of being noisy. Figure 3 compares assimilation results with different methods. Since the evolution model is incorrect, the estimate of 4D-Var can not follow accurately the true state. In practice the estimates of VKF and EKF methods

99

ECMWF operational implementation of four dimensional variational assimilation. Part I: Experimental results with simplified physics. Q. J. R. Meteorol. SOC.126, 1143-1170 (2000). 13. C.D. Rodgers: Inverse Methods for Atmospheric Sounding Theory and Practice, World Scientific, London, (2000).

INTEL@ARCHITECTURE BASED HIGH-PERFORMANCE COMPUTING TECHNOLOGIES HERBERT CORNELIUS Advanced Computing Center EMEA, Intel GmbH,Dornacher Str. 1 Feldkirchen, 85622,Germany An increasing number of systems and solutions in the High-Performance Computing (HPC) market segment are moving from traditional proprietary (Vector-) Supercomputing and RISC/uNLy machines towards utilizing industry standards building block technologies to leverage volume economics such as the Intel@ Architecture. Bis trend continues to enjoy increased performance at decreasing costs with an ongoing injection of new hardware and sofiare technologies.At the same time the application areas of HPC are widening and spanning from science, research and academic over engineering to commercial business computing deployments. Hence, the customer focus is evolvingfrom a pure technologicalpoint of view to a more solution oriented approach to increase the business value and higher productivity. B i s move takes place in both the scientijk and industry market segments where HPC is being used as a mean to archive and implement better solutions. Intel@ Architecture based HPC systems and solutions are playing a vital role in accelerating and improving science and research capabilities and capacity, business benefits and competitive advantage..

1. Introduction

The biggest, toughest computing challenges in the world are tackled - and very often solved through high performance computing (HPC), in former times also called Supercomputing. Such diverse and life-essential research areas as meteorological modeling, automotive crash test simulations, human genome mapping, and nuclear blast modeling are all part of HPC. Solutions built on open standards-based Intel@ platforms provide supercomputing capabilities and capacities at significant cost savings for cutting-edge scientific research, industry, and enterprise HPC applications. Recent industry standard technical computing benchmarks show Intel@ Itanium@ 2-based platforms are more costeffective than RISC-based systems. Intel-based building blocks - architectures, processors, platforms, interconnects, software, and services - form comprehensive HPC solutions that optimize performance, network management, security, and reliability. And as the industry's leading supplier of technology building blocks for HPC solutions, Intel 100

101

is at the center of a worldwide community of equipment manufacturers, software developers, system integrators, and service providers that build best-in-class solutions on open standards-based architecture. Using networked, commercial-off-the-shelf technologies, Intel Architecture (IA) based HPC clusters deliver superior performance and business value for technical computing. Intel products extend the notion of these open standardsbased cluster platforms to grids, enabling coordination among any connected computing devices - potentially anywhere in the world. Intel offers a full suite of building blocks for these solutions including a complete range of platform options for high-performance clusters and grids, with 64-bit and 32-bit architectures, dual- and multi-processor node confgurations, and floating-point execution capabilities for the most demanding applications. Intel architecture was at the foundation of the earliest high-performance computing clusters and now, more than a decade later, Intel is the industry's leading supplier of standards-based building blocks for HPC solutions. Intel products extend the notion of these open standards-based cluster platforms to grids, enabling computing devices of all kinds to talk to each other, creating a virtual worldwide network of grid computing power 1.1. Productivity and Pricflerformance

To illustrate the changes in HPC/Supercomputing, we just need to look at the TOP500 list'. Started in 1993, this list is published twice a year and summarizes the 500 most powerful computer systems in the world as measured with the Linpack benchmark. One of the most visible trends fiom this list is the increasing use of industry standard building blocks such as Intel Architecture processors and platforms to gain more performance at decreasing costs. Over the last 4 years the utilization of IA based HPC systems amongst the TOP500 listed machines grow fiom 6 (12%) to 320 (64%) with the Intel@Xeonm processor as the most used processor architecture and the Intel@ ItaniwO Architecture being the fastest growing processor architecture in the TOP500 list. Figure 1 shows the growth of IA in the TOP500 list over time. Beside the performance aspect, even more important are the developments of costs for such powerful HPC systems. In [I1 it is noted that the Earth Simulator with 35.86TFLOPS performance had estimated costs of $350-500M, while the current BlueGene& machine with 70.72TFLOPS costs $100M and the NASA Columbia system with 51.87TFLOPS only $50M. If we look at the price/performance ratios (lower is better) of those three systems, we get 0.96 for the Intel Itanium based Columbia system, 1.41 for the IBM BlueGene& machine and 9.76 (at best) for the Earth

104

1.3. HPC Solutions Building Blocks Based on its manufacturing and production capabilities, Intel is continuously pushing the process envelope with new technologies to drive performance and volume economics. Today 9Onm process technology is in full production with 300mm wafers, and the next generation 65nm process technology has already passed the prototype phase and is being integrated into production planned for 2005. On the transistor design side itself, Intel has developed the so-called “Terahertz-“ and “Tri-Gate-Transistor” for future generations of processors with higher performance, denser packaging and lower power consumption. One of the key technology area for the future are also new materials, in this area Intel has developed a new “high-k gate dielectric” material to allow much smaller processes with much lower power dissipation. For fiuther reading on this more details can be found On the processor side both architecture lines Intel@Itaniumo processor and Intel@ Xeonm processor are being enhanced and extended, and both product lines will quickly move to dual-core and eventually to multi-core implementations. Hence, this will continue drive performance up and costs down using industry standard building blocks. In addition, new technologies and functionalities for security, manageability and virtualization will be introduced to enable more capabilities, more performance and more flexibility to end-users beyond processor speed. As an example, the next generation Itanium processor (codename Montecito) planned in 2005 will utilize 1.7 billion transistors in a single chip (at 9Onm process) and feature beside higher fiequencies, larger caches and lower power consumption also dual-core und multi-threading capabilities. Using IA processors, a wide range of HPC systems from many vendors are available, each having its own value-add for HPC solutions based on Scale-Up and Scale-Out system architectures. In the area of interconnects, I/O and networking, several industry standards have evolved and new technologies are being introduced. One of the key enabling technologies is the new PCI Express’ industry standard as local YO attach-point. M i b a n d as a new industry standard system interconnect technology nicely complements PCI Express with low latency, high bandwidth and additional advanced capabilities such as scalability and quality-of-service. Figure 3 illustrated the use of different I/O technologies for different needs and infrastructures.

106

1.5. Advanced Computing Center

Increasing workloads are creating an insatiable demand for computing power across academia, government and industry. While proprietary supercomputers can solve complex logistical problems, they are often custom-built systems that are not cost-effective to replicate for smaller enterprises. For capabilities such as data and compute-intensive technical and commercial applications over massive data sets to be made widely available, innovative hardware and s o h a r e architectures are needed to meet the ever-increasing computing demands of the high-end technical computing market. The Intel@ Advanced Computing Center (Intel@ ACC) addresses this requirement, by working to bring high-performance computing (HPC) capabilities to enterprise systems. It is focused on accelerating innovation in mainstream, or volume, computer technologies by working with leading commercial and academic partners to develop technologies once limited to proprietary high-perfonnance computing systems bringing High-Performance Computing to the Mainstream. To accelerate the intersection between HPC and mainstream enterprise computing technologies, Intel ACC is working to drive technologies once limited to proprietary HPC systems into standard buildingblock hardware and software. By accelerating technology innovation, Intel ACC is promoting the use of advanced computing into the enterprise to achieve a qualitative change in the way engineers work. Ultimately, these advances will provide enterprises with supercomputing power that has the ease and flexibility of desktop computing. More information on Intel ACC can be found at this URL’.

1.6. Summary

The Economics of High-Performance Computing have changed. HPC solutions must track Moore’s law to be viable and leverage industry standard building bocks for volume economics. Intel is playing a key role in accelerating HPC solutions for science, industry and business with open commercial off the shelf technology leadership and working with the industry and end-users. This will lead the way to PFLOPS computing solutions within this decade to be ready for the “Era of Tera” computing’.

107

References 1. 2.

IEEE Spectrum, February 2005, MT,11-12 (2005) David 3. Kuck, Productivity in High Performance Computing. The International Journal of High PerJonnance Computing Applications, Vol. 18, No. 4, Winter 2004,489-504 (2004)

www.top5OO.org www.infiibandta.org www.intel.com/researcrch/silicon 4 www .intel.com/technology www.pcisig.com 6 www.intel.com/software/products www.intel.com/technology/systems/acc www.intel.com/technology/computing/archinnov/teraera 1

2

DISTRIBUTED DATA MANAGEMENT AT DKRZ

WOLFGANG SELL Deutsches Klimarechenzentrum GmbH (DKRZ), BundesstraJe 55 0-20146 Hamburg, Germany

Earth System Modeling is both compute and data intensive. HPC Systems for Earth System Modeling consist of Compute- and Data-Servers as well as additional servers for visualization and other purposes, that are linked together to provide an efficient, easy to use workflow environment. Huge amounts of data are exchanged between the DataServer and its clients. Coupling the different servers of an HPC System via Shared File Systems is an effective solution for the data transfer problem, which has been successfully implemented at DKRZ. All Data and Scalar Compute Services are L464, Linux-based computer systems which use the Shared File System for data exchange. Lessons learnt from this implementation will be reported.

1. Introduction

Deutsches Klimarechenzentrum GmbH (DKRZ) or translated to English the German Climate Computing Center is a national facility to provide computing, data handling and visualization resources on the hghest affordable level to support state of the art earth system modeling activities of the German research community. DKRZ is organized as a private, non-profit limited company whose shareholders are the Max-Planck-Society, the University of Hamburg and two National Labs, strongly engaged in Climate Research. The investments for the big computer systems are funded by the German Federal Ministry for Education and Research (Bundesministerium fiir Bildung und Wissenschaft, BMBW), the operational costs are covered by the shareholders. 50 YOof DKRZ's resources are used by researchers of the shareholder institutions, the remaining 50 % are open to any German Earth System research group. Grants are assigned by a Scientific Steering Committee (SSC) in periodic intervals. Usage quotas depend on the scientific merit of the project as determined by the SSC. DKFU started its operation on January lSt, 1988, originating from a computer center that had provided HPC services since 1985 for research groups in Northern Germany. Based on 15 years of experience and pitfalls DKRZ decided about 2000 to change its emphasis in HPC system architecture from 108

109

compute-centric to data-centric for the next procurement cycle. During the period fiom 1995 to 1998 it became evident that a high compute power level is of little value to customers unless it is balanced by appropriate data handling resources.

2. A HPC System Architecture for Earth System Modeling 2.1. Rationalefor a data-centrk approach HPC Systems used in Earth System Modeling consist mainly of Compute-, Dataand Visualization-Servers as well as of related network components. The connecting element between the different servers normally is a LAN, through which data is exchanged between the different servers and also with the rest of the world under control of the TCP/IP protocol suite. The effective data transfer rate through LANs increased continuously over time, similar to the compute power of Compute-Servers or the storage capacity of Data-Servers. About 1980 the 10 Mbps Ethernet was used as fast LAN in computer centers, which was improved via the 100 Mbps up to the 1 Gbps Ethernet around 2000 and now even 10 Gbps is feasible as standard component of LANs. While the transfer rate for single Ethernet connection increased by a factor of hundred over 20 years, the compute power of the individual servers to be connected increased by a factor of 10,000 over the same period. During the decades between 1980 and 2000 besides Ethernet other network technologies like FDDI, ATM or HiPPI were used in LANs for the coupling of servers. However, none of these technologies could cope with the continuously improved Ethernet in the LAN realm. Even recently developed technologies like Infiiband are not likely to become a threat for 10 Gbps Ethernet. LANs are mainly used for data exchange between server groups of a whole computer center or between the nodes of a server complex. In the h g h end realm of HPC Systems the coupling of the nodes is accomplished by proprietary networks, since the technologies developed for LANs cannot satisfy efficiently the very high demands for the coupling of HPC nodes. The needed effective transfer rates between the nodes are too high to be satisfied by LAN connections and the associated protocols. In general the largest amounts of data are exchanged in Earth System Modeling between the nodes of the Compute-Server. The next highest level is used for the data exchange between Compute- and Data-Server. It is obvious that LAN technology is insufficient for the coupling of nodes of a HPC ComputeServer, however, LANs are well suited for the coupling of different servers and

117

The Data-Server complex consists of GFS-, HSM- and DBMS-Servers, as well as a Backup-Server and a %-Server. All these services, except Backup and ftp, run on TX7 nodes. All these servers run under the Linux operating system. Backup- and ftp-Service are provided on a SMP, code-named AmsA by NEC which has an Itanium 1 processor and is the predecessor system of the TX7. Most of the TX7 systems are 16-way SMPs, some are partitioned into 2 partitions with 8 CPU each. Further deviations of the standard configuration as used in the HLRE are described in the relevant section of this paper. As pointed out above TX7 computer systems are also used for data pre- and post-processing of the simulation runs,in general a more scalar oriented type of computing which can be done more cost effectively on scalar SMPs than on the SX6 PWs. These TX7 computers are also hooked to the GFS managed disk pool. Additionally GFS client software is available for other computer systems like a SGI Altix graphics system or Sun Microsystems E-Series SMPs, which can be hooked by means of this software to the GFS controlled disk pool. One TX7 node with 12 CPUs and 24 GB main memory is used exclusively for test purposes. The test system, in fig. 7 named DS Test, is split up in three partitions such that failures that occur in one partition do not affect tests that run in another partition. It is very important to have this test facility available in such a complex environment as the HLRE to check out new software versions or features before introducing them into production. The box names UCFM, UDSN etc. in fig. 7 are only indicative; the system is used for all different kinds of check-outs that can be run on a TX7. In this test environment it was also proven in co-operation with NEC Software Division that a SGI Altix system could run successfully and efficiently a GFS client. The data transfer rate that was achieved with a single FC connection was the same that is seen on the TX7 boxes. The main part of DKRZ's data inventory is kept on tapes, which are stored in 4 STK Powderhorn tape silos. Each silo provides about 5000 to 6000 slots for tape cartridges, each of which can store about 200 GB of uncompressed data resulting in a total archive capacity of over 4 PB. Two more silos will be added by end of 2004. Together with a switch to new 9940C cartridge technology in the 2005/2006 timefiame this will enhance the archive capacity up to more than 16 PB which will be sufficient until to the next procurement cycle for the HLRE2, probably to be installed in the year 2007. Data is cached for processing fi-om the tape archive to the GFS disk pool with a total capacity of about 63 TB. Data is also temporarily stored and processed on node local disk space at the CS- and DS-Servers. The total capacity of this local disk space is over 50 TB, most of which is attached to the DS-

118

Servers. As is pointed out later it is highly desirable that the amount of local disk space is significantly reduced in the HLREZ HPC System. Disks and tapes are connected to the computers by a 1 Gbps FC-fabric. The computers themselves are interconnected via Gigabit-Ethemet. The number of connections between the different componentskomplexes are shown in figure 7. 5. Implementationof Data-Services at DKRZ The Global File System GFS from NEC, the HSM DiskXtender from EMCLegato, previously known under the name of Unitree, and the DBMS Oracle are the central components for the management of the huge amounts of data at DKRZ. The data inventory kept at DKRZ originates mostly from simulations that were run on the SX6-PVP-cluster at DKRZ. Data import from external groups or from other institutions as well as data generated on the Scalar Compute-Servers and by visualization services make up only a small amount of the total capacity. 5.1. The Global File system GFS

The purpose of GFS is to keep data that has to be processed on different computer platforms in a central place in a disk pool where the data can be accessed at the record level. In this way we avoid keeping unnecessary copies on multiple platforms as local files since access to GFS resident data is nearly as performant as access to local files as long as the blocksize exceeds 64 kB [4]. Additionally the central storage makes copy processes obsolete and thus keeping the LAN free for tasks which cannot be handled by a shared file system. Another important aspect of a common file system that is shared between CS and DS is that the server function of the GFS can be transferred to the DS. The CS needs to take care of the client function only. For a CS that is made up of a clustered PVP-system most of the data has for ease of use and system administration to be kept in file systems that can be accessed from all CS nodes. The CS then has to provide both server and client functionality unless a shared file system exists between different computer platforms. It takes a lot of resources of the CS to provide both GFS server and client functionality. If a shared file system for both CS and DS exists the costly server function can be transferred to the IA6WLinux SMP and can be managed there in a much more cost efficient manner. The total GFS resident data inventory is distributed over severallmany file systems, which are served by different GFS-Server platforms. By distributing the total amount of GFS-resident data on various file systems a de fact0 unlimited scalability with respect to the GFS-service can be acheved. As soon as a GFS-

119

Server is overloaded, one either moves part of the file systems that are serviced by the overloaded system onto a newly provided GFS-Server, which can be done nearly seamlessly for the users, or in case the system serves only one file system anymore and still is overloaded then the file system has to be split up into several new ones and these new file systems are then distributed across enough newly provided GFS-server platforms. In the second case the actions taken are unfortunately not transparent for the users and if the related GFS-file systems are subject to data migration a lot of administrative work has to be done in the respective HSM. While scalability with respect to GFS-server-functionality can be acheved technically without problems this is not the case for the GFS-client-functionality provided on systems that run under Linux 2.4. For such systems a limitation of 128 LUNs applies which can be addressed in an attached disk pool. This limitation is a severe restriction for the day-to-day operations in a computer center. E.g. there is no way to provide in 140 GB disk drive technology a disk pool of 100 or more TB of capacity in GFS-client-modus to a single system. Ths limitation will be practically removed under the Linux 2.6 kernel, however. For GFS-file systems which are not coupled by means of a migration component to a HSM NEC offers a fail-over hnctionaIity for GFS. In case that the active GFS-server fails the service will be taken over seamlessly by a server on a different platform. Because of the high visibility that is related to an undisturbed access of large shared file systems in computer center operations this is an important fbnctional enhancement. If one node or in larger systems even several nodes of a Compute-Server go down this does not have so much impact on the usability of the total system as loss of access to one or more shared file systems. In case of the loss of nodes only a few users are affected with their interrupted jobs and these jobs can be restarted soon on other available nodes. However, if access to parts of a GFS fails then many users and all related client systems are affected. Data will become available only after the failure is fxed and the related servers resume the service properly unless a backup copy of the affected files is readily available and can be reloaded into a properly hctioning accessible file system. 5.2. The HSM DiskXtenderrnnitree During execution Earth System Modeling application programs write or read their data to/fiom disk resident files. Access to data that is stored on other storage media like e.g. magnetic tapes infers much latency and too low transfer rates. As the total data inventory is much too large to be stored on disk most of

120

the data is exported by means of a Hierarchical Storage Manager (HSM) into a tape archive and data will be retrieved automatically to disk whenever they are needed. After many years of discussion among experts from large computer centers the architecture of such a HSM was described under the title “Mass Storage System Reference Model Version 4” in a report that was issued in May 1990 by the IEEE Technical Committee on Mass Storage Systems and Technology [ 5 ] . Later this approach was developed further and materialized as IEEE standard 1244.1-2000 about the architecture of Media Management Systems [6]. The most important aspects of the architecture as described in the 1990 report are the separation of control and data path and the clear structuring of the modules into distinct specific subtasks. By this vendors were encouraged to implement HSM components and systems, which are mutually compatible and could be combined to integrated storage systems. Due to its clear structure the 1990 approach further laid the basis to build scalable HSMs. The basic modules of a HSM and its interaction as described in the Mass Storage System Reference Model Version 4 are shown in figure 8. Bitfile Transfer

Bitfile Transfer Mover

Move Move Command Command Command Client Reauests SS Requests

Application Interface User Name

SS Replies

Bitfile ID

To/From

All Modules

Control

Monitor

Reauest Volume Repository

Figure 8. IEEE Mass Storage System Reference Model Version 4

121

HSM systems from different vendors have been available for a long time. In the mid of the SO-ies CRAY Research Inc. offered the Data Migration Facility under UNICOS, already at that time a versatile migration tool. Because of the hgher flexibility from an operation point of view DKRZ decided in the beginning of the 90-ies in favor of Unitree, an HSM that was originally developed under the auspices of Dick Watson at LLNL and then handed over to DISCOS for commercial distribution. Over time Unitree had many owners until it came to Unitree Software Inc. where it was developed to a highly scalable HSM under the leadership of Mark Saake. It had many functional enhancements which are most important for the day-to-day operations in big data centers. This significant development was the basis for DKRZ’s decision to take Unitree resp. DiskXtender as it is still called by its latest owner LegatoIEMC as HSM in the HLRE procurement. The central modules of Unitree that are used at DKRZ in the HLRE are 0 Unitree Central File Manager UCFM, the master module that controls the whole HSM and coordinates the work with the other independent modules 0 Unitree Distributed Storage Node USDN, a distributed networked mover component which contributes to the high scalability of Unitree 0 Unitree Virtual Disk Manager UVDM, a distributed networked migration component for external file systems 0 Unitree Database NearLine UDNL, a component needed to couple an HSM to a DBMS in order to migrate and stage back DBMS data files The mapping between Unitree and DiskXtender product names is given in [7]. An extensive documentation can be found in [S] The UVDM monitors file system from the GFS pool for the purpose of migration or staging-back and initiates these activities. In order to avoid overloading UVDM, this component can be distributed across several platforms. If needed data is moved by request of UVDM out of the GFS disk space into a disk cache that is owned by Unitree and from h s Unitree disk cache the data is W h e r migrated by an Unitree internal migration component onto tapes. The dedicated Unitree disk cache is a relict from times when Unitree was built to run on a variety of operating systems and for reasons of portability of the Unitree software the HSM had its proprietary file system in order to abandon modification of the underlying operating system. Since source code of the various operating systems is in general not available for commercial adaptation to thrd party software needs the variant to use a proprietary file system was a smart way out. This argument no longer holds for Linux and the dedicated disk cache is a completely obsolete relict that occupies significant disk space for

122

copies that are not needed. In case of HLRE 17 TB of Unitree internal disk cache are occupied for a GFS disk space of 70 TB. Today there are other HSM systems with nearly the same functionality as Unitree that do not need their private dedicated internal disk cache. In cooperation with a HSM that runs a migration component for GFS data access into the GFS pool works from a user's point of view highly performant on record level by means of the standard Unix system calls. Because of the integrated migration component data residency is hctionally transparent for the enduser, only the response time to the data depends on the data residency. Access to data that has to be staged back from magnetic tape obviously takes much longer than access to data that still resides in the GFS disk pool. The GFS disk pool of the HLRE is divided into two main groups: File system that are subject to migration and shared file systems that are not migrated. Both of these groups are further subdivided with respect to the retention period of the files. The characteristics of all file systems which can be accessed from the SX6-Compute-Server by means of Unix system calls is depicted in figure 9. Since the second quarter of 2004 access to all file systems which are subject to migration was completely liberated with the consequence that both the number of files archived as well the total amount of data in the archives increased drastically. In order to stop the unrestrained growth the free access to create archived data will be restricted step by step until the growth rate has decreased to a level that is in balance with the budget for tape media. A stepwise approach is taken in order to keep usability of the HLRE as user friendly as possible. As an initial step for each user the number of additional files which can be archived after October lSt,2004, is subject to a quota.

124

also be migrated to tape. A specially designed cooperation between HSM and DBMS takes care that the end-user functionally experiences a transparent data access irrespective of the actual storage location of the data. For Earth System data that is not kept in database tables the catalogue has a pointer to the corresponding file. In this way data can be accessed via this pointer link without explicit specification of the proper file name or residency. As early as 1994 DKRZ started a project funded by the German Federal Ministry for Research and Technology to study the feasibility of semantic data management. The CERA data model [9] was developed to meet the needs of Earth System data management and was implemented on the Oracle database system. The decision for Oracle as DBMS was taken in the very beginning and conveyed into the HLRE procurement since Oracle is believed to guarantee on long term that the product is well maintained, continuously improved and kept at the technological front without loosing operational stability and provided with an extensive set of tools that eases operations for semantic data handling. Among other topics it was technically important for DKRZ that the Oracle DBMS provided a partitioning option, the possibility to set data base files in read-only mode, to switch on a no-logging option and to process binary data directly as well as BLOBS.More properties of the Oracle DBMS are described in [ 111. An important component for the operational framework of semantic data handling is a software component that allows entry of the large amount of primary data that are created in simulation runs automatically into the data base system in accordance with guidelines specified by the creator of the primary data. Because of the amount of data generated on the HLRE a manual entry process as used in early stages was no longer practical. Since June 2004 this coupling and entry software is fully operational and the data base archive since then grows according to the plans set out in the HLRE procurement: About one thu-d of the bulk data is stored into the climate data base as semantically accessible time series. The amount of data that enters the database can be seen clearly in figure 10. The ease of use in bringing data into the climate data base as well as retrieving information is reflected in the intense use of the climate data base both by external and local users. More information on access to and use of the climate database can be found in [lo]. The DBMS was the final standard data application which was ported from a Sun Microsystems platform to a TX7 platform during the phase stepped upgrade of the HLRE. DKRZ could choose between a powerful monolithic SMP with up to 32 CPUs or a “cluster” of TX7 partitions with 4 CPUs each. DKRZ opted for the initially more difficult way to go with the partitions because in the long range it will be the more stable system (in case of a failure only DBMS operations in

129

System research together with a shared file system which can be accessed by all attached component systems form an excellent platform for a scalable total HPC System to run elaborate compute- and data-intense applications of Earth System Modeling. The overall size of the HPC System as well as subsystems can be easily adapted to changing application profiles. As important as the total system architecture for an efficient usability is the scalability of the different main application on the respective servers. The shared file system then ensures that data exchange between the various server complexes will not become a bottleneck and that the resources Storage and Network are used efficiently as multiple copies of files are obsolete and unnecessary data transfers are avoided. Operational experience with the HLRE makes it likely that the successor generation of the HLRE can be implemented easily in the same system architecture. However, the data archiving policy has to be changed drastically compared to what was accepted until now in HLRE operations. Otherwise tape media budgets will become exorbitantly high. For the successor system HLRE2 much more attention would have to be paid to redundancy and resilience of the total system to provide an acceptable level of usability. In particular shared file system environments have to work without nearly any interruption. Fail-over software will be needed for GFS-type environments that are coupled to data migration facilities and a HSM. Semantic data processing will increase simply for the reason of the sheer amount of data that will be kept in the archives. Disclaimer

Throughout this paper names like Legato, Oracle or Unitree are used fi-eely without marking them explicitly as trademarks or registered trademarks. These and other such names may be the property of their respective owners. The appearance in this paper does not imply that these names can be freely used in other places. Acknowledgments A feasibility study for semantic data handling was sponsored by the German Federal Ministry for Research and Technology in the early 1990-ies. The investment for the HLRE was hnded by the German Federal Ministry for Education and Research by a contract covering the period 2001 to 2005. In the planning of the HLRE architecture persons fiom many groups were engaged. The most prominent staff groups were from DKRZ, NEC Japan and

130

Unitree Inc. Very valuable contributions were made by Hartmut Fichtel, Mark Saake and Tadashi Watanabe during intensive discussions. Tadashi Watanabe took responsibility to provide a GFS version that allowed coupling of a SUPER-UX system and a Linux system, which are built by NEC. DKRZ is also gratehl to Al Kellie from NCAR and to Walter Zwieflhofer from ECMWF who encouraged the implementation of the presented system architecture during the HLRE review process. Without the support of these two gentlemen the project would not have started. Last but not least DKRZ has to thank DKRZ and NEC staff for their effort during the installation period of the HLRE to make the project successful. Again Hamnut Fichtel and Yuich Kojima need special mentioning because of their strong engagement and leadership in the HLRE implementation during the years 2002 to 2004. Most of the figures were used previously in presentations and have been prepared originally by Hartmut Fichtel.

References

K. Kitagawa et al., NEC Res. & Develop., Vol44, No.l,2003, p 2 7 T. Senta et al., NEC Res. & Develop., Vol44, No. 1,2003, p 8 12 K. Miyoshi et al., NEC Res. & Develop., Vol44, No.l,2003, p 53 59 A. Ohtani et al., NEC Res. & Develop., Vol44, No.l,2003, p 85 90 S. Coleman and S. Miller, editors, “Mass Storage System Reference Model, Version 4”, Report of the IEEE T e c h c a l Committee on Mass Storage Systems and Technology, May 1990, URL: www.ssswg.org/public~documents/MSSRM/reef.model.ver4.pdf 6. 1244.1-2000 IEEE Standard for Media Management Systems (MMS) Architecture 7. URL for Unitree-DiskXtender Product Name Changes: http:/lwww.legato.comlsupportJnames.cfin 8. URL for Documentation of DiskXtender for Unix/Linux: http.//web 1.legato.com/cgi-bin/catalog?s~Releases&levl=6-0#6-0 9. M. Lautenschlager, F. Toussaint. H. Themann and M. Reinke, 1998: The Cera-2 Data Model, Technical Report No. 15, DKRZ, Hamburg 10. The CERA database at DKRZ provided by Model & Data, URL: www.mad.zmaw.de 11. DBMS Oracle Database 1Og, URL: www.oracle.comltechnology/products/database/oracle 1Og/index.html

1. 2. 3. 4. 5.

SUPERCOMPUTING UPGRADE AT THE AUSTRALIAN BUREAU OF METEOROLOGY I. BERMOUS, M.NAUGHTON, W.BOURKE Bureau of Meteorology Research Centre, GPO Box 1289K. Melbourne. VIC 3001, Australia The Australian Bureau of Meteorology has recently made a major upgrade to its supercomputer and its Central Computer Facilities (CCF), located in the Bureau's new headquarters in Melbourne. This paper describes the overall CCF with particular emphasis on the new NEC SX-6 supercompuhng facilities. The porting experiences in migrating from the previously installed SX-5 facilities to the multi-node SX-6 environment suppolted by NEC IA-64 TX7 file servers and associated global file system are described. The system usage and performance of several major Bureau operational applications on the SX-6 are presented along with a discussion of planned upgrades of these applications.

1 Introduction We would like to start with a brief account of the recent history of high performance computing at the Bureau. As a result of collaboration between the Australian Bureau of Meteorology (BoM) and Commonwealth Scientific and Industrial Research Organisation (CSIRO) the High Performance Computing and Communication Centre (HPCCC) was established in 1997 as a joint HPC facility to support both organizations. As detailed in Table 1 NEC SX supercomputers have been utilised with an NEC SX-4/16 in 1997, with an SX-5 in 2000 and now more recently with an SX-6 facility. Table 1. Recent BoWCSIRO HF'CCC Systems history

NEC S X 4 3 2

1998 - 2000

BoWCSIRO

NEC SX-5/32M2

In the latest contract the two organizations utilise separate partitions. The Computing and Communication Facility (CCF) located on the second floor of the previous Bureau Headquarters had no space or structural capacity to support installation of new equipment. This problem was resolved when the BoM moved to a new building at 700 Collins Street in August 2004 and the CCF is now located in the new BoM headquarters which incorporates an extensive purpose built computer hall. The first 18 SX-6 nodes were installed in December 2003 in the new building and became available for use in March 2004. 131

132

2 Phases of Current Contract with NEC The current 4 year contract with NEC includes 2 phases, with the second phase covering most of the period. The system configurations for each stage are described in Table 2 and Table 3: Table 2: SX-6 multinode PVP Apr2004

2005 - Apr 2008 28(23BoM)

Nodes

18 BoM)

CPUS

144 1.1 52

224 1.792 Gflous

14 TB

22 TB

Peak

(13

Table 3: TX7 IA64 GFS server

Memo

Apr 2004

2005 A r2008

24GB

32GB

Disk Capacity

The overall increase in the Bureau's capacity with the new SX-6 system in comparison to the previous single node SX-5 facility is - 11.5 times increase in peak CPU capacity - 13 times increase in memory - 14 times increase in disk capacity - performance 5%-30% faster per processor for our major applications. Some performance degradation for multithreaded runs or multiple single CPU jobs withm a node has been found as discussed in Section 7.1. 3 Operational Suite of Numerical Weather Analysis and Prediction Systems The Bureau of Meteorology operational suite of numerical weather analysis and prediction systems for global and regional scales includes Short Range Forecasting (runs twice a day, 29 levels for all models) - LAPS 0.375' Australian Region data assimilation and 3 day prediction system - TLAPS 0.375' Tropical Australasian Region 3 day prediction system - LAPS 0.125' Australian Region 2 day prediction system - TCLAPS 0.15" 3 day tropical cyclone prediction system - MESOLAPS 0.05" mesoscale 2 day prediction system for VictoriaTasmania, Sydney domains and South East Queensland Medium Range Forecasting (runs twice a day) - GASP (T239L29, 75km) 2x6hours assimilation analysis and 10 days forecast

134

These results were obtained by NEC with highly scalable MPI implementations provided by the BoM. Our HPC system configuration includes: NEC SX-6 nodes connected via the IXS h g h speed internode crossbar switches and two NEC TX7 IA-64 systems running the GFS global file system, a research development platfonn HP-UX 9000, and SamFS data server on Sun Solaris with 8 CPUs and dual MARS/TSM meteorological data servers on IBM p690 with 8 CPUs. 5 Global File System

The Global File System (GFS) is a very important component of our system; it performs direct data transfer between a GFS client and a disk array device via fibre channel. GFS is a network type file system that speeds up VO operations. The GFS Server and Client are on TX7 IA-64 Linux front end server, also a GFS client is on each SX-6 node. NFS is used for VO requests up to 64KB regardless of file size and GFS is used for I/O requests over 64KB regardless of file size. Our plan for the future is to use GFS with a client on HP;NEC support for GFS clients for IBM and Sun has also been included in our contract. We would like to emphasise the critical role of the TX7’s in our facility. In our system with 28 SX-6 nodes the consequences of failure of one or two of the compute nodes for a period of several hours are not so severe. However, failure of the TX7’s or GFS affects the whole facility and so must be kept to absolute minimum. This is designed for by means of high base reliability of the TX7 machmes and duplicatiodfail over functionality. 5.1 Pros and Cons of GFS

The main advantages of GFS are - visibility of the same file system from all SX-6 nodes and the TX7 - high performance data transfer which is near local disc speed for large files - fail-over capability between two TX7 servers - it is based on SGI’s Open Source X F S for Linux. Drawbacks with the current version of GFS are - small block I/O uses slow NFS - there is no prioritisation of GFS VO - there is no caching of GFS file systems compared with SX-6 file systems - GFS I/O bottleneck. We have found that heavy I/O may cause significant performance degradation for applications running at the same time on other nodes and using the same file system. One possible solution is use of a local file system or MFF (Memory File Facility) for

135

heavy VO with access to GFS only at the start and end of a job, but in this case a job cannot be migrated to other nodes if it is checkpointed.

5.2 GFS Usage Tips From our early experience we can make the following conclusions in relation to effective use of GFS: - it is very important to set large enough buffer sizes for 110 - for direct access VO use the FHSDIR option for optimal buffering of non-sequential records - large data transfers should ideally be done in batch jobs executed on the TX7’s - some operations such as rcp’s and file manipulations require execution on TX7’s. These are conveniently done tiom within SX-6 batch jobs using a do-tx7 utility script, which executes a Unix command in the same directory using rsh on the TX7 remote host - likewise, for interactive development work on the TX7’s, the do-sx6 utility script is useful as well. 5.3 Example of Local File System Impact

The following is an example of GFS impact on a program performance. The GASP Ensemble Prediction System (EPS) runs with 32 members executed in parallel using 4 batch jobs of 8 members in background with each member using 1 CPU. These 4 packs are executed in parallel using 4 nodes. With heavy VO all to the same storage device, it takes >40min for 4 packs with VO on GFS and this time may vary substantially depending on competing GFS traffic to the same device; the same task requires 15-16min with VO onto a cached local file system with the usage of the $LOCALDIR environment variable. In this case time for copying files from GFS to a local file system and back at the beginning and end is negligible in comparison with the VO overhead caused by directly using GFS during the calculation. Further investigation is continuing with NEC cooperation to achieve near local file system performance with I/O direct to GFS. 6 Stages in SX-6 Installation

The stages in the SX-6 installation included 3-4 months for porting our applications on two nodes at an off-site SX-6 system starting in September 2003, then 3 months during which we had no access while NEC installed and tested the full SX-6 system at the new BoM headquarters. From late March to

136

late May 2004 we had two months of parallel testing to implement operations before the SX-5’s were decommissioned. This was an extremely tight time frame, especially since it coincided with the move to the new building. As the SX-6 has a similar vector architecture to the SX-5, the single node porting process was mostly straightforward. At the initial porting stage binary files from the SX-5 were used. GFS worked smoothly and only changes in scripts were required to replace NQS options by NQSII with changes to the file system references.

7 Porting Experience from SX-5 to SX-6 In migrating to the SX-6, the then operational applications on the SX-5 were ported; these applications included only some of the efficiencies implemented by NEC on the benchmark codes. One interesting and important issue for our applications is that as the SX-6 has half the memory bandwidth per CPU in comparison to the SX-5 (SX-5 and SX-6 node comparison characteristics are given in Table 4) and this may cause significant performance degradation for multithreaded runs; additionally heavy load within a node may inhibit performance of other applications running on the same node at the same time. Table 4: SX-5 and SX-6node comparison

Peak Performance MainMemo Unit Memo Bandwidth erCPU WO Bandwidth

128 GFLOPS 128 GB 64 GBIsec 12.8 GBIsec

64 GFLOPS 64 GB 32 GBIsec 8 GBIsec

This issue was recognised very early in use of the SX-5, but on the SX-6 it has a more visible and severe impact. We have identified that indirect addressing with its associated memory access concentration contributes to such performance degradation. A simple example of memory access concentration can be illustrated by the following vectorised Fortran90 loop:

a(:) = table ( i n d ( : ) ) where size(table)<< size(a) and size(table)<<256. 7, I Performance Tuning Results for GenSI ASSIM

Slow multithreaded performance as described above in the operational assimilation code prompted us to tune the program and a substantial

137

improvement was achieved. For an ASSIM program already tuned on the SX-5 for the currently operational T239L29 resolution it was found when using a higher resolution such as T359L50 in multithreading mode that the application may suffer a major increase in CPU time regardless of whether the application is executed on a single dedicated node or together with other jobs on the same node at the same time. So the ASSIM performance was analysed in detail again on the SX-6. The comparison of SX-6 CPU times for the same executable for ASSIM using T359L50 resolution showed in comparison with the SX-5 a -4% gain on 1 CPU, but a -15% loss on 4 CPUs. A number of optimisation improvements, in particular elimination of indirect addressing and memory access concentration, led to a reduction in multithreaded memory contention overhead. An indirect effect of the performance improvement was seen from the elimination of memory contention; i.e. the performance improvement for one specific subroutine may improve performance of other subroutines in multithreading by up to 10%. Some inefficiency was also found with the NEC BO compiler optimisation. After tuning, the performance on the SX-6 has been improved with a speed up of 44% for 1 CPU and 50% for 4 CPUs.

7.2 Memory Contention in GASP EPS As an example of the memory contention issue found for operational GASP EPS executed on a dedicated node, we show in Fig.2 the CPU time increase when the number of single CPU parallel forecasts on a dedicated node is increased from 1 to 8. The CPU time increases by -35% overall, CPU time increases almost linearly by -26sec for each subsequent forecast and it is wholly due to vector time increase. We found that these CPU times are reproduced if the forecasts are run in batch single CPU parallel jobs or via an MPI wrapper.

141

the BoM; this research usage will expand further on the BoM 23 node facility. Operational usage will be considerably more extensive in 2005 with scheduled upgrades to the major operational applications. 9 Data Archiving at BoM

The main archiving systems used in BoM are dual MARSITSM servers on the IBM p690 with 8 CPUs and SamFS on a Sun with 8 CPUS. MARS is the ECMWF Meteorological Archival and Retrieval System, which has been implemented in the Bureau experimentally from 1998 and in a quasi-operational mode in 2004 with the IBM p690 facilities. The current archiving rate is -5Gb/day or 0.15 Tb/mth for the MARS servers and 300-350Gb/day or 9-10 Tb/mth for SamFS. The current data volume archved on MARS is 6 TB and on SamFS is 185 TB. 10 Resolutions in the Future

Table 5 compares current operational resolutions with planned resolution upgrades with estimations of the approximate increase in CPU time required. Based on the -10 times upgrade factor from SX-5 to SX-6 these CPU times should translate into comparable or faster wall times for all these systems, with some capacity still available. Table 5: Current operational resolutions and plans for the future. Current Resolution GASP GASP EPS LAPS LAPS EPS TCLAF'S MESOLAPS POAMA Seasonal Forecasting

T239L29,75 km 33 members T119L19,15Okm 0.375", 29 levels 24 members OS", 29 levels 0.15",29 levels 0.05", 29 levels T47L17.375 km

Future 2005-2007 T359L50,50km 50 members T159L29,llO km 0.25", 50 levels 50 members OS", 50 levels 0.10", 50 levels 0.05", 50 levels T63L50,280 km

UpgradeCPU Time Factor -7 -6-7

1

-8 4 -9 -2 -6

11 Conclusions

The recently acquired SX-6 systems are delivering major HPC capacity and capability increase to the BoM. The transfer from SX-5 to SX-6 was successfully made without major problems although in a rather short time frame.

142

Operational jobs are mostly faster than on the SX-5 (when measured per CPU). The TX7/SX6/GFS provides a seamless environment for research and development. System reliability is very high so far. The main challenges for the future are - maintaining good system utilisation as resolutions and workload of the SX-6 facility are increased - refined job scheduling with increased demand - resolvinglavoiding performance issues that can arise from memory contention. 12 Acknowledgements

We would like to acknowledge our BMRC colleague Peter Steinle, HPCCC manager Phillip Tannenbaum and HPCCC colleagues, and NEC/A applications support staff for comments and advice during the preparation of this paper. We thank NEC for agreeing to the inclusion of the benchmark performance figures here.

4D-VAR: OPTIMISATION AND PERFORMANCE ON THE NEC SX-6

DR S. OXLEY Met Ofice,

FitzRoy Road, Exeter, EX1 3PB, UK E-mail: stephen.oxleyQmetofice.go~~.uk The Met. Office 4D-Var scheme went operational in October 2004. In order to produce a timely analysis, 6 nodes are used compared to the 1 node that was previously required for 3D-Var. Even with this 6 fold increase in resources, 4DVar takes 4 times longer to complete: Optimising 4D-Var for the SX-6 has been, and continues to be, a high priority. Porting the code from the Cray T3Es is described, along with optimisations introduced to October 2004, such as techniques for increasing vector lengths and reducing communication overheads. Impacts of optimisations are presented, along with investigations into the scalability of the code on a per routine basis.

1. Introduction

This paper presents the optimisation and performance of the Met, Office's new data assimilation scheme, 4D-Var, which has been used to produce operational forecasts since 5th October 2004. Work on porting 4D-Var to the SX-6, and preliminary tuning, first commenced in March 2002 during our supercomputer procurement'. Recent optimisation work has been a combined effort between Met Office staff and NEC consultants; the goal being t o get 4D-Var running as fast and as accurately as possible on 6 nodes of our 30 node SX-6 system". Section 2 describes how our 4D-Var scheme differs from our previous 3D-Var scheme and outlines the operational implementation as of October 2004. Section 3 summarises the SX-6 architecture. Section 4 examines what code changes were required to get the code up and running on the SX-6, followed by Section 5 which discusses the optimisations applied in tuning the code to the SX-6 architecture. Section 6 examines the effect "34 SX-6 nodes and 15 SX-8 nodes from February 05.

143

144

of PE domain decomposition on the execution time, and finally Section 7 demonstrates how 4D-Var scales from 1 to 6 nodes. 2. 4D-Var

The Met Office 4D-Var data assimilation scheme is essentially the same as our 3D-Var scheme2 but extended by the use of a Perturbation Forecast (PF) model and its adjoint to allow the time distribution of observations throughout the assimilation window to be taken fully into account3. This additional work accounts for practically all of the run-time in 4D-Var and is the reason why our 4D-Var data assimilation scheme is an order of magnitude more computationally demanding than our 3D-Var scheme. 3D-Var was developed from the outset to include a development path allowing the scheme to be extended to 4D-Var when computing resources permitted. The horizontal resolution of 4D-Var is currently half of that used in the operational global forecast model - described in the Progress Report on the Global Data Processing System 2004b - due to computational timing constraints. The resolution used in 4D-Var is thus currently N108L38 - 216 grid-points East-West x 163 grid-points North-South x 38 vertical levels. The PF model operates on this same regular longitude-latitude grid-point model and is based on the full non-linear forecast model but simplified to provide fast linear calculations of small increments for fitting observations; in particular the PF model omits most physics schemes. Assimilation is performed over a 6 hour window centred on the analysis hour, and since a time-step of 20 minutes is used, this results in 18 Linearisation States (LS) being fed from the full non-linear forecast model to 4D-Var, reconfigured to half horizontal resolution and centred about the analysis hour.C 3D-Var minimises the difference between the observations and the background field in around 40-60 iterations - depending on the number of observations processed and the desired tolerance. In 4D-Var each of these iterations contains a forecast, using the PF model, from the start of the time window to the end. The adjoint of the P F model runs from the end of the time window to the beginning and serves to propagate information from observations back to the initial time, which can then be fed into the outer minimisation. At each time-step during the P F model and its adbPublished annually in the WMO Technical Progress Reports available at: http://www.wmo.int/web/www/DPS/Annual-Tech-Progress/ CFromFebruary 2005 we have been using double the full forecast time-step: 40 minutes resulting in 9 LS states.

145

joint, a generalised Helmholtz equation is solved for the pressure increment in around 50 or so iterations - which again depends on a tolerance criteria: It is these routines in 4D-Var where most of the computational time is spent. 3. The NEC SX-6

The details of our supercomputer system in Exeter was presented by Burton' and summarised below: The basic building block of the SX-6 system is the NEC designed vector processing unit. This unit, which is integrated on a single CMOS chip, consists of four vector pipelines: Floating point add/shift, floating point multiply, logical operator and floating point divide. Each pipeline actually consists of eight parallel pipes, each with a vector length of 32 words, which gives a total vector length per pipeline of 256 64 bit words. The processor operates at 500 MHz, with both the add/shift and multiply pipes generating eight results per cycle (from the eight parallel pipes). This results in a theoretical peak performance of 8 GFlops per PEd. As with any PE, the main obstacle to approaching peak performance is wasting cycles while waiting for data to arrive at the vector pipelines. However, to alleviate this, the SX-6 P E has an exceptional 32 GB/s of bandwidth to main memory and supported by 8 x 256 word vector registers and 64 x 256 word vector data store, it is possible to achieve significant fractions of peak performance on real codes. An SX-6 node consists of 8 processors sharing 32 GB of main memory. The nodes are connected together using a high bandwidth, low latency crossbar switch called the IXS. Although communication within a node is possible by OpenMP, we have chosen to perform all communications between processors, both within a node and between nodes, using MPI. 4. Porting Data Assimilation Codes to the SX-6

Data assimilation at the Met Office is split up into two distinct tasks. The first is OPS, which stands for Observation Processing System, and this task's function is to process recent observations into a form suitable to be assimilated into the full forecast model by VAR. ~

~~

~

dThe peak performance of 8 GFlops only takes into account the multiply and add pipelines. Since there is also a divide pipeline, which operates in parallel to these, it is possible to actually exceed this performance if a suitable mix of instructions exist within a loop.

146

Our data assimilation codes are mostly written in Fortran 90 with a small amount of C for efficient 1/0 and system calls, e.g. PE and memory usage monitoring. Our codes previously compiled and ran on two platforms - HP and Cray T3E - so were already fairly portable; porting codes with today’s modern compilers is less effort than it used to be - since standards are more closely adhered to; it is the tuning to a particular architecture where most of the work is required. Below are some of the changes that were required to port our codes to the SX-6: 4.1. Non-Standard~Ambiguoususe of Fortran (1) Illegal usage of structures (OPS) Ensure that structures are allocated before initialising them with data; so long as unallocated, but initialised, structures were subsequently never used, the HP and Cray systems were more forgiving in this aspect. (2) Unformatted internal writes - Replace unformatted writes to character strings with explicit formatting. e.g. write(String,*)J becomes write(String,’(i8)’)J (3) Expand out implied-do loop internal writes (VAR) - Ensure that character manipulation internal writes do not have compact ambiguous loops. Expanding such loops was a simple workaround and made the code easier to read. ~

4.2. Functionality Changes

(1) Replacement of NAG libraries - The code inherently becomes more portable by using the free, and widely available, Linear Algebra package. (2) Reduce the size of MPI ID tags and terminate correctly - The size of MPI tags used on the Cray T3E were larger than the standard allowed -- 32768. Also, the executables did not exit cleanly on the SX-6 until a call to terminate NIP1 was added immediately before exiting. (3) Allow more LS states than PEs (VAR) - 4D-Var was written with the assumption that it would always be run with at least as many PEs as LS stat,es. This was a reasonable assumption on the unshared memory of the T3E, but not so on the SX-6. where even running on 1 PE is possible. The code was adapted to remove this restriction whilst still retaining parallel reading of the files.

147

4.3. C-Fortran Interface

(1) Ensure correctly declared sizes between routines - Code changes were required in several routines to ensure that the interfaces between C and Fortran calls were compatible; and reasonably obscure compiler options (-dw -Wf,”-i8” -Wf,”-A idbl4”) were required to prevent the -ew (enable word) option from promoting integers declared with KIND to 64 bit.

4.4. Optirnisation Level

Increase to the highest level that still gives correct results - For 4DVar the results were deemed “correct” if they lay within the bounds of variability induced on the T3E in changing the number of PEs, domain decomposition, and/or optimisation level. Address why routines give incorrect results at high optimisation levels - Routines that are not time critical are left with the lower optimisation level, usually -Cvsafe (vector safe). Time critical routines were examined and altered until they gave correct results at the highest (-Chopt), or second highest (-Cvopt), optimisation level. Having ported our codes to the SX-6, very little was required to port the codes to run on our new Linux desktop - now proving to be a useful development platform. Apart from several 64 bite little endian 1/0 issues, all code changes required were related to the Linux Intel Fortran compiler being stricter in its interpretation of the Fortran 90 standard, which is beneficial in maintaining the overall portability of the codes.

5. Optimisation Techniques and Impacts On the SX-6 it is most important to ensure that:

(1) Loops vectorise well (2) Memory bank conflicts are reduced (3) Loud imbalance is reduced (4) 1/0 is eficient ‘VAR is compiled at 64 bit even on native 32 bit architectures, as we are more interested in computational accuracy than computational performance on these development platforms.

148

Point 1 is clearly the most important on vector architectures as poorly vectorised loops can run slower than the equivalent well vectorised loops by more than an order of magnitude. Point 2 relates to the physical limit of how fast a particular bank of memory can be accessed. Techniques for accessing different banks in subsequent memory references are employed to reduce this latency. Points 3 and 4 apply to all parallel architectures. Load imbalance is an important factor in scaling code to many PEs and nodes, as well as the reduction in average vector lengths and increased communication between PEs. 1/0 in 4D-Var takes less than 3% of the total run-time and hence is relatively unimportant compared to within the full forecast model and, to a lesser extent, OPS, where an order of magnitude more 1/0 is performed. Below are descriptions of the most popular optimisations made to the ported version of VAR, without any code’optimisations - version 20.0 up until the version of VAR for which 4D-Var was first used to produce operational forecasts - version 20.5 - in October 2004. The number in brackets is the number of routines in the top 10 of 4D-Var version 20.0 to which the optimisation was applied. This is followed by an estimate of the impact on total run-time for that type of optimisation.

(1) Allow a n y summation order zn MPI (5: High impact) - This optimisation relaxes the requirement for bitwise reproducibility of results when using a different number of PEs or P E domain decomposition, e.g. during reductions the summation order had been maintained by passing the result of each summation to the next PE which adds to this result in succession. This leads to code that does not perform as well as when all PEs do their summations in parallel with each other. The P F model in 4D-Var inherited this option from the full forecast model, where bit-reproducibility is required for climate forecasts. Since maintaining bit-reproducibility in the adjoint of the PF model is much more difficult to achieve. full bit-reproducibility in 4D-Var has not been attempted. (2) Loop splitting t o remove dependence of calculations (5: High impact) - A common technique that allows loops to vectorise is to promote scalars to arrays and pre-calculate their values in preceding loops. Anything that allows time critical loops to vectorise well is a valid solution. (3) Change loop nest order t o aid vectorisation (5: Medium impact) - When a 3 level, or higher, nested loop cannot be completely

149

collapsed, the compiler makes an educated guess on which subset, of loop indices to vectorise over. If the compiler does not make the best choice - perhaps the necessary information is not available at compile time - then changing the order by hand can force the longest loops t o be vectorised. (4) Collapse loops by h,and to increase vector lengths (3: Medium impact) - Modifications are sometimes necessary so that the compiler knows that, it, can safely collapse the loop, alternatively the loop can be collapsed by hand. This sometimes means altering the code to address arrays out of bounds on an index, but within bounds in the whole array. e.g. accessing an (i=l..lO,j=l..lO) array as (i=1..100,1). (5) Loop merging (3: Low impact) - The T3E preferred some loops t o be split and unrolled. It is best to merge these on the SX-6. (6) Use additional temporary arrays to avoid bank conflicts (2: High impact) - Arrange for additiona,l temporary arrays to be stored in memory such that bank conflicts do not occur with the access pattern used within the loops. As with point 2, it looks like more work is done, and the code strays from its scientific meaning, but the performance gain is worth the (small) loss in clarity.

A breakdown of these optimisations, on a per routine basis, can be found in the Appendix. Figure 1compares the execution times of the unoptimised 4D-Var code - the left-hand columns - with the optimised code - the right-hand columns - using a configuration which was very similar t o that used operationally: 6 nodes; N108L38; 20 minute time-step; 2x21 PE domain decompositionf. Run-times were extracted using the flow tracing utility, ftrace, provided by NEC, which can add up to an additional 10% compared t o uninstrumented run-times. Although these timings were not obtained on a truly dedicated system, runs were performed in near empty production queues. To be confident that the run was not disturbed, a common technique is t o repeat the run several times and compare tracing outputs: Undisturbed runs will show practically identical ftrace times. It is unlikely that disturbed runs will show identical execution times for all routines; they are likely to be disturbed by differing degrees and at different times during execution. ‘We have found that it is advantageous to use only 7 PEs, of the available 8, when running on more than 1 node. This leaves 1 PE free for the MPI deamons, system, and other high priority tasks.

150

Top 10 Routines in 4D-Var 20.0 (2x21)

Figure 1. Demonstrating the impact on timings once 4D-Var has been tuned to the SX-6.

Ftrace collects together the total execution time spent by each P E in each routine. In Figure 1 the height of the black bars for each routine indicates the PE with the fastest execution time. The height of the white bars indicates the slowest PE, and the height of the grey bars indicates the average PE execution time. The difference between the slowest and fastest times is the load imbalance, and the position of the grey bar in relation to the other two indicates the distribution of the load across the PEs. Routine 1 initially took 1100 seconds of the 3045 seconds total execution time. This is because the average vector length in that routine was initially 2. Following a complete overhaul, this increased to 170 - 256 is the maximum on the SX-6 - and the time taken in that routine reduced by an order of magnitude. The only routine not to have improved significantly is routine 9. This well balanced routine only had the loop collapsing optimisation applied t o it - see the Appendix - which, although increasing the average vector length from 109 to 147 and increasing the average GFlops per PE from 1.49 to 1.67, didn’t, have that much overall impact, on the execution time. One reason is that the total time spent in bank conflicts for all PEs has risen from 67.8 seconds to 68.9 seconds.

151

6. Effect of PE Domain Decomposition

On vector machines it is sensible to use a PE domain decomposition that maximises vector lengths. For 4D-Var this means that splitting into full rows is best such that each P E has all longitude grid-points assigned to it over several grid-points of latitude. The concept of halos complicates matters slightly in the sense that the total amount of communication between PEs is minimised for square PE domain decompositions, and, in the current, formulation, halos cannot extend out to non-neighbouring PEs a situation that will occur given enough PEs and a halo size greater than 1: For the current N108L38 resolution, with a time-step of 20 minutes, we require a halo size of 3 grid-points. Since there are 163 grid-points in the North-South direction we can have a maximum number of 54 PEs in that direction. Fortunately this is larger than the number of PEs that have been assigned to us in the operational schedule. In theory then, our best choice should be to use 1x42 on 6 nodes. This is not the case - it turns out that, currently. a P E domain decomposition of 2x21 gives the best performance on 6 nodes, as shown in Figure 2.

-

300

u) v

d . .

-

250

3

I?

E

200

ul Q)

5 W

150

a h

8

z

100

>

2E

50

0

1

2

3

4

5

6

7

8

9

1

0

l o p 10 Routines in 4D-Var 20.5 (2x21)

Figure 2.

Demonstrating the dependence of run-times on PE domain decomposition.

In Figure 2 routines 1 and 3 are the backward (adjoint) and forward

(PF model) forecast halo updates respectively. These routines do essentially nothing but communication, and, due to MPI synchronisations, load

152

imbalance shows up here - making them look more expensive than they actually are. On first inspection routine 1 appears to perform additional computation or communication at 1x42, but this is in fact due t o the severe load imbalance in routine 4 - a computationally intensive adjoint routine. The reason for the large load imbalance in this adjoint routine is due to around 100 seconds of bank conflicts occurring at the polar PEs. The cause of the bank conflicts is unknown; since they also occur, but t o a lesser extent, at lower 1xX PE domain decompositions, the reason is under priority investigation. The large load imbalance in routine 7, seen at both PE decompositions, is due to much more work being performed in the polar regions. Whether some of this work can be distributed t o other PEs is also under investigation. Routine 8 is significantly faster at 1x42 due to communication not being required in this routine at this domain decomposition, along with less load imbalance and longer vector lengths. A few other routines in the top 20 also spike at 1x42 and these too are being investigated.

7. Code Scalability Figure 3 shows how the execution time of 4D-Var scales from 1 to 6 nodes for the optimum P E configurations on each number of nodes. 7

(u

U

0

z

5

-’

1

2

3

4

5

4D-Var 20.5: Number of Nodes

Figure 3.

Scalability of 4D-Var 20.5 at N108L38.

6

153

The actual scaling curve is projected to flatten out at a speedup of around 4 times that of running on 1node; higher resolutions are expected to flatten out at higher speedup values - lower resolutions at lower speedups. In general, this poor scaling is a combination of reduced vector lengths, as the area assigned to each P E is less; increased communication time, since there are more PEs to communicate between and the total amount of data held in halos is greater; and a greater load imbalance, since when using many PEs the difference in the number of rows assigned to a P E is, relatively speaking, larger. e.g. at 1x28 each PE is assigned 5 or 6 rows; at 1x8 this is 20 or 21 rows. The comparison of speedups on a per routine basis between the optimum P E configuration of 4D-Var on 1 node, 1x8, compared to the optimum configuration of 4D-Var on 6 nodes, 2x21, is shown in Figure 4. I40 120 100 80

60

40 20

0 1

2

3

4

5

6

7

8

9

1

0

Top 10 Routines in 4D-Var 20.5 (2x21)

Figure 4.

Comparison of the scalability of 4D-Var20.5 on a per routine basis.

In Figure 4 the run-times of the 1 node job have been divided by 6 so that direct comparison with the 6 node times are possible - the closer the left and right columns, the better the scaling. Routines 1 and 3 do purely communication, handling several terabytes of data each on 6 nodes. As already mentioned, load imbalance in other routines shows up in these due to the presence of communication barriers. The average vector lengths in routines 2, 5 and 7 have halved in going to 6 nodes, from around 200 to 100, primarily because of the East-West row

154

split. Routine 4 appears to super-scale, and strictly speaking it does, but this is in fact due to the large bank conflicts that occur when using 1 PE in the East-West direction, as examined in Section 6. At 1x8, routines 8 and 9 perform no communication, hence their poor scaling to 2x21, where East-West communication is present. The average vector lengths are also reduced from 253 to 166. Routines 6 and 10 scale reasonably well.

8. Summary

The Met Office 4D-Var scheme has been successfully ported and tuned to an NEC SX-6 supercomputer, enabling assimilation of observations into the operational forecast, in around a third of the time of what would have been possible with the untuned code - 15 minutes as opposed to 50 minutes. Comparing performance at different P E domain decompositions has highlighted areas of the code where further tuning is expected to improve scalability at the current operational grid resolution.

Acknowledgments The author would like to thank David Dent for his valuable optimisation work carried out over the past, two years, and Gerrit van der Velde for the initial porting and optimisation of 4D-Var during the procurement of the SX-6.

Appendix Table 1 shows which routines the optimisations described in Section 5 have been applied to. Of note is that around half of the top ten are adjoint routines (specified by -adj) accompanied by their corresponding PF model counterparts.

155 Table 1. Optimisations applied to the most expensive routines in 4D-Var at 2x21 PI3 domain decomposition. Nth Most Expensive 20.0 20.5 1 4 2 2 3 1 4 5 5 3 6 LO 7 9 8 15 I) 6 10 11 11 8 12 37 13 7

Routine Name 1 Cubic-lagrange-adj Gcr-elliptic-operator-adj2 Swap-bounds-adj Gcr-coefficient Swaphounds Gcr-ellipticaperator Mppfrisolve-exec Gcr-precon-adi-exec-trisolve Vert-weights Gcr-precon-adi-exec-tri-adj2 Interpolation Var-decomposetolevels Ritchie

X

Optimisation Technique 2 3 4 5

x

x

x

X

x

s

x

X X

X X

x

x x x

X

X

x

X X

X X

x

X

x

x

X

References 1. P. Burton, Vector Returns: A new Supercomputer for the Met Office. Realzszng Teracomputang. Proceedangs of the Tenth ECMWF Workshop on the Use of High Performance Computzng zn Meteorology, 19-28 (2002). 2. Lorenc, A. C., S. P. Ballard, R. S. Bell, N. B. Ingleby, P. L. F. Andrews, D. M. Barker, J. R. Bray, A. M. Clayton, T. Dalby, D. Li, T. J. Payne and F. W. Saunders, The Met. Office Global 3-Dimensional Variational Data Assimilation Scheme, Quart. J . Roy. Met. Soc. 126,2991-3012 (2000). 3. F. Rawlins et al., The Met Office Global 4-Dimensional Variational Data Assimilation Scheme, In preparation.

THE WEATHER RESEARCH AND FORECAST MODEL: SOFTWARE ARCHITECTURE AND PERFORMANCE J. MICHALAKES. J. DUDHIA, D. GILL, T. HENDERSON, J. KLEMP, W. SKAMAROCK, W. WANG

Mesoscale and Microscale Meteorology Division. National Centerfor Atmospheric Research, Boulder, Colorado 80307, U.S.A.

The first non-beta release of the Weather Research and Forecast (WW) modeling system in May, 2004 represented a key milestone in the effort to design and implement a fullyfunctioning, next-generation modeling system for the atmospheric research and operational NWP user communities. With efficiency, portability, maintainability, and extensibility as bedrock requirements, the WRF software fmmework has allowed incremental and reasonably rapid development while maintaining overall consistency and adherence to the architecture and its interfaces. The WRF 2.0 release suppolts the fullrange of functionality envisioned for the model including efficient scalable performance on a range of high-performance computing platforms, multiple dynamic cores and physics options, low-overhead two-way interactive nesting, moving nests, model coupling, and interoperability with other common model infrastructure efforts such as ESMF.

1. Introduction

The WRF project has developed a next-generation mesoscale forecast model and assimilation system to advance both the understanding and the prediction of mesoscale precipitation systems and to promote closer ties between the research and operational forecasting communities. With the release of WRF version 2.0 to the community in May of 2004, the wide dissemination of the WRF modeling system to a large number of users and its application in a variety of areas including storm-scale research and prediction, air-quality modeling, wildfire simulation, hurricane and tropical storm prediction, regional climate, and operational numerical weather prediction are well underway. The number of registered downloads exceeded 2,500 at the end of 2004. 173 participants from 93 institutions in 20 countries attended the annual WRF Users Workshop in June 2004 at NCAR and heard 28 scientific presentations involving work being 156

158

into operations at NCEP, AFWA, and at the U.S. Navy through Operational Testbed Centers being established at the respective centers. The WRF system, illustrated in Figure 1, consists of the WRF model itself, preprocessors for producing initial and lateral boundary conditions for idealized, real-data, and one-way nested forecasts, postprocessors for analysis and visualization, and a three-dimensional variational data assimilation (3DVAR) program. With the exception of the standard initialization (SI) program, each of the preprocessors and 3DVAR are parallel programs implemented using the WRF Advanced Software Framework (ASF). Data streams between the programs are input and output through the ASF’s I/O and Model Coupling MI. The WRF Model (large box in figure) contains two dynamical cores, providing additional flexibility across institutions and applications. The NCAR-developed Advanced Research WRF (ARW; originally the Eulerian Mass, or “EM’ core) uses a time-split high-order Runga-Kutta method to integrate a conservative formulation of the compressible non-hydrostatic equations [ 161. ARW is supported to the research community as WRF Version 2 and is undergoing operational implementation at the US. Air Force Weather Agency. N O M C E P ’ s operational implementation of WRF is using dynamics adapted to the WRF ASF from the Non-hydrostatic Mesoscale Model (NMM) r31[81[91[151. The WRF ASF implements the WRF software architecture [111 and is the basis on which the WRF model and 3DVAR systems have been developed. It features a modular, hierarchical organization of the software that insulates scientific code from parallelism and other architecture-, implementation-, and installationspecific concerns. This design has also been crucial for managing the complexity of a single-source-code model for a range of users, applications, and platforms. This paper describes the implementation and performance of WRF software, including new features provided in WRF 2.0: two-way interacting and moving nests, support for model coupling, and interoperability with emerging community modeling infrastructure such as the Earth System Modeling Framework.

’

2. WRF Advanced Software Framework The WRF ASF comprises a number of separable layers and supporting components: the Driver Layer, Mediation Layer, Model Layer, a metaprogramming utility called the Registry, and application program interfaces (APIs) to external packages for interprocessor communication, data formats, and VO. The benefits of the WRF ASF are facilitation of rapid development, ease of extension, leverage of development effort by the WRF community at large,

159

software reuse, and ready adaptation to community model infrastructure such as ESMF. The Driver layer handles run-time allocation and parallel decomposition of model domain data structures; organization, management, interaction, and control over nested domains, including the main time loop in the model; highlevel interfaces to YO operations on model domains; and the interface to other components when WRF is part of a larger coupled system of applications. Within the driver, each domain is represented abstractly as a single object: a Fortran90 derived data type containing the dynamically allocated state data with pointers to other domains in the nest hierarchy. Nesting is represented as a tree of domains rooted at the top-level (most coarse resolution) domain. Each model time step involves a recursive depth-first traversal over this tree, advancing each node and its children forward to the next model time. Forcing, feedback, and nest movement is also handled in the Driver. The Mediation Layer encompasses one time-step of a particular dynamical core on a single model domain. The solve routine for the dynamical core contains the complete set of calls to Model Layer routines as well as invocation of interprocessor communication (halo updates, parallel transposes, etc.) and multithreading. The current WRF implementation uses the RSL communication library [12] that, in turn, uses the Message Passing Interface (MPI) communication package. Shared-memoryparallelism over tiles - a second level of domain decomposition within distributed memory patches - is also specified in the solve routines using OpenMP. The Model Layer comprises the actual computational routines that make up the model: advection, diffusion, physical parameterizations, and so forth. Model layer subroutines are called through a standard Model Layer Interface: all state data is passed as arguments, along with the starting and ending indices in each of the three grid dimensions for the tile that is being computed. Model layer subroutines may not include YO, stop statements, multi-threading, or interprocessor communication, ensuring that they may be executed coherently for any tile-decomposition or order of execution over tiles. The Model Layer Interface is a contract between the ASF and the programmerlscientist working at the Model Layer. Adherence to the interface ensures that a Model Layer package incorporated into WRF will work on any parallel computer the framework itself is ported to. Model layer routines that have data dependencies rely on the mediation layer to perform the necessary interprocessor communication prior to their being called. The programmer describes the communication type and pattern by adding an entry to the Registry and then inserts a notation to perform the communication at the appropriate location in the solve routine. The Registry is a concise database of information about WRF data structures and a mechanism for automatically generating large sections of WRF code from the notations in the database. The Registry data base is a collection of tables that

161

3. Nesting and Moving Nests

Nesting is a form of mesh refinement that allows costly higher resolution computation to be focused over a region of interest. WRF 2.0 includes support for one-way and two-way interacting nested domains. Nests in WRF are nonrotated and aligned so that parent mesh points are coincident with a point on the underlying nest, which eliminates the need for more complicated generalized regridding calculations.Nest configurations are specified at run-time through the namelist. The WRF ASF supports creating and removing nests at any time during the simulation but the WRF model is currently constrained to starting nests at the beginning if runs require input of nest-resolution terrain or other lower boundary data; this limitation will be addressed in the near future. Nests may be telescoped (nests within nests) to an arbitrary level of horizontal refinement. Vertical refinement is not yet implemented. Refinement ratios are whole integers, typically 1:3. A prototype implementation of moving nests was released in version 2.0.3. This version was used for a 4km moving nest simulation of Hurricane Ivan (Sept. 2004). An animation is viewable on-line.’ Efficient and scalable implementation of nesting is a key concern. All domains in a nested simulation are decomposed over the same set of processes and nested domains run synchronously with the parent. Exchanging forcing and feedback information requires communication to scatter and gather data across processes every parent time step. In addition, interpolation of parent domain data to nest points is load imbalanced because it only occurs over regions of the domain shared by both parent and nest. This is partially alleviated by first rearranging the parent domain data to the processes storing the corresponding nest domain points, allowing interpolation to be performed locally and over a larger number of processes. Figure 2 shows parent domain data on the processes overlaying the nested boundary (including “a” and “b” for the northwest nest comer) being communicated to the processes that compute the nest boundary (including process “c”. Recall both domains are decomposed over the same set of processes). After rearrangement of the parent-domain data, the parent-to-nest grid interpolation is performed locally on the nest-boundary processes. Nesting overhead has been measured by running equally dimensioned parent and nest domains as two-way interacting domains and then separately as standalone single domain runs. Overhead for nesting is between 5 and 8 percent, depending on the number of processes, and well within the target of 15 percent overhead observed in the parallel MM5 model. Most of the overhead appears related to the cost of the interpolation, which uses a relatively expensive nonlinear algorithm. The approach for moving nests is the same as two-way nesting, with some additional logic added to the fiamework and the model code:

’ htto://www.~.ucar.~~w~WG2/wrf moving nest.pif

162

1. Determine whether it is time for a move and, if so, the direction and distance of the move. 2. Adjust the relationship between points on the nest and corresponding points on the parent domain.

3.

Shift data in the 2- and 3-dimensional state arrays of the nest in the opposite direction of the nest movement.

4.

Initialize the leading edge of the nest as it moves into a new position relative to the parent domain.

Additional work is needed in step 1, above, to incorporate an automatic featurefollowing nest movement mechanism and in step 4 to allow run-time ingest of nest-resolution lower boundary data such as topography and land use on the leading edge of the moved nest. Lastly, the issue of moving coupling to an external model such as an ocean model will be addressed. 4. UO and Model Coupling

The YO and Model Coupling API within the WRF ASF provides a uniform, package-independent interface between the WRF model and external packages for YO and data formatting. Implementations of the API for NetCDF, Parallel HDF5, native-binary, and GRIBl YO are run-time assignable to the framework‘s I/O streams.’ The WRF YO and Model Coupling API also supports model coupling, an idea well developed under [5][6] and in the PRISM coupling framework [7]. “Coupling as VO” is attractive in that it allows encapsulation of details of component data exchange within a model’s control structures and interfaces that already exist for YO. It requires little if any modification to the models themselves, it is readily and efficiently adaptable to different forms of coupling (sequential or concurrent), it can switch transparently (from the applications point of view) between on-line and off-line modes of coupling, and it is naturally suited to distributed computing environments such as Grid computing. Two model coupling implementations of the WRF YO and Model Coupling MI have been developed: the Model Coupling Toolkit (MCT) [101 is the basis for the Community Climate System Model (CCSM) coupler; the Model Coupling Environment Library (MCEL) [2] is a CORBA-based client-server based coupling framework. The MCT implementation of the WRF YO and Model Coupling API supports regular, scheduled exchanges of boundary conditions for tightly to moderately coupled interactions between WRF and the Regional Ocean Modeling System

’ Muqun

Yang at NCSA contributed the HDFS implementation of the WRF VO API. Todd Hutchinson of WSI lnc contributed the GIUBl implementation.

165

WRF is being adapted to support top-level interface requirements to interoperate as an ESMF coupled component but will also continue to interoperate through YO-like coupling mechanisms through the WRF YO and Model Coupling AFT Implementing a form of ESMF coupling that presents itself through the WRF UO and Model Coupling MI is also being explored. 5. Performance Key goals for the WRF software are portability and efficiency over shared, distributed-memory, and hybrid parallel architectures and over both vector and scalar processor types. The WRF ASF supports a two-level decomposition strategy that first decomposes each model domain over distributed-memory patches and then, within each patch, over shared-memory tiles. The framework and the Model Layer Interface allow the Driver layer to decompose domains over arbitrarily shaped and sized rectangular patches and tiles, giving maximum flexibility for structuring the computation as efficiently as possible. Towards the goal of WRF performance-portability, routine benchmarking on a variety of target computer platforms has been ongoing. Figure 5 represents a snapshot of WRF performance at the end of 2004; the latest results are maintained and routinely updated on the web.' The test case in the figure is a 48-hour, 12km resolution case over the Continental U.S. (COWS) domain. The computational cost for this domain is about 22 billion floating point operations per average time step (72 seconds). Performance is defined as model speed, ignoring YO and initialization cost, directly measured as the average cost per time step over a representative period of model integration, and is presented both as a normalized floating-point rate and as simulation speed. These are equivalent measures of speed, but floating-point rate expresses speed as a measure of efficiency relative to the theoretical peak capability while simulation speed, the ratio of model time simulated to actual time, is more relevant as a measure of actual time-to-solution. The purpose of this WRF benchmark is to demonstrate computational performance and scaling of the WRF model on target architectures. The benchmarks are intended to provide a means for comparing the performance of different architectures and for comparing WRF computational performance and scaling with other similar models. In light of the continuing evolution and increasing diversity of high-performance computing hardware it is important to define what is being counted as a process. For this benchmark, a parallel process is the finest-grained sequence of instructions and associated state that produces a

' Rtto://www.mmm.ucar.edu/wri%ench

167

institutions, and applications. Portability and efficiency over a full range of high-performance computing systems has been a key objective. Finally, the modular design of the WRF ASF and its interfaces facilitates model coupling and integration of WRF into community model infrastructure efforts such as PRISM and ESMF. Acknowledgements

M. Yang (NCSA), T. Black, N. Surgi. and S. G. Gopalakrishnan ( N O W NCEP), D. Schaffer, J. Middlecoff, G. Grell ( N O M S L ) , M. Bettencourt (AFRL), Shuyi Chen and David Nolan (U. Miami), Alan Wallcraft (NRL), Chris Moore (NOAAPMEL). JIN Zhiyan (CMA), D. Barker, A. Bourgeois, C. Deluca, R. Loft (NCAR), J. Wegiel (AFWA), T. Hutchinson (WSI), R. Jacob and J. Larson (ANL). References 1 . Allard, R., C. Barron, C.A. Blain, P. Hogan, T. Keen, L. Smedstad, A. Wallcraft, C. Berger, S. Howington, J. Smith, R. Signell, M. Bettencourt, and M. Cobb, “High Fidelity Simulation of Littoral Environments,” in proceedings of UGC 2002, June 2002. 2. Bettencourt, M. T. Distributed Model Coupling Framework, in proceedings of HPDC-11, July 2002. (hm://WWW.eXtrem e.indiana.edu/

-

3. Black, T., E. Rogers, Z. Janjic, H. Chuang, and G. DiMego. Forecast guidance from NCEP’s high resolution nonhydrostatic mesoscale model. Preprints 15th Conference on Numerical Weather Prediction, San Antonio, TX,Amer. Meteor. SOC.,J23-J24.2002. 4. Coats, C.J., J. N. McHenry, A. Lario-Gibbs, and C. D. Peters-Lidard, 1998: MCPL(): A drop-in MM5-V2 module suitable for coupling MM5 to parallel environmental models; with lessons learned for the design of the weather research and forecasting (WRF) model. The Eighth PSUNCAR Mesoscale Model Users’ Workshop. MMM Division, NCAR, Boulder, Co., 117-120. 5. Coats, C. J., A. Trayanov, J. N. McHenry, A. Xiu, A. Gibbs-Lario, and C. D. Peters-Lidard, 1999: An Extension of The EDSSModels-3 I/O API for Coupling Concurrent Environmental Models, with Applications to Air Quality and Hydrology, Preprints, 15& lIPS Conference, Amer. Meteor. SOC.Dallas, Tx.,January 10-15, 1999. 6 . Fiedler, B., J. Rafal, and C. Hudgin. F90tohtml tool and documentation. httD://mensch.or~~Otohtml.

7. Guilyardi, E., R. Budich, G. Brasseur, and G. Komen, 2002: PRISM System SpecificationHandbook V. 1.. PRISM Report Series No 1 239 pp. 8. Janjic, Z. I., J. P. Gerrity, Jr. and S. Nickovic. An Alternative Approach to Nonhydrostatic Modeling. Monthly Weather Review, 129, 1164-1178. 2001.

168

9. Janjic, Z. I. A Nonhydrostatic Model Based on a New Approach. Meteorology and Atmospheric Physics (in print). 2002. 10. Larson, J.W., R. L. Jacob, I. Foster, and J. Guo, “The Model Coupling Toolkit, ” 2001, Proc. 2001 Int’l Conf. on Computational Science. 11. Michalakes, J., J. Dudhia, D. Gill, J. Klemp, and W. Skamarock “Design of a next-generation regional weather research and forecast model,“ Towards Teracomputing, World Scientific, River Edge, New Jersey (1999), pp. 117124. 12. Michalakes, J.: RSL: A parallel runtime system library for regional atmospheric models with nesting, in Structured Adaptive Mesh Rejnement ) R ’ S ( Grid Methods, M A Volumes in Mathematics and Its Applications (1 17), Springer, New York, 2000, pp. 59-74. 13. Michalakes, J., M. Bettencourt, D. Schaffer, J. Klemp, R. Jacob, and J. Wegiel, Infi-astructure Development for Regional Coupled Modeling Environments Parts I and II, Final Project Reports to Program Environment and Training Program in the U.S. Dept. of High Performance Computing Modernization Office, Contract No. N62306-01-D-7110/CLIN4, 2003 and 2.O/io.~df 2004. See also: http:llwww.mmm.ucar.edu/wrflWG2/software 14. Nolan, D.S. and M.T. Montgomery, 2002: “Nonhydrostatic, threedimensional perturbations to balanced, hurricane-like vortices. Part I: linearized formulation, stability, and evolution”, J. Atmos. Sci., 59, 29893020. 15. Pyle, M.E., Z. Janjic, T. Black, and B. Ferrier: An overview of real time WRF testing at NCEP, in proceedings of the MMSNlrRF Users Workshop, NCAR, June 22-25,2004. 16. Wicker, L. J., and W. C. Skamarock, 2002: Time splitting methods for elastic models using forward time schemes. Mon. Wea. Rev., 130, 20882097.

ESTABLISHMENT OF AN EFFICIENT MANAGING SYSTEM FOR NWP OPERATION IN CMA JIANGKA1,HU WENHAI,SHEN

National Meteorological Centre, China Meteorological Administration Abstract With the application of NWP developed these years in China, the scale of the operational suite has increased. An efficient managing system is important for stability of the Operational suite. It is necessary for us to do business process re-engineering in N W . Given the distribution and diversification of NWP applications the challenge for CMA is how best to manage the operational system within the available facilities. Some methods and projects are being put in practice. In this paper, our approach on standardizing the running interface; visualizing the Iunning processes by SMS are shown: and a ongoing project “National Meteorological data Access and Retrieval System” , a general way for NWP data access in China is described.

1. Introduction

The application of NWP has made big progress in the past 8 years in Chma mainly for two reasons, one is the improvement in accuracy of NWP forecasts, and the other is that more and more forecasts required by society wouldn’t be provided on time without the support of N W P products. There is no doubt that such products have played an important role in operational weather forecasting. Table 1 shows a comparison of NWP forecasts in 1996 and in 2004. The increased number of forecasts in operation, and the limited computing resources as well as the different types of supercomputers and storage facilities, makes the management of operational NWP systems more complex today than in the past. In addition, there are frequent changes of NWP systems including model development, updates to data assimilation, support for new observations types, some special products requirements and so on. In the CMA, we do not have a parallel suite or a good test-bed suite due to limitations in computing resources. Modifications of the major parts are often well tested. But changes of the minor parts are not always tested well and may result in human errors in some N W P systems. Also, scheduled maintenance often takes second place to “P system operation. All of the above issues make it difficult for new staff to operate the whole NWP system. The author has many years of experience in maintaining the CMA NWP operational systems. Today, the team is focussed on improving the management and efficiency of the NWP operational system. 169

170

2004

1996 Forecast

I Period

Global (T106L19)

I 6hours I Global (T213L31)

I Period

Forecast

I 6hours

Regional (HlafsOS)

6hours

Regional (Hlafs025)

6hours

Typhoon (Hlafs)

12hours

Regional (Grapes)(xp)

6hours

Typhoon1 (hlafs)

12hours

Typhoon2 (T213L31)

12hours

Meso-Scale

6hours

Very Short-Range (WRF)(xp)

3hours

Medium-Range

Daily

Ensemble

I

(TI06L19)

I Table 1

1

Air pollution forecast

Daily

Ultraviolet radiation forecast

Daily

Sandstorm forecast

Daily

Potential fire index forecast

Daily

Emergency response modeling

on-demand

system

I

NWP forecasts comparison in 1996 and 2004

How to set up a stable system that meets the operational schedule for forecasts and dissemination of products withm the available facilities in CMA.In the following sections we show some methods put in practice and some ideas on standardizing the running interface; visualizing the running process by SMS: and some background on the project ‘Wational Meteorological data Access and Retrieval System”. 2. Standardizing the Running Interface

In CMA, the staff that maintain the Nwp operational systems are in three levels. The operators are in the first level and their job is to monitor the operational consoles 24 hours a day. If some abnormal message fiom the NWP operational system appears, they will call supervisors, who are in the second

I

171

level, to investigate the problem. Researchers will be called if the problem is identified to be due to a failure in one of the model applications. In practice, only the researchers are qualified in meteorology. In some cases, particularly when a new system is implemented, researchers are also contacted fiequently as the NWP system supervisors are not sufficiently experienced with the new system. Given the expansion of the NWP operational systems in recent years, this has become more of an issue. If such upgrades to the operational suite could be made using a standard running interface, then supervisors could operate it without the need to know what the components actually do and simply restart the right step. Some ideas were considered: having a unique ID for the operational account in all computers; separating constant data fiom other data in different directories; and standardizing the components for running. The Nwp operational system is distributed over different computers in CMA. Having a unique account ID on these systems is more convenient for supervisors and avoids them needing to remember different accounts/passwords. The need to have same file organization in different systems is also important. When we organize so many files including source codes, libs, executables , scripts and other generated data, we should separate the data based on inconstancy. Inconstant data mainly includes the intermediate data and the products. This data is used for a short period of time or is updated fi-equently. Relative constant data includes the source code, libs, executables, scripts and some configure files, and climate data. This data is relatively constant or seldom modified if the system is stable. It is probably a good idea to try and separate these two kinds of data. This approach is not usually followed by the research groups. However, it is clearly an advantage when doing backups that only the relatively constant data should be backed up to reduce the volume of data written. All the models of NWP system are categorized into 2 categories: global model and regional model. Figure 3 shows the hierarchy of the file system.

172

$BASE

I--------NWP_GMFS(”-RMFS) I----------T213(hlafs/mm5/grapes) I-_______ condat I--------$components I--_-----$componentn I--______ bin I--------script I--------source I-----Makefile ]------$components I--------codes \-----$original I--------1ibs I------$componentsn

$WORKDIR I--------NWF-GMF S_DATA(”-RMF S-D ATA) I ----------T2 13(hlafs/mm5/grapes) I------$components1 I------$components2 I------$componentsn +--log

~

_

_

Figure 3

_

~

_

_

_

_

~

File system hierarchy

The above schematic shows the layout of the file organization for the operational system at CMA. All the models are integrated in a unified interface. All the data generated by running are separated into another directory or another file system. Having a common data interface helps some data processes such as cleanup, or archiving. Also having such structure help avoid human errors. In the above file hierarchy at CMA the components would be observations processing, data analyses, forecasting, post processing, grib fields file making, and products generation. The directory “bin“ is for executable files. The directory “script” is for run scripts. In the “source” directory, the whole system is compiled with a Makefile. This is very useful for porting. Considering the custom of researchers, all original structure of source codes is kept under the “codes” directory.

173

Other rules about files naming also be considered. For example, the name of executable files start with the prefix of the abbreviation of the components and an underline. The middle part represents the function of the executable. The postfix could be “.x” or “.exe” etc. In the rules of files naming, the information includes model name, components name, functions name, and the date time. Some former rules are also taken into account. If new models are introduced according to these rules, then the supervisors of the operational system can typically support these models without any special training. The supervisors do not need to know the details about any new model, they just run and monitor it. With the standard running interface, it becomes easy to do.

3. Visualizing the Running Processes by SMS

Visualization tools are very useful for monitoring N W P forecast suites as they run. With the graphic interface, operators and supervisors can immediately find out the status of forecasts. Also, supervisors can more easily locate problems when they occur and take remedial action. The ECMWF SMS (Supervise Monitor Scheduler) met the requirements. Late last year, we obtained the SMS s o h a r e and started a project to use it. At first, we will integrate the Nwp operational system with SMS step by step. After this we are planning to make some advanced development using SMS. Up to now, we have set up a parallel suite using SMS for a global model (T213). The suite will be put into operation when it is enough stable. The X-CDP GUI hides the details of the SMS system, and provides a good running interface. It is compatible with supervisors having to take some actions without knowing too many details. The important thing we should think about is how to make the system more maneuverable by SMS. Some parts of SMS might be further advanced. In security, the authorized permission should be extended to the suites. Mostly, what we are considering is setting up an alarm system. Some voice functions and automatic telephone call should be included. With the connection to XCDP, this system could initiate an audible alarm when the system broken (red color appears on failed xcdp node). Furthermore, a supervisor would be called automatically if some serious incident has happened. Figure 4 shows the demo of operational suites.

175

SMS provides a function for scheduling the tasks, so a reasonable logistics between tasks is pivotal for efficiency. Petri nets have proven to be usefkl in the context of logistics and process control, such as the graphical and precise nature, the f m mathematical foundation and the abundance of analysis methods[l]. Clearly, the graphical nature of Petri nets is a very important feature in the context of business process redesign. Research on Petri nets addresses the issue of flexibility; many extensions have been proposed to facilitate the modeling of complex systems. Typical extensions are the addition,of ‘colour’, ’time’ and ‘hierarchy’. It is interesting to investigate the application of Petri nets to the analysis of an N W P operational system.

4. Project “National Meteorological Data Access and Retrieval System”

Last year, a project ‘Wational Meteorological Data Access and Retrieval System” started to support a standard user interface to mass capacity storage and provide a better platform for users to share all kinds of data. Figure 5 shows the connection in CMA. There are 2 SANS (Storage Area Network) designed for the NWP operational suites. One is dedicated for the supercomputers to support the running systems. A mass of NWP midterm data is deposited in the parallel file system through a SAN switch. This data can be only accessed by the supercomputer. The other SAN is designed for user access. Large amounts meteorological data will be included such as observations, service products, some data from GTS, and output of the N W P operational system. In the operational system today, primary data formats maintained in CMA include standard single field grib files which are distributed by wide area network, facsimile graphs distributed by electrograph, images published on the web and MICAPS (a visualization tools for weather forecast) format files for CMO (Central Meteorological Office). To generate large amounts various format data for users is time-consuming work. Actually, some data are used infrequently. Furthermore, all the products can not cover the model output entirely. On the viewpoints mentioned , the operational system component “products making” should be moved to the use end. With the project supporting, the data access interface between users and operational system is clearer. As the NWP operational system, it retrieves real-time observations from the storage pools and moves the data to the parallel file system then starts the components followed and deposits the amount of temporary data in it, at the end the output is

177

.

d P

aI v)

* Client applications

interface

Figure 6 The reference model Reference

[l]W.M.P. van der Aalst. The Application of Petri Nets to Workflow Management (1998). The Journal of Circuits, Systems and Computers

THE NEXT-GENERATION SUPERCOMPUTER AND N W P SYSTEM OF THE JMA MASAMI NARITA Numerical Prediction Division. Japan Meteorological Agency, 1-3-4 Ote-machi. Chiyoda-ku, Tokyo, 100-8122, Japan For operational numerical weather prediction, the Japan Meteorological Agency (JMA) has been running analysis and forecast models on the HlTACHI SR8000 model El since March 2001. On the next-generation supercomputer procurement announced in April 2004, the HlTACHI SRllOOO was judged to be best. The JMA plans to improve its analysis and forecast models on the new computer.

1. Introduction The Japan Meteorological Agency (JMA) is the national organization for Japanese operational weather service, and as such is responsible for contributing to the improvement of public welfare including natural disaster prevention and mitigation, safety of transportation, prosperity of industries, and international cooperation activities. The major activities of the JMA are:

0

To issue warnings, advisories and forecasts in short-range, one-week and long-range. To deal with the global environmental issues such as global warming and ozone depletion. To provide information on earthquake and volcanic activities.

Since more comprehensive meteorological information is increasingly required in the development of the socio-economic system, the JMA has been making every effort to improve the forecasts, to facilitate the climate-related activities, and to enhance the tsunami and earthquake prediction systems. To issue accurate meteorological information quickly, Numerical Weather Prediction by massive computation using a high performance supercomputer plays an important part. The JMA runs several analysis and forecast models operationally and makes effort to improve their accuracy and computational efficiency.

m)

178

179

2. JMA computers 2.1. History and current supercomputer: SR8000 model E l

The JMA installed its first-generation computer IBM 704 for the operational NWP in March 1959. Figure 1 shows the history of computers used for operational NWP at the JMA in terms of peak performance. In March 200 1, the JMA installed a distributed memory parallel processing computer HITACHI SR8000 model E l as its seventh-generation computer. This machine is configured with 80 nodes and each node has eight RISC processors with a peak performance of 1.2 GFLOPS per processor. Processors in a single node can be simultaneously operated and share the main memory. The peak performance of all 80 nodes of the SR8000 model El reaches 768 GFLOPS. These nodes are divided into three subsystems: 73 nodes for the operational NWF' and development work, 5 nodes for other operations, and 2 nodes for system management. Each subsystem can be operated independently of the others. 100.000.000.000

,

I

TFLOPS 'FLOPS GFLOPS

1959 1964 1969 1974 1979 1984 1989 1994 1999 2004 2009

Year Figure 1. The history of computers used for the operational NWF' at the JMA in terms of peak performance.

2.2. Supercomputerprocurement in 2004 The JMA is planning to replace its supercomputer from March 2005 to March 2006. The contract period for the next-generation suppercomputer is fiom April 2006 to March 2011, and the procurement was announced in April 2004. The benchmark tests prepared by the JMA consist of eight parts: 1. Global forecast by JMA global spectral model (GSM) with resolution of TL959 L40,

180

2. Meso-scale forecast by JMA non-hydrostatic meso-scale model (MSM) with grid points of 721 x 577 x 50, 3. Meso-scale analysis by JMA four dimensional variational assimilation system based on 10-km hydrostatic MSM with grid points of 361 x 289 x 40, 4. Very short-range forecast (VSRF) of precipitation based on kinematics, 5. Compilation, 6. DiskI/O, 7. Task generation, 8. File transfer through network.

The candidates for the procurement were allowed to optimize the codes of the GSM, non-hydrostatic MSM, meso-scale analysis, and VSRF suitable for their computers by themselves. In June 2004, the offer from HITACHI was judged to be best, and the JMA decided to install the HITACHI SRllOOO as its nextgeneration supercomputer.

2.3. Next-generation supercomputer: HITACHI SRIIOOO model JI and "model J l followed-on '' The next-generation supercomputer of the JMA consists of three subsystems: subsystem 1 is going to be installed in March 2005, and subsystems 2 and 3 in March 2006. Subsystem 1 is configured with a 50-node SRllOOO model J1, each node having 16 IBM POWER5 processors with frequency of 1.9 GHz. The peak performance of the total 50 nodes of the SRllOOO model J1 reaches 6.08 TFLOPS. Subsystems 2 and 3 are the same in composition, and each subsystem is configured with an 80-node SRl 1000 "model J1 follow-on", each node having 16 POWERS+ processors with fiequency of 2.1 GHz. The peak performance of the total 80 nodes of the SRllOOO "model J1 follow-on" reaches 10.75 TFLOPS. Figure 2 shows a schematic view of one node configuration of the SR11000.

1 2 1 GHz POWERS MCM Multi Chip Module

Figure 2. A schematic view of one node configuration of the HITACHI SRllOOO.

181

3. JMA NWP system The operational NWP at the JMA started in March 1959 after a few years of preparations. Because the target of the operational N W P became widely spread in the scale of space and time, the JMA operates a suite of analysis and forecast models for the very short-range prediction of meso-scale disturbances, shortrange and medium-range prediction of synoptic-scale disturbances, and longrange prediction of large-scale disturbances. 3.1. Operational suite on the current supercomputer

The operational suite of analysis and forecast models at the SMA is shown in Tables 1 and 2, respectively. Table 1. Operational suite: Analysis.

Name Global Analysis Regional Analysis Meso-scale Analysis Typhoon Analysis

Analysis time 00,06,12,18 UTC 00,06,12,18 UTC 00,06,12,18 UTC 06,18 UTC

Analysis scheme 3D-Var 4D-Var 4D-Var 3D-Var

Table 2. Operationalsuite: Forecast.

Name

Model

Forecast span

El Nino Forecast

Atmosphere: T42L21 Ocean: 144 x 106 L20 GSM: T63LAOM31 GSM: T106LAOM26

1.5 years

Seasonal Ensemble One-Month Ensemble Medium-Range Ensemble Global Forecast Typhoon Forecast Regional Forecast Meso-scale Forecast Very Short-Range Precipitation Forecast

Operation interval 1/2 month

4 or 7 month

1 month

1 month

7 days

GSM: T106LAOM25

9 days

Daily

GSM: T213LAO

4 days (00 UTC) 9 days (12 UTC) 84 hours 51 hours 18 hours

12 hours 6 hours 12 hours 6 hours

6 hours

30 minutes

TYM: 24 km L25 RSM: 20 km LAO Non-hydrostatic MSM: 10 km L40 Kinematics: 2.5 km

182

Current global analysis is based on a three dimensional variational method (3D-Var), but will be replaced by a four dimensional variational method (4DVar) in FY 2004. At the same time of implementation of the 4D-Var global analysis, a semi-Lagrangian advection scheme will be incorporated into the global spectral model (GSM)*. In the 4D-Var assimilation system. the tangentlinear and adjoint codes of deep convection have been revised, and latest observational operators used in the 3D-Var system have been implemented. In the semi-Lagrangian advection scheme for the global forecast model, computation of the advection terms is split into the horizontal and vertical directions and both terms are computed separately. The flux in the vertical direction is evaluated with a one dimensional conservative semi-Lagrangian scheme, while the horizontal advection is calculated with a conventional nonconservative two dimensional semi-Lagrangian scheme. The operational run of the JMA non-hydrostatic meso-scale model (MSM) was started on September 1, 2004 in place of the hydrostatic MSM. The new operational model is developed based on the Meteorological Research Institute (MRI) and Numerical Prediction Division (NPD) of JMA unified nonhydrostatic model (h4RI/NPD NHM) with a certain modifications made for efficient computation and improvement of quantitative precipitation prediction. The area of the operational meso-scale forecast covers Japan and its surrounding area with grid points of 361 x 289 at a horizontal resolution of 10 lan on the Lambert conformal projection map (Figure 3). A terrain-following coordinate is employed in the vertical direction, and the top of the model atmosphere is located at 22 lan altitude with 40 vertical layers. The nonhydrostatic MSM is initialized with the 4D-Var analysis based on the hydrostatic MSM and integrated up to 18 hours four times a day. Because computation of operational meso-scale forecast must be accomplished within 90 minutes from the initial time, dynamics and physics of the non-hydrostatic MSM are optimized for computational efficiency. To shorten computational time of the operational runs, longer step of time integration is required unless computational instability occurs. For this purpose, the splitexplicit scheme is improved to allow 40-second time step in 10-km horizontal resolution. In addition, some of the microphysical processes are simplified to make computational time shorter without deteriorating forecast characteristics, and a Lagrangian treatment for the falling of raindrops and graupel is combined with the microphysics to avoid breaking the Courant-Friedrichs-Lewy condition. The 4D-Var assimilationsystem of global analysis and the semi-Lagmngian advection scheme of global forecast were implemented in the operational NWP on February 17,2005.

184

Medium-range ensemble forecast (9-day forecast) by GSM 0 Incorporate semi-Lagrangian advection scheme: TL159 LAO M25 (in FY 2005) More members: TL159 LAO M51 (in FY 2005) U Higher resolution: TL319 L60 M5 1 (in FY 2006) Deterministic global forecast by GSM 17 Higher resolution: TL959 L60 (in FY 2006) Meso-scale analysis 0 4D-Var data assimilation system based on non-hydrostatic MSM (in FY 2007) Meso-scale forecast by non-hydrostatic MSM U Higher resolution: horizontal grid spacing = 5 km,vertical layers = 50 (in FY 2005)

3.3, Parallelization The HITACHI SR8000 has three level programming models: 1. Microprocessor: Pseudo-vector processing (PVP), 2. In node: Shared memory parallelization by co-operative microprocessors in single address space (COMPAS), 3. Between nodes: Distributed memory parallelization by MPI (or PVM, HPF). The feature of PVP enables Nwp codes of the JMA optimized for the SR8000 perform high speed computation also on the vector supercomputer. The feature of COMPAS is like that of OpenMP, whereas intra node parallelization by COMPAS is accomplished automatically by the compiler without insertion of any compiler directives. The computation of the GSM in one time step consists of two phases: grid space and spectral space (Oikawa 2001). The computation of grid space is parallelized to each node in latitudinal bands assigned cyclically and the computation of spectral space in a triangular array of spectral coefficients assigned dynamically to lessen the load imbalance. In the current operational runs,forty nodes of the HITACHI SR8000 model El are devoted to the non-hydrostatic MSM. To avoid the stagnant time of computational node while one node is writing data gathered from the other nodes by MPI communication, one node is spared for the output procedures only and the rest is responsible for the computation as schematically illustrated in Figure 4. To lessen the load imbalance and reduce the amount of data transfer between each node especially in massively parallel computations, two-dimensional

185

domain decomposition was implemented in the non-hydrostatic MSM as shown in Figure 5. The non-hydrostatic MSM is integrated for the domain and forecast time span mentioned above in about 23-minute calculation time by using both COMPAS applied for the efficient parallel processing by a single node and MPI library used for the communication between the processor nodes. communication and output by 0-thnode node

0

1

2

communication and output by 0-th node (for NO only)

output by each node 0

1

2

0

1

2

communication output

Figure 4. Parallelization:Output node (non-hydrostatic MSM).

North t I i

i

West

East

computational domain interface (halo) Figure 5. Parallelization:Two-dimensional decomposition (non-hydrostaticMSM).

186

4. Summary

The operational NWP of the JMA was started in March 1959. Since March 2001, the JMA has been running global, regional, typhoon and meso-scale models on the HITACHI SR8000 model El. Recently, the JMA implemented a nonhydrostatic model for the operational meso-scale forecast, and is going to implement a four dimensional variational method of the global analysis and a semi-Lagrangian advection scheme of the global forecast in FY 2004. After the supercomputer procurement work announced in April 2004, the HITACHI SRllOOO was judged to be the next-generation supercomputer. The JMA is planning to improve its analysis and forecast models on the new computer.

Reference 1. Y. Oikawa, Performance of parallelized forecast and analysis models at JMA,Proceedings of the Ninth E C M W Workshop on the Use of High Performance Computing in Meteorology, 2001; 53-61.

THE GRID: AN IT INFRASTRUCTURE FOR NOAA IN THE 21ST CENTURY MARK W. GOVETT, MIKE DONEY AND PAUL HYDER NOAA Forecast Systems Laboratory, 325 Broadway, Boulder, CO 80305 USA

Abstract This paper highlights the need to build a grid-based infrastructure to meet the challenges facing NOAA in the 21“ century. Given the enormous expected demands for data, and increased size and density of observational systems, current systems will not be scalable for future needs without incumng enormous costs. NOAA’s IT infrastructure is currently a set of independent systems that have been built up over time to support its programs and requirements. NOAA needs integrated systems capable of handling a huge increase in data volumes from expected launches of GOES-R, NPOESS, new observing systems being proposed or developed, and to meet requirements of the Integrated Earth Observation System. Further, NOAA must continue moving toward integmted compute resources to reduce costs, to improve systems utilization. to support new scientific challenges and to run and verify increasingly complex models using next generation high-density data streams. Finally, NOAA needs a fast, well-managed network capable of meeting the needs of the organization: to efficiently distribute data to users, to provide secure access to IT resources, and be sufficiently adaptable and scalable to meet unanticipated needs in the future.

1. Introduction The NOAA Strategic Plan, titled ‘Wew Priorities for the 2 1’‘ Century,” highlights many significant and complex challenges facing NOAA. The strategic plan states: “NOAA’s role is to predict environmental changes, protect life and property, provide decision makers with reliable scientific information, manage the Nation’s living marine and coastal resources, and foster global environmental stewardship” [2003]. NOAA’s mission is broad, ranging from managing coastal and marine resources to predicting changes in the Earth’s environment. Data are central to this mission including, (1) building and maintaining observational platforms such as radars, satellite-based instruments, aircraft, profilers, weather balloons, ships, and buoys, (2) creating data products for environmental prediction, (3) disseminating the data to users, and (4) archiving the data so they can be used by research laboratories to develop new capabilities and forecast products. NOAA has made a significant investment in Information Technology (IT) for computers, data storage and networks required to handle the processing, dissemination, archival, and management of its data. NOAA IT assets include three super-computing 187

188 centers maintained at the Forecast Systems Laboratory (FSL), Geophysical Fluid Dynamics Laboratory (GFDL), and at the National Centers for Environmental Prediction (NCEP). Compute resource also include thousands of support systems ranging from desktop systems to high performance servers used for data processing, new product development and ongoing scientific research. NCDC, NGDC and NODC are NOAA’s primary facilities for the long-term archival of data, but many other secondary facilities store data of interest to their local facilities for shorter periods. Mass storage requirements for short and long term data storage are significant. To effectively receive and disseminate these data, NOAA also relies heavily on a network composed of NOAAPORT, the Internet, and many dedicated point-to-point connections. NOAA’s IT investment directly relates to the volume of data it receives, processes and disseminates. Data volume has grown tremendously over the last decade. Data managers estimate NOAA will archive more data in 2004 than is contained in its entire archive through 1998. This volume of data handling requires large investments in facilities, hardware, software and staff. For example, NESDIS has begun construction on a $61 million facility designed to handle ingesting and processing of high-volume next-generation satellite data. NCEP is currently operating under a $240 million 9 year contract with IBM to provide supercomputing facilities for N W S operations. This contract includes computing, mass storage, and the housing and maintenance of a system that is composed of two virtual systems: an operational platform and a research and development platform. There will also be a backup system to the operational platform in FY05. The huge growth in data volume is expected to continue. In the next 10 years, data growth is expected to increase by more than 100 times over current volumes. The number of weather satellites is expected to increase from a few dozen to more than 200 containing 300 instruments in the next 15 years. New satellite technology offers the most compelling glimpse into NOAA’s ingest, dissemination and archival needs for the future. The National Polar-orbiting Operational Environmental Satellite System (NPOESS) program is a 6 billion dollar joint NOAA and DoD program that will begin launching a series of polar orbiters beginning in 2009. The GOES-R satellites are slated to be online in 2012 and will record huge increases in data over existing GOES platforms. Effective use of this satellite data will depend on sufficient IT infrastructure to ingest. process, distribute and archive the data. This IT infrastructure is costly both in terms of increasing the capacity of systems, storage, and networks, and requiring sufficient staff to support the development and maintenance of their hardware and software systems. Better coordination and integration of NOAA’s IT assets could lead to significant cost reductions and better utilization of systems, software and staff.

189

The need to better coordinate NOAA’s IT assets will be required to meet the expected demands of the proposed Integrated Earth Observation System. Since 2003, participants have been meeting to promote and design a comprehensive, coordinated, and sustained Earth observation system or systems among governments and the international community to understand and address global environmental and economic challenges (www.earthobservationsummit.gov). While the importance of developing such a system is clear, the task of building it is daunting. The size of the proposed system indicates it will need to be distributed across nations and mechanisms will need to be developed to share these data effectively. NOAA‘s role in developing this system will be pivotal to its success. To respond to these changing requirements and mission goals, NOAA’s IT infrastructure must become increasingly robust, responsive, flexible, efficient, secure, adaptable and cost-effective. Grid computing has emerged as a viable technology that is increasingly utilized by government laboratories, corporate institutions, and high performance computing centers around the world. Grid computing is being used in diverse applications such as high-energy physics, medical imaging, meteorology, and business applications. Advancements in high speed networks, the development of software to enable efficient, secure distributed computing, and the role of industry in supporting and promoting Grid, have moved grid computing from a fringe technology just a decade ago, to an increasingly mainstream technology today. Grid computing is being explored at the Forecast Systems Laboratory as a means to provide better integration of NOAA’s growing data, compute and network resources and to meet the expected needs of these and other projects in the future. Specifically, grid and web technologies can be used to (1) provide better utilization of NOAA’s IT assets, (2) provide a more efficient way to ingest, disseminate and archive data to NOAA constituents and the general public, (3) reduce costly duplication in data archival and product generation systems and the network, (4)improve the utilization of data handled by NOAA, and (5) provide more efficient access to compute facilities and requisite storage facilities. A general description of grid computing is given in Section 2. Section 3 shows how grids can benefit NOAA and highlights the need to build a more robust, cost-effective NOAA network. A conclusion is given in Section 4.

2. Grid Computing The modem concept of grid computing began in the mid-1990s when the term “the Grid” was coined to describe a computing infrastructure to support large-scale resource sharing necessary for scientific research. Software development efforts began in eamest, and in 2000 the first of two foundational papers was written by Foster, et al., to describe functional requirements for a grid computing infrastructure and the

190

development of the Globus software intended to meet those requirements, The first paper, titled “The Anatomy of the Grid”, defined essential properties of Grids to build Virtual Organizations (VOs) and proposed a set of protocols and services to provide access, coordination and sharing of distributed resources [Foster et al. 20011. A VO is defined to be those distributed communities willing to share their resources in order to achieve common goals. The second paper, titled “The Physiology of the Grid”, describes the interactions between grid components and defines a set of protocols required for interoperability between them [Foster, et al., 20021. The central tenet of these papers is to move toward an end goal of ubiquitous computing serving the needs of VOs, where computing is available on-demand and users do not have to be concerned with where their tasks are running or where data reside. Industry support for grid computing has coalesced around Globus, a software toolkit designed to handle resource sharing and discovery, data movement, security, access and other operations required in distributed computing environments. Recently completed efforts to define grid software standards that align with standards from the World Wide Web, and Globus’s adherence to these standards, have strengthened the interest and commitment of industry toward grid computing. With support from industry heavyweights such as B M , HP, Microsoft, and Sun,the Globus toolkit and related grid standards represent a roadmap to generalized grid computing in the future. The most common analogy for grid computing is the electric power grid, a system of power generating plants connected together via transmission lines that supply power to the grid for use by consumers on demand. Instead of a system of individual community power plants, the development of a power grid has increased the availability and reliability of power. As a measure of its success, consumers no longer think in terms of how power gets to their homes, it is simply available to them whenever they need it. Key to the success of the power grid has been the development and adherence to power standards (eg. voltage, amps, cycles), which permit consumers to access and use the resources in the same way anywhere on the grid [Wladawsky-Berger 20041. Similarly, grid computing, or simply “the grid” is most often described in terms of computer hardware: typically compute resources, data storage, and a high-speed network to link them together into a single IT entity. Grid resources can be located at a single site or distributed across the country or around the world. They can be confined to a single institution or include multiple collaborating organizations. They can be restricted to a simple network of desktop systems, or include multiple supercomputing sites, mass storage facilities, and dedicated high speed wide-area networks. From the user perspective, the grid looks like a single large IT resource with a single sign-on or login required. Users obtain a grid account kom their organization and log

192 Grid Forum (GGF). The Globus Alliance, responsible for the development and free support of Globus has four core participants: Argonne National Laboratory, University of Southern California, University of Edinburgh, and UK E-science. Industry support for Globus include partners such as IBM, HP, Sun and Platform Computing who have contributed to the development both of grid standards and the software. Several companies also provide commercial support for Globus, including IBM, Platform Computing, and Data Synapse; other vendors have plans to add support for Globus in the near future. Oracle, for example, has assigned a large staff to build “grid enabled” database applications, and to support the expected needs of the business and scientific communities. The GGF, represented by both industry and research groups, is primarily concerned with the development of Grid standards. These standards provide common application programming interfaces (APIs) upon which Globus is based. and permit others. including industry, to contribute to and support the toolkit. Globus is standards based software that builds on three web service standards defined within the W3C (www.w3c.org) and other standards bodies: (l),the Web Services Definition Language (WSDL) provides a means to define a service, (2) Web Service Inspection (WSInspection) provides a set of conventions to discover data services, and (3), the Simple Object Access Protocol (SOAP) provides a means for service provider and service requestor to communicate. Grid extends these web services to include grid services and standards for security, resource management, resource discovery, and data access that are briefly described. 2.1.1.

Security

System security is critical everywhere but is particularly important in large distributed and diverse grid environments. The grid standard that governs security is called the Grid Security Infrastructure (GSI). Four important requirements in grid security are single sign-on, authentication, delegation, and authorization. Single sign-on is the ability to authenticate on a single source host and to have that sign-on be inherited by jobs running across the grid. Authentication verifies the user’s identity. Typically a user ID and password assigned on the multitude of local hosts is validated against the certificate as part of the grid job initialization. Delegation is the ability to permit another program to act on the user’s behalf - in effect giving that program the ability to access all resources that the user is permitted to access. Authorization is the restriction of the subject to grid resources they are permitted to use and are controlled by the individual sites thus eliminating the need for a centrally managed security system. GSI addresses these requirements and provides secure communication between elements. and security across organizational boundaries. The implementation of GSI is

193 based on public key encryption, X.509 certificates, and the Secure Sockets Layer (SSL) protocol. A trusted third party Certificate Authority (CA) is used to certify that the link between the public key and user’s identity is valid. 2.1.2. Resource Management

Globus includes a set of service components called the Globus Resource Allocation Manager (GRAM) that provides a single standard interface for managing resources on remote systems. Client commands are available for submitting and monitoring remote jobs. In addition, an API is provided that can be used to construct meta-schedulers. Meta-schedulers enable the scheduling of jobs across grid resources. They provide support for advanced reservations, job prioritization, and co-allocation of resources at multiple sites. Examples include ClusterResources SILVER, and meta-schedulers built using the Platform Computing developed Community Scheduler Framework. 2.1.3. Resource Discovery A grid service extends the concept of a web service to provide information about the

state of a service. For example, users of applications will want to know if a particular resource is available, its system characteristics (CPU type, system load, job queue length, job queue type, etc.), usage policies (e.g. only night-time access permitted), resource availability (e.g. data storage, network capacity), and process status (eg. job state, cpu time used). The Grid Resource Information Service (GRIS) runs on grid resources and handles the registration of resource characteristics. The Grid Information Index Server (GIIS) is a user accessible directory server that supports the searching for a resource or by specific resource characteristics. Hierarchies of directory services can also be built to permit local level services (e.g. administrative, security management, load balancing) or to support expanded searches across multiple services. 2.1.4. Data Access Data access and file transfer are accomplished using a service called GridFTP. GridFTP extends the FTP standard to incorporate security protocols, support for partial file access and high-speed parallel file transfers. Performance and reliability are critical issues when accessing large amounts of data so GridFTP supports parallel data transfers that use multiple TCP streams between the source and destination, and striped data transfers to increase parallelism and to allow data objects to be interleaved across multiple hosts. Increasing the size of TCP windows can also improve the efficiency of large file transfers. GridFTP also includes mechanisms to improve the reliability of data transfers, provides mechanisms to restart data transfers, and offers fault recovery procedures to handle network and server failures. In addition, emerging standards for

194 Data Access and Integration (DAI) will provide techniques to integrate data from diverse sources including DBMSs, XML formatted files and simple binary or flat files. 2.1.5. Grid Tools Equally important as the Globus middleware is the set of application-based tools that build on core grid capabilities. The Grid Tools Layer, illustrated in Figure 1, provides high-level tools, programming environments and web service capabilities to simplify the development of grid applications. Two primary languages in grid tools development are: The extensible Markup Language (XML) provides a flexible, self-describing web-based data format that is queriable and used widely in dynamic web applications. The Java language is particularly well suited for interacting with web-based applications and is heavily used. Using these and other web-based languages, a large, growing and active community of developers is building many open source tools useful in developing grid applications. Two notable developments are the Commodity Grid (COG) kits (wwwunix.globus.org/cog), and GridSphere (www.gridsphere.org). COG kits provide a language-based Gamework €or grid projects; currently COGS are available for Java, Python. CORBA, Perl, and Matlab. Gridsphere uses both java and XML to construct grid applications. called portlets, to simplify and speed development of grid-based web applications.

2.2. Types of Grids There are three types of grids: compute grids, data grids and service grids. 2.2.1. Compute Grids Compute grids, the most familiar type of grid, are generally deployed by an organization and often within a single site. Grid standards development and adherence by Globus middleware have permitted more secure, general-purpose grids to be built that span sites and organizations. These generalized types of grids enable the running of applications that require access to large amounts of compute and data resources that would not be possible within a single site. Additionally, compute grids can be constructed to provide better utilization of existing systems. For example, most user desktop systems are utilized during work periods but rarely used at night or on weekends.

195

Grids can be constructed to utilize these cycles as a shared resource within a work group or VO. Compute grids can also be constructed between large compute facilities to distribute the workload more evenly between sites. System loads at supercomputing sites often relate to the technology refresh cycle. At the beginning of a HPCS procurement, computers are often under-utilized until researchers are able to fully exploit them. At the end of the procurement cycle, HPCSs are often over-utilized. Compute grids can help smooth out these differences by allowing users to obtain cycles from other sites prior to a new procurement and export cycles to other sites after a technology refresh. 2.2.2. Data Grids

Data grids are used to seamlessly access and share data resources across remote systems. Typically, a thin abstraction layer is created between the data resident at each site and what the user or application sees. This allows users to more easily access and manage the data they require, and allows the organization to better manage its data sources. From the user or application perspective, data that may be distributed across multiple file systems and sites is viewed as a single large file system. Data grids can be constructed to permit read-only access to data or they can be viewed as large readwrite distributed file systems. Data grids are simpler to implement than compute grids since access can be more easily controlled and there are fewer security and resource allocation issues. As a result, many successful data grids are in use today. For example, the Earth System Grid (ESG) is a Department of Energy (DOE) project that permits researchers to share the results of large climate simulations among collaborators within the U.S. and international climate communities. This kind of structure allows users to share research results more fully, and to coordinate further research among collaborators. The ESG and other examples of data grids will be discussed in Section 2.4.

2.2.3. Service Grids Service grids build on the success of the World Wide Web and the emergence of ebusiness in a new area of IT called web services. The use of E-business and Ecommerce applications have grown exponentially in the last decade because the enterprise community has seen web services as a means to focus on selecting delivery systems on the basis of price/performance and Quality of Service (QoS) requirements rather than being required to invest heavily in a single compute or software platform. Web services and the development of portable platform independent applications permit software re-use, componentization, and integration. Grid services extend the web services concept by providing task and resource management functions across heterogeneous computing environments.

197

2.3.1. The TeraGrid The TeraGrid project was launched by the National Science Foundation in August 2001 with $53 million in funding to four sites: the National Center for Supercomputing Applications (NCSA) at the University of Illinois, Urbana-Champaign, the San Diego Supercomputer Center (SDSC) at the University of California, San Diego, Argonne National Laboratory in Argonne, IL, and Center for Advanced Computing Research (CACR) at the California Institute of Technology in Pasadena. TeraGrid is nearing the end of a multi-year effort to build and deploy the world’s largest, most comprehensive, distributed infrastructure for open scientific research, These facilities are capable of managing and storing nearly 1 petabyte of data, high-resolution visualization environments, and provide toolkits for grid computing. Four new TeraGrid sites are expected to add more scientific instruments, large datasets, and additional computing power and storage capacity to the system. All the components are tightly integrated and connected through a fiber-optic network that operates at 40 gigabits per second

(www.teramid.org). 2.3.2. Earth System Grid ESG is a $15 million. 5 year DOE project to construct a data grid to address needs of the climate community. High-resolution, long-duration simulations performed with advanced DOE / NCAR climate models will produce tens of petabytes of output. To be useful, this output must be made available to global change impacts researchers nationwide, at national laboratories, universities, research laboratories, and other institutions around the world. To this end, the Earth System Grid, a virtual collaborative environment that links distributed centers, users, models, and data, has been created to provide scientists with virtual proximity to the distributed data and resources that they require to perform their research (www.earthsystemgrid.org).Initial focus of this project has been to build the necessary infrastructure required to build a data grid including data cataloging, metadata infrastructure, discovery capabilities and tools to provide data integration, sub-setting and sub-sampling of data streams. 2.3.3. LEAD LEAD is an $1 1 million. five-year multi-university NSF funded project to “create an integrated, scalable cyber-infrastructure for meso-scale meteorology research and education.” This project, led by the University of Oklahoma, proposes to develop an integrated, scalable framework for accessing, preparing, assimilating, predicting, managing, mining, analyzing, and displaying a broad array of meteorological and related information, independent of format and physical location. The focus of LEAD is to improve the capability to provide timely warnings of severe weather events by developing a dynamic computing and networking infrastructure required for on-demand

198

detection, simulation and prediction of high-impact local weather such as thunderstorms (www.lead.ou.edu). In a LEAD scenario, for example, prior to a model run, all data relevant to the forecast would be discovered and gathered into a high-resolution data stream that would be integrated into the model. LEAD is being designed to potentially control select local observing systems, such as local radars, which could be directed by and take special observations for grid-enabled applications.

3. Deploying Grids at NOAA The NOAA Administrator, Vice Admiral Conrad Lautenbacher, describes the end goal of NOAA’s data systems as a “fully wired, networked and integrated system that provides for data processing, distribution and archiving.” The system for the creation, dissemination. distribution and use of these data are complex and involves many facilities in NOAA. Figure 3 illustrates many of the facilities that handle NOAA’s data. The flow of data begins with the ingest of raw data from a host of instruments, remote sensors, and observing platforms including aircraft, satellites, radars, profilers. ships, and buoys. These raw data are used in weather forecast models, climate models, and for research applications. They are archived by NOAA’s data centers, and saved for shorter periods at the local facilities where research or development is performed. To support this data flow, storage is required to hold the data, compute resources are required to create the data, and a network is required to move data to applications and users.

NOAA’s IT irhastructure is currently a set of independent systems that have been built up over time to support its programs and requirements. Grid offers a capability to develop an integrated, coordinated, cost effective IT resource to meet the needs of the entire organization. This section will describe how compute, data and service grids can be effectively built to meet the many scientific and IT challenges facing NOAA in the 21‘‘ century. The successful implementation of these grids relies on a robust network infrastructure, so discussion will begin with an examination of the NOAA network.

201

term problem of rapidly increasing data volumes. Increasingly the NWS and operations are using the Internet to access and distribute data. For example, the CRAFT program uses the Internet to distribute WSD-88D radar data to N W S WFO’s, to NCDC for archiving, to Unidata for distribution to universities, and to other data users as required [Sandman, 20031. The CRAFT program, now called the Integrated Radar Data Services (IraDS) (www.radarservices.org),has demonstrated that reliable, secure data transport can be accomplished for the N W S via the Internet, thus reducing the need for high-cost dedicated point-to-point links and paving the way toward an integrated, high speed network at NOAA. 3.1.2. NationaI Lambda Rail At the forefront of high-performance network technologies, is the emergence of national-scale fiber optic networks designed for research environments. The National Lambda Rail (NLR) network (www.nationallambdarai1.org) is a consortium that is building such a fiber network to meet the research and development requirements for NOAA. NLR will utilize the ubiquitous Ethernet protocol over Dense Wave Division Multiplexing (DWDM) to create a scalable national network that supports both research and production environments simultaneously. DWDM uses multiple wavelengths of light (lambdas) to segregate unique networks as desired. Initial bandwidth will be available in one and ten-Gigabit Ethernet (GigE) segments, but is currently scalable to 40 x 10 GigE networks. The first segment of the NLR was “lit” in November 2003; the last segment of the first phase is scheduled for completion in December 2004 and will provide coast-to-coast lambda connectivity. NOAA is fortunate to already have excellent connectivity to the primary partners and community network partnerships called GigaPoPs that are driving early development of NLR. NOAA sites in Boulder, Seattle, Miami and Washington D.C. already have NLR connectivity through GigaPops. Additional NOAA sites, including Princeton (GFDL), are exploring this opportunity. By utilizing dark fiber, network cost savings over typical wide area network services are substantial. The attractive pricing, coupled with increased acceptance of the Internet by operational and research group, make the adoption of NLR by NOAA likely in the near future. This represents an important step toward building an efficient scalable distributed IT grid infrastructure at NOAA. 3.2. Compute Grids Compute facilities are expensive to build and maintain, so being able to share cycles across the organization represents a more effective way to utilize IT resources. Compute grids within NOAA can be deployed at a local level or encompass computing facilities around the nation. Local compute grids can be deployed at a single site to make use of spare compute cycles that would not otherwise be utilized. These intra-site

203 3.2.1. ScientiJic Applications NOAA’s requirements for computing and data storage are growing rapidly in response to the increased flow and availability of data through its facilities to support (1) increasingly complex models, (2) new areas of research such as ensembles, data assimilation, and coupled models, and (3) the creation and integration of new high resolution datasets coming from next-generation observational platforms. More complex models stem from higher model resolutions, more accurate physical parameterizations. and the availability of higher density datasets. In the next decade for example, it is likely that regional climate models will be run on a “meso” scale. Since a doubling in model resolution requires a factor of 8 increase in the number of computations, moves to higher resolutions will be contingent on having the necessary resources available to run these applications. Improvements in forecast accuracy are tied to better model initialization, data assimilation and model ensembles. Four-dimensional data assimilation techniques used to initialize models, offer promising research results but are hindered by the lack of the large compute resources required to run them. Ensembles combine the results of several to tens of instances of a model (or models) and/or perturbed initial conditions that are combined to achieve more accurate forecasts. An ensemble prediction system based on the Weather Research and Forecasting (WRF) model will be the next operational model in 2004. Since ensemble member runs are independent of one another, it would be feasible to have tens or even hundreds of ensemble components running across a NOAA computational grid. 3.2.2. VeriJication As models become more complex, the number of data sources increase, and data density from observing platforms grow, verification will play an increasingly important role in both model development and observing systems evaluation. Improvements in forecast accuracy and increasing the speed new modeling and data assimilation techniques are transferred into operations is central to the N W S mission. The increased role of verification in model development is being demonstrated in the test and evaluation of WRF. A WRF ensemble is slated to become the next national-scale operational weather forecast model for NWS in October 2004. Prior to acceptance, WRF is undergoing an extensive series of tests identified in an NWS document called the WRF Test Plan [Siemens 20031. Under this plan, a total of 1920 runs (idealized, platform, and retrospective) were run at Air Force Weather Agency (AFWA) and FSL to compare two variants to the current operational model. Additional testing and verification at NCEP is planned for the WRF ensemble prior to operational acceptance.

204

A second benefit of verification is its value in determining costhenefit analyses of new or existing observing systems that cost NOAA hundreds of millions of dollars annually to build, deploy and maintain. For instance, data denial experiments can be run to quantify the value of existing observmg systems. Simple scenarios. such as removing select RAOB observations. have been run to determine the effect on forecast accuracy DAOS 20001. More complex scenarios can be run to replace older expensive observing systems with a combination of multiple lower-cost alternatives. Verification can also be used to determine the value of deploying new or proposed observing systems. For example. the National Profiler Network (NPN) and National Radar Network (NRN) are slated for deployment in the next 5 years. Other systems such as the GPS-Met water vapor product may provide significant value if fully developed and deployed. Verification systems, such as the Real-Time Verification System (RTVS) [Mahoney, et all, can be used to compare model results and to provide verification statistics that help management and budget analysts determine which observing systems to develop and deploy.

To obtain accurate and valid verification statistics. extensive testing must be conducted over long time periods and will require significant IT resources. For example, many variants have been proposed as candidates for the operational WRF that could be evaluated resulting in tens of thousands of model runs, petabytes of data storage, and massive compute resources. Modest testing scenarios have been proposed for WRF in FY06 that will require 160% to 680% of FSL's HPCS [Hart20031, a system that is currently the 17fhfastest supercomputer in the world (www.top500.org). This type of systematic test and verification can and should be applied to other models important to NOAA; however, a grid infrastructure must be built to coordinate and harness the resources required to run the tests. 3.2.3. Compute Grids to Support VOs Figure 6 illustrates some examples of compute grids that could be created to support VOs within NOAA. An N W S grid could be designed to provide compute capabilities using unused cycles in the WFOs in support of local modeling efforts. WFO collaborations with NCEP could be used to deliver additional high-resolution data streams. and provide modeling support. A NESDIS product development VO could enable a richer collaboration between developers and users of new generation satellite data products. A Developmental Testbed Center (DTC) grid could be built to bring together research and operational aspects of WRF model testing, sharing both test results and new modeling developments. Finally. a HPCS grid could provide better integration and utilization of a supercomputing test and development facility to tackle grand challenge problems important to NOAA.

206

The largest and most expensive data stream handled by NOAA is satellite data from GOES, POES and other remote-sensing platforms. These satellite data, operated by the NESDIS satellite operations facility currently distribute over 4TB 1 day for NOAA, NASA and DoD; its successful operation is vital to NOAA and the nation. In addition, numerous observational data are taken by NOAA, from M O B S to precipitation gages, stream flow data, and snow cover. These data are in different formats, and different QC procedures often apply. Further, new national observing systems including profiler, GPS water, and radar are being proposed or slated to become available. These data represent numerous “stove pipe” developments that make it difficult to coordinate their use across scientific disciplines. The NOAA Observing System Architecture (NOSA), illustrated in Figure 7, is building a system to define, coordinate and integrate NOAA data sources; currently NOAA has found that it has 102 separate observing systems measuring 521 different environmental parameters. NOSA is billed as a program to both identify observing systems and provide a single integrated system to deliver information “on demand” to defense, commercial. civilian and private sectors at the national and international levels. So far, NOSA has focused on identifying NOAA‘s observing systems as a means to highlight gaps and duplication in current observing systems. However, delivery systems will need to be developed to locate, access and integrate data into systems and users that require them. These systems will need to be distributed in nature. in order to account for data located in repositories around the nation and the world. Grid mechanisms for data access, discovery and integration can provide such a distributed delivery system. improve data utilization and address the future needs of NOAA.

208

3.3.1. Leveraging Existing NOAA Programs

Data grids can provide an “umbrella” technology to link existing NOAA data infrastructure into a more coherent integrated system. Programs such as the NOAA Operational Data Archive and Distribution System (NOMADS) project and the Comprehensive Large Array-data Stewardship System (CLASS) are good examples of existing systems that can be used as a foundation for integrated data grids. NOMADS is a distributed data archive system that provides mechanisms to access and integrate model output and other data stored in distributed repositories. NOMADS enables the sharing and inter-comparing of model results and is a major collaborative effort, spanning multiple government agencies and academic institutions. The data available under the NOMADS framework include model input and Numerical Weather Prediction (NWP) gridded output from NCEP, and Global Climate Models (GCM) and simulations from GFDL and other leading institutions around the world (www.ncdc.noaa.gov/oa/climate/nomads/nomads.html). CLASS is an online operational system that has been designed and funded to archive and distribute NOAA’s observational data; by 2015 CLASS is expected to be archiving over 1,200 Teral3ytes of data per year. Dual redundant CLASS systems provide access to observational data via the web, and the project plans to offer enhanced database management, search, order, browse and subsetting techniques to users and applications (www.class.ncdc.noaa.gov). While the NOMADS and CLASS systems represent effective means to obtain observational data within NOAA, more robust mechanisms for the discovery, access, and integration of data will be required to effectively utilize all data across the organization and across scientific disciplines. In a data grid scheme, NOMADS and CLASS can be viewed as one of many potential data streams that also include local data sets (e.g. model runs, experimental datasets), proprietary data (e.g. airline data) and other data important to NOAA. These data can be made available via the grid using web-services, discovery languages such as XML, and grid interfaces deployed by a growing number of commercial DBMS systems from Oracle and others. The basis from which data grids derive significant benefits over existing systems stem from grids underlying use of web technologies to permit dynamic discovery, access, and integration of distributed data. For example, the construction of web-accessible metadata catalogues can be utilized to identify data stored in the web-friendly XML format. Web and grid services can be used to dynamically locate meta-data catalogues and deliver data to the end-user or application upon request. Current static methods used to obtain

209

data via CONDUIT, OpenDAP, and others require apriori knowledge of data sources and agreements between the data providers and user (client / server). To permit dynamic access and discovery of data, the development of generalized data catalogues that identify data sources, history, and information about the data are very important. Tremendous work within NOAA has already been done to catalogue and create metadata mechanisms. This is a significant and essential effort necessary to providing generalized access to data. The ESG project, for example, has spent significant effort to define sufficient information to categorize climate data, while building grid mechanisms to access. integrate and subset the data. The LEAD project proposes to extend this work by exploring dynamic access and discovery of data in quasi-operational environments.

3.3.2. New Data Storage Strategies A fundamental problem with the current data is an inherent inefficiency in how data are stored, accessed and used. Data delivery mechanisms such as CONDUIT and OpenDAP provide valuable means to distribute data and are useful in operational settings where data utilization may be high. However, in research and operational settings where data utilization may be low. storing local copies of data available elsewhere is not cost effective. Slow networks exacerbate the need to store local copies of data that are often never accessed. For example, FSL stores copies of operational Eta model runs that are also being stored at NCEP and NCDC and likely many other sites in NOAA. FSL saves over 200 Gbytes of its real-time data to its Mass Storage System daily and only a fraction of it is ever requested. This is a typical pattern of usage mirrored across sites in NOAA. The cost of archiving this data mounts as tapes must be purchased, offsite storage must be located and maintained, older data must be migrated to newer storage media, and operator time is required to handle the data. With a faster network, and better methods to locate data, the requirement that it be stored locally is reduced. Ideally, data should be stored in a few places for redundancy, and obtained from these sites when required (ondemand). Finally, it may also be useful to not archive some data, but store the recipe used to create it instead. Then when the data is required, the relevant programs can be run to produce it.

3.3.3. Potential Data Grids at NOAA Figure 6 illustrates several data grids that could be built for VOs in NOAA to satisfy operational and research requirements. For example, a NESDIS grid could be built to integrate deep storage archival systems under a single service-oriented infrastructure.

210

Users requiring access to archival data could locate and access it when required using grid file transfer mechanisms. Another grid, the Integrated SATellite Data Grid (ISAT) could be similarly available to producers and consumers of this data, allowing new product developments at NESDIS, to be more easily tested in next generation models at NCEP. A prototype grid is already being used at NESDIS to test new products, but systems are currently limited to desktop PCs as stated in Section 3.2. Access to large storage facilities could provide a more efficient means to test new products required by new generation high-resolution data streams. A Meso-scale Modeling Data (MMD) grid could be designed for model development described in Section 3.2.2. Collaborators could share test results and access information about what tests have been run thus reducing duplication. 3.4. Service Grids

NOAA continues to evolve into an organization that is increasingly run like a business: a responsive, flexible, service-oriented organization that can adapt to changing requirements and demands. Services grids can map directly into functions such as delivery of data, generation of data products, and on-demand computing requirements for operational activities within NOAA. Figure 6 illustrates the tiered relationship between tasks available in a service grid infrastructure, the data and compute grids where they will run, and the hardware required to run them. Service Grids represent the hture of Grid, but will likely not be sufficiently mature for several more years. These grids are expected to amplify the separation between QoS, ontime delivery of products or data and the hardware and software systems required to accomplish the tasks. The business community is already moving away from owning and operating their own IT infrastructures, choosing instead to contract these operations out. For example, the fastest growing and largest part of IBM is its Global Services sector. IBM Global Services provides both software and hardware infrastructure support for their customers. IBM offers service contracts to many customers including financial institutions, insurance companies and foreign governments. Contracts are often specified in terms of QoS (reliability, downtime) and Global Services designs an end-to-end system to meet those requirements. In NOAA, the NCEP HPC contract is structured in the same way. The NCEP HPCS system is owned by IBM and housed at IBM controlled facilities. The NCEP contract with IBM states specific QoS metrics for performance, and delivery of products that are critical to the mission of the N W S . IBM must meet those guarantees within the bounds of the agreement, but is free to upgrade or dedicate more or less resources as they see fit.

211 This decoupling of IT from mission critical requirements represents a trend that is growing in the commercial sector, and is becoming more common within NOAA. In the future, both hardware and software contracts could be let for compute, data, archival. and network services as a cost savings measure. This would allow NOAA to focus on core activities and less on the IT infrastructure required to accomplish its goals. 4. Conclusion

This paper highlights the need to build a grid infrastructure to meet the challenges facing NOAA in the 2Is* century. Given the enormous expected demands for data, and increased size and density of observational systems, current systems will not be scalable for future needs without incumng enormous costs. NOAA needs integrated data systems capable of handling a huge increase in data volumes from expected launches of GOES-R, NPOESS, new observing systems being proposed or developed, and to meet requirements of the Integrated Earth Observation System. Further, NOAA must continue moving toward the use of shared compute resources in order to reduce costs and improve systems utilization, to support new scientific challenges and to run and verify increasingly complex models using next generation high-density data streams. Finally, NOAA needs a fast, well-managed network capable of meeting the needs of the organization: to efficiently distribute data to users, to provide secure access to IT resources, and be sufficiently adaptable and scalable to meet unanticipated needs in the future. Just as the NOAA Observing System Architecture (NOSA) was developed to define, coordinate and integrate data sources, NOAA needs to build an IT counterpart (NITSA) to define, coordinate and integrate its IT resources. We propose the construction of a grid infrastructure, centered around a fast network, which will permit resources including computer, data storage, software systems and services to be managed and shared more effectively. All three types of grids can be utilized at NOAA: compute grids to provide access to under-utilized cycles and to link super-computing centers, data grids to promote sharing and better utilization of data, and service grids to provide the reliability and redundancy demanded by operational processes. Grid also provides robust mechanisms for data access, discovery and integration that can transform static methods of data distribution into dynamic demand driven delivery systems and thereby reduce the need to store redundant copies of data. NOAA currently spends hundreds of millions of dollars on observing systems, supercomputing facilities, data centers, and the network. The development of a NITSA, enabled by Grid, can potentially reap huge cost savings for the organization. We

212 described the value of building an efficient, scalable, secure and integrated network. NOAA cannot continue to upgrade select point-to-point network links; the entire network needs to be examined and an integrated, managed, coordinated network resource must be created that is an asset to the entire organization. We also described the value of building an grid infrastructure to coordinate access to the IT resources necessary to handle increasing volumes of data. to develop and run next generation models and to evaluate the costbenefit of new and existing observing systems. Further, as NOAA stresses the increased importance of sharing IT resources in the interest of cost savings, code interoperability across HPC systems at GFDL. FSL and NCEP has become a critical requirement. It is not enough to simply achieve code portability, however; improved usability of these codes is also required. Grids provide a supportive distributed computing environment in which to run portable codes and simplifies user access to shared system resources. Finally, we described the importance of VOs (Virtual Organizations) to improve collaborations between organizational units, to improve eficiency and to reduce costs. Grids support the creation of VOs, which can map IT resources directly into cross cutting programs identified by NOAA’s Strategic Plan.

Acknowledgements The authors wish to thank Dan Schaffer and Jacques Middlecoff from FSL for their valuable feedback on ideas and research discussed in this paper. Paul Hamer and Chris MacDermaid from FSL were helpful in describing NOAA’s data systems and offering ideas about how NOAA might handle high data volumes in the future. Jeny Janssen from OAR provided valuable information on the NOAA network and administrative plans to upgrade it. We also thank NOAAs High Performance Computing and Communications (HPCC) for sponsoring the annual NOAA TECH conferences. This forum has been a valuable way to learn about new technologies and developments at NOAA. The NESDIS Data Users and GOES-R conferences underscored the many challenges facing NOAA in the next decade. Finally, we thank the developers of the many excellent NOAA web pages that provided valuable information about numerous exciting programs in the organization.

213 References 1. Atkinson, M., Chervenal, A., Kunszt, P., Narang, I., Paton, N., Pearson, D., Shoshani, A., and P. Watson, 2004, Data Access and Integration, The Grid 2: A Blueprint for a New Computing Infrastructure, pp391-430, Elsevier, ISBN: 1-55860933-4. 2. Droegemeier, K.K., K. Kelleher, T. Crum, J.J. Levit, S.A. Del Greco, L. Miller, C. Sinclair, M. Benner, D.W. Fulker, and H. Edmon, 2002: Project CRAFT: A test bed for demonstrating the real time acquisition and archival of WSR-88D Level I1 data. Preprints, 18th Int. Conf. on Interactive Information Processing Systems (IPS) for Meteorology, Oceanography, and Hydrology., 13-17 January, Amer. Meteor. SOC., Orlando, Florida, 136-139. 3. Foster, I., C. Kesselman, J. Nick, S. Tuecke, The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, Open Grid Service Infrastructure Working Group, Global Grid Forum, June 22, 2002

hnD://www.Glo&us.ore/researcWvaDers.htmI#OGSA. 4. Foster, I., Kesselman, C., Teucke, S., 2001, The Anatomy of the Grid Enabling Scalable Virtual Organizations, International Journal of Super-computing Applications. hnD://www.Globus.org/researchhaDers/anatomv.Ddf. 5. Guch, I., 2003: Harnessing the Spare Computing Power of NOAA Office PCs for Improved Satellite Data Processing and Technology Transition, NOAA Tech Conference, October 2003, httv://www.noaatech2004.noaa.gov/absb.ac~ab42guch.htrn1. 6. Hart, L., Govett, M., Doney, M. and J. Middlecoff, 2003: A Discussion of the Role of the Boulder High Performance Computer, NOAA Internal Document to Carl Staton 12/29/2003. 7. James, R., 2003, NOAA Public Key Infixstructure (PKI) Pilot Project, NOAA Tech iames.Rd. 2004, Rtt~://www.noaatech2004.noaa.gov/abstracts/abl6 8. Janssen, J., 2003, The NOAA Network Baseline, Talk given at NOAA Tech 2004, htt~:/lwww.noaatech2004.noaa.eov/abstracts/ab83ianssen.html. 9. Mahoney, J.L., Judy K. Henderson, Barbara G. Brown, Joan E. Hart, Andrew Loughe, Christopher Fischer, and Beth Sigren: 2002. The Real-Time Verification System (RTVS) and its application to aviation weather forecasts. Preprints. 10th Conference on Aviation, Range, and Aerospace Meteorology, Amer. Met. SOC.,13-16 May, Portland, OR. 10. NAOS Report, 2000, RadiosondeMDCRS Evaluations and Analyses, Findings, Conclusions, and Next Actions, Alexander MacDonald - chair of NAOS Council. 1 1 . NoAA's FY2003-2008 StrategicPlan, 2003, httu://www.sDo.noaa.f!ov/stda.htm. 12. Rutledge, G.K., J. Alpert, R. J. Stouffer and B. Lawrence, "The NOAA Operational Model Archive and Distribution System (NOMADS), 2002, Tenth ECMWF' Workshop on the Use of High PerJormunce Computing in Meteorology - Realking TeraComputing, UK, ECMWF, November 2002, Reading,

214

htto://www.ncdc.noaa.eov/oa/model~~ublication~no~ds-ecm~-2OO2-~ one.Ddf. 13. Sandman, T., and K. Kelleher, 2003: Real Time Dissemination of WSR-88 Radar Data over Internet 2, NOAA Tech Conference, November 2004, httv:llwww.noaateCh2004.no~vIabstl7gts/ab45sandman.html. 14. Siemens, N., 2003: The NCEP WRF System Test Plan. 15. Tabb, L., 2003, Grid Computing in Financial Markets: Moving Beyond ComputeXntensive Applications, The Tabb Group, LLC, Westborough, MA. 16. Wladawsky-Berger,I., 2004, The Industrial Imperative, The Grid 2: A Blueprintfor a New Computing Infrastructure, pp25-34, Elsevier, ISBN: 1-55860-933-4. 17. Wolfe,D. and S. Gutman, 2000, Development of the N O M R L Ground-Based Water Vapor Demonstration Network Design and Initial Results, J. Atmos. Ocean. Technology., 17,426-440.

INTEGRATING DISTRIBUTED CLIMATE DATA RESOURCES: THE NERC DATAGRID ANDREW WOOLF e-Science Centre, CCLRC Rutherjord Appleton Lab, Chilton Didcot, Oxon OX1I OQX, UK BRYAN LAWRENCE (l), ROY LOWRY (2), KERSTIN KLEESE VAN DAM (3), RAY CRAMER (2), MARTA GUTIERREZ (I), SIVA KONDAPALLl(2), SUSAN LATHAM (I), KEVIN O’NEILL (3), AG STEPHENS (1) (1) British Atmospheric Data Centre (2) British Oceanographic Data Centre (3) CCLRC e-Science Centre NERC Datachid (NDG)is a UK e-Science project that will provide discovery of, and virtualised access to, a wide variety of climate and earth-system science data. We present an overview of key elements of the NDG architecture beginning with the NDG metadata taxonomy for various information categories. Architecture components include: federation of discovery metadata exploiting the harvesting protocols of the Open Archives Initiative; a domain ontology to suppon navigation of data relationships during search and discovery; data storage transparency achieved through a wrapper mechanism and a data model based on emerging IS0 standards for spatial data; and finally, a rolebased authorisation framework which interfaces with existing access control infrastructures and suppons mappings between credentials to establish networks of trust.

An essential pattern of Grid computing is the viltualisation of resources. The

1. Introduction

There are increasing requirements for sophisticated data management infrastructures in the climate sciences. A number of leading climate modelling activities (e.g. P C C [l],climateprediction.net [21) require access to and analysis of widely distributed, high volume datasets. Additionally, such projects involve verification studies, which compare model output with observational data from a range of sources and in a variety of formats. A consumer of HPC climate simulation products is faced with three issues: 1. Data discovery. Locating suitable data is the first step in the exploitation pipeline. Data must be catalogued, and catalogues searchable in a manner that maximises dataset exposure. Federating catalogues - either through 215

216

harvesting [3] or distributed searches [4] is an essential element. Effective federation relies on standard metadata schemas against which to query. Metadata is thus seen to provide the structures for successful data discovery. 2. Data access. Metadata typically provides a pointer to the location of data either physical or digital. A consumer of climate data must then obtain access to it. Just as a book is delivered by a library to a remote customer through the postal service, so a network service is used to deliver digital climate data. And, as in the postal world, different classes of service are appropriate for different types of data product - some specialised to highspeed delivery, some to large volume; some provide value-added functionality (e.g. offering an extract of a book or dataset). Access restrictions may require proof of identity for delivery. 3. Data use. Having discovered and obtained access to data, a consumer has to be able to use it. The large range of data formats in use raise a formidable challenge for climate data. An emerging paradigm focuses on characterising the semantics and structure of data (through formal data models) rather than file formats - ‘content’ rather than ‘container’. Conventions may be established [ 5 ] for serialisation into different file formats, but primacy lies with the a-priori data model. The distributed nature of climate data resources is fundamental. For example, access to observational data is essential for assimilation models and verification or process studies. Observation data are, by their nature, distributed globally (see, for instance, the national and world data centres of UNESCO’s International Oceanographic Data and Information Exchange, IODE). Similarly, model intercomparison studies (e.g. IPCC, AMIP) compare large-scale HPC outputs from various climate model runs - usually archived around the globe at host institutions or national data centres. In the past, network bandwidth has placed severe constraints on the practicality of accessing large-scale remote HPC products. However, recent increases afforded by the evolution of optical technology are reducing distribution costs relative to costs of production (Figure 1). Rapidly increasing network bandwidths are a significant driver for so-called ‘Grid computing’ [6,7]. For example, GridFTP [8,9] is an extension to the standard FTP protocol tuned for high-performance over large-capacity networks. It uses parallel TCP-streams, TCP buffer/window tuning, and stripedpartial file transfers to achieve high endto-end data transfer performance.

220

The spectrum of activities within the NDG includes:

Search and discovery: searching over discovery metadata and browsing of detailed metadata Data browse and delivery: browsing of dataset structures, selection of data subsets, and delivery of data Workspace management: provision of resources for a Workspace and interaction of a User with Workspace facilities Metadata management: updating of metadata and datasets, and harvesting of discovery metadata: registration of Data Providers and Discovery Services User administration: logging in and out of NDG, and assignment and retrieval of security credentials A number of policies govern NDG activities on matters such as security (authentication, authorisation and accounting), resource usage, quality of service. 2.2. The NDG Information viewpoint

The major information components in the NDG are the metadata infrastructure (Section Three) and data model (Section Four). These divisions conform with the ‘Domain Reference Model’ [13] of IS0 Technical Committee 211 for geographic information, shown in Figure 4. This sets out a high-level view of the structures required for interoperability of geographic information (including in distributed computing environments). It underlies an entire series of emerging standards for geographic data, metadata and services (the ‘IS0 19xxx’ series of standards). Conformance to the abstract Domain Reference Model is a prerequisite for standards-compliantdistributed data infrastructures. Core to the Domain Reference Model is a geospatial Dataset. Logical datasets in the NDG are configured by individual Data Providers. They may be based around defined activities (e.g. the ‘ECMWF ERA-40’ reanalysis or the ‘ACSOE’ campaign measurements), or particular instruments (e.g. measurements from the ‘Chilbolton radar’), or in any other manner deemed appropriate by the Data Provider. The question of what constitutes a dataset is an important one, and usually non-trivial. A dataset contains Feature instances and related objects (a feature is a semantically meaningful abstraction of a specific data type - for instance a snapshot of a HPC model field. or an atmospheric temperature sounding).

22 1

O..' Conialned feature

Q.:

I

Figure 4: 'Domain Reference Model' underlying IS0 standards for geographic information

The logical structure and semantic content of a dataset is described by an Application schema [14]. It defmes the feature types that may appear. These are defined for the NDG in a data model named the Climate Science Modelling Language (CSML) [15], described in Section Four below. A Meradura dutuser provides metadata related to a dataset. It includes necessary information to support access to, and transfer of, the dataset. A standardised schema [ 161 provides metadata element definitions for dataset

222

description, topic, point of contact, quality, etc., and may include references to an application schema for the dataset. Metadata in the NDG is generated as a byproduct of a sophisticated conceptual meta-model (or ontology) for datasets and the relationships between them, described in the following Section Three. Finally, the IS0 Domain Reference Model includes Geographic Information Sewices that operate on the dataset [17]. A broad taxonomy of services is imagined - for human interaction (e.g. viewing data), information and model management, workflow, general geo-processing, annotation, transfer etc. The NDG is being built fundamentally as a service-oriented architecture to provide a basis for service chaining. 3. Metadata As mentioned above, a detailed conceptual model underlies the metadata mfrastructure in the NDG. The approach is motivated by a use-case supporting sophisticated metadata ‘browsing’ identified in the NDG architecture. A user initially locates a dataset of interest through a search against an NDG Discovery Service metadata catalogue. The user should then be able to navigate to related datasets - those containing data from the same instrument(s)/model(s), or collected at the same observation station, or generated through the same activity (research project, funded thematic program, etc.). To support this kind of functionality, a taxonomy of different classes of metadata has been identified [18,19], Figure 5: [Alrchve: Usage metadata generated from (or about) the data. Normally generated directly from internal metadata. This is equivalent to the ‘Application schema’ of the IS0 Domain Reference Model discussed above. [Blrowse: Generic complete metadata, semantic, including summary of syntactic (S), not including discipline-specific (E). 0 [Clomment: Metadata generated to describe both documentations and annotations (as opposed to binary data). [Dliscovery: Metadata suitable for harvesting; the ‘Metadata dataset’ of the I S 0 Domain Reference Model. 0 [Elxtra: Additional metadata, discipline-specific.

223

[S]ummary: Summary metadata (overlap between A and D).

f&x. A: Usage metadata generated from (or about} the data. Normaiiy generated directly from internal metadata

XMU

B: Generic complete metadata, semantic, including summary of syntactic (S), not including discipline specfic (E), G: Metadata generated to describe bosh documentations and annotations (as opposed to binary data). D; Discovery metadata suitable for harvesting, Probably based on Dublin core & GEO, Subset of B or C,

XML

discipline specific, S: Summary metadata (wsrtap beJwsBn A&DJ

Q; Senerna which defines supported

Queries upon A,B,C,D

Figure 5: Metadata taxonomy identified in the NDG.

In the NDG, 'D' metadata is harvested from Data Providers into a search catalogue by a Discovery Service, while 'B' metadata captures the relationships between datasets to enable the metadata browsing mentioned above. "A' metadata is the data model described in Section Four. We consider first the 'B' and 'D' metadata in more detail. 3.1.

'B'metadata

'B' metadata [19,20] represents the concepts and relationships required for the metadata browse scenario, and may be considered an ontology. The conceptual model is shown in Figure 6. The major entities are: • •

Activities: these range in scope from entire programmes of work down to individual data gathering exercises, such as flights or cruises. Data Production Tools: these are broadly classified into Instruments and Models.

226

latter. Migration towards the standard I S 0 19115 metadata will take place as schemas stabilise. Figure 7 illustrates a conceptual ‘metadata flow’ in the NDG. Usage metadata (‘A’) is summarised into the core metadata (‘B’), which may be supplemented with ancillary (‘C’) metadata (annotations, external documentation, etc.). ‘B’ metadata may be transformed into discovery (‘D’) metadata in any of a number of formats. 4. Datamodel

The previous section describes the conceptual models and metadata schemas used in the NDG for data discovery and metadata browsing - those related to the ‘Metadata dataset’ of the IS0 Domain Reference Model. We now consider the ‘Dataset’ itself and describe the ‘Application schema’ that defines its logical structure and semantic content (also called ‘A’ metadata in the taxonomy above). As discussed earlier, a dataset contains feature instances and related information. Features (which may be types or instances) are semantically relevant abstractions of real-world phenomena - gridded model output fields, atmospheric soundings, marine salinity sections, etc. They offer the possibility of constructing a semantically-rich abstraction layer over underlying databases (filestores or relational). Feature types - once defined at a conceptual level may be catalogued for re-use in a repository (“feature-type catalogue”) [22,23]. This provides a mechanism to set up a shared understanding of semantics within communities of practice. An application schema describes the content and structure of a dataset in terms of feature instances and related information such as coordinate reference system definitions, dictionaries, etc. In the NDG, the application schema therefore serves two purposes: first, it describes the semantic content of datasets, second it provides a ‘wrapper’ for data storage. File-based data are cast onto, and exposed as, feature instances within logical datasets (Figure 8). Access services may be constructed over this abstraction layer, and the MxN mapping reduces to an M+Nproblem. Feature types may be defined across a wide spectrum of typing granularity. For instance, an abstract ‘Sounding’ class could be defmed, qualified with an instrument attribute of ‘sonde’ and parameter attribute of ‘temperature’ for a balloon-launched radiosonde profile of atmospheric temperature. Similarly, the same measurement could be represented with a more specialised ‘SondeProfile’ feature type having a ‘temperature’ attribute, or a highly specialised ‘AtmosphericTemperatureSounding’feature type could be defmed. In practice, the governance structures available to support feature type definitions constrain

227

the typing abstractness [24] - a large number of strongly-typed features may be defined by an organisation with sufficient interest in, or remit for, maintaining ownership of those definitions. A standards-based semantic approach to data integration based on feature types is very new in the climate sciences, and no existing feature-type catalogues or definitions were available to the NDG. Hence, the curated archives of BADC and BODC were inspected for the purpose of defining a small number of feature types having general utility (i.e. weakly-typed) across a range of climate science data types. The resulting feature types are listed in Table 1, and the resulting application schema is called the 'Climate Science Modelling Language' (CSML) [15]. A principle used in the feature type construction is that semantics are offloaded onto physical parameter type and underlying coordinate-reference systems - no distinction between feature instances is made on the basis of these properties. In addition, the notion of 'sensible plotting' was used as a discriminant - feature types should be sufficiently specialised to enable (in principle) unsupervised plotting of a feature instance in the conventional manner.

~~

1

feature types / application schema

-1

Figure 8: Application schema provides abstraction layer between storage and access services.

228 Table 1 : Feature types defined in the Climate Science Modelling Language. CSML feature type

Description

Examples

TrajectoryFeature

Discrete path in time and space of a platform or instrument.

ship’s cruise track. aircraft’s flight path

PointFeature

Single point measurement.

raingauge measurement

ProfileFeature

Single ‘profile’ of some parameter along a directed line in space.

wind sounding, XBT, CTD, radiosonde

GridFeature

Single time-snapshot of a gridded field.

gridded analysis field

PointSeriesFeature

Series of single measurements.

tidegauge, rainfall timeseries

PmfileSeriesFeature

of Series measurements.

GridSeriesFeature

of Timeseries parameter fields.

datum

profile-type

vertical or scanning radar. shipborne ADCP, thermistor chain timeseries

gridded

numerical weather prediction model, ocean general circulation model

4.1. Standards

Emerging international standards for geographic mformation (those of IS0 Technical Committee 2 11) lie behind much of the approach adopted in the NDG for data integration [25]. The specifications being developed by the Open Geospatial Consortium (OGC) are also relevant to the NDG. These include web service interfaces for online rendering (Web Map Service) and retrieval (Web Feature Service, Web Coverage Service) of data. 5. Security

The NDG consists hndamentally of a federation of Data Providers agreeing to share common interfaces for data discovery, access, and use. It integrates a number of heterogeneous legacy infrastructures and client communities. This pattern is often referred to as a ‘virtual organisation’ (VO) in the Grid literature [26]. Since the VO crosses institutional and policy boundaries, considerable effort needs to be spent on ensuring integrity of access restrictions and resource usage in a manner that is hidden from the user. At the same time, it is impractical to re-engineer the existing role-based security infkastructures at each participating Data Provider.

229

It is also not feasible in the NDG to define a single set of authorisation roles across the VO, as is done in some authorisation systems such as VOMS [27] and PERMIS [28]. Each Data Provider in the NDG assigns to their users a particular set of access control roles that may not correspond to those in use anywhere else. While the integration afforded by the NDG may dnve security roles towards convergence in the longer term, in the short term there remain differences. The authorisation framework being developed by the NDG [21] distinguishes between roles assigned to a user and roles required to access data or metadata. Security roles are assigned by an Attribute Authority, while access is granted or denied by a Data Provider. In the usual case, an NDG participant will fulfil both these enterprise roles (e.g. BADC is both an Attribute Authority assigning security roles to its users, and a Data Provider - controlling access to resources). However, the need arises in general to decide access to a resource for a user who holds security roles assigned by a different Attribute Authority (e.g. BODC). The NDG security kamework allows Data Providers to define mappings between roles for the purpose of deciding access. For instance, a BADC dataset may be available to users who hold a ‘staff role issued by the BADC Attribute Authority. Ordinarily, a request for this data would be denied to a user holding only a role “ERC employee’ granted by the BODC Attribute Authority. However the BADC Data Provider may define a mapping from the BODC ‘NERCemployee’ role to the BADC ‘staff role, and therefore grant access in this case. In addition, if the Data Provider happens also to be the relevant Attribute Authority (as in this case), they may choose to assign to the user the mapped role for a limited time period. This avoids repeated mappings having to be evaluated. Mappings between roles is based on a trust evaluation by a Data Provider of an Attribute Authority - in the example, BADC examines the policies used by BODC in assigning the role “ERC employee’ and deems them sufficient for its purposes to be regarded as holding the BADC ‘staff role. Figure 9 illustrates the authorisation sequence in practice. If an initial request for data is not accompanied by appropriate security credentials, a Data Provider replies with a list of those trusted Attribute Authorities (‘Trusted Host’) granting roles deemed sufficient for access to be granted (either directly, or through the above mapping process). After retrieving credentials from relevant Attribute Authorities, a user again may request access.

I

I 3 Request Credentral;

DP2 Data Provider

Figure 9: Sequence diagram for NDG authorisation procedure (GateKeeper implements the RFC 3281 Attribute Authority) .

Conventional public key infrastructure is used for the implementation of NDG security. A user is bound to their identity with an x.509 public key certificate (issued by a Certificate Authority), and bound to their security roles with an XML-based attribute certificate (issued by an Attribute Authority).

23 1

6. Conclusions The need to access a wide variety of distributed data, together with rapidly increasing network bandwidths is driving the development of new technological approaches to data management in the climate sciences. Large distributed-data integration activities are developing novel approaches for data discovery, access and use. The UK project NERC DataGrid is one such activity, aimed at developing uniform interfaces to a range of environmental data in the UK and beyond. The project is focussed initially on the curated archives of the British Atmospheric Data Centre and British Oceanographic Data Centre. An overall architecture has been constructed, identifymg a number of roles, activities and policies that scope the NDG requirements. From an information perspective, the NDG is conformant to the abstract reference model underlying an important emerging series of international standards for geographc information. Key information divisions within the IS0 reference model, and in the NDG, are between data and metadata. The NDG has identified a metadata taxonomy that supports a range of activities including ‘discovery’ and ‘browsing’ between datasets. A detailed conceptual model (ontology) lies behind the core NDG metadata schema, and enables the export of discovery-level metadata in any of a number of formats. Scalable, robust metadata federation in the NDG is based on harvesting using the OAI Protocol for Metadata Harvesting. For data integration, a semantic standards-based data model - the Climate Science Modelling Language - has been developed. This characterises a number of weakly-typed data classes (‘feature types’) and defines the structure and content of the NDG logical datasets. Security requirements in the NDG have been met by designing a novel authorisation framework. This enables legacy access control infrastructures to be maintained by supporting mapping between access roles assigned by different authorities.

Acknowledgements The NERC DataGrid is f h d e d under the UK e-Science program through grant NElUT/S/2002/00091 from the Natural Environment Research Council.

References 1. IntergovemmentalPanel on Climate Change, hthx//www.itm.cW

232

2. Stainforth, D.A., et. al. (2005), “Uncertainty in predictions of the climate response to rising levels of greenhouse gases”, Nature, 433,403-406,27 Jan 2005. See also http://www.climateDrediction.net. 3. The Open Archives Initiative Protocol for Metadata Harvesting (OAIPMH), httD://www.ouenarchives.ordOAI/2.O/o~enarchives~rotoco~.htm 4. IS0 23950:1998, Information and documentation - Information retrieval (239.50) - Application service definition and protocol specification 5. NetCDF Climate and Forecast (CF) Metadata Convention,

httD://www.cgd.ucar.eddcm/eaton/cf-met. 6. Foster, I. and Kesselman, C. eds. (2004), The Grid: Blueprint,for a New Computing Infi-astructure, 2e, Morgan Kauhan. 7. Berman, F., Fox, G. and Hey, A.J.G. eds. (2003), Grid computing - making the global inji-astructurea reality, John Wiley and Sons Ltd. 8. Allcock, W. (2003), GridFTP: Protocol Extensions to FTP for the Grid. Global Grid ForumGFD-R-P.020. 9. B. Allcock, I. Foster, V. Nefedova, A. Chervenak, E. Deelman, C. Kesselman, J. Leigh, A. Sim, A. Shoshani, B. Drach, D. Williams (2001), High-Performance Remote Access to Climate Simulation Data: A Challenge Problemfor Data Grid Technologies, SC 2001. 10. ISOIIEC 10746-1:1998, Information technology - Open Distributed Processing - Reference model: Overview. 11. Putman, J.R., Architecting with RM-ODP (2001), Prentice Hall PTR, New Jersey. 12. Woolf, A., R. Cramer, M. Gutierrez, K. Kleese van Dam, S . Kondapalli, S. Latham, B. Lawrence, R. Lowry and K. O’Neill (2004), “Enterprise specification of the NERC DataGrid”, Proceedings of UK e-Science All Hands Meeting 2004, Nottingham, UK, ISBN 1-904425-21-6. 13. IS0 19101, Geographic information -Reference model. 14. IS0 19109, Geographic information -Rules for application schema. 15. Woolf, A., B. Lawrence, R. Lowry, K. Kleese van Dam, R. Cramer, M. Gutierrez, S. Kondapalli, S. Latham, K. O’Neill and A. Stephens (2005), “Climate Science Modelling Language: standards-based markup for metocean data”. Proceedings of 85th meeting of American Meteorological Society. San Diego, USA. 16. IS0 19115, Geographic information -Metadata. 17. IS0 19119, Geographic information -Services. 18. Lawrence. B., R. Cramer, M. Gutierrez, K. Kleese van Dam, S. Kondapalli, S. Latham, R. Lowry, K. O’Neill and A. Woolf (2003), “The NERC DataGrid Prototype”, Proceedings of UK e-Science All Hands Meeting 2003, Nottingham, UK, ISBN 1-904425-11-9. 19. O’Neill, K., R. Cramer, M. Gutierrez, K. Kleese van Dam, S. Kondapalli, S. Latham, Lawrence, B., R. Lowry and A. Woolf (2003), “The Metadata

233

20.

21.

22. 23. 24.

25.

26. 27.

28.

Model of the NERC DataGrid”, Proceedings of U . e-Science AN Hands Meeting 2003, Nottingham, UK, ISBN 1-904425-11-9. O’Neill, K., R. Cramer, M. Gutierrez, K. Kleese van Dam, S. Kondapalli, S. Latham, B. Lawrence, R. Lowry and A. Woolf (2004), “A specialised metadata approach to discovery and use of data in the NERC DataGrid”, Proceedings of UK e-Science All Hands Meeting 2004, Nottingham, UK, ISBN 1-904425-21-6. Lawrence, B., R. Cramer, M. Gutierrez, K. Kleese van Dam, S. Kondapalli, S. Latham, R. Lowry, K. O’Neill, and A. Woolf (2004), “The NERC DataGrid: ‘Googling’ Secure Data”, Proceedings of UK e-Science All Hands Meeting 2004, Nottingham. UK, ISBN 1-904425-21-6. I S 0 19110, Geographic information - Methodologv for feature cataloguing. I S 0 19135, Geographic information - Procedures for registration of’ geographical information items. Atkinson, R., et. al. (2004), “Next steps to interoperability - Mechanisms for semantic interoperability”, EOGEO Workshop, University College London, 23-25 June 2004. Woolf, A., B. Lawrence, R. Lowry, K. Kleese van Dam, R. Cramer, M. Gutierrez, S. Kondapalli, S. Latham, and K. O’Neill (2005), “Standardsbased data interoperability for the climate sciences”, Met. Apps. (in press). Foster, I., et. al. (2001). “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”,Int. J. HPC Apps. 15(3), 2001. Alfieri, R., et. al. (2003), “VOMS, an Authorization System for Virtual Organizations”, I”‘ European Across Grids Conjerence, Santiago de Compostela, February 13-14,2003. Chadwick, D.W., A. Otenko (2002), “The PERMIS X.509 Role Based Privilege Management Infi-astructure”, Future Generation Computer Systems, 936 (2002) 1-13, December 2002, Elsevier Science BV. See also

htttxllwuw.Dermis.ord

TASK GEOMETRY FOR COMMODITY LINUX CLUSTERS AND GRIDS: A SOLUTION FOR TOPOLOGY-AWARE LOAD BALANCING OF SYNCHRONOUSLY COUPLED, ASYMMETRIC ATMOSPHERIC MODELS

I. LUMB* AND B. MCMILLAN Platform Computing Inc., 3760 I f h Avenue, Marlcham, O n t a n o L3R 3T7, Canada E-mail: (dumb, bmcmillan) @platform.com * Formerly with Platform Computing Inc.

M. PAGE AND G. CARR National Center for Atmospheric Research, P.O. Box 3000, Boulder, CO 80307-3000, USA E-mail: { mpage, gcarr} Qucar. edu

State-of-the-art atmospheric models comprise multiple components that interact synchronously. Because some components need to expend more computational effort than others, it is often the case that these components are computationally imbalanced. Although parallelism within a component can reduce the imbalance, there is still a need to coordinate component interaction via a coupler in real time. To address this issue, NCAR identified requirements for Task Geometry a construct that specifies for the purpose of coordination the run-time topology within and between components. Although it has recently become generically available in Platform LSF HPC, the current focus is commodity architectures based on the Linux operating environment. Crafted originally for CCSM, Task Geometry appears applicable to other systems-coupled atmospheric models such as 4D-Var. Scalecoupled atmospheric models also show promise for application in areas such as subgrid-scale parameterizations for GCM models, and the class of interactions demanded by use cases emerging in the area of Homeland Security. In all cases it is possible t o demonstrate via speedup and efficiency that Task Geometry enhances performance, however both metrics are problematical to quantify except in simple cases. With applicability from isolated Linux clusters to grids, Task Geometry remains useful in contexts that span organizational and/or geographic boundaries.

234

235

1. Introduction

Workload Management (WM) solutions aim to match application requirements (demand) with available resources (supply) subject to scientific and/or organizational policies (objectives). Generally speaking, WM solutions address this optimization problem with demonstrable effectiveness even in today's highly heterogeneous distributed computing environments in which multiple users/applications compete for the same resources. However, state-of-the-art atmospheric models, present a significant challenge. A convenient, heuristic approach for understanding this challenge, from the WM perspective," is afforded by the granularityb versus concurrency' plot provided in Fig. Using standard Cartesian conventions, the four quadrants of Fig. 1 have the following associations: l.29330

0

0

0

0

Data Parallel (I, Upper Right) - Owing to parallelism that exists in the data, workloads in this quadrant have high degrees of granularity and concurrency. The resulting embarrassingly parallel workloads are well suited to compute farms and desktops that are architected to maximize compute capacity - i.e., workload throughput as a function of time. Serial (11, Upper Left) - Serial workloads plot in this quadrant. Owing to their sequential nature, applications in this quadrant have high degrees of granularity but no concurrency. Service (111, Lower Left) - Workloads from this quadrant tend to spend more time communicating than computing. As a result they are of low granularity and low concurrency. Compute Parallel (IV, Lower Right) - Workloads in this quadrant are the typical focal point for the HPC applications typical of the atmospheric sciences. In this case, the parallelism that exists at the source-code level is exploitable in shared-memory (e.g., via OpenMP) and distributed-memory (e.g., via MPI) contexts. These capability workloads seek to make use of HPC architectures to solve problems that are Grand Challenge in their scope.

aAlthough instruction/process versus data heuristics exist and are widely used, the granularity versus concurrency approach has proven more appropriate in the WM context. bGranularity is a measure of the amount of computation that can take place before there is a need for synchronization or communication. Thus the ratio computation/communication serves as a proxy for the vertical axis of Fig. 1. 'Concurrency refers to an ability to carry out activities simultaneously. In other words, it is a measure of the degree of parallelism that is present.

236

POP

CSIM

Granularity

Concurrency

Figure 1. Granularity versus concurrency heuristic for the model components and coupler in CCSM. Double-edge arrows specify interaction between each of the components and the coupler via MPI. Note that the relative placement of a model component within a quadrant is not significant - e.g., CSIM does not necessarily exhibit greater concurrency than P O P in quadrant 11. Acronyms are expanded in the text.

To fix ideas for the purpose of illustration, i.e., to elucidate the challenge state-of-the-art atmospheric models present to WM solutions, the Community Climate System Model (CCSM)12q13.36is considered in the context of Fig. 1. CCSM is a ‘meta-model’ in that it comprises four separate componentsd - i.e., an atmospheric model (The Community Atmosphere Model, CAM),35 a land surface model (The Community Land Model, CLM),37 an oceanic model (The Parallel Ocean Program, POP),42 and a sea-ice model (The Community Sea Ice Model, CSIM)38. In reality, most of these CCSM components can execute in serial or multithreaded/distributedmemory/hybrid parallel modes; the range of possibilities is enumerated e l ~ e w h e r eMultithreaded .~~ versions of CAM and CLM, plus serial versions of CSIM and POP, have been included in Fig. 1 for illustrative purposes. Workload managers have no issue executing any one of these components in isolation. However, CCSM makes synchronous use of these four models as components, and accounts for their interaction via an MPI-based coupler (illustrated as CPL in Fig. 1). The resulting workload is Multiple Process, Multiple Data (MPMD) in the components and Single Process, Multiple Data (SPMD) for the interactions between each of the components and the coup1er.l2 Synchronous executione of coupled, asymmetric workloads such as dCCSM’s architecture allows one of more of these components to be exchanged with alternate models. eIn reality, each CCSM component computes, receives data via the coupler, computes,

237

CCSM present a challenge for workload managers. For this reason, NCAR has an established history of collaborating with providers of WM solutions to enable a functionality that can address workloads such as CCSM. Known as “Task Geometry”, the following section ($2) continues the present use of CCSM, to describe and quantify the impact of this functionality for systems-coupled atmospheric models. Then in $3, Task Geometry is a p plied conceptually for the first time to scale-coupled atmospheric models. Before identifying conclusions ( 5 5 ) , extending Task Geometry to Grid Computing environments is the subject of $4. Task Geometry contained within a cluster and co-scheduled between clusters, plus interoperability in Grid middleware, are each addressed in turn. 2. Systems-Coupled Atmospheric Models

2.1. Description

In use since 2000 at NCAR, “Task Geometry” originally load balanced CCSM workloads in NCAR’s IBM SP-2 operating environment (Black Forestf) based on IBM AIX. Implemented as a new functionality that was incorporated into WM solution IBM LoadLeveler, via a collaboration between NCAR and IBM, Task Geometry is also in use on NCAR’s IBM p690 environments (Bluedawn and Thunder) .45 More recently NCAR has incorporated IBM Linux clusters with Myricom Myrinet interconnects, for low-latency, high-bandwidth message passing, into their distributed computing environment. This recent addition motivated NCAR to collaborate with Platform Computing to make Task Geometry genemeally available in WM solution Platform LSF HPCg - i.e., regardless of the underlying operating environment. In CCSM, four independent components (i.e., atmosphere, land surface, ocean and sea ice) model the dynamics of the physical systems they represent, and are connected via a flux coupler.36 To fix ideas for the purpose of illustration, suppose that the distributed environment consists of three, quadruple-CPU compute nodes (see Fig. 2).h In this case, single-threaded land surface (CLM), ocean (POP) and sea-ice (CSIM) components execute sends data via the coupler, and computes again - at each time step.40 In addition, most of the components operate in stages that introduce processing dependencies. Thus ‘synchronous execution’ is an oversimplification. fNCAR has decommissioned Black Forest and replaced it with the other environments identified in this section. gTask Geometry became officially supported as of Version 6.1 of Platform LSF HPC. hOf course, in practice, other nodes are present - i.e., for compute. storage and other

238

nMP

nMP

Figure 2. Schematic for a simple CCSM use case in which a multithreaded atmospheric component interacts via MPI with singlethread components for each of the land surface, ocean and sea ice through a flux coupler. Note that singlehead arrows denote serial or threaded execution on a compute node, whereas doublehead arrows denote interactions via MPI within or between compute nodes.

on a single, quadruple-CPU compute node. Because the component related to the atmosphere (CAM) needs to compute roughly eight times as much as the other components, it’s been quadruply threaded for execution on two, quadruple-CPU nodes. Thus the model components operate in an MPMD mode (Fig. 3 ) . The flux coupler (cp16) also executes on one of the compute nodes (Fig. 2). As illustrated (Fig. 2)’ the quadruply threaded component representing the atmosphere makes use of the shared memory programming semantics available under OpenMP on each of two compute nodes; interactions between these multithreaded processes occurs via MPI. In contrast, the Single Process, Multiple Data (SPMD) interaction between each of the components and the coupler is facilitated via the distributed memory programming semantics available from MPI (Fig. 3). From Fig. 3, it is also clear that the flux coupler facilitates a feedback process aimed at providing a time-evolving CCSM model run. Component-coupler interactions have been simplified for present purposes; detailed interactions are specified elsewhere.40 To load balance CCSM workloads, Platform LSF HPC requires knowledge of the CCSM topology depicted in Fig. 2. This is achieved via the L S B SJL-TASK-GEOMETRY environment variable. In addition to clearly identifying the name of this functionality, naming of this environment variable purposes. Here ‘other’ will also include a node (or more) on which the daemons supporting Platform LSF HPC execute.

239

Figure 3. Linear-systems style schematic for a simple CCSM use case in which models for the atmospheric, land surface, ocean and sea ice take inputs and provide outputs that are used in a feedback process to generate results at the next time step. The interaction between model components is facilitated by a flux coupler.

conveys its batch heritage (i.e., via “LSB”) plus its association with the Parallel Job Launcher (i-e,, “PJL”) capability of the product. For the example provided in Fig.2, and in the case of the C shell, the appropriate syntax is s e t env LSBSJL-TASK-GEOMETRY c c { (0,I, 2,3),(4) ,(5) } ’ ’ . This environment variable specifies that tasks (0,1,2,3) correspond to the single-threaded executables (i.e., CLIM, POP, CSIM and CPL) executing on a single, quadruple-CPU compute node, while tasks (4) and (5) correspond to the MPI-OpenMP instance of CAM executing on two, quadruple-CPU compute nodes.’ Collectively the three-tuple values { (0,1,2,3),(41, (5)}, of the LSBSJL-TASK-GEOMETRY environment variable, convey CCSM topology to Platform LSF HPC. Because scriptbased submission is the preferred mechanism for initiating CCSM workloads at NCAR, the LSBSJL-TASK-GEOMETRY environment variable is set non-interactively - along with a number of additional environment variables (e.g., these include variables relating to CCSM itself and the parallel operating environment). These CCSM-specific submission scripts also associate components with the appropriate number of threads, and dynamically generate a command file for the relevant parallel operating environment. It is then this command file that is executed by the mpirun.lsf functionality of Platform LSF HPC. In addition to the LSBSJL-TASK-GEOMETRY environment variable setting, the scheduler requires some guidance. In the current example, the maximum number of threads per node is four - a value driven by the atmo‘Note that the Platform LSF HPC implementation of Task Geometry allows for single and multithreaded tasks in the LSBPJL-TASK-GEOMETRY environment variable.26

240

sphere component. This maximum-number-of-threads-per-node allocation establishes a processor tile ( p t i l e ) - i.e., in general a virtual construct for mapping executable units (i.e., processes or threads) to CPUs. p t i l e is one of the possible entries for the span directive of the resource-requirement ~ t r i n g specific ~ ~ j to Platform LSF HPC. Because the atmospheric component sets this value to four, the corresponding resource-requirement string is -R ‘ ‘ span Cptile=41 ’ ’ , where -R is the bsubk command line option to Platform LSF HPC indicating that a resource-requirement follows. Finally, the scheduler needs to allocate an appropriate number of execution slots’ for CCSM. Multiplying the number of tuples in the L S B P JL-TASK-GEOMETRY environment variable, NTUP,in the model run (i.e., three in the present example) by the p t i l e value (i.e., four in this case) results in the total number of required slots (i.e., 12 in this case). In general, n = ptile x N ~ u p .

Thus -n 12 is mother bsub directive that needs to be set in the CCSM submission script. Additional information on NCAR’s ongoing use of task geometry in the CCSM context is available elsewhere.12.36,45 Armed with topology awareness of the model components, and scheduling guidance, Platform LSF HPC dynamically determines the mostappropriate nodes on which to execute CCSM. In addition to Task Geometry, this determination takes into account stated resource requirements, node availability and characteristics, plus relevant scheduling policies (e.g., fairshare). The Task Geometry implementation in Platform LSF HPC supports both proprietary (e.g., IBM Parallel Operating Environment) and Open Source (e.g., LAM/MPI, MPICH-P4 for TCP/IP, MPICHGM for Myricom Myrinet) MPI implementations. Additional details on Task Geometry under Platform LSF HPC are available in the associated documentation;26this includes usage notes and examples.

jResource requirements allow workload-specific information to be detailed in a manner that is digestible by the scheduler. This information may impact how the workload is scheduled and managed. kbsub is the command used in Platform LSF HPC for job submission. It is detailed elsewhere.26 Graphically oriented interfaces are also available. “Slot’ is a term specific to Platform LSF HPC. It corresponds to a resource provider of processing capability. In HPC, there is typically a single slot allocated for a single thread or process.

244

0

head while they are pending," suspended, or undergoing a checkpoint/restart/migration operation; and Post-dispatch workload initiation and termination times. These times will be increased in cases where execution environment preand/or post-conditioning is required.

With the addition of workload-management overhead, Eq. 2 becomes

It is standard practice to follow quantification of speedup with quantification of efficiency - i.e., speedup normalized by the number of processors used, N c p u S . Summing speedup over all parallel (CAM) and serial (CLM, CSIM, POP and CPL) components yields an efficiency of

Efficiency =

+

Speedup N

NCPUS

N T H D ~Speedupseria1 =-8+4 12 NCPUS

- 100%

(5)

for a stage (either R2S or S2R) in an iteration of CCSM. In practice the inherent complexity of CCSM makes efficiency estimation challenging. This complexity derives from two primary sources: 0

0

Multiple modes of parallelization and invocation - As noted previously, most CCSM components can execute in serial, multithreaded, distributed-memory parallel, or even a hybrid In addition, components can be invoked in 'active', mode.36*40,45 Ldata'or 'dead' modes.450 Component-coupler interaction -As noted previously, each CCSM component computes, receives data via the coupler, computes, sends data via the coupler, and computes again.40 In addition, component-coupler interactions are not, in general, synchronized.40

Thus mode parallelization and invocation has the potential for impact at each time step, while component-coupler interactions have the potential for impact within a time step. It is this variability that makes detailed speedup and efficiency calculations challenging to impractical. In fact, in practice, it is a combination of 'hard' (e.g., run-time statistics) and 'soft' "In Platform LSF HPC, workload is placed in a pending state while it awaits requested resources for scheduling and dispatch. "Briefly active, data and dead modes correspond respectively t o component execution, data manipulation involving the coupler, and simulation.

245

(e.g., intuition gained from experience) skills that are required for effective use of CCSM.40 2.3. Related Usage

Task Geometry allows for topology-aware coordination within and between the components involved in a CCSM model run. As alluded to, at the outset of section 52, an MPMD implementation of a data-assimilation application also exists. In the case of Meteorological Service of Canada's (MSC) pre-operational4D-Var, interaction between a data-assimilation component (based on 3D-Var) and a global model (the MSC Global Environmental Multiscale (GEM) model), is achieved via a coupler.46 Thus it would appear that pre-operational4D-Var could be executed via the Task Geometry implementation in Platform LSF HPC with minimal effort - i.e., effort would need to focus only on the details of the submission script. As Task Geometry account,s for application topology at run time, no additional effort (e.g., recompilation) is required. Other candidates for systems-coupled Task Geometry are likely to include the Integrated Global System Model (IGSM),48 atmosphere-biosphere interactionsg modeled via CLIMBER,47 plus models for Environmental Prediction.1o

3. Scale-Coupled Atmospheric Models 3.1. Description The Task Geometry functionality shows promise for utility in other atmospheric applications where more effort is likely to be required. For example, phenomenological approaches are currently used to parameterize the physics of cloud formation for GCMs. Today, the range of scales requiring representation, makes use of direct numerical simulation of cloud processes (e.g., Cloud Resolving Convection Parameterization,20 CRCP) computationally impractical.28 Although use of IBM BlueGene/L shows significant promise in this context,28 Task Geometry also has the potential to have an impact in more-commoditized operating environments - based on Linux or some other operating system. To fix ideas for the purpose of illustration,P suppose that the overall PIt is acknowledged that this example grossly oversimplifies the use of a GCM, CRCP. and their interaction. Its express purpose is t o demonstrate the utility of Task Geometry in the context of scale-coupled atmospheric models. Subsequent use will determine morerealistic applications of Task Geometry in this context.

246 CR

Figure 7. Highly simplified schematic for the solution domain of a GCM (large rectangle) interacting with a model that parameterizes the effects of cloud formation (small rectangles). To illustrate use of Task Geometry in this context, instances of CRCP acting on this grid are numbered CRCP(1,l) through CRCP(4,4).

two-dimensional solution domain for the GCM is represented by the large rectangle shown in Fig. 7; then, each of the 16, small rectangles in the same figure represents a solution domain for an instance of the CRCP model. Thus the feedback from CRCP as a subgrid-scale (SGS) model for the GCM can be illustrated as in Fig. 8.

4-4 Figure 8. Linear-systems style schematic for a simple GCM-CRCP use case. Here the GCM is involved in a feedback loop with the SGS model, CRCP, to generate results at. the next time step. This interaction needs to be facilitated by a coupler.

As in the case with CCSM workloads (§2), Platform LSF HPC acquires topology awareness of a GCM-CRCP workload via an environment variable setting. To further develop the primary example of this section, and supposing that all systems have 4 CPUs, the assignment statement for the topology-awareness environment variable is setenv LSBPJL-TASK-GEOMETRY ‘ ‘ { (0),(1) ,(2,3,4,5),(6,7,8,9),(10,11,12,13),(14,15,16,17)}”. Thus the GCM runs four OpenMP threads on the first-assigned node (the first, i.e., (0) tuple of LSBPJL-TASK-GEOMETRY), the coupler executes as a single thread on the second-assigned node (the second, i.e.. (1) tuple of LSBP JL-TASK-GEOMETRY),while CRCP runs four separate instances each in a single-threaded mode on each of the remaining 4 nodes (i.e., the remain-

249

scales the summation in Eq. 6 by the number of CPUs, Ncpus, i.e., tAFTER GCM+CRCP

MCALLS Ci=1 (%'RCP ~ G C M+

-k t & P L )

NCPUS

(7)

Note that Eq. 7 includes a term to account for the speed of the interaction between CRCP and the GCM via the coupler, and any computational effort (e.g., re-gridding) expended by the coupler, t C p L . The speedup in this context is

if the speed of the coupler is ignored. This means that the overall efficiency is given by

Efficiency =

SpeedupGCM

+ SpeedupCRCP

NCPUS

(9)

In the case of the current example,q SpeedupGCM = 4, SpeedupCRCP = 16 and SpeedupCPL = 1. Because the total number of CPUs allocated by Platform LSF HPC is 24, the efficiency is about (4 + 16 1)/24 M 87.5%. Note that the maximum achievable efficiency is (4 16 4)/24 = 100%. This overbooking of CPUs encourages GCM and/or CRCP allocations that better account for the requirements of the coupler.

+ + +

3.3. Related Usage As another example, experiments like Joint Urban 200351illustrate that the controlled release of effluents motivates challenges and opportunities for the atmospheric sciences in a Homeland Security context. Such scenarios may necessitate an interplay between atmospheric models in (formerly) independent domains. More specifically, and again with the aid of a 'coupler', Task Geometry might facilitate modeling involving Large Eddy S i m ~ l a t i o n ~ l ~ ~ (LES) of passive tracers in the Urban Boundary Layer with mesoscale models (e.g., MM541 or RF34343)that address regional effects.31 In this case, an SGS model (i.e., the LES model) provides input to a broader-scale model (e.g., MM5 or WRF) regarding dynamics on unresolved scales (i.e., the Boundary Layer). This approach presents a potentially compelling, moreproductive alternative to the somewhat redundant, sequential approach to qAssuming a speedup of 4 relating to the 4 OpenMP threads in comparison to a purely serial execution of the GCM.

250

multigrid methods in common use today.27 Following through at a n even higher level of abstraction, these scale-coupled atmospheric models might serve as prognostic instruments in emergency-response platforms. 1,49 Use of Task Geometry in scale-coupled contexts was originally suggested in regard t o modeling the magnetohydrodynamics of Earth’s magnetic field.31 In this case, SGS effects are parameterized for the large-scale, three-dimensional geodynamo models17 by the SGS model” via a ‘coupler’. Because use is made of existing larger-scale SGS models, most of the extra effort needs t o focus on the ‘coupler’. Fortunately couplers, and the frameworks they serve to define, are active areas of current research.8 4. Extension t o The Grid

With Platform LSF MultiCluster, clusters based on Platform LSF HPC can be rapidly transformed into a Grid-computing environment .r Because Linux clusters are already virtualized resources, this transformation exploits existing natural a f f i n i t i e ~ Given . ~ ~ that the specifics of this transformation are detailed e l ~ e w h e r e , ’ ~attention .~~ here focuses on Task Geometry in the context of Platform LSF MultiCluster. More specifically, attention here focuses in turn on use cases driven by the two, primary use models provided by Platform LSF MultiCluster -namely job forwarding and resource leasing.

4.1. Job Forwarding In production deployment since 1996, the Job Forwarding use model of Platform LSF HPC is based on sendlreceive queues.s These queues allow for workload exchange (i.e., forwarding) between cc-operating clusters, and serve as cluster-wide containers for managing workload against a rich array of scheduling policies. In the Job Forwarding use model, sites retain a very high level of local autonomy (i.e., sites selectively identify resources available for Grid use) while benefiting from resource sharing that crosses geographic and/or organizational boundaries. Typically used to ensure ‘Although a more-formal definition of a Grid computing is deferred elsewhere,16 positing ‘Grid computing’ as a cluster-of-clusters or clusters-of-clusters serves current purposes. Job Forwarding supports one-to-one, oneternany, many-teone and many-to-many relationships between send/receive queues. Security is a more-significant concern as organizational boundaries need to be crossed. Together with customers and partners, Platform has crafted solutions that address aspects of this security challenge via AFS, DCE/DFS, The Grid Security Infrastructure4 and Kerberos.

251

maximal utilization of all compute resources across an entire enterprise, Job Forwarding applies to mixed workloads - e.g. serial applications, parametric processing, plus shared and distributed memory parallel applications. Supported scheduling policies include advance reservation, backfill, fairshare, goal-oriented service level agreements, memory and/or processor reservation, preemption, etc. Ignoring the challenge of data locality.21 and under the Job Forwarding use model of Platform LSF MultiCluster, MPMD CCSM model runs are ‘forwarded’ as-a-whole from a submission to execution cluster. This means that all activities, including interaction between the four components r e p resenting the atmosphere, land surface. ocean and sea ice, via the coupler, are confined to the execution cluster. Thus Fig. 2 adequately illustrates a CCSM model run in the context of the execution cluster. A similar statement applies to Fig. 10 in the case of GCM-CRCP model runs. In the case of the Job Forwarding use model of Platform LSF MultiCluster, it’s possible to make use of Task Geometry, without any additional effort on the part of the scientist performing the modeling. Additional information on usage of this Platform LSF MultiCluster use model is detailed elsewhere.24 4.2. Resource Leasing

Resource Leasing is a use model that appeared more recentlyU in Platform LSF MultiCluster. In this use model, clusters based on Platform LSF HPC have resources (e.g., compute nodes, software licenses, etc.) that can be provided or consumed subject to some agreement. On execution of the workload, the consuming cluster assimilates provided resources in the context of a ‘resource lease’. To fix ideas for the purpose of illustration, and again ignoring data locality, consider a CCSM model run via Task Geometry involving clusters at the two sites depicted by the dashed-line rectangles of Fig. 12. Each of the clusters is running an instance of Platform LSF HPC and the inter-cluster interaction is facilitated by the Resource Leasing use model of Platform LSF MultiCluster. As illustrated here, a CCSM model run is submitted at Site A. It makes use of compute nodes local to Site A for the land surface, ocean and sea-ice CCSM components, plus a leased compute node from Site B for the atmosphere component. Although six additional compute nodes have been exported from Site B, they do not factor into this “Resource Leasing first appeared in Version 5.0 of Platform LSF MultiCluster

252

scenario. As before (e.g., Fig. 2), the flux coupler interacts with all components via MPI. Implicit in this scenario, of course, is the assumption that the interactions between the model components and the coupler are fairly tolerant of the inherent network characteristics (i.e., bandwidth and more importantly latency) between the two sites. It is important to note that this is a bona fide example of co-scheduling an MPI application in a Grid computing context29730 - without any need for human intervention. As in the case of Job Forwarding, Resource Leasing can apply various scheduling policies, and make use of the extensible scheduler framework present in Platform LSF HPC. An analogous example could easily be developed for a scale-coupled use case via Task Geometry. I

nMP

'r II

nMP

CLM

L , - - -

Site A J

- - - Site B

Figure 12. Schematic for a simple CCSM use case in which multithreaded atmospheric components interact via MPI with single-threaded components for each of the land surface, ocean and sea ice through a flux coupler. Note that the atmosphere component executes on leased resources that are physically situated at Site B. All other CCSM components, and the flux coupler, execute on compute resources local to the submission cluster (Site A ) . Co-scheduled Task Geometry is made possible through the Resource Leasing use model of Platform LSF MultiCluster.

4.3. A Grid of Grids

The highly collaborative nature of atmospheric science ultimately demands interoperability between technologies used to craft Grid infrastructures at the middleware level. This means, for example, that Platform LSF Multicluster must be able to interoperate with the Globus Tooklkit2 and/or UNICORE.15 TOfacilitate this interoperability in the case of The Globus Toolkit (see Fig. 13), and to pave the way for eventual compliance with the emerging

253

standards for Grid-enabled Web service^,^.^ Platform created The Community Scheduler Framework (CSF).23CSF is an Open Source contribution by Platform that has recently appeared in Version 4.0 of The Globus T ~ o l k i t . ~ Although CSF is a recent contribution, it has been generally available for over a year, and has a modest following - including members of the publicand private-sector atmospheric sciences community. With job, reservation and queuing services, CSF provides a comprehensive, standards-compliant framework for the development of Grid-level schedulers. In addition to being Open Source. CSF makes use of Web Services Agreement (WSAgreement) - a Global Grid Forum (GGF)14 specification co-authored by the Globus Alliance, IBM and Platform. WS-Agreement details the basis for negotiating consumer-provider relationships for entities belonging to one or more virtual organizations. Although CSF currently targets compute-oriented workloads, it has the potential to factor in other contexts where there is a need to negotiate consumer-provider r e l a t i o n ~ h i p s . ~ ~

Grid Portal

I Platform LSF MultiClusler

Community Scheduler Framework

Platform LSF HPC

Globus Toolkit

Linux

Sun GridEngine

Linux

Figure 13. Simplified schematic for interaction a t the Grid-middleware level. The Grid based on Platform technology interacts with a Globus-based Grid via a Resource Manager Adapter (RMA) that plugs into The Community Scheduler Framework. Ultimately this extends heterogeneity to the workload-manager level - Le., Platform LSF HPC indirectly interacts with Sun GridEngine. This Grid of Grids is accessible via a portal.

To complete the solution stacks presented in Fig. 13, it is necessary to add access points to these Grid-based environments. This access is typically mediated via a p ~ r t a l Because . ~ ~ ~portals ~ ~ abstract away underlying complexity, while serving as community-specific points of presence, this continues to be an active investment area for organizations involved in the atmospheric science^.^^.^^

254 5 . Discussion

State-of-the-art atmospheric models present a challenge for WM solutions: They are composed of independent, asymmetric components that are synchronously coupled. The resulting complex application, needs to be managed as a whole in today's highly heterogeneous and highly distributed computing environments. Motivated by this scientific imperative, NCAR collaborated first with IBM and then with Platform Computing to detail a new functionality known as Task Geometry. IBM and Platform subsequently implemented Task Geometry in IBM LoadLeveler and Platform LSF HPC, respectively. The Platform implementation is available for all UNIX and Linux-based environments supported by Platform LSF HPC. Task Geometry is a construct that allows topological information about the atmospheric model to be conveyed to the workload manager. Armed with this information the workload manager is able to optimally schedule the atmospheric model, subject t o policies, on the most-appropriate computational resources. By exploiting the parallelism that already exists within and between model components, Task Geometry allows highly asymmetric workloads to be load balanced. This coordinated load balancing also results in significant performance gains, as well-balanced workloads allow for better overall performance. Speedup and efficiency calculations can be used to quantify the improvement achieved. Although Task Geometry was originally developed to address CCSM workloads, it has a broader range of applicability. Specifically, CCSM and 4D-Var are two examples of systems-coupled workloads. A second class of applicability was also identified as scale-coupled workloads. Task Geometry has not been applied previously to scale-coupled use cases, and as a result, effort will be needed to frame atmospheric models appropriately. More specifically, the required effort will need to focus on submission specifics (e.g., a submission script) and the development of a coupler. Parameterized effects from unresolved scales for larger-scale models, as well as novel usage motivated by Homeland Security, are examples representative of Task Geometry in scale-coupled use cases. It is also anticipated that scale-coupled usage of Task Geometry could play a role in addressing the spatial aspects of Adaptive Mesh Refinement (AMR)22- a numerical approach that dynamically enhances spatial and or temporal domains of interest. There is clearly ample motivation for further investigation. With a storied history of co-dependence spanning a half century already," it is not surprising that NWP is still generating leading-edge

255

challenges for computing. This is fortunate for other areas of the physical sciences, as they can benefit directly from leading-edge solutions like Task Geometry. In addition to allusions to the study of Earth's geodynamo as a scale-coupled use case for Task Geometry, it is clear that this functionality will find utility in areas as diverse as astronomy, the Life Sciences, and materials science. And this applies to both public- and private-sector HPC applications. The fractal structure of HPC architectures progresses from the processor to the cluster to the grid - the latter of which can be thought of as a cluster of clusters. Platform LSF MultiCluster allows for the cluster-to-grid progression via two use models - Job Forwarding and Resource Leasing. Whereas Job Forwarding workloads are constrained to execute within a cluster, Resource Leasing workloads are co-scheduled between one or more clusters. Because Task Geometry is compatible with both of these use models, the utility of commodity architectures can be leveraged by making transparent organizational and/or geographic boundaries. This transparency also takes into account the requirement for the highest levels of interoperability between grid middleware, a requirement that underlines the importance of open standards.14

Acknowledgments The authors acknowledge Mariana Vertenstein of the Climate and Global Dynamics Division, NCAR, for various discussions on CCSM, plus Hanna Mullane of ECMWF and Karen Sopuch of Platform Computing for their encouragement and assistance. Additionally, K acknowledges David McMillan of York University (Toronto, Canada) for ongoing discussions regarding modeling the geodynamo.

References R. Adam, V. Atluri, and V. P. Janeja. Agency interoperation for effective data mining in border control and homeland security applications. The Fifth National Conference on Digital Government (dg.0 2OO4), Seattle, Washington, USA, May 2004. The Globus Alliance. The Globus Toolkit. http://www.globus.org. The Globus Alliance. GT4: CSF. http://www.globus.org/toolkit/docs/4.O/contributions/csf/. The Globus Alliance. Overview of The Grid Security Infrastructure. http://www.globus.org/security/overview.html. The Globus Alliance. Towards Open Grid Services Architecture.

1. N.

2. 3. 4.

5.

http://www.globus.org/ogsa.

256 6. The Globus Alliance. The WS-Resource Framework. http://www.globus.org/wsrf. 7. S. P. Arya. Introduction to Micrometeorolgy, Second Edition. Academic Press, 2001. 8. V. Balaji. A comparative study of coupling frameworks: The MOM case study. Eleventh ECMWF Workshop: Use of High Performance Computing in Meteorology, Reading, UK, October 2004. 9. G. Brasseur, W. Steffen, and K. Noone. Earth System focus for GeosphereBiosphere Program. Eos Tmns. Am. Geophys. Un., 86(22):209, 213-214, 2005. 10. G. Brunet. The first hundred years of Numerical Weather Prediction. In I. Kotsireas and D. Stacey, editors, Proceedings of The 19th International Symposium on High Performance Computing Systems and Applications. HPCS 2005, pages 276-279. The IEEE Computer Society, 2005. 11. B. A. Buffet. A comparison of subgrid-scale models for large-eddy simulations of convection in the Earth’s core. Geophys. J. Int., 153:753-765. 2003. 12. G. Carr. Porting and performance of the Community Climate System Model (CCSM3) on the Cray X1. Eleventh ECMWF Workshop: Use of High Performance Computing in Meteorology, Reading, UK, October 2004. 13. W. D. Collins, C. M. Bitz, M. L. Blackmon, G. B. Bonan, C. S. Bretherton, J. A. Carton, P. Chang, S. C. Doney, J. .J. Hack, T. B. Henderson. J. T. Kiehl, W. G. Large, D. S. McKenna, B. D. Santer, and R. D. Smith. The Community Climate System Model: CCSM3. J. Climate, 11(6):to appear, 2005. 14. The Global Grid Forum. The global grid forum. http://www.ggf.org/. 15. The UNICORE Forum. Unicore. http://www.unicore.org. 16. I. Foster. What is The Grid? A Three Point Checklist. GRIDtoday, 1(6), 2002. 17. G. A. Glatzmaier, D. E. Odgen, and T. L. Clune. Modeling the Earth’s dynamo. In R. S. J. Sparks and C. J. Hawkesworth, editors, The State of the Planet: Frontiers and Challenges in Geophysics, volume IUGG Volume 19 of Geophysical Monograph 150, pages 13-24. American Geophysical Union, 2004. 18. M. W. Govett. A WRF portal for model test and verification. NOAA Technical Report, 2003. 19. M. W. Govett. Grid Computing: The development of a portal for model test and verification. NOAATech 2003, Silver Spring, Maryland, USA, October 2003. 20. W. W. Grabowski. Coupling cloud processes with the large-scale dynamics using the Cloud-Resolving Convection Parameterization (CRCP). J . Atmos. SCZ.,58(9):978-997, 2001. 21. G. Hayward and I. Lumb. Data agnostic resource scheduling in the grid. In I. Kotsireas and D. Stacey, editors, Proceedings of The 19th International Symposium on High Performance Computzng Systems and Applicatzons, HPCS 2005, pages 23C-235. The IEEE Computer Society, 2005. 22. R. Henderson, D. Meiron, M. Parashar, and R. Samtaney. Parallel com-

257

23. 24. 25. 26. 27. 28.

29. 30. 31.

32.

33. 34.

35. 36. 37. 38. 39. 40. 41.

puting in computational fluid dynamics. In J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White, editors, Sourcebook of Parallel Computing, pages 93-144. Morgan Kauffman, 2003. Platform Computing Inc. The Community Scheduler R m e w o r k . http://sourceforge.net/projects/gcsf. Platform Computing Inc. Uszng Platform LSF MultiCluster, Version 6.1. Platform Computing Inc., 2004. Platform Computing Inc. Using Platform LSF, Version 6.1. Platform Computing Inc., 2004. Platform Computing Inc. Using Platform LSF HPC, Version 6.1. Platform Computing Inc., 2005. E. Kalnay. Atmospheric Modeling, Data Assimilation and Predictability. Cambridge University Press. 2003. R. Loft. Price and power aware approaches t o advancing atmospheric science. Eleventh ECMWF Workshop: Use of High Performance Computing in Meteorology, Reading, UK, October 2004. I. Lumb. HPC Grids. In A. Abbas, editor, Grid Computing: A Practical Guide to Technology and Applications, pages 119-133. Charles River, 2003. I. Lumb. Production HPC reinvented. ;login:, 28(4):15-22, 2003. I. Lumb. The metric of scale: Real-world experiences via infrastructural software. High Performance Computing and Communications Conference: Exploring the Next Frontier, 19th Annual, Newport, Rhode Island, USA, April 2005. I. Lumb and K. D. Aldridge. Grid-enabling the Global Geodynamics Project: The introduction of an XML-based data model. In I. Kotsireas and D. Stacey, editors, Proceedings of The 19th International Symposium on High Performance Computing Systems and Applications, HPCS 2005, pages 216-222. The IEEE Computer Society, 2005. I. Lumb and C. Smith. Integrating Linux clusters into the Grid. SysAdmzn Magazine, Clustering Supplement 2003, 12(8):49-52, 2003. J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, and W. Wang. The Weather Research and Forecast Model: Software architecture and performance. In Proceedings of the Eleventh ECM W F Workshop: Use of High Performance Computing in Meteorology. World Scientific, 2005. NCAR. The Community Atmosphere Model. http://www.ccsm.ucar.edu/models/atm-cam/. NCAR. Community Climate System Model (CCSM3). http://www.ccsm.ucar.edu/models/ccsm3.0. NCAR. The Community Land Model. http://www.cgd.ucar.edu/tss/clm/. NCAR. The Community Sea Ice Model. http://www.cgd.ucar.edu/ccr/bettge/ice/. NCAR. Large Eddy Simulation. http://www.mmm.ucar.edu/research/surface/les.html. NCAR. Load Balancing CCSM3. http://www.ccsm.ucar.edu/models/ccsm3.0/ccsm/doc/UsersGuide/UsersGuide/nodelO. html. NCAR. MM5 Community Model. http://www.mmm.ucar.edu/mm5/.

258 42. NCAR. The Parallel Ocean Program. http://climate.lanl.gov/Models/POP/. 43. NCAR. The Weather Research and Forecasting Model. http://www.wrfmodel .org/. 44. NICE. NICE EnginFrame. http://www.enginframe.com. 45. M. Page, G. Carr, I. Lumb, and B. McMillan. NCAR CCSM with taskgeometry support in LSF. ScicomP 11, Edinburgh, Scotland, June 2005. 46. S. Pellerin. MPMD implementation of preoperational MSC 4Dvar on IBM p690. Eleventh ECMWF Workshop: Use of High Performance Computzng zn Meteorology, Reading, UK, October 2004. 47. V. Petoukhov, A. Ganopolski, V. Brovkin, M. Claussen, A. Eliseev, C. Kubatzki, and S. Rahmstorf. CLIMBER-2: A climate system model of intermediate complexity: I. Model description and performance for present climate. Clzm. Dyn., 16:l-17, 2000. 48. R. G. Prinn. Complexities in the climate system and uncertainties in forecasts. In R. S. J. Sparks and C. J. Hawkesworth, editors, The State of the Planet: Frontiers and Challenges in Geophysics, volume IUGG Volume 19 of Geophyszcal Monograph 150, pages 297-305. American Geophysical Union, 2004. 49. SAP and Rutgers University. Next-generation emergency response. SAP SAPPHIRE 2005, SAP Innovation Pavilion, Boston, Massachusetts, USA, May 2005. 50. The GridPort Team. GridPort. http://gridport.net. 51. United States The Department of Energy. Joint Urban 2003: Atmospheric dispersion study in Oklahoma City. http://ju2003.pnl.gov/.

PORTING AND PERFORMANCE OF THE COMMUNITY CLIMATE SYSTEM MODEL (CCSM3) ON THE CRAY X1

George R Carr Jr, NCAR; Ilene L Carpenter, SGI; Matthew J Cordery, Cray; John B Drake, Michael W Ham, Forrest M Hoffman, Patrick H Worley, ORh!L

ABSTRACT: The Community Climate System Model (CCSM3) is the primaly model for global climate research in the United States and is supported on a variety of computer systems. We present some of our porting experiences and describe the current pe$ormance of the CCSM3 on the Cray X I . We include the status of work in progress on other systems in the Cray product line. KEYWORDS: CCSM, climate, Cray X1,porting, performance.

1. Introduction: The Community Climate System Model The Community Climate System Model (CCSM3) is a computer model for simulating the Earth’s climate. The CCSM is supported by the National Science Foundation and the Department Of Energy and is freely available to the climate community. The CCSM is built from four dynamical component models: atmosphere, ocean, land surface and sea ice. These communicate with each other via a flux coupler component in a “hub and spoke” codiguration (see Figure 1). The CCSM and its components are documented in papers available from web pages at the National Center for Atmospheric Research (NCAR) (httP://www.ccsm.ucar.edu/models/ccsm3.0)and other papers [Special Issue on Climate Modeling, Int’l J. HPC Apps.,Vol 19, #3, August, 20051 [CCSM Special Issue, J. Climate, 1l(6)J [Collins, 20051. 259

261

The current version of CCSM incorporates contributions from researchers around the world over a period of more than 25 years. The code base is Written in multiple versions of Fortran and a small amount of C. The CCSM requires the use of MPI and allows the additional use of OpenMP on some architectures with some components. This, together with being a Multiple Program Multiple Data (MPMD) application, is often a significant challenge for vendors and application porters alike. The CCSM3 can be run with a variety of options. For example, the CCSM3 currently supports options for the ocean model ranging from a “Data” model that is often used for testing CCSM3 to a somewhat simplified slab ocean model to the use of a complete ocean model (POP). In this paper, all reported results from CCSM configurations refer to ‘‘hlly coupled” options that utilize CAM3, POP1.4.3, CLM3, CSIM5, and CPL6.

1.3 CCSM Platforms The CCSM runs on various older and current IBM, SGI, Cray, NEC, and commodity cluster systems. The current list of supported systems, i.e., systems on which CCSM has been validated, can be found at htttxilwww.ccsm.ucar.edu. Recent and current work has concentrated on adding the Cray X1, SGI Altix. Intel Xeon clusters, and AMD Opteron clusters to the list of supported machines.

2. Introduction: Cray X1 The Cray X1 architecture combines vector and scalar processor units with cache and a globally addressable memory. The Single Streaming Processor (SSP) is composed of a single scalar unit and two vector pipes. The MultiStreaming Processor (MSP) is composed of 4 SSPs ganged together with a 2MB cache. Four MSPs comprise an X1 SMP (Symmetric Multiprocessor) node. Memory is physically distributed between SMP nodes, but globally addressable. The X1 allows the applications developer to program to the SSP or to the MSP. At this time, the CCSM uses the MSP as the processor, relying on the compiler to spread the computation across the 8 vector pipes of each MSP. Work is being performed on the Cray X1 at the Oak Ridge National Laboratory ( O W ) known as Phoenix. Phoenix is currently populated with 512 MSPs. Plans exist to upgrade Phoenix to the newer X1E architecture and increase the total number of MSPs to 1024.

3. CCSM port and validation process 3. I Porting Introduction

The process of porting a code of the size and complexity of the CCSM is complicated. Just getting CCSM to build the first time on a new machine can be several weeks of work. Generation of correct climatic results requires much more work and time. When porting to a new machine, the safe approach is to begin with limited or no compiler optimization. The CCSM regularly finds a

263

3.3 CAWCCSM Atmospheric Diagnostics The next validation step is to compare the monthly history files generated by CAM against a corresponding 100 year baseline. This test computes the monthly averages for the test configuration and the baseline and generates a large number of plots and graphs for analysis. A duration of the test configuration that is less than 100 years can be compared against the 100 years of the baseline and may be enough to show numeric divergence. A complete 100 years of the test configuration is required to accept the test configuration as valid. The atmospheric diagnostics package can be run comparing data from the standalone CAM to a standalone CAM baseline or by comparing data from CCSM CAM to a CCSM CAM baseline. Often, this test is performed as a cutoff check while running 100 years of a controlled CCSM run for the CCSM Statistical Test discussed in the next section. Divergence of the results permits one to stop running the test configuration prior to completing 100 years of simulation. Problems pointed out by the CAM diagnostics package may not actually be attributable to problems in CAM. The ocean, land, and sea ice components also have component specific tests that can be run to aid in isolation of problems.

3.4 CCSM Statistical Analysis The final step in the validation of a CCSM port is a statistical analysis of the CAM history files generated from a run of 100 simulation years of CCSM from a controlled baseline. Generating 100 years of data can take considerable computer time. On the Cray X1,running 100 years of the small T3 1x3 resolution model on 40 MSPs continuously currently takes more than two and a half days to complete. Running 100 years of a much larger T85x1 resolution on 132 MSPs continuously currently takes more than 9 days. Note that it is possible to pass all of the previous CCSM tests and still fail the statistical analysis test.

3.5 CCSM Regression Testing Once a validated source baseline exists for a machine, simple tests can be performed to test whether a change produces “bit for bit” exactly the same results. Code changes generating results that are not bit for bit must go through the full validation process to be accepted.

4. Some Aspects of the CCSM 4.1 CCSM Performance

The production performance of the CCSM3 is most often expressed as production throughput in number of simulated years per wall clock day for a specified number of processors (or years per day). A century long simulation takes a minimum of 25 days for a computer delivering 4 years per day. Scaling

264

efficiency is expressed as simulated years per wall clock day per CPU (or years per day per cpu). Table 1 shows the performance and efficiency on each computing platform of the standard International Panel on Climate Change (IPCC) T85 atmosphere, 1 degree ocean model configuration. The number of processors used for a production run is a choice based on load balance of the components, batch queue constraints, and an estimate of the time required to generate the results. The turn around time can be measured in weeks to months when a large simulation of a thousand years or more are computed [Drake, 20051.

IBM SP3 208 Yearslday 1.57 Yearsldaylcpu 0.0075

Platform NumCPUs

IBM p690 ES(NEC SX6) 192 184 3.43 16.2 0.0179 0.0880

Cray XI

208 13.6 0.0654

Table 1: Computational Performance of CCSM3.0 for an IPCC TS5xl Run 4.2 Processor Load Imbalance

A significant performance challenge in the CCSM is the load imbalance generated by the non-homogeneous structure of a multi-physics, multicomponent model. A striking example of the structure of the load imbalance appears in the calculation of the short wave radiation balance. This computation need be done only where the sun is shining, i.e. on half of the computational domain. This region changes for each time step. Load imbalances within a component are typically resolved using component specific data decomposition schemes [Worley, 20051. Load imbalances are also generated from the concurrent component execution model used by CCSM. CCSM launches five individual binaries that run concurrently on separate processor sets. Each of the four dynamic components communicates with each other via the coupler component at prescribed stages of processing. Choosing a “correct” number of processors for each component is at best a compromise. The goal for a specified total number of processors is to provide a number of processors to each component such that the maximum simulation years per day is achieved, typically by attempting to minimize the overall idle processor time. This is complicated as each component has different scaling attributes and different data decomposition restrictions. Some component processing is dependent on other component results. A poorly chosen assignment of processors may result in one component waiting excessively on the results from another. Typically, for a fully active T85x1 configuration, 2 of every 3 processors of the total processor count are assigned to the atmospheric component. The number of processors assigned to the ocean component is chosen to best match the processing time of the atmosphere. The balance of the processors in the

267 0 0 0

(optional) Long term archive: data transfer to tape. Monitoring of production process (often manual). Submission of next job.

A 1000 year production process is broken into many separate jobs. A single job is set to run for a specified simulation duration such that the wall clock runtime required satisfies queue, disk, andor other limitations of the computer environment. Restart files are written at the end of a job and can be written at multiple points in a job, limiting the amount of computations that might be lost if the jobs fails to complete at a cost of extra I/O time and disk space. Performance numbers are often reported (as in this paper) only for the subset of the production job where the actual simulation is performed. This presumes that the length of the production job is long enough to amortize the startup and termination time of the jobs. The creation of data can be very substantial. Standard output from a T85x1 IPCC run generates about 10 GB of simulation data per simulation year (excluding restart and some other output). Many experiments will generate much more. Staging and storage of dus amount of data can challenge data center capabilities.

5. Porting Status of the CCSM on the Cray X1 5.1 December, 2003 Preliminary work was performed to vectorize the individual component models. Some of this effort benefited from vectorization work on the Earth Simulator in Japan. 5.2 April, 2004

The preliminary vector versions of the CCSM components were merged into the development branch includmg basic support for the X1. The standalone CAM3 and CLM2 models were successfully validated on the Earth Simulator and the X1.

5.3 June, 2004 The CCSM was validated on the Earth Simulator. This version of CCSM achieved the required percentage of vectorization allowing the use of the Earth Simulator for production runs.Over 6000 simulation years were run on the Earth Simulation in 2004 as part of NCAR’s support for the IPCC. The data from these runs were placed on tapes and shipped to NCAR for analysis and archive. CCSM3 was formally released to the community. This included basic support for the X 1.

268

5.1 Fall, 2004 (current)

The first successful T31x3 validation on the Cray X1 was completed with programming environment 5.1.0.5 (PE5105). With a normal conservative first try approach, the performance of the T31x3 runs with this baseline was adequate and scaling was acceptable to begin production runs. The performance and scaling of the T85x1 runs were not quite as good as with the T31x3 runs. However, science could now be performed on the X1. The primary goal was to achieve correct scientific results. Performance was a secondarypriority. This stage of work on the X1 has concentrated on an MPI-only configuration. The X1 does not support using OpenMP with some components and not with others currently, and not all components in the released version of CCSM support OpenMP parallelism.. Some preliminary CCSM load balance work has been completed (assigning processors to components). This was used to choose the configuration that was used for the validation effort and to produce the preliminary raw performance and scaling performance data. A small configuration of 40 MSPs was chosen for the T31x3 validation on the X1 as a good compromise of model performance, model efficiency, and job queue wait time. Work to date has been specifically concentrated on basic functionality.

6. Remaining Work With a validated baseline in hand, work has begun to focus more on performance. In addition, Cray has introduced a newer programming environmentthat will also be evaluated. There are a number of issues remaining to be addressed. The model timer in POP andCSIM can become corrupted after approximately 10 simulation years in a single job. The numeric results can differ when using dynamic CAM load balancing. This is a small difference in the range of roundoff. These results should be bit for bit, and are with slightly different compiler options than were used during model validation. The alternative settings will be used in future validations. One of the performance timers in the coupler is broken. The run scripts for the O W environment need to be hardened for general use. There can be a significant variation in the performance from simulation day to simulation day, partly due to process migration as other jobs on the system are loaded and complete. General performance needs to be improved.

269

Conclusion Significant progress has been made porting the CCSM to the Cray X1. With a successful validation, production work can begin on the X1. The Cray X1 is becoming a significant resource for the CCSM community.

Acknowledgments The research and development activities of Carr at NCAR were sponsored by the National Science Foundation. There are far too many at NCAR who contributed to George’s work to be able to list them all here. Special thanks to Lawrence Buja, Bill Collins, Jim Hack, Tom Henderson, Mark Moore, Phil Rasch, and Mariana Vertenstein for their support, encouragement, and tolerance. The research and development activities of Drake, Ham, Hoffinan, and Worley were sponsored by the Climate Change Research Division of the Office of Biological and Environmental Research, Office of Science, U.S.-Department of Energy under Contract No. DE-AC05-000R22725 with UT-Batelle, LLC. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. Special thanks to Dr. Yoshikatsu Yoshida, all his colleagues of the Central Research Institute of Electric Power Industry (CRIEPI), and to Dave Parks and John Snyder from NEC for all their support to CCSM.

About the Authors George R Carr Jr is a Software Engineer at the National Center for Atmospheric Research (NCAR) in Boulder, CO. He has worked with many of the high performance computing platforms performing systems engineering and systems architecture analysis in a variety of applications areas including signal and image processing, numeric weather prediction, and climate modeling. He is currently responsible for the portability and performance of the Community Climate System Model (CCSM). He received a Bachelor of Arts degree in Mathematics at Drake University and a Master of Sciences degree in Computer Science at The Ohio State University. He is a member of the ACM, the IEEE Computer Society, and USENIX. George can be reached at NCAR, P.O. Box 3000, Boulder, CO 80307-3000, USA, E-mail: [email protected]. Ilene L Carpenter is currently a business development manager at Silicon Graphics Inc. She was previously an applications engineer at Cray Inc. and Silicon Graphics where she specialized in performance analysis and optimization of weather and climate models. She received her Ph.D. in physical chemistry from the University of Wisconsin, Madison in 1987 and has worked on weather, climate and ocean models since 1995. She can be reached at Silicon Graphlcs Inc., 2750 Blue Water Rd., Eagan, MN 55121, E-mail: [email protected]. Matthew Cordery is a Software Engineer for Cray Inc, where he specializes in optimizing weather and climate applications. He has a PhD in marine geology and geophysics from MIT and the Woods Hole Oceanographic Institution. He

270

can be reached at Cray, Inc., 411 1" Ave S, Suite 600, Seattle, WA 98104 USA, E-mail: [email protected]. John Drake is the Climate Dynamics Group Leader at the Oak Ridge National Laboratory. He received his Ph.D. in Mathematics from the University of Tennessee in 1991 and has worked on projects involving fluid dynamics simulation among the labs of the Oak Ridge complex since 1979. His primary research interest is in numerical algorithms for atmospheric flows that involve problems of scale and closure. Parallel a l g o r i h s and high-end computational techniques have been the focus of several recent projects to develop state of the art climate models for parallel computers. He can be reached at Oak Ridge National Laboratory, PO Box 2008, Oak Ridge, Tennessee, 37831-6016, USA, Email: drakeib@,od.eov. Michael W. Ham is a researcher at Oak Ridge National Laboratory in the Computer Science and Mathematics Division. He received a Bachelor of Science degree from The Pennsylvania State University and a Master of Science degree from the University of Illinois at Urbana-Champaign. He is interested in maximizing the performance and correctness of large scientific applications, particularly those related to weather and climate. He can be reached at Oak Ridge National Laboratory, PO Box 2008, Oak Ridge, TN 37831-6016, USA, Email: [email protected]. Forrest Hoffman is a researcher at Oak Ridge National Laboratory where he holds joint appointments in the Computer Science 8z Mathematics and the Environmental Sciences Divisions. He received Masters and Bachelors of Science degrees in physics from the University of Tennessee. Forrest performs computational science research in global climate, landscape ecology, and terrestrial biogeochemistry on Linux clusters as well as some of the world's largest supercomputers in the National Center for Computational Sciences at ORNL. He can be reached at Oak Ridge National Laboratory, P.O.Box 2008 MS6008, Oak Ridge, TN 37831-6008, USA, Email: [email protected]. Patrick H. Worley is a Senior Research Scientist in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. His research interests include parallel algorithm design and implementation, the performance evaluation of high performance computer systems, the performance evaluation and optimization of parallel scientific applications, and the numerical simulation of PDEs. Worley has a Ph.D. in computer science from Stanford University. He is a member of SIAM and the ACM. He can be reached at Oak Ridge National Laboratory, P.O.Box 2008 MS6016, Oak Ridge, TN 37831-6016, USA, Email: [email protected].

References Briegleb, B. P., C. M. Bitz, E. C. Hunke, W. H. Lipscomb, M. M. Holland, J. L. Schramm, and R. E. Moritz, 2004. Scientific description of the sea ice component in the Community Climate System Model, Version Three. NCAR Tech. Note NCAR/TN-463+STR, 70 pp

271

Collins, W.D, C. M. Bitz, M. L. Blackmon, G. B . Bonan, C. S. Bretherton, J. A. Carton, P. Chang, S. C. Doney, J. J. Hack, T. B. Henderson, J. T. Kiehl, W. G. Large, D. S. McKenna, B. D. Santer, andR. D. Smith, 2005. The Community Climate System Model: CCSM3 to be published in the CCSM Special Issue, J. Climate, 11(6), to appear Collins, W. D., P. J. Rasch, B. A. Boville, J. J. Hack, J. R. McCaa, D. L. Williamson, J. T. Kiehl, B. Briegleb, C. Bitz, *S.-J. Lin, M. Zhang, andY. Dai, 2004. Description of the NCAR Community Atmosphere Model (CAM 3.0). NCAR Tech. Note NCAR/TN-464+STR, p. 226. Craig, A. P., R L Jacob, B Kauffman, T Bettge, J Larson, E Ong, C Ding, Y He. CPL6: The New Extensible, High-Performance Parallel Couler for the Community Climate System Model. Submitted for publication Special Issue on Climate Modeling, Int’l J. HPC Apps.,Vol 19, #3, August, 2005 Drake, J. B., P W Jones, G R Carr Jr. Overview of the Software Design of the CCSM. Submitted for publication Special Issue on Climate Modeling, Int’l J. HPC Apps.,Vol 19, #3, August, 2005 Levis, S., G. B. Bonan, M. Vertenstein, and K. W. Oleson, 2004. The Community Land Model’s Dynamic Global Vegetation Model (CLMDGVM): Technical description and user’s guide. NCAR Tech. Note NCAR/TN-459+IA, p. 50. Oleson, K. W., Y. Dai, G. Bonan, *M. Bosilovich, R. Dickinson, *P. Dirmeyer, *F. Hoffman, *P. Houser, S. Levis, G.-Y. Niu, P. Thornton, M. Vertenstein, Z.-L. Yang, and X. Zeng, 2004. Technical description of the Community Land Model (CLM). NCAR Tech. Note NCAlUTN-461+STR, 174 pp Smith, R.D. and P. Gent 2002. Reference manual for the Parallel Ocean Program (POP). Los Alamos Unclassified Report LA-UR-02-2484. Washington, W.M., 1982. Documentation for the Community Climate Model (CCM) Version 0. NCAR report, Boulder Colorado, NTIS No. PB82 194192 Williamson, D.L., 1983. Description of the NCAR Community Climate Model (CCMOB). NCAR Technical Note NCAR/TN-21O+STR, Boulder Colorado, NTIS No. PB83 231068 Worley, P.H. and J.B. Drake, 2005. Performance Portability in the Physical Parameterizationsof the Community Atmospheric Model. Submitted for publication Special Issue on Climate Modeling, Int’l J. HPC Apps.,Vol 19, #3, August, 2005

A UNIFORM MEMORY MODEL FOR DISTRIBUTED DATA OBJECTS ON PARALLEL ARCHITECTURES *

v. B A L A J I ~ Princeton University and NOAA/GFDL PO Box 308, Princeton University Princeton, N J 08542 USA

ROBERT W. NUMRICH Minnesota Supercomputing Institute University of Minnesota, Minneapolis, MN 55455 USA a n d NASA Goddard Space Flight Center Greenbelt, MD 20771 USA

Most modern architectures employ a hybrid memory model where memory is shared across some processors and distributed among others (clusters of SMPs). We describe a syntax for expressing memory dependencies in grid codes that takes the same form on distributed or shared memory, and can be optimally implemented on either, and on any hybrid layering thereof. This syntax can be applied to scalars, arrays or any other distributed data object. The syntax may be implemented on any of several current parallel programming standards, and also provides a way forward to future higher-level libraries.

1. Introduction Modern scalable architectures employ a hybrid memory model, often refered to as clusters of shared memory processors, where memory is shared across some processors and distributed among others. Currently, basic lowlevel communication and synchronization protocols on such machines are based on a few community standards (MPI, OpenMP) or platform-specific libraries such as the shmem library. While. these protocols are useful on a variety of architectures, their low-level nature is widely recognized as a significant impediment to scientific productivity. ~~

'05 april 2005, version2.0: papers/ecmwf2004/ecmwf2004-2.tex 'corresponding author: balaj i0princeton.edu

272

273

The development of more natural and algorithm-friendly expressions of parallelism has been the focus of considerable research. Development of high-productivity programming paradigms is a central goal of the HighEnd Computing Revitalization Task Forcea report (HECRTF), solicited by the U.S. National Science and Technology Council in March 2003 to determine the nation’s high-end computing needs. Research in this area spans the range from the development of new high-level parallel languages (e.g., Carlson et al., 1999; Chamberlain et al., 2000; HPF Forum, 1997; Numrich and Reid, 1998; Supercomputing Technologies Group, 1998; Yelick et al., 1998), to general scientific computing libraries for parallel platforms (e.g., Balay et al., 2003; Blackford et al., 1996), to domain-specific frameworks in various fields such as the geosciences, plasma physics, and so on. We describe a uniform memory model for expressing memory dependencies in codes for weather, climate and ocean models. Our model addresses the central issue in parallel processing: avoid memory race conditions with minimal time wasted waiting for signals. It is a high-level formalism that takes the same form on a distributed-memory architecture as it does on a shared-memory architecture. It is language-independent and hardware-independent, but it nonetheless allows specific implementations to take maximal advantage of underlying hardware primitives on specific architectures including hybrid clusters of shared-memory processors. Our model works for grid codes that represent differential operators as finite difference expressions on underlying coordinate grids. These codes exhibit two important properties that we call persistence and unison. By persistence we mean that concurrent execution threads allocate distributed data objects, and the lifetime of the threads is at least as long as the lifetime of the data objects. By unison we mean that the execution threads roughly proceed in tandem. Each thread executes independently of the others, but no thread gets too far ahead or too far behind because it needs data from other threads before it can move forward. These two characteristics of grid codes allow us to develop a memory model that is less general than other models, since it applies to a restricted set of codes, but has the advantage that it has the same interface on all types of parallel architectures. Our uniform model is based on the underlying symmetry of the current communication protocols as they apply to the restricted set of grid codes. We show that all current communication protocols can be represented by this model. We define two memory access states, READ and WRITE, which ahttp://www.itrd.gov/hecrtf-outreach/20040112_cra_hecrtf_report.pdf

274

form a pair of orthogonal mutual exclusion locks. To manage the memory access states, we define three functions, REQUEST, REQUIRE and RELEASE. Only the REQUIRE function is blocking to allow overlap of computation and communication. A thread requests or releases a memory access state without blocking and comes back later, after doing some other work, to require the access state. As an example, we apply our model to distributed arrays with halos. If one thread owns WRITE access to a distributed array, its neighbours cannot have READ access to their halo points, because they may be updated by the thread that owns them. Conversely, if one thread owns READ access to its halo, its neighbours must wait for the array to be released before they can obtain WRITE access. We show that the rules implied by our model for writing distributed array loops are the same rules followed f o r vectom'zation. Vectorization rules are quite familiar in within our community because earlier generations of weather, climate and ocean models relied very heavily on vector architectures. These architectures are well-suited to the memoryintensive algorithms widely used in such models, where a stream of data is pipelined through sequences of computations. The decline in popularity of vector architecture is widely considered to be due to economic rather than technical considerations. Many researchers still believe it to be the ideal architecture for our models pointing to the success of the Earth Simulator, which achieves higher resolution and higher throughput far above any other machine worldwide. Memory race conditions have always been difficult to detect and difficult to eliminate. It may at times be forgotton that shared-memory vector processors, such as the Cray-XMP through the Cray-T9O machines, had specialized hardware to deal with memory race conditions. These machines provided a very efficient hardware mechanism for resolving race conditions that does not exist on current machines. Be that as it may, vector architectures, even in their heyday, were well beyond the reach of most researchers. The transition to scalable architectures has the potential to increase scientific productivity, as it places the same infrastructure that is deployed at the high end within reach of university researchers with modest computing budgets. The downside of these architectures, however, is that they encourage a plethora of disparate approaches for resolving race conditions. We shall reveal the underlying symmetries that exists among these approaches. In Section 2, we introduce the abstractions of persistence and unison. In Section 3, we introduce the uniform memory model (UMM), a general

275

communication and synchronization model for distributed objects on parallel systems that subsumes the other current models. In Section 4, we use a simple example to illustrate the different ways to handle data dependencies in different models. In Section 5, we show how this model might, be applied to the key example of distributed arrays. We present our conclusions in Section 6 . 2. Persistence and unison

The uniform memory model is restricted to grid codes, very common in weather, climate and ocean models, which exhibit two important characteristics: persistence and unison. These two characteristics allow us to recognize common features in several disparate memory models and to subsume them into one uniform model. Without these two characteristics, memory models are necessarily more complicated, because they must be more general, and more difficult to implement, because they tend to be more dependent on specific architectures. 2.1. Persistence Property The basic element in concurrency models is the notion of an execution thread (ET). To achieve parallel speedup, we want several ETs to operate concurrently. In classifying concurrency models, it is important to consider the temporal relationship between an ET and the data objects associated with an ET. There are two distinct semantics: one where an ET exists first followed by creation of data objects, as shown at the top of Figure 1, and the other where the data objects exist first followed by creation of an ET, as shown at the bottom of Figure 1. We call the first kind of thread a persistent execution thread (PET) and the second kind a transient execution thread (TET). We have chosen to frame our discussion around the PET abstraction. A TET can be layered on top of a PET within the user code. For example, the hybrid programming model (OpenMP+MPI) (e.g Smith and Bull, 2001) can be thought of as PETS implemented as MPI processes and TETs implemented as the OpenMP threads launched within an MPI process. Allowing TETs would lead to added complications outside our uniform model. The persistence requirement is that a PET must have a lifetime at least as long as the data objects associated with it. For most current architectures it is very likely that one PET will be scheduled on each physical

276

alloc (a)

alloc (a)

alloc (a)

a=. . .

a=. . .

a=. . .

7-

1

a=. . .

a=. . .

a=. . -

alloc (a>

a=. . .

Figure 1. Persistent execution threads (PETs) at the top and transient execution threads (TETs) at the bottom. PETs exist prior to the allocation of the data objects they operate upon; TETs are created after. PETS and TETs may be layered upon each other (not shown).

processor (PE), and thus one may find it conceptually easier to think of PETs as PEs. There is no requirement, however, that there be a oneto-one correspondence between PETs and PEs. It is possible that in the future there will be multi-threaded architectures where one PET might be active while another stalls for memory access on the same PE. The PET abstraction allows us to extend our model to such platforms. Persistence, not memory distribution, is what distinguishes different concurrency models. PETs with distinct address spaces can operate within a shared-memory arena, using low-latency communication primitives within that arena, as well as across distributed-memory arenas, using different communication primitives required for that arena. The MPP Kernels (Bal-

277

aji et al., 2005b), use this mechanism extensively. This mechanism is also the basis for optimizations used in other programming models such as multilevel parallelism (MLP: see, e.g., Ciotti et al., 2000; Taft, 2001) and the ARMCI Library Nieplocha and Carpenter (1999).

2.2.

Unison Property

Grid codes exhibit a property that we call the unison property. Each PET executes an instruction sequence to perform work on a subset of data objects. Unison requires that PETs execute roughly the same instruction sequence without getting too far out of kilter. Typically, PETs stay roughly together across each model time step. Although some PETs may have less work to do than others, they must wait for their neighbors to exchange data before they can proceed to far from one time step into the next. They are forced into unison at each time step by their need for new data from other PETs. This unison requirement can at times lead to imbalance in the work distribution among PETs. Because of the unison property, mechanisms for avoiding memory race conditions can be quite simple. Each PET can expect to receive a signal at a known spot in the time step that matches its own signal at another known spot in the step either earlier or later. Low-level libraries allow arbitrary memory access patterns, without this unison assumption, to be expressible in the semantics. This generality is suitable at the level of communication and memory primitives, but higher-level libraries targeted at grid codes can work, within the constraint of unison. with a restricted semantics without a loss of functionality. Some of our models are composed of multiple components, each of which is composed of a set of PETS running in unison within the component. This unison within components induces unison among components. Communication among components consists of data exchange through coupler components, which block at each exchange point to transfer data. For example, Figure 2 shows components COand CI combining and sending data to Cs in a tripartite coupler COCI-+ C3, which in turn sends data to C, in another coupling event Cs + C,. We have shown component execution times as uneven to highlight the fact that idle time is incurred in a waiting event. Just how far the components can get out of kilter can be quite precisely defined, for this example, as being one coupling timestep. This issue is discussed in greater detail in Balaji et al. (2005a). In principle we could go to a fully non-blocking model for component

278

+

+ c 3

-+

c 2

Figure 2. Concurrent components using the blocking model of communication. Note that because of uneven execution times, idle time is incurred at the coupling events, while waiting for data to arrive.

communication. In practice, for the class of applications we are dealing with (a coupled clinate model being a typical instance) we find the blocking model to be adequate. Consider a simple example of two components, say an atmosphere and a n ocean, stepping forward in time and providing each other with surface boundary conditions. These models can never be more than one coupling timestep out of synchronicity, as each model needs data from the other for the next step. This is true even if the data exchange uses a fully non-blocking communication model. Non-blocking calls are of great importance in parallel computing, and appeared in the shmem and MPI-1 libraries. Specialized hardware is required t o take advantage of non-blocking communication. Some systems dedicate an entire extra processor just t o handle communication. Some systems have independent network controllers, which control data flow between disjoint memories without involving the processors on which computation takes place, as shown in Figure 3. True non-blocking communication originally appeared on the Cray-T3E, where a special set of registers communicated directly with remote memory. Many architectures implement non-blocking communication calls as deferred communication, where a communication event is registered and queued, but only executed when the matching block is issued. Such systems are unable t o realize the benefits of overlapping communication with computation. Note also in Figure 3 that caches are on

279

P

P

Figure 3. Commodity networks and LANs generally are loosely coupled, and the processor is required to generate traffic over the network, as shown on the left. Advanced architectures can have independent links between memory and network, as shown on the right. Non-blocking communication can show significant performance advantages on such tightly-coupled systems, if the code is structured to overlap communication and computation. Caches, shown here by C, add another complication on tightly-coupled systems: the system must make sure that the value transmitted from memory over the network is consistent with any cached value.

the far side of memory in this schematic of tightly-coupled networks. This requires special cache-coherent protocols to ensure that the data state in cache is consistent with any inter-memory communication that may take place. 3. The Uniform Memory Model

Our uniform m e m o r y model (UMM) is a schematic method for describing memory dependencies in a neutral and uniform manner with no explicit reference to the implementation language or the underlying communication protocol (Balaji et al., 2005~).Our model applies to distributed data objects. Definition 3.1. A distributed data object is an object that assigns memory locations to persistent, execution threads (PETS). The memory locations assigned to a PET are the union of sets of memory locations, which may

280

be (1) owned exclusively by the PET, (2) owned by another PET, or (3) shared with other PETs. Any or all of these three sets may be empty. Each set of memory locations assigned to a PET may be in one of two access states.

Definition 3.2. A PET with READ access for a set of memory locations may read its contents, even if it does not own the set, but can not modify its contents. Multiple PETs may have simultaneous READ access to the same set. Definition 3.3. A PET with WRITE access for a set of memory locations may modify its contents, even if it does not own it. No other PET can either read from or write to that set. To obtain either READ or WRITE access to a set of memory locations, a PET must use the following access primitives.

Definition 3.4. REQUEST is a non-blocking call. A REQUEST for WRITE access for a set of memory locations must be posted, and fulfilled, before modifying the contents of the set. A REQUEST for READ access must be posted, and fulfilled, before reading the contents of a set of memory locations. Definition 3.5. REQUIRE is a blocking call. A REQUIRE for READ or WRITE access follows a matching REQUEST. The PET waits until the REqUEST is fulfilled. Definition 3.6. RELEASE is a non-blocking call. It follows a matching REQUIRE.A RELEASE of WRITE access must be posted immediately following completion of updates of a memory location. A RELEASE of READ access must be posted following completion of a computation that reads a memory location. We have split the process of acquiring access to a set of memory locations into a REQUEST and REQUIRE phase so that a PET may perform other work, not dependent on the REQUEST being fulfilled, while waiting for access. Similarly, the RELEASE call returns without waiting for remote PETs to acknowledge the RELEASE. In the following sections, we show how to apply our uniform memory model to two specific examples. The first example. where the distributed object is a single variable, is the simplest possible case. The single variable

281

may be owned exclusively by one PET or more likely shared among PETs. Somewhat surprisingly, with this first example, we encounter the essence of all current parallel programming models, avoiding memory race conditions. The second example is a distributed array with halos. This object, along with higher dimensional analogs like distributed matrices, is a ubiquitous abstraction in parallel codes. especially in grid codes used in weather, climate and ocean models. In this case, the interior points of the array are owned exclusively by a single PET while the halo points may be owned exclusively by other PETS or shared among PETs.

4. Memory race conditions in parallel programs

The central issue in parallel processing is the avoidance of memory race conditions with the least amount of time wasted waiting for signals. A race condition occurs when one concurrent execution thread writes to a memory location while another one is trying either to read it or to write it. In this section, we describe several approaches for dealing with race conditions and show how they fit into our uniform memory model. The simple computation shown in Figure 4 illustrates the problem. A single PET owns all three variables, a, b, c, and can read them or write them any time it pleases. In the traditional von Neumann model of cornput.ing, the PET fetches operands from memory to a register file R, performs the computation, perhaps holding intermediate results in the register file, and writes the result back to memory. The semantics of the programming language orders the operations so that, at the end of the computation, both a and b have the value 3. Suppose now that it is expensive to compute the variables b and c and that they have no mutual dependencies. If two PETs share the same memory and each computes b and c independently, as shown in Figure 5, several issues arise. A minor issue is that the memory traffic is somewhat increased. In Figure 4, the values of b and c can stay in registers or in cache without updating the memory values. In Figure 5, the value of b must be transferred to memory by one PET where it can be read by the other PET. A major issue is the ownership of each variable and the READ and WRITE access for each variable. On shared-memory architectures, all variables are typically shared among PETs while READ and WRITE access is exchanged among them to enforce the correct semantics. Figure 5 implies, for example, that P I ,on the right, always holds WRITE access to a and c while PO,on the

282

Po b=l c=2

a=b+c b=3

Figure 4. Sequential processing: the operation a=b+c requires two words to be read from memory M into the registers R, to be computed on Po and one word to be written back to M.

left, alternately holds acquires and release WRITE access to b so that PI can aquire READ access, which it must then release back to PO.POsignals PI that its computation is done and that the result has been written to memory by the RELEASE of WRITE access. PI signals POthat it has obtained the new value of b by the RELEASE of READ access so that POcan again obtain WRITE access. If PI reads b too soon, before it has been updated and stored to memory by PO,or too late, after POhas updated its value a second time, the result on PI will be incorrect.

U a=b+c

M Figure 5. Parallel processing: the operation a=b+c now splits the work between two processors. Total memory traffic is somewhat larger. The processor computing b must signal to the other when the computation is complete.

283

As outlined in Section 1, climate and weather modeling came of age in the era of shared-memory, parallel vector processors like the Cray-XMP, which evolved over two decades through the Cray-YMP, Cray-2, Cray-C9O and Cray-T9O as well as the Fujitsu and NEC machines. Cray parallel compilers used shared memory directives, embedded within comments in the code, t o signal state between processors. These directives, taking the form of compiler hints, have since been standardized as the OpenMP protocol (Chandra et al., 2001). It has perhaps been forgotton that these machines contained efficient hardware mechanisms for signaling state through shared semaphores and shared registers. As shown in Figure 6(1), for example, a mutual exclusion lock, associated with the variable b. can be held by one P E T at a time and prohibits other PETs from reading it or writing it. The lock is passed twice, once for POt o inform PI that the data is available, and once again for PI to inform Po that it has been successfully consumed. In terms of our uniform model. the bullet on the line between the two PETs corresponds to a REQUEST-REQUIREcall by the P E T with the bullet. The bullet in these and subsequent picture indicate the PET which makes the call shown.

PO

F

l

pZ-1

lock(b)

"

Pl

r1 7 1

lock(b) (1)

memory and distributed memory. T e computations b = l and c=2 are concurrent, and their order in time cannot be predicted. (1) In shared-memory processing, mutex locks are used to ensure that b = l is complete before PI computes a=b+c, and that this step is complete before PO further updates b. (2) In distributed-memory processing, each PE retains an independent copy of b. which is exchanged in paired send/receive calls. Only after the transmission is complete is PO free to update its value of b. The bullet in these and subsequent picture indicate the PET which makes the call shown.

Physical limitations on the number of processors that can effectively share a memory bus or a set of specialized semaphores and registers ham-

284

pered early attempts at shared-memory multiprocessing. Advances in networking led t o distributed-memory architectures where independent computers communicate over a network. Each execution stream has its own independent memory, and shared variables must be passed explicitly from one memory t o another. This protocol is now standardized in the MessagePassing Interface (MPI: Gropp et al., 1999), the canonical interface for distributed memory. As shown in Figure 6(2), PO and PI share the variable b but have their own independent copies. They exchange values through paired send/receive calls, which act as a lock exchange. Po releases WRITE access to its copy of b by the send(b). Until POknows that Pl has received a copy of b, it is forbidden to change the value of its own copy. PO blocks at the send(b) call, as a REQUIRE for WRITE access, until it receives a signal from PI that it has updated its copy of b. The return from send(b) acts as a REQUIRE for WRITE access. At the same time, PI blocks at the revc(b) call, as a REQUIRE for WRITE access to its copy of b. The return from revc(b) acts as a RELEASE of WRITE access. Figure 6 reveals a symmetry between the two protocols. In both ceses, race conditions are avoided and unison is enforced, as a necessary side effect, by exchanging a lock, in a shared-memory machine, or by exchanging the value of a variable, in a distributed-memory machine. A single semantics works for both. A send/receive pair on a shared-memory machine might be implemented as a lock exchange. A lock-exchange on a distributed-memory machine might be implemented as a send/receive pair. The important point in either case is that the PETs are forced into unison by the need to avoid a race condition. The single lock model shown here is too restrictive. The lock acquired for reading also locks out other readers, whereas only writers need to be locked out. A refined model has separate locks for reading and writing. A P E T owning a write-lock prevents other PETS from reading a variable; conversely a P E T owning a read-lock prevents other PETs from writing to it. Figure 7(1) shows this protocol. Note that POnever relinquishes the read-lock, so that PI is never permitted t o update b. Similarly, we may refine the simple send/receive model of Figure 6(2) by introducing non-blocking calls. Non-blocking versions of send and receive are termed isend and irecv in MPI-1 terminology. Figure 7(2) shows the same example implemented in non-blocking communication. Pi issues a non-blocking irecv as early as possible in the calling sequence, that is, a

285

PO

PO

Pl irecv(b)

wrlock( b)

tb

isend(b) 41

rdlock( b)

wait (b)

I a=b+c I (1)

b

1-

(2)

Figure 7. Parallel processing: the full synchronization model applied to shared-memory protocols and message-passing. Note that in shared memory, Po never relinquishes the write-lock. In message-passing, this is done using non-blocking isend and irecv operations.

-Request for READ access, but it is prohibited from actually reading b until the block is lifted by the return from the wait call. Po issues a non-blocking isend,that is, a RELEASE of WRITE access, but is forbidden to update b until the block is lifted. An alternative to the message-passing protocol is Remote Memory Access (RMA), a mechanism for direct loads and stores t o remote memory as shown in Figure 8. Originally developed for the Cray-T3D/T3E, the simplest RMA idiom is the shmem library (Barriuso and Knies, 1994; Quadrics, 2001). It is also represented in language extensions such as F-- (Numrich. 1997), Co-Array Fortran (Numrich and Reid, 1998). and UPC (Carlson et al., 1999). The MPI library has been extended (MPI-2: Gropp et al., 1998) to include RMA calls MPI-Put and MPI-Get. The ARMCI library of Nieplocha and Carpenter (1999) is a promising new native implementation of RMA on a variety of platforms, similar in spirit to shmem, which can be thought of as a universally portable implementation. The name one-sided message passing is often applied t o RMA, as opposed t o two-sided message passang in the MPI model, but this is a misleading term. Although the shmem library allows one P E T to read from or write data to another PET, any time it likes, there must be some form of synchronization between PETs to avoid race conditions. In the shmem model, this synchronization often takes the form of a full barrier among PETs. Since this mechanism may be expensive and lead to work-load imbalance, synchronization sometimes takes the form of a notify/wait protocol that exchanges flags among PETs. At other times, PETs exchange data through

286

extra memory buffers, which must be managed by the programmer. In the MPI-2 model, instead of paired send/receive calls, we now have transmission events on one side (put, g e t ) paired with exposure events (start, w a i t ) and (post, complete) respectively - on the other side. It is thus still “two-sided”. A variable exposed for a remote g e t may not be written t o by the PE that owns it; and a variable exposed for a remote put may not be read. Note that the g e t may be placed early; it will be consummated by POwhen its computation of b is complete. Each of the three models of concurrent access to variables in memory needs t o be further refined. This is easiest to see in the RMA model, in thinking about exposure events. Assume that the hardware allows direct stores to remote memory, SO that the put in Figure 8(1) can be directly accomplished. How does PO know that it is safe to perform a put, which requires that PI not read b until the put is complete? In phrasing the question thus, the answer is obvious: an exposure event must have a beginning as well as an end. In MPI-2, a P E can declare a variable open to remote put by a start operation, and open to a remote g e t by a post operation, as shown in Figure 8. Reading a variable exposed for a remote put, or writing to a variable exposed for a remote g e t , can lead to undefined behaviour.

PO

P l

PO

Pl

start (b) -I

I

wait(b)

I ~~

I

Figure 8. Parallel processing: - comparison c signals for RMA put and get. Now, n addition, PI signals its readiness for a put by exposing a memory window using a start call. If instead it is PI issuing a get, Po indicates data availability by exposing its memory in a p os t operation.

There are several subtleties worth underlining. In Figure 8(1), PI begins its exposure t o receive b even before executing c=2. This is a key optimization in parallel processing, overlapping computation with communication. Similarly, in Figure 8(1) the order of operations is not the same

287

as in Figure 8(2). PI issues a get early, cannot be fulfilled until PO issues a matching post. The get call returns program control and PI continues execution before the get has been fulfilled. This is an example of a non-blocking call. We have outlined mechanisms to share the contents of a variable using shared memory, message-passing and RMA protocols. The complete model separates the request for a variable from the requirement that it be available. As described above, this permits overlapping of computation with communication: work that does not depend on the requested variable may be scheduled even when a posted request is unfulfilled. This is exactly analogous to the pre-fetch, optimization of a memory load: when compilers encounter the statement a=b,they may attempt to request b from memory as early as possible before the last modification of b. 5. Distributed arrays

A key feature of our uniform memory model is that it can be applied to arbitrarily complicated data objects not just to single variables. For example, we can define UMM semantics directly for a distributed array where PETS own different regions of an array, an almost universal abstraction in parallel processing. We describe how to do this using, as a minimal example code, a onedimensional shallow water model (see, e.g., Haltiner and Williams, 1980):

-62_1 - -9-ah

at dX Some discretized forms of equation ( l ) ,the details of which are unimportant, for this discussion, take reasonably simple schematic forms. A forwardbackward shallow water code might look like:

hf+' = h ( h f , u f , ~ : - ~ , u ~ + ~ ) t ht+l t + l ht+l = ~(u,,, ,hi-', i + l )

(2)

and might take this simple form in pseudocode:

BEGIN TIME LOOP h(i) = h(i) u(i) = u(i) END TIME LOOP

- (0.5*H*dt/dx)*( u ( i + l ) - u ( i - 1 ) - (0.5*g*dt/dx)*( h ( i + l ) - h ( i - I )

) FORALL i ) FORALL i

288

This schematic form is sufficient for an analysis of data dependencies between distributed arrays, and forms the basis for the subsequent discussion. 5.1. Memory allocation

The first step in a discrete computation of equation (2) on a distributed set of PETs is to allocate memory for a distributed array:

I h = makeDistributedArray( nx, np, -w, +q

(3)

where we have requested a discretization on nx points, distributed across np PETs, with data dependencies of -w and +w, that is, extending w points below the local computational domain and w points above. On each PET the procedure makeDistributedArray returns a pointer h(s-w :e+w>, using Fortran notation for clarity, to the local portion of the distributed array. The index s=lbound(h)+w marks the start of the local computational domain and the index e=ubound(h)-w marks the end of the local computational domain. We briefly describe how this memory allocation might be implemented on different architectures. On pure distributed memory, each PET gets an array h(s-w ,e+w), where the extra points on either side store a local copy of data from neighbouring PETs (the “halo”). The user performs operations (“halo updates”) at appropriate times to ensure that the halo contains a correct copy of the remote data. On flat shared memory, one PET requests memory for all nx points while the others wait. Each PET is given a pointer into its local portion of this single array. On distributed shared memory (DSM or ccNUMA: Lenoski and Weber, 1995) PETs can share an address space beyond the flat shared portion. This means access is non-uniform: the processor-to-memory distance varies across the address space and this variation must be taken into account in coding for performance. Most modern scalable systems fall into the DSM category. We call the set of PETs sharing flat (uniform) access to a block of physical memory an mNode, and the set of PETs sharing an address space an aNode. Whether it is optimal to treat an entire aNode as shared memory, or distributed memory, or some tunable combination thereof depends on the platform (hardware), efficiency of the underlying parallelism semantics for shared and distributed memory (software), and even problem size. Memory request procedures must retain the flexibility to optimize distributed array allocation under a variety of circumstances.

289

5.2. Memory access

The array pointer h and the indices s and e offer a uniform semantic view into a distributed array, with the following access rules: 0

0

0

0

0

Each array element h (s :e) is exclusively “owned” by one and only one PET. A PET always has READ access to the array elements h(s :e ) , which it owns. A PET can request READ access to its halos, h(s-w:s-I) and h(e+l :e+w) . A PET can acquire WRITE access to the array elements h(s:e). No other PET can ever acquire WRITE access to this portion of the distributed array. A PET never has WRITE access to its halos, h(s-w:s-l) and h(e+l :e+w) .

In practice, it is usually incumbent on the user to exercise restraint in writing outside the computational domain. There may be no practical means to enforce the last rule. Also, there may be occasions when it is advantageous to perform some computations in the halo locally, rather than seeking to share remote data. But for this discussion we assume the last rule is never. Under these rules, we extend the access states and access operations, which we defined earlier for a single variable, to apply to a single object called a DistributedArray. These access semantics are the key step in defining a uniform programming interface f o r distributed and shared m e m ory, and we urge close reading. We again define two access states: 0

0

READ access for a DistributedArray always refers to the halo points. READ access to interior points is always granted. WRITE access always refers to the interior points. It is never granted for the halo points.

We also define three access operations: 0

0

A REQUEST for WRITE access to a DistributedArray must be posted, and fulfilled, before modifying the array contents, i.e., before it appears on the LHS of an equation. A REQUEST for READ access must be posted, and fulfilled, before an array’s halo appears on the RHS of an equation. A REQUIRE for READ or WRITE access follows a matching REQUEST.

290 0

A RELEASE of WRITE access must be posted immediately following completion of updates of the computational domain of an array. A RELEASE of READ access must be posted following completion of a computation using an array’s halo points.

Only the REQUIRE call is blocking; REQUEST and RELEASE are non-blocking. These rules are specialized for arrays representing grid fields, where data dependencies are regular and predictable. These constitute a vast simplification over rules developed for completely general and unpredictable data access, where read and write permission require more complicated semantics, as one sees for example in pthreads or any other general multithreading library. We summarize the main points of our model. Each array element is owned by a PET, which is the only one that can ever have WRITE access to it. WRITE access on a PET is to the interior elements of the array and READ access is to the halo. It is taken for granted that WRITE access to the halo is always forbidden, and READ access to the interior is always permitted. Simultaneous READ and WRITE access to an array is never permitted. If a PET has WRITE access to its computational domain, its neighbours will have WRITE access to its halo. Hence, a PET cannot have READ access to its halos when if has WRITE access to its interior points. To put it another way, one cannot write a loop of the following form,

(h(i) = a*( h(i+l)

-

h(i-1) ) FORALL il

(4)

is not well-defined on distributed arrays. This is not an onerous restriction. As is easily seen, this is the same rule that must be obeyed to avoid vector dependencies. The rules f o r writing distributed array loops under the rules outlined here are those that must be followed f o r vectorization as well. READ access and WRITE access form a pair of orthogonal mutual exclusion locks. When a PET owns WRITE access to a DistributedArray,its neighbours cannot have READ access to their halo points, because they may be updated by the PET that owns them. Conversely, when PET owns READ access to its halo, its neighbours must wait for the array to be released before they can get WRITE access. At this point in the development of the argument, we return to the unison assumption introduced in Section 2.2. Each distributed array goes through an access cycle, typically once per iteration. The unison assumption requires that no PET can ever be more than one access cycle ahead its the result

291

in the instruction sequence of any other PET with which it has a data dependency. The following code block illustrates how our uniform memory model might be applied to the shallow water equations.

BEGIN TIME LOOP RequireREAD( u ) RequireWRITE( h ) h(i) = h(i) - (0.5*H*dt/dx)*(

u(i+l) - u(i-I)

)

FORALL i

Release( u ) Release( h ) RequestREAD( h ) RequestWRITE( u ) RequireREAD( h ) RequireWRITE( u 1 u(i> = u(i) - (0.5*g*dt/dx)*(

h(i+l) - h(i-I)

)

FORALL i

Release( h ) Release( u ) RequestREAD( u ) RequestWRITE( h ) END TIME LOOP 6. Summary This article reviews several parallel programming models and demonstrates their essential unity. All memory models require semantics to know when a load from memory or a store to memory is allowed. We have extended this to include remote memory. Our access semantics are couched entirely in terms of data dependencies. We have shown that the implied highlevel syntax can be implemented in existing distributed memory and shared memory programming models.

292

While the uniform memory model of Section 3 is in terms of a single memory location, it can be applied to any distributed data object, as we demonstrated by extending the approach to arrays, in Section 5. Distributed arrays have always been a topic of interest in scalable programming. Language extensions like Co-Array Fortran (Numrich and Reid, 1998) propose this as an essential extension to programming languages, and have built general-purpose class libraries for arrays and matrices on this basis (Numrich, 2005). The Earth System Modeling Framework (ESMF: Hill et al., 2004) also proposes a general distributed array class for grid codes. We have demonstrated a general method to describe data dependencies in distributed arrays that is agnostic about the underlying transport, but is described entirely in terms of the algorithmic stencils. For typical (e.g nearest-neighbour) data dependencies, we show that the coding restrictions imposed b y the memory model are exactly equivalent to those needed to ensure vectorization of loops. However, the model can be generalized to arbitrarily complex data dependencies, such as those resulting from arrays discretized on unstructured grids. High-level expressions of distributed data objects, we believe, are the way forward: they allow scientists and algorithm developers to describe data dependencies in a natural way, while at the same time allowing scalable library developers considerable breadth of expression of parallelism in conventional and novel parallel architectnres.

Acknowledgements Balaji is funded by the Cooperative Institute for Climate Science (CICS) under award number NA17RJ2612 from the National Oceanic and Atmospheric Administration, U S . Department of Commerce. The statements, findings, conclusions, and recommendations are those of the author and do not necessarily reflect the views of the National Oceanic and Atmospheric Administration or the Department of Commerce. Numrich is supported in part by grant DE-FC02-01ER25505 from the U S . Department of Energy as part of the Center for Programming Models for Scalable Parallel Computing and by grant DE-FG02-04ER25629 as part of the Petascale Application Development Analysis Project both sponsored by the Office of Science. He is also supported in part by the NASA Goddard Earth Sciences and Technology Center where he holds an appointment as a Goddard Visiting Fellow for the years 2003-2005.

293 References

Balaji, V., J . Anderson, I. Held, Z. Liang, S. Malyshev, R. Stouffer, M. Winton, and B. Wyman, 2005a: FMS: the GFDL Flexible Modeling System: Coupling algorithms for parallel architectures. Mon. Wea. Rev., 0, in preparation. Balaji, V., J. Anderson, and the FMS Development Team, 2005b: FMS: the GFDL Flexible Modeling System. Part 11. The MPP parallel infrastructure. Mon. Wea. Rev.. 0, in preparation. Balaji, V., T. L. Clune, R. W. Numrich, and B. T. Womack, 2005c: An architectural design pattern for problem decomposition. Workshop on Patterns zn Hzgh Performance Computzng. Balay, S., W. D. Groop, L. C. McInnes, and B. F. Smith, 2003: Software for the scalable solution of partial differential equations. Sourcebook of parallel computzng, Morgan Kaufmann Publishers Inc., pp. 621-647. Barriuso, R., and A. Knies, 1994: SHMEM User’s Guzde: SN-2516. Cray Research Inc. Blackford, L. S., et al., 1996: ScaLAPACK: A portable linear algebra library for distributed memory computers design issues and performance. Proceedzngs of SC’96, p. 5 . Carlson, W. W., J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren, 1999: Introduction to UPC and language specification. Center for Computing Sciences, Tech. Rep. CCS-TR-99-157, http://www.super.org/upc/. Chamberlain, B. L.. S.-E. Choi, E. C. Lewis, C. Lin, L. Snyder, and D. Weathersby, 2000: ZPL: A machine independent programming language for parallel computers. IEEE Transactaons on Software Engzneerzng, 26(3),197-211. Chandra, R., R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald, 2001: Parallel Programmzng zn OpenMP. Morgan-Kaufmann, Inc. Ciotti, R. B.. J. R. Taft. and J. Petersohn, 2000: Early experiences with the 512 processor single system image Origin2000. Proceedzngs of the 42nd Internatzonal Cray User Group Conference, SUMMIT 2000, Cray User Group, Noordwijk, The Netherlands. Gropp. W.. S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg. W. Saphir, and M. Snir, 1998: MPI: The Complete Reference. The MPI-2 Extenszons.. vol. 2, MIT Press. Gropp, W., E. Lusk, and A. Skjellum, 1999: Uszng MPI: Portable Parallel Programmzng wzth the Message Passzng Interface. 2nd ed., MIT Press. ~

294

Haltiner, G. J., and R. T. Williams, 1980: Numerical Prediction and Dynamzc Meteorology. John Wiley and Sons, New York. Hill, C., C. DeLuca, V. Balaji, M. Suarez, A, da Silva, , and the ESMF Joint Specification Team, 2004: The Architecture of the Earth System Modeling Framework. Computing in Science and Engineering, 6(l ) , 1-6. HPF Forum, 1997: High Performance Fortran Language Specification V2.0. Lenoski, D. E., and W.-D. Weber, 1995: Scalable Shared-Memory Multiprocessing. Morgan Kaufmann. Nieplocha, J., and B. Carpenter, 1999: ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586, 533-546. Numrich, R. W., 1997: F--: A parallel extension to Cray Fortran. Scientific Programming, 6(3), 275-284. Numrich, R. W., 2005: Parallel numerical algorithms based on tensor notation and Co-Array Fortran syntax. Parallel Computing, p. in press. Numrich, R. W.. and J. K. Reid, 1998: Co-array Fortran for parallel programming. A C M Fortran Forum, 17(2), 1-31. Quadrics, 2001: S h m e m Programming Manual, Quadrics Supercomputing World Ltd. Smith, L., and M. Bull, 2001: Development of mixed mode MPI/OpenMP applications. Scientific Programming, 9(2-3), 83-98. Presented at Workshop on OpenMP Applications and Tools (WOMPAT 2000), San Diego, Calif., July 6-7, 2000. Supercomputing Technologies Group, 1998: Cilk 5.2 Reference Manual, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. Taft, J . R., 2001: Achieving 60 GFLOP/s on the production code OVERFLOW-MLP. Parallel Computing, 27,521-536. Yelick, K., et al., 1998: Titanium: A high-performance Java dialect. Proceedings of the Workshop o n Java f o r High-Performance Network Computing, Stanford, California.

PANEL EXPERIENCE ON USING HIGH PERFORMANCE COMPUTING IN METEOROLOGY SUMMARY OF THE DISCUSSION GEORGE MOZDZYNSKI European Centre for Medium-Range Weather Forecasts Shinfield Park, Reading, Berkshire, RG2 9AX, U.K. E-mail: George.Mozdzynski (29ecmwJ1:int

As has become customary at these workshops, the final session was dedicated to a panel where brief statements raising fundamental or controversial issues were made to open the discussions. New Languages and New Hardware Features The following statements introduced this topic, 0

Many of the sites with large application codes have tried for years to keep codes highly portable (standard FORTRAN, MPI, OpenMP) There have been plenty of examples where the indiscriminate use of vendor specific features stored up problems for the future The academic argument for better languages (programming paradigms) is compelling; but what would it take before any of the major meteorological codes will be developed in it? There are promising ideas on the HW from (PIM, vector features in the router chip, etc.) - will this allow new programming paradigms to make a breakthrough ?

A user representative started the discussion by commenting that the stability of their application codes does not stress manufacturers and only new applications would do this. He added that the cost of developing new applications is a major concern as was portability. Another user expressed that his organisation could see opportunities for systems with 3-4 orders of magnitude more computing capability and suggested that commodity clusters could play a part in realising this. A vector manufacturer representative commented that the more sites employ these clusters the less that manufacturers can provide for the future, as development budgets would be reduced. 295

296

The discussion moved on to the Top500 report and the use of linpack to rank performance. Some user representatives commented that the linpack benchmark was not representative of their applications. A vector manufacturer representative stated that their vector processor was about lox faster than some scalar processors but that it was also about lox more expensive. He added that manufacturers are more interested in strong new science on systems than linpack. Somebody suggested that the NWP community should provide its own report based on meteorological applications. A user representative added that his organisation did not buy systems based on their Top500 ranking and that price/performance is the most important consideration. He added that we could judge this aspect by the number and size of contracts awarded to manufacturers. Would the meteorological community be willing to develop new codes to realise performance increases of between 2 to 3 orders of magnitude? A hardware expert stated that reconfiguration logic could provide 2 orders of magnitude increase in performance and that this was not an issue of cost. A vector manufacturer representative commented that climate/NWP codes are very diverse and do not have any significant peaks that could be handled by special hardware. A user representative endorsed this argument, and commented that his organisation’s current NWP application started development in 1987 and added that this application has been in use longer than a number of computer architectures. A vendor representative stated that most NWP applications he had experience of were simply not large enough to scale well on today’s supercomputers.

Another user saw a need for running NWP applications on 100 nodes today to be followed by many ensemble runs to study small scales.

Linux Clusters The following statements introduced this topic,

Small scale clusters seem to work well Large clusters for time-critical operations?

- Maintenance: have your own kernel programmers? - Applications: what can we not run on this type of system?

297

It was broadly agreed that the Linux O/S was as reliable as any other operating system and given the numbers of systems installed Linux was probably one of the best supported. A user commented that in his experience Linux clusters require more support than vendor-supplied systems as the site is normally responsible for maintaining facilities such as compilers, MPI libraries and batch sub-systems. The user added that there was no obvious need for kernel programmers for small clusters. A user commented that the rate of development for Linux clusters could be viewed as a problem if you want to upgrade your system. He added that it was probably more cost effective to totally replace a cluster (e.g. faster clock, more memory) after a couple of years than upgrading the existing system. The chairman asked the question - Does the NWP community have applications that cannot be satisfied by Linux clusters? There was no clear answer. A user representative said that UO is a problem for his NWP applications. He added that Linux clusters would require installation of parallel file system s/w such as Lustre to get acceptable performance. A user asked whether the success of Linux clusters could be viewed as a consequence of the fact that the HPC market is too small to cover the development costs of special hardware. A user representative replied that future model and data assimilation workloads might still need to run on special hardware to meet operational deadlines. A hardware expert expressed the opinion that government should help to fund the development of new technologies. As an example he stated that early work on Beowulf clusters (leading to Linux clusters) was government funded.

Frameworks A user introduced this topic by expressing concern about the effort required to support EMSF in their NWP model. He said that he expected some loss of model efficiency to support EMSF and asked what should be acceptable. An EMSF expert suggested that the number of lines of code would most likely be just a few hundred and that this should have a negligible impact on performance. Another user commented that frameworks are a necessary component for coupling models as software such as GRIB is not that portable. There was some disagreement on how intrusive frameworks would be on this model. However, it was agreed that this could provide a useful case study and topic for the next workshop at ECMWF in 2006.

This page intentionally left blank

LIST OF PARTICIPANTS

Mr. Eugenio Almeida

INPEKPTEC, Rodl. Presidente Dutra, km 39, Cachoeira Paulista, SP, CEP 12630-000, Brazil eugenio @cptec.inpe.br

Dr. Mike Ashworth

CCLRC Daresbury Laboratory, Warrington, WA4 3BZ, United Kingdom m.ashworth @ dl.ac.uk

Dr. Venkatramani Balaji

Princeton University/GFDL, PO Box 308, Princeton, N.J. 08542, USA v.balaji @noaa.gov

Mr. Stuart Bell

Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom [email protected]

Dr. Ilia Bermous

Australian Bureau of Meteorology, GPO Box 1289K, Melbourne, VIC 3001, Australia i.bermous @bom.gov.au

Mr. Tom Bettge

NCAR, 18.50 Table Mesa Drive, Boulder, CO 80305, USA [email protected]

Mr. David Blaskovich

Deep Computing Group, IBM, 519 Crocker Ave., Pacific Grove, CA 93950, USA [email protected]

Mr. Jan Boerhout

NEC HPC Europe, Antareslaan 65, 2132 JE Hoofddorp, Netherlands [email protected]

Dr. Ashwini Bohra

National Centre for Medium-Range Weather Forecasts, A-50, Sector-62, Phase 11, Noida-201307, U.P., India [email protected] 299

300

Mr. Reinhard Budich

Max Planck Institute for Meteorology, 20146 Hamburg, Germany budich @dkrz.de

Mr. Arry Carlos Buss Filho

INPEKF'TEC, Rod. Pres. Dutra, km 40, Cachoeira Paulista - SP, Brazil arry @cptec.inpe.br

Dr. Ilene Carpenter

SGI, 2750 Blue Water Rd., Eagan, MN 55121, USA [email protected]

Mr. George Carr Jr.

NCAWCGD, PO Box 3000, Boulder, CO 80307-3000, USA gcarr @ ucar.edu

Dr. Bob Carruthers

Cray (UK) Ltd., 2 Brewery Court, High Street, Theale, Reading RG7 5AH, United Kingdom crjrc @cray.com

Mrs. Zoe Chaplin

University of Manchester, Manchester Computing, Oxford Road, Manchester, M13 9PL, United Kingdom [email protected]

Mr. Peter Chen

World Meteorological Organization, 7 bis Ave. de la Paix, 1211 Geneva, Switzerland pchen @ wrno.int

Dr. Gerardo Cisneros

Silicon Graphics, SA de CV, Av. Vasco de Quiroga 3000, P.0.-lA, Col. Sante Fe, 01210 Mexico, DF, Mexico Gerardo @sgi.com

Dr. Thomas Clune

NASNGoddard Space Flight Center, Code 93 1, Greenbelt, MD 2077 1, USA [email protected]

301

Dr. Herbert Cornelius

Intel GmbH, Dornachcr Str. 1, 85622 FeldkirchedMunchen, Germany [email protected]

Mr. Franqois Courteille

NEC HPC Europe, "Le Saturne", 3 Parc Ariane, 78284 Guyancourt Cedex, France [email protected]

Mr. Jason Cowling

Fujitsu Systems Europe, Hayes Park Central, Hayes End Road, Hayes, Middx., UB4 8FE,United Kingdon [email protected]

Ms. Marijana Crepulja

Rep. Hydrometeorological Service of Serbia, Kneza Viseslava 66, 11030 Belgrade, Serbia and Montenegro maja @ meteo.yu

Mr. Jacques David

DSI, CEA-Saclay, 91191 Gif sur Yvette Cedex, France [email protected]

Mr. David Dent

NEC High Performance Computing Europe, Pinewood, Chineham Business Park, Basingstoke, RG24 8AL, United Kingdom ddent @ hpce.nec.com

Mr. Michel Desgagnk

Environment Canada, 2121 North Service Road, Trans-Canada Highway, Suite, 522, Dorval, Quebec, Canada H9P 1J3 michel.desgagne 0 cmc.ec.gc.ca

Mr. Martin Dillmann

EUMETSAT,Am Kavalleriesand 3 1, 64295 Darmstadt, Germany dillmann @eumetsat.de

Mr. Vladimir DimitrijeviC

Rep. Hydrometeorological Service of Serbia, Kneza Viseslava 66, 11030 Belgrade, Serbia and Montenegro vdimitrijevic @meteo.yu

302

Mr. Douglas East

Lawrence Livermore National Laboratory, PO Box 808, MS-L-73, Livermore, CA 94551, USA dre @llnl.gov

Mr. Ben Edgington

Hitachi Europe Ltd., Whitebrook Park, Lower Cookham Road, Maidenhead, SL6 SYA, United Kingdom [email protected]

Mr. Jean-Franqois Estrade

METEO-FRANCE, DSI, 42 Av. G. Coriolis, 31057 Toulouse Cedex, France jean-francokestrade @ meteo.fr

Dr. Juha Fagerholm

CSC-Scientific Computing, PO Box 4305, FIN-02101 Espoo, Finland juha.fagerholm @ csc.fi

Mr. Torgny Faxtn

National Supercomputer Center, NSC, SO58183 Linkoping, Sweden faxen @nsc.liu.se

Dr. Lars Fiedler

EUMETSAT, Am Kavalleriesand 3 1, 64295 Darmstadt, Germany Fiedler @eumetsat.de

Dr. Enrico Fucile

Italian Air Force Met Service, Via Pratica di Mare, Aer. M. de Bernardi, Pomezia (RM) Italy [email protected]

Mr. Toshiyuki Furui

NEC HPC Marketing Division, 7-1, Shiba 5-chome Minatoku, Tokyo 1088001, Japan t-furui @ bq.jp.nec.com

Mr. Fabio Gallo

Linux Networx (LNXI), Europaallee 10,67657 Kaiserslautern, Germany [email protected]

Mr. Jose A. Garcia-Moya

Instituto Nacional de Meteorologia (INM), Calle Leonard0 Prieto Castro 8, 28040 Madrid, Spain j [email protected]

303

Dr. Koushik Ghosh

NOANGFDL, Princeton University Forrestal Campus, Princeton, N.J., USA [email protected]

Dr. Eng Lim Goh

Silicon Graphics, 1500 Crittenden Lane, MS 005, Mountain View, CA 94043, USA [email protected]

Mr. Mark Govett

NOAA Forecast Systems Lab., 325 Broadway, W S L , Boulder, CO 80305, USA mark.w.govett@ noaa.gov

Dr. Don Grice

IBM, 2 Autumn Knoll, New Paltz, NY 12561, USA [email protected]

Mr. Paul Halton

Met Eireann, Glasnevin Hill, Dublin 9, Ireland [email protected]

Dr. James Hamilton

Met Eireann, Glasnevin Hill, Dublin 9, Ireland [email protected]

Mr. Detlef Hauffe

Potsdam Institut fur Klimafolgenforschung, Telegrafenberg A31, D-14473 Potsdam, Germany hauffe @ pik-potsdam.de

Mr. Pierre Hildebrand

IBM Canada, 1250 blvd. ReneLevesque West, Montreal, Quebec, Canada H3B 4W2 pierreh @ca.ibm.com

Mr. Chris Hill

Massachusetts Insitute of Technology, 54-1515, M.I.T. Cambridge, MA 02139, USA cnh @mit.edu

304

Dr. Richard Hodur

Naval Research Laboratory, Monterey, CA 93943-5502, USA hodur @ nrlmry.navy.mil

Mr. Jure Jerman

Environmental Agency of Slovenia, SI-1000 Ljubljana, Slovenia j ure.jerman @ rzs-hm.si

Mr. Hu Jiangkai

National Meteorological Center, Numerical Weather Prediction Div., 46 Baishiqiao Rd., Beijing 10008, P.R. of China [email protected]

Dr. Zhiyan Jin

Chinese Academy of Meteorological Sciences, 46 Baishiqiao Road, Beijing 100081, P.R. of China j inzy @ cma.gov.cn

Mr. Jess Joergensen

University College Cork, College Road, Cork, CO. Cork, Ireland jesjoergensen @ wanadoo.dk

Mr. Bruce Jones

NEC HPC Europe, Pinewood, Chineham Business Park, Basingstoke, RG24 8AL, United Kingdom bjones @ hpce.nec.com

Mr. Dave Jursik

IBM Corporation, 3600 SE Crystal Springs Blvd., Portland, Oregon 97202, USA [email protected]

Mr. Tuomo Kauranne

Lappeenranta University of Technology, POB 20, FIN-53851 Lappeenranta, Finland [email protected]

Dr. Crispin Keable

SGI, 1530 Arlington Business Park, Theale, Reading, RG7 4SB, United Kingdom [email protected]

305

Mr. A1 Kellie

NCAR, 1850 Table Mesa Drive, Boulder, CO 80305, USA kellie @ucar.edu

Mrs. Kerstin Kleese-van Dam

CCLRC-Daresbury Laboratory, Warrington, WA4 4AD, United Kingdom [email protected]

Ms. Maryanne Kmit

Danish Meteorological Institute, Lyngbyvej 100, DK-2100 Copenhagen @, Denmark kmit @ dmi.dk

Dr. Luis Kornblueh

Max-Planck-Institute for Meteorology, D-20146 Hamburg, Germany kornblueh @dkrz.de

Mr. William Kramer

NERSC, 1 Cyclotron Rd., M/S 50B4230, Berkeley, CA 94550, USA wtkramer@ lbl. gov

Dr. Elisabeth Krenzien

Deutscher Wetterdienst, POB 10 04 65,63004 Offenbach, Germany [email protected]

Mr. Martin Kucken

Potsdam Institut fur Klimafolgenforschung, Telegrafenberg A3 1, D- 14473 Potsdam, Germany kuecken @pik-potsdam.de

Mr. Kolja Kuse

SCSuperComputingServices Ltd., Oberfohringer Str. 175a, 81925 Munich, Germany kolja.kuse @ terrascale.de

Mr. Christopher Lazou

HiPerCom Consultants Ltd., 10 Western Road, London, N2 9HX, United Kingdom chris @lazou.demon.co.uk

306

Ms. Vivian Lee

Environment Canada, 2121 N. Trans Canada Hwy, Dorval, Quebec, Canada, H9P 1J3 [email protected]

Mr. John Levesque

Cray Inc., 10703 Pickfair Drive, Austin, TX 78750, USA levesque @cray.com

Mr. RenB van Lier

KNMI, PO Box 201,3730 AE De Bilt, The Netherlands rene.van.lier @ knmi.nl

Dr. Rich Loft

NCAR, 1850 Table Mesa Drive, Boulder, CO 80305, USA loft @ucar.edu

Mr. Thomas Lorenzen

Danish Meteorological Institute, Lyngbyvej 100,DK-2100 Copenhagen 0,Denmark tl @ dmi.dk

Mr. Ian Lumb

Platform Computing, 3760 14" Avenue, Markham, Ontario L3R 3T7, Canada ilumb @platform.com

Mr. Wai Man Ma

Hong Kong Observatory, 134A, Nathan Road, Tsim Sha Tsui, Kowloon, Hong Kong [email protected]

Dr. Alexander MacDonald

NOAA Forecast Systems Lab., 325 Broadway, W S L , Boulder, CO 80305, USA Alexander.e.macdonald@ noaa.gov

Mr. Moray McLaren

Quadrics Ltd., One Bridewell St., Bristol, BS1 2AA, United Kingdom moray @[email protected]

Mr. John Michalakes

NCAR, 3450 Mitchell Lane, Boulder, CO 80301, USA [email protected]

307

Mr. Aleksandar Miljkovid

“Coming-Computer Engineering”, Tosejovanovica 7, 1030 Belgrade, Serbia and Montenegro aleksandar.miljkovic @coming.co.yu

Prof. Nikolaos Missirlis

University of Athens, Dept. of Informatics, Panepistimiopolis, Athens, Greece [email protected]

Mr. Chuck Morreale

Cray Inc., 6 Paddock Dr., Lawrence, N.J. 08648, USA [email protected]

Mr. Guy de Morsier

MeteoSwiss, Krahbuhlstr. 58, CH8044 Zurich, Switzerland gdm @ meteoswiss.ch

Mr. Masami Narita

Japan Meteorological Agency, 1-3-4 Ote-machi, Chiyoda-ku, Tokyo, Japan [email protected]

Dr. Lars Nerger

Alfred-Wegener-Institut fur Polarund Meeresforschung, Am Handelshafen 12,27570 Bremerhaven, Germany lnerger @awi-bremerhaven.de

Mr. Wouter Nieuwenhuizen

KNMI, PO Box 201,3730 AE De Bilt, The Netherlands wouter.nieuwenhuizen@ knmi.nl

Mr. Dave Norton

HPFA, 235 1 Wagon Train Trail, South Lake Tahoe, CA 96150-6828, USA norton @ hpfa.com

Mr. Per Nyberg

Cray Inc., 3608 Boul. St-Charles, Kirkland, Quebec, H9H 3C3, Canada nyberg @cray.inc

Mr. Michael O’Neill

Fujitsu Systems Europe, Hayes Park Central, Hayes End Road, Hayes, Middx., UB4 8FE, United Kingdon Mike.O’[email protected]

308

Mr. Yuji Oinaga

Fujitsu, 1-1, Kamikodanaka 4-chome, Nakahara-ku, Kawasaki-shi, Kanagawa, 21 1-8588, Japan [email protected]

Dr. Stephen Oxley

Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom stephen.oxley @metoffice.gov.uk

Mr. Jairo Panetta

INPEKPTEC, Rod. Presidente Dutra, Km 40, Cx. Postal 01, 12630-000 Cahoeira Paulista SP, Brazil [email protected]

Dr. Hyei-Sun Park

Korea Institute of Science and Technology Information (KISTI), 52 Yuseong, Daejeon, S. Korea hsunparkC2kisti.re.h

Mr. Simon Pellerin

Meteorological Service of Canada, 2121 Trans-Canada Highway North, Suite 504, Dorval, Quebec, Canada, H9P 153 simon.pellerin@ ec.gc.ca

Mr. Kim Petersen

NEC High Performance Computing Europe, Prinzenallee 11,40549 Dusseldorf, Germany kpetersen @ hpce.nec.com

Dr. Jean-Christophe Rioual

NEC HPCE, 1 Emperor Way, Exeter Business Park, Exeter, EX1 3GS, United Kingdom jrioual @hpce.nec.com

Dr. Ulrich Schattler

Deutscher Wetterdienst, POB 10 04 65,63004 Offenbach, Germany ulrich.schaettler @ dwd.de

Dr. Joseph Sela

US National Weather Service, 5200 Auth Rd., Room 207, Suitland, MD 20746,USA [email protected]

309

Mr. Wolfgang Sell

Deutsches Klimarechenzentrum GmbH (DKRZ), Bundesstr. 55, D20146 Hamburg Germany [email protected]

Dr. Paul Selwood

Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom [email protected]

Mr. Robin Sempers

Met Office, FitzRoy Road, Exeter, EX1 3PB, United Kingdom robimempers @metoffice.gov.uk

Mr. Eric Sevault

METEO-FRANCE, 42 Av. Coriolis, 3 1057 Toulouse Cedex, France eric.sevault 0 meteo .fr

Dr. Johan Sile’n

Finnish Meteorological Institute, Helsinki, Finland johansilen @ fmi.fi

Dr. Roar SkHlin

Norwegian Meteorological Institute, PO Box 43, Blindern, 0313 Oslo, Norway [email protected]

Mr. Niko Sokka

Finnish Meteorological Institute, PO Box 503,00101 Helsinki, Finland [email protected]

Mr. Jorg Stadler

NEC High Performance Computing Europe, Prinzenallee 11,40549 Diisseldorf, Germany [email protected]

Mr. Alain St-Denis

Meteorological Service of Canada, 2121 North Service Road, Transcanada Highway, Dorval, Quebec, Canada H9P 1J3 [email protected]

Dr. Lois Steenman-Clark

University of Reading, Dept. of Meteorology, PO Box 243, Reading, RG6 6BB, United Kingdom [email protected]

310

Mr. Thomas Sterling

Caltech, CACR,d MC 158-79, 1200

E. California Blvd., Pasadena, CA, USA, 911 tron @ cacr.caltech.edu Dr. Conor Sweeney

Met Eireann, Glasnevin Hill, Dublin 9, Ireland conor.Sweeney @ met.ie

Dr. Mark Swenson

FNMOC, 7 Grace Hopper Ave., Monterey, CA 95943, USA mark.swenson@ fnmoc.navy.mil

Dr. Keiko Takahashi

The Earth Simulator Center/JAMSTEC, 3173-25 Showamachi, Kanazawa-ku, Yokohama, Kanagawa 236-0001, Japan takahasi @jamstec.go.jp

Mr. Naoya Tamura

Fujitsu Systems Europe, Hayes Park Central, Hayes End Road, Hayes, Middx., UB4 8FE, United Kingdon [email protected]

Dr. Ulla Thiel

Cray Europe, Waldhofer Str. 102, D69123 Heidelberg, Germany [email protected]

Dr. Mikhail Tolstykh

Russian Hydrometeorological Research Centre, 9/11 B. Predtecenskii per., 123242 MOSCOW, Russia tolstykh@ mecom.ru

Ms. Simone Tomita

CPTECIINPE, Rod. Pres. Dutra, km 40, CP 01, 12630-000 Cachoeira Paulista - SP, Brazil [email protected]

Mr. Joseph-Pierre Toviessi

Meteorol. Service of Canada, 2121 Trans-Canada Highway, Montreal, Que. Canada H9P 1J3 joseph-pierre.toviessi @ec.gc.ca

311

Mr. Eckhard Tschirschnitz

Cray Computer Deutschland GmbH, Wulfsdorfer Weg 66, D-22359 Hamburg, Germany [email protected]

Mr. Robert obelmesser

SGI GmbH, Am Hochacker 3,85630 Grasbrunn, Germany [email protected]

Dr. Atsuya Uno

Earth Simulator Center, JAMSTEC, 3 173-25 Showa-machi, Kanazawa-ku, Yokohama, Kanagawa 236-0001, Japan [email protected]

Dr. Ole Vignes

Norwegian Meteorological Institute, PO Box 43, Blindern, 0313 Oslo, Norway ole.vignes @met.no

Mr. John Walsh

NEC HPC Europe, Pinewood, Chineham Business Park, Basingstoke, RG24 8AL, United Kingdom j walsh @ hpce.nec.com

Mr. Bruce Webster

NOAA/National Weather Service, National Centers for Environmental Prediction, 5200 Auth Road, Camp Springs, MD 20746, USA [email protected]

Dr. Jerry Wegiel

HQ Air Force Weather Agency, 106 Peacekeeper Drive, STE 2N3, Offutt AFB, NE 681 13-4039, USA jerry.wegiel @afwa.af.mil

Mr. Jacob Weismann

Danish Meteorological Institute, Lyngbyvej 100, DK-2100 Copenhagen 0,Denmark j wp @dmi.dk

Mr. Shen Wenhai

National Meteorological Information Center, 46 Baishiqiao Rd., Beijing, 100081, PR of China [email protected]

312

Dr. Gunter Wihl

ZAMG, Hohe Warte 38, A-1191 Vienna, Austria gunter.wihl @ zamg.ac.at

Mr. Tomas Wilhelmsson

Swedish Meteorological & Hydrological Institute, SE-601 76 Norrkoping, Sweden [email protected]

Dr. Andrew Woolf

CCLRC, Rutherford Appleton Laboratory, Chilton, Didcot, OX1 1 OQX, United Kingdom [email protected]

Prof. Hans Zima

University of Vienna, Austria, & Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 9 11098099, USA zima @jpl.nasa. gov

ECMWF: Erik Anderson Sylvia Baylis Anton Beljaars Horst Bottger Philippe Bougeault Paul Burton Jens Daabeck Matteo Dell’Acqua Mithat Ersoy Richard Fisker Mats Hamrud Alfred Hofstadler Mariano Hortal Lars Isaksen Peter Janssen Norbert Kreitz Franqois Lalaurette Dominique Marbouty Martin Miller

Head, Data Assimilation Section Head, Computer Operations Section Head, Physical Aspects Section Head, Meteorological Division Head, Research Department Numerical Aspects Section Head, Graphics Section Head, Networking & Security Section Servers & Desktop Group Head, Servers & Desktops Section Data Assimilation Section Head, Meteorological Applications Section Head, Numerical Aspects Section Data Assimilation Section Head, Ocean Waves Section User Support Section Head, Meteorological Operations Section Director Head, Model Division

313

Stuart Mitchell Umberto Modigliani George Mozdzynski Tim Palmer Pam Prior Luca Romita Sami Saarinen Deborah Salmond Adrian Simmons Neil Storer Jean-Noel Thepaut Saki Uppala Nils Wedi Walter Zwieflhofer

Servers & Desktop Group Head, User Support Section Systems Software Section Head, Probabilistic Forecasting & Diagnostics Division User Support Section Servers & Desktop Group Satellite Data Section Numerical Aspects Section Head, Data Division Head, Systems Software Section Head, Satellite Data Section ERA Project Leader Numerical Aspects Section Head, Operations Department

Use of High Performance Computing in Meteorology

The Use of High Persormance Computing in Meteorology

Developments in Teracomputing: The Use of High Performance Computing in Meteorology